Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications (Studies in Big Data, 132) 3031383249, 9783031383243

In the age of transformative artificial intelligence (AI), which has the potential to revolutionize our lives, this book

130 99 14MB

English Pages 614 [597] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Big Data Intelligence for Smart Applications (Studies in Computational Intelligence, 994) [1st ed. 2022] 3030879534, 9783030879532

Today, the use of machine intelligence, expert systems, and analytical technologies combined with Big Data is the natura

118 73 8MB Read more

Big Data and Analytics Applications in Government: Current Practices and Future Opportunities (Data Analytics Applications) [1 ed.] 9781498764346, 1498764347

Within this context, big data analytics (BDA) can be an important tool given that many analytic techniques within the bi

3,961 838 34MB Read more

Machine Intelligence, Big Data Analytics, and IoT in Image Processing: Practical Applications 9781119865049

MACHINE INTELLIGENCE, BIG DATA ANALYTICS, AND IoT IN IMAGE PROCESSING Discusses both theoretical and practical aspects o

1,184 176 35MB Read more

The Big Data-Driven Digital Economy: Artificial and Computational Intelligence (Studies in Computational Intelligence, 974) 3030730565, 9783030730567

This book shows digital economy has become one of the most sought out solutions to sustainable development and economic

115 100 8MB Read more

Advances in Bioinformatics and Big Data Analytics

The book will play a vital role in improvising knowledge on the practical application of information science in the biol

1,002 293 53MB Read more

Data Analytics: Models and Algorithms for Intelligent Data Analysis [3 ed.] 3658297786, 9783658297787

This book is a comprehensive introduction to the methods and algorithms of modern data analytics. It provides a sound ma

2,547 286 6MB Read more

Data Analytics and Business Intelligence: Computational Frameworks, Practices, and Applications 9781032039046, 9781032039060, 9781003189640, 2075432016, 0952813021

Business Analytics (BA) is an evolving phenomenon that showcases the increasing importance of using huge volumes of data

339 63 8MB Read more

Urban Freight Analytics: Big Data, Models, and Artificial Intelligence [1 ed.] 1032199369, 9781032199368

Urban Freight Analytics examines the key concepts associated with the development and application of decision support to

341 125 11MB Read more

Taming Big Data Analytics

2,039 367 7MB Read more

Computational Intelligence Applications for Text and Sentiment Data Analysis 9780323905350

Computational Intelligence Applications for Text and Sentiment Data Analysis explores the most recent advances in text i

452 164 9MB Read more

Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications (Studies in Big Data, 132)
3031383249, 9783031383243

Author / Uploaded
Gilberto Rivera (editor)
Laura Cruz-Reyes (editor)
Bernabé Dorronsoro (editor)
Alejandro Rosete (editor)

Table of contents :
Preface
Contents
Descriptive and Diagnostic Analytics
Cluster Analysis Using k-Means in School Dropout
1 Introduction
2 Related Works
3 k-Means Clustering Processing
4 Dataset
5 Experimentation and Results
5.1 Kappa Statistic
5.2 Mean Absolute Error
5.3 Root Mean Square Error
6 Discussion
7 Conclusions
References
Topic Modeling Based on OWA Aggregation to Improve the Semantic Focusing on Relevant Information Extraction Problems
1 Introduction
2 Background
2.1 Topic Modeling in Keyphrases Extraction
2.2 Topic Modeling in Extractive Summarization
3 Topic Modeling Aimed at Extracting Relevant Information
3.1 Fuzzy-Based Topic Modelling
3.2 Candidate Topics Extraction
3.3 Topics Identification
3.4 Topics Ranking Construction
3.5 Fuzzy-Based Topic Modelling Applied to Keyphrases Extraction
3.6 Fuzzy-Based Topic Modelling Applied to Text Summarization
4 Evaluation and Discussion
4.1 Experimental Results in Keyphrases Extraction Problem
4.2 Experimental Results in a Multi-document Summarization Problem
5 Conclusions and Future Works
References
An Affiliated Approach to Data Validation: US 2020 Governor’s County Election
1 Introduction
2 Literature Review
3 Benford’s Law
4 Zipf’s Law
5 Application
5.1 US 2020 Election
6 Comparative Study
7 Conclusion
References
Acquisition, Processing and Visualization of Meteorological Data in Real-Time Using Apache Flink
1 Introduction
2 Theoretical Fundamentals
2.1 Big Data
2.2 Data Streaming
2.3 NoSQL Databases
2.4 Apache Kafka
2.5 Apache Flink
2.6 Elasticsearch
2.7 Kibana
3 Proposed Architecture
3.1 Software and Computer Equipment Used
3.2 Construction of the Weather Station
3.3 Streaming of Data from the Weather Station to the Tools
3.4 Data Storage in Elasticsearch
3.5 Visualization of Data with Kibana
4 Verification
5 Results
6 Discussions
7 Conclusions
References
Topological Data Analysis for the Evolution of Student Grades Before, During and After the COVID-19 Pandemic
1 Introduction
2 Preliminaries
2.1 Topological Data Analysis
2.2 Topological Spaces
2.3 Simplicial Complexes
2.4 Homology Groups
3 Simplicial Complexes from Data
3.1 Vietoris-Rips Complex
3.2 Čech Complex
3.3 Nerves
3.4 Filtrations
4 Persistent Homology
4.1 Persistence Diagrams
4.2 Bottleneck Distance
5 Mapper Algorithm
6 Application
6.1 Datasets
6.2 Some Results
7 Conclusion and Future Work
References
Redescending M-Estimators Analysis on the Intuitionistic Fuzzy Clustering Algorithm for Skin Lesion Delimitation
1 Introduction
2 Redescending M-Estimation
2.1 Enhanced Intuitionstic Fuzzy Clustering Algorithm Trough Redes-Cending M-Estimators
3 Experimental Results
3.1 Metrics
3.2 Simulation Results
4 Conclusion and Future Scope
References
Big Data Platform as a Service for Anomaly Detection
1 Introduction
2 Problem Description and Motivation
3 Background
3.1 Modern Distributed Computing and Frameworks
3.2 Platforms as a Service with Container Orchestators
4 Big Data
4.1 Big Data Reference Architecture
5 Big Data Architecture Proposal Built-In PaaSCO DC/OS
5.1 Assembly Frameworks in PaasCO DC/OS for Big Data Anomaly Detection
5.2 Test Case for Prediction of Severe Diabetic Retinopathy
5.3 Experimental Analysis
5.4 General Conclusions in Test Case
6 General Discussion and Conclusions
References
Predictive Analytics
An Overview of Model-Driven and Data-Driven Forecasting Methods for Smart Transportation
1 Introduction
2 Model-Driven Versus Data-Driven Approaches
3 Model-Driven for Traffic State Estimation
3.1 Macroscopic Models
3.2 Microscopic Modeling
3.3 Mesoscopic Modeling
3.4 Critiques and Limitations of Model-Driven Approaches
4 Traffic Flow Prediction Based on Data-Driven
4.1 The Challenges of Data-Driven Traffic Prediction
4.2 Review of Data-Driven Approaches
5 Naïve Methods
6 Parametric Methods
6.1 Historical Average Algorithm (HA)
6.2 Smoothing Techniques
6.3 Kalman Filtering Technique (KFT)
6.4 Auto-Regressive Linear Processes
7 Non-parametric Methods or Machine Learning Approach
7.1 Support Vector Regression (SVR)
7.2 Artificial Neural Networks (ANN)
7.3 Hybrid Prediction Methods
8 Conclusions
References
Data Augmentation Techniques for Facial Image Generation: A Brief Literature Review
1 Introduction
2 Generation of Artificial Facial Images
2.1 Generic Transformations
2.2 Component Transformation
2.3 Attribute Transformation
2.4 Age Progression and Regression
3 Generative Adversarial Networks (GANs)
3.1 Definition
3.2 Architecture
3.3 Training Process
3.4 Challenges
3.5 Face Image Generation Evolution with GANs
4 Related Work
5 Methodology
6 Face Image Generation with GANs
7 Conclusion and Future Work
8 Code Repository
References
A Review on Machine Learning Aided Multi-omics Data Integration Techniques for Healthcare
1 Introduction
2 Multi-omics
2.1 Genomics
2.2 Epigenomics
2.3 Transcriptomics
2.4 Proteomics
2.5 Metabolomics
3 Machine Learning
4 Multi-omics Data Integration Strategies
4.1 Early Integration
4.2 Mixed Integration
4.3 Intermediate Strategies
4.4 Late Integration
4.5 Hierarchical Integration
5 Machine Learning-Based Data Integration Methods
5.1 Concatenation-Based Integration Methods
5.2 Model-Based Integration Methods
5.3 Transformation-Based Integration Methods
6 Application
6.1 IMPaLA (Integrated Molecular Pathway-Level Analysis)
6.2 MixOmics
6.3 MOFA (Multi-omics Factor Analysis)
6.4 BioMiner
6.5 TCGA (The Cancer Genome Atlas)
6.6 ICGC
6.7 CPTAC (Clinical Proteomic Tumor Analysis Consortium)
6.8 DepMap
6.9 PaintOmics
7 Multi-omics Research Contributions
7.1 PARADIGM (Pathway Recognition Algorithm Using Data Integration on Genomic Models)
7.2 iCluster
7.3 Patient-Specific Data Fusion (PSDF)
7.4 Bayesian Consensus Clustering (BCC)
8 Challenges
9 Future Prospects
10 Conclusion
References
Learning of Conversational Systems Based on Linguistic Data Summarization Applications in BIM Environments
1 Introduction
2 Model of the Conversational System with Learning Based on Linguistic Summaries of Data
2.1 BRasa Assistant Subsystem Architecture
2.2 Architecture and Algorithms of BRasa_LDS Learning Subsystem
2.3 Example of Application of BRasa on BIM Project Management Environment
2.4 Indicators Used to Evaluate the Conversational System Knowledge Databased on Linguistic Summaries
3 Results Analysis
3.1 Validation of BRasa Performance in BIM Project Management Environment (BusinessRedmine Ecosystem)
4 Conclusions
References
Fuzzified Case-Based Reasoning Blockchain Framework for Predictive Maintenance in Industry 4.0
1 Introduction
1.1 Data Analytics, Computational Intelligence, and Predictive Maintenance
1.2 Emerging Technologies in Predictive Maintenance
1.3 Basic Concepts
2 Related Work
2.1 Models, Algorithms, and Applications for Solving Production Loss in Industry 4.0
2.2 Application of Cased-Based Reasoning in Industry 4.0
3 Proposed FCBRB Framework
3.1 Overview of the Framework
3.2 Methodology
3.3 Sim (FnSa, FnSb)
4 Implementation and Discussion
4.1 Experimentation and System Implementation
4.2 Discussion
5 Conclusions
References
Machine Learning for Identifying Atomic Species from Optical Emission Spectra Generated by an Atmospheric Pressure Non-thermal Plasma
1 Introduction
1.1 Motivation
2 Automatic Recognition with Machine Learning
2.1 Optical Emission Spectroscopy
2.2 Characterization Techniques Based on Machine Learning
3 Method
3.1 Ensemble-Classifier Based on Decision Trees Algorithms
4 Results
5 Discussion
6 Conclusion
References
Agent-Based Simulation: Several Scenarios
1 Introduction
2 Signal Configuration
2.1 Case 1: Signal Configuration with Agent-Based Simulation
2.2 Traffic Simulation Tool
2.3 Discussion: Case 1
3 Simulation of Drifting Objects
3.1 Case 2: Simulation of Drifting Objects in Cuban Territorial Waters
3.2 GAMA Platform
3.3 Modeling Proposal
3.4 Results of the Experiments
3.5 Discussion: Case 2
4 Simulation for Training
4.1 Case 3: Simulator for the Training of Boiler Operators
4.2 Discussion: Case 3
5 Conclusion
References
Prescriptive Analytics
Multihop Ridesharing in NPC
1 Introduction
2 Related Work
3 Multi-hop Ride Sharing Problem Definition
4 MHRS in NP
5 Vertex Covering and VC leqp MHRS
5.1 If i is Into the Yes Instances of VC Then The Transformation Function of i Will Produce a YES Instance of MHRS
5.2 If i is Not Into the YES Instances of VC Then The Transformation Function of i Will Produce an Instance that is Not Into the YES Instances of MHRS
6 Conclusions
References
A Content-Based Group Recommender System Using Feature Weighting and Virtual Users Aggregation
1 Introduction
2 Preliminaries
2.1 Content-Based Recommendation
2.2 Group Recommendation
2.3 Previous Works in Content-Based Group Recommender Systems
3 A New Hybrid Method for Content-Based Group Recommender Systems
3.1 Modeling Users and Items
3.2 Aggregation of User Profiles
3.3 Addition of Group Profile
3.4 Calculation of Weighted User-Item Similarity
3.5 Aggregation of Values and Final Recommendation
4 Experimentation
4.1 Datasets
4.2 Experimental protocol
4.3 Results and Discussion
5 Conclusions
References
Performance Evaluation of AquaFeL-PSO Informative Path Planner Under Different Contamination Profiles
1 Introduction
2 Related Work
3 Statement of the Problem
3.1 Monitoring Problem
3.2 Assumptions
4 PSO-Based Path Planning Algorithms
4.1 Classic Particle Swarm Optimization (PSO)
4.2 Enhanced GP-Based PSO
4.3 AquaFeL-PSO
5 Results and Discussion
5.1 Ground Truth
5.2 Performance Metric
5.3 Setting Simulation Parameters
5.4 Performance Comparison
6 Summary of the Results
7 Conclusions
References
Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat Aerial Vehicle Platform
1 Introduction
2 Proposed Work
3 Swarm Formation
3.1 Arrowhead Formation
3.2 Rectangular Prism Formation
3.3 Simulation Analysis
4 Mission Execution
4.1 Mission Initiation Module
4.2 Route Planning
5 Autonomous Lock-On Target Tracking
5.1 Determining the Target and the Route
5.2 Target Image Processing
5.3 Target Tracking via LVFG
6 Communication
6.1 Communication with Ground Station
6.2 Intercommunication of UCAVs
6.3 Collision Avoidance
6.4 Flight Control Module
7 Conclusion
References
Cellular Processing Algorithm for Time-Dependent Traveling Salesman Problem
1 Introduction
2 State of the Art
2.1 Time-Dependent Traveling Salesman Problem Contributions
2.2 Cellular Processing Algorithms Contributions
3 Time Dependent-Traveling Salesman Problem (TD-TSP)
3.1 Instance Structure
3.2 Calculation Process Example
4 Greedy Randomized Adaptive Search Procedure Algorithm for Time Dependent-Traveling Salesman Problem
4.1 Greedy Randomized Adaptive Search Procedure Construction
4.2 Roulette Procedure
4.3 Influence on the Candidate List
4.4 Shared Memory and Normalization
4.5 Reactive Greedy Randomized Adaptive Search Procedure
5 Experimental Results
5.1 Configuration and Instances
5.2 Parameter Comparison
5.3 Comparison Between GRASP Methods
5.4 Comparison Between CPA Methods
5.5 Comparison Between CPA and GRASP Methods
6 Conclusions
References
Portfolio Optimization Using Reinforcement Learning and Hierarchical Risk Parity Approach
1 Introduction
2 Related Work
3 Data and Methodology
3.1 Choosing the Sectors
3.2 Data Acquisition
3.3 Hierarchical Risk Parity Portfolio Design
3.4 Portfolio Design Using Reinforcement Learning
3.5 Backtesting the Portfolios on the Training and Test Data
4 Results
5 Conclusion
References
Reducing Recursion Costs in Last-Mile Delivery Routes with Failed Deliveries
1 Introduction
2 Formal Description of the Proposed Model
3 Solution Method
4 Case Study
4.1 Solution of the Instance A
4.2 Solution of the Instance B
5 Conclusions
References
Intelligent Decision-Making Dashboard for CNC Milling Machines in Industrial Equipment: A Comparative Analysis of MOORA and TOPSIS Methods
1 Introduction
2 Developing
3 Obtaining the Data
4 Analysis of Data Capture
5 Results
6 Conclusions
7 Future Research
References

Citation preview

Studies in Big Data 132

Gilberto Rivera Laura Cruz-Reyes Bernabé Dorronsoro Alejandro Rosete Editors

Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications

Studies in Big Data Volume 132

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Big Data” (SBD) publishes new developments and advances in the various areas of Big Data- quickly and with a high quality. The intent is to cover the theory, research, development, and applications of Big Data, as embedded in the fields of engineering, computer science, physics, economics and life sciences. The books of the series refer to the analysis and understanding of large, complex, and/or distributed data sets generated from recent digital sources coming from sensors or other physical instruments as well as simulations, crowd sourcing, social networks or other internet transactions, such as emails or video click streams and other. The series contains monographs, lecture notes and edited volumes in Big Data spanning the areas of computational intelligence including neural networks, evolutionary computation, soft computing, fuzzy systems, as well as artificial intelligence, data mining, modern statistics and Operations research, as well as self-organizing systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. The books of this series are reviewed in a single blind peer review process. Indexed by SCOPUS, EI Compendex, SCIMAGO and zbMATH. All books published in the series are submitted for consideration in Web of Science.

Gilberto Rivera · Laura Cruz-Reyes · Bernabé Dorronsoro · Alejandro Rosete Editors

Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications

Editors Gilberto Rivera División Multidisciplinaria de Ciudad Universitaria Universidad Autónoma de Ciudad Juárez Chihuahua, Mexico Bernabé Dorronsoro School of Engineering Universidad de Cádiz Cadiz, Spain

Laura Cruz-Reyes Tecnológico Nacional de México/Instituto Tecnológico de Ciudad Madero Ciudad Madero, Tamaulipas, Mexico Alejandro Rosete Universidad Tecnológica de La Habana “José Antonio Echeverría” La Habana, Cuba

ISSN 2197-6503 ISSN 2197-6511 (electronic) Studies in Big Data ISBN 978-3-031-38324-3 ISBN 978-3-031-38325-0 (eBook) https://doi.org/10.1007/978-3-031-38325-0 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

This book presents a compilation of cutting-edge research in two prominent fields: computational intelligence and data analytics, along with some related research areas. These two disciplines synergistically complement each other, enabling the exploration of complex problems and maximizing the utilization of available data. Moreover, both are critical tools in the knowledge-based economy and society as they enable harnessing the knowledge embedded in data to drive economic growth, foster innovation, and enhance the quality of life in society. Let us begin by providing some brief definitions, which will serve as our introductory guide to delivering the intricacies, approaches, and applications presented in the book chapters. Computational intelligence is founded on techniques and algorithms that emulate natural learning and adaptation processes. On the other hand, data analytics focuses on analyzing and extracting valuable information from datasets. Computational intelligence encompasses subdisciplines such as machine learning, fuzzy logic, evolutionary computation, and artificial neural networks. Conversely, data analytics employs statistical, mathematical, and computational techniques and tools to analyze extensive data volumes and derive meaningful insights. Although these fields have achieved significant transformations in various areas, interpretation and decision-making pose significant challenges in dealing with data’s complexity and exponential growth, as well as ethical and privacy concerns. In that sense, “Data Analytics and Computational Intelligence: Novel Models, Algorithms, and Applications” represents an exciting avenue for fostering constructive discussions, conversations, and reflection about their impact and potential for addressing everyday and emerging needs. This book is organized into three main parts that group chapters with similar topics. Part I. Descriptive and Diagnostic Analytics. The seven chapters in this part highlight the diverse applications of data processing and analysis across various domains. This field plays a critical role today by facilitating informed decision-making and uncovering valuable business insights. These chapters showcase a range of applications, including student dropout prediction, relevant information extraction, election fraud detection, meteorological data processing, analysis of student grades, skin lesion delimitation, and anomaly detection in Big Data. Each study provides valuable v

vi

Preface

insights and contributes to the advancement of data-driven approaches. They cover topics such as prediction algorithms for analyzing historical data, topic modeling using Ordered Weighted Average aggregation, analysis of word frequency distribution in text using Zipf’s Law, data stream processing with Apache Flink, topological data analysis employing the Mapper algorithm, clustering utilizing the Intuitionistic Fuzzy C-Means algorithm, and Big Data Analytics Systems for cloud computing. Part II: Predictive Analytics. The chapters in this section focus on gaining valuable insights into future events and providing practical knowledge to companies by utilizing machine learning, statistical models, and forecasting methods—the seven chapters in this part address relevant problems by expanding and adapting various cutting-edge approaches. The topics covered include forecasting models for intelligent transportation, data augmentation techniques for facial image generation, the integration of machine learning algorithms with multi-omics analysis for healthcare research, linguistic data summarization in Building Information Modeling environments for conversational systems, the fusion of three methodologies (fuzzy logic, blockchain, and case-based reasoning) for predictive maintenance in Industry 4.0, machine learning techniques for identifying atomic species, and the application of agent-based simulations in real-life scenarios. Part III. Prescriptive Analytics. The eight chapters in this part present new approaches to learning from data, understanding past behavior, and recommending specific actions to make optimal decisions that can improve the current state of an entity. These original contributions are mainly based on deep learning, feature weighting, TOPSIS, MOORA, cellular processing, and adaptive Swarm Intelligence. In addition, they address challenges in areas such as group recommendation, logistics, decision-making, portfolio optimization, autonomous vehicles, and transportation problems. Finally, a theoretical chapter demonstrates that the problem of multi-hop ridesharing is an NP-complete problem; this study establishes the foundation for future research in this field. With contributions from various authors, this valuable resource could serve researchers, practitioners, and academics in data analytics and computational intelligence. This book aims to inspire readers to implement these technologies in Smart Business or Industry 4.0 environments. We encourage researchers to drive the field forward and improve strategic decision-making processes. We hope readers find the book informative and helpful, fostering productive research and intelligent solutions across related disciplines. Chihuahua, Mexico Ciudad Madero, Mexico Cadiz, Spain La Habana, Cuba

Gilberto Rivera Laura Cruz-Reyes Bernabe Dorronsoro Alejandro Rosete

Contents

Descriptive and Diagnostic Analytics Cluster Analysis Using k-Means in School Dropout . . . . . . . . . . . . . . . . . . . Luis Earving Lee Hernández, José Antonio Castán-Rocha, Salvador Ibarra-Martínez, Jésus David Terán-Villanueva, Mayra Guadalupe Treviño-Berrones, and Julio Laria-Menchaca

3

Topic Modeling Based on OWA Aggregation to Improve the Semantic Focusing on Relevant Information Extraction Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yamel Pérez-Guadarramas, Alfredo Simón-Cuevas, Francisco P. Romero, and José A. Olivas

17

An Affiliated Approach to Data Validation: US 2020 Governor’s County Election . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manan Roy Choudhury

43

Acquisition, Processing and Visualization of Meteorological Data in Real-Time Using Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonathan Adrian Herrera Castro, Abraham López Najera, Francisco López Orozco, and Benito Alan Ponce Rodríguez Topological Data Analysis for the Evolution of Student Grades Before, During and After the COVID-19 Pandemic . . . . . . . . . . . . . . . . . . . Mauricio Restrepo

65

97

Redescending M-Estimators Analysis on the Intuitionistic Fuzzy Clustering Algorithm for Skin Lesion Delimitation . . . . . . . . . . . . . . . . . . . 121 Dante Mújica-Vargas, Blanca Carvajal-Gámez, Alicia Martínez-Rebollar, and José de Jesús Rubio Big Data Platform as a Service for Anomaly Detection . . . . . . . . . . . . . . . . 141 Adrián Hernández-Rivas, Victor Morales-Rocha, and Oscar Ruiz-Hernández

vii

viii

Contents

Predictive Analytics An Overview of Model-Driven and Data-Driven Forecasting Methods for Smart Transportation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Sonia Mrad and Rafaa Mraihi Data Augmentation Techniques for Facial Image Generation: A Brief Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Blanca Elena Cazares, Rogelio Florencia, Vicente García, and J. Patricia Sánchez-Solís A Review on Machine Learning Aided Multi-omics Data Integration Techniques for Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Hina Bansal, Hiya Luthra, and Shree R. Raghuram Learning of Conversational Systems Based on Linguistic Data Summarization Applications in BIM Environments . . . . . . . . . . . . . . . . . . . 241 Yuniesky Orlando Vasconcelo Mir, Iliana Pérez Pupo, Pedro Y. Piñero Pérez, Luis Alvarado Acuña, and Aimee Graffo Pozo Fuzzified Case-Based Reasoning Blockchain Framework for Predictive Maintenance in Industry 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Kayode Abiodun Oladapo, Folasade Adedeji, Uchenna Jeremiah Nzenwata, Bao Pham Quoc, and Akinbiyi Dada Machine Learning for Identifying Atomic Species from Optical Emission Spectra Generated by an Atmospheric Pressure Non-thermal Plasma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 Octavio Rosales-Martínez, Allan A. Flores-Fuentes, Antonio Mercado-Cabrera, Rosendo Peña-Eguiluz, Everardo Efrén Granda-Gutiérrez, and Juan Fernando García-Mejía Agent-Based Simulation: Several Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Mailyn Moreno-Espino, Ariadna Claudia Moreno-Román, Ariel López-González, Robert Ruben Benitez-Bosque, Cynthia Porras, and Yahima Hadfeg-Fernández Prescriptive Analytics Multihop Ridesharing ∈ NPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 Javier Alejandro Romero-Guzmán, Jesús David Terán-Villanueva, Mirna Patricia Ponce-Flores, and Aurelio Alejandro Santiago-Pineda A Content-Based Group Recommender System Using Feature Weighting and Virtual Users Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 Yilena Pérez-Almaguer, Manuel J. Barranco, Yailé Caballero Mota, and Raciel Yera

Contents

ix

Performance Evaluation of AquaFeL-PSO Informative Path Planner Under Different Contamination Profiles . . . . . . . . . . . . . . . . . . . . . 405 Micaela Jara Ten Kathen, Federico Peralta, Princy Johnson, Isabel Jurado Flores, and Daniel Gutiérrez Reina Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat Aerial Vehicle Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 Murat Bakirci and Muhammed Mirac Ozer Cellular Processing Algorithm for Time-Dependent Traveling Salesman Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 Edgar Alberto Oviedo-Salas, Jesús David Terán-Villanueva, Salvador Ibarra-Martínez, and José Antonio Castán-Rocha Portfolio Optimization Using Reinforcement Learning and Hierarchical Risk Parity Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 Jaydip Sen Reducing Recursion Costs in Last-Mile Delivery Routes with Failed Deliveries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 Luis Suárez, Cynthia Porras, Alejandro Rosete, and Humberto Díaz-Pando Intelligent Decision-Making Dashboard for CNC Milling Machines in Industrial Equipment: A Comparative Analysis of MOORA and TOPSIS Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573 Javier Andres Esquivias Varela, Humberto García Castellanos, and Carlos Alberto Ochoa Ortiz

Descriptive and Diagnostic Analytics

Cluster Analysis Using k-Means in School Dropout Luis Earving Lee Hernández, José Antonio Castán-Rocha, Salvador Ibarra-Martínez, Jésus David Terán-Villanueva, Mayra Guadalupe Treviño-Berrones, and Julio Laria-Menchaca

Abstract Traditional measurements of university students are not providing the data needed for adequate guidance of the students. The evidence for this is the constantly growing problem of student dropout. This paper presents a comparison of selected prediction algorithms to detect students at risk of becoming inactive at the end of their first semester. Our proposal uses data from previous academic years to produce a dataset with the reports from the school performance of the students. Additionally we used Random forest, J48 decision tree, and Logistic regression. Our main approach is to implement k-means algorithm to split the database into subgroups of classes and compare it with traditional applications to evaluate each case. The dataset uses school performance of the student from 2017 to 2019 to make a prediction about the chance of a newcomer student to drop out of school during the start of their academic year. The results show that Random forest algorithm reaches a better performance than other algorithms. We show that applying our proposed clustering-based processing, our algorithm flagged as potential drop outs, 80% of those who actually did drop out. In conclusions section we discuss the effectiveness and propose a future work.

1 Introduction Governments around the world have been increasing the annual budget in education with the purpose to promote the teaching quality of young students in higher education. That gives them the incentive to create mechanisms that benefit the students to L. E. Lee Hernández (B) · J. A. Castán-Rocha · S. Ibarra-Martínez · J. D. Terán-Villanueva · M. G. Treviño-Berrones · J. Laria-Menchaca Faculty of Engineering, Autonomous University of Tamaulipas, Tampico, Tamaulipas, Mexico e-mail: [email protected] J. A. Castán-Rocha e-mail: [email protected] S. Ibarra-Martínez e-mail: [email protected] M. G. Treviño-Berrones e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications, Studies in Big Data 132, https://doi.org/10.1007/978-3-031-38325-0_1

3

4

L. E. Lee Hernández et al.

improve their quality of life during their studies. The benefits are scholarships, incentives to single mothers, school counseling, and psychological counseling, among others. Despite this, there are several adversities that students face during school, such as cognitive, mathematical, academic and even family adversities that can aggravate the student’s issues, affecting the student’s academic performance, which may result in school desertion. In recent years, many universities have recorded the performance of students in electronic files. This information can be processed with data mining and artificial intelligence techniques to provide the academic institutions with the ability to respond to such problems with various preventive measures to minimize the student’s likelihood dropout. The traditional performance evaluations, while giving qualitative evaluation of academic performance history, do not produce predictions of students’ future development as our methods do. Therefore, it is necessary to emphasize the effective assessment and prediction of student academic performance. Hence, the need arises for an academic performance prediction system to predict whether a student will drop out of school. With such predictions, academic authorities could adjust their orientation and help students to achieve better academic performance in the future to avoid school desertion. This paper presents a new method to predict school desertion based on students’ scores using clustering techniques. We propose three algorithms that use k-centroids to measure students’ score. Additionally, we compare the methods with and without the clustering technique and analyze the impact of the students’ performance by such techniques. Finally, we compare the performance of the prediction techniques, using kappa statistic, mean absolute error (MAE) and root mean square error (RMSE). We find that Random forest outperforms the other implementations. The remainder of this paper is structured as follows: Sect. 2 shows related works regarding prediction in school desertion. In Sects. 3 and 4, we describe the models and methods used to process the students data base, which includes data extraction and preparation. Section 5 shows the computational results of the experimentation. Section 6 presents discussions of the findings and the conclusions.

2 Related Works Hedge and Pragreeth [6] proposed a methodology to predict student dropout using naive Bayes classifier algorithm and decision trees in R language. They examined each student’s reason for early dropout. First the authors used a factor identification for attribute selection. They then prepared a questionnaire survey and established criteria from academic files. Next, they applied dimensionality reduction as a pre-processing technique. The result was the algorithm had 83% accuracy on 50 instances (using 54 attributes). The authors note that academic, demographic, and psychological factors play an important role in school dropout. The authors conclude that identifying potential dropouts at an early stage can prevent the dropout from happening with monitoring to provide valuable counseling.

Cluster Analysis Using k-Means in School Dropout

5

Huisman et al. [8] proposed a study on the optimality of pairing different feedback reviewers with students, as a method that provides an efficient review of the performance of the students in their writing performance task. The authors categorize the pairing of students and reviewers as same-ability peers (homogeneously) or different ability peers (heterogeneously). The study addressed this issue in the context of an academic writing task. According to the authors, 94 undergraduate students were matched in 47 homogeneous or heterogeneous reciprocal pairs reviewers which provided anonymous, formative peer feedback on each other’s draft essays. The authors affirm that the feedback quality of the reviewers did not depend on the student’s ability or the pair composition, although the authors argue that there may be a benefit from highly skilled reviewers. Gore et al. [5] developed a study about student’s expectations for a particular career that need a higher education that involves student’s environment and education variables. In the study of some 6000 public school students in Australia, Univariate Logistic regression was used to analyze year, gender, and various diversity factors such as economic status where authors conclude that student diversity must be taken into account. The authors argues rather that solving the issue with one single activity for participation of students, instead, several activities to reach more students. Kool et al. [9] presented a study about the effects of honor programs on students, by using certain skills and applying propensity score matching. Student’s skills were used to match undergraduate honors students with non-honor students. The authors also employed longitudinal data. The authors utilized propensity scores that are statistically related pre-enrollment characteristics for the calculation having a total samples of 12,000 students. They argue that univariate tests of orientation and intellectual curiosity both highlighted a distinction between programs. Moreover, they argue honor programs might help mitigate a decrease in skills-mastery by students through suitable educational challenges. Yagci and Cevik [17] proposed a study to predict the academic achievement of vocational high school students of Turkish and Malaysian students in science courses by using artificial neural networks and offering preventive measures against students who drop out. The student population consisted of the tenth and eleventh grades of 922 students in Turkey and 1050 Malaysian students. The study was managed by using a questionnaire with variables affecting the level of academic achievement which included the averages of the students who studied various courses. The authors designed a model that predicted the students’ academic achievement with an artificial neural network using the Matlab program. As a conclusion of the study, an academic prediction system was developed with an average sensitivity of 98% on 922 samples for Turkey and 95% on 1050 samples for Malaysia. The work of Sukhbaatar et al. [16] is based on a simple forecast scheme using a decision tree contextualized for the analysis of categorization to recognize students that may drop out in the middle of the semester. The data included 700 online activities of students in required sophomore courses with blended learning styles. This study showed that 79% of the students who dropped out were correctly predicted; this gives the possibility to the school authorities to interfere, motivate and support them for better engagement and not to abandon school. The authors emphasize that

6

L. E. Lee Hernández et al.

students who fail make a similar or greater effort in the online learning environment compared to students with good academic performance. Various predictive measures were performed to evaluate the effectiveness of the decision trees such as sensitivity, precision, accuracy, and F-score, reaching 89% accuracy. Perez et al. [14] used knowledge discovery techniques to analyze educational data focused on detecting the dropout of undergraduate students in Systems Engineering, after 7 years of being enrolled in a Colombian university. The data used by the authors was extended and enriched using a feature engineering process. The authors used Decision Trees, Logistic Regression, and Naive Bayes for the prediction modeling. In the evaluation part, the authors employed Watson Analytics software to ease the use of the service for a non-expert user. The dataset consisted of 800 students who matriculated in the computer science program at a private university in Bogota, and the attributes used were: admission information, graduation dates, academic programs, and financial aid. In the paper, they showed preliminary results for predicting dropout from a large and heterogeneous data set using demographic records and student transcripts. Notably, the authors argued that performance in systems engineering courses is correlated with performance in physics and mathematics courses. The experimental results showed that the best method was the decision tree model achieving an accuracy of 94%. Gera et al. [4] developed a study to propose an advanced feature selection algorithm to effectively apply educational data mining to improve the performance of the technique. The algorithm is based on the hadoop framework and map-reduce. Thus this study is a comparative analysis of various feature selection algorithms such as JRip, Naive Bayes, Decision Table, etc., which evaluates the accuracy of the data based on three parameters: precision, recall, and F-metric. The authors argue that the proposed algorithm is a better way to manage educational data compared to other objectives. Additionally, the authors argue there were no noteworthy changes by using different algorithms along with the classifiers. The authors conclude that AFSA (Artificial Swarm Algorithm) performs more effectively in consolidation with the different classifiers. Arun et al. [1] proposed a study whose objective was to present a data-driven system to assist teachers in early detection, prediction of course grades, and advanced prediction of cumulative low-grade point averages with an examination of first, second, and third-year students in Bengaluru, India. One of the stipulated considerations was grouping the results according to the type of algorithm implemented. This grouping of techniques was proposed according to the classification of algorithms in WEKA (Waikato Environment for Knowledge Analysis). The authors implemented the method of voting by discarding the techniques. The techniques employed were the Multi-layer Perceptron Clustering, Naive Bayes, Hoeffding Tree, JRip, Random Forest, Average-one-dependence Estimator, Logistic Regression, Simple Logistic, Linear Regression, Support Vector Machine, Gaussian Processes, Simple Linear Regression, Multilayer Perceptron Neural Networks, Regression Classification, Iterative Classification Optimizer, among others. As a result, the authors emphasize that the Random forest algorithm obtains the best result among the regression algorithms, reaching an accuracy of 89.15%.

Cluster Analysis Using k-Means in School Dropout

7

All of the above works are feasible examples of current approaches with adequate and robust results that seek to provide academic solutions to students on time. The authors have based their efforts on presenting an improvement to the techniques for identifying potential dropouts and even using clustering techniques to highlight their own improvement as a measure of student performance. However, in spite of their efforts to control school desertion, the problem has been getting worse ([2]). The focus of this paper is to highlight the solution we achieve by applying clustering techniques. We use artificial intelligence techniques to identify students at risk of becoming inactive at the end of their first school year.

3 k-Means Clustering Processing This section presents the processing method of the dataset using the technique of k-means clustering algorithm. The k-means clustering Eq. 1 was implemented to precisely cluster the academic performance of students into three groups highlighting the academic performance history of first-semester students and creating a training set to form a robust dataset for later use in the analysis of this paper. k-means =

k n

|xi − µ j |2

(1)

j=1 i=1

where: k is the number of clusters n is the number of cases x is the case µ is the centroid for cluster j. Each student’s grade was classified according to this scheme: low performance, intermediate performance, and high performance. The scheme employed to classify the grades of the student performance was adjusted according to the university score ranking used by most universities in México. The principal objective of the k-means is to find the cluster points (“centroids”) that can group the data in their vicinity. The algorithm helps to find the group of academic performance where the students belong. It initially randomly selects k centroids, then groups the data points according to which centroid they are closest to their Euclidean distance in 10-space. Then it refines the selection of each centroid as the arithmetic mean of all the data points in that cluster. The process repeats with this new set of centroids until the selection of centroids doesn’t change much. The approach of k-means is greedy this means the algorithm will try to minimize the Euclidean distance between the points and the centroids [3]. In sum: This process groups each student’s performance with other students, so that each student is included with others of their kind. The Algorithm 1 summarizes all the steps fol-

8

L. E. Lee Hernández et al.

Algorithm 1 K-means clustering algorithm with classifiers 1: for k ← 1 to K do 2: µk ← some random location //randomly initialize mean for kth cluster 3: end for 4: while doesn’t converge do 5: for n ← 1 to N do 6: z n ← argmin k ||µk − xn || //assign example n to closed center 7: end for 8: for k ← 1 to k do 9: µk ← MEAN({xn : z n = k }) //re-estimate mean of cluster k 10: end for 11: end while 12: return z // return cluster assignments 13: Define data source and instance of z 14: Apply filters to z 15: Define Classifiers 16: Evaluate model 17: Repeat for each classifier

lowed. This code is available at https://github.com/Earvingle/Kmeans/blob/main/ kkmeans. Here is how the preprocessing produces new datasets for each school: After the k-means algorithm on a school dataset produces classes C1 . . . , Cm , in each class Ci we replace the class-score data for each of the students in Ci with the average of the class-score data in Ci ; call this new set of data Ci . Then the new dataset is the union of C1 , . . . , Cm . This classification also matched the k-centroids applied in the algorithm, meaning that low performance was ranked by the k-centroid 5 with the range from 0 to 7, the intermediate performance was ranked by the k-centroid 8 with no range only including classes with a grade of 8, and the high performance was ranked by the k-centroid 10 with the range from 9 to 10 respectively. The average and standard deviation were calculated, while also counting the number of courses belonging to their course classification (mathematics as exact, English/Spanish linguistics, and so on) as shown in Fig. 1. The data was predicted between each cluster where the tuple of the table corresponds to the academic performance of each student divided into different columns according to their performance as shown in Fig. 2. Moreover, the clustering was analyzed to categorize the performance of the student in each course development; the technique analyzes the overall performance per student making a prediction about whether the student, who was active, becomes inactive, and drops out of school. Thus a repository is built that contains all the data on the performance of the students. In School Arturo Navarro siller, we had 337 students as data points with 2359 courses categorized on their performance; taking k = 3, this grouping method separated the data points into 3 groups, each with its own centroid. The students in one group, comprising 800 courses, generally performed well; another group, comprising 600 courses, generally performed medium; and the last group, comprising 800 courses, generally performed poorly. The group of good performance almost reached 90% of

Cluster Analysis Using k-Means in School Dropout

9

Fig. 1 Data processing model

Fig. 2 Second phase of the data processing model

the student who remained active. However, the group of low performance reached almost 40% students who became inactive due to their low academic performance. This indicates that about half of our sample is categorized as low student’s performance, and they were a student with a high risk of becoming inactive. To achieve the comparison of the proposed method with the unprocessed data, the grades were sorted in ascending order separating the students by career; this transformation was proposed because there is a disparity in the students in terms of the generational change of the careers. The course of advanced programming II is

10

L. E. Lee Hernández et al.

different from the first generation and the current generation of computer systems engineering, because the engineering department is in constant revision updating the study plans of the students, the course topics of the subjects, and so on.

4 Dataset In this section, the data sets used in this document will be described, i.e., a database consisting of 58,879 records of the grades of the students of the Arturo Narro Siller School of Engineering of the Autonomous University of Tamaulipas. The Mexican school system includes examinations sets, called ordinary A, ordinary B, extraordinary A, and extraordinary B; every student takes one or more of these. During their enrollment, the students signed on a privacy notice to allow the university to use their data for research purposes and only the research team has access to this information. The database consists of 1396 students belonging to various engineering careers from the period of 2017, 2018, and 2019. The data consists of information about students (enrollment, school year of entry, career, whether they are active, courses taken, the school cycle of the course, grades of the ordinary A, ordinary B, extraordinary A, and extraordinary B). Incomplete data sets were processed and cleaned. (We removed the data of recent students who performed a re-validation of their courses due to change of career. This category of students was discarded because their enrollment consisted of a higher number of courses as opposed to the regular block of 7 courses of given to new entry students.) A database repository was generated using MySQL Workbench to perform indexed and easily accessible searches for the application and implementation of the system. The grades of ordinary A were converted to their numerical form according to the transformation made in the institution when the student’s average was being calculated; for example, NA (not approved) transformed into 5 because it is a failing grade, while AC (approved) is transformed into a 10, which is taken as a passing grade for the course. The extraordinary grades were excluded because the imparity of the comparison among students. The students selected as a sample were 317 new students who took the regular block of 7 courses. Each course was classified according to the group they belonged to; courses such as mathematics were classified in the “exact” category, advanced English was classified in the “linguistics” category, and so on as shown in the Table 1. This categorization was performed manually according to the specialization of each university department; each career has their own kind of specialized classes, and categorizing the courses attunes the predictors.

Cluster Analysis Using k-Means in School Dropout

11

Table 1 Example of categorization on courses according to university departments UPDATE courses SET clasification= ‘Exacts’ WHERE courses.courses = ‘basic maths’; UPDATE courses SET clasification= ‘Exacts’ WHERE courses.courses = ‘differential calculus’; UPDATE courses SET clasification = ‘chemistry’ WHERE courses.courses =‘basic chemisty’; UPDATE courses SET clasification = ‘engineer’ WHERE courses.courses = ‘introduction to engineer’;

5 Experimentation and Results The original dataset was replaced to obtain a new dataset, by making use of the division of the data points into groups by k-means, in the following manner: which included the counting of courses, mean and standard deviation of each group, and the class of the student as being active or inactive. The new dataset consists of replacing the grades of each student by the mean of the grades in that studentâŁ™s group. The repository was used as an input in WEKA. Three prediction algorithms were implemented: Random Forest (RF), J48 Decision Tree (J48) and Logistic Regression (LR) using WEKA [7, 12, 15]. Accuracy is the probability of the algorithm to correctly classify a student, either way. Sensitivity (true positive rate–TPR) measures the probability that the algorithm, when examining a dropout, will correctly place the student in that category. Specificity (false positive rate–FPR) measures the probability of the algorithm, when examining a non-dropout, will correctly place the student in that category. As Table 2 shows, applying our k-means processing approach before using the RF, J48 an LR algorithms results in significant improvement in accuracy. This is particularly noticeable in the RF algorithm; which, with k-means grouping achieved

Table 2 Measures obtained as result of the experimentation Measures Sorted grades k-means processing RF J48 LR RF J48 Accuracy (%) TPR FPR Precision Recall F-measure ROC RAE (%)

LR

69.18

72.97

74.05

80.86

75.65

79.13

0.69 0.47 0.67 0.69 0.67 0.69 77.71

0.73 0.53 0.8 0.73 0.66 0.59 85.89

0.74 0.48 0.76 0.74 0.69 0.81 77.72

0.8 0.43 0.79 0.8 0.79 0.72 81.10

0.75 0.62 0.72 0.75 0.7 0.54 87.19

0.79 0.44 0.77 0.79 0.77 0.74 82.07

Average accuracy of sorted grades models = 72.06% Average accuracy of k-means processing models = 78.54% Improved accuracy = 78.54 – 72.06 = 6.48%

12

L. E. Lee Hernández et al.

an accuracy of almost 7% points above than the rest of the algorithms without kmeans grouping. An alternative measure of the k-means processing improvement is calculated by subtracting the mean of the accuracy of the models from k-means processing (Random forest, J48 and Logistic regression) and the mean of the models from sorted grades (Random forest, J48 and Logistic regression).

5.1 Kappa Statistic Melville2005 The kappa statistic is defined as the measurement of the degree of agreement between two categorized data sets [13]. The kappa score varies between 0 and 1. The higher the kappa value, the stronger the agreement. Kappa of 1 means perfect correlation, while kappa of 0 means no correlation at all. Kappa more than 0.8 is generally considered a good correlation [11]. Figure 3 shows the statistical analysis of the kappa measurement applied to the improvement proposed in the paper where we can observe an improvement in the Random Forest and Logistic Regression algorithm. This increases the degree of concordance compared to the J48 algorithm with a reduction in the same.

Kappa 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 RF

J48 Sorted grades

Fig. 3 Statistic analysis of kappa

LR

RF

J48 kmeans processing

LR

Cluster Analysis Using k-Means in School Dropout

13

5.2 Mean Absolute Error The mean absolute error (MAE) is the average absolute error. This kind of error measures the forecast value to the actual value; it measure the closeness of a predicted model to its actual one [10]. Consequently, this type of error, being the average calculated from the difference of the distances, means that a lower MAE value corresponds to the prediction being closer to the actual value. Our results are shown in Fig. 4: applying the processing described in this paper causes the MAE value to decrease.

5.3 Root Mean Square Error The root mean square (RMSE) represents the square root of the summation of the differences between predicted students values and observed students values. It measures the differentiation between the students predicted to become inactive. Therefore, lower RMSE and MAE values increases the performance of the prediction and accuracy [10]. Figure 5 shows how the mean square error behaves when applying the processing proposed in this paper. What stands out, despite the J48 algorithm increasing its error, is that the Random forest algorithm and Logistic regression reduce their error by almost 4%.

Mean Absolute Error 0.38

0.37 0.36 0.35 0.34 0.33 0.32 0.31 0.3

0.29 0.28 RF

J48

LR

Sorted grades

Fig. 4 Statistic analysis of Mean Absolute Error

RF

J48 kmeans processing

LR

14

L. E. Lee Hernández et al.

Root mean square deviaon 0.46 0.45 0.44 0.43 0.42 0.41 0.4 0.39 0.38 0.37 0.36 RandomF

J48

LR

RandomF

Sorted grades

J48

LR

Kmeans processing

Fig. 5 Statistic analysis of root mean square deviation

6 Discussion The proposed forecast estimates that 18.69% of students are at risk of dropping out of school, which has a difference of 8.01% regarding the actual dropout data. The results showed that k-means processing outperformed the sorting grades model for predicting students’ inactivity by an average of 6.64%. However, we believe that these results may improve with a larger dataset.

7 Conclusions The main purpose of this research was to perform an analysis of college students who are at risk of becoming inactive at the end of the second semester during their stay in a university by applying advanced prediction systems and using artificial intelligence techniques. In this paper, we present an academic inactivity detection system based on grade prediction. In particular the k-means algorithm is designed to precisely cluster the subject grades per student using the distribution of their grades and classifying the subjects concerning each student’s performance. Each sub-clustering was stored in a knowledge base to later apply the Random forest, J48 Decision trees and Logistic regression algorithms. In this paper we emphasized the comparison of the sub-grouping of the knowledge base generated from the proposed methodology with the ascending order of the students’ grades, the latter being proposed because there is a disparity in the classes taken in terms of the generational change of the careers.

Cluster Analysis Using k-Means in School Dropout

15

The statistics show that Random Forest provides the best performance: Applying the proposed clustering-based processing method, this algorithm reached an accuracy of 80%. For future work, we propose applying other techniques to our inactivity prediction such as convolutions neural networks, support vector machines, etc. Future work should be aimed at improving the proposed methodology to find a solution for prediction of academic inactivity and student attrition based on grades in conjunction with the tutoring system.

References 1. Arun, D.K., Namratha, V., Ramyashree, B.V., Jain, Y.P., Roy Choudhury, A.: Student academic performance prediction using educational data mining. In: 2021 International Conference on Computer Communication and Informatics, ICCCI 2021 (2021). https://doi.org/10. 1109/ICCCI50826.2021.9457021 2. Calderón Argomedo, M.A., Vergara López, L., Atilano Mireles, L., Moctezuma Barragán, E., Flores Mendoza, R., Mayorga Ríos, A.: Principales cifras del sistema eduativo nacional 2018–2019. Technical report, Secretaria de Educación Pública, Ciudad de México (2019) 3. Daume, H.: A course in machine learning (CIML, v0.99). Todo, p. 189 (2012) 4. Gera, T., Panwar, A., Malhotra, N., Malhotra, D.: AFSA: a comprehensive analysis of educational big data using the advanced feature selection algorithm. In: 2021 International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE), pp. 349–353 (2021). https://doi.org/10.1109/ICACITE51222.2021.9404745 5. Gore, J., Holmes, K., Smith, M., Fray, L., McElduff, P., Weaver, N., Willington, C.: Unpacking the career aspirations of Australian school students: towards an evidence base for university equity initiatives in schools. High. Educ. Res. Dev. 36(7), 1383–1400 (2017). https://doi.org/ 10.1080/07294360.2017.1325847 6. Hegde, V., Prageeth, P.P.: Higher education student dropout prediction and analysis through educational data mining. In: Proceedings of the 2nd International Conference on Inventive Systems and Control, ICISC 2018, pp. 694–699. IEEE (2018). https://doi.org/10.1109/ICISC. 2018.8398887 7. Hormann, A.M.: Programs for machine learning. Part II. Inf. Control 7(1), 55–77 (1964). https://doi.org/10.1016/S0019-9958(64)90259-1 8. Huisman, B., Saab, N., van Driel, J., van den Broek, P.: Peer feedback on college students’ writing: exploring the relation between students’ ability match, feedback quality and essay performance. High. Educ. Res. Dev. 36(7), 1433–1447 (2017). https://doi.org/10.1080/07294360. 2017.1325854 9. Kool, A., Mainhard, T., Jaarsma, D., van Beukelen, P., Brekelmans, M.: Effects of honours programme participation in higher education: a propensity score matching approach. High. Educ. Res. Dev. 36(6), 1222–1236 (2017). https://doi.org/10.1080/07294360.2017.1304362 10. Kumar, Y., Sahoo, G.: Analysis of parametric & non parametric classifiers for classification technique using WEKA. Int. J. Inf. Technol. Comput. Sci. 4(7), 43–49 (2012). https://doi.org/ 10.5815/ijitcs.2012.07.06 11. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics (1977). https://doi.org/10.2307/2529310 12. Le Cessie, S., Van Houwelingen, J.C.: Ridge estimators in logistic regression. J. R. Stat. Soc. Ser. C (Appl. Stat.) 41, 191–201 (1992). https://doi.org/10.2307/2347628 13. Melville, P., Yang, S.M., Saar-Tsechansky, M., Mooney, R.: Active learning for probability estimation using Jensen-Shannon divergence. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2005). https://doi.org/10.1007/11564096_28

16

L. E. Lee Hernández et al.

14. Perez, B., Castellanos, C., Correal, D.: Applying data mining techniques to predict student dropout: a case study. In: 2018 IEEE 1st Colombian Conference on Applications in Computational Intelligence, ColCACI 2018-Proceedings, pp. 1–6 (2018). https://doi.org/10.1109/ ColCACI.2018.8484847 15. Reza, M., Miri, S., Javidan, R.: A hybrid data mining approach for intrusion detection on imbalanced NSL-KDD dataset. Int. J. Adv. Comput. Sci. Appl. 7(6), 1–33 (2016). https://doi. org/10.14569/ijacsa.2016.070603 16. Sukhbaatar, O., Ogata, K., Usagawa, T.: Mining educational data to predict academic dropouts: a case study in blended learning course. In: IEEE Region 10 Annual International Conference, Proceedings/TENCON, 2018-Oct, pp. 2205–2208 (2019). https://doi.org/10.1109/TENCON. 2018.8650138 17. Ya˘gci, A., Çevik, M.: Prediction of academic achievements of vocational and technical high school (VTS) students in science courses through artificial neural networks (comparison of Turkey and Malaysia). Educ. Inf. Technol., 2741–2761 (2019). https://doi.org/10.1007/s10639019-09885-4

Topic Modeling Based on OWA Aggregation to Improve the Semantic Focusing on Relevant Information Extraction Problems Yamel Pérez-Guadarramas, Alfredo Simón-Cuevas, Francisco P. Romero, and José A. Olivas

Abstract The volume of textual information available has led to the development of many solutions for extracting relevant information from Text Mining. Modeling the topics of texts has been an approach taken by many of these solutions. Topic modeling has been approached in many relevant information extraction solutions and in various ways. However, in all of them, semantic analysis is little exploited in identifying relationships between text terms. This paper presents a topic modeling method based on Ordered Weighted Average (OWA) aggregation as core to improve the semantic focusing on relevant information extraction problems. An OWA operator allow aggregate several values of measures in a single value. In this work OWA is used to aggregate several similarity and distance measures between candidate keyphrases. The measures are combined through a fuzzy aggregation of their values and some weights calculated with the RIM (Regular Increasing Monotone) quantifier. In addition, two other keyphrases extraction and text summarization solutions are proposed to demonstrate the effectiveness of the topical modeling method. The effectiveness of topic modeling is evaluated through the two relevant information extraction solutions proposed. Keyphrases extraction was evaluated with Inspec and 500N-KPCrowd datasets, and text summarization with datasets in English and Spanish was offered in MultiLing 2015. The highest values were reached in most of the metrics used in the experiments carried out with the two solutions. In the keyphrases extraction, Y. Pérez-Guadarramas (B) Centro de Aplicaciones de Tecnologías de Avanzada (CENATAV), La Habana, Cuba e-mail: [email protected] A. Simón-Cuevas Universidad Tecnológica de La Habana José Antonio Echeverría, Cujae. La Habana, Cuba e-mail: [email protected] F. P. Romero · J. A. Olivas Universidad de Castilla-La Mancha. Ciudad Real, Ciudad Real, España e-mail: [email protected] J. A. Olivas e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications, Studies in Big Data 132, https://doi.org/10.1007/978-3-031-38325-0_2

17

18

Y. Pérez-Guadarramas et al.

significant improvements in precision and f-measure were obtained, while in the summarization, the most significant improvements were in the recall. Keywords Topic modelling · Fuzzy aggregation · Keyphrases extraction · Text summarization · Semantic analysis

1 Introduction Nowadays, the amount of textual information available on the Internet and other information-centric application scenarios are becoming a valuable resource of knowledge for decision-making. The accelerated growth of textual and unstructured data in digital format causes a lot of information overload for users when a piece of specific information is needed. Although information access seems to be guaranteed, the tremendous available volume, its heterogeneity, and the inherent ambiguity provoke that identifying and extracting the most relevant information constitutes a significant challenge. The development of computational solutions based on the application of Natural Language Processing (NLP) and Text Mining (TM) techniques emerged as the most promising alternative to deal with this challenge [1–6]. In text mining solutions, automatic text summarization and keyphrases extraction condense and extract the most relevant and essential information in a textual source. Detecting the main topic(s) addressed in the documents is becoming one of the most promising techniques for improving the effectiveness of these solutions [7], and processing the underlying semantics of the language is an essential task. Therefore, automatic topic detection has been included in several of the automatic text summarization and keyphrases extraction solutions reported. Topic detection is also called topic modeling when it is used as part of the process of other text-mining solutions. Finally, it allows modeling an intermediate representation of the text on which different techniques can be applied to obtain the desired result. The topic modeling process is essentially an unsupervised process within which the nature and number of topics are not known in advance. Therefore, solution approaches are more related to text clustering algorithms than classification or categorization [8]. Topic models assume that meanings are relational and that the meaning of a topic can be understood as a set of groups or clusters of words [9]. The initial methods for topic detection in collections of documents used text clustering as a solution, grouping each document into account the discussion topics addressed in their content [10]. The next generation of methods focused their analysis on clustering the relevant words instead of using the document clustering of the initial methods. Topic modeling has not been addressed only from clustering. The use of other techniques— such as LDA (Latent Dirichlet Allocation) [11–17] and LSA (Latent Semantic Analysis) [18–21]—has been widely used. Although LDA is central to topic modeling and has impacted this field, both the LDA and the LSA show some limitations. For example, it assumes words are exchangeable, and the conceptual structure (e.g., noun phrases composed of several words) is not modeled; the number of topics is fixed

Topic Modeling Based on OWA Aggregation to Improve the Semantic …

19

in advance; and the distribution of topics does not allow to identify correlation relationships; high dependence on the content of the documents and does not offer the possibility to consider external information, and requires a considerable amount of content in order to work well. These disadvantages have led to increased emphasis on clustering-based approaches [13]. Clustering algorithms have the advantage of using measures of similarity and distance to identify relationships between the terms in the topic detection process. Selecting a measure that correctly discerns which terms belong to the same topic is challenging because each type of measure captures different information and represents a different relationship weight between the terms. These measures mainly focus on calculating co-occurrence relationships [22–26] and contextual relationships based on distance [3]. However, there are other measures of semantic relatedness, such as meaning-based semantic similarity and context-based semantic relationships, which are poorly considered [21, 23–27], and the combined use of several measures has also not been reported. Subjectivity, vagueness and imprecision are problems present in the semantic analysis of the textual content at the level of the meaning of the words or the relationships between them. These problems are present due to the ambiguity characteristic of natural language, which in solutions that require intensive semantic processing constitutes a challenge. To deal with these problems and under the fuzzy logic approach, different techniques have been developed, such as fuzzy set techniques, fuzzy clustering algorithms, aggregation operators and others. Our research addresses this issue by proposing a new topic modeling approach based on clustering and fuzzy logic techniques. This chapter proposes a topic modeling method for solving the relevant information extraction problem. The method was conceived through a clustering-based topic identification, which is carried out from the fuzzy logic perspective. To deal with the limitation of using one or another relationship measure between terms, a fuzzy logic solution [28] is used, which allows adding in a single relationship value, the different values obtained with different measures, and assigning weights to these values. This fuzzy technique allows assigning a relevance weight to determine which of the measures are more relevant than others to obtain the overall similarity score, which reduces the adverse effects of uncertainty associated with the decision to assign the weights. This would make it possible to carry out a deeper semantic analysis of the relationships between the terms and, in the clustering process, to discern better whether a term belongs to one topic (cluster) or another. In this sense, the aggregation operator OWA (Ordered Weighted Averaging) [28] is applied to combine several syntactic and semantic measures. This allows increasing the level of semantic processing to identify the distance between the candidate topics in the clustering process. In order to demonstrate the contribution of the proposed method in extracting relevant information, two other solutions are presented that include topic modeling as an essential part of their solutions. The first proposal is a keyphrases extraction method, which, based on the identified topics, selects from each of these the most representative phrases based on different criteria. The second proposal is an extractive

20

Y. Pérez-Guadarramas et al.

summarization method, where the relevance of the sentences is calculated based on their similarity with the identified topics and the most relevant sentences are chosen to build the summary. Several experiments were carried out to evaluate the effectiveness of our fuzzybased topic modeling proposal through text summarization and keyphrases extraction solutions. Specifically, keyphrases extraction methods were evaluated with the Inspec [29] and 500N-KPCrowd [30] datasets, and the performance was measured using the precision, recall, and F-measure metrics. On the other hand, the multi-document text summarization method was evaluated with datasets in English and Spanish offered in MultiLing 2015 (http://multiling.iit.demokritos.gr/), and the quality of the summaries obtained was measured using metrics of ROUGE-N [31]. The results obtained in both methods were compared with those obtained by other solutions, where superior results of our proposals are evident in most of those evaluated metrics. The main contributions of our works are (1) a new way of processing the semantic information in the topic-modeling process, applying a fuzzy aggregation operator (OWA), and (2) we show, through two different proposals and four datasets (two for each proposal), that the fuzzy topic modeling proposed to improve the results in relevant information extraction through text summarization and keyphrases extraction. The rest of the chapter is organized as follows: Sect. 2 sets out a background of the main techniques used in the topic modeling process; Sect. 3 describes the proposed methods; Sect. 4 presents the datasets, metric description, experimental results, and the corresponding analysis. Conclusions and future lines of work are given in Sect. 5.

2 Background Topic modeling in TM solutions has been boarded from several approaches; among the common ones are those based on LDA [11–17], LSA [18–21], and clustering algorithms [22, 24, 32–38]. Using a clustering algorithm for topic modeling provides the flexibility of using different measures of similarity and distance as a function of distance to calculate relationships between terms or sentences in the text. Using these measures in clustering also eliminates redundancies since terms or sentences that are strongly related are grouped together. However, the semantic information obtained from each of these measures is used independently, and the integrated use is not reported in a measure that encloses the semantic meaning of each one in a single value. Subjectivity, vagueness and imprecision are problems present in the semantic analysis of the textual content at the level of the meaning of the words or the relationships between them. These problems are present due to the ambiguity characteristic of natural language, which in solutions that require intensive semantic processing constitutes a challenge.

Topic Modeling Based on OWA Aggregation to Improve the Semantic …

21

To deal with these problems and under the fuzzy logic approach, different techniques have been developed, such as fuzzy set techniques, fuzzy clustering algorithms, aggregation operators and others. Sin embargo, se identifican pocas propuestas que se basan en el modelado de tópicos aplican la lógica difusa para llevar a cabo algún nivel de análisis semántico. [37, 38]. The use of topic modeling in other TM solutions to extract relevant information is widely used, evidenced by the number of works reported under this approach [11–14, 17–22, 24, 26, 32–36, 38–42]. In general, these solutions are based on topic modeling to identify concise information semantically related to the topics addressed in the textual source. Two of the solutions in which the topic modeling has been boarded more and more varied are keyphrases extraction and text summarization. Finally, a more focused analysis of how topic modeling has been used in these two types of TM solutions can provide a better idea of the existing limitations.

2.1 Topic Modeling in Keyphrases Extraction Topic modeling has been carried out in keyphrases extraction using several techniques. The methods that follow that approach, once the topics have been identified, two different strategies are generally followed: (1) a ranking of topics is created to identify the most relevant ones, or (2) the most representative phrases are selected from the main topics. In the Salience Rank algorithm [17], salience measures are combined in an LDAbased topic modeling approach to obtain the rank of words in the document. In [39, 40] LDA is also used to identify the topics of the document, on which the relevant sentences are subsequently identified. Topics in TSAKE [24] are modeled by cooccurrence graphs of words from the input text, which is constructed from Wikipedia articles. Subsequently, the nodes and core communities are identified using the fuzzy modularity criterion to measure the goodness of the overlapping community structures. Wikipedia is also used in WikiRank [43], where a related page is linked to significant sequences of words (concepts) identified in the text using the TAGME topic annotator [44]. On the other hand, candidate keyword phrases in WikiRank are identified using noun phrases through patterns. The concepts with the candidate key phrases that contain them are then represented in a semantic graph. Finally, the candidate keyphrases with the most links to concepts are selected as relevant. In [41], another graph-based approach to topic modeling is proposed, using words and sentences as vertices and three types of relationships: sentence-to-sentence, word-to-word, and sentence-to-word, define the corresponding edges. TopicRank [38] proposes a strategy to identify and analyze topics to extract relevant phrases. In this method, a hierarchical agglomerative clustering (HAC) algorithm [44] is used to group syntactically similar noun phrases into a topic or theme. Next, a graph is constructed using the topics as vertices, and each edge is labeled with a weight that represents the strength of the contextual relationship in the text between the candidate keyphrases contained in the corresponding pair of themes. Finally, only one

22

Y. Pérez-Guadarramas et al.

keyphrases is selected for each theme, which is highlighted as a weakness because a theme can be represented by more than one phrase in the same text. In [26] a more flexible procedure is proposed for the selection of keyphrases from topics and incorporates the definition of a distance function between phrases in the candidate phrase grouping process. However, in this proposal the semantic processing continues is limited, as in [38]. In Liu et al. [22] also carries out a grouping of candidate phrases to represent the topics of a document, in this case the distance function used is cooccurrence. The good keyphrases in a document should be semantically relevant to the topic of the document and cover the entire document well [22]. In this sense, a low use of semantic analysis was identified in the process of grouping and modeling themes or in any other task included in the works analyzed. Semantic analysis has focused only on calculating co-occurrence relationships [22–25] or distance-based contextual relationship [26]. However, other levels and measures of semantic analysis, such as semantic similarity and measures of semantic relatedness, have not been explored.

2.2 Topic Modeling in Extractive Summarization In extractive summaries solutions that carry out topic modeling, once these are identified, they are generally used to calculate levels of relevance to the sentences and later select the most relevant ones to form the summary. In this field, topic modeling has been approached from different approaches, using clustering algorithms, LDA and LSA among the most used. Methods based on clustering assume the groups of terms or sentences as topics and generally select the most representative sentences of the topics [32, 33] or calculate the sentences’ relevance depending on the topics’ coverage [35]. Blair-Goldensohn et al. [33] evaluate the representative sentences according to the size of the group to which they belong. On the other hand, in Angheluta et al. [35], grouping was used to eliminate redundancy in abstract sentences and identify topics in the text. Meaningful sentences are then chosen based on the number of keywords they contain, which are detected using an author topic segmentation module. From different features, clustering-based methods compute the similarity between sentences, also known as sentence salience [26]. In MEAD [36], this task is performed with three parameters: the centroid value (the average cosine similarity between the sentences and the rest of the sentences in the documents). In Saggion and Gaizauskas [34], different sentence characteristics are also used, and sentences with the role of the centroid are taken into account. In this proposal, the sentences are evaluated based on three characteristics: the similarity with the group’s centroid, the similarity with the document’s main part and the sentence’s position. With the main objective of eliminating redundancies in the topics and having good coverage of the documents, in [45], similar prepositions are grouped. For each group, the most representative preposition is selected. The sentences that will make up the summary are selected using the most representative prepositions of the groups. In [11], the syntactic and semantic similarities between sentences in the clustering process were considered.

Topic Modeling Based on OWA Aggregation to Improve the Semantic …

23

The use of different characteristics in the form of measures of similarity between sentences or terms allows the grouping process to capture different information, each represented as a different weighting of the relationship between the terms or sentences of the text. This allows a better semantic analysis in the grouping process, leading to better discerning which sentences or terms are associated with the same topic. There are various clustering algorithms and similarity measures to weigh the relationships between terms or sentences used as a function of distance. This constitutes an advantage of the methods based on clustering to model the topics since it allows a more significant analysis of the relationships between the topics. On the other hand, using other approaches to model topics, such as LDA or LSA, does not have this advantage. LDA-based methods select the most important sentences in each topic to be part of the summary generally after discovering the topics of the text [43]. Roul [11] identifies the number of independent items using LDA and three probabilistic models. The probabilistic methods of Hellinger distance, Jensen Shannon divergence and KL divergence are used to calculate the similarity between each subject pair. Then the LDA technique is used again to reduce a large set to a smaller set while keeping the important information. The representative sentence of each topic is selected and ordered by the importance of the corresponding topic so that it appears in the summary. In [44], a heuristic method is proposed that uses LDA to determine the optimal number of independent topics that represent the corpus. In [12], the approach is graph-based, where the nodes are the sentences of the document. To determine which sentences make up the summary, a weight is calculated for the nodes of the graph, having as one of the criteria the similarity of each node with the topics of the document. The similarity measure used is the semantic similarity through WordNet [46] and LDA was used to identify the topics. LSA is another of the most common approaches for topic modeling, and the most recent works based on it [18, 20, 21] follow a similar strategy to those based on LDA. Both LDA and LSA do not handle polysemy, which can result in semantically similar topical results and summaries with redundant information. In [19], a clustering algorithm is used to group the most similar topics to reduce the redundancy of the topics identified with LSA. The methods that carry out topic modeling allow identifying the main topics boarded in the text to try to find that the sentences that make up the summary cover the most significant number of topics possible. Clustering strongly related terms or sentences together for topic modeling, in addition to the advantages mentioned above, reduces redundancy when selecting several sentences with a high relevance value. However, at the same time, they can also have very similar semantics. It also allows us to concentrate on different text terms or sentences that may be discussing the same topic. Again, this favors selecting from these instances which may be the most representative and be considered in the construction of the summary. On the other hand, it is a great challenge to select by which criteria should be grouped. For this, there are various measures that allow calculating relationships between the terms or sentences of the text and that can capture different semantic or statistical information. The increased semantic analysis in the clustering allows

24

Y. Pérez-Guadarramas et al.

us to discern better whether a term or sentence belongs to one group. The measures reported capture different information about a relationship between text strings, so using several measures leads to capturing more information and increasing semantic analysis; however, few studies [11, 12, 26] use this strategy. In addition, the combined use of these measures would allow the aggregation of all the information collected by the different measures into a single relational value. In extractive summarization methods, identifying the main topics addressed in the text is one of the most promising ways to identify the most relevant sentences [14], so improving the identification of topics will result in a higher quality of the resulting summary.

3 Topic Modeling Aimed at Extracting Relevant Information This section proposes a new topic modeling method for solving the relevant information extraction problem. The method was conceived through a clustering-based topic identification, which is carried out from the fuzzy logic perspective. In this sense, syntactic and semantics measures are combined, applying the aggregation operator OWA [28] to increase the semantic processing level of the candidate topics in the clustering process. Lexical-syntactic patterns were defined for extracting terms from the text as candidate topics. Finally, the fuzzy approach for treating the semantic relationship of candidate topics within the clustering process is proposed for identifying the main topics in the texts to increase the semantic analysis in this process. In order to address the topic modeling to relevant information extraction, we develop a keyphrases extraction method and a multi-document extractive summarization method to apply and evaluate the topic modeling proposed. There are methods to determine relevance and reduce information overload in texts, and it is common for both to apply topic detection as an intermediate task in their process. Figure 1 shows a general outline of the topic modeling flow integrated with the Keyphrases Extraction and Multi-Document Extractive Summarization.

3.1 Fuzzy-Based Topic Modelling 3.1.1

Text Pre-processing

In this first phase, NLP tasks are carried out that allow the syntactic information necessary to extract the candidate topics from the text. First, the plain text is extracted, which is segmented into paragraphs, sentences and up to the level of the tokens (words, numbers, among others). Then, deep parsing is applied using the Freeling parser [47].

Topic Modeling Based on OWA Aggregation to Improve the Semantic …

25

Fig. 1 General outline of the topic modelling flow integrated with the keyphrases extraction and multi-document extractive summarization

3.2 Candidate Topics Extraction The extraction of candidate topics is based on the identification of conceptual phrases that are named candidate topics. A set of defined lexical-syntactic patterns are defined for this purpose, such as [D | P | Z] + [ ] + NN; [D | P | Z] + [ ] + NN + NN; [Z] + < sn > ; ; NN+[IN] + NN; VBN + NN; JJ + NN + [NN], in a similar way to that reported in [27]. These patterns have been defined according to the grammar labeling used by Freeling [47], and they combine a set of relevant grammatical categories in the composition of candidate topics. Most of these patterns have their origins in the most frequent patterns identified in the concepts included in several ontological knowledge resources analyzed, e.g., the ontology of the DBpedia project [48], which has more than 1000 concepts of different domains from Wikipedia. Other reported proposals only consider nominal phrases at this stage and through these patterns the coverage of the main topics of the text could be increased.

3.3 Topics Identification Topic identification consists of clustering candidate phrases using a hierarchical clustering (HAC) algorithm [44]. Cluster-based topic modeling has been reported in several proposals [11, 26, 34, 36, 38, 39], although in these proposals semantic analysis has been limited to the use of co-occurrence only. In our proposal, this process is approached as a fuzzy logic problem to reinforce the semantic analyzes in the grouping of sentences. To use HAC it is not necessary to set a number of groups in advance, which can be a point in your favor because you want to get a solution without domain restriction and for documents of different lengths.

26

Y. Pérez-Guadarramas et al.

The low use of semantic analysis reported in the literature is considered a weakness under the assumption that a topic of a text could be modeled by a set of phrases that are commonly used in the same context and that have similar or semantically related meanings. The clustering process in our proposal starts from the distance function, which is considered as the value resulting from combining four measures of similarity and distance. These measures are syntactic similarity and word distance reported in [26], along with two other semantic measures applying a fuzzy aggregation operator. The distance measure (in words) is calculated using Eq. 1, where a is each word of a sentence A with a total of words Na and b is each word of other sentences B with a total of words N b . 1 dist(a, b) = |Na ||Nb |

(1)

a∈A b∈B

The two semantic measures were conceived according to the sentence-to-sentence similarity metric reported in [49] and using two word-to-word semantic similarityrelatedness metrics from the WordNet::Similarity package, specifically the Jiang & Conrath and Leacock & Chodorow metrics [50]. Additionally, the word distance metric reported in [10] was redefined (see Eq. 2). D(F1 , F2 ) = {1 if F1 and F2 appear in the same paragraph, 1− avg_dist(F1 , F2 ) otherwise. TW

(2)

where avg_dist(F 1 , F 2 ) is the average distance in words that exists between the words included in the pair of phrases F 1 and F 2 , and TW is the total of words in the text. In this method, we aggregate the resultant numerical values (ai ) from the four defined measures into one similarity relatedness score (SRS) of a pair of candidate topics. These measures represent features with different semantic meanings for the phrases clustering and different relevance levels for the decision-making in this process. Aggregation is the process of combining several values (numeric or nonnumeric) into a single value in such a way that the final result of the aggregation takes into account, in a certain way, all the individual values added [51]. The ordered weighted average (OWA) [28] operator is one of many aggregation operators for aggregating information that has been developed. OWA has been widely used as a solution to multiple criteria decision problems. Therefore, the OWA operator can be very useful in combining semantics with other linguistic aspects through weights assigned to each measure to be added. Combining these different measures using OWA makes it possible to achieve groups of strongly related phrases from different semantic dimensions and, at the same time, to achieve broad coverage of the entire document in the topic modeling process.

Topic Modeling Based on OWA Aggregation to Improve the Semantic …

27

This aggregation operation is carried out using Eq. 3, where each value bj of the measurements is multiplied by the weight wj calculated in Eq. 4. There are different methods to determine the weights to use in an OWA operator; the use of linguistic quantifiers is one of them [52], for example, the RIM (Regular Increasing Monotone) quantifier. In our proposal, the RIM quantifier “Most (Feng & Dillon)” reported in [53] is used, which is calculated using Eq. 5. According to [54], using this quantifier most promising results were obtained fowa (ai , . . . , an ) =

n

wj bj

(3)

j=1

wj = Q

j−1 j −Q n n

Q(x) = {0 if 0 ≤ x ≤ 0.5, (2x − 1)0.5 otherwise.

(4) (5)

The hierarchical agglomerative clustering process creates a symmetric square matrix of size n (total candidate phrases identified). Initially, each phrase is considered a topic in the matrix, composed of only the phrase itself. Each topic represents both a column and a row, while the intersection between each column and row contains the SRS value (weight value) of the relationship between the pair of candidate phrase corresponding to the pair of topics. In each iteration, the pair of topics with the highest weight value are grouped, and the SRS of the new topic is recalculated concerning the others in the matrix. The average of SRS values is used as a recalculation strategy for a pair of topics, as reported in TopicRank [38]. Among the commonly used linking strategies, TopicRank proposes using average linking because it represents a trade-off between complete and simple linking. Using average linking, the relation’s weight between the newly formed topic T (x) with the topic T (y) is calculated by Eq. 6. R(Tx , Tk ) =

R(Ti , Tk ) + R(Tj , Tk ) 2

(6)

Being Rel(T x , T y ), the relation’s weight between T x and T y , T i and T j are the topics that came together to conform T x . In each iteration, the associated weight to relations of the new topic is recalculated after grouping two topics in a new topic. The process of clustering is shown in Algorithm 1. The algorithm input is a candidate topics list resulting from the Candidate Topic Extraction phase, and the output is a list of all identified topics.

28

Y. Pérez-Guadarramas et al.

First, the matrix is filled (line 2 to line 6), where the distance (line 4) between each pair of topics (C[i] and C[j]) is calculated. The distance is calculated by Eq. 3 and correspond to aggregation with OWA operator. After building the matrix, begin the process of clustering. The threshold computed in the Threshold(matrix) (line 7) function is the average SRS between each pair of phrases and is used as the HAC stop condition. The Min-Value(array) function (line 8) finds the minimum distance within the array, and the corresponding topic pair (C[i] and C[j]) will be clustered. The function MergePairOfTopics(min, matrix) (line 9) merges the row-column pair representing the topics C[i] and C[j] into a new topic. In each iteration of this process, a new matrix is obtained from which a new minimum distance is selected (line 9) and the process is repeated (lines 9–12) until the stopping condition is met. The phase concludes by representing the input text in the form of a graph using the identified topics. The graph is built with the topics as vertices and the edges are labeled with the weight of the relationship between them. Each edge represents the semantic relationship between the pair of topic that it joins. If the average distance between each pair of phrases that make up a pair of topics is low, then that pair of topics has a strong semantic relationship. The weight Wi,j of an edge is calculated using Eqs. 7 and 8. Equation 7 refers to the reciprocal distance between the positions of the candidate phrases ci and cj in the text, where pos(i) represents all positions pi of ci . Wi,j =

D ci , cj

(7)

ci ∈Ti cj ∈Tj

D ci , cj =

pi ∈pos(ci ) pj ∈pos(cj )

1 pi − pj

(8)

Topic Modeling Based on OWA Aggregation to Improve the Semantic …

29

3.4 Topics Ranking Construction In this phase, a ranking of the topics is created by calculating the relevance of each one, applying the TextRank model [23] on the previously constructed graph.. TopicRank represents a document with a complete graph in which topics are vertices and edges are weighted according to the strength of the semantic relations between vertices. Then, TextRank’s graph-based ranking model is used to assign a significance score to each topic. The relevance score computed for each topic Ti is based on the concept of “voting” (inspired by the PageRank algorithm [55]): the adjacent topics of T i with the highest score contribute more to the relevance evaluation of the topic T i . The relevance score S(T i ) is obtained through Eq. 9, where V i is the set of adjacent topics of Ti in the graph, and λ is a damping factor that usually is 0,85 [55]. S(Ti ) = 1 − λ + λ ∗

Wi,j ∗ S(Ti ) Wj,k T ∈V j

i

(9)

Tk ∈Vj

In Algorithm 2 it is described how the weight of each vertex and edge of the graph is calculated.

The list T of topics identified in the previous phase is the input of Algorithm 2. The weights of the edges (line 2 to line 7) corresponding to each pair of vertices T[i], T[j] are calculated in the Distance function (T[i], T[j]) (line 4), this refers to Eq. 6. Each vertex after being weighted is saved in a list (line 5). Then, the weight of each vertex (line 8 to line 11) is calculated by the weight of its relation (weight of the edge) with the rest of the vertices. For each vertex i, a weight is calculated with the function VertexScore(i, edgeList) (line 9), which corresponds to Eq. 9.

30

Y. Pérez-Guadarramas et al.

Once a vertex is weighted, it is stored in a list (line 10), which it is returned at the end of the algorithm process (line 12). The proposed topic modeling method outputs a list of ranked topics, which are clusters of strongly related phrases. Based on the proposed topics modeling, we propose two other solutions: (1) a keyphrases extraction proposal and (2) a multi-document extractive summaries proposal. Both proposals are described below.

3.5 Fuzzy-Based Topic Modelling Applied to Keyphrases Extraction Keyphrases of a text, in comparison with the topics modeling, offer a superior level in the identification of the relevant information. Although the topics obtained represent informant elements that are also relevant, they remain at a higher level. The candidate topics are assumed to be candidate keyphrases in the context of the keyphrases extraction method. The keyphrases extraction proposal consists of selecting the most representative candidate keyphrases from the best-ranked topics using various criteria. That allows for identifying a set of representative keyphrases of the document. From each of the most relevant topics, the keyphrases are selected following three criteria: (1) the most frequent candidate phrase; (2) the candidate key phrase that appears first in the text; and (3) candidate phrase that is most related to the others of each topic (centroid role). For each topic, the three criteria are applied, so more than one keyphrases from each topic can be selected. This provides greater flexibility than what was reported in [36], where only one criteria are considered, which may limit the coverage of the main topics of the document. Algorithm 3 describes in more detail the steps followed by the keyphrases selection described above. The input of the algorithm is a list R of the best ranked topics. First is selected the n bestranked topics through the function SelectNBestRankedTopics(R) (line 1), which sorts descending the topics and selects the n firsts. Late, are selected using three criteria, the keyphrases from each n best topics (line 2 to line 6). Using the first criteria, the most frequent phrase of the topic is selected as the keyphrases through the function mostFrequentKP(i) (line 3). The second criterion allow select the first phrase that appears in the text as relevant through the function firstKP(i) (line 4). The last criteria allow us to select as a keyphrases the most similar to each other phrases in the topic using the function centroidKP (line 5). The algorithm finishes returning the list of selected keyphrases KP (line 7).

Topic Modeling Based on OWA Aggregation to Improve the Semantic …

31

3.6 Fuzzy-Based Topic Modelling Applied to Text Summarization In the same sense as keyphrases extraction was previously introduced, a text summary constitutes more concise information and a higher level of representation of relevant information regarding topic modeling. The summary of a text or a collection is a small set of sentences compared to the number that comprises the source text(s), which concentrates on the main topics. The text summarization proposal starts from the sentences obtained in the segmentation of the text in the Pre-processing phase of the topic modeling. The best-ranked topics are used to score the sentences of multiple text documents. Compute similarity between sentences and each candidate topic of main topics guarantees that sentences best scored contain the main topics boarded in the documents. To build a summary from the identified topics, we propose scoring each sentence similar to the candidate topics that compose the best-ranked topics and later build the summary with the best-ranked sentences. The senScore relevance of each sentence Si is calculated from the average of its Cosine Similarity with each sentence Pj of the vector of sentences of size N of each selected topic, as seen in Eq. 10. According to [56], Cosine Similarity is one of the most used criteria to calculate the similarity between sentences (from the vector pair of terms that compose them). N

senScore (Si ) =

cos_similarity Si ; Pj

j=0

M

(10)

The resulting summary is built by selecting the -ranked sentences according to the above criteria. The selected sentences are ordered from two aspects: according to the order of the documents in which the most relevant sentences appear and then

32

Y. Pérez-Guadarramas et al.

the order in which they appear within the document. These two aspects contribute to the resulting summary having more consistency and coherence.

4 Evaluation and Discussion In this section, we will describe the experimental setup and results obtained to evaluate the effectiveness in several problems of our fuzzy-based topic modeling proposal. According to the results reported in [54], the linguistic quantifier Most(Feng) [54] is used in all experiments because using this quantifier most promising results were obtained. The following experimental tasks were performed: 1. To evaluate the impact of OWA-based topic modeling solutions in the context of keyphrases extraction and multi-document summarization problems, using several datasets and metrics recognized. 2. To compare the results obtained from other proposals reported in both scenarios.

4.1 Experimental Results in Keyphrases Extraction Problem The effectiveness of our proposal is evaluated with two standard and publicly available datasets, Inspec [29] and 500N-KPCrowd [30]. The Inspec dataset [29] consists of 2000 abstracts of scientific journal articles in computing collected between 1998 and 2002, divided into sets of 1000, 500, and 500, as training, validation, and test datasets, respectively. In our experiment we use the 500 test documents and as gold standard a list of keyphrases for each document provided in the collection. The 500N-KPCrowd dataset [30] consists of 500 English-language broadcast news in 10 different categories (eg, politics, sports) with 50 papers per category. The provided gold standard was built using Amazon’s Mechanical Turk service in conjunction with various annotators. A characterization of these datasets is shown in Table 1. The performance of the keyphrases extraction solution using OWA-based topic modeling was measured using the precision (P), recall (R), and F-measure (F) metrics, according to Eqs. 11–13.

Table 1 Characterization of Inspec and 500N-KPCrowd datasets

Characteristics

Inspect

500N-KPCrowd

Type of doc

Paper abstracts

News stories

Documents

500

500

Ave. of words

124.4

333.33

Ave. of gold keys

9.8

39.9

Topic Modeling Based on OWA Aggregation to Improve the Semantic …

33

correct extracted keyphrase total extracted keyphrase

(11)

correct extracted keyphrase total gold standard keyphrase

(12)

Precision = Recall =

F − measure =

2 × precision × recall precision + recall

(13)

For each document we compute the macro-averaged precision, recall, and F-measure to measure the performance of the algorithm. In the first experimental task, the results of our proposals are evaluated with two approaches. The single criteria approach, where topic modeling is based on WordNet-based semantic relatedness and similarity metrics. The multi-criteria approach, where the metrics are aggregated with OWA in the topic modeling. Table 2 shows these results and the highest best values obtained with the metrics are highlighted in bold. With the simple semantic relationship metrics, an increase in precision is obtained, but with the increase in the size of the documents, the recall decreases. The OWA-based solution does not show this behavior, which reached the highest values in general in both datasets. Instead, with this approach, more balanced values between precision and recovery are achieved for both datasets, with different document sizes. Finally, Table 3 compares obtained results with other algorithms reported in the state-of-the-art literature, which are grouped (A, B, or C) according to the dataset in which they were evaluated. With Inspec our proposal shows the highest value of F measure in the state of the art, being a slight improvement (only 0.2%) with respect to Vega et al. [61]. Liu et al. [22] obtained the highest value of recall (66%), guaranteeing a good coverage of the document due to the use of clustering through semantic relationships between terms. Although the recall obtained by our proposal with 500N-KPCrowd was the least satisfactory result of the metrics, the values obtained with precision and F measure were significantly higher than the rest of the reported works. Yign et al. [41] obtains the highest recall value with 500N-KPCrowd, but its precision and F measure are 9 and 2% respectively lower than our proposal. The Table 2 Results achieved with Inspec and 500N-KPCrowd with the different topic modeling approaches Topic modeling approach

Metrics

Single criteria (WordNet-based semantic relatedness metrics)

JCN

Multi-criteria (fuzzy aggregating metrics)

Inspec

500N-KPCrowd

P

R

F

P

R

F

38.2

53.2

44.5

59.0

19.0

28.6

LCH

39.9

55.1

46.2

55.0

18.9

28.1

Avg

39.0

54.1

45.3

57.0

18.9

28.3

OWAMost(Feng)

42.8

62.0

50.7

57.4

43.9

49.8

34

Y. Pérez-Guadarramas et al.

Table 3 Comparative results of OWA-based topic modeling proposal with other keyphrase extraction methods Systems Group A

Group B

Inspec

500N-KPCrowd

P

R

F

P

R

F

Liu et al. [22]

35.0

66.0

45.7

−

−

−

Thi et al. [57]

38.1

46.1

41.7

−

−

−

WikiRank [43]

28.4

25.9

27.0

−

−

−

EmbedRank [58]

31.5

49.2

38.2

−

−

−

SMAF extractor [59]

−

−

−

42.7

24.8

29.8

YAKE! [60]

−

−

−

25.1

6.3

10.1

TextRank [23]

31.2

43.1

36.2

26.5

6.3

10.3

TopicRank [38]

36.4

39.0

35.6

26.2

23.9

25.0

TSAKE [24]

40.1

20.3

26.9

14.3

46.6

21.9

Salience rank [17]

26.5

29.8

26.6

25.3

22.2

22.9

Ying et al. [41]

43.0

40.2

39.6

48.7

49.8

47.8

RAKE [25]

33.7

41.5

37.2

12.0

3.8

5.8

Vega et al. [61]

49.2

51.8

50.5

44.8

44.3

44.5

Avg. (Baseline)

32.4

41.2

36.8

19.3

23.0

21.7

(Solution proposal) KPE with OWAMost(Feng)

42.8 (III)

62.0 (II)

50.7

57.4

43.9

49.5

Group C

identification of named entities was not considered in the patterns defined by our proposal to identify candidate phrases, because it is common for us to identify this type of statement as a relevant phrase. In 500N-KPCrowd there is a large number of entities named as relevant phrases, which may explain the low recall value obtained in this dataset. Furthermore, the aggregation of various semantic measures with OWA may fail to identify relationships with named entities. This situation suggests a later, more specific analysis of this type of sentences. Despite this, our proposal generally obtains good results and the effectiveness of the proposed topic modeling for improving the extraction of relevant phrases in at least two types of texts is verified, such as: paper summaries and news.

4.2 Experimental Results in a Multi-document Summarization Problem To evaluate the effectiveness of our OWA-based topic modeling proposal in a multidocument summarization solution, we used text corpus in Spanish and English language, provided by MultiLing 2015 for evaluating the Multilingual MultiDocument Summarization task (MMS). MultiLing is a community initiative that

Topic Modeling Based on OWA Aggregation to Improve the Semantic … Table 4 Characterization of the datasets for evaluation of the proposed approach in a multi-document summarization problem

35

Characteristics

ENColl

SPColl

Language

English

Spanish

Corpus

15

15

Documents in each corpus

10

10

Sentences

4469

5000

Ave. of sentences in the corpus

297.93

333.33

Ave. of document sentences

28.92

33.34

promotes state-of-the-art research for automatic text summarization, from which data sets are provided, favoring research topics in this scientific area. The selected corpora are made up of news from WikiNews and associated with 15 different topics. Table 4 characterizes both corpora, called ENCol and SPCol, the one in English and Spanish, respectively. In both collections, a summary is provided for each corpus, which was manually elaborated and used as a reference for evaluating the quality metrics. The quality of the obtained summaries using the method proposed was measured also using precision (P), recall (R), and F-measure (F) metrics in the context of ROUGE [33]; specifically, ROUGE-1 (using unigrams) and ROUGE-2 (using bigrams). Tables 5 and 6 show the results obtained and their comparison with other reported solutions, including solutions participating in the MultiLing 2015 MMS task (Table 5), as well as more recent ones that have been evaluated with that selected corpus (Table 6). As shown in Table 5, the results obtained by the proposed method outperform the baseline values in each of the metrics and test collections. The values obtained for ROUGE-1 improve those of the rest of the systems in both datasets, with recall values above 50%. In the case of the results associated with ROUGE-2, the recall in ENCol and the precision in SPCol stand out, the latter being 20% higher than the majority of the systems. However, the results of UWB [66] are distinguished in recall and f-measure in SPCol. In UWB, the vector size of sentences is used as a measure to evaluate their relevance within the topics, which LSA obtains. Although the results obtained are good, this technique may require more than one sentence to express all the information associated with the topics, leading to redundancies in the topics, being a limitation in some collections [18]. However, this limitation is not present in the proposed approach when using clustering techniques for the modeling of the topics since, for each cluster, only the most representative is selected; in addition to the advantage of modeling the topics in a more granular way, representing them as groups of sentences that have a strong semantic relationship with each other. In general, our proposal achieves higher results in most metrics and both collections, with a better performance with ROUGE-1. Exceeding 50% of recall and fmeasure in SPCol for ROUGE-1 is an important result, outperforming those reported by [64]. The recall obtained in ENCol for ROUGE-1 is another promising result, the only value higher than 50%, exceeding the second-best value by more than 7%.

0.207

0.125

0.294

0.154

0.142

0.344

0.229

0.159

0.230

0.147

0.420

BUPT-CIST

BGU-MUSE

NCSR/SCIFY

UJF-Grenoble

UWB

ExB

ESIAllSummr

IDAOCCAMS

GiauUngVan

Proposal

0.196

0.533

0.123

0.230

0.150

0.227

0.327

0.121

0.135

0.277

0.110

0.200

0.469

0.134

0.229

0.153

0.227

0.333

0.129

0.142

0.284

0.116

0.127

0.156

0.037

0.066

0.039

0.084

0.186

0.043

0.053

0.188

0.019

0.253

0.027

0.068

0.034

0.086

0.178

0.034

0.043

0.173

0.015

0.117

0.190

0.031

0.067

0.036

0.085

0.181

0.038

0.046

0.180

0.017

0.121

F

P

R

ROUGE-2 with ENCol

F

P

R

ROUGE-1 with ENCol

SCE-Poly

Systems

0.489

0.252

0.208

0.253

0.412

0.226

0.130

P

0.572

0.244

0.202

0.242

0.390

0.202

0.122

R

ROUGE-1 with SPCol

Table 5 Results with MultiLing 2015 MMS and according to the OWA-based topic modeling approaches

0.521

0.246

0.204

0.245

0.399

0.211

0.124

F

0.330

0.085

0.054

0.098

0.261

0.068

0.028

P

0.202

0.081

0.055

0.097

0.247

0.059

0.027

R

ROUGE-2 with SPCol

0.249

0.082

0.054

0.097

0.253

0.062

0.028

F

36 Y. Pérez-Guadarramas et al.

Topic Modeling Based on OWA Aggregation to Improve the Semantic …

37

Table 6 Comparative the results of the proposed approach with other text summarization methods reported Systems

ROUGE-1 with ENCol

ROUGE-2 with ENCol

ROUGE-1 with SPCol

ROUGE-2 with SPCol

P

P

R

F

P

R

F

R

F

P

R

F

Li et al. [22]

−

−

−

−

−

0.122 −

-

−

−

−

0.231

Rao and Devi [62]

−

−

−

−

−

0.219 −

−

−

−

−

−

Al−Saleh − and Bachir Menai [63]

−

0.468 −

−

0.173 −

−

−

−

−

−

Del Camino et al. [64]

0.427 0.460 0.442 0.186 0.185 0.185 0.485 0.558 0.518 0.259 0.285 0.271

Valladares 0.39 et al. [65] Proposal

0.41

0.38

0.13

0.13

0.13

0.47

0.47

0.47

0.20

0.19

0.20

0.420 0.533 0.469 0.156 0.253 0.190 0.489 0.572 0.521 0.330 0.202 0.249

The metrics in which our solution did not achieve the best results are outperformed mainly by [64] and only in one case by [62]. Although [64] generally obtains good results, high dependence on the information contained in the texts to be summarized and present in WordNet is the fundamental limitation of this solution since the contents that do not appear in this lexical resource are not used in the process of identifying the relevance to obtain the summary. In our proposal, semantic analysis using WordNet is carried out through metrics that evaluate the semantic relationship between sentences but combined with other measures that do not depend on that resource to achieve greater coverage in processing the input textual content.

5 Conclusions and Future Works This paper presents a new topic modeling method based on OWA aggregation as a core to improve the semantic focus in relevant information extraction problems. Several syntactic and semantic measures were aggregated to model the most relevant linguistic characteristics of the candidate topics by applying an OWA aggregation operator. The increase in semantic analysis in the process of calculating relationships between the candidate phrases in the clustering process was achieved by aggregating the measures through the OWA operator, an aspect little considered in most existing proposals. In addition to the topic modeling method, two other relevant information

38

Y. Pérez-Guadarramas et al.

extraction methods were proposed: (1) keyphrases extraction and (2) text summarization. These two methods were evaluated on two text collections each. The results obtained with both proposals generally obtained the best results compared to other reported works. The evaluation of these proposals allowed us to demonstrate the contribution of the semantics of the topic modeling method in two different contexts for extracting relevant information. In the experimentation on the keyphrases extraction, the improvement achieved in the F measure was slight in both datasets with respect to the works that reported the best results. The Precision achieved in 500N-KPCrowd does show a more significant improvement compared to the other reported proposals. The proposed solution did not generally achieve significant improvements with the Precision and Recall metrics, but a better balance was obtained in these metrics, which contributed to the improvement of the values of the F measure. On the other hand, the experimental results of the text summarization method show improvements in the proposed method results compared to other solutions in the literature evaluated with the MutiLing 2015 text collection for the MMS task. In both collections, 50% was exceeded in several of the ROUGE-1 metrics, and concerning the works presented in MultiLing 2015, several results were higher by 20%. Improving the Precision results in domain general texts is a challenge to be solved in the future, this can be achieved by specifically analyzing the named entities. Additionally, the application of the OWA operator with other linguistic quantifiers and the performance of these variants in the topic modeling process will be evaluated. In addition, it is also intended to extend the experimental evaluation with other collections and to include other clustering algorithms.

References 1. Merrouni, Z.A., Frikh, B., Ouhbi, B.: Automatic keyphrase extraction: An overview of the state of the art. In: 2016 4th IEEE international colloquium on information science and technology (CiSt). pp. 306–313. IEEE, (2016). https://doi.org/10.1109/CIST.2016.7805062 2. Pazos-Rangel, R.A., Rivera, G., Gaspar, J., Florencia-Juárez, R.: Natural language interfaces to databases: A survey on recent advances. In: Handbook of research on natural language processing and smart service systems (pp. 1–30). IGI Global, (2021). https://doi.org/10.4018/ 978-1-7998-4730-4.ch001 3. Rao, S.X., Piriyatamwong, P., Ghoshal, P., Nasirian, S., de Salis, E., Mitrović, S., Zhang, C.: Keyword extraction in scientific documents. (2022). arXiv preprint arXiv:2207.01888. https:// doi.org/10.48550/arXiv.2207.01888 4. Widyassari, A.P., Rustad, S., Shidik, G.F., Noersasongko, E., Syukur, A., Affandy, A.: Review of automatic text summarization techniques & methods. J. King Saud Univ.-Comput. Inf. Sci. (2020). https://doi.org/10.1016/j.jksuci.2020.05.006 5. Dehru, V., Tiwari, P. K., Aggarwal, G., Joshi, B., Kartik, P.: Text summarization techniques and applications. In: IOP Conference series: Materials science and engineering, vol. 1099, no. 1, pp. 012042. IOP Publishing, (2021). https://doi.org/10.1088/1757-899X/1099/1/012042 6. Pazos-Rangel, R.A., Florencia-Juarez, R., Paredes-Valverde, M.A., Rivera, G. (eds.).: Handbook of research on natural language processing and smart service systems. IGI Global (2021). https://doi.org/10.4018/978-1-7998-4730-4

Topic Modeling Based on OWA Aggregation to Improve the Semantic …

39

7. Kherwa, P., Bansal, P.: Topic modeling: a comprehensive review. EAI Endorsed Trans. Scalable Inf. Syst. 7(24), (2019). http://dx.doi.org/10.4108/eai.13-7-2018.159623 8. Indurkhya, N.: Emerging directions in predictive text mining, Wiley Interdisciplinary Reviews. Data Min. Knowl. Disc. 5(4), 155–164 (2015). https://doi.org/10.1002/widm.1154 9. Ignatow, G., Mihalcea, R.: An introduction to text mining. research design, data collection, and analysis. SAGE Publications, (2018). https://doi.org/10.4135/9781506336985 10. Sayyadi, H., Raschid, L.: A graph analytical approach for topic detection. ACM Trans. Internet Technol. (TOIT) 13(2), 4–23 (2013). https://doi.org/10.1145/2542214.2542215 11. Roul, R.K.: Topic modeling combined with classification technique for extractive multidocument text summarization. Soft. Comput. 25(2), 1113–1127 (2021). https://doi.org/10. 1007/s00500-020-05207-w 12. Belwal, R.C., Rai, S., Gupta, A.: A new graph-based extractive text summarization using keywords or topic modeling. J. Ambient. Intell. Humaniz. Comput. 12(10), 8975–8990 (2021). https://doi.org/10.1007/s12652-020-02591-x 13. Issam, K.A.R., Patel, S., others.: Topic modeling based extractive text summarization. (2021). arXiv preprint arXiv:2106.15313. https://doi.org/10.48550/arXiv.2106.15313 14. Belwal, R.C., Rai, S., Gupta, A.: Text summarization using topic-based vector space model and semantic measure. Inf. Process. Manage. 58(3), 102536 (2021). https://doi.org/10.1016/j. ipm.2021.102536 15. Rani, R., Lobiyal, D.: An extractive text summarization approach using tagged-LDA based topic modeling. Multimed. Tools Appl. 80(3), 3275–3305 (2021). https://doi.org/10.1007/s11 042-020-09549-3 16. Roul, R.K., Mehrotra, S., Pungaliya, Y., Sahoo, J.K.: A new automatic multi-document text summarization using topic modelling. In: International conference on distributed computing and internet technology, pp. 212–221. Springer, (2019). https://doi.org/10.1007/978-3-03005366-6_17 17. Teneva, N., Cheng, W.: Salience rank: Efficient keyphrase extraction with topic modelling. In: Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp. 530–535 (2017). https://doi.org/10.18653/v1/P17-2084 18. Steinberger, J.: The UWB summariser at multiling-2013. In: Proceedings of the MultiLing 2013 workshop on multilingual multi-document summarization, pp. 50–54 (2013) 19. Hafeez, R., Khan, S., Abbas, M.A., Maqbool, F.: Topic based summarization of multiple documents using semantic analysis and clustering. In: 2018 15th International conference on smart cities: improving quality of life using ICT & IoT (HONET-ICT), pp. 70–74. IEEE, (2018). https://doi.org/10.1109/HONET.2018.8551325 20. Gupta, H., Patel, M.: Method of text summarization using lsa and sentence based topic modelling with Bert. In: 2021 International conference on artificial intelligence and smart systems (ICAIS), pp. 511–517. IEEE, (2021). https://doi.org/10.1109/ICAIS50930.2021.939 5976 21. Yadav, C., Sharan, A.: A new LSA and entropy-based approach for automatic text document summarization. Int. J. Semant. Web Inf. Syst. (IJSWIS) 14(4), 1–32 (2018). https://doi.org/10. 4018/IJSWIS.2018100101 22. Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on empirical methods in natural language processing: Volume 1-Volume 1, pp. 257–266 (2009). https://doi.org/10.3115/1699510.1699544 23. Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 404–4011 (2004) 24. Rafiei-Asl, J., Nickabadi, A.: TSAKE: A topical and structural automatic keyphrase extractor. Appl. Soft Comput. 58, 620–630 (2017). https://doi.org/10.1016/j.asoc.2017.05.014 25. Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. Text Min.: Appl. Theory 1, 1–20 (2010). https://doi.org/10.1002/978047068964 6.ch1 26. Pérez-Guadarramas, Y., Rodríguez-Blanco, A., Simón-Cuevas, A., Hojas-Mazo, W., Olivas, J., Ángel.: Combinando patrones léxico-sintécticos y anélisis de tópicos para la extracción automática de frases relevantes en textos. Proces. Del Leng.Je Nat. 59, 39–46 (2017)

40

Y. Pérez-Guadarramas et al.

27. Jalil, Z., Nasir, J.A., Nasir, M.: Extractive multi-document summarization: a review of progress in the last decade. IEEE Access (2021). https://doi.org/10.1109/ACCESS.2021.3112496 28. Yager, R.R.: On ordered weighted averaging aggregation operators in multi-criteria decisionmaking. IEEE Trans. Syst. Man Cybern. 18(1), 183–190 (1988). https://doi.org/10.1109/21. 87068 29. Hulth.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 conference on Empirical methods in natural language processing, pp. 216–223, (2003). https://doi.org/10.3115/1119355.1119383 30. Marujo, L., Viveiros, M., Neto, J.P.D.S.: Keyphrase cloud generation of broadcast news. (2013). arXiv preprint arXiv:1306.4606. https://doi.org/10.48550/arXiv.1306.4606 31. Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81, (2004) 32. Gupta, V.K., Siddiqui, T.J.: Multi-document summarization using sentence clustering. In: 2012 4th International conference on intelligent human computer interaction (IHCI), pp. 1–5. IEEE. (2012). https://doi.org/10.1109/IHCI.2012.6481826 33. Blair-Goldensohn, S., Evans, D., Hatzivassiloglou, V., McKeown, K., Nenkova, A., Passonneau, R., Schiffman, B., Schlaikjer, A., Siddharthan, A., Siegelman, S.: Columbia university at duc 2004. In: Proceedings of the document understanding conference, Boston, USA (2004) 34. Saggion, H., Gaizauskas, R.: Multi-document summarization by cluster/profile relevance and redundancy removal. In: Proceedings of the document understanding conference, pp. 6–7 (2004) 35. Angheluta, R., Mitra, R., Jing, X., Moens, M.-F.: KU Leuven summarization system at DUC 2004. In: DUC workshop papers and agenda, pp. 53–60 (2004) 36. Radev, D.R., Jing, H., Sty’s, M., Tam, D.: Centroid-based summarization of multiple documents. Inf. Process. & Manag. 40(6), 919–938 (2004). https://doi.org/10.1016/j.ipm.2003. 10.006 37. Toleu, A., Tolegen, G., Mussabayev, R.: Keyvector: Unsupervised keyphrase extraction using weighted topic via semantic relatedness. Comput. Y Sist. 23(3), 861–869 (2019). https://doi. org/10.13053/cys-23-3-3264 38. Bougouin, A., Boudin, F., Daille, B.: Topicrank: Graph-based topic ranking for keyphrase extraction. In: Proceedings of the sixth international joint conference on natural language processing, Asian federation of natural language processing, Nagoya, Japan, pp. 543–551 (2013) 39. Romanadze, E.L., Sudakov, V.A., Kislinsky, V.G.: Development of a keyphrase extraction method based on a probabilistic topic model. Model. Data Anal. 12(2), 20–33 (2022) 40. Li, T., Hu, L., Li, H., Sun, C., Li, S., Chi, L.: TripleRank: An unsupervised keyphrase extraction algorithm. Knowl.-Based Syst. 219, 106846 (2021). https://doi.org/10.1016/j.knosys.2021. 106846 41. Ying, Y., Qingping, T., Qinzheng, X., Ping, Z., Panpan, L.: A graph-based approach of automatic keyphrase extraction. Procedia Comput. Sci. 107, 248–255 (2017). https://doi.org/10.1016/j. procs.2017.03.087 42. Afsharizadeh, M., Ebrahimpour-Komleh, H., Bagheri, A., Chrupala, G.: A survey on multidocument summarization and domain-oriented approaches. J. Inf. Syst. Telecommun. (JIST). 1(37), 68 (2022). https://doi.org/10.52547/jist.16245.10.37.68 43. Yu, Y.N.V.: Wikirank: Improving keyphrase extraction based on background knowledge. (2018). arXiv preprint arXiv:1803.09000. https://doi.org/10.48550/arXiv.1803.09000 44. Ferragina, P., Scaiella, U.: Tagme: On-thefly annotation of short text fragments. In: Proceedings of the 19th ACM international conference on Information and knowledge management, pp. 1625–1638 (2010). https://doi.org/10.1145/1871437.1871689 45. Müllner, D.: Modern hierarchical agglomerative clustering algorithms. (2011). arXiv preprint arXiv:1109.2378. https://doi.org/10.48550/arXiv.1109.2378 46. Ernst, O., Caciularu, A., Shapira, O., Pasunuru, R., Bansal, M., Goldberger, J., Dagan, I.: Proposition-level clustering for multi-document summarization. In: Proceedings of the 2022

Topic Modeling Based on OWA Aggregation to Improve the Semantic …

47. 48. 49.

50.

51. 52. 53.

54.

55.

56. 57.

58.

59.

60. 61.

62.

63. 64. 65.

41

conference of the North American chapter of the association for computational linguistics: Human language technologies, pp. 1765–1779. (2022). https://doi.org/10.18653/v1/2022. naacl-main.128 Miller, G.A.: WordNet: An electronic lexical database. MIT press (1998). https://doi.org/10. 2307/417141 Padró, L., others.: Analizadores Multilingües en FreeLing. Linguamática. 3(2), 13–20 (2011) Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., Van Kleef, P., Auer, S., others.: DBpedia—a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015). https://doi.org/10.3233/ SW-140134 Li, Y., McLean, D., Bandar, Z.A., O’shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006). https://doi.org/10.1109/TKDE.2006.130 Pedersen, T., Patwardhan, S., Michelizzi, J., others.: WordNet: Similarity-measuring the relatedness of concepts. In: AAAI, pp. 25–29, (2004) Xu, Z., Da, Q.-L.: An overview of operators for aggregating information. Int. J. Intell. Syst. 18(9), 953–969 (2003). https://doi.org/10.1002/int.10127 Zadeh, L.A.: A computational approach to fuzzy quantifiers in natural languages. In: Computational linguistics, pp. 149–184. Elsevier, (1983). https://doi.org/10.1016/B978-0-08-0302539.50016-0 Feng, L., Dillon, T.S.: Using fuzzy linguistic representations to provide explanatory semantics for data warehouses. IEEE Trans. Knowl. Data Eng. 15(1), 86–102 (2003). https://doi.org/10. 1109/TKDE.2003.1161584 Perez-Guadarramas, Y., Barreiro-Guerrero, M., Simon-Cuevas, A., Romero, F.P., Olivas, J.A.: Analysis of OWA operators for automatic keyphrase extraction in a semantic context. Intell. Data Anal. 24(S1), 43–62 (2020). https://doi.org/10.3233/IDA-200008 Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. (1998).https:// doi.org/10.1016/S0169-7552(98)00110-X Sanchez-Gomez, J.M., Vega-Rodríguez, M.A., Pérez, C.J.: An indicator-based multi-objective optimization approach applied to extractive multi-document text summarization. IEEE Lat. Am. Trans. 27(8), 1291–1299 (2019). https://doi.org/10.1109/TLA.2019.8932338 Le, T.T.N., Le Nguyen, M., Shimazu, A.: Unsupervised keyphrase extraction: Introducing new kinds of words to keyphrases. In: Australasian Joint Conference on Artificial Intelligence, pp. 665–671. Springer, (2016). https://doi.org/10.1007/978-3-319-50127-7_58 Bennani-Smires, K., Musat, C., Hossmann, A., Baeriswyl, M., Jaggi, M.: Simple unsupervised keyphrase extraction using sentence embeddings. arXiv preprint arXiv:1801.04470 (2018). https://doi.org/10.48550/arXiv.1801.04470 Abdou, M., Salah, M., AbdelGaber, S.: Unsupervised automatic keywords and keyphrases extractor for web documents. Int. J. Comput. Sci. Inf. Secur. (IJCSIS). 15(10), (2017) Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., Jatowt, A.: YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. 509, 257–289 (2020). https://doi.org/10.1016/j.ins.2019.09.013 Vega-Oliveros, D.A., Gomes, P.S., Milios, E.E., Berton, L.: A multi-centrality index for graphbased keyword extraction. Inf. Process. Manage. 56(6), 102063 (2019). https://doi.org/10.1016/ j.ipm.2019.102063t Rao, P.R., Lalitha Devi, S.: Enhancing multi-document summarization using concepts. S¯adhan¯a. 43(2), 1–11 (2018). https://doi.org/10.1007/s12046-018-0789-y Al-Saleh, Menai, M.E.B.: Solving multi-document summarization as an orienteering problem. Algorithms. 11(7), 96 (2018). https://doi.org/10.3390/a11070096 del Camino Valle, O., Simón-Cuevas, A., Valladares-Valdés, E., Olivas, J.Á.R.F.P.: Generación de resúmenes extractivos de múltiples documentos usando grafos semánticos. In: Sociedad Española para el Procesamiento del Lenguaje Natural, (2019). https://doi.org/10.26342/201963-11

42

Y. Pérez-Guadarramas et al.

66. Valladares-Valdés, E., Simón-Cuevas, A., Olivas, J. A., Romero, F. P.: A fuzzy approach for sentences relevance assessment in multi-document summarization. In: International workshop on soft computing models in industrial and environmental applications, pp. 57–67. Springer (2019). https://doi.org/10.1007/978-3-030-20055-8_6 67. Nenkova., McKeown, K.: A survey of text summarization techniques, in Mining text data, pp. 43–76. Springer, (2012). https://doi.org/10.1007/978-1-4614-3223-4_3

An Affiliated Approach to Data Validation: US 2020 Governor’s County Election Manan Roy Choudhury

Abstract Fraud has reached new heights due to rising prices and demand for products and services. Currently, it cannot be entirely outlawed in the initial phase. Developing a framework for fraud detection is a difficult task for researchers despite detecting fraud has drawn constant attention from the academic community, businesses, and regulatory organisations. Benford’s law has effectively served this purpose since the late 1900s. Within a decade of its use, a tonne of fraudulent activity suddenly began to be confiscated. Later, this law was applied to examine finances, forensics, and electoral fairness, among other things. This chapter suggests a formula that can identify fairness and fallacies when applied to datasets, including forensics, finances, elections, and similar socio-economic issues. This formula is derived from Zipf’s law. In contrast to Benford’s law, our suggested formula is supported by rigorous proof rather than any observation. All the data sets we used in this study have undergone in-depth analysis and several fitting tests. The dataset that has been picked to apply Benford’s and Zipf’s law is the 2020 US Governor’s County Election data. Keywords Benford’s law · Zipf’s law · Lagrange’s Interpolation · Chi-Square test · Mean absolute deviation and Mantissa arc test

1 Introduction Fraud is a term that refers to practices, procedures, or systems that do not comply with the regulations that have been put in place for the benefit of the planet’s young. Since the first scam occurred somewhere around 300 BC, not much has changed. Time passed, civilisation developed, colonisation took place, and the economy soared and crashed, but humankind’s fundamental drive remained the same. Up until now, fraud M. Roy Choudhury (B) Government College of Engineering and Textile Technology, Department of Computer Science and Engineering, Serampore, India e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications, Studies in Big Data 132, https://doi.org/10.1007/978-3-031-38325-0_3

43

44

M. Roy Choudhury

has been committed every day. Even though con artists are very cunning, laws of nature cannot be changed. The problem, though, is that regimes can accomplish this, as Russia did in 2008, by erecting imposing administrative and lawful barriers that assume and conclude that several objective and legitimate oversight of an imposter; or, as it happened in Ukraine in 2004, both parties to a conflict can deploy their own cadre full of observers asserting or disputing an indication of an election’s outcome. We may understand and recognise the requirement of creating statistical methods and software that, when used in practice with official results, both strengthen the findings, conclusions, and supporting data of the primary observation and direct further inquiries into an election’s murky underbelly. It is essential to understand this since observers have been accused of having political goals beyond promoting free and fair elections. Additionally, when Benford’s and Zipf’s laws are combined, most daily frauds— such as income tax, credit card, GDP, and election fraud—may be proven conclusively. The first digit law, sometimes referred to as the law of anomalous digits or Benford’s law, probability distribution function. The first non-zero digit on a number’s far left, such as 6 for 6897, 9 for 99, and 7 for 0.007895, is considered the number’s first significant digit. According to the suggested Benford’s law, as a digit’s value rises from 1 to 9, the likelihood that it will appear in a dataset as the first significant digit drops logarithmically. In Table 1, the predicted probability values are shown. Zipf’s law states that the rank-frequency distribution has an inverse relationship to many other types of data studied in the physical and mathematical sciences. Zipf’s law was derived empirically using mathematical statistics and probability. The Zipfian distribution is one of a collection of interconnected discrete power-law probability distributions. It works with a dataset similar to the zeta distributions but not quite the same. When applied to a corpus or dataset of natural language occurrences, Zipf’s law states that a word’s frequency is inversely proportional to its rank in the frequency table. It was first developed for the study of words that occur abnormally. Because of this, the most common word will appear roughly twice as often as the second-most Table 1 Probability of significant first digits using Benford’s Law

First significant digit

Probability

1

0.3010

2

0.1761

3

0.1249

4

0.0969

5

0.0792

6

0.0669

7

0.0580

8

0.0512

9

0.0458

An Affiliated Approach to Data Validation: US 2020 Governor’s County …

45

common word, three times the occurrence of the third-most common word, and so on. For instance, “the” is the most frequent word, making up over 7% of all word occurrences in the Brown Corpus of American English text (69,971 out of slightly over 1 million). The statistical basis of Benford’s and Zipf’s laws, which have tremendous strength, has served as the law of nature concerning fraudsters. These regulations have been implemented on several scams, and they have frequently been successful in catching fraud. Checking election results, someone’s income tax information and credit card transactions are a few other areas where Benford’s law has been applied rather frequently. Additionally, this study introduces a formula that was created using the wellknown Zipf’s Law. The approach has been successful in identifying accuracy in social and economic statistics. The fact that our formula holds true for data sets relevant to linguistics and several other data sets that, based on the formulation of Benford’s Law, may not be conceivable to compute is something we believe we should emphasise in our paper.

2 Literature Review In this section, we will see the previous existing works of literature available to us. The law has the name of the American linguist George Kingsley Zipf (1902–1955), who popularised and attempted to explain it. Although he did not create the rule, he was inspired by its potential. Before Zipf identified it, the French stenographer JeanBaptiste Estoup (1868–1951) noted its pattern. German physicist Felix Auerbach (1868–1933) also noted it in 1913. The first individual to practically utilise Benford’s law in the fields of fraud detection and forensic analysis was Nigrini [1]. He conducted cutting-edge theoretical research on Benford’s law and the court procedures related to fraud convictions. “Forensic Analytics” by Mark J. Nigrini, published by Wiley, describes tests, such as Benford’s law, to find frauds, errors, estimations, and biases in financial and electoral data. The Wall Street Journal and other national media outlets have praised him, and he has written multiple papers on Benford’s law. According to a 2011 research paper [2] by Arno Berger and Theodore P. Hill titled “Benford’s Law Strikes Back: No Simple Explanation in Sight for Mathematical Gem” [3, 4], which discusses the randomness of Benford’s law. This study establishes the consistency and similarity of the features identified by Benford’s law with the data supplied by PCA and the random forest approach on the same datasets. In the research paper by Jernej Vii and Aleksandar Toi [5], Benford’s law is applied to scientific cooperation networks in “Use of Benford’s Law on Academic Publishing Networks.” The report offered a particular way to evaluate the development of the research system. The report delves deeply into the discrepancies between many and varied research topics in Slovenia.

46

M. Roy Choudhury

The research paper “Detecting Academic Fraud Using Benford Law: The Case of Professor James Hunton” by Joanne Horton, Dhanya Krishna Kumar, and Anthony Wood was released on August 11, 2021 [6]. The study discusses the possibility of Benford’s law for determining the validity of the cardinal data utilised in various academic and research works. The “Statistical metalinguistics and Zipf/Pareto/ Mandelbrot”, “SRI International Computer Science Laboratory,” (May 29 2011) by Neumann, Peter G. outlined a clear and reasonable path for us to take in order to assert the applicability of this law in numerous and various statistical areas. On March 25, 2014, Steven Piantadosi published a study [7]. Accordingly, Zipf’s law has numerous applications, ranging from statistical linguistics to fraud detection, according to the article “Zipf’s word frequency law in natural language: A critical appraisal and future directions”. Piantadosi and Steven worked on confirming Zipf’s law using a corpus of different terms as their dataset, analysing the incidence of some specific frequent phrases, and ranking them going forward for applying Zipf’s rule. The many statistical methodologies that can be used to apply Zipf’s law are discussed in this essay. This essay’s primary focus is on the unusual frequency distribution of some words that appear more frequently than others, which is why it was investigated using various statistical techniques. The research article by Jing Wei, Bofeng Cai, Jianjun Zhang, Ke Wang, Sen Liang, and Yuhuan Geng [8] presents the study “Characteristics of carbon dioxide emissions in response to local development: Empirical explanation of Zipf’s law in Chinese cities”. It examines the dataset of CO2 emissions from various Chinese cities and validates it using empirical data. This study suggests a modified model for investigating the urbanisation of China. In the research report by Qiuping [9], a novel model is proposed, the Zipf-Pareto law, which can be derived using the law of least effort. The Zipf-Pareto law is derived by a method that links higher calculus and the likelihood of non-additivity of efficient thermodynamic engines. In his research entitled “Zipf’s law applications in patent landscape analysis,” Adel [10] argues that Pareto-Zipf analysis can derive ideas about business insights from a patent topography. This research paper’s primary goal is to calibrate the scales and ascendancy of the patent landscape.

3 Benford’s Law The discovery of Benford’s law, an observation-based law, dates back to the 1800s, when Newcomb [11] noticed that the earlier pages in his log notebook, particularly those beginning with “1,” were in worse condition than the later ones. This observation sparked a thought that later manifested as a formulation in his mind. Newcomb [12] proposed a law that the probability of a single number λ being the first digit of a number was equal to log (λ + 1) – log(λ). A physicist named Frank Benford later saw this occurrence again in the early 1900s. He applied the formulation at the time to a large number of datasets and, to his astonishment, found that almost all of them exhibited a correlation with the

An Affiliated Approach to Data Validation: US 2020 Governor’s County …

47

formulation. Nearly 20,000 observations were used in total by Benford for his study, which was a huge number to handle at the time. Benford did, however, finally get an appreciation for it. In a nutshell, Benford’s law states (or rather, observes) that various processes or measurements give credibility to numbers (such as returns on investment, city population, location data, sales and marketing, and building code) that establish trends in the digits that might otherwise seem paradoxical where smaller numbers are more common than larger ones (see Fig. 1). The mathematical formulation of Benford’s law is

1 π(δ) = (δ + 1) − (δ) = (1 + ) δ where, π(d) = Probability of d as the first digit ∀1 ≤ d ≤ 9 P(d) was limited on the y-axis while d was on the x-axis. Benford’s law works for the digit d as the first digit. Apart from that, the following formula has been developed that accurately describes the likelihood of a digit occurring d as the ς ’th digit:

Fig. 1 Benford’s law plot, where the first digits and their corresponding probabilities are denoted in the x and y-axis, respectively

48

M. Roy Choudhury

π (d) =

ζ −1 10 −1

κ=10ζ −2

1 1+ 10κ + d

π(d) = Probability of d as the ζ’th digit ∀0 ≤ d ≤ 9 ∀ζ > 1 On a general note, Benford’s law can be stated as 1 ∀ 1 ≤ δ ≤ 9 and ζ = 1 π (d) = 1 + d ζ −1 10 −1 1 1+ ∀ 0 ≤ δ ≤ 9 and ζ > 1 10κ + d ζ −2

κ = 10

Hal Varian, an economist, proposed using Benford’s law to check for fraud in the socio-economic datasets in the late 1900s, and this was done. Why not? Benford’s law rose to fame, and with good reason. After all, its range of operation was so broad. Benford’s law has applications in many different fields, including: • • • •

Determining election fraud by looking for anomalies in the dataset Examining pricing digits Testing genome data Checking for inaccuracies in scientific and research works.

Not only that but Benford’s law was also presented as proof in US criminal courts. Benford’s law seems to have become a popular plot device in recent years for numerous television shows. In 2016, in the film “The Accountant”, Benford’s law serves as a plot device to root out fraud. In addition, it serves as the basis for several other Netflix series, including Ozark. Everything, as we all know, has both benefits and drawbacks, and Benford’s law is no different. Benford’s rule has a wide range of applications, but it is not infallible; the biggest illustration of this is the entire dataset for the 2020 US election [13], which did not adhere to Benford’s law and was thus revalued, but no discrepancy was discovered.

4 Zipf’s Law In 1935, George Kingsley Zipf, who focused on statistically varied occurrences, proposed that some words are used infrequently while others are frequently. The primary idea behind Zipf’s law was this. The rank-frequency distribution has an inverse relationship for many types of data analysed in the physical and social sciences, according to Zipf’s law [14], which states (or rather observes). The frequency of each word is inversely proportional to its rank in the frequency table for a corpus of natural language utterances, according to Zipf’s law, which was originally defined in terms of quantitative linguistics. As a result,

An Affiliated Approach to Data Validation: US 2020 Governor’s County …

49

the most frequent phrase will be used roughly twice as frequently as the secondmost frequent word, three times as frequently as the third-most frequent word, etc. However, this law is not just applicable to linguistics; it can also be used to apply to numeric data. If a set of data is arranged in ascending order, frequency times rank will always be a constant. Mathematically, fζ × ζ = ϑ where, f ζ = frequency of data with rank ς. ζ = rank. ϑ = Constant. For Zipf’s law, we frequently use a log–log graph, where the scale for the y-axis is determined by the log of the frequency, and the scale for the x-axis is determined by the log of the rank. A log–log graph is used to display the graph’s edges with more clarity and depth. The result of taking the logarithm on both sides of the previously given Zipf’s law equation is: log f ζ + log(ζ ) = log(ϑ)) The straight line equation has an intercept as log(ϑ). Therefore, A dataset adheres closely to Zipf’s law if the log–log curve closely matches a straight line with a defined intercept of log(ϑ). The graph for Zipf’s law is given in Fig. 2. Zipf’s law was first developed using quantitative linguistics. The most frequent word was discovered to be “the,” followed by “of” and “and,” which were both ranked second and third. According to a claim, the frequency of a word is inversely proportional to its position in the frequency table for any kind of linguistic data. The frequency of “of” was found to be half that of “the,” while that of “and” was found to be one-third that of “the”. As appreciated, Zipf’s law can potentially model a wide range of problems in business and text analytics [15, 16]. The history of this law is noteworthy because, despite the fact that Zipf popularised and explained it, he never claimed to have created it. Prior to Zipf, Jean-Baptiste Estoup (1868–1955), a French stenographer, seems to have recognised the periodicity. Felix Auerbach, a German physicist, made a similar observation in 1913 [17]. Though it differs in distribution, Zipf’s law is theoretically comparable to Benford’s law. Along with Benford’s law [18, 19], we also used Zipf’s law to verify datasets in the later section of the paper.

50

M. Roy Choudhury

Fig. 2 Zipf law’s graph is a rectangular hyperbola x y = c with c = 1 and c = 2, respectively

5 Application I’ve attempted to apply the aforementioned laws, Benford’s law, and Zipf’s law in this part using a robust dataset. Here, we are working with the dataset: • US 2020 election data This dataset is available online on Kaggle. The link to the dataset is given below: https://drive.google.com/file/d/15NQo8 gu4pGdHkNKxu6NeovDZi_aQ2JCi/view?usp=share_link Kaggle, a division of Google LLC, hosts an online community of data scientists and machine learning specialists. With the help of Kaggle, users can collaborate with other data scientists and machine learning specialists to find and publish datasets, investigate and build models in a web-based data science environment, and more. Some tests are carried on to determine whether the dataset is appropriate for applying Benford’s law before doing so. We will perform three tests, namely [20], Mean Absolute Deviation, Chi-square test and Mantissa Arc Test (Table 2). R-Studio is used to perform these tests.

An Affiliated Approach to Data Validation: US 2020 Governor’s County … Table 2 Conformity range for the first digits and second digits

51

Conformity range

First digits

Second digits

Close conformity

0.000 − 0.006

0.000 − 0.008

Acceptable conformity

0.006 − 0.012

0.008 − 0.010

Marginal conformity

0.012 − 0.015

0.010 − 0.012

Non-conformity

Above 0.015

Above 0.012

• Mean Absolute Deviation (MAD): The mean of the absolute deviations from a central point, or “Mean,” is the Mean Absolute Deviation (MAD) of a dataset. MAD of a dataset {a1 , a2 , a3 , …, an } = n1 nk=1 |ak − X (a)| X(a) = Mean value. n = Total number of lattice values. ak = Data values. • Chi-Square test: To evaluate statistically whether there is a significant discrepancy or difference between anticipated or theoretical frequencies and synthetic or observed frequencies, one uses the Pearson Chi-Squared test [21]. Given that the null hypothesis is correct as n tends to infinity, the χ2 distribution is: k k 2 zi (z i − m i )2 Z = = −k mi mi i=1 i=1 2

kpi ∀i ∈ Nk where pi ’s are the probabilities calculated using the null hypothesis mi = and i∈Nk pi = 1 • Mantissa Arc test: With the use of this test, we may identify the centre of mass for a group of mantissa spread around a unit circle. The centre of mass, or mean vector, will be the unit circle if the mantissa of numbers is distributed evenly about it (0,0) [22]. If the points on the “Difference” graph, which is a scatter plot that represents the synthetic probability that differs from the theoretical probability according to our suggested formula by around zero, the dataset is likely to be valid. We can determine whether the dataset deviates from our suggested formula or not, i.e., is valid or not, based on this scatter plot. The most precise test to determine whether the data fit Benford’s law or not is the mean absolute deviation test. Chi-Square and Mantissa Arc tests are used for additional conformity if the MAD test yields a result of “moderate conformity” or “non-conformity” (see Table 2).

52

M. Roy Choudhury

5.1 US 2020 Election The 59th quadrennial presidential election for the year 2022 was held on Tuesday, November 3, 2020. The contender needs 270 of the 538 electoral votes to win the election. If a candidate wins states that are not expected to support them, it is a good indication that they are performing well. Now, a small snippet of the whole dataset is provided In Table 3 to understand it better. Here, we consider the first digit of total votes from each county, and we will apply all the tests, i.e., Mantissa Arc Test, Pearson Chi-Square Test and Mean Absolute Deviation on that. After that, if the dataset passes all the tests, then we will apply Benford’s and Zipf’s laws on it to check its correctness and validate it. Results of the mentioned tests for the US 2020 Governor’s County Election dataset: 1. Mantissa: • • • •

Mean = –0.506 Skewness = -0.067 Var = 0.087 Ex. Kurtosis = -–1.264

2. Mean Absolute Deviation (MAD): 0.002546374 • Distortion Factor: 1.881918 • MAD Conformity: Close Conformity 3. Pearson Chi-Square test: • χ 2 = 90.378 • df = 89 • p-value = 0.4393 4. Mantissa Arc Test: • L2 = 0.0031049 • df = 2 • p-value = 0.04161 All the results of this test tell us that this dataset is suitable for Benford’s Law to be applied to it. We considered the first digits of the dataset, and the results in Table 4 were obtained. 9 Adding up all the frequencies, we get i=1 f i = 1025. The synthetic probability by using the formula P = 9fi f and is compared with i=1 i the theoretical probabilities obtained by Benford’s law for the case of (ζ = 1) (see Table 5). The practical probability closely matches the theoretical probability.

An Affiliated Approach to Data Validation: US 2020 Governor’s County …

53

Table 3 Snippet of the 59th quadrennial presidential election for the year 2022 State

County

Current votes

Total votes

Percent

Delaware

Kent county

85,415

87,025

100

Delaware

New castle country

280,039

287,633

100

Delaware

Sussex country

127,181

129,352

100

Indiana

Adams county

14,154

14,209

100

Indiana

Allen county

168,312

169,082

100

Indiana

Bartholomew county

36,037

36,235

100

Indiana

Benton county

4100

4114

100

Indiana

Blackford county

5283

5350

100

Indiana

Boone county

38,492

38,520

100

Indiana

Brown county

8957

8981

100

Indiana

Carroll county

9510

9510

100

Indiana

Cass county

15,146

15,198

100

Indiana

Clark county

57,426

57,869

100

Indiana

Clay county

12,186

12,267

100

Indiana

Clinton county

12,891

12,949

100

Indiana

Crawford county

4859

4944

100

Indiana

Daviess county

11,860

11,954

100

Indiana

Dearborn county

25,295

25,383

100

Indiana

Decatur county

12,260

12,235

95

Indiana

DeKalb county

19,493

19,628

100

Indiana

Delaware county

47,949

48,191

100

Indiana

Dubois county

21,588

21,770

100

Indiana

Elkhart county

74,425

74,425

100

Indiana

Fayette county

10,054

10,136

100

Indiana

Floyd county

41,589

41,802

100

Indiana

Fountain county

7952

7986

100

Indiana

Franklin county

11,822

12,000

100

Indiana

Fulton county

9123

9147

100

Indiana

Gibson county

16,130

16,161

100

Indiana

Grant county

27,021

27,159

100

Indiana

Greene county

14,694

14,784

100

Indiana

Hamilton county

193,584

193,999

100

Indiana

Hancock county

42,809

42,911

100

Indiana

Harrison county

20,199

20,236

100

Indiana

Hendricks county

88,122

88,505

100

Indiana

Henry county

21,061

21,160

100

Indiana

Howard county

40,547

40,630

100 (continued)

54

M. Roy Choudhury

Table 3 (continued) State

County

Current votes

Total votes

Percent

Indiana

Huntington county

17,731

17,829

100

Indiana

Jackson county

19,136

19,238

100

Indiana

Jasper county

15,371

15,475

100

Indiana

Jay county

8405

8461

100

Indiana

Jefferson county

14,537

14,697

100

Indiana

Jennings county

12,153

12,251

100

Indiana

Johnson county

77,274

77,618

100

Indiana

Knox county

15,840

16,011

100

Indiana

Kosciusko county

35,658

35,787

100

Indiana

LaGrange county

10,575

10,633

100

Indiana

Lake county

217,232

219,956

100

Indiana

LaPorte county

48,618

49,392

100

Indiana

Lawrence county

20,868

21,025

100

Indiana

Madison county

51,806

51,806

100

Indiana

Marion county

390,854

390,854

100

Indiana

Marshall county

19,804

19,927

100

Indiana

Martin county

5118

5144

100

Indiana

Miami county

14,361

14,473

100

Indiana

Monroe county

62,523

63,151

100

Indiana

Montgomery county

17,184

17,198

100

Indiana

Morgan county

35,947

36,159

100

Indiana

Newton county

6556

6591

100

Indiana

Noble county

19,088

19,190

100

Indiana

Ohio county

3186

3186

100

Indiana

Orange county

8759

8832

100

Indiana

Owen county

9846

9904

100

Indiana

Parke county

6972

7011

100

Indiana

Perry county

8648

8722

100

Indiana

Pike county

6141

6220

100

Indiana

Union county

3459

3483

100

Indiana

Vanderburgh county

77,390

77,662

100

Indiana

Vermillion county

7386

7481

100

Indiana

Vigo county

43,335

43,594

100

Indiana

Wabash county

14,511

14,573

100

Indiana

Warren county

4458

4472

100

Indiana

Warrick county

33,524

33,829

100

Indiana

Washington county

12,073

12,139

100 (continued)

An Affiliated Approach to Data Validation: US 2020 Governor’s County …

55

Table 3 (continued) State

County

Current votes

Total votes

Percent

Indiana

Wayne county

27,620

27,679

100

Indiana

Wells county

14,050

14,050

100

Indiana

White county

11,138

11,166

100

Indiana

Whitley county

17,451

17,556

100

Missouri

Adair county

10,318

10,336

100

Missouri

Andrew county

9733

9752

100

Missouri

Atchison county

2770

2814

100

Missouri

Audrain county

10,608

10,656

100

Montana

Broadwater county

4099

4110

100

Montana

Carbon county

7076

7101

100

Montana

Carter county

857

864

100

Montana

Cascade county

39,832

40,039

100

Montana

Chouteau county

2972

2992

100

Montana

Custer county

5869

5883

100

Montana

Daniels county

1009

1030

100

Montana

Dawson county

4826

4837

100

Montana

Deer lodge county

4873

4891

100

Montana

Fallon county

1555

1576

100

Montana

Fergus county

6499

6534

100

Montana

Flathead county

59,825

60,258

100

Montana

Gallatin county

70,956

71,507

100

Montana

Garfield county

813

813

100

Montana

Glacier county

5684

5719

100

Table 4 Digit–wise frequency for first digit place for US Election 2020 – Governor’s County dataset

Digit

Frequency

1

314

2

162

3

110

4

108

5

94

6

63

7

71

8

60

9

43

56 Table 5 Tabular comparison between Practical and Theoretical probabilities for 1st Digit place for US Election 2020–Governor’s county dataset

M. Roy Choudhury

Practical probability

Theoretical probability

0.306341463

0.3010

0.158048780

0.1760

0.107317073

0.1249

0.105365854

0.0969

0.091707317

0.0791

0.061463415

0.0669

0.069268293

0.0579

0.058536585

0.0511

0.041951220

0.0457

The comparison plot of the probabilities is shown Fig. 3. Secondly, we considered the second digit of the dataset, and we obtained the results in Table 6. 9 f i = 1025. Adding up all the frequencies again, we get i=1 The synthetic probability is obtained by using the relation P = 9fi f and is i=1 i compared with the hypothetical odds determined by Benford’s law in the scenario of (ζ > 1) (see Table 7). The practical probability closely matches the Theoretical probability. The comparison plot of the probabilities is shown in Fig. 4. Next, we considered the third digit of the dataset, and we obtained the results in Table 8.

Fig. 3 Comparison plot between synthetic probability and the probability by Benford’s law

An Affiliated Approach to Data Validation: US 2020 Governor’s County … Table 6 Frequency of the second digits of the dataset

Table 7 Comparative table between the practical and theoretical probability of the occurrence of the second digits of the dataset

Digit

Frequency

0

120

1

138

2

93

3

99

4

123

5

90

6

87

7

99

8

95

9

81

Practical probability

Theoretical probability

0.117187500

0.120

0.134765625

0.114

0.090820313

0.109

0.096679688

0.104

0.120117188

0.100

0.087890625

0.097

0.084960938

0.093

0.096679688

0.090

0.092773438

0.088

57

9 Adding up all the frequencies again, we get i=1 f i = 1025. The synthetic probability is obtained by using the relation P = 9fi f and is i=1 i compared with the hypothetical odds determined by Benford’s law in the scenario of (ζ > 1) (see Table 9). The practical probability closely matches the Theoretical probability. The comparison plot of the probabilities is shown in Fig. 5. Finally, we considered the last digit of the dataset, and we obtained the results in Table 10. 9 f i = 1025. Adding up all the frequencies again, we get i=1 The synthetic probability is obtained by using the relation P = 9fi f and is i=1 i compared with the hypothetical odds determined by Benford’s law in the scenario of (ζ > 1) (see Table 11). The practical probability was nearly similar to that of the Theoretical probability. The comparison plot of the probabilities is shown in Fig. 6. The comparison plot of the probabilities is shown below. Now, we will validate the dataset using Zipf’s Law.

58

M. Roy Choudhury

Fig. 4 Comparative bar plot of synthetic and theoretical probabilities of the occurrence of the second digit

Table 8 Frequency of the third digits of the dataset

Digit

Frequency

0 1

112 117

2

97

3

103

4

95

5

96

6

104

7

94

8

106

9

101

The results are presented in Table 12. In Zipf’s Column I, frequency is multiplied by the rank, and Zipf Column II indicates the corresponding probabilities. i is rounded up to 2 decimal places. In the Zipf’s Column II, the result of 9Z C1 i=1 Z C1i Where ZC1 = Zipf’s first column. The log–log graph moderately resembles a straight line; thus, we can’t get any conclusive result from the log–log curve (see Fig. 7). The dataset for the 2020 US Election (Governor’s County) appears to be legitimate, according to all of these analyses. The fact that the dots in the scatter plot are

An Affiliated Approach to Data Validation: US 2020 Governor’s County … Table 9 Comparative table between the practical and theoretical probability of the occurrence of the third digits of the dataset

Practical probability

Theoretical probability

0.109375000

0.102

0.114257813

0.101

0.094726563

0.101

0.100585938

0.101

0.092773438

0.100

0.093750000

0.100

0.101562500

0.099

0.091796875

0.099

0.103515625

0.099

59

Fig. 5 Comparative bar plot of synthetic and theoretical probabilities of the occurrence of the third digit Table 10 Frequency of the last digits of the dataset

Digit

Frequency

0 1

100 100

2

115

3

98

4

115

5

106

6

98

7

86

8

98

9

109

60

M. Roy Choudhury

Table 11 Comparative table between the practical and theoretical probability of the occurrence of the last digits of the dataset Practical probability

Theoretical probability

0.097560976

0.1111

0.097560976

0.1111

0.112195122

0.1111

0.095609756

0.1111

0.112195122

0.1111

0.103414634

0.1111

0.095609756

0.1111

0.083902439

0.1111

0.095609756

0.1111

Fig. 6 Comparative bar plot of synthetic and theoretical probabilities of the occurrence of the last digit

so near to 0 indicates that the dataset is accurate. Also, the graphs of Benford’s law and our suggested formula strongly mirror those of this dataset. With all the test results and the graphs in Fig. 8, it is clear from these arguments that our technique can also identify fairness in datasets, and this dataset appears to be fair.

An Affiliated Approach to Data Validation: US 2020 Governor’s County …

61

Table 12 Digit, frequency, rank and Zipf’s Column for the US Election 2020 dataset Digit

Frequency (Vm )

Rank (m)

Zipf’s column I

Zipf’s column II

1

314

1

314

0.09

2

162

2

324

0.09

3

110

3

330

0.09

4

108

4

432

0.12

5

94

5

470

0.13

6

63

7

441

0.14

7

71

6

426

0.10

8

60

8

480

0.16

9

43

9

387

0.11

Fig. 7 log–log graph of Zipf’s law for the US 2020 Governor’s county election

Fig. 8 These above graphs denote digit distribution, Chi-Square difference, summation distribution by digits, digit distribution (second order test), and Absolute summation difference

62

M. Roy Choudhury

6 Comparative Study 2020 was a year which saw the upsurge of COVID-19 to its peak. This held 2020 as the most probable year for some discrepancies in the election data of all the countries across the world. This chapter validates the US 2020 Governor’s County Election dataset. The reason behind choosing this particular election data is because, at that time, some questions were raised against the validity and correctness of the US 2020 Governor’s County Election dataset. It was observed that Joe Biden got a large number of votes from some particular counties. The distribution of the votes that supported Joe Biden was not uniform across the country. But the analysis done in this chapter shows that the US 2020 Governor’s County Election is fair. Now we will analyse some existing research works on the US 2020 Governor’s County Election and compare it with the work done in this paper. We will cite some papers here simultaneously and compare them with this work. Kate and Charles [21] focused on the case study that examined how the COVID19 pandemic’s commencement had an impact on the planning and environment of the 2020 US Presidential and County Elections. Baltz et al. [23] described the development and quality control of a dataset encompassing almost all known precinct-level election results from the American elections of 2016, 2018, and 2020. Precincts are where elections are administered, and it is at this level that many significant issues can be answered. Baccini et al. [24] had as their main topic the impact of the COVID-19 epidemic on the 2020 US presidential election. They estimate the impact of COVID-19 cases and deaths on the shift in county-level support for Donald Trump between 2016 and 2020 using a pre-analysis strategy as our guide. As we can see, all these papers focus on the effect of COVID-19 on the US Election. This was the main problem regarding the data analysis of the 2020 US Elections. So, most of the works involve the effect of COVID-19 on the US Election, but very few to none actually focus on its fairness and validity. This chapter does exactly that in a very efficient and accurate way. Apart from that, there are very few research works that use both Benford’s and Zipf’s laws simultaneously to validate datasets. This, in turn, provides us with a double verification which nullifies the chances of possible errors. Also, this chapter validates the US 2020 Governor’s County dataset not only on the basis of the first digit but also on the basis of the second, third and last digits of the total number of votes from the dataset stated above.

7 Conclusion The work that has been done throughout this chapter is now complete on all fronts and with outcomes that can be verified. Using Benford’s law and Zipf’s law, numerous datasets have been developed and tested to ensure their fairness and perfection.

An Affiliated Approach to Data Validation: US 2020 Governor’s County …

63

Additionally, we conducted a lot of fitting tests to guarantee that we applied the pre-existing laws to the datasets. These two rules with projected assimilation have gleefully demonstrated the axiom “Prediction matches reality if it’s made in reality.” Furthermore, we would like to emphasise that the US 2020 Election datasets (divided by county) are error-free. The validation of the dataset using Zipf’s law entirely depends on the log–log curve. As we can see from Fig. 7 (Frequency Spectrum), the curve nearly represents a straight line with a negative slope. According to Zipf’s law, if the log–log curve approximately represents a negative sloped straight line, then that dataset will follow Zipf’s law, and in the case of the US 2020 Governor’s County Election, it does so. Benford law has many applications in the field of fraud detection and data validation. The validation of the dataset using Benford’s law becomes stronger when we analyse the first two or three digits simultaneously. We can incorporate this idea for a two-fold verification of the US 2020 dataset, which will, in turn, provide us with a firmer ground to support our conclusions.

References 1. Nigrini, M.: Forensic analytics: Methods and techniques for forensic accounting investigations, vol. 2, pp. 5–36. Wiley & Sons, (2011) 2. Berger, A., Hill, T.: Benford’s law strikes back: No simple explanation in sight for mathematical gem, Springer science+Business media. LLC 33, 85–89 (2021). https://doi.org/10.1007/s00 283-010-9182-3 3. Hill, T.: A statistical derivation of the significant-digit law. institute of mathematical statistics. Stat. Sci. 10, 354–363 (2001). http://doi.org/10.1214/ss/1177009869 4. Mbona, I., Eloff.: Feature selection using Benford’s law to support detection of malicious social media bots. J. Inf., 582, 369–381 (2021). https://doi.org/10.1016/j.ins.2021.09.038 5. Tošić, A., Viˇciˇc, J.: Use of Benford’s law on academic publishing networks. J. Informet. 588, 36–81 (2021). https://doi.org/10.1016/j.joi.2021.101163 6. Horton, J., Krishna, K., Wood, A.: Detecting academic fraud using Benford law: The case of Professor James Hunton. Res. Policy 49, 8 (2021). https://doi.org/10.1016/j.respol.2020. 104084 7. Piantadosi, S.T.: Zipf’s word frequency law in natural language: A critical review and future directions. Psychon Bull Rev 21, 1112–1130 (2014). https://doi.org/10.3758/s13423-0140585-6 8. Wei, J., Zhang, J., Cai, B., Wang, K., Liang, S. and Geng, Y.: Characteristics of carbon dioxide emissions in response to local development: Empirical explanation of Zipf’s law in Chinese cities. Sci. Total. Environ. 757 (2021). https://doi.org/10.1016/j.scitotenv.2020.143912 9. Qiuping A.W.: Principle of least effort versus maximum efficiency: deriving Zipf-Pareto’s laws. Chaos, Solitons & Fractals. 153(Part 1), 111489 (2021). https://doi.org/10.1016/j.chaos.2021. 111489 10. Adel, M.: Zipf’s law applications in patent landscape analysis. World Pat. Inf. 64, 102012 (2021). https://doi.org/10.1016/j.wpi.2020.102012 11. Newcomb, S.: Note on the frequency of use of the different digits in natural. Am. J. Math. 4, 39–40 (1881) 12. Deckert, J., Myagkov, M., Ordeshook, P.C.: Benford’s law and the detection of election fraud. Polit. Anal., 19(3), 245–268 (2017). https://doi.org/10.1093/pan/mpr014

64

M. Roy Choudhury

13. Li, F., Han, S., Zhang, H., Ding, J., Zhang, J., Wua, J.: Application of Benford’s law in data analysis. J. Phys: Conf. Ser. 1168, 3 (2019). https://doi.org/10.1088/1742-6596/1168/3/032133 14. Saichev A, Malevergne Y, Sornette D.: Theory of Zipf’s Law and Beyond, vol. 632. Springer (2009) 15. Pazos-Rangel, R.A., Florencia-Juarez, R., Paredes-Valverde, M.A., Rivera, G.: Preface. Handbook of research on natural language processing and smart service systems, pp. xxv–xxx. IGI Global (2021). https://doi.org/10.4018/978-1-7998-4730-4 16. Pedrycz, W., Martínez, L., Espin-Andrade, R. A., Rivera, G., Gómez, J. M. (eds.).: Preface. In: Computational intelligence for business analytics, pp. v–vi. Springer (2021). https://doi.org/ 10.1007/978-3-030-73819-8 17. Golan, A., Greene, W.H.: An information theoretic estimator for the mixed discrete choice model. Handbook of Empirical Economics and Finance, pp. 90–105 (2016) 18. Urzúa.: Testing for Zipf’s law: A common pitfall. Economics Lett., 112 (3), 254–255 (2011). https://doi.org/10.1016/j.econlet.2011.05.049 19. Dutta, A., Choudhury, M.R., De, A.K.: A unified approach to fraudulent detection. IJAER. 17(2), 110–124 (2022). RI Publication https://dx.doi.org/10.37622/IJAER/17.2.2022.110-124 20. Roy Choudhury, M., Dutta, A., De, A.K.: Data corroboration of the catastrophic chernobyl tragedy using arc-length estimate conjecture. Vertices, Duke University. 1(2)|Fall 2022, 69–84 (2022). https://doi.org/10.55894/dv1.24 21. Kate, S., Charles, S.: Impact of COVID-19 on the 2020 US presidential elections. Int. Idea, 1–43 (2022) 22. Choudhury, M.R., Dutta, A.: A perusal of transaction details from silk road 2.0 and its cogency using the riemann elucidation of integrals. Appl. Math. Comput. Intell. 11(2), 423–436 (2022) 23. Baltz, S., Agadjanian, A., Chin, D., et al.: American election results at the precinct level. Sci Data 9, 651 (2022). https://doi.org/10.1038/s41597-022-01745-0 24. Baccini, L., Brodeur, A., Weymouth, S.: The COVID-19 pandemic and the 2020 US presidential election. J. Popul. Econ. 34, 739–767 (2021). https://doi.org/10.1007/s00148-020-00820-3

Acquisition, Processing and Visualization of Meteorological Data in Real-Time Using Apache Flink Jonathan Adrian Herrera Castro, Abraham López Najera, Francisco López Orozco, and Benito Alan Ponce Rodríguez

Abstract Today, the issue of data processing has become a topic of vital importance for society and at the business level. Data has been essential for decision-making and business opportunities for companies. Data processing is not a trivial task since large amounts of data are generated in different formats, which makes it difficult to process, store and visualize them. This chapter presents an architecture for realtime data stream processing on Apache Flink. Sensors connected to a prototype of a weather station built for this purpose were used to generate the data. To transmit the data generated by the weather station prototype to Apache Flink, the Apache Kafka streaming tool was used. Finally, the Elasticsearch and Kibana tools were used for data processing and visualization. The tests carried out to verify the operation of all the components of the proposed architecture were satisfactory. Keywords Data acquisition · Data streaming · Sensors · Data storage · Data visualization

J. A. Herrera Castro (B) · A. López Najera · F. López Orozco · B. A. Ponce Rodríguez División Multidisciplinaria de Ciudad Universitaria, Universidad Autónoma de Ciudad Juárez, Ciudad Juárez, Chihuahua, México e-mail: [email protected] A. López Najera e-mail: [email protected] F. López Orozco e-mail: [email protected] B. A. Ponce Rodríguez e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications, Studies in Big Data 132, https://doi.org/10.1007/978-3-031-38325-0_4

65

66

J. A. Herrera Castro et al.

1 Introduction Several years ago, most companies discarded millions of pieces of data because it was expensive to store, in addition to the fact that there were no tools to store and process all that information. Over time, tools were developed to process and store all this data, thus giving rise to the concept of Big Data, which is a term that refers to the large volume of data that is produced every day. Big Data technology has allowed companies to address the problem of data storage. However, the processing issue is still an obstacle, so specialized tools for data processing began to emerge. Once companies began to solve storage and processing problems with the help of these tools, they found that their data lost value over time, so processing it later would have no benefit for the company. On the other hand, they noticed that mobile applications, social networks, and web pages generate a large amount of data in realtime when interacting with the user. This led to the need to process data in real-time, so specialized tools for this type of processing arose. There are works related to data processing tools in real-time. For example, in Rodríguez [1], a platform for processing meteorological data was developed using an Arduino, a temperature sensor, and a Wifi module. The objective was to demonstrate the flow of data with the help of different tools. The tools used to develop the project were Eclipse Mosquittto, Arduino IDE, Apache Kafka, Apache NiFi, Apache Spark, and Apache Zepellin, which were essential for the ingestion, processing, and visualization of the data obtained from the Arduino. Arduino was used to capture the data generated by the sensor. Subsequently, Mosquitto was used to send the data from the Arduino to NiFi for publication in the cloud. NiFi then sent the data to Kafka for storage, and finally, it was processed in Spark for further visualization in Zepellin. The main contribution of this work was to give an idea about obtaining meteorological data in real-time using a sensor and tools to interact with the data obtained by the sensor. Another work is that of Barba [2], in which an embedded system was designed and built to monitor the production data of a greenhouse. The tools used were Apache Kafka, a Raspberry PI device, and an Arduino. The main objective was to allow users real-time access to weather data, captured by Arduino, through a web application. Arduino was used to capture the temperature and humidity sensor data. Later, these data were sent to the Raspberry PI device. Later, Apache Kafka was used to process, store, and transmit data to the web application. Finally, the embedded system was connected to the web application so that the user could observe the meteorological data in real-time. The main contributions of this work were to give an overview of the interaction with data in real-time through specialized tools and the access to these data through an application open to users. Also, Miralles [3] implemented a prototype to send meteorological data from a temperature sensor connected to an Arduino to the cloud so the end user could access them. The tools used were Circus of Things and Arduino IDE. For the development, Arduino was used to collect sensor data; these data were sent to Circus of Things, an application for developing Internet of Things projects. Finally, to evaluate the

Acquisition, Processing and Visualization of Meteorological Data …

67

project’s performance, the application (Circus of Things) was configured so that users could access the data from a Web browser. The main contributions of this work were to give an idea about obtaining meteorological data in real-time with a sensor and how the user can access these data through the cloud. As can be seen, the catalog of real-time data processing tools is vast and some of them can interact with each other. Although there are various tools for real-time data processing, they have different characteristics, are designed for different platforms, and there is no proper documentation or guide for their configuration. This chapter presents an architecture focused on processing data streams in realtime in Apache Flink. The architecture consists of five phases: acquisition, the transmission of data streams, storage, processing, and visualization of data in realtime. For data acquisition, a prototype weather station was designed using temperature, humidity, air quality, and barometric pressure sensors connected to an Arduino board. Data was taken from the sensors using the Python programming language and sent from the Arduino to Apache Flink using Apache Kafka. Kafka was used to store the data received in Flink. Finally, Elasticsearch and Kibana were used to process and visualize the data. The main contribution of this work was to describe the implementation of a data analysis environment for processing data streams in real time. Unlike the works mentioned above, in this architecture, the Python programming language was used for communication with the Arduino board; three sensors were connected to the same board, and the data obtained from these sensors were simultaneously sent to the tools proposed in this chapter. The reason why the Python, Kafka, Flink, Elasticsearch, and Kibana tools were used was that, in the case of Python, it allows better communication with the Arduino board and the sensors, Kafka is a tool for receiving and sending messages by whatever fit the need of the architecture, Flink is specialized in real-time data processing, Elasticsearch has the flexibility to store data from various sources and Kibana graphs data stored in Elasticsearch in real-time. The chapter is structured as follows. Section 2 describes the theoretical foundations. Section 3 explains the proposed architecture, describing the software and hardware used, the process followed for constructing the weather station, the procedure applied for transmitting the data captured by the sensors in real-time, the data storage, and the data visualization process. Section 4 presents the process carried out to verify the operation of all the components of the proposed architecture and, lastly, Sect. 5 presents the conclusions of this work.

2 Theoretical Fundamentals In this section the fundamental concepts of this work are defined.

68

J. A. Herrera Castro et al.

2.1 Big Data Big Data is called the set of data or combinations of data sets whose size, complexity, value, and speed of growth make it difficult to capture, manage, process, or analyze them using conventional technologies and tools [4]. However, the amount of data is not relevant, but the usefulness of this data since it helps companies make better decisions and generate more strategic business [5, 6]. Big Data has five main characteristics, better known as 5Vs: • Volume: It refers to the enormous amount of data generated from different sources. • Variety: It is the origin of the data. They can come from cameras, smartphones, GPS (Global Positioning System) systems, social networks, etc. • Speed: Millions of data are stored and generated with amazing speed. • Veracity: It refers to whether the data from various sources is complete and correct. • Value: Once the data is converted into information, companies will have the opportunity to make improvements in business management. It is important to highlight that the data is not simply available to the companies but that there is a process in charge of extracting all data, which is called Data Ingestion. Data Ingestion is the process by which data from different sources, structures and/ or characteristics are introduced into another data storage or processing system [7]. Ingestion can occur in real-time when the source produces the data or in batches when a large amount of data is ingested in defined time periods. Three essential steps occur in data ingestion: extraction, transformation, and loading. Extraction refers to the collection of data from the information source. The transformation consists of validating, cleaning, and normalizing the data. Loading involves inserting the data into some database.

2.2 Data Streaming Data Streaming refers to data constantly generated from thousands of data sources and is usually sent simultaneously in small data sets [8]. This data includes, for example, log files generated by customers using mobile applications, Internet purchases, online games, information on social networks, among others. These data must be processed sequentially and gradually, record by record, or in gradual time windows. They are used for various analysis types, such as correlations, aggregations, filtering, and sampling. The information derived from the analysis gives companies visibility into many aspects of business and customer activities. The main characteristics of Data Streaming are the following: • Sensitive time: Each element in a Data Stream carries a time stamp. Data Streaming is time sensitive, and the data becomes less important after a certain time.

Acquisition, Processing and Visualization of Meteorological Data …

69

• Continuous: Data has no beginning or end. It is continuous and occurs in realtime, but the data is not always acted on at the moment. • Heterogeneous: Data originates from different sources that may be geographically distant. Because of this, there can be various data formats. • Imperfect: Due to the variety of sources and different data transmission mechanisms, data can be found to be corrupted, lost, or out of order.

2.3 NoSQL Databases NoSQL databases allow data to be stored without previously defining the storage structure, that is, without having to define the data schema [9]. Some of the advantages of using NoSQL databases are: • Flexibility: They offer flexible schemas for faster and more iterative development. • Scalability: Distributed clusters can be used for information storage. • High performance: They are designed for a specific data model and access patterns. • Highly functional: The APIs are helpful and the data types are designed for each data model. • Among the different types of NoSQL databases, the following can be found: • Documents: Manages a set of named string fields and data values from objects in an entity. These fields are commonly saved in json format. • Columns: Data is arranged in columns and rows. Its main feature is its denormalized approach to the data structure. • Key – Value: The data value can be associated with a unique key and a key value, which you will use to store the data. • Graphs: Allows applications to quickly query between nodes and edges and perform analysis of relationships between entities.

2.4 Apache Kafka It is a distributed data store optimized for ingesting and processing streaming data in real-time [10]. It is mainly used to create applications and real-time data transmission channels that adapt to data flows. This platform combines messaging, storage, and stream processing for real-time and historical data analysis. Some of the benefits are: • The partitioned registry model allows data to be distributed across multiple servers. • Decouples the data flow. • The partitions are distributed and replicated across many servers; all data is written to disk.

70

J. A. Herrera Castro et al.

Fig. 1 Example of how Kafka works [11]

Kafka contains four essential parts for its operation (Fig. 1). First is the topic, which refers to the set of messages written by one or more producers and read by one or more consumers. Each topic created is identified by its name. Producers are also known as publishers and consumers as subscribers. Second is the broker (Fig. 2), where each Kafka host runs a server called broker that stores messages sent to topics and services consumer requests. Third is the registry, which consists of a key/value pair and metadata, including a timestamp (Fig. 3). Kafka stores keys and values as byte arrays. Finally, there is partitioning, where instead of all the records handled by the system being stored in a single segment, Kafka divides the records into partitions. A partition can be considered a subset of all the records in a topic.

Fig. 2 Example of the broker [12]

Acquisition, Processing and Visualization of Meteorological Data …

71

Fig. 3 Example of a partition [11]

2.5 Apache Flink It is a framework and distributed processing engine for stateful computations over limited and unlimited data streams [13]. It stands out that Apache Flink was designed to run on any common cluster and, at the same time, allows high-speed calculations. Important aspects of Apache Flink include: • Unlimited and Bounded Data Processing: Any type of data is produced as a sequence of events. An example of this would be the money transactions that are generated every day in banks. • Application Deployment: Flink is a distributed system and requires computing resources to run applications. It can be integrated with Hadoop YARN, Apache Mesos, and Kubernetes, or it can also be configured to run independently. • Application execution at any scale: It is designed to run applications at any scale and can distribute the tasks in a cluster. • Memory performance: Applications are optimized for local state access. Tasks run in the main memory, but if they exceed capacity, they continue executing on the hard disk. The operation of Apache Flink is due to its core, also called Flink Core. It contains all the APIs and libraries that are useful for its operation. There are two main APIs that are a fundamental part of the proper functioning of Apache Flink:

72

J. A. Herrera Castro et al.

Fig. 4 Flink libraries [14]

• Dataset API: It is an execution environment in which the transformations are executed on the static data, for example, the databases. • DataStream API: Data is taken or collected from sockets or message queues in the environment. Additionally, Apache Flink provides a series of libraries (Fig. 4) with which you can go a little deeper with Big Data: • FlinkML: It is a library specialized in machine learning. • Gelly: It is an API for graph analysis. Table API: With this API, it is possible to work with the SQL syntax. The Apache Flink architecture registers a JobManager, considered the coordinator of the tasks and a TaskManager, highlighting the function of executing parts of the program in parallel (Fig. 5). The optimizer transforms the task sent to Flink into a DataFlow, run in parallel by the TaskManager, together with the coordination of the JobManager [15].

2.6 Elasticsearch It is an open source, distributed analytics, and analytics engine for all types of data, including textual, numeric, geospatial, structured, and unstructured [15]. This application is used for a variety of use cases, some of which are: • Application search. • Website search.

Acquisition, Processing and Visualization of Meteorological Data …

73

Fig. 5 Apache Flink architecture [14]

• Application performance monitoring. • Analysis and visualization of geospatial data. • Business analytics. Both processed and raw data flowing into Elasticsearch can come from various sources, such as web pages, IoT (Internet of Things) devices, etc. With this software, the data is parsed, normalized, and enriched, and then stored in an index, which is a collection of interrelated documents [15]. Elasticsearch was used as the database.

2.7 Kibana It is an open-source, distributed analytics, and analytics engine for all data types, including textual, numeric, geospatial, structured, and unstructured [15]. This application is used for a variety of use cases, some of which are: • Search, view, and visualize indexed data in Elasticsearch and analyze the data by creating bar charts, pie charts, histograms, and maps. • Monitor and manage an instance of the Elastic Stack through the web interface. • Centralize access for integrated solutions built on the Elastic Stack for enterprise search and security applications.

74

J. A. Herrera Castro et al.

Kibana allows analysis of data in an Elasticsearch index or multiple indices. These are created when unstructured data is saved from some external source and converted to a structured format for Elasticsearch’s storage and search capabilities. The Kibana interface allows users to visualize the indexes stored in Elasticsearch through different standard graphs.

3 Proposed Architecture This work proposed an architecture for real-time data stream processing on Apache Flink (Fig. 6). Sensors connected to a weather station prototype with an Arduino Uno board were used for data acquisition. The generated data is sent to Flink using the Apache Kafka streaming tool. Lastly, Elasticsearch and Kibana were used to process and visualize the data received in real-time. The weather station prototype was integrated with temperature and humidity, barometric pressure, and air quality sensors connected to an Arduino Uno board; the technical sheet of these sensors is defined in Sect. 3.2.1. The data collected by the sensors was sent to Apache Kafka using the Python programming language. Subsequently, the data in Kafka was forwarded to Apache Flink. Finally, they were stored in Elasticsearch and visualized using Kibana. Specifically, the following was implemented: • The weather station prototype captured temperature, humidity, barometric pressure, and air quality data through sensors connected to the Arduino. • Communication with the Arduino board was established using the Pyserial library using the Python programming language. • The Pykafka library was used to send the data captured by Pyserial to Apache Kafka. • Apache Kafka received the data sent from Pykafka. • Apache Kafka sent the data to Apache Flink. • Apache Flink received and processed the data. • The data was stored and visualized with Elasticsearch and Kibana, respectively.

Fig. 6 Proposed architecture

Acquisition, Processing and Visualization of Meteorological Data …

75

3.1 Software and Computer Equipment Used Two operating systems, Windows, and Ubuntu were used in this work. Python was used as the programming language. The tools used were Apache Flink, Apache Kafka, Elasticsearch, and Kibana. Windows was the operating system on which Python was installed. Ubuntu was the operating system, and Apache Flink, Apache Kafka, Elasticsearch, and Kibana tools were installed. Python was used to connect to the Arduino board and send data to Apache Kafka, Kafka was used to receiving data from the sensors, Apache Flink was in charge of processing the data sent from Kafka, Elasticsearch was used to store the data processed by Apache Flink and Kibana was used to create the necessary graphs to display the data from the sensors in real-time. Three computers were used to implement the proposed architecture. The first computer was a Lenovo laptop with 4 GB of RAM and 2 TB of hard drive with the Windows operating system. The sketch to establish communication between the sensors and the Arduino board was programmed on this computer. The Pyserial library was also used to establish communication with the Arduino board. The Pykafka library was used to establish communication with Apache Kafka on the second computer to send the data captured by the sensors. The second computer was a Dell Inspiron with 2 GB of RAM and a 500 GB hard drive running the Ubuntu 20.04 operating system. Data from the first computer was received in Apache Kafka on this second computer and was ingested into Apache Flink (sent from Kafka on this same computer). The third computer was a Toshiba Satellite laptop with 4 GB of RAM and a 500 GB hard drive running the Ubuntu 20.04 operating system. This computer was used to store sensor data in Elasticsearch and its respective visualization with Kibana in real-time.

3.2 Construction of the Weather Station This section presents the essential components for constructing the weather station and the programmed sketch to obtain the data from the sensors.

3.2.1

Weather Station Components

In constructing the weather station, an Arduino Uno, a breadboard, cables for connecting the sensors, a power supply module, a 9 V battery, a battery adapter, and a USB cable were used. It should be noted that a study was previously carried out to determine the necessary components for the station’s construction. In addition, each type of sensor was studied to select the most appropriate for the weather station. Three different temperature and humidity sensor models were considered (Table 1).

76

J. A. Herrera Castro et al.

Table 1 Characteristics of temperature and humidity sensors Figure

Name

Features

DHT11

Digital temperature and humidity Voltage: 3.3 V to 5 V Temperature range: 0 °C to 50 °C Humidity range: 0 to 100%

DHT22

Temperature and relative humidity Voltage: 3 V to 6 V Temperature range: −40 °C to 80 °C Humidity range: 0 to 100%

LM35DZ

Temperature Voltage: 4 V to 20 V Temperature range: 0.5 °C to 25 °C

For this work, the DHT11 sensor was used. We considered this sensor adequate since the voltage was minimal, it could measure humidity and the measurement ranges were acceptable. Two different models were considered for the selection of the barometric pressure sensor (Table 2). For the air quality sensor, it should be noted that there are several models; however, only the best known on the market were considered (Table 3). Table 2 Characteristics of barometric pressure sensors Figure

Name

Features

BMP180

Pressure, temperature, altitude Voltage: 1.8 V to 3.3 V Pressure range: 3 to 11 millibar

BMP280

Pressure, temperature, altitude Voltage: 1.8 V to 3.6 V Pressure range: 3 to 11 millibar

Acquisition, Processing and Visualization of Meteorological Data …

77

Table 3 Characteristics of air quality sensors Figure

Name

Features

MQ–135

NH4, CO2 , CO Voltage: 5 V Detection range 10 – 1000 ppm

MQ–5

LPG and natural gas Voltage: 5 V Detection range: 10 – 1000 ppm

MQ–8

H, CO2 Voltage: 5 V Detection range: 10 – 1000 ppm

The weather station had five components: an Arduino Uno board, three sensors, and a 9 V battery. The Arduino Uno board was essential for programming the sensors (Fig. 7). The DHT11, BMP180, and MQ-135 sensors were programmed to take the data simultaneously. However, for these sensors to work correctly, the 9 V battery was used as a power module since the voltage demand of the sensors was not enough to be covered.

Fig. 7 Weather station circuit

78

3.2.2

J. A. Herrera Castro et al.

Programming the Arduino Board Sketch

Once the weather station was built, the Arduino board was programmed with the selected sensors. It should be noted that initially, tests were carried out with each one to verify that they took readings correctly. The corresponding libraries were installed in the Arduino IDE so that the sketch could recognize the sensors. In the sketch, each one of them was invoked to establish communication with the sensors. The libraries used were the following: • Dht.h: It is a library that allows communication with the DHT11 and DHT22 sensors. • MQUnifiedsensor.h: It is a library that allows communication with various quality sensors, some of them are MQ-2, MQ-135, MQ-5 and MQ-8. • BMP180I2C.h: It is a library that establishes communication with the BMP180 sensor. Additionally, some values were declared for the DHT11 and MQ-135 sensors. For the DHT11 sensor, PIN 7 was set to capture temperature and humidity data. In the MQ-135 sensor, the Arduino Uno model was defined, the value 5 in the sensor voltage, the PIN A0 for data capture, and MQ-135 in the sensor model. The values declared in the sensors were selected because they allowed us to obtain the data and the correct functioning of the sensors. Programming a sketch requires two functions, setup, and loop. The setup function is used for communication between the Arduino board and the computer and stores the necessary settings to run the sketch. The loop function contains the sketch code to be executed. In the setup function, a series of declarations were made so that the sensors could start and maintain the correct operation during the data capture. In the case of the MQ-135 sensor, a for loop was used to calibrate it and constantly obtain air quality data. In the BMP180 sensor, functions provided by the BMP180I2C.h library were used to restart it in case of a failure and reset the sensor’s default values. A single function was used on the DHT11 sensor so that it could start capturing data. In the loop function, variables were declared to store data captured by each sensor (Table 4). It should be noted that to obtain the values of the MQ-135 sensor mentioned above, a series of previous calculations were made with the help of the 3.6 value of the sensor calibration and the radiation number of the chemical element. Those values were stored inside variables of type floats. Table 4 Values captured by each sensor

Sensor

Values

DHT11

Temperature, Humidity

BMP180

Pressure

MQ–135

CO, Alcohol, CO2 , Toluene, NH4, Acetone

Acquisition, Processing and Visualization of Meteorological Data …

79

For the DHT11 sensor, two functions were used, one called temperature and the other called humidity, to capture the values corresponding to temperature and humidity, respectively. For the BMP180 sensor, a function called getPressure was used to capture the pressure value. An important point considered is the unit in which each value is measured. The temperature was measured in degrees Celsius (°C), the humidity in percentage (%), the pressure in millibars (mbar), and the air quality values in Part per million (ppm). It should be noted that the barometric pressure sensor captures the data in Pascal (Pa) unit and is generally measured in millibar; for this reason, a conversion from Pascal to Millibar was performed. Finally, the value of each variable was displayed with the print function in the loop function. After verifying the correct operation of the previous functions, the sketch was uploaded to the Arduino board. Once the above was done, we verified the data capture from the Serial Monitor with the COM3 communication port, which was the port used to monitor the activity of the sensors from the Arduino IDE.

3.3 Streaming of Data from the Weather Station to the Tools This section presents the process of transmitting data from the sensors to the tools.

3.3.1

Connecting Python to the Arduino Board

Once the weather station was built, a code was developed with the Python language to connect with the Arduino Uno board. The Pyserial Python library was also used to interact with the Arduino board and thus obtain data from the sensors. Within this developed code, two functions were declared, json and serial, the first was to store all the data obtained in a json structure and the second was for communication with the Arduino board port. arduinouno = serial.Serial(‘COM3’,9600). As observed in the previous sentence, the serial library was used to establish communication through the COM3 port to which the Arduino board was connected. Subsequently, we began reading the sensor data from Python using the following statement in a while loop. s = arduinouno.readline().decode(“utf8”).replace(“\n”,“”).split(“,”). In the previous line, the readline function was used to read the records that the sensors were getting. The decode function decoded each data to UTF8. The replace function replaced special characters with whitespace between the data. The split function replaced the whitespace between the data with commas.

80

3.3.2

J. A. Herrera Castro et al.

Sending and Receiving Data from Pykafka to Apache Kafka

The data collected from the sensors on the first computer was sent in the json structure to Apache Kafka on the second computer using the Pykafka library. The KafkaProducer function was used from this library, which acts as a message producer, to send the data. The following statement establishes the connection between Python and Apache Kafka. producer = KafkaProducer (bootstrap_servers = ‘192.168.1.76:9092’, value_ serializer = lambda v: json.dumps(v).encode(‘utf-8’)). The above statement used the bootstrap_servers and value_serializer parameters of the KafkaProducer function. The first parameter defines the IP of the second computer in which Apache Kafka was located. The second parameter was used to modify the json structure with the sensor data to a Kafka-compatible format. Subsequently, the send function was used to send the json structure to Kafka. In this function, the topic (or topic) called “weather” was defined, in which the json structure was stored. producer.send(‘weather’, js). In the next phase, we use Apache Kafka to receive the sensor data in json format sent from Python. Because Kafka depends on a service called Apache Zookeeper to coordinate distributed processes, Java OpenJDK 11.0.8 had to be pre-installed for it to work correctly. After installing the Java OpenJDK, Zookeeper was downloaded, and inside the conf folder, the zoo.cfg file was edited, configuring the Zookeeper access port and response timeout. tickline = 2599. dataDlr = /data/zookeeper. clientPort = 2181. Subsequently, Kafka was downloaded, and within the conf folder, the server.properties file was edited, configuring the IP and port in which the data sent from the first computer will be received. listeners-PLAINTEXT://0.0.0.0:9092. advertised.listeners-PLAINTEXT://192.168.1.76:9092. After installing Zookeeper and Kafka, the Zookeeper service was started using the following command. bin/zookeeper-server-start.sh config/zookeeper.properties. After starting Zookeeper, Kafka was activated using the following command. bin/kafka-server-start.sh config/server.properties. Once Zookeeper and Kafka were activated, a topic called “weather” was created, with which the data sent from Python was received using the following command. bin/kafka-topics.sh –create –bootstrap-server 192.168.1.76:9092 –replicationfactor 1 –partitions 1 –topic weather. In the above command, the following is highlighted. bootstrap-server: In this part, the IP and port of the computer where Kafka was installed were written.

Acquisition, Processing and Visualization of Meteorological Data …

81

topic: The same name of the topic previously declared in the code of sending data from Python is written here. Finally, the consumer was declared where the sensor data sent from Python will be displayed using the following command. bin/kafka-console-consumer.sh –bootstrap-server 192.168.1.76:9092 –topic weather.

3.3.3

Processing Data in Apache Flink Sent from Apache Kafka

After declaring the consumer from Kafka, we proceeded to use Apache Flink. This tool was used to process the data from the sensors sent from Apache Kafka. For the Apache Flink installation, the installation was downloaded and Flink started using the following command. ./bin/start-cluster.sh

After Apache Flink was started, the connection to Kafka was made. First, the SQL Client was activated from the Ubuntu console. The SQL Client is intended to provide a simpler way to write, debug, and submit queries to a Flink cluster without a programming language. Maven files were also declared, which contain the necessary dependencies for the connection between Flink, Kafka, and Elasticsearch. The applied command is shown below. ./bin/sql-client.sh embedded –jar /home/jonathan/flinksql-connector-kafka_2.11-1.11.0.jar --jar /home/jonathan/ flink-sql-connector-elasticsearch_2.11- 1.11.0.jar

In SQL Client, it is possible to work with queries written in the SQL language. This is intended to provide an effortless way to write, debug, and submit queries to a Flink cluster without a single line of code. It is possible to retrieve results in real-time from the distributed application running on the command line through SQL Client. In the case of the connection with Apache Kafka, the previously loaded Maven file was necessary to establish communication with Flink, in addition to creating a table to store data. In SQL Client, a table named “weather” was created. CREATE TABLE weather ( FLOAT temperature, FLOAT humidity, FLOAT pressure, Co Float, Alcohol FLOAT, Co2 Float, Toluene FLOAT, Nh4 Float, Acetone FLOAT. )

82

J. A. Herrera Castro et al.

Fig. 8 Query made from running jobs

Additionally, in this same table, extra values were declared to establish communication with Kafka, such as the connector type (in this case, the Kafka version), the theme, and the supported format. With ( ’connector.type’ = ’kafka’, ’connector.version’ = ’universal’, ’connector.topic’ = ’weather’, ’connector.properties.bootstrap.servers’ = ’192.168.1.76:9092’, ’format.type’ = ’json’ );

Subsequently, a query was made to confirm that the table had been created correctly. The executed command is presented below. SELECT * FROM weather; Finally, the web interface was used to verify that the query appeared as a job. Additionally, the job diagram was displayed along with the information and status from the Running Jobs option of the web interface (Fig. 8). Until now, data taken from sensors with Pyserial was sent from Python using Pykafka, received in Apache Kafka, and sent to Apache Flink using Pykafka. With this, it can be said that the data can be processed in real-time.

3.4 Data Storage in Elasticsearch Apache Flink allows connecting to Elasticsearch to store data in real-time. This tool is a search service for storing data obtained from various sources. It is worth mentioning that installing Elasticsearch requires Java to be installed.

Acquisition, Processing and Visualization of Meteorological Data …

83

The APT package transport was installed using the following command to access the Elasticsearch repositories. sudo apt install apt-transport-https. Curl was installed using the following command to verify URLs and transfer files. sudo apt install curl. The public GPG key from Elasticsearch was imported into APT, a public key allowing secure access to Elasticsearch; the Curl command was used to download files from web pages. curl–fsSL https://artifacts.elastic.co/GPG-KEY-elasticsearch|sudo apt-key add – Later, an Elasticsearch sources list was added to the sources.list.d directory, where APT will look for new sources. Within the lists of sources, basic libraries are stored for Elasticsearch. echo “deb https://artifacts.elastic.co/packages/7.x/apt stable main”|sudo tee-a / etc./apt/sources.list.d/elastic-7.x.list. After installing the above, we installed Elasticsearch using the following command. sudo apt install elasticsearch. Once installed, the following lines were added to the elasticsearch.yml file network.host: localhost. The previous line specified the IP address with which Elasticsearch was accessed. Localhost was, by default, the dynamic IP of the computer. http.port: 9200. The communication port to access Elasticsearch was specified in the previous line. After making the necessary configurations for Elasticsearch to work, we started the service using the following command. sudo systemctl start elasticsearch. The following command was used to verify that the configurations made previously were correct. curl-X GET ‘http:// localhost: 9200’. The above command verified that Elasticsearch was configured correctly. Now, on the second computer, a table named “exploration” was created in Apache Flink. The same columns as in the weather table were defined in this table to transfer the data to this new table. CREATE TABLE scan ( FLOAT temperature, FLOAT humidity, FLOAT pressure, Co Float, Alcohol FLOAT, Co2 Float, Toluene FLOAT, Nh4 Float, Acetone FLOAT )

84

J. A. Herrera Castro et al.

Fig. 9 Elasticsearch index web interface

Parameters were established in the exploration table to establish the connection with Elasticsearch. The specified parameters were the type of connector, which in this case is Elasticsearch, the connector version, the IP and port of the third computer, the name of the index created, the data source table, and the Elasticsearch compatible format. With ( ’connector.type’ = ’elasticsearch’, ’connector.version’ = ’7’, ’connector.hosts’ = ’http://192.168.1.69:9200’, ’connector.index’ = ’scan’, ’connector.document.type’ = ’weather’, ’format.type’ = ’json’, ’update-mode’ = ’append’);

After creating the table, the data was transferred from the weather table to the exploration table through a query. For this operation, an INSERT INTO was used to specify the destination table, a SELECT to select all the source table’s columns and the FROM for the source table, weather. INSERT INTO exploration. SELECT Temperature, Humidity, Pressure, CO, Alcohol, CO2 , Toluene, NH4, Acetone. FROM weather; Finally, the creation of the index was verified from the third computer. An index is a collection of related documents. The following address was entered into a web browser to access Elasticsearch index management to verify the index creation: 192.168.1.69:9200/_cat/indices (Fig. 9).

3.5 Visualization of Data with Kibana Elasticsearch can be connected to Kibana to display data stored in Elasticsearch indices. Kibana is a tool capable of graphing data taken from Elasticsearch regardless

Acquisition, Processing and Visualization of Meteorological Data …

85

of whether it is in real time or not. It also offers the option of creating a dashboard with several types of graphs and performing machine learning experiments with the data. In the Kibana installation, the following command was used. sudo apt install kibana. After installation, Kibana was activated from the Ubuntu console using the following command. sudo systemctl start kibana. The following address was entered into a web browser to verify that Kibana was activated: localhost:5601 (Fig. 10). On this page, the data stored in the “exploration” index was explored in the Discover section (Fig. 11). As a demonstration, two different graphs were created from the Alcohol and CO columns of the “exploration” index. Data is taken in real-time from the air quality sensor (MQ-135) connected to the Arduino board. The two graphs used were line and stacked area. In the stacked area graph, the alcohol column was used, plotting the first 20 records based on their average (Fig. 12). The level of Alcohol is indicated on the X–axis. In the line graph, the alcohol and CO columns were used, specifying that the first 20 records of each column will be graphed based on their average (Fig. 13). In the previous graph, the green line corresponds to the Alcohol data and the blue line refers to CO. The purpose of plotting the data for Alcohol and CO together was to demonstrate that the data for these chemicals is not static and there is no similarity between these two chemicals in the matter of values obtained.

Fig. 10 Kibana home page

86

J. A. Herrera Castro et al.

Fig. 11 Exploration index data

Fig. 12 Stacked area plot

Fig. 13 Line graph

4 Verification In this first part of the verification, various tests were carried out with data from the sensors and some functions of the Apache Flink SQL Client. The objective of these tests was to demonstrate that it is possible to interact and perform operations in real

Acquisition, Processing and Visualization of Meteorological Data …

87

time with the data stored in the table. For this, a table called “test” was created and in it topic “clima1” created in Kafka was integrated. The functions used in the test were DESCRIBE, AVG, MAX, MIN, SELECT, and SUM. With the DESCRIBE function, it was possible to verify the data type of each column of the test table. root | | | | | | | | |

----------

Temperature: FLOAT Humidity: FLOAT Pressure: FLOAT CO: FLOAT Alcohol: FLOAT CO2: FLOAT Toluene: FLOAT NH4: FLOAT Acetone: FLOAT

The SELECT function allowed to select some columns from the table. The columns were Pressure, CO, and NH4. The data was stored in these columns as a demonstration (Fig. 14). With the AVG function, the real-time average of each of the selected columns with the stored data was obtained; for this, SELECT AVG(Pressure) AS PresionProm, AVG(CO) AS COProm, AVG(NH4) AS NH4Prom FROM test was executed. (Fig. 15).

Fig. 14 Selected columns

Fig. 15 Average of the columns

88

J. A. Herrera Castro et al.

Fig. 16 Maximum data of the columns

The largest real-time value of the selected columns was obtained with the MAX function; for this, SELECT MAX(Pressure) AS PresionMayor, MAX(CO) AS COMayor, MAX(NH4) AS NH4Mayor FROM test was executed (Fig. 16). The smallest real-time value of the selected columns was obtained with the MIN function; for this, SELECT MIN(Pressure) AS PresionMenor, MIN(CO) AS COMenor, MIN(NH4) AS NH4Menor FROM test was executed (Fig. 17). The last test was with the SUM function, with which the real-time sum of each selected column was calculated.; for this, SELECT SUM(Pressure) AS PresionSum, SUM(CO) AS COSum, SUM(NH4) AS NH4Sum FROM test was executed (Fig. 18). The second part of the verification was to check that no data was lost when it was sent from Kafka to Flink. Four tests were carried out, and ten records were used for each test, monitored simultaneously in Kafka (Fig. 19). These records were used to verify data sent from Kafka to Flink and show that the synchronization between these two tools is correct. First, a test was carried out with the temperature and humidity sensor, whose data was stored in a table in Flink called “test” (Fig. 20). In the second test, the barometric pressure sensor was integrated (Fig. 21); it was tested with ten records, and these were stored in a table called “test2” (Fig. 22). In the third test, the air quality sensor was integrated, and for visualization reasons, it was tested with CO, Alcohol, and CO2 (Fig. 23); in the same way, ten test records were used and stored in a table called “test3” (Fig. 24). In the last test, Toluene, NH4, and Acetone from the air quality sensor were integrated (Fig. 25); ten records were used and stored in a table called “test4” (Fig. 26).

Fig. 17 Minimum column data

Fig. 18 Total sum of each column

Acquisition, Processing and Visualization of Meteorological Data …

Fig. 19 Temperature and humidity data in Kafka

Fig. 20 Flink data

Fig. 21 Temperature, humidity and pressure data in Kafka

Fig. 22 Temperature and humidity data in Flink

89

90

J. A. Herrera Castro et al.

Fig. 23 Temperature, humidity, pressure, CO data in Kafka

Fig. 24 Temperature, humidity, pressure and CO data in Flink

Fig. 25 Temperature, humidity, pressure and toluene data in Kafka

Fig. 26 Temperature, humidity, pressure and toluene data in Flink

5 Results This section shows the results obtained from the experiment taking data from the meteorological station prototype. This prototype was placed outdoors for three hours, acquiring data every four minutes. This time was selected because there would be variations in the data obtained. At the end of the allotted time, 46 records were acquired. The experiment consisted of verifying the correct functioning of the meteorological station prototype, the sending and receiving the data and the subsequent visualization of these. The steps carried out during the experimentation were the following:

Acquisition, Processing and Visualization of Meteorological Data …

91

1. Collection of the data generated by the sensors of the meteorological station prototype in real-time. 2. Sending the data to an Apache Kafka topic. 3. Sending Apache Kafka data to Apache Flink. 4. Creating an index in Apache Flink to store the received data in Elasticsearch. After performing the abovementioned steps, the data obtained from the weather station sensors, placed outdoors, were graphed using the Kibana tool. The previous graph shows the results obtained from the temperature sensor (Fig. 27). Time is measured on the X-axis, and the temperature index is measured on the Y-axis. As can be seen, the temperature remained stable during most of the assigned time; the value that was repeated the most was 28° Celsius. During this experiment, the maximum temperature recorded was 30 °C and the minimum was 28 °C by the weather station. In the case of humidity, there was no variation in the data because it was maintained at 15% during the hours assigned for the experiment and it is very rare for humidity to suffer variations (Fig. 28). On the X-axis the time is measured and on the Y-axis the humidity index is measured. Regarding the barometric pressure, it stands out that at the beginning of the experiment, the sensor indicated that the pressure was very high, and as the hours went by, the pressure dropped. In the assigned hours, the maximum pressure was 811.39 mbar (millibar), and the minimum was 804.15 mbar (Fig. 29). The X-axis measures time, and the Y-axis measures the pressure index. Concerning CO, the data were changing, that is, they were not stable compared to temperature or humidity. The maximum CO registered was 6.18 and the minimum was 0.75 (Fig. 30). The time was measured on the X-axis, and the CO index was measured on the Y-axis. For NH4, in the same way as for CO, the data varied during the experiment. The maximum NH4 was 5.74 and the minimum 1.53 (Fig. 31). Time is measured on the X-axis and the NH4 index is measured on the Y-axis.

Fig. 27 Temperature results

92

J. A. Herrera Castro et al.

Fig. 28 Humidity results

Fig. 29 Barometric pressure results

Fig. 30 CO results

In the case of CO2 , the data kept changing during the experiment. The maximum CO2 was 3.94 and the minimum 0.85 (Fig. 32). The X-axis measures time, and the Y-axis measures the CO2 index.

Acquisition, Processing and Visualization of Meteorological Data …

93

Fig. 31 NH4 results

Fig. 32 CO2 results

In Alcohol, there were variations with the data, the peak was recorded in record 21. The maximum alcohol recorded was 1.9 and the minimum was 0.35 (Fig. 33). On the X-axis the time is measured and on the Y-axis the Alcohol index is measured. In Toluene, various changes in the data were recorded during the time allotted for the experiment. The maximum Toluene registered 0.81 and the minimum 0.13 (Fig. 34). On the X-axis, the time is measured, and on the Y-axis, the Toluene index is measured.

Fig. 33 Alcohol results

94

J. A. Herrera Castro et al.

Fig. 34 Toluene results

Fig. 35 Acetone results

Finally, a variation in the acetone data was recorded during the time allotted for the experiment. The maximum record was 0.69 and the minimum was 0.11 (Fig. 35). On the X-axis, the time is measured, and on the Y-axis, the Acetone index is measured.

6 Discussions The experiment’s results showed that during the allotted time, there was a variation in the data; it was noted that the sensors captured the environment data correctly and continuously during the allotted time. It should be noted that the sensors were programmed to obtain data simultaneously, and there was no delay in data capture. During the development of the project, it was discovered that the data from the sensors can be sent in real-time to other data processing tools through the Python language and the Pykafka library; this stands out because the work of other researchers did not find a clear way to send data in real-time to specialized data processing tools.

Acquisition, Processing and Visualization of Meteorological Data …

95

On the other hand, it was understood how the Apache Kafka, Apache Flink, and Elasticsearch tools interact with each other to transmit, process, and store data in realtime. In the case of Kafka, it was understood how this tool can be connected with the Python language to send the data since, at first, it was not clear how to connect them. In the case of Flink, it was discovered how to work with data in real-time because, based on the information available about Flink, it was not clear how it was possible to work with data in real-time and with the help of this architecture—weather station it was understood how this type of data can be processed within Flink.

7 Conclusions This work implemented an architecture for real-time data stream processing on Apache Flink. The architecture consists of five phases: acquisition, the transmission of data streams, storage, processing, and visualization of data in real time. For data generation, sensors connected to a meteorological station prototype were used. The prototype was integrated with temperature and humidity, barometric pressure, and air quality sensors connected to an Arduino Uno board. The data generated by the sensors was sent to Apache Flink using the Apache Kafka streaming tool and was stored using Apache Kafka. Lastly, Elasticsearch and Kibana were used to process and visualize the data received in real time. This architecture’s relevance lies in transmitting sensor data in real-time; it is also highlighted that it was possible to process, store and visualize this data through the Flink, Elasticsearch, and Kibana tools. Finally, this architecture can be used in implementing machine learning algorithms using the Apache Flink Machine Learning Library in a big data environment to process data streams in real-time. In future work, we will integrate machine learning techniques to make weather predictions based on the data captured by the sensors in real time. It is planned to compare the actual values against the predictions made to validate the quality of our predictions. Also, in terms of scalability, it is intended to implement this architecture in a cluster on Apache Flink to create a big data analytics environment.

References 1. Rodríguez, G.: Temperature streaming with Arduino + Big data tools. Hackster. https://www. hackster.io/Gersaibot/temperature-streaming-with-arduino-big-data-tools-eb22fc. Accessed 2 Feb 2022 2. Barba, C.J.: Seguimiento de datos en tiempo real con Apache Kafka en Raspberry Pi; Caso práctico: monitoreo ambiental de la actividad del invernadero Espe–Iasa (2018) 3. Miralles, J.: Publish your arduino data to the cloud. Hackster. https://www.hackster.io/jaume_ miralles/publish-your-arduino-data-to-the-cloud-9dfaa2. Accessed 3 Feb 2020 4. Joyanes, L.: Big data—Análisis de grandes volúmenes de datos en organizaciones. Alfaomega, México (2013)

96

J. A. Herrera Castro et al.

5. Pazos-Rangel, R.A., Florencia-Juarez, R., Paredes-Valverde, M.A., Rivera, G. (Eds.).: Handbook of Research on Natural Language Processing and Smart Service Systems. IGI Global, (2021). https://doi.org/10.4018/978-1-7998-4730-4 6. Bolívar, A., García, V., Florencia, R., Alejo, R., Rivera, G., Sánchez-Solís, J.P.: A preliminary study of smote on imbalanced big datasets when dealing with sparse and dense high dimensionality. In: Pattern Recognition: 14th Mexican Conference, MCPR 2022, Ciudad Juárez, Mexico, June 22–25, 2022, Proceedings, pp. 46–55. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-07750-0_5 7. Microsoft. Microsoft Learn. https://learn.microsoft.com/es-es/azure/data-explorer/ingest-dataoverview. Accessed 20 Dec 2022 8. Amazon Web Services.: AWS. https://aws.amazon.com/es/streaming-data/. Accessed 20 Dec 2022 9. Amazon Web Services. https://aws.amazon.com/es/nosql/. Accessed 20 Dec 2022 10. Narkhede, N., Shapira, G., Palino T.: Kafka: The definitive guide. O’Reilly, Sebastopol (2017) 11. Apache Kafka.: https://kafka.apache.org/intro. Accessed 14 Mar 2023 12. Cloud Kafka.: https://www.cloudkarafka.com/. Accessed 14 Mar 2023 13. Hueske, F., Kalavri, V.: Stream processing with apache flink. O’Reilly Media, Inc., (2019) 14. Apache Flink. https://flink.apache.org/. Accessed 14 Mar 2023 15. Elasticsearch. s.f.: https://www.elastic.co/es/what-is/elasticsearch. Accessed 13 Sep 2020

Topological Data Analysis for the Evolution of Student Grades Before, During and After the COVID-19 Pandemic Mauricio Restrepo Abstract This paper presents an examination of how topological data analysis was applied to trace the evolution of grades obtained by students enrolled in various courses within the mathematics department of a university before, during, and after the COVID-19 pandemic. The study utilized 24 datasets that included information on grades, subjects, academic programs, and basic details about the teachers, such as their academic background and the type of contract they held with the institution. This paper analyzes different fail rate scenarios using two key algorithms of topological data analysis and demonstrates that the Mapper algorithm can effectively identify significant differences in the evolution of grades in relation to the type of teachers’ contract. Keywords Mapper algorithm · Fail rates · COVID-19 pandemic

1 Introduction Topological data analysis (TDA) is a recent discipline that seeks to study geometrical aspects employing classical concepts of algebraic topology related to simplicial complexes and homology groups. From the topological point of view, a set of points in a n-dimensional space is simply a completely non-connected set. Using a notion of distance or a measure of similarity, it is possible to construct simplicial objects over a set of points that can reveal interesting geometrical aspects of data. The study of topological data analysis began in the first decade of this century with the works of Edelsbruner et al. [1], Zomorodian [2], and Gunnar [3], which provided qualitative information about data structure. Topological data analysis has had different applications. One of the most relevant applications of TDA is in the area of medicine. The works related to diabetes [4], cancer [5], immune system [6], and heart diseases [7, 8] stand out. It is also worth mentioning applications related to ecology [9], chemical engineering [10], decisionM. Restrepo (B) Universidad Militar Nueva Granada, Bogotá, Colombia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications, Studies in Big Data 132, https://doi.org/10.1007/978-3-031-38325-0_5

97

98

M. Restrepo

making, investments [11, 12], artificial intelligence [13], and time series analysis [14, 15]. Some works on the subject oriented to a solid mathematical formulation of TDA include [16–19]. The objective of this paper is twofold: first, to present how the general ideas of topological data analysis can be applied to educational data. Second, to obtain relevant information from 24 data sets containing the grades attained by students enrolled in mathematics courses before, during, and after the COVID-19 pandemic. The remainder of this paper is organized as follows: Sect. 2 presents preliminary concepts about topological spaces and homology groups. Section 3, presents the construction of simplicial objects from a data set. Sections 4 and 5 present two of the most commonly used techniques in topological data analysis: the persistence homology and the Mapper algorithm. Section 6 presents an application of these techniques on 24 data sets in order to analyze the evolution of course fail rates. Finally, Sect. 7 presents some conclusions and outlines future work in this topic.

2 Preliminaries Topological Data Analysis (TDA) has emerged in recent years as a new field whose aim is to uncover, understand, and exploit the topological and geometric structure underlying complex and high-dimensional data.

2.1 Topological Data Analysis One of the most widely used techniques in topological data analysis is the persistence diagram, which is obtained by means of persistent homology groups. With the persistent homology is possible to identify topological attributes through simple elements such as connected components, holes, voids, and their generalizations [10].

2.1.1

Data Shape is Important

Figure 1 shows two data sets, randomly generated based on a pair of circles of radii 0.5 and 1, respectively. If we accept that the shape of the data is important, it is worth studying geometric techniques that describe the shape of the data. Figure 2 shows the persistence diagrams that correspond to the datasets in Fig. 1. These diagrams are used to understand the shape of the data. A persistence diagram is a visual method of points (x, y) in R2 that represents the birth (x) and death (y) of topological features [10]. The red dots represent zero-dimensional objects known as connected components, while the blue dots represent one-dimensional elements

Topological Data Analysis for the Evolution of Student Grades … Two circles dataset N = 1000

Two circles dataset N = 1000

1.5

1.0

99

1.0 0.5 0.5 0.0

0.0

–0.5

–0.5 –1.0

–1.0 –0.5

–1.0

0.5

0.0

1.0

–1.5

–1.0

–0.5

0.0

0.5

1.0

Fig. 1 Randomly generated data from circles with noise Persistence diagram

Persistence diagram

+

+ 0.300

0.300

Death

Death

0.200 0.200

0.100 0.100

0.000

0.000 0.00

0.05

0.10

0.15

0.20

0.25

Birth

0.30

0.35

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Birth

Fig. 2 Persistence diagrams for data

known as loops. The idea is that if the persistence diagrams in a particular dimension are very different, it is very likely that the datasets have important differences. Using a distance between persistence diagrams, it is possible to quantify how different the diagrams are, and therefore it is possible to establish some differences between the datasets. In this way, TDA is an interesting application of algebraic topology, particularly the concept of homology groups. Homology is, in essence, a connection between topological spaces and group theory. This connection allows the study of properties of topological spaces through certain algebraic objects such as groups. The fundamental idea is that, if the groups associated to two topological spaces are not isomorphic, the topological spaces will not be isomorphic either.

2.2 Topological Spaces Here are some basic concepts related to topological spaces, continuity, and homeomorphisms.

100

M. Restrepo

Definition 1 (Willard [20]) A topology on a set X is a collection τ of subsets of X , called open sets, satisfying: 1. Any union of elements of τ belongs to τ , 2. Any finite intersection of elements of τ belongs to τ , 3. ∅ and X belongs to τ . We say that (X, τ ) is a topological space. Let (X, τ ) and (Y, σ ) be topological spaces. A function f among topological spaces f :(X, τ ) → (Y, σ ) is a continuous function f :X → Y . The continuity means that f −1 (O) ∈ τ , for each O ∈ σ . That is to say, the inverse image of an open set of Y is an open set of X .

2.2.1

Homeomorphic Spaces

Two topological spaces are homeomorphic if there exists a bijection f :(X, τ ) → (Y, σ ) such that f and f −1 are continuous. From the topological point of view, two homeomorphic spaces are indistinguishable. The continuity of f and its inverse means that one space can be continuously deformed into the other and vice versa. For example, a circumference can be continuously deformed into a square or a triangle.

2.2.2

Subspaces

If (X, τ ) is a topological space and Y ⊆ X , then (Y, τY ) is a topological space, where τY = {Y ∩ O : O ∈ τ } In this case, the inclusion function i:(Y, τY ) → (X, τ ) is continuous.

2.3 Simplicial Complexes A simplicial complex is a geometric structure built from simple objects such as points, segments, triangles, tetrahedrons, and their counterparts in higher dimensions. Let S = {a0 , a1 , . . . , ak } be a set, where ai ∈ Rn . The hyperplane π(S) is the set defined as: k k n π(S) = p ∈ R : p = λi ai , where λi = 1 (1) i=1

i=1

Topological Data Analysis for the Evolution of Student Grades …

101

A hyperplane is a generalization of a point, a line (passing through two points) and a plane (passing through three points).

2.3.1

Affinely Independent Set

In Fig. 3 the set of points {a0 , a1 , a2 } is in the hyperplane generated by {a0 , a1 } and it is exactly the same as that generated by {a1 , a2 }. This shows the idea of an affinely independent set. Definition 2 (Keese [22]) A finite set S = {a0 , a1 , . . . , ak }, with ai ∈ Rn , is affinely independent if S is not contained in π(T ) for any proper subset T of S. A proof of the following result is easy and can be found in [22]. The set S = {a0 , a1 , . . . , ak } is affinely independent if and only if the set of vectors V = {a1 − a0 , a2 − a0 , . . . , ak − a0 } is linearly independent.

2.3.2

Simplices

An affinely independent set defines a simplex as follows: Definition 3 (Keese [22]) Let S = {a0 , a1 , . . . , ak } be an affinely independent set. The simplex of dimension k (k-simplex) generated by S is: k k n (S) = p ∈ R : p = λi ai , where λi = 1, λi > 0 (2) i=1

i=1

Examples of simplices are shown in Fig. 4. A 0-simplex is a point, a 1-simplex is a segment, a 2-simplex is a triangle with its interior, and a 3-simplex is a solid tetrahedron with its interior. The points ai are the vertices of the simplex. Usually (S) is represented as [a0 , . . . , ak ]. A face of a k-simplex is the simplex spanned by a subset of S. In particu-

Fig. 3 The set S = {a0 , a1 , a2 } is affinely dependent

2

1

Y

a0 a1 1

–1

X 2

a2

102

M. Restrepo

a1

a1 a0

a0

a1

a2

a0

a3

a0

a2

Fig. 4 Simplices of dimension k = 0, 1, 2, 3

lar, a face of dimension k − 1 is the (k − 1)-simplex spanned by {a0 , . . . , aˆ i , . . . , ak }, where aˆ i means that ai was removed.

2.3.3

Simplicial Complexes

A simplicial complex is a subspace K of Rn with a finite list of non-empty simplices satisfying the following conditions: 1. The union of the simplices is K . 2. Each point in K is in the interior of a unique simplex. 3. If S k ∈ K and S t is a proper face of S k , then S t ∈ K . The dimension of K is the largest dimension of its simplices. Figure 5 shows a simplicial complex in R2 . Note that this simplicial complex has a single 2-simplex: σ = [a2 , a3 , a4 ]. Definition 4 A simplicial complex K is called a triangulation of a topological space X if there exists a homeomorphism h : K → X . It is said that X is triangulable. According to Fig. 6, the circumference S 1 is homeomorphic to the simplicial complex having three 1-simplices. In this case, the simplicial complex does not contain the 2-simplex [a0 , a1 , a2 ].

a7

Fig. 5 Simplicial complex

a3 a5 a2 a0

a4 a1

a6

Topological Data Analysis for the Evolution of Student Grades …

103

a2

Fig. 6 Two homeomorphic spaces. From a topological point of view, the figures are equivalent

a2

a0

a1

a0

a1

2.4 Homology Groups As mentioned above, homology groups are the backbone of this theory. We present below some basic concepts related to homology.

2.4.1

Chains, Cycles, and Boundaries

Let K be a simplicial complex in Rn . A chain of dimension k is a subset of ksimplices in K . We can define an addition operation of chains with coefficients in Z. The addition of two chains is a linear combination of chains with integer coefficients. Also, it is possible to use coefficients in a general group G. The set of all k-chains together with addition operation form a group denoted as Ck with the empty set as the zero element. The boundary of a k-simplex σ = [a0 , . . . , ak ] is the collection of its (k − 1)-dimensional faces, denoted as ∂k (σ ). It is written as a linear combination of (k − 1)-simplices; in other words, it is a (k − 1)-chain. ∂k (σ ) =

k (−1)i [a0 , . . . , aˆ i , . . . , ak ].

(3)

1=0

The concept of boundary can be extended to a k-chain as the sum of the boundaries of its simplices. Figure 7 shows the boundary operator acting on a 2-simplex. The result is a chain of 1-simplices, which determine a loop. Each boundary operator is a group homomorphism ∂k : Ck → Ck−1 , and the collection of boundary operators connect the chain groups into a chain complex, ∂2

∂1

· · · → ∅ → C2 (K ) −→ C1 (K ) −→ C0 (K ).

(4)

According to group theory, each homomorphism f : G → H defines two important subgroups. A subgroup of G called the kernel and a subgroup of H called the image. The kernel of the operators ∂k are sets of k-chains with empty boundary and the image of ∂k is the collection of (k − 1)-chains that are boundaries of k-chains.

104

M. Restrepo

a2

a0

a2

a0

a1 a0 ,a1 ,a2 = a1 ,a2

a1

a0 ,a2 + a0 ,a1

Fig. 7 The boundary of a 2-simplex is a 1-chain

ker ∂k = {c ∈ Ck : ∂k (c) = ∅}.

(5)

Definition 5 (Edelsbrunner et al. [1]) A k-cycle is a k-chain in the kernel of ∂k , and a k-boundary is a k-chain in the image of ∂k+1 . A 1-cycle is called a loop. The sets Z k of k-cycles and Bk of k-boundaries together with addition operation form subgroups of Ck . A fundamental property of boundary operators is that the boundary of every boundary is empty (∂k+1 ◦ ∂k (c) = ∅). This implies that the groups are nested (Bk ⊆ Z k ⊆ Ck ), and that the image of the homomorphism ∂k+1 , denoted as Bk (K ), is a subgroup of the kernel of ∂k , denoted as Z k (K ): ∂k+1

∂k

Ck+1 (K ) −−→ Ck (K ) −→ Ck−1 (K )

(6)

Therefore, the k-dimensional homology group of a simplicial complex K over Z is defined by means of the quotient group: Hk (K ) = Z k (K )/Bk (K ).

(7)

From a less formal point of view, homology groups H0 (K ), H1 (K ) and H2 (K ) count the connected components, holes (loops), and voids (cavities) present in a surface or n-dimensional manifold. For example, we know that the sphere S 2 has a connected component, has no holes, and has a void, in the sense that it encloses an empty space.

2.4.2

Betti Numbers

For any topological space (X, τ ) with a finite number of path components, β0 (X ) is the number of path components. The Betti number β1 (X ) represents the number of holes, while β2 (X ) represents the number of voids, and βn (X ) the number of

Topological Data Analysis for the Evolution of Student Grades … Table 1 Homology groups for some surfaces Space H0 (K ) Circumference Sphere Torus Klein bottle

Z Z Z Z

105

H1 (K )

H2 (K )

Z 0 Z×Z Z × Z2

0 Z Z 0

n-dimensional cavities. These values are called Betti numbers, which reveal the topological complexity of X [1]. Table 1 shows the homology groups for some surfaces. According to the homology groups of Torus (T ), the Betti numbers are exactly the number of factors of Z in each case. So, β0 (T ) = 1, β1 (T ) = 2, and β2 (T ) = 1.

3 Simplicial Complexes from Data The theory of simplicial complexes, and especially persistent homology, is applied in a natural way to a set of points in Rn . For this purpose, simplicial complexes of ˇ Vietoris-Rips and Cech can be constructed from data.

3.1 Vietoris-Rips Complex Suppose X is a cloud of points in an n-dimensional space. On each xi ∈ X we construct a ball Br (xi ) of radius r varying from 0, as far as necessary. Formally, the Vietoris-Rips simplicial complex, is defined as follows [21]: Definition 6 Let X = {x1 , . . . , xn } be a cloud of points with a metric d, and a value r ≥ 0. The Vietoris-Rips (VR) complex is defined in the following way inductively: 1. Each point in X makes a 0-simplex. 2. Each pair x1 , x2 ∈ X makes a 1-simplex σ = [x1 , x2 ] if d(x1 , x2 ) ≤ r . 3. Each x1 , . . . , xk ∈ X makes a (k − 1)-simplex with vertices x1 , . . . , xk if all points are within a distance of r from each other. Formally, Rr (X ) = {[x1 , . . . , xk ] : d(xi , x j ) ≤ r, ∀i, j}

(8)

Rr (X ) contains all the simplices whose diameter is less or equal than r . Figure 8 shows that each value of r defines a VR complex. With r = 0 the complex is composed of nine points (connected components). For r = 0.37 there are eight

106

r=0

M. Restrepo

r = 0.24

r = 0.37

r = 0.5

r =1

Fig. 8 Vietoris-Rips complexes for different values or r

connected components and one 1-simplex appears (birth). For r = 0.5 a loop is obtained. Finally, this loop disappears (death) at r = 1. It is easy to note that if r1 ≤ r2 , then Rr1 ⊆ Rr2 .

ˇ 3.2 Cech Complex ˇ The Cech complex of a dataset X with radius r is defined as: Cr (X ) = [x1 , . . . , xk ] : Br (xi ) = ∅

(9)

i

This means that a simplex σ with vertices {x1 , . . . , xk } is added to the complex when the balls Br (xi ) have a non-empty intersection. The fundamental difference between these two complexes can be seen in Fig. 9. ˇ The Cech complex is a 1-chain, while the Vietoris-Rips complex is a 2-simplex. A well-known relationship between these simplicial complexes is: Cr (X ) ⊆ R2r (X ) ⊆ C2r (X ).

(10)

ˇ Fig. 9 Cech complex (a), Vietoris-Rips complex (b)

(a)

(b)

Topological Data Analysis for the Evolution of Student Grades …

107

3.3 Nerves ˇ The Cech complex of a dataset X is a particular case of the following. Given a covering C = {Ui }i∈I , the nerve of C is the abstract simplicial complex S(C) whose vertices are the open sets Ui and a simplex σ = [Ui0 , . . . , Uik ] ∈ S(C) if and only if kj Ui j = ∅. A relevant aspect about the nerve S(C) is that if the intersection of any subcollection of Ui is either empty or contractible, then X and S(C) are homotopically equivalent, that is, they have the same homology groups. This allows the study of topological properties through simplicial objects, which are suitable for constructing efficient computational algorithms.

3.4 Filtrations In general it is assumed that the simplices in the Vietoris-Rips complex are added one after another, i.e.: ∅ = K 0 ⊆ K 1 ⊆ · · · ⊆ K n = Rr (X ) Therefore, a filtration of a simplicial complex K is a family of subcomplexes K i , such that for i ≤ j, K i ⊆ K j . So, we have a succession of homological groups: fp

fp

fp

0 = H p (K 0 ) −−→ H p (K 1 ) −−→ · · · −−→ H p (K n ) = H p (Rr (X )).

(11)

4 Persistent Homology Persistent homology is a methodology originally proposed by Edelsbrunner, Letscher, and Zomorodian in [1] and further developed by many others for extracting and quantifying topological information from data. The r -dimensional persistent homology group of K over Z is defined by means of the quotient: (12) Hr (K ) = Z r (K i )/(Br (K j ) ∩ Z r (K i )) This group contains the homology classes of K i that are present in K j .

108

M. Restrepo

4.1 Persistence Diagrams The data on the left-side graph in Fig. 10a was generated using two concentric circles with a noise. Its persistence diagram appears on the right, Fig. 10b. The red dots show the dynamics of 0-dimensional objects or connected components, while the blue dots show the elements of the 1-dimension homology or loops. For each point in the dataset a circle of radius r is constructed, with r varying from zero as far as necessary. Upon constructing a simplicial Vietoris-Rips complex, we can see the dynamics of the homology. In dimension zero (red-colored dots), we see that the connected components begin to disappear as r grows because the dots are joined to other dots. In dimension 1 (blue-colored dots), we see that small loops begin to appear, but soon disappear. The two blue dots in Fig. 10b that are farther away from the diagonal represent the circles from which the dots were generated. Each coordinate of the persistence diagram expresses the value of r for which a topological property appears (is born) and disappears (dies). The points close to the diagonal, in general, are elements that die very fast, while the points that are far from the diagonal, correspond to elements that last and define the persistence diagram. The birth and death processes of a topological attribute can be formally defined as: • Birth. For a filtration of K and subcomplexes K i and K j , a topological attribute / Hr (K i ) for all i < j. x ∈ Hr (K j ) is born in j if x ∈ • Death. For a filtration of K and subcomplexes K i and K j , a topological attribute / Hr (K i ) for all i > j. x ∈ Hr (K j ) dies in j if x ∈ Persistence diagram

Two circles dataset N = 1000

+

1.0 0.300

Death

0.5

0.0

0.200

0.100

–0.5

0.000

–1.0

0.00 –1.0

–0.5

0.0

0.5

1.0

(a)

Fig. 10 A random dataset and its persistence diagram

0.05

0.10

0.15

0.20

Birth

(b)

0.25

0.30

0.35

Topological Data Analysis for the Evolution of Student Grades …

109

4.2 Bottleneck Distance It is unlikely that two persistence diagrams will have the same number of dots. To define a distance between two persistence diagrams, the first requirement is that the diagrams have the same number of dots. This is achieved by taking from the diagonal as many points as necessary. In this way, if D1 , D2 are persistence diagrams and h : D1 ∪ → D2 ∪ is a bijection, the bottleneck distance W∞ between D1 and D2 is defined as follows (see Fig. 11): W∞ = inf h

sup x − h(x)∞

(13)

x∈D1 ∪

5 Mapper Algorithm The Mapper algorithm is another important tool in any practical implementation of TDA. It is an unsupervised machine learning algorithm used to obtain different descriptions of a dataset, since it provides an approximate representation of the structure of the data. It has been successfully used for clustering and feature selection [13]. In practice, the Mapper algorithm has two major applications: data visualization and clustering, and feature selection. The basic idea of the Mapper algorithm is to construct a graph from a dataset of points in some n-dimensional space. The Mapper algorithm transforms data into a graph. First, we start with data, i.e. a cloud of points. Second, the data is projected into a lower dimension using a scalar function f : X → R. This function maps the higher dimensional input data into a lower dimensional representation. Third, a covering C = {Ui }i∈I is defined for the

x

h(x)

Death

Fig. 11 Bottleneck distance between two persistence diagrams. First of all, it is necessary to construct a bijection between the persistence diagrams, with the help of the points of the diagonal . (Adapted from [15])

Birth

110

M. Restrepo Y 8 6 4 2

X 2

4

6

8

10

Fig. 12 Dataset in a two-dimensional space

projected data. Fourth, we cluster the pre-image f −1 (Ui ), and finally, we construct a graph based on the clusters [17]. More specifically, we draw a node corresponding to each of the clusters in step 4, and connect two nodes together if they have any members in common. Perhaps the biggest problem with the Mapper algorithm is that the visualization of the data can change dramatically with small variations of the parameters (the projection function, the number of intervals in the covering, the overlap percentage, and the clustering algorithm used, along with its own parameters). For this reason, in this particular application the Mapper parameters were fixed and applied to the 24 datasets. A simple example of this algorithm is illustrated below for 2D data. Let us consider the dataset in Fig. 12. The functions f are the projections f 1 (x, y) = x and f 2 (x, y) = y. Let us consider the covering of f 1 (X ) defined by six open intervals with a small overlap percentage in Fig. 13a. The output of the algorithm is a two-dimensional graph in which the nodes represent sets of points obtained through the clustering process (Fig. 13b) and the edges represent a connection between nodes containing points in common. The graph in Fig. 13c was obtained from the dataset, using f as the x-projection. A covering of six intervals known as cubes, with a 10 % overlapping was chosen and the classical clustering k-means, with k = 2, was used. Similarly, if we use the second projection (on the y-axis) and repeat the same process, then we obtain the graph in Fig. 14c.

Y

Y

8

8

6

6

4

4

2

2

X 2

4

6

(a)

8

10

X 2

4

6

8

(b)

Fig. 13 Graph obtained from data, using the first projection

10

(c)

Topological Data Analysis for the Evolution of Student Grades … Y

111

Y

8

8

6

6

4

4 2

2

X 2

4

6

8

(a)

10

X 2

4

6

8

10

(b)

(c)

Fig. 14 Graph obtained from data, using the second projection

This example shows the importance of properly choosing the projection function f . For example, the graph in Fig. 13c has two connected components, while the graph in Fig. 14c has only one, so they are not topologically equivalent. Neither is better than the other; they are simply two different representations of the data. Another important aspect of the Mapper algorithm is the possibility of coloring the nodes based on specific criteria. For example, the values of an attribute can be used to color the graph nodes.

6 Application According to data published by UNESCO,1 almost all Latin American countries closed their schools and universities for more than 40 weeks during the COVID-19 pandemic, that is, they were closed for almost two years. These closures strongly affected the learning process of millions of young people, particularly the least favored populations, who did not have the technological tools to attend their classes remotely. But even so, in the best of cases, the students who managed to continue their studies in this modality did not obtain a quality education, despite all the efforts made by the educational institutions. In Colombia, the lockdown for universities began in March 2020. For a period of two years, the math courses’ fail rates decreased by almost half and remained steady. In January 2022, the lockdown ended and a massive return to face-to-face teaching was in force. Figure 15 shows that course fail rates are gradually returning to pre-pandemic levels. Therefore, our interest is to analyze the fail rates among the university students enrolled in undergraduate mathematics courses.

1

https://en.unesco.org/covid19/educationresponse.

112

M. Restrepo

Fig. 15 Fail rates from 2019 to 2022

6.1 Datasets For this application, 24 datasets corresponding to the grades obtained by university students in their mathematics courses were selected. Each dataset contains between 8000 and 9000 records, corresponding to the students registered in each academic period. After preprocessing the data, nine variables were selected in each dataset. These variables are related to student information: academic program, subjects, level at which the subjects were taught, and some data related to their teacher’s academic background and the type of contract they held with the institution. The variables used in each dataset are described in Table 2.

Table 2 Variables used in each dataset N Variable

Type

Code

1 2 3 4 5

Grade Approval status Class absences Program code Subject

Numerical Categorical Numerical Numerical Categorical

6 7 8

Semester Teacher training Contract type

Numerical Categorical Categorical

9

Teaching rank

Categorical

Numerical 1: pass, 0: fail Numerical Numerical Numerical according to areas Numerical 1: Master, 2: Ph.D. 1: adjunct, 2: lecturer, 3: tenure 1: auxiliar, 2: assistent, 3: associate, 4: full

Topological Data Analysis for the Evolution of Student Grades …

113

Fig. 16 Range percentages from 2019 to 2022

The 24 datasets correspond to the academic periods between 2019 and 2022 (six datasets per year). Since for each semester there are three grade reports, the data sets were coded in the form: 19-1-1, meaning: year 2019, semester 1, report 1. Similarly, 22-2-3, meaning: year 2022, semester 2, report 3. Some files can be found in https:// github.com/MRestrepo08/TDA, as well as the code for the persistence diagrams and graphs from the Mapper algorithm. Figure 15 shows the evolution of subject fail rates from year 2019, until the end of 2022. An interesting aspect that was detected over time is the percentage of grades in the different ranges. The minimum grade is 0 and the maximum is 50, where 30 is the minimum passing grade. Figure 16 shows the percentages of the grades obtained, organized in these ranges: 0–9, 10–19, 20–29, 30–39, and 40–50. It is important to note that in Fig. 16, the grades in the 40–50 range doubled during the pandemic (from 2020-1-2 to 2021-2-3). This means that during the COVID-19 pandemic the percentage of grades in the 40–50 range doubled, once face-to-face teaching returned, the grades in this range became similar to those obtained before the pandemic. As a result, the 20–29 range was reduced by almost half, while the other ranges were remained steady over time.

6.1.1

Software

Several applications have been developed for the computation of simplicial complexes and persistent homology. Among them stand out: • SciKitTDA, Python • Teaspoon, Python

114

• • • • •

M. Restrepo

Ripser (C++) Gudhi, (C++, Python) Giotto-TDA, Python CliqueTop, Matlab TDA, R.

Specifically, the Kepler Mapper library, developed in Python was used for the analysis of the datasets mentioned above, along with other commonly used Python libraries, such as SciKit-learn, Numpy, and Pandas.

6.2 Some Results Figure 17 shows two of the graphs obtained after applying the Mapper algorithm on each dataset. The projection function used was the projection of each data onto the line given by the direction of the first principal component, obtained by PCA. The covering consists of 8 intervals with an overlap percentage of 60%. The clustering algorithm used was k-means, with k = 2. This resulted in 16 nodes with their respective connections. The color of the dots (nodes) is obtained using a color scale. In this particular case, we used a color scale for the values of the attribute “Contract”: red for high values, and green for low values. The node with the most intense tone of red contains a higher percentage of students who took the course with a teacher whose contract type is 3 (tenure), while the node with the most intense tone of green contains a higher percentage of students with teachers whose contract type is 1 (adjunct). In general, the use of these parameters in the Mapper algorithm for the datasets 20-1-2, 20-2-1, 21-1-1, 21-1-3, 22-1-2 22-1-3, and 22-2-1 resulted in a graph with

Fig. 17 Graphs obtained using the Mapper algorithm for grades in 2019-1-1 and 2019-1-2

Topological Data Analysis for the Evolution of Student Grades …

115

Fig. 18 Flares found in the 19-2-1 and the 22-2-1 dataset

two connected components, while for the other datasets, the graph only had one connected component.

6.2.1

Flares

In the visualization of data using this technique, it is very common to search for flares and try to characterize the individuals that belong to the nodes that make up the flare. In none of the graphs obtained up to this point was it possible to visualize any flare. But by reducing the percentage of overlap to 40%, some flares were observed in these sets: 19-1-2, 19-1-3, 19-2-1, 20-1-2, 21-1-1, 21-1-1, 21-1-2, 22-1-2, and 22-2-1. As an example, Fig. 18 shows a flare in a yellow box, for datasets 19-2-1 and 22-2-1. After identifying the individuals in the flares, it was revealed that the programs with the highest percentage of approval corresponded to three distance education programs.

6.2.2

Academic Background

Again, the Mapper algorithm was applied to each of the 24 datasets. In this case, the nodes were colored according to the teachers’ academic background. Each node contains the percentages of teachers with master’s and doctoral degrees. The node with the highest percentage of teachers with doctoral degrees was selected, and the fail rates in these nodes were also identified. The results shown in Table 3 correspond to the fail rates of courses held by teachers with a doctoral degree.

116

M. Restrepo

Fig. 19 Persistence diagrams for distance comparison

As Table 3 shows, fail rates in periods 22-1-1, 22-1-1, and 22-1-3 increased significantly, compared to the other periods. The data shows that the fail rates of selected nodes in each dataset are higher than the overall percentages of teachers with this academic background, which suggested that another method of comparison was necessary. We used the bottleneck distances for the diagrams shown in Fig. 19, to compare the 22-1 periods with other periods, as it is explained below. First of all, we calculated the distances between the three periods highlighted in Table 3 and found that the largest distance (60.41) is between datasets 22-1-1 and 221-2. On the other hand, after comparing period 22-1-1 with periods 19-2-1, 19-2-2, and 19-2-3, we found that the largest distance is (85.21).

Table 3 Fail rates for a specific node considering teacher’s academic background Dataset Master’s (%) Ph.D. (%) Dataset Master’s (%) Ph.D. (%) 19-1-1 19-1-2 19-1-3 19-2-1 19-2-2 19-2-3 20-1-1 20-1-2 20-1-3 20-2-1 20-2-2 20-2-3

13.3 13.3 13.5 14.1 14.1 14.1 15.2 15.2 15.1 16.8 16.8 16.9

28.0 25.2 27.4 19.4 19.5 17.8 32.1 26.9 26.3 25.0 26.9 24.8

21-1-1 21-1-2 21-1-3 21-2-1 21-2-2 21-2-3 22-1-1 22-1-2 22-1-3 22-2-1 22-2-2 22-2-3

16.0 15.9 16.0 18.2 18.2 18.4 21.9 22.1 22.2 21.7 20.7 21.0

30.8 30.7 26.9 30.0 30.7 31.4 43.3 42.6 38.0 28.0 26.8 27.3

Topological Data Analysis for the Evolution of Student Grades …

117

Figure 19a shows the datasets of periods 22-1-1, 22-1-2, and 22-1-3 (postpandemic) compared with the datasets (Fig. 19b) of periods 2022-1-2, 19-2-2, and 19-2-3 (pre-pandemic and post-pandemic).

6.2.3

Employment Modalities

In general, university professors have employment contracts in the following modalities: Tenured, Lecturers, and Adjuncts. The aim is to see if there is any important connection between the fail rates and the type of employment. Figure 20 shows the percentages of students in engineering and economics majors who failed their mathematics courses, in reference to the type of teacher employment modalities. In this case, the nodes in the Mapper algorithm were colored according to the contract type of the teachers. In the same way, the node with the highest percentage of tenured teachers was selected, and the fail rates in these nodes were also obtained. The results are shown in Fig. 21. In Fig. 21, Tenure_N represents the percentage of tenured teachers present in the node, while Tenure represents the percentage of grades recorded by these teachers in the database, which reveals an important difference in the fail rates. Finally, from each dataset, the records of students who took and failed a subject with tenured teachers were selected and compared with those who took the subject with lecturer teachers. The bottleneck distances of persistence diagrams were calculated in each case, and the color diagrams shown in Fig. 22 were obtained. The box in the center of each graph shows the pandemic period. As Fig. 22a shows, the datasets for the post-pandemic periods for tenured teachers significantly differ (in distance) from the other datasets, in contrast to the datasets for lecturer teachers in Fig. 22b.

Fig. 20 Fail rates from 2019 to 2022

118

M. Restrepo

Fig. 21 Rates comparison for tenured teachers

(a)

(b)

Fig. 22 Color maps for bottleneck distances of each dataset. A comparison among tenured and lecturer teachers

7 Conclusion and Future Work Topological data analysis provides a set of tools to visualize data in a highdimensional space, using persistence diagrams and two-dimensional graphs. However, for an efficient visualization, it is necessary to adjust the parameters in each algorithm. This analysis made it possible to detect significant differences in the evolution of fail rates in mathematics courses. For example, bottleneck distances between persistence diagrams revealed an important difference among the post pandemic datasets of tenured teachers. Given the difficulty of handling all the parameters of the Mapper algorithm simultaneously, as a future work, we are considering studying the incidence of the Mapper algorithm parameters for a better visualization. Also, we will try to find topological attributes of the simplicial complexes of each dataset, in order to use them in other machine learning algorithms.

Topological Data Analysis for the Evolution of Student Grades …

119

Acknowledgements This work was supported by Universidad Militar Nueva Granada’s VICEIN Special Research Fund, under project CIAS 2548-2018.

References 1. Edelsbrunner, H., Letscher, D. Zomorodian, A.: Topological persistence and simplification. Discrete Comput. Geom. 28, 511–533 (2002). https://doi.org/10.1007/s00454-002-2885-2 2. Zomorodian, A.: Topology for Computing. Cambridge University Press, New York (2015) 3. Gunnar, G.: Topology and data. Bulletin (New Series) AMS. 46(2), 255–308 (Apr 2009) 4. Lum, P., Singh, G., Lehman, A., et al.: Extracting insights from the shape of complex data using topology. Sci. Rep. 3, 1236 (2013). https://doi.org/10.1038/srep01236 5. Nicolau, M., Levine, A.J., Carlsson, G.: Topology-based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proc. Natl. Acad. Sci. USA 108(17), 7265–7270 (2011) 6. Sasaki, K., Bruder, D. Hernandez-Vargas, E.: Topological data analysis to model the shape of immune responses during co-infections. Commun. Nonlinear Sci. Numer. Simul. 85 (2020) 7. Aljanobi, F.A., Lee, J.: Topological data analysis for classification of heart disease data. In: 2021 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 210–213 (2021) 8. Hwang, D., et al.: Topological data analysis of coronary plaques demonstrates the natural history of coronary atherosclerosis. Cardiovasc. Imaging 14(7) (2021) 9. Migdalek, G., Zelawski, M.: Measuring population-level plant gene flow with topological data analysis. Ecol. Inform. 70, 1–11 (2022) 10. Smith, A., Dlotko, P., Zavala, V.: Topological data analysis: concepts, computation, and applications in chemical engineering. Comput. Chem. Eng. 146 (2021) 11. Goel, A., Pasricha, P., Mehra, A.: Topological data analysis in investment decisions. Expert. Syst. Appl. 147 (2020) 12. Gidea, M.: Topological data analysis of critical transitions in financial networks. In: International Winter School and Conference on Network Science. Springer Proceedings in Complexity (2017) 13. Chazal, F., Michel, B.: An introduction to topological data analysis: fundamental and practical aspects for data scientists. Front. Artif. Intell. 4 (2021) 14. Majumdar, S., Kumar, A.: Clustering and classification of time series using topological data analysis with applications to finance. Expert. Syst. Appl. 162 (2020) 15. Karan, A., Kaygun, A.: Time series classification via topological data analysis. Expert. Syst. Appl. 183 (2021) 16. Carriere, M., Oudot, S.: Structure and stability of the one-dimensional mapper. Found. Comput. Math. 18(6), 1333–1396 (2018) 17. Bui, Q.T., Vo, B., Do, H.A.N., Hung, N.Q.V., Snasel, V.: F-mapper: a fuzzy mapper clustering algorithm. Knowl.-Based Syst. 189 (2020) 18. Munch, E.: A user’s guide to topological data analysis. J. Learn. Anal. 4(2), 47–61 (2017) 19. Sheffar, D.: Introductory Topological Data Analysis. Department of Mathematics and Statistics. University of Victoria, Canada (2020). arXiv:2004.04108v1 [math.HO] 20. Willard, S.: General Topology. Addison-Wesley Publishing, Massachusetts (1970) 21. Chen L.M., Su, Z., Jiang, B.: Mathematical Problems in Data Science. Theoretical and Practical Methods. Springer (2015) 22. Keese, J.W.: Introducción a la Topología Algebraica. Alhambra, Madrid (1971)

Redescending M-Estimators Analysis on the Intuitionistic Fuzzy Clustering Algorithm for Skin Lesion Delimitation Dante Mújica-Vargas, Blanca Carvajal-Gámez, Alicia Martínez-Rebollar, and José de Jesús Rubio

Abstract This study has conducted an investigation of the Redescending MEstimators as alternative to strengthen an Intuitionistic Fuzzy C-Means algorithm against atypical information. In this regard, the objective function was stated taking into account the loss function, while the update expressions for membership matrix and prototypes vector were formulated by a derivative procedure, depending on the influence functions. The evaluated estimators were Huber Skipped Mean, Simple Cut, Tukey Biweight, Hampel’s Three Part Redescending, Andrew’s Sine, German-MacClure, Lorentzian, Asad-Qadir, Insha and Alamgir. The empirical study was performed using the ISIC 2017 dataset in order to make the skin lesion delimitation task, these images had inherent artifacts that were considered as atypical information; the performance was quantified by popular state-of-the-art metrics. The quantitative results show an outstanding performance of the Hampel’s Three Part Redescending with Jaccard Similarity Coefficient = 0.905 ± 0.054, Dice Measure = 0.910 ± 0.060, Misclassification Ratio = 7.174 ± 0.864 and Hausdorff Distance = 6.281 ± 0.804, in contrast all rest estimators. Keywords Intuitionistic fuzzy C-Means · Skin lesion delimitation · Hampel’s three part redescending

D. Mújica-Vargas (B) · A. Martínez-Rebollar Tecnológico Nacional de México/Centro Nacional de Investigación y Desarrollo Tecnológico, Interior Internado Palmira S/N, Cuernavaca-Morelos, Mexico e-mail: [email protected] A. Martínez-Rebollar e-mail: [email protected] B. Carvajal-Gámez Unidad Profesional Interdisciplinaria en Ingeniería y Tecnologías Avanzadas, Instituto Politécnico Nacional, Ciudad de México, Mexico e-mail: [email protected] J. de J. Rubio Sección de Estudios de Posgrado e Investigación, ESIME Azcapotzalco, Instituto Politécnico Nacional, Av. de Las Granjas No. 682, Col., Ciudad de México, Santa Catarina, Mexico © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications, Studies in Big Data 132, https://doi.org/10.1007/978-3-031-38325-0_6

121

122

D. Mújica-Vargas et al.

1 Introduction Melanoma is the most lethal of all the skin cancers and has significantly increased in incidence over the past several decades [11]. If melanomas are contained on the outer layers of the skin, then simple excisions are still curative with a 5-year survival of approximately 98% [11]. Regrettably, although melanoma may be diagnosed early through simple visual inspection, many patients are diagnosed late in the disease. Therefore, over 57,000 people worldwide are estimated to have died of melanoma in 2020, despite dramatic recent progress in the treatment of advanced metastatic melanoma [11]. Given the importance of the issue, the research community have introduced interesting computer-aided systems, as well as innovative algorithms. The state-of-art has exhibited a trend to use proposals based on Deep Learning, which it is supported in [1], since it was presented a survey of different Deep Learning techniques for skin lesion analysis and melanoma cancer detection. In [38], it was proposed to integrate dermatologists clinical knowledge into the learning process of a knowledge-aware deep framework for collaborative skin lesion segmentation and melanoma recognition. In [21], it was introduced an architecture based on two VGG-16 and GoogLeNet architectures for segmentation task, the proposal was complemented by a SVM algorithm in order to classify the lesions into different classes. In [17], it was introduced a novel network architecture designed to automatic dermoscopic skin lesion segmentation particularly, it was so-called DSNet, in contrast with most of Deep Learning submissions, this one did not used common generic architectures such as UNET, YOLO, VGG, ResNet or Mask R-CNN. The Deep Learning paradigm has been also complemented with other techniques, the proposal introduced in [3] used a YOLOv4-DarkNet for melanoma lesion detection, while the segmentation was performed by the Active Contour algorithm. In [30], a RCNN was used to the melanoma lesion detection and the deep features extraction, that information was clustered by through a Fuzzy C-Means (FCM) algorithm. An operation principle similar was suggested in [29], since a faster-RCNN was used to obtain the feature vector, and then a Fuzzy C-Means algorithm was employed to segment the melanoma-affected portion of skin with variable size and boundaries. In [9] a methodology was developed, in which one the images were pre-processed using bilateral filter for eliminating the irrelevant noise artifacts. The pre-processed images were taken as input to Fuzzy U-net for overlapped on the skin region. Consequently, the images were fed as input to Fuzzy U-net for segmentation and optimized by May Fly Optimizer. In addition to the methods mentioned, there exists important contributions based on clustering algorithms, and even classical image processing techniques. In [13] a three stages methodology was proposed, in which a Fuzzy C-Means clustering algorithm was used to segment the given input image, then the features were mined from the segmented image using Local Vector Pattern (LVP) and Local Binary Pattern (LBP); subsequently, a Fuzzy classifier was used to do the classification process. In [33], it was introduced an approach of support vector machine-based black widow

Redescending M-Estimators Analysis on the Intuitionistic Fuzzy …

123

optimization for skin disease classification, at an earlier stage the segmentation of the lesion region was performed by a Level Set Fuzzy C-Means technique. In [16], it was proposed the Block-Matching Fuzzy C-Means clustering algorithm to segment RGB color images degraded with Additive White Gaussian Noise, this proposal was designed to segment natural images, in addition it was effective for segmenting real melanoma images. In [12], it was developed a mobile imaging system for early detection of melanoma, the hierarchical segmentation was based on the combination of Otsu’s and Minimum Spanning Tree (MST) methods. By analyzing all these research papers, there are aspects that should be mentioned, the Deep Learning approaches demanded large quantities of information for a good training stage, many times special hardware was required in order to speed up that training procedure. Conventional architectures were used in most of the researches, therefore only one proposal was specifically designed to work with melanoma images. For improving the Deep Learning’s performance there was the need of using other algorithms, e.g. those based on Fuzzy Theory. By other hand, there was a trend towards using fuzzy clustering for developing the melanomas segmentation, since it had an outstanding performance; however, it required special hand-crafted features, an inclusive those obtained by convolutional neural network architectures. In addition, only a few researches reduced the inherent noise existing in this type of images. To overcome these problems, in this study a robust clustering algorithm is stated, its mathematical support is the Intuitionistic Fuzzy C-Means algorithm, and strengthened by the Redescending M-Estimation. To that end, the objective function is written in terms of the M-Estimators, the update expressions for membership matrix and prototypes vector are derived. The main contributions in this study are summarized as follows: 1. A robust clustering algorithm tolerant to atypical information, with capacity for skin lesion delimitation. 2. The proposal does not require a training stage, neither great amounts of information for learning. 3. This proposal is not a method composed by different techniques, rather is as an algorithm that consists of an objective function, and two update expressions for internal variables. 4. The intuitionistic clustering algorithm works only with color features, i.e. it does not use special or sophisticated characteristics obtained by other techniques. 5. Since it is an algorithm for clustering, this proposal may be used in other contexts, such as pattern recognition tasks, where atypical information is existence. To assess and confirm the performance of current proposal, it was considered the ISIC 2017 dataset, for the specific lesion segmentation task. The International Skin Imaging Collaboration (ISIC) is an international collaborative project, that links the expertise on melanoma and imaging from the international dermatology community, most of the focused on the standards development for skin imaging modalities. An aspect to bear in mind is the fact that these images don’t have noise as all common medical imaging, they have hindering artifacts affecting a suitable

124

D. Mújica-Vargas et al.

lesion segmentation. Consequently, this type of images is challenging, and therefore appropriate to evaluate the current proposal performance. The rest of the chapter is organized as follows. In Sect 2, the Redescending MEstimation is described in a nutshell. Section 3 depicts the proposal mathematical developing. Section 4 presents the experiments as well as a comparative analysis. Conclusions and recommendations for future work are given in the last section.

2 Redescending M-Estimation Robust estimation deals possible perturbations of the probabilistic models and proposed solutions for protecting the statistical θ = [μ, σ ]; where the mean estimation μ corresponds to location estimation, whereas the standard deviation estimation σ corresponds to scale estimation. These estimations are robust procedures that can be approached by means of R, L and M estimators. In the beginning, the robust estimation was stated as a solution to overcome the noise sensitivity of least square method [5, 18, 32, 36]; however, it has been used in other tasks, as nonlinear filters to reduce noise on images [27, 28]. In brief, the R-estimators are based on rank criteria for hypothesis testing on symmetry center of a probability density distribution. That name is given by the statistical rank of an observation x, i.e. R = ri (x). By this way, x(1) ≤ x(2) ≤ · · · ≤ x(n) is the order statistics of the observations x1 , x2 , . . . xn . For formulating a location R-estimator, it is imperative that the test statistic [32]: θR =

n 1 α N (Ri ) n i=1

(1)

becomes as close to zero as possible. The coefficients α N may be computed by: i/N αN = N

f (x)d x

(2)

(i−1)/N

where f (x) is a function constrained to be odd symmetric f (1 − x) = − f x, it also n satisfies f (x)d x = 0. These constraints enforce to coefficients α N satisfy i=1 α N (i) = 0. The L-estimators are estimators in the form of linear combination of order statistics, they are explicitly defined and easily calculated. For instance, the median is the most prominent L-estimator, it is widely used in signal and image filtering tasks. L-estimators have the following definition: n 1 αi x(i) θL = n i=1

(3)

Redescending M-Estimators Analysis on the Intuitionistic Fuzzy …

125

where x(i) stands for the observation data, after they are ordered statically. The Lestimator efficiency is constrained to coefficients αi . A location L-estimators may be defined through the following expression: i/n (i−1)/n h(λ)dλ αi = 1 (4) 0 h(λ)dλ 1 where, h(λ) is a function that satisfies 0 h(λ)dλ = 0. The M-estimators are a generalization of the Maximum Likelihood Estimator (MLE), their main purpose is to obtain an estimate θˆ of θ such that [15, 26, 32]: θˆM = arg min

θ M ∈ M

or analogously,

n

n

ρ(xi − θ M )

(5)

i=1

ψ(xi − θ M ) = 0

(6)

i=1

where the expression (5) defines a location M-estimator in terms of a loss function ρ; however, the estimate θ is inherently presents. In this respect, the location Mestimator is also defined in terms of the influence function ψ. Those functions are related through ψ(x, θ M ) = ∂θ∂M ρ(x, θ M ) [32], if only if the ρ- function satisfies constraints of symmetry, positive definition, has minimum value at zero and is partially differentiable. The Redescending M-estimators r are a special M-estimator case, they have the peculiarity of fading out outside from the central region. Which implies that central observations have the higher weighting, when they move away from central region then their importance decrease, but if the threshold r is reached they are set to zero. Within the context of this investigation, these estimators let to weight the atypical information. The Redescending M-estimators r [14] are defined by:

r = {ψ ∈ |ψ(x) = 0 ∀ |x| ≥ r }

(7)

where r is a fixed constant, it lets to restrict the influence function bounds. Throughout time, several Redescending M-estimators have been introduced in the literature, among them: Huber Skipped Mean, Simple Cut, Tukey Biweight, Hampel’s Three Part Redescending, Andrew’s Sine, German-MacClure, Lorentzian, AsadQadir, Insha and Alamgir [2, 4, 5, 7, 18, 37]. Table 1 summarizes the loss and influence functions for each Redescending Mestimator already mentioned. As can be seen, most of them consider the threshold r , in this study the median absolute deviation from median is suggested, i.e. the robust estimator of the scale with value r = 1.4823 · median (|x − median(x)|), where x stands for each pixel and x the whole image. The Hampel’s Three Part Redescending

126

D. Mújica-Vargas et al.

Table 1 Redescending M-estimators Type Huber Skipped Mean (HSM)

Simple Cut (SC)

Tukey Biweight (TB)

Hampel’s Three Part Redescending (HTPR)

Loss function ρ(x) ⎧ 2 ⎪ ⎨x |x| ≤ r 2

⎪ ⎩r |x| − r otherwise 2 ⎧ 2 ⎨x |x| ≤ r 2 ⎩ otherwise r

Influence function ψ(x) x |x| ≤ r

⎧

x 2 3 2 ⎪ ⎪ ⎪r 1− 1− |x| ≤ r ⎨ 6 r ⎪ 2 ⎪r ⎪ ⎩ otherwise 6

⎧

2 ⎪ ⎨x 1 − x 2 |x| ≤ r r ⎪ ⎩0 otherwise

⎧ 2 x ⎪ ⎪ ⎪ ⎪ 2 ⎪ ⎪ ⎪ 1 ⎪ ⎨β|x| − β 2

2 2 ⎪ ⎪βδ − β + β(r − δ) 1 − r − |x| ⎪ ⎪ ⎪ 2 2 r −δ ⎪ ⎪ ⎪ ⎩ r

r sgn(x) otherwise

x |x| ≤ r 0 otherwise

0 ≤ |x| < β β ≤ |x| < δ δ ≤ |x| ≤ r

⎧ ⎪ x ⎪ ⎪ ⎪ ⎪ ⎨βsgn(x) r − |x| ⎪ β sgn(x) ⎪ ⎪ r −δ ⎪ ⎪ ⎩ r

0 ≤ |x| < β β ≤ |x| < δ δ ≤ |x| ≤ 0 |x| > r

|x| > r

Andrew’s Sine (AS)

⎧

⎨r 2 1 − cos x |x| ≤ r π r ⎩2r 2 otherwise

⎧

⎨r sin x |x| ≤ r r ⎩0 otherwise

GermanMacClure (GM)

x 2 /2 1 + x2

x (1 + x 2 )2

Lorentzian (L)

Asad-Qadir (AQ)

Insha (I)

Alamgir (A)

1 log 1 + x 2 2 ⎧ ⎪ x2 8 ⎪ ⎨ 3x − 10r 4 x 2 + 15r 8 |x| ≤ r 45r 8 ⎪ 8r 2 ⎪ ⎩ otherwise 45

x 2 r2 r2x2 arctan |x| ≥ 0 + 4 4 4 r r +x ⎧ ⎪ ⎪ ⎪ ⎪ 2r 2 ⎪ ⎪ ⎪ ⎨ 3 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 2r 2 ⎪ ⎩ 3

⎡

x 2 ⎤ 2 1 + 3e r ⎢ ⎥ ⎢ ⎥ |x| ≤ r ⎢1 − x 2 3 ⎥ ⎣ ⎦ 1+e r 1 − 2(1+3e) otherwise 3

x 1 2 x 2 ⎧

2 ⎪ ⎨ 2x 1 − x 4 |x| ≤ r 3 r ⎪ ⎩0 otherwise 1+

x 4 −2 x 1+ r ⎧ x 2 ⎪ ⎪ 16xe−2 r ⎪ ⎪ ⎨ x 2 4 1 + e− r ⎪ ⎪ ⎪ ⎪ ⎩ 0

|x| ≥ 0

|x| ≤ r

otherwise

(1+e)

requires other thresholds, we therefore suggest β = 0.05 · r and δ = 0.95 · r . Whilst, German-MacClure and Lorentzian do not require any parameter. From Eqs. (1) and (3), it may be contended that the R-estimators and L-estimators are not differentiable functions. This issue can be particularly sensitive, since the update expressions of a clustering algorithm are computed by the objective function

Redescending M-Estimators Analysis on the Intuitionistic Fuzzy …

127

derivation with respect to each variable. For which reason they cannot be considered as an alternative to enhance the algorithm. In contrast, the M-estimators are defined through differentiable functions, since the influence function is computed by the derivative of the loss function. Therefore, they are more suitable to rewrite a clustering algorithm, as the intuitionistic fuzzy one being considered in this study. Other relevant aspects to take into account are the fact that, the M-estimators are not based on the order statistics, and they do not demand weight coefficients unlike the other estimators.

2.1 Enhanced Intuitionstic Fuzzy Clustering Algorithm Trough Redes-Cending M-Estimators 2.1.1

Intuitionstic Fuzzy Transformation

Let X ∈ R H ×W × p to be a color image, with size H × W and p = 3 channel colors, it may be transformed from matrix notation into a vector, when 1 ≤ i ≤ n = H × W is accepted. This consideration allows to write X ∈ Rn× p , thus every xi j stands for the i-th pixel at j-th color component. In order to work in the intuitionstic fuzzy domain, a transformation process f : X → X I F S must be performed taking into account the membership, non-membership and hesitancy indexes [6, 10]. In this respect, the first index is computed by: μ(xi j ) =

xi j − min(x j ) max(x j ) − min(x j )

(8)

where μ(xi j ) is the membership degree, while min(x j ) and max(x j ) are functions used for computing the minimum and maximum value, respectively. An intuitionistic fuzzy number consisted of a membership degree, a non-membership degree and the hesitancy degree, i.e. xiIjF S = xi j , μ(xi j ), ν(xi j ), π(xi j ) ; in this respect, an intuitionistic fuzzy generator may be used to computed the non-membership degree ν(xi j ) as shown below: ν(xi j ) =

1 − μ(xi j ) , λ ∈ [0, 1] 1 + (eλ − 1) · μ(xi j )

(9)

where λ is a constant parameter. From Equations (8) and (9), the hesitancy degree π(xi j ) is computed as follows: π(xi j ) = 1 − μ(xi j ) − ν(xi j ) A formal definition of xiIjF S can be stated by means of the tuple:

(10)

128

D. Mújica-Vargas et al.

xiIjF S =

xi j − min(x j ) 1 − μ(xi j ) , ν(xi j ) = , max(x j ) − min(x j ) 1 + (eλ − 1) · μ(xi j ) ! " π(xi j ) = 1 − μ(xi j ) − ν(xi j ) , ∀xi j ∈ X |i = 1, . . . , n; | j = 1, . . . , p xi j , μ(xi j ) =

(11) Indexes (8), (9) and (10) are subjected to μ(xi j ) + ν(xi j ) + π(xi j ) = 1, while that (10) is bounded as 0 ≤ π(xi j ) ≤ 1. In the image segmentation context is required to compute the similarity between pixels, e.g. by using the Euclidean distance; however, this one must be suited to current intuitionistic fuzzy domain. Given two intuitionistic fuzzy numbers x1I jF S = { x1 j , μ(x1 j ), ν(x1 j ), π(x1 j ) } and x2I jF S = { x2 j , μ(x2 j ), ν(x2 j ), π(x2 j ) }, their Euclidean intuitionistic fuzzy distance may be computed as [35]:

2 2 x1I jF S − x2I jF S 22 = μ(x1 j ) − μ(x2 j ) + ν(x1 j ) − ν(x2 j ) 2 (12) + π(x1 j ) − π(x2 j )

2.1.2

Redescending M-Intuitionistic Fuzzy C-Means

Clustering algorithms such as K -Means, Fuzzy C-Means and their derivations, conventionally perform a mean estimation during the prototypes vector update. It is well known that this mathematical operation is not robust to the atypical information, as well as the noise. In [32], an analysis about the atypical data influence over the mean and median estimators was presented. It was noted that, the mean estimation had a breakdown point of ε = 0, this meant that even one outlier can destroy its results. In contrast with the median estimation, that had a breakdown point of ε = 0.5; that is to say, it was reliable only if less than 50% of the observations outly. Proposals such as [20, 23] were the pioneers to replace the mean by median estimator, they highlighted that the median estimation needed an statistic order. Computationally, that replacement implied a computational cost increased, since the over the iterative process it was needed the statistical order. Those proposals remain being considered in the state-of-art, e.g. [8, 24, 25, 31, 34, 40]. To address that shortcoming, the Robust Estimation may be considered, this has three major variants the R, L and M estimators, but in this study just the last ones are analyzed. In specific, the manuscript is focused on the Intuitionistic Fuzzy C-Means, an variant of the Fuzzy C-Means algorithm, which it is based on the Intuitionistic Fuzzy Sets. Clustering algorithms based on prototypes perform the mean mathematical operation during the prototypes vector update; it is well known that this operation is not robust to atypical information as well as noise. To address the shortcoming, the Robust Estimation may be considered. In this study some Redescending M-Estimators are analyzed as an alternative to robust the performance of an Intuitionistic Fuzzy C-Means clustering algorithm.

Redescending M-Estimators Analysis on the Intuitionistic Fuzzy …

129

Let a set of intuitionistic fuzzy numbers X I F S = {xiIjF S |i = 1, . . . , n; j = 1, . . . , p}, its partitioning into c clusters is valid if and only next constrains: 0 ≤ u ik ≤ 1, i = 1, 2, . . . , n; k = 1, 2, . . . , c c

u ik = 1, i = 1, 2, . . . , n; and

(13)

(14)

k=1

0 4. On the other hand, in HetRec (Table 16) the new proposal has a tendency to obtain the best Precision result for higher values of n. In contrast, for sizes 1 and 2 of the

Table 14 Evaluation of the proposal (hyb) and comparison with previous works (avg, min); n ∈ {1, 2, 3, 4, 5, 10, 15, 20} and group si ze ∈ {3, 4, 5}. Binary profile. Metric Precision n/size 3 4 5 1 2 3 4 5 10 15 20

Avg 0.5167 0.5313 0.5359 0.5372 0.5373 0.5374 0.5371 0.5373

Min 0.5083 0.5194 0.5266 0.53 0.5326 0.5346 0.536 0.5371

Hyb 0,5283 0,5480 0,5492 0,5501 0,5492 0.5486 0.5472 0.5461

Avg 0.524 0.532 0.536 0.5365 0.5363 0.5366 0.5366 0.5368

Min 0.5198 0.521 0.5261 0.53 0.5323 0.5345 0.5358 0.5369

Hyb 0,5222 0,5429 0,5471 0,5480 0,5477 0.5473 0.5462 0.5453

Avg 0.5336 0.5365 0.5378 0.538 0.5374 0.5372 0.5372 0.5372

Min 0.5229 0.5278 0.5301 0.5326 0.5344 0.5357 0.5367 0.5376

Hyb 0,5501 0,5512 0,5515 0,5504 0,5494 0.5475 0.5463 0.5455

400

Y. Pérez-Almaguer et al.

Table 15 Evaluation of the proposal (hyb) and comparison with previous works (avg, min); n ∈ {2, 3, 4, 5, 10, 15, 20} and group size si ze ∈ {3, 4, 5}. Binary profile. Metric: NDCG n/size 3 4 5 2 3 4 5 10 15 20

Avg 0.9908 0.9769 0.967 0.9595 0.953 0.947 0.9422

Min 0.9911 0.9764 0.966 0.9582 0.9517 0.9457 0.941

Hyb 0.9909 0.9773 0.9674 0.9600 0.9536 0.9476 0.9428

Avg 0.9859 0.9735 0.9643 0.9575 0.9509 0.9453 0.9408

Min 0.9855 0.9725 0.9632 0.9561 0.9495 0.944 0.9397

Hyb 0.9852 0.9733 0.9645 0.9577 0.9513 0.9458 0.9414

Avg 0.9822 0.9705 0.962 0.9556 0.9489 0.9437 0.9396

Min 0.9813 0.9694 0.9607 0.9542 0.9477 0.9425 0.9385

Hyb 0.9822 0.9705 0.9623 0.9561 0.9496 0.9444 0.9403

Table 16 Evaluation of the proposal (hyb) and comparison with previous works (avg, min); n ∈ {1, 2, 3, 4, 5, 10, 15, 20} and group si ze ∈ {3, 4, 5}. Multi-valued profile. Metric: Precision n/size 3 4 5 1 2 3 4 5 10 15 20

Avg 0.4583 0.4819 0.4832 0.4828 0.482 0.4811 0.48 0.4793

Min 0.5267 0.5126 0.5103 0.5083 0.5064 0.504 0.5015 0.4995

Hyb 0.51 0.5118 0.5103 0.5094 0.5080 0.5060 0.5031 0.5005

Avg 0.471 0.4787 0.4804 0.4808 0.4804 0.4794 0.4788 0.4782

Min 0.4996 0.5078 0.5071 0.506 0.5044 0.5021 0.5 0.4982

Hyb 0.4993 0.5064 0.5074 0.5073 0.5059 0.5039 0.5014 0.4990

Avg 0.483 0.482 0.482 0.4817 0.481 0.4799 0.4793 0.4788

Min 0.5074 0.5099 0.5084 0.5068 0.5049 0.5023 0.5003 0.4983

Hyb 0.5149 0.5108 0.5097 0.5086 0.5068 0.5040 0.5015 0.4991

recommendation list, the minimal aggregation approach provides better results than the hybrid method. This performance can be associated with the nature of the item features in this dataset, whose values show a high level of imbalance. Therefore, it is necessary to carry out a more in-depth study in order to obtain features that can better identify the items in this dataset. Finally, in the case of the NDCG metric (Table 17), the min approach leads to the better performance, in contrast to the previous scenarios considered at this experimental evaluation. In order to analyze the statistical significant of the results presented at this section, Table 18 presents the results of the Wilcoxon-Signed Rank test [19], used to compare the new hybrid approach (hyb) against the avg and min approaches, considering the numerical results obtained across Tables 14, 15, 16 and 17. Taking as base the null hypothesis with p < 0.01, Table 18 clearly evidences that this hypothesis is rejected for all the scenarios in the case of the binary profile, and for the Precision metric in the multi-valued profile. Moreover, the distribution of the W + and W − values in these cases indicates the superiority of the proposed hybrid approach. In a different direc-

A Content-Based Group Recommender System …

401

Table 17 Evaluation of the proposal (hyb) and comparison with previous works (avg, min); n ∈ {2, 3, 4, 5, 10, 15, 20} and group si ze ∈ {3, 4, 5}. Multi-valued profile. Metric: NDCG n/size 3 4 5 2 3 4 5 10 15 20

Avg 0.9918 0.9804 0.9724 0.9663 0.9611 0.956 0.9518

Min 0.9923 0.9812 0.9732 0.9674 0.9623 0.9575 0.9535

Hyb 0.9916 0.9799 0.9718 0.9659 0.9609 0.9561 0.9523

Avg 0.9879 0.9776 0.9702 0.9646 0.9593 0.9544 0.9506

Min 0.9876 0.9778 0.9707 0.9655 0.9604 0.9559 0.9523

Table 18 Statistical analysis of the obtained results W–

Binary profile

Precision

NDCG

Multivalued profile

Precision

NDCG

Hyb 0.9871 0.9766 0.9694 0.9641 0.9591 0.9547 0.9511

Avg 0.9852 0.9755 0.9684 0.9633 0.9577 0.9531 0.9496

Min 0.9851 0.9757 0.9692 0.9642 0.959 0.9547 0.9513

Hyb 0.9845 0.9746 0.9678 0.9629 0.9577 0.9535 0.9502

W+

p value

Result over null hypothesis ( p < 0.01)

Avg versus hyb Min versus hyb Avg versus hyb Min versus hyb Avg versus hyb

1

299

2.38E-07

R

0

300

1.19E-07

R

20.5

169.5

0.00283691 R

3

228

0

300

9.65732E05 1.19E-07

Min versus hyb Avg versus hyb Min versus hyb

40.5

235.5

0.00316991 R

163

47

0.0316294

A

231

0

6.20175E05

R

R R

tion, in the case of the NDCG criterion for the multi-valued profile, the comparison does not report significant difference p < 0.01 in the case of the avgvshyb comparison. However, in the case of minvshyb, there are significant difference between both approaches, leading the min approach to the best performance values.

402

Y. Pérez-Almaguer et al.

5 Conclusions The present work focuses on the proposal of a new content-based group recommendation method, which has some novel characteristics: the weighing of the features of the items, the incorporation of a consolidated profile of the group as an additional user, and the dynamic selection of the best aggregation function depending on the size of the group. An experimental study shows that the proposal is able to outperform previous works of the state of the art, for most sizes of the list of recommendations in the case of items characterized by binary features, and for large sizes of the list of recommendations in the case of multi-valued features. Furthermore, an ablation study proves the level of importance for each component of the proposal. As future works, we suggest the following research directions: (1) the development of new content-based group recommendation methods supported by matrix factorization, (2) the proposal of data preprocessing methods for this specific scenario [7], and (3) the evaluation of the current proposal in specific domains of e-health [21, 22] or tourism [4].

References 1. Adomavicius, G., Tuzhilin, A.T.: Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734–749 (2005) 2. Aizawa, A.: An information-theoretic perspective of tf-idf measures. Inf. Process. Manag. 39(1), 45–65 (2003) 3. Almaguer, Y.P., Dueñas, N.M., Cruz, E.C., Yera, R.: Una revisión de los sistemas recomendadores grupales como herramienta innovadora en el área del turismo. Revista de Ciencia y Tecnología 35, 44–53 (2021) 4. Carballo-Cruz, E., Yera, R., Carballo-Ramos, E., Betancourt, M.E.: An intelligent system for sequencing product innovation activities in hotels. IEEE Latin Am. Trans. 17(2), 305–315 (2019) 5. Castro, J., Lu, J., Zhang, G., Dong, Y., Martínez, L.: Opinion dynamics-based group recommender systems. IEEE Trans. Syst. Man Cybern.: Syst. 48, 2394–2406 (2018). https://doi.org/ 10.1109/TSMC.2017.2695158. http://ieeexplore.ieee.org/document/7919222/ 6. Castro, J., Rodríguez, R.M., Barranco, M.J.: Weighting of features in content-based filtering with entropy and dependence measures. Int. J. Comput. Intell. Syst. 7(1), 80–89 (2014) 7. Castro, J., Yera, R., Martínez, L.: An empirical study of natural noise management in group recommendation systems. Decis. Support Syst. 94, 1–11 (2017) 8. Cataltepe, Z., Uluya˘gmur, M., Tayfur, E.: Feature selection for movie recommendation. Turkish J. Electr. Eng. Comput. Sci. 24(3), 833–848 (2016) 9. Dara, S., Chowdary, C., Kumar, C.: A survey on group recommender systems. J. Intell. Inf. Syst. 54, 271–295 (2020) 10. De Pessemier, T., Dhondt, J., Vanhecke, K., Martens, L.: Travelwithfriends: a hybrid group recommender system for travel destinations. In: Workshop on Tourism Recommender Systems (touRS15), in Conjunction with the 9th ACM Conference on Recommender Systems (recsys 2015), pp. 51–60 (2015) 11. De Pessemier, T., Dooms, S., Martens, L.: Comparison of group recommendation algorithms. Multimed. Tools Appl. 72(3), 2497–2541 (2014)

A Content-Based Group Recommender System …

403

12. Domingues, M.A., Jorge, A.M., Soares, C.: Dimensions as virtual items: improving the predictive ability of top-n recommender systems. Inf. Process. Manag. 49(3), 698–720 (2013) 13. Felfernig, A., Boratto, L., Stettinger, M., Tkalˇciˇc, M.: Group Recommender Systems: an Introduction. Springer (2018) 14. Kagita, V.R., Pujari, A.K., Padmanabhan, V.: Virtual user approach for group recommender systems using precedence relations. Inf. Sci. 294, 15–30 (2015) 15. Kaššák, O., Kompan, M., Bieliková, M.: Personalized hybrid recommendation for group of users: top-n multimedia recommender. Inf. Process. Manag. 52(3), 459–477 (2016) 16. Khoshkangini, R., Pini, M.S., Rossi, F.: A self-adaptive context-aware group recommender system. In: Conference of the Italian Association for Artificial Intelligence, pp. 250–265. Springer (2016) 17. Pera, M., Ng, Y.: A group recommender for movies based on content similarity and popularity. Inf. Process. Manag. 49, 673–687 (2013) 18. Pérez-Almaguer, Y., Yera, R., Alzahrani, A.A., Martínez, L.: Content-based group recommender systems: a general taxonomy and further improvements. Expert Syst. Appl. 184, 115444 (2021) 19. Rey, D., Neuhäuser, M.: Wilcoxon-signed-rank test. In: International Encyclopedia of Statistical Science, pp. 1658–1659. Springer (2011) 20. Ricci, F., Rokach, L., Shapira, B.: Recommender systems handbook. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, chap. 1, pp. 1–35. Springer (2011) 21. Yera, R., Alzahrani, A.A., Martínez, L.: A food recommender system considering nutritional information and user preferences. IEEE Access 7, 96695–96711 (2019) 22. Yera, R., Alzahrani, A.A., Martínez, L.: Exploring post-hoc agnostic models for explainable cooking recipe recommendations. Knowl.-Based Syst. 251, 109216 (2022) 23. Yera, R., Martínez, L.: Fuzzy tools in recommender systems: a survey. Int. J. Comput. Intell. Syst. 10(1), 776–803 (2017) 24. Yera, R., Martínez, L.: A recommendation approach for programming online judges supported by data preprocessing techniques. Appl. Intell. 47(2), 277–290 (2017) 25. Yera Toledo, R., Caballero Mota, Y., Martínez, L.: Correcting noisy ratings in collaborative recommender systems. Knowl.-Based Syst. 76, 96–108 (2015)

Performance Evaluation of AquaFeL-PSO Informative Path Planner Under Different Contamination Profiles Micaela Jara Ten Kathen, Federico Peralta, Princy Johnson, Isabel Jurado Flores, and Daniel Gutiérrez Reina

Abstract The use of Autonomous Surface Vehicles allows streamlining the task of monitoring the water quality parameters of water resources, reducing costs and time spent. In addition, the vehicles are capable of taking water measurements in hard-toreach areas. This chapter evaluates the performance of the AquaFeL-PSO monitoring system in different contamination profiles. The AquaFeL-PSO is an algorithm based on Particle Swarm Optimization, Gaussian processes and Federated Learning technique. The operation mode of the algorithm is two-phase, the exploration phase, which is responsible for covering the largest possible area of the water resource in order to detect contaminated areas, and the exploitation phase, which is responsible for accurately characterizing the water quality parameters in the contaminated areas. This system is evaluated using different benchmark functions in order to observe the performance of the system under different contamination profiles. The results show that the AquaFeL-PSO was able to generate models of the water quality parameters in eight of the ten contamination profiles. In addition, it had the best performance in detecting pollution peaks and generating the model of the water quality parameters of the entire water resource. M. J. T. Kathen (B) · F. Peralta · I. Jurado Flores Universidad Loyola, Av. de las Universidades, s/n, 41704 Cordoba, Spain e-mail: [email protected] F. Peralta e-mail: [email protected] I. Jurado Flores e-mail: [email protected] P. Johnson Liverpool John Moores University, Byrom St, Liverpool, Liverpool L3 3AF, England e-mail: [email protected] D. Gutiérrez Reina University of Seville, C. Américo Vespucio, 41092 Seville, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications, Studies in Big Data 132, https://doi.org/10.1007/978-3-031-38325-0_17

405

406

M. J. T. Kathen et al.

Keywords Informative path planning · Particle swarm optimization · Federated learning · Autonomous surface vehicles · Machine learning · Water monitoring · Multi-modal problems

1 Introduction Polluted waters can be described as bodies of water that are imbalanced in terms of physico-chemical properties, such as pH level, Dissolved Oxygen (DO), Turbidity, etc. [19]. Ecosystems with such conditions can be toxic and even deadly for life. A clear example is the current state of the Ypacarai Lake (Paraguay) [7], where toxic blue-algae blooms during the hot months of summer appear. The eutrophized [3] waters are hazardous for fishes as well as humans, preventing wildlife from flourishing, aquaculture and tourism. Although there are works that focus on recovering the waters via direct treatment [16], the pollution models or contamination profiles are so far left unaddressed, which may lead to inefficient and expensive water treatment. Contamination profiles are models that show a contamination level for each location or coordinate related to a water body [26]. The usefulness of these models is evident since contamination peaks needs more attention and treatment. Therefore, obtaining reliable contamination models and pinpointing contamination peaks are tasks that will provide efficiency in treating or maintaining healthy water bodies. This is the main objective of the present work, and to provide good results the usage of Autonomous Surface Vehicles (ASVs) equipped with Water Quality Parameters (WQP) Sensor systems is proposed. ASVs have been used for WQP measuring since they provide significant advantages when compared to fixed stations and manual measurement campaigns [2, 24]. A coordinator system can efficiently distribute multiple ASVs to cover a water region and measure WQPs. However, as the water models can vary weekly [18], may mean that monitoring missions might be done with few reliable initial information. In that sense, the system will benefit from including gathered data during the mission to decide on the measurement locations, i.e., an online system. Moreover, online missions will be useful for obtaining both good overall contamination models and peak contamination locations. The proposed system focuses on both of these objectives, and therefore, it is divided into in two steps: a first that considers shared information among all the ASVs and a latter stage that federates the learning into promising zones that could contain contamination peaks. The proposed system is called AquaFeL-PSO [8]. In the fist phase, the AquaFeL-PSO generates a baseline WQP model, which is used to determine the areas where contamination levels are high. In the second phase, the proposed system focuses on characterizing WQP in contaminated areas and detecting pollution peaks with higher resolution. Noting that the contamination behavior is unknown a-priori, this work extends previous approaches to evaluate the performance of the AquaFel-PSO considering different benchmark functions, disregarding expected WQP behaviors so that the evaluation of the system is broader. Consequently the main contribution of the present

Performance Evaluation of AquaFeL-PSO Informative Path Planner …

407

chapter is the evaluation of the AquaFeL-PSO monitoring system in the detection of contamination peaks and in the generation of the model of the WQP of a water resource under different contamination profiles. The chapter layout is as follows: in Sect. 2, the related work of path planners for ASVs is presented. The explanation of the monitoring problem and the assumptions of the monitoring system can be found in Sect. 3. The path planners are explained in Sect. 4. In the Sect. 5 the different contamination profiles and comparison metrics are listed, and the experiments and comparisons between path planners are presented. Section 6 provides a summary of the results. A conclusion of the work and future research directions are presented in Sect. 7.

2 Related Work Patrolling and monitoring scenarios using autonomous vehicles technology have been studied by several researchers in recent years [3, 19, 24, 25, 31]. Monitoring and patrolling tasks of water resources are performed with ASVs equipped with sensors capable of measuring WQP, like pH, temperature, DO sensors, and with sensors that allow the guidance and control system to know where the vehicle is located and its environment, such as Global Positioning System (GPS), an Inertial Measurement Unit (IMU), a Light Detection and Ranging (LiDAR), among others. Researchers propose machine learning techniques as a solution for the generation of ASV trajectories in order to fulfill the assigned task, techniques such as Genetic Algorithm (GA) [3], Bayesian Optimization (BO) [20], Swarm Intelligence [26], Deep Reinforcement Learning (DRL) [30], etc. The trajectory generation in [2] is based on a GA, an algorithm based on selection theory of Charles Darwin. In addition, the authors use the Traveling Salesman Problem (TSP) as a monitoring problem. The objective of the monitoring system is to optimize the ASV trajectory and to cover the largest possible area of the water resource. The researchers use the Ypacarai lake as an example scenario, and beacons have been set up for the ASV to pass through. This work is extended in [1], where the monitoring problem changes to the Chinese Postman Problem (CPP). By modeling the monitoring problem as the CPP, the ASV can visit the beacon more than once. This improvement is observed in the results, since the new path planner maximizes the area to be covered by the ASV in the monitoring task. A two-phase GA-based monitoring system is proposed in [3]. In the first phase of the system, the monitoring task aims to explore the water resource to locate contaminated areas. In the second phase, the vehicles are intended to exploit the contaminated areas identified during the exploration stage. In [29], the main objective is the patrolling of several WQP simultaneously. In order to solve the established problem, the authors propose an algorithm based on multi-objective GA, and more specifically the NSGAII algorithm. This algorithm obtained good results in solving single-objective and multi-objective problems.

408

M. J. T. Kathen et al.

In [30], the patrolling of the Ypacarai lake is performed with a DRL-based system. In the system proposed by the authors, RGB images are used as the possible ASV positions and states, and the problem is modeled as a Markov Decision Problem (MDP). The authors trained a Deep Q-learning method based on convolutional neural networks using a customized reward function. The work was extended in [31]. The improved system uses a centralized approach strategy using multiple agents or ASVs. The advantage of using the centralized approach is that the system proposed by the authors has the ability to train the entire fleet of ASVs with only one neural network. The results obtained demonstrated the superiority of the centralized approach over the distributed Q-learning in the patrolling task. Another machine learning technique that is used as a basis is BO. In [20], the authors developed an Informative Path Planner (IPP) based on BO and Gaussian Process (GP) that aims to generate models of the WQP of the Ypacarai lake. To optimize the coordinates where the measurements should be taken, the authors propose different tailored acquisition functions. The work was extended to an improved monitoring system capable of solving multiobjective problems in [19]. The monitoring system proposed by the authors calculates the best positions to take measurements considering several WQP simultaneously. This problem is solved with acquisition functions, based on various multi-objective techniques, that fuse data from water quality sensors. This approach is extended in [21], where the problem is extended to multiple agents. A Multi-Objective approach is addressed by the same authors in [17], where paths are generated and optimized using Genetic Algorithms, considering a GP as base for the fitness function. The use of several ASVs allows speeding up the monitoring task by generating models of the WQP in less time. Water resource monitoring problems with various agents or vehicles can also be solved by applying swarm intelligence algorithms. In [26], the authors developed a PSO and GP-based monitoring system, the Enhanced GP-based PSO. The system used measurements obtained from sensors in order to generate an accurate model of the WQP of the water resource and to detect peaks of water pollution. The difference between the actual models and the estimated models was very small. The Enhanced GP-based PSO was evaluated and compared with other PSO-based algorithms in [27, 28]. In [27], the focus of the evaluation was exploration, i.e., covering as much area as possible in order to minimize the error between the actual model and the estimated model of the entire water resource. The Enhanced GP-based PSO adjusted with the BO obtained the best response among the compared algorithms. In contrast, in [28], the focus was on exploiting areas with high pollution levels with the objective of detecting water resource pollution peaks. Among the compared algorithms, the one that obtained the best response was the Contamination algorithm, this algorithm is a variant of the Enhanced GP-based PSO. These evaluations were performed in uni-modal scenarios. Therefore, in [9], the work was extended by evaluating the algorithms in multi-modal scenarios. The results showed that the Enhanced GP-based PSO based on the Epsilon Greedy method obtained the best responses in detecting pollution peaks and obtaining an accurate model of the WQP of the water resource. Based on these results, the authors designed the AquaFeL-PSO monitoring system [8]. The AquaFeL-PSO is an enhancement of the Enhanced GP-based PSO that

Performance Evaluation of AquaFeL-PSO Informative Path Planner …

409

has two phases, the exploration phase and the exploitation phase. In the exploration phase, the ASVs traverse the entire surface of the water resource. At the end of the first phase, the system generates a first model of the WQP. The first model is used to divide the water resource into zones where the level of contamination is high, the action zones. Then, the ASVs are assigned to these zones, creating sub-populations of ASVs in order to perform the exploitation of the zones. The final model of the WQP is obtained by merging the models of the action zones with the first model generated. The Federated Learning (FL) technique is used in the exploitation phase. The FL allows each sub-population to act as a node or local server and adjust its own GP to generate a model of its action zone. The AquaFeL-PSO obtained the best answers in the evaluation of ground truths based on the Shekel function, the difference between the real models and the estimated models was minimal. At this point, it is evident that the AqualFel-PSO is one of the first methods that is evaluated with several benchmark functions instead of a based benchmark function. This performance evaluation represents a step towards the generalization of the proposed method towards a broader set of applications, such as different environments or measured data.

3 Statement of the Problem In this section the monitoring problem to be solved, as well as the considered assumptions of the monitoring system are presented.

3.1 Monitoring Problem The water resource used as a scenario for the simulations is Lake Ypacarai, Fig. 1, located in Paraguay. This lake has two large inflows to the southeast and northwest, the Pirayú and Yukyry streams, respectively. In addition, it has small inflows distributed to the east and west of the lake. The only natural outlet of the lake is located to the north, the Salado River. The contamination of Lake Ypacarai has long been the case study of organizations and researchers in the country. The lake is polluted by sewage from surrounding houses and industries, and fertilizer residues from surrounding crops [13]. Because of this, periodic monitoring is carried out by different organizations and entities, in order to obtain the state of the water quality of Lake Ypacarai [5, 6]. These monitoring can be carried out with ASVs, reducing the time and by reaching remote locations inside the lake to take water measurements. The monitoring system consists of a fleet of ASVs composed of P vehicles. These vehicles have on-board sensors S to take water measurements. These measurements are used to generate an estimated model of the water resource yˆ (x) and to calculate the speed v and position x of the ASVs. The measurements taken by the sensors are stored in a vector s = {sk | k = 1, 2, . . . , n}, the term k refers to the number of measurements taken, and the coordinates where the measurements were taken

410

M. J. T. Kathen et al.

Fig. 1 Inflows and outflows of the Ypacarai lake

are stored in the vector q = {qk | k = 1, 2, . . . , n}. Measurements of the WQP are taken from a function that simulates the actual model of the WQP of the water resource f (x). The actual model of the WQP is called the ground truth. That said, each measurement taken is represented as follows: sk = f (qk )

(1)

The estimated model of the WQP of the water resource yˆ (x) can be obtained by applying the regression model, Eq. 2 and taking a sufficient number of measurements k. yˆ (x) ≈ f (x) (2)

3.2 Assumptions In order to implement the monitoring system, the following assumptions have been considered: • Search space: The search space is a matrix M with dimensions m × n. Each element Mi, j has a state assigned to it: i) the value 0 represents the zones where the ASVs cannot travel, either because it is a forbidden zone, an obstacle, land, among others, and ii) the value 1 represents the available zones, that is, the zones where the ASVs can travel.

Performance Evaluation of AquaFeL-PSO Informative Path Planner …

411

Fig. 2 Communication system based on centralized learning technology

• ASV: The movements performed by the ASVs are considered ideal, there is no error in the trajectory. Collisions and obstacles are not considered. The autonomy of the vehicles allows to complete the monitoring task, each ASV travels 20 km to complete the monitoring of the water resource. • Sensors: The data provided by the sensors are free of error, including GPS data and data taken by the sensors of the WQP. • Communication: The monitoring system has a global coordinator in the cloud. The ASVs communicate to the global coordinator via 4G or 5G technology. Communication between the sensors of the WQP and the ASV Guidance, Navigation and Control (GNC) system is via Universal Serial Bus (USB). Figure 2 shows the communication of the monitoring system based on the centralized learning technique.

4 PSO-Based Path Planning Algorithms This section explains the path planners that are used as the basis for the AquaFeLPSO. In addition, the operation of the AquaFeL-PSO monitoring system is explained.

4.1 Classic Particle Swarm Optimization (PSO) PSO is a heuristic optimization algorithm developed by [10] using as inspiration the social behavior of a flock of birds. This optimization algorithm is composed of particles p that represent possible solutions to an optimization problem, the term swarm is used to refer to a set of particles. To obtain the best solution, the particles use their own experience and the experience of the other particles in the swarm.

412

M. J. T. Kathen et al.

Therefore, the particles are connected, constantly sharing information as they move in a multidimensional space. The calculation of particle motion is based on: (i) a control parameter, (ii) a self-cognitive component, and (iii) a social component. The auto-cognitive component pbest is based on the own experience of the particle and represents the best solution of the particle up to the current time. However, the social component gbest is based on the experience of the swarm and is the best solution found by all particles in the swarm up to the current time. The particles p have a velocity v p and a position x p . To calculate these parameters, the following equations are applied: = wvtp + c1 r1t pbesttp − xtp + c2 r2t gbestt − xtp vt+1 p xt+1 p

=

xtp

+

vt+1 p

(3a) (3b)

t+1 the terms vt+1 p and x p are the values of the next velocity and next position of the particle p, respectively. The control parameter is composed of the inertia weight w and the velocity vtp of the particle p at time t. The c1 and c2 terms are acceleration coefficients that determine the importance of the self-cognitive and social components. In other words, they determine the weights of the exploitation approach, approach that allows to achieve the self-component or local best, and of exploration, approach that is achieved with the social component or also called global best. The terms r1 and r2 are random values that vary in the range of 0 and 1. Advantageously, PSO is a versatile algorithm: although it was initially designed for addressing unconstrained continuous optimization, it has been extended to address constrained discrete optimization (e.g., [23]). So, a wide range of optimization problems can be addressing using PSO.

4.2 Enhanced GP-Based PSO In this subsection, the surrogate model used, the GP, is explained, since the GP responses are used in the equations of motion of the Enhanced GP-based PSO. Then, the operation of the path planner itself is explained. Finally, the variants of the algorithm used in the AquaFeL-PSO comparison are explained.

4.2.1

Gaussian Process Regression (GPR)

Gaussian Process (GP) modelling is a machine learning technique based on Bayesian inference [22]. The behavior of the GP is mainly determined by two components: the covariance function or kernel function, and the mean function [22]. Normally, the mean function is set to 0, so the GP would depend only on the kernel function. The purpose of the kernel function is to determine the shape, variability and smoothness of the model of the WQP of the water resource. The Enhanced GP-based PSO

Performance Evaluation of AquaFeL-PSO Informative Path Planner …

413

monitoring system uses the kernel function called Radial Basis Function (RBF), since, according to [19], this kernel function has the best behavior for water resources. The GPR is updated using the input data, which is a set D. These input data are treated as random variables and, then conditioned and marginalized to fit the kernel function and measured values. In response to the equations shown in Eq. 4, the mean value μ(x∗ ) and standard deviation σ(x∗ ) of the GPR fˆ(x)∗ are obtained. μ fˆ(xi )∗ |D = K ∗T K −1 f (x) σ fˆ(xi )∗ |D = K ∗∗ −

K ∗T

K

−1

(4a) K∗

(4b)

From the fitted kernel, the data of K , K ∗∗ and K ∗ are obtained. These data comprise covariances between known data k(x, x), unknown data k(x∗, x∗), and covariances between both the known and unknown data k(x, x∗). k(x, x) k(x, x∗) K K∗ = K = K ∗T K ∗∗ k(x∗, x) k(x∗, x∗)

(5)

The obtained data D along with the kernel are used to update the hyperparmeters of the kernel itself, which, in this case, corresponds to hyperparameter of the RBF kernel. We denote that a GP “converges” when a value for larger than a certain value can be obtained.

4.2.2

Path Planner

The Enhanced GP-based PSO monitoring system combines the PSO components, the local best pbest and the global best gbest, with the GP, the mean of the model μ and the standard deviation σ. In this monitoring system, the ASVs are the particles and the swarm is represented by the fleet of ASVs. The data obtained from the GP allows the monitoring system to be guided by the estimated model data of the WQP. With this data, the ASVs can travel to unexplored areas, regions where the standard deviation or so-called uncertainty is high, and characterize in depth areas where contamination is high, i.e. zones where the mean of the model is high. The Enhanced GP based PSO takes measurements of the water resource in order to update the GP and obtain an estimated model of the WQP. With the model data, the velocity and position of the ASVs are calculated by applying Eqs. 6a and 6b respectively. = wvtp + c1r1t [pbesttp − xtp ] + c2 r2t [gbest t − xtp ] vt+1 p

(6a)

+ c3r3t [max_unt xt+1 = xtp + vt+1 p p

(6b)

−

xtp ]

+

c4 r4t [max_cont

−

xtp ]

414

M. J. T. Kathen et al.

As in the Classic PSO, to calculate the next velocity vt+1 p of the particle p, Eq. 6a uses the control parameter data wvtp , the local best pbesttp and the global best gbestt . However, two new terms are added, the max_unt and max_cont . The max_unt represents the coordinate where the maximum value of the model uncertainty max_σ is found at time t. The max_cont term refers to the coordinate where the maximum value of the model mean max_μ or the maximum contamination of the water resource is found at time t. The constants c3 and c4 are acceleration constants that determine the importance of exploration (max_un) and exploitation (max_con). r3 and r4 are random values between [0, 1]. To calculate the next position xt+1 p of the particle p, the same equation of the Classical PSO is used. This equation is shown in Eq. 6b. Water measurements are taken every l distance traveled by the ASVs. This condition is applied in order to reduce the GP processing time, since the greater the number of measurements, the longer the processing time. Additionally, the kernel provides high covariance values for coordinates that are close. Therefore, it is not necessary to perform measurements that are very close to each other. To obtain the value of l, Eq. 7 is applied [19]. (7) l = λ × t where λ represents a proportion value and t refers to the length scale of the GP at time t.

4.2.3

Variants of the Enhanced GP-Based PSO

Two variants of the Enhanced GP-based PSO are shown below. These variants are based on keeping two active terms in Eq. 6a, the control parameter and one of the PSO components or GP responses. The purpose of these variants is to observe the behavior of the terms that are kept active to obtain the model of the WQP of the water resource and to detect pollution peaks. • Uncertainty: In this variant, the term that remains active along with the control parameter wv is the coordinate of the maximum uncertainty max_un. This variant allows the ASVs to target unexplored areas of the water resource surface because the max_un term is obtained from the model uncertainty. The velocity of the ASVs are updated according to the following equation: = wvtp + c3r3t [max_unt − xtp ] vt+1 p

(8)

• Contamination: This variant allows the ASVs to target the areas where they detect the highest level of contamination. This is due to the terms that remain active, the control parameter wv and the coordinate of the maximum contamination max_con. The last term refers to the mean of the GP model. Equation 9 is used to update the velocity of the ASVs.

Performance Evaluation of AquaFeL-PSO Informative Path Planner …

415

vt+1 = wvtp + c4 r4t [max_cont − xtp ] p

(9)

The position of the ASVs in both variants is updated according to Eq. 6b.

4.3 AquaFeL-PSO The AquaFeL-PSO is a monitoring system developed in [8]. This system is based on the PSO, GP and FL paradigm. The AquaFeL-PSO is an improvement of the Enhanced GP-based PSO, the system consists of splitting the monitoring task into two different approaches in two different periods. When starting the monitoring task, the algorithm focuses on covering the largest possible surface of the water resource in order to have a first model of the WQP of the water body, this phase is called exploration phase. After the ASVs have covered a certain distance, the focus shifts to the exploitation of areas with high levels of contamination in order to characterize in depth the WQP, this phase is called the exploitation phase.

4.3.1

Exploration Phase

The main objective of the exploration phase is to obtain a first model of the WQP of the water resource in order to determine the areas where contamination levels are severe so that the WQP can be more accurately characterized. To meet this objective, the ASVs must travel over a large part of the surface of the water body, therefore, the velocity equation in this phase is Eq. 10, equation obtained with the results of [27]. The results obtained in [27] demonstrate that for the exploration approach, the terms that should remain active are the best local pbest and the coordinate of maximum uncertainty max_un, in addition to the control parameter wv. With this equation of motion, the monitoring system succeeds in generating a first reliable model. = wvtp + c1 r1t [pbesttp − xtp + c3 r3t [max_unt − xtp ] vt+1 p 4.3.2

(10)

Exploitation Phase

The exploitation phase has several stages: the division of the contaminated areas into action zones, the assignment of the ASVs to the action zones found, the exploitation of the action zones and the obtaining of the final model of the WQP of the water resource by applying the FL concept. Action Zones The action zones are areas where the level of contamination of the water resource is considered worrisome or alarming. The level of contamination is considered worrisome when the percentage of contamination is between 33 and 65% of the maximum

416

M. J. T. Kathen et al.

Fig. 3 Example of water contamination levels of the water resource

contamination value found in the exploration phase, and is considered alarming when the value equals or exceeds 66% of the maximum contamination value. The acceptable level is a value lower than 33%. An example of contamination levels is shown in Fig. 3. The action zones are circles drawn from a center and a radius rad. The center of the action zone are the coordinates where the maximum contamination values are found and the radius of the action zone is determined according to the number of ASVs in the fleet n ASV and the width of the water resource length, Eq. 11. To obtain the different action zones, first the coordinate of the maximum contamination value is found and with the radius calculated using Eq. 11, the area covered by the action zone is plotted. To obtain the next action zone, the coordinates belonging to the previously determined action zones are no longer considered in the following action zones, i.e. a point (x, y) in the search space only belongs to one action zone. Because of this, there is no overlapping of action zones. rad = length/n ASV

(11)

Regarding the number of action zones, two cases are considered: (i) the number of action zones is less than the number of ASVs in the fleet, (ii) the number of action zones is equal to the number of ASVs in the fleet. The AquaFeL-PSO does not consider the case where the number of action zones is greater than the number of ASVs in the fleet. To solve the latter case, more ASVs would be needed. Resource Allocation ASVs are assigned according to the distance between the vehicle and the action zone. The ASV that is closest to a given action zone is the vehicle that has the task of exploiting that area. In case the number of ASVs is greater than the number of action zones, more vehicles are assigned to the action zones with higher priority. The priority of the action zone is determined by considering the peak value of the

Performance Evaluation of AquaFeL-PSO Informative Path Planner …

417

zone, from the data obtained in the exploration phase, the peak with the highest contamination value is the one with the highest priority max_ pr t. Eq. 12 is applied to calculate the highest priority. max_ pr t = n ASV ∗ 10 + 10

(12)

The maximum value of the priority is determined by the total number of ASVs n ASV , since it may be the case that the number of action zones is equal to the number of vehicles. Each time a priority value is assigned to an action zone, the priority value for the next zone decreases by 10 points. Exploitation of the Action Zones The exploitation task is performed after assigning the ASVs to the action zones. Since the vehicles must exploit different action zones, the fleet is divided into subpopulations. This allows information to be shared only between the ASVs that are assigned to the same action zone, not interfering in the exploitation of the other areas. Each sub-population generates a model of the WQP with the data from the first model obtained in the exploration phase together with the measurements taken by the vehicles in their assigned area. To calculate the velocity of the ASVs in this phase, Eq. 13 is applied. This equation is the result obtained in [28], where the focus of the monitoring task was the characterization of WQP (exploitation). = wvtp + c1r1t [pbesttp − xtp ] + c2 r2t [gbestt − xtp ] + c4 r4t [max_cont − xtp ] vt+1 p (13) To obtain the global best gbest and the coordinate of the maximum contamination value max_con terms, only the data from the ASVs of the sub-population and the model of the WQP generated with the data from that sub-population are considered. Federated Learning The final model of the WQP of the water resource is obtained by applying the Federated Learning (FL) concept, a technique developed by McMahan et al. [11, 12, 14, 15]. This concept was introduced with the aim of updating language models in cell phones. Through FL, an ensemble ML model is obtained [32]. To generate this model, data found in multiple sites is used. The data is trained at the same site, using local servers, which then send only the final training result to the central server. According to [32], the final results of the models generated with the FL and with the centralized learning technique are quite similar. The FL preserves the security and privacy of the data and nodes respectively, moreover, it allows to generate better results, since the ML models are trained collaboratively [4, 32]. This last point can be observed in the proposed monitoring system. Reference is made to the information provided by [32] to explain the mathematical definition of the FL. The node vector F = {Fk | k = 1, 2, . . . , L} is considered, where L are the nodes and k is the node number. For this monitoring system, the subpopulations are considered the nodes in the FL technique. To train the ML models on the nodes, their own data are used W = {Dk | k = 1, 2, . . . , L}. These data refers

418

M. J. T. Kathen et al.

to the water measurements and the coordinates where the measurements were taken. In case of having a central server, centralized learning technique, the data of the nodes D = D1 ∪ D2 ∪ ... ∪ DL are merged to generate a model MC E N . In contrast, with the FL technique, models M F E D are generated at each node, since each node trains its own ML model with its own data. Furthermore, the accuracy V F E D of the model generated with the FL technique M F E D is very similar to the accuracy VC E N of the model generated with the centralized learning technique MC E N . δ being a non-negative real number in Eq. 14, the FL algorithm loses an accuracy of δ. |V F E D − VC E N | < δ

(14)

The FL technique is used in the AquaFeL-PSO in order to generate models of the WQP of the action zones in the sub-populations, the sub-populations are the nodes in the FL. At the end of the monitoring task, merging the first model of the WQP obtained in the exploration phase with the models of the WQP generated in the subpopulations. The generation of the final model of the WQP and the operation of the FL technique works as follows in the AquaFeL-PSO: 1. A first model of the WQP is generated at the end of the exploration phase on the central server, joining all the measurements taken by the ASVs of the fleet. 2. In the exploitation phase, after assigning the vehicles to the action zones, each sub population is a node, where the model of the WQP of the action zone is generated by merging the data from the exploration phase with the measurements taken by the sub population. This allows to have a more accurate model of the action zone. 3. To obtain the final model of the WQP, the first model obtained in the exploration phase is used as a base, and the mean values of the coordinates are replaced with the values of the mean of the coordinates obtained in the exploration phase, and the values of the first model in the coordinates of the action zones are replaced by the values of the models generated in the sub-populations of the corresponding action zones. An example schematic of the operation is shown in Fig. 4. To obtain the final model of the WQP of the water resource, the following cases are considered: • Case 1: The GP converges in the exploration phase and in the exploitation phase. That is, it was possible to generate a first model in the exploration phase, and, in the exploitation phase, more accurate models of the action zones were generated. Therefore, to obtain the final model of the WQP, the data from the models of all the action zones are replaced in the first model. • Case 2: The GP converges in the exploration phase. However, in the exploitation phase, the GP does not converge in all nodes or sub-populations. In this case, the model data of the WQP from the action zones where the GP converged are replaced in the first model, and for the action zones where the GP did not converge, the data from the first model are used. • Case 3: The GP converges in the exploration phase. However, the GP does not converge in the exploitation phase, in any action zone, it is not possible to obtain

Performance Evaluation of AquaFeL-PSO Informative Path Planner …

419

Fig. 4 Final model of the WQP generation process using Federated Learning technique

a more accurate model of these action zones. In this case, the final model is equal to the first model, which was obtained in the exploration phase. • Case 4: Finally, the GP does not converge in the exploration phase. In this case it is considered that the monitoring system was not successful in generating the model of the WQP of the water resource. The monitoring system is considered successful in Cases 1, 2 and 3.

5 Results and Discussion This section presents the results obtained from the evaluation of the AquaFeL-PSO1 in different contamination profiles and the comparison with the performance of the Enhanced GP-based PSO variants. The contamination profiles on which the algorithms are evaluated are listed, as well as the performance metrics and parameter and hyper-parameter settings. The monitoring systems were developed in Python using the following libraries: Scikit-learn,2 DEAP3 and Bayesian Optimization.4 A laptop computer with 8GB RAM and an Intel i5 1.60 GHz processor was used to run the simulations.

1

https://github.com/MicaelaTenKathen/AquaFeL_bench.git (accessed on 02 December 2022). https://scikit-learn.org/stable/ (accessed on 27 November 2022). 3 https://deap.readthedocs.io/en/master/ (accessed on 27 November 2022). 4 https://github.com/fmfn/BayesianOptimization (accessed on 27 November 2022). 2

420

M. J. T. Kathen et al.

Table 1 Benchmark functions used as contamination profiles Benchmark f. Equation Ackley

Bohachevsky

Griewank

Example

f (x) =20 − N xi2 20 exp −0.2 N1 i=1

N +e − exp N1 i=1 cos(2πxi ) N −1 2 2 − f (x) = i=1 (xi + 2xi+1 0.3 cos(3πxi ) −0.4 cos(4πxi+1 ) + 0.7) 1 N f (x) = 4000 x2 − i=1 i

N x √i +1 i=1 cos

(15) Fig. 5a

f (x) =

(18) Fig. 5d

(16) Fig. 5b

(17) Fig. 5c

i

h1

x2 2 x1 2 8 ) + sin(x 2 + 8 ) 2 (x 1 −8.6998) + (x 2 −6.7665)2 +1

√ Himmelblau Rastrigin Rosenbrock Schaffer Schwefel Shekel

sin(x 1 −

f (x1 , x2 ) = (x12 + x2 − 11)2 + (x1 + x22 − 7)2 f (x) = N 10N + i=1 x 2 − 10 cos(2πxi ) N −1 i f (x) = i=1 (1 − xi )2 + 100(xi+1 − xi2 )2 N −1 2 2 )0.25 f (x) = i=1 (xi + xi+1 2 2 )0.10 ) + 1.0 · sin (50 · (xi2 + xi+1 f (x) ·N

√ N= 418.9828872724339 − i=1 xi sin |xi | f Shekel (x) = M 1 i=1 c + N (x −a )2 i ij j=1 j

(19) Fig. 5e (20) Fig. 5f (21) Fig. 5g (22) Fig. 5h (23) Fig. 5i (24) Fig. 5j

5.1 Ground Truth The proposed scenario for the simulations is the Ypacarai lake. Since the objective is to evaluate the performance of the AquaFeL-PSO in different contamination profiles, several benchmark functions are used. Each benchmark function is a contamination profile. These functions have the characteristics of being multidimensional, multimodal, continuous, and deterministic. The benchmark functions and their equations are listed in the Table 1. A ground truth generated from the different benchmark functions can be seen in Fig. 5. Scenarios where the contamination is dispersed throughout the lake are considered, such as in Fig. 5b, j. In addition, extreme scenarios are also considered, where the contamination is extremely severe and varies abruptly over a small surface area, as in Fig. 5a, f. For each benchmark function, 10 different ground truth models are used.

Performance Evaluation of AquaFeL-PSO Informative Path Planner …

(a) Ackley Function

(b) Bohachevsky Function

(c) Griewank Function

(d) h1 Function

(e) Himmelblau Function

(f) Rastrigin Function

(g) Rosenbrock Function

(h) Schaffer Function

(i) Schwefel Function

421

(j) Shekel Function

Fig. 5 Examples of ground truth with different benchmark functions

Due to the fact that the range of the WQP values varies randomly in the ground truths, all ground truths used are normalized using the Eq. 25 in order to make comparisons of algorithm performance.

422

M. J. T. Kathen et al.

f Normalized (x) =

f Benchmark (x) − f min _Benchmark(x) f max _Benchmark(x) − f min _Benchmark(x)

(25)

the current value of the benchmark function is represented by the term f Benchmar k (x), the term f min_Benchmar k (x) represents the minimum value of the benchmark function, and the term f max_Benchmar k (x) refers to the maximum value of the benchmark function.

5.2 Performance Metric The performance of the algorithms are evaluated according to: (i) success rate of the algorithms, (ii) model of the WQP generated from the action zones, (iii) detection of the peaks of the action zones, and (iv) model of the WQP generated from the whole water resource surface. The success rate of the algorithms corresponds to the percentage of cases where the monitoring system succeeded in generating models of the WQP of the water resource. It is considered a success when the monitoring system was able to generate a model of the WQP, i.e., when the GP converged. In the case of the AquaFeL-PSO, Cases 1, 2 and 3 mentioned in Sect. “Federated Learning” are considered successful. To evaluate the reliability of the generated models of the WQP, the Mean Square Error (MSE) and the error between the ground truth and model data generated by the monitoring system are calculated. In order to evaluate the models obtained from the action zones, the MSE is calculated between the data found in the coordinates of the ground truth action zones f (x) and the generated model y, Eq. 26. MSEaction_zone ( f (x), y) =

n action_zone −1

1 n action_zone

( f (xk ) − yk )2

(26)

k=0

The same procedure is carried out to obtain the results corresponding to the model of the WQP of the entire surface of the lake, with the difference that all the coordinates of the search space of the water resource are considered. The equation used to calculate the MSE value is Eq. 27. MSEmap ( f (x), y) =

1 n map_ points

n map_ points −1

( f (xk ) − yk )2

(27)

k=0

Finally, in order to know the capacity of the algorithms to detect the peaks of the action zones, the error between the peaks of the ground truth f (xaction_zone_ peak ) and the peaks of the generated model yaction_zone_ peak is calculated using Eq. 28. Erroraction_zone_ peak ( f (x), y) = | f (xaction_zone_ peak ) − yaction_zone_ peak |

(28)

Performance Evaluation of AquaFeL-PSO Informative Path Planner …

423

5.3 Setting Simulation Parameters The ASV fleet has four vehicles that can travel at a maximum speed of 2 m/s. The maximum distance that ASVs can travel is 20 km. Once the ASVs reach this distance, the monitoring task is considered to be complete. Regarding the GP configuration, the initial value of the length scale used is 10% of the size of the surface area of the water resource [20], therefore the length scale is equal to 10. The length scale values are in the range [0.1, 105 ]. To determine the λ value, [19, 26] are used as a basis. In order to balance the run time of the monitoring system and the number of measurements, the λ value is set to 0.3. Since the AquaFeL-PSO is composed of two phase, the exploration focus phase and the exploitation focus phase, the acceleration coefficient values vary. Table 2 shows the values for the two phases. The values for the exploration phase were obtained from [27] and for the exploitation phase the values obtained from [28] were considered. Another variable that must be established is the distance that the ASVs must travel in the exploration phase and in the exploitation phase. To set this value, the results obtained in [8] are considered. Therefore, the ASVs travel 10km in the exploration phase and then travel 10 km exploiting the most contaminated areas (action zones).

5.4 Performance Comparison After setting the parameters for the AquaFeL-PSO, comparisons are made with two variant algorithms of the Enhanced GP-based PSO, the Uncertainty algorithm and the Contamination algorithm. The performance of AquaFeL-PSO is compared with these algorithms because the Uncertainty algorithm allows to cover the largest area of the water resource, since it focuses on exploration, and the Contamination algorithm focuses on characterizing the WQP of the most contaminated areas since its main focus is exploitation. Considering Eq. 8, the values for the acceleration coefficients of the Uncertainty algorithm are equal to: c1 = 0, c2 = 0, c3 = 3, and c4 = 0, and considering Eq. 9, the coefficients of the Contamination algorithm are set as follows: c1 = 0, c2 = 0, c3 = 0, and c4 = 3. These values were established considering [26– 28]. One consideration taken is that in all simulations, with the three algorithms, the ASVs started from the same initial position.

Table 2 Values of the acceleration coefficients for the phases of the AquaFeL-PSO Hyper-parameter Exploration phase Exploitation phase c1 c2 c3 c4

2.0187 0 3.2697 0

3.6845 1.5614 0 3.1262

424

M. J. T. Kathen et al.

Table 3 Success rate of the compared algorithms Benchmark function Uncertainty algorithm Contamination (%) algorithm (%) Ackley Bohachevsky Griewank h1 Himmelblau Rastrigin Rosenbrock Schaffer Schwefel Shekel

0 100 0 0 100 0 100 0 10 100

0 100 0 0 100 10 100 0 60 100

AquaFeL-PSO (%) 50 100 0 20 100 50 100 0 100 100

Table 3 shows the success rate of the algorithms with the different benchmark functions. For the success cases, Cases 1, 2 and 3 of Sect. “Federated Learning” are considered. The results where the success rate is 0 is because Case 4 of the aforementioned section is presented. It is considered a success when the GP converged and it was possible to generate a model of the WQP of the water resource. The algorithm that was able to generate water resource quality models in more benchmark functions was the AquaFeL-PSO. This is due to the two functions that the algorithm has, since first its objective is to cover the entire surface of the lake, which allows the system to generate a first model of the WQP. Then, when it switches to the exploitation phase, more accurate models of the action zones are generated. This allows for higher success rates because the cases mentioned in Sect. “Federated Learning” are retained. However, there are cases where the GP does not converge and is not able to generate a model of the WQP in the exploration phase, as was the case in the Griewank and Schaffer functions. The reason why the GP did not converge is the value of the length scale. The minimum value set for the GP was 0.1. However, there are functions where the ground truth values vary so sharply between neighboring coordinates that the minimum value set was high for the length scale of the GP. In the generation of the action zone models, as shown in Table 4, there are cases where the three algorithms have a success rate of 100%, the monitoring system that generates the most reliable models, by a small difference, is the Contamination algorithm. This is because in this algorithm, the ASVs go directly to the areas of highest contamination, which would be the action zones. However, this algorithm is not able to generate any model in at least 4 benchmark functions and in two functions, the rates vary between 10% and 60%. On the other hand, the AquaFeL-PSO is able to generate models for 8 of the 10 benchmark functions, of which 3 of the functions, the success rate varies between 20% and 50%. Despite generating the most reliable models of the action zones, the Contamination algorithm did not obtain the best results in detecting contamination peaks, Table 5.

Performance Evaluation of AquaFeL-PSO Informative Path Planner …

425

Table 4 MSE of the model of the action zones of the compared algorithms Benchmark function Uncertainty algorithm Contamination AquaFeL-PSO algorithm Ackley Bohachevsky Griewank h1 Himmelblau Rastrigin Rosenbrock Schaffer Schwefel Shekel

– 0.01355 ± 0.03104 – – 0.03224 ± 0.08064 – 0.03940 ± 0.06415 – 0.07580 ± 0.10647 0.03420 ± 0.08262

– 0.01278 ± 0.02958 – – 0.03182 ± 0.08195 0.10513 ± 0.21185 0.04055 ± 0.06713 – 0.05614 ± 0.09920 0.03388 ± 0.08469

0.09758 ± 0.14598 0.01343 ± 0.03301 – 0.07705 ± 0.00496 0.03238 ± 0.08298 0.09066 ± 0.17625 0.03868 ± 0.06604 – 0.05434 ± 0.11330 0.03419 ± 0.08051

Table 5 Error in the peaks of the action zones of the compared algorithms Benchmark function Uncertainty algorithm Contamination AquaFeL-PSO algorithm Ackley Bohachevsky Griewank h1 Himmelblau Rastrigin Rosenbrock Schaffer Schwefel Shekel

– 0.05470 ± 0.04451 – – 0.07676 ± 0.17413 – 0.05557 ± 0.15710 – 0.42700 ± 0.72971 0.03262 ± 0.10427

– 0.03746 ± 0.11310 – – 0.02539 ± 0.07548 0.21511 ± 0.47034 0.02898 ± 0.04707 – 0.30488 ± 0.59490 0.05142 ± 0.17252

0.41941 ± 0.54160 0.02621 ± 0.08778 – 0.73664 ± 0.65926 0.01744 ± 0.04327 0.32011 ± 0.56378 0.01409 ± 0.02622 – 0.29774 ± 0.57475 0.01799 ± 0.10266

The algorithm that performed the best in detecting action zone contamination peaks was AquaFeL-PSO. This is due to the fact that, as Uncertainty focuses on exploring, it does not take into account the pollution peaks and therefore does not characterize in depth the WQP in these zones. However, the Contamination algorithm does have the ability to characterize the parameters but, since it does not have the ability to explore the lake surface, it does not detect the peaks of the action zones because they are stuck at a local maximum. In contrast, the AquaFeL-PSO combines both approaches, allowing it to cover a large part of the lake surface and then detect contamination peaks in polluted areas. The monitoring system that generates the best models of lake WQP is the AquaFeL-PSO. The results can be seen in Table 6. In most of the contamination profiles, except for the ground truths of the Griewank and Schaffer functions, the best performance was obtained with the AquaFeL-PSO. In the Rastrigin function, it

426

M. J. T. Kathen et al.

Table 6 MSE of the model of the Ypacarai Lake of the compared algorithms Benchmark function Uncertainty algorithm Contamination AquaFeL-PSO algorithm Ackley Bohachevsky Griewank h1 Himmelblau Rastrigin Rosenbrock Schaffer Schwefel Shekel

– 0.00064 ± 0.00060 – – 0.00108 ± 0.00142 – 0.00108 ± 0.00075 – 0.05992a 0.00056 ± 0.00114

– 0.00097 ± 0.00133 – – 0.00030 ± 0.00024 0.03636a 0.00041 ± 0.00059 – 0.07196 ± 0.06012 0.00128 ± 0.00155

0.13269 ± 0.11981 0.00045 ± 0.00083 – 0.10568 ± 0.07438 0.00011 ± 0.00023 0.06249 ± 0.06181 0.00008 ± 0.00011 – 0.04742 ± 0.07190 0.00042 ± 0.00136

a The mean value is the MSE of the only scenario that the algorithm was able to generate a model of the WQP of the water resource

can be seen that the Contamination algorithm is lower than the value of the AquaFeLPSO. However, the success rate of the Contamination algorithm was only 10%, i.e., it obtained only one model of the WQP out of 10 monitoring task simulations. The figures in Fig. 6 show the results obtained with the AquaFeL-PSO in different contamination profiles. The maps at the top represent the uncertainty of the model during the monitoring task. In addition, the movement of the ASVs is represented by colored lines on the top map, the initial positions are indicated by black dots and the final positions are represented by red dots. The maps shown at the bottom represent the models of the WQP. Figure 6a, d show the results obtained with the Ackley function and h1 function as ground truth, respectively. In these examples, it was possible to generate models with the AquaFeL-PSO in the exploration phase, so that first models were obtained. However, due to the value of the length scale of the GP, the GP did not converge and it was not able to generate models of the action zones in the exploitation phase. Because of this, the final results were the models obtained in the exploration phase. These results are examples of Case 3 of Sect. “Federated Learning”. In Fig. 6b, e, g, j, in addition to generating the models in the exploration phase, the GP was able to generate the models of the action zones in the exploitation phase. Then, these models obtained in the exploitation phase are replaced in the first models obtained in the exploration phase (Case 1 of the Sect. “Federated Learning”). One of the advantages of applying federated learning can be observed in the cases of the Rastrigin (Fig. 6f) and Schwefel (Fig. 6i) functions. In these examples, the monitoring system generated a model in the exploration phase of each scenario. However, in the exploitation phase, the GP was only able to generate a more accurate model in some action zones. Therefore, in the first model generated, only the models of the action zones where it was possible to obtain the model of the WQP were replaced (Case 2 of the Sect. “Federated Learning”). Finally, there are the cases of the results of the Griewank (Fig. 6c) and Schaffer (Fig. 6h) functions, in

Performance Evaluation of AquaFeL-PSO Informative Path Planner …

427

(a) Ackley

(b) Bohachevsky

(c) Griewank

(d) h1

(e) Himmelblau

(f) Rastrigin

(g) Rosenbrock

(h) Schaffer

(i) Schwefel

(j) Shekel

Fig. 6 Example of the operation of the AquaFeL-PSO with different contamination profiles. At the top of the figures are shown the movement of the vehicles, the initial positions (black dots), the final positions (red dots), and the uncertainty of the model generated by the GP. At the bottom of the figures, the model of the WQP obtained in the monitoring task is shown

428

M. J. T. Kathen et al.

which no model could be generated in the exploration phase because the minimum value of the range of length scale values was high to fit the GP (Case 4 of the Sect. “Federated Learning”). Therefore, the uncertainty value is 1 in all the map except in the coordinates where measurements were taken, where the uncertainty is 0. In the model of the WQP, in all the coordinates where no measurements were taken, the value of the mean is equal to 0.

6 Summary of the Results The main findings of this work are discussed below: • The Uncertainty algorithm focuses on the exploration of the water resource surface. Because of this, it is not possible to accurately detect the contamination peaks of the action zones. However, it generates good models of WQP of the entire water resource. • The Contamination algorithm, unlike the Uncertainty algorithm, focuses on exploiting areas with high contamination peaks. However, since it does not have the capacity to cover part of the surface of the water resource, it does not have the capacity to detect all the contamination zones and all the contamination peaks. In addition, ASVs can get stuck at a local maximum by not exploring the surface sufficiently. • The AquaFeL-PSO was able to generate good models of the action zones and obtain the most accurate WQP models. It also obtained the best results in detecting contamination peaks in the action zones. This is due to the change of focus of the monitoring system and the distribution of the fleets in different action zones. • In four benchmark functions used in this work, the GP converged at all nodes or subpopulations in the exploitation phase, enabling the generation of more accurate models of the WQP in the action zones. These functions are the Bohachevsky, Himmelblau, Rosenbrock and Shekel functions. • By applying the Federated Learning technique it was possible to replace the models of the WQP generated in the sub-populations or nodes where the GP converged and to keep the data from the action zones of the first model in the nodes where it was not possible for the GP to converge. This advantage of the FL could be observed in the AquaFeL-PSO result with Rastrigin and Schwefel functions as ground truth. • In functions such as Ackley and Schewefel, the GP was not able to converge in the exploitation phase at the nodes. However, in the exploration phase, it was possible to generate a first model of the WQP of the water resource, which was considered as the final model when the GP did not converge.

Performance Evaluation of AquaFeL-PSO Informative Path Planner …

429

• Due to the large variability of the data in the near coordinates in the ground truths of the Griewank and Schaffer functions, the GP was not able to converge with the selected configuration. The length scale value was high for these functions. However, the scenarios presented with these functions are very unlikely scenarios to happen in real life, since the variability of water parameters at near coordinates is normally not expected to be large.

7 Conclusions This work consists of the evaluation of the performance of the AquaFeL-PSO in ground truths of different benchmark functions, also called contamination profiles. The objectives of the evaluation were the detection of contamination peaks and the generation of accurate models of areas with high pollution levels and modeling of the entire water resource. In addition, the results of the AquaFeL-PSO were compared with two variants of the Enhanced GP-based PSO [26], the Uncertainty algorithm and the Contamination algorithm. The Uncertainty algorithm bases the movements of the ASVs on the control parameter of the Enhanced GP-based PSO and the model uncertainty term, which allows the ASVs to traverse unexplored areas of the water resource. In contrast, the velocity in the Contamination algorithm is calculated with data from the control parameter and the maximum contamination term, so the ASVs are guided to areas where contamination is high. The results show that AquaFeLPSO was able to obtain models of the WQP of the water resource in eight of the ten benchmark functions evaluated. Regarding the generation of the models, by a small difference, the Contamination algorithm generated the best models of the action zones. However, the AquaFeL-PSO obtained the best performance in the generation of the model of the WQP of the whole lake and in the detection of contamination peaks. The results demonstrate the great performance of AquaFeLPSO in different contamination profiles. From this, the development of a PSO and GP based monitoring system capable of multi-target problem solving, where each target is a water quality sensor, is proposed as a future research direction. In other words, the movement of ASVs should be based on data from several water quality sensors simultaneously. Acknowledgements This work has been funded by MCIN/AEI /10.13039/501100011033 y the European Union Next Generation EU/ PRTR under the Project “Gestión del Aprendizaje y Planificación de Flotas de Vehículos Acuáticos No Tripulados para la Monitorización de Masas de Agua Superficiales”, project reference: PID2021-126921OB-C21.

430

M. J. T. Kathen et al.

References 1. Arzamendia, M., Espartza, I., Reina, D.G., Toral, S., Gregor, D.: Comparison of eulerian and hamiltonian circuits for evolutionary-based path planning of an autonomous surface vehicle for monitoring ypacarai lake. J. Ambient. Intell. Hum. Comput. 10(4), 1495–1507 (2019) 2. Arzamendia, M., Gregor, D., Reina, D.G., Toral, S.L.: An evolutionary approach to constrained path planning of an autonomous surface vehicle for maximizing the covered area of ypacarai lake. Soft Comput. 23(5), 1723–1734 (2019) 3. Arzamendia, M., Reina, D.G., Toral, S., Gregor, D., Asimakopoulou, E., Bessis, N.: Intelligent online learning strategy for an autonomous surface vehicle in lake environments using evolutionary computation. IEEE Intell. Transp. Syst. Mag. 11(4), 110–125 (2019) 4. Chen, M., Poor, H.V., Saad, W., Cui, S.: Wireless communications for collaborative federated learning. IEEE Commun. Mag. 58(12), 48–54 (2020) 5. Dirección General del Centro Multidisciplinario de Investigaciones Tecnológicas (CEMIT): Servicios de monitoreo de calidad de agua por campañas de muestreo en el lago ypacaraí. 2016 –2018. Technical report, Universidad Nacional de Asunción (UNA) (2018) 6. Dirección General del Centro Multidisciplinario de Investigaciones Tecnológicas (CEMIT): Monitoreo de calidad de agua por campañas de muestreo en el lago ypacaraí 2019–2021. Technical report, Universidad Nacional de Asunción (UNA) (2021) 7. González, E.J., Roldán, G.: Eutrophication and phytoplankton: some generalities from lakes and reservoirs of the americas. Microalgae-From Physiology to Application (2019) 8. Kathen, M.J.T., Johnson, P., Flores, I.J., Reina, D.G.: Aquafel-pso: a monitoring system for water resources using autonomous surface vehicles based on multimodal pso and federated learning (2022). https://doi.org/10.48550/ARXIV.2211.15217, https://arxiv.org/abs/ 2211.15217 9. Kathen, M.J.T., Johnson, P., Flores, I.J., Reina, D.G.: Monitoring peak pollution points of water resources with autonomous surface vehicles using a pso-based informative path planner. Mobile Robots: Motion Control and Path Planning (in press) 10. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of ICNN’95International Conference on Neural Networks, vol. 4, pp. 1942–1948. IEEE (1995) 11. Koneˇcn`y, J., McMahan, H.B., Ramage, D., Richtárik, P.: Federated optimization: Distributed machine learning for on-device intelligence (2016). arXiv:1610.02527 12. Koneˇcn`y, J., McMahan, H.B., Yu, F.X., Richtárik, P., Suresh, A.T., Bacon, D.: Federated learning: Strategies for improving communication efficiency (2016). arXiv:1610.05492 13. López Moreira, G.A., Hinegk, L., Salvadore, A., Zolezzi, G., Hölker, F., Monte Domecq S.R.A., Bocci, M., Carrer, S., De Nat, L., Escribá, J., et al.: Eutrophication, research and management history of the shallow ypacaraí lake (paraguay). Sustainability 10(7), 2426 (2018) 14. McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics, pp. 1273–1282. PMLR (2017) 15. McMahan, H.B., Moore, E., Ramage, D., y Arcas, B.A.: Federated learning of deep networks using model averaging (2016). arXiv:1602.05629 2 16. Mitsch, W.J., Wang, N.: Large-scale coastal wetland restoration on the laurentian great lakes: determining the potential for water quality improvement. Ecol. Eng. 15(3–4), 267–282 (2000) 17. Peralta, F., Pearce, M., Poloczek, M., Reina, D.G., Toral, S., Branke, J.: Multi-objective path planning for environmental monitoring using an autonomous surface vehicle. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO ’22, pp. 747– 750. Association for Computing Machinery, New York, USA (2022) 18. Peralta, F., Reina, D.G., Toral, S.: Towards an online water quality monitoring system of dynamic environments using an autonomous surface vehicle. In: International Conference on Optimization and Learning (OLA2022). Science Conferences (2022)

Performance Evaluation of AquaFeL-PSO Informative Path Planner …

431

19. Peralta, F., Reina, D.G., Toral, S., Arzamendia, M., Gregor, D.: A bayesian optimization approach for multi-function estimation for environmental monitoring using an autonomous surface vehicle: Ypacarai lake case study. Electronics 10(8), 963 (2021) 20. Peralta, F., Reina, D.G., Toral, S., Arzamendia, M., Gregor, D.: A bayesian optimization approach for water resources monitoring through an autonomous surface vehicle: the ypacarai lake case study. IEEE Access 9, 9163–9179 (2021). https://doi.org/10.1109/ACCESS.2021. 3050934 21. Peralta, F., Yanes, S., Reina, D.G., Toral, S.: Monitoring water resources through a bayesian optimization-based approach using multiple surface vehicles: the ypacarai lake case study. In: 2021 IEEE Congress on Evolutionary Computation (CEC), pp. 1511–1518. IEEE (2021) 22. Rasmussen, C.E.: Gaussian processes in machine learning. In: Summer School on Machine Learning, pp. 63–71. Springer, Berlin (2003) 23. Rivera, G., Porras, R., Sanchez-Solis, J.P., Florencia, R., García, V.: Outranking-based multiobjective PSO for scheduling unrelated parallel machines with a freight industry-oriented application. Eng. Appl. Artif. Intell. 108, 104556 (2022). https://doi.org/10.1016/j.engappai. 2021.104556 24. Sánchez-García, J., García-Campos, J.M., Arzamendia, M., Reina, D.G., Toral, S., Gregor, D.: A survey on unmanned aerial and aquatic vehicle multi-hop networks: Wireless communications, evaluation tools and applications. Comput. Commun. 119, 43–65 (2018) 25. Sánchez-García, J., Reina, D., Toral, S.: A distributed pso-based exploration algorithm for a uav network assisting a disaster scenario. Futur. Gener. Comput. Syst. 90, 129–148 (2019) 26. Ten Kathen, M.J., Flores, I.J., Reina, D.G.: An informative path planner for a swarm of asvs based on an enhanced pso with gaussian surrogate model components intended for water monitoring applications. Electronics 10(13), 1605 (2021) 27. Ten Kathen, M.J., Flores, I.J., Reina, D.G.: A comparison of pso-based informative path planners for autonomous surface vehicles for water resource monitoring. In: 7th International Conference on Machine Learning Technologies (ICMLT) (ICMLT 2022). ACM (in press) 28. Ten Kathen, M.J., Reina, D.G., Flores, I.J.: A comparison of pso-based informative path planners for detecting pollution peaks of the ypacarai lake with autonomous surface vehicles. In: International Conference on Optimization and Learning OLA’2022 (in press) 29. Yanes, S., Peralta, F., Córdoba, A.T., del Nozal, Á.R., Marín, S.T., Reina, D.G.: An evolutionary multi-objective path planning of a fleet of asvs for patrolling water resources. Eng. Appl. Artif. Intell. 112, 104852 (2022) 30. Yanes, S., Reina, D.G., Toral, S.: A deep reinforcement learning approach for the patrolling problem of water resources through autonomous surface vehicles: The ypacarai lake case. IEEE Access 8, 204076–204093 (2020) 31. Yanes, S., Reina, D.G., Toral, S.: A multiagent deep reinforcement learning approach for path planning in autonomous surface vehicles: The ypacarac-lake patrolling case. IEEE Access (2021) 32. Yang, Q., Liu, Y., Cheng, Y., Kang, Y., Chen, T., Yu, H.: Federated learning. Synth. Lect. Artif. Intell. Mach. Learn. 13(3), 1–207 (2019)

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat Aerial Vehicle Platform Murat Bakirci and Muhammed Mirac Ozer

Abstract The majority of the swarm UAV studies focus on a single aspect, only investigating the stages such as formation development, path planning, or target tracking for a swarm currently in mission flight. Besides, the dynamic coordination and operation of the system based on the new commands that can be transmitted to the swarm during the mission are not taken into account; that is, the input of the ground resources is often ignored. In this study, all stages of a swarm of unmanned combat aerial vehicles (UCAV), from take-off to the end of the mission, are detailed in a single holistic framework, including communication with the ground station and intercommunication between swarm members. The designed solution is a platform that will enable the swarm structure to prevail by developing alternative strategies and tactics against existing manned or unmanned air, land, and sea platforms. In this context, operational algorithms have been developed for fixed-wing, fully autonomous controlled UCAVs, which can successfully detect in-sight and beyondsight targets for a desired period of time, and can communicate seamlessly with ground stations. Furthermore, dynamic swarm-type algorithms have been developed in order to fulfill the desired task in the event of the loss of any UCAV during the mission, to replace the lost vehicle with a new vehicle, and to communicate directly with the UCAVs in the swarm. As a result of adapting swarm intelligence to the UCAV platform, all individuals in the swarm perform tasks such as taking off in formation, adding or removing new individuals to the swarm, and formation protection. Moreover, they have the ability to change direction in a swarm, change formation, split or merge, navigate, ascend and descend, and simultaneous/sequential auto-landing as a swarm. Keywords Swarm systems · UAV · Flight formation · Operational algorithms · Target detection · Tracking M. Bakirci (B) · M. M. Ozer Unmanned/Intelligent Systems Lab, Faculty of Aeronautics and Astronautics, Tarsus University, Mersin 33400, Turkey e-mail: [email protected] M. M. Ozer e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications, Studies in Big Data 132, https://doi.org/10.1007/978-3-031-38325-0_18

433

434

M. Bakirci and M. M. Ozer

List of Acronyms, Variables, and Symbols Backbone CNN CSPResNext50 CSPDarknet53 CSRT EfficientNet-B3 Faster R-CNN FOV FPS Head ImageNet IVG KCF LVFG Neck YOLO OpenCV SSD TCP TLD UDP UGV VTOL altitude arrowhead_array

d1

First layer of the YOLO algorithm. Convolutional Neural Network. Backbone architecture. Backbone architecture. Channel and Spatial Reliability Tracking. Backbone architecture. Faster Region-Based Convolutional Neural Network. Field of View. Frame Per Second. Third layer of the YOLO algorithm. Detection algorithm training dataset. Impact Vector Guidance. Kernelized Correlation Filter. Lyapunov Vector Field Guidance. Second layer of the YOLO algorithm. You Only Look Once. An open-source library for programming functions. Single Shot Detector. Transmission Control Protocol. Tracking-Learning-Detection. User Datagram Protocol. Unmanned Ground Vehicle. Vertical Take-Off and Landing. Flight altitude (feet). Instantaneous position of the UCAV in the arrowhead formation. The distance between the UCAV with the arrowhead_array value of 1 (UCAV-1) and the guide UCAV (feet).

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

d1x

d1y

d2 def.ucav_com[‘ucav_formation’][‘type’] def.ucav_com[‘guide_ucav’][‘gps_noise_flag’] direction (dir)

dispatch prism_array theta (θ)

ucav_com ucav_id ucav_link

x x_vel y y_vel

435

Parameter to reflect the distance of the guide UCAV and UCAV1 to other UCAVs in 2D space (feet). Parameter to reflect the distance of the guide UCAV and UCAV1 to other UCAVs in 2D space (feet). Distance between consecutive UCAVs (feet). Flight formation type. GPS service availability. The value used to determine which wing a UCAV is on in the formation. Departure command value. Instantaneous position of the UCAV in the prism formation. Angle made by UCAVs other than UCAV-1 in a prism formation with the direction of flight of the guide UCAV (degrees). Data set containing ucav_id and ucav_link. Position of a UCAV in a formation. Instant location and speed information of all UCAVs within the communication range of a default UCAV. X-coordinate of the guide UCAV (longitude). X-component of the speed of guide UCAV (knots). Y-coordinate of the guide UCAV (latitude). Y-component of the speed of guide UCAV (knots).

436

M. Bakirci and M. M. Ozer

1 Introduction Unmanned combat aerial vehicles (UCAVs) are very popular due to the wide variety of benefits they provide, such as being economical compared to manned aircraft and reducing the risk of death to zero by keeping the user away from the conflict zone. It has been more than twenty years since the use of UCAVs, which started in the early 2000s, with the US equipping UAVs with various weapons. Today, due to the many benefits they provide, they have become an extremely important tactical weapon for countries that want to act as playmakers on a global or regional scale. In parallel with the rapid technological developments, new models are added to these vehicles, which can be easily equipped with various features according to the mission requirements [1, 2]. An unmanned aircraft refers to the common name of its tools and systems, consisting of components that include the equipment, infrastructure, and personnel necessary to control this aircraft. UCAVs are unmanned vehicles within the family of unmanned combat vehicles, which can be controlled by a remote and are lethal with their weapons and ammunition. These vehicles are classified according to the communication systems that provide communication with the operator or other vehicles, the type of propulsion, and the size of different weapons with the maximum load capacity they can carry [3]. Their cost of operation and supply is much lower than manned vehicles. Moreover, the fact that UCAV operations do not cause any additional costs in terms of life and property also provides political advantages in the eyes of the user countries [4, 5]. Although UCAVs alone can provide operation services in a very wide area, both the difficulty of the required operations and the loss of cost in the event of a single UCAV system being out of use creates the idea of using more than one low-cost UCAV at the same time [6]. Deploying multiple UCAVs at the same time instead of one UCAV for different phases of a mission has great advantages in terms of both the success and duration of the mission. The limitations on payload, energy, size, time, performance, accident, and loss that may arise by using a UCAV can be significantly reduced by the use of multiple or swarm UCAVs [7–9]. UCAV swarms are formed when more than one UCAV comes together and performs the same task or different parts of the same task at the same time. In swarm UCAV applications, it is of great importance to determine the mission planning, decision-making process, and the stages of this process well. In addition, it is necessary to ensure interaction, communication, and harmony between a large number of different UCAVs. While the swarm UCAVs are performing their tasks, when their communication with each other is seamless and good coordination is ensured, they can offer efficient task sharing in cooperation. In order to achieve this, algorithms with an advanced level of autonomy and a certain amount of artificial intelligence are being developed for UCAV systems that move in large numbers that cooperate to perform a specific task [10–16]. Security systems are needed for the removal of a single UCAV from the swarm that can be used for different missions and UCAVs that should not be in a certain flight zone [17]. This situation, which was evaluated with various simulators,

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

437

provided successful results, especially in terms of the reduction of communication losses [17]. In swarm UCAVs, control algorithms such as particle swarm optimization algorithms have been proposed to ensure cooperation and detect situations such as gas leakage [18]. Similar algorithms are also being developed to correctly manage the movement in the swarm and prevent possible collisions. In practice, swarm intelligence algorithms have proved to be competitive and versatile, addressing a wide range of optimization problems: discrete and continuous, constrained and unconstrained, and single-, multi-, and many-objective problems [19–22]. Swarm UCAVs are used especially in the military as a means of target detection, reconnaissance, surveillance, intelligence, and telecommunication [23–25]. However, the importance of this technology emerges in times of natural disasters such as earthquakes and fires. They have an important contribution to the field of civil practice, which includes locating the victims and delivering the necessary assistance, especially in situations where access to the disaster area is difficult [26, 27]. In addition, they are used to map a particular region with lower budgets in order to facilitate the work of some occupational groups [28]. Swarm UCAVs are divided into homogeneous and heterogeneous in terms of having the same characteristics [29]. There are two control architectures for the control of swarm UCAV systems: central unmanned aircraft control algorithms and decentralized unmanned aircraft control algorithms. In a decentralized control solution, UCAVs must be able to communicate with each other and act fully autonomously with individual decisions, and in case of the loss of any UCAV, the remaining UCAVs must perform the task. However, UCAV systems are not controlled from a headquarters, and each drone makes its own decisions during the operation [30]. During the development of swarm systems, different disciplines work together and use highly advanced algorithms that take a long time to develop. Therefore, in swarm UCAVs, task distribution, decision-making mechanism, path planning, coordination, and the controllability of these situations are the main areas that need to be emphasized [31–34]. There are various mission algorithms, such as communication technologies, positioning and mapping algorithms, collision avoidance algorithms, mission assignment algorithms, vehicle dynamics, trajectory planning, and UCAV control algorithms, to ensure communication between swarm systems [35]. In addition to these, optimization algorithms that can be used in real-time are also used in swarm UCAV systems [36, 37]. These algorithms can solve the optimization problem very quickly on processors with low processing power. The remainder of the article is organized as follows. In the second section, the proposed study is discussed and compared with similar studies in the literature. Formation development, which is one of the main features expected from swarm systems, is discussed in the third section. Section four covers mission initiation and route planning procedures. The details of the detection and tracking of target systems, which are expected to be detected and destroyed by the swarm UCAVs within the scope of the mission, are given in the fifth section. All communication architectures that must be performed during the mission are described in section six. The results obtained and a general evaluation are compiled in the seventh chapter, the conclusion.

438

M. Bakirci and M. M. Ozer

2 Proposed Work Swarm UAV systems, which are desired to perform challenging and critical tasks, require highly sensitive coordination, advanced automation, and artificial intelligence integration. Since it is insufficient to investigate such a complex system with a single aspect, it is rather analyzed by dividing it into sub-categories. Swarm UAV systems can be considered in five stages in general [38]. All the requirements for the operation of the system can be addressed in these five stages. The success of the swarm mission depends on the coordination and seamless execution of these phases, each of which is interconnected with the other. The first of these is the decision-making step, in which preliminary decisions are made regarding the planned task. In this step, the requirements of the task to be implemented, the extent to which it is necessary, the benefits and harms, scheduling, management, and possible consequences are determined. In the next step, path planning is carried out in accordance with the scope and objectives of the task. Considering all mapping and environmental information, the most suitable flight path is created for the swarm members. In cases where unexpected scenarios are encountered, alternatives are also determined in this step. The next step is the control step, which is extremely critical and responsible for the coordination of the entire task. Important tasks such as controls for the both individual and collective flight of the swarm, changing flight mode according to conditions, and obstacle avoidance are managed within this step. The dynamic interaction of the swarm on mission with ground resources and, at the same time, information sharing among individuals is carried out through the communication step. Issues such as from which network and how the communication will be made, communication security, and IoT architecture are evaluated in this step. The last step is the application step, which expresses in which field the task to be performed is subject to application. Current studies in the literature predominantly focus on only one of the abovementioned steps, such as path planning [39–43], control [44–47], or detection and tracking [48–51]. Few studies have addressed both of these issues, such as decision and control [52] and path planning and control [53]. In addition, the communication step is often overlooked, and the number of studies on this subject is quite limited too. In this study, assuming that the decision-making process has been completed, the entire task set that a UCAV swarm system will perform in line with the planned task is discussed in a single framework. In this context, all of the mission phases, namely take-off, formation, path planning, target detection and tracking, and communication architectures, were discussed, and the relevant steps were supported by the necessary algorithms. The sequential take-off and formation development phases, which are the most prominent features of the swarm systems, have been verified by simulations. Additionally, the performances of the algorithms used for detection and tracking were measured through various training tests. During the swarm UCAV mission planning, tasks are assigned to a sufficient number of UCAVs for multiple targets, taking into account certain objectives. While selecting the target for each UCAV, obstacles in the

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

439

environment, the number of enemies, weather conditions, and environmental factors are also taken into consideration. However, swarm UCAVs will be able to choose the route they will take to the targets, taking into account the obstacles on the way, and they will also be able to choose according to a factor that indicates the importance of the targets.

3 Swarm Formation After the autonomous take-off, the vehicle starts to search for the easiest competitor according to the data coming from the server. When it determines the suitable opponent, it moves over the position until it enters the field of view of the main camera. In the target field of view, the locking algorithm and the tracking algorithm are activated, respectively, and the easy target is locked through the image, and the ammunition is released. In order for swarm-armed UCAVs to fulfill this task, an appropriate control algorithm must be used. This control algorithm should be able to provide simultaneous assignment, collision avoidance, and formation points determination. The formation flight is the positioning sequence created by the UCAVs while following the guide UCAV according to the flight formation parameter transmitted to the UCAVs by the ground station after taking off from the airbase. There are two general formations, arrowhead, and rectangular prism. The transition to the arrowhead and rectangular prism formations is achieved by reaching the position of the guide UCAV and creating the coordinate frame as much as the number of UCAVs, after which the UCAVs access those coordinates. From the moment of takeoff to the target area, the heading angle of all UCAVs is in the same direction as the heading angle of the guide UCAV. While the UCAVs are following the guide UCAV, in case they enter the area where the GPS disturbance is experienced, the UCAVs continue until the end of the GPS disturbance by equalizing their speed and direction to minimize the risk of collision. The guide takes the information that the UCAVs should make formation flight, depending on the UCAV tracking, from the def.ucav_com[‘guide_ucav’] [‘dispatch’] value. The dispatch value is True when the guide UCAV is taking off from the airbase and entering the mission area. When flying from the airbase to the duty area, this value is False. This is checked every time the Dispatch value changes with the condition stated in Algorithm 1.

440

M. Bakirci and M. M. Ozer

Algorithm 1. Sending formation information to the guide UAV.

3 4 5 6 7

“ kh" def.ucav_com['guide_ucav']['dispatch'] def.Dispatch: def.Dispatch = pqv"(def.Dispatch) def.DispatchI = def.DispatchI + 1 ”

!=

In cases where the Dispatchl value is 0 or 1, formation flights are made to the UCAVs. If this value is 2, the UCAVs are in the duty area. At the time of departure from the airbase, each of the UCAVs is at a different distance from the guide UCAV or the ones at an equal distance are on opposite sides. Initially, all UCAVs are given generic values of arrowhead_array and prism_array according to their distance from the position of the guide UCAV. Before assigning these values to the UCAVs, the current position of the guide UCAV is coded as in Algorithm 2, in the same axis as the UCAV with ID number 1 (UCAV-1) in Fig. 1.

Fig. 1 Airbase departure sequence

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

441

Algorithm 2. Equalization of the coordinates of the guide and UCAV-1.

3 “ 4 mid_ucav 5 6

= ucav_statements.getAltCrd (all_ucav_positions) guide_position = (guide_position[0], mid_ucav[1]) ”

Coordinates of all UCAVs within range are sent as a parameter to the getAltCrd() function. This function returns the coordinates of the UCAV-1 in Fig. 1 based on the sequences of the UCAVs. Then, the coordinate of the y-axis of the position of the guide UCAV is considered to be equal to the position of the UCAV-1. Thus, the distances of the UCAVs arrayed in two directions to the guide UCAV can be scaled properly.

3.1 Arrowhead Formation In the arrowhead formation, after the swarm UCAVs take off from the airbase with their arrangement in Fig. 2, the UCAV closest to the guide UCAV is positioned d1 units behind the guide UCAV in the horizontal coordinate. The two closest UCAVs to this UCAV are positioned either to the right or to the left with an angle of θ and a distance of d2 between them. Each UCAV performs the same operation in pairs according to the UCAV in front of it and its position, and the arrowhead formation is created as in Fig. 3. After coding that the y-axis coordinate of the guide UCAV is equal to the UCAV-1 in Fig. 1, the distance of all UCAVs to the guide UCAV is calculated as shown in Algorithm 3, and added to an array.

Fig. 2 Swarm UCAVs take-off formation

442

M. Bakirci and M. M. Ozer

Fig. 3 Swarm UCAVs arrowhead formation

Algorithm 3. Determining the distances of swarm members to the guide.

3 4 5

“ hqt" m kp" range(0, abucav[1]): ucav_guide_distances.append([util.dist(guide_position,all_ucavs _positions[m]), abucav[0][m]]) ”

Then, the sequence of distances is sorted according to their proximity to the guide, from smallest to largest, through guide_ucav_distance.sort() function, and a new sequence is created by taking the UCAV ID numbers from the sorted values. In this way, the sequence [1–9] is formed and returned according to the proximity of the sequence in Fig. 1 to the guide UCAV. As a result of looping the array coming from the function, the index of the UCAV’s ID number is determined as the UCAV’s arrowhead_array number, as shown in Algorithm 4.

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

443

Algorithm 4. Determination of the places of individuals in the arrowhead formation.

3 4 5 6 7

“ hqt"m kp"lgt (0, ucavs[1]): kh" (def.arrowhead_ucav_tg_array[ucavs[0][m]]) == (def.ucav_tg): def.arrowhead_array = ucavs[0][m] ”

The arrowhead_array values, which are the red numbers of the UCAVs in Fig. 1, are the generic sequence numbers of the UCAVs. After taking the arrowhead_array value for the formation pattern, the position is produced for each UCAV at the distances of d1 and d2 and an angle of θ to the guide UCAV and other UCAVs. The UCAV with the arrowhead_array value of 1 just behind the guide UCAV must be d1 units behind the guide UCAV. For this, an if condition, as shown in Algorithm 5 below, should be added only for the UCAV-1. Thus, the position of UCAV-1 in the arrowhead formation is determined, and the other swarm members in the formation take their places according to this vehicle.

Algorithm 5. Calculation of the position of UCAV-1 in the arrowhead formation.

1 2 3 4 5 6 7 8

“ crd = [x, y, kh"m == 1: crd[0] = crd[1] = crd[2] = tgvwtp"crd ”

z] gucavx + int(sin(theta) * (-1 * d1)) gucavy - int(cos(theta) * (-1 * d1)) def.ucav_com['guide_ucav']['altitude']

If the value of arrowhead_array is ‘1’, this formulation is applied directly to the position of the guide UCAV and the position of it behind the d1 distance is calculated regardless of the heading angle. Algorithm 6 is used for all arrowhead_array values except ‘1’, i.e., UCAVs except UCAV-1.

444

M. Bakirci and M. M. Ozer

Algorithm 6. Calculation of the positions of other individuals in the arrowhead formation.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

“ kh"i % 2 == 0: dir = 1 gnug: dir = -1 costheta = cos(theta) sintheta = sin(theta) d1x = int(sin(theta)) * (-1 * d1)) d1y = int(cos(theta)) * (-1 * d1)) x = int(-1 * (sin(theta)) * int(i / 2) * d2 * costheta)) y = int(-1 * (sin((theta)) * int(i / 2) * d2 * sintheta * dir)) crde = [0, 0, 0] crd[0] = gucavx + x + d1x crd[1] = gucavy + y – d1y crd[2] = def.ucav_com['guide_ucav']['altitude'] tgvwtp"crd ”

In the algorithm, the direction value is determined first. In Fig. 1, the UCAVs are located along the two wings on the left and right edges of the guide UCAV. These UCAVs take their places on the nearest wing while forming the prism formation. When the arrowhead_array values of these UCAVs are examined, they are ordered by increasing from a right-wing to a left-wing, respectively. Even numbers continue on the right wing; odd numbers continue on the left wing. The direction value in the algorithm is used to determine which wing the UCAV is on in the formation. Then, the cosine and sine values of the heading angle are calculated, and the angle values of the formation are prepared to produce the position. In addition, the calculated d1x and d1y values are used to reflect the distance of the guide UCAV and the UCAV behind it to other UCAVs. Besides, the calculated x and y values are used to find out how many units the UCAV is behind and how many units to the right or left of the guide UCAV. Finally, when the d1 distance and x and y variables are subtracted or added from the x and y coordinates of the guide UCAV, the instantaneous position of the UCAV with the value of arrowhead_array is determined in the arrowhead formation.

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

445

Since the position of the guide UCAV will change during the formation flight, the instantaneous position of the UCAV is calculated from this function. Initially, until the guide UCAV gains altitude, all other UCAVs perform a level flight within the fetch_guide() function by turning their heading angles towards the calculated position according to the guide UCAV. When they arrive at their required positions, they make the same flight as the guide UCAV, in accordance with the parallel_fly_ to_guide() function, by equating their heading angles, speeds, and altitudes with the values of the guide.

3.2 Rectangular Prism Formation In the rectangular prism formation, after the UCAVs take off from the airbase with their arrangement in Fig. 2, while forming a rectangular prism, the UCAV that comes to the d1 distance first takes its position. The other UCAVs are then arranged in pairs, up or down in the vertical coordinate according to their proximity to the UCAV in front. Then, the UCAVs lined up and down in pairs are brought to the same horizontal coordinate with the distance between neighboring individuals d2, forming a rectangular prism formation as in Fig. 4. While transitioning between formations, point position determination is made as in arrowhead and rectangular formations. When ucav_formation_type, which is the command to switch between formations, is received from the ground station, UCAVs in a certain formation determine the closest point to them and share the information that the location of the destination point is full, which prevents other UCAVs from going to the same point. Thus, a transition between formations is ensured. After coding that the y-axis coordinate of the guide UCAV is equal to the UCAV-8 in Fig. 1, a ranking is formed for the arrowhead formation according to the proximity

Fig. 4 Swarm UCAVs rectangular prism formation

446

M. Bakirci and M. M. Ozer

Fig. 5 Rectangular prism formation order

to the guide UCAV. While the sequence obtained in Fig. 1 is [8, 0, 4, 1, 5, 2, 6, 3, 7], sequence [8, 5, 1, 4, 0, 7, 3, 6, 2] is obtained when the algorithm applies the necessary operation to this sequence with the formPrismPositionSort() function. When this sequence is used for the prism, the desired order for the formation in Fig. 5 is obtained. As in the arrowhead formation, the index of the UCAV’s ID, by looping through the array coming from the function, becomes the prism_array number of the UCAV as given in Algorithm 7.

Algorithm 7. Assigning the indices of UCAVs in the prism formation.

1 2 3 4 5

“ hqt"m kp"lgt(0, ucavs[1]): kh"(def.prism_ucav_tg_array[ucavs[0][m]]) == (def.ucav_tg): def.prism_array = ucav[0][i] ”

As in the arrowhead formation, the UCAV with the prism_array value of ‘1’ just behind the guide UCAV is d1 units behind the guide UCAV in the prism formation. An if condition is added only for the value ‘1’ as indicated in Algorithm 8 below.

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

447

Algorithm 8. Determining the location of UCAV-1 in the prism formation.

1 “ 2 crd = [x, y, z] 3 kh"m == 1: crd[0] = gucavx + (sin(theta) * (-1 * d1)) 4 crd[1] = gucavy - int(cos(theta)) * (-1 * 5 d1))

6 7

crd[2] = def.ucav_com['guide_ucav']['altitude'] tgvwtp"crd

8 ” If the prism_array value is ‘1’, this formulation is applied directly to the position of the guide UCAV to calculate the position of the guide UCAV behind the d1 distance regardless of the heading angle. For all prism_array values except ‘1’, Algorithm 9 is executed.

Algorithm 9. Determination of the positions of other individuals in the prism formation.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

“ kh"m % n == 0: dir = -1 gnug: dir = 1 ucavs_order= ((i - 1) / 2) + 1 d1x = int(sin(theta)) * (-1 * d1)) d1y = int(cos(theta)) * (-1 * d1)) x = int(-1 * (sin(theta)) * d2 * ucavs_order)) y = int(-1 * (sin(theta)) * (d2 / 2) * dir)) crd = [0, 0, 0] crd[0] = gucavx + x + d1x crd[1] = gucavy + y – d1y yjkng"m >= 3: m=m-2 kh"(m == 1) qt"(m == 2): crd[2] = def.ucav_com['guide_ucav']['altitude'] + (d2 / 2) gnugkh"(i == 4) qt"(i == 1): crd[2] = def.ucav_com['guide_ucav']['altitude'] - (d2 / 2) tgvwtp"crd ”

448

M. Bakirci and M. M. Ozer

In this algorithm, the direction value is determined first. In Fig. 5, the UCAVs are located along the two wings on the left and right sides of the guide UCAV. The UCAVs on these wings take their places on the nearest wing while forming the prism. The direction value in the algorithm determines which wing the UCAV will take place in the formation. Then, the UCAVs are lined up as four individuals on the same x-axis in the prism formation. Which of the four individuals the UCAV should be in is calculated with the value of ucavs_order. d1x and d1y values are calculated to reflect the distance of the guide UCAV and the UCAV behind it to other UCAVs. The x and y values calculated afterward are used to find out how many units the UCAV is behind and how many units to the right or left of the guide UCAV. Afterward, when the d1 distance and the left/right distances x and y variables are subtracted or added from the x and y coordinates of the guide UCAV, the information at which point the UCAV, which is given the value of arrowhead_array, is instantaneously in the prism formation is obtained. Finally, to calculate the altitudes of the UCAVs, the algorithm is coded as two down directions and two up directions, and the resulting coordinate is returned. Since the position of the guide UCAV changes during the formation flight, this function constantly calculates the position of the UCAV that should be found instantly. Initially, until the guide UCAV gains altitude, all other UCAVs perform a level flight within the fetch_guide() function by turning the heading angle towards the calculated position relative to the guide UCAV. When they arrive at their required positions, they perform the same flight as the guide UCAV by equating their heading angles, speeds, and altitudes with the values of the guide and obeying the parallel_ fly_to_guide() function. In addition, the guide UCAV leaves the swarm after the swarm arrives at its mission area. Then, the UCAVs perform their duties together with the advantages of being a swarm by forming the diagonal formation developed as in Fig. 6. In the diagonal area scanning formation, the distance between the individuals is such that the communication ranges of the UCAVs intersect at least and/or the diameter of the area scanned by a UCAV is equal to the length of the area. The purpose of this sequence is to detect the enemy faster by scanning the maximum area and to minimize the risk of collision that may be encountered as a result of maneuvers performed in the area. In order to calculate the required distances of the UCAVs relative to the guide UCAV and to each other before and during the formation flight on task, the coordinate at the unit distance of the UCAV must be calculated. For example, during the formation flight, the position of the UCAV behind the guide UCAV should be d1 units behind the heading angle of the guide UCAV. Additionally, UCAVs should check whether there is a prohibited zone in the areas in front of them or if another UCAV has had a collision on the route while operating in the task area. In such cases, the UCAV needs to calculate the coordinates to go according to the heading angle. During the transition between formations, it is necessary to check the def.ucav_ com[‘ucav_formation’][‘type’] data during the formation flight and know that the flight formation is an arrowhead or a prism. In any formation flight change, the

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

449

Fig. 6 Diagonal field scanning formation

UCAVs move toward the position they should be in relative to the other formation. With this, all kinds of formation distortions are tried to be minimized. In areas where the GPS signal is cut off, the def.ucav_com[‘guide_ucav’] [‘gps_ noise_flag’] data is checked during the formation flight, and the availability of GPS service is checked. From the moment the GPS service is interrupted, a movement command is sent to the UCAVs in parallel with the speed, heading angle, and altitude values of the guide UCAV as given in Algorithm 10.

Algorithm 10. Checking GPS data in corrupted GPS area.

1 “ 2 print('State : GPS corrupted area') 3 def.trans_fly_ord(def.ucav_com['guide_ucav']['spee 4 5 6 7

d']['x'], def.ucav_com['guide_ucav']['speed']['y'], def.ucav_ccom['guide_ucav']['heading'], def.ucav_com['guide_ucav']['altitude']) ”

450

M. Bakirci and M. M. Ozer

When the formation flight change information is received in the region where the GPS signal is problematic, the flight is made only by gaining the necessary altitude for the other formation without passing to that formation until this region is over. After passing this region, the transition to the other formation order is provided.

3.3 Simulation Analysis A series of simulations were performed to measure the consistency with which the determined swarm formations could be achieved. A total of 10 identical fixedwing UCAVs, one of which was a guide UCAV, were used for both arrowhead and rectangular prism formation simulations. The algorithm that created the formation was programmed in Python, and simulations were performed on the Visual Studio 16.0 platform. The simulation results showing the development of arrowhead and rectangular prism formations of UCAVs taking off from the airbase are shown in Figs. 7 and 8, respectively. In Fig. 7, it is seen that the swarm members that take off in the take-off formation act in coordination and begin to form the arrowhead formation after t = 2 as commanded. The red dashed lines in the frame corresponding to t = 5.04 represent the formation pattern that is expected to be formed. Although there are minor position errors, it is seen that the formation has been completed successfully. On the other hand, in Fig. 8, it is seen that the members of the swarm that take off in the take-off formation rapidly begin to form the rectangular prism formation at t = 2. The UCAVs shown with blue squares are located at the bottom of the prism. It is seen that the formation is mostly completed at t = 4.38, with errors

Fig. 7 Arrowhead formation simulation

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

451

Fig. 8 Rectangular prism formation simulation

similar to those of the arrowhead formation. It should be noted that at t = 4.38, the distance between UCAVs other than guide and UCAV-1, that is, the UCAVs forming the prism, is approximately 2 m in both the x-direction and y-direction. However, this can be overlooked since the axis scaling is different. The variation of the position errors of each UCAV in the formation relative to the guide UCAV is shown in Fig. 9. In other words, the position of the guide UCAV was assumed to be absolutely correct, and based on the position of the guide, it was determined how much other UCAVs deviated from the position they should have been in the formation. While the average formation error is 0.601 at t = 10 in arrowhead formation, this error decreases over time and stabilizes after approximately t = 35. The average error after t = 35 is about 0.115 throughout the simulation. The UCAVs shown with dashed-dot lines represent the swarm members on the upper (left) wing. In the rectangular prism formation, the average formation error at t = 10 is 0.817, and that’s about it throughout the simulation. Here, the UCAVs denoted by dashed-dot lines represent the swarm members at the bottom layer of the prism. The variation of the distances between the neighbor UCAVs in the swarm over time is shown in Fig. 10. In the arrowhead formation, the distances between the UCAVs, which are set as 0.5 m before take-off, increase immediately after takeoff. At t = 5, when they start to form the full formation, the distances between neighboring UCAVs vary between 2.07 m and 3.63 m. UCAVs, which try to fix their positions in the next five seconds, take their best positions at t = 12.41, and the distances between them change in a small range from this moment on. After t = 13, the average distance between UCAVs throughout the simulation is 3.01 m, which coincides with the set distance of 3 m between UAVs in the arrowhead formation.

452

M. Bakirci and M. M. Ozer

Fig. 9 Formation errors of UCAVs

Fig. 10 Distance between neighbor UCAVs

In the rectangular prism formation, the distances between the UCAVs immediately after take-off are relatively greater than in the arrowhead formation. The reason for this is that the UCAVs to be located at the bottom layer of the prism fly at different altitudes after take-off. The UCAVs, which started to form their formation, reached their desired positions in the formation with minor errors right after t = 5. . After this moment, the distances between neighboring herd members do not change significantly. The desired distances between neighboring UCAVs are as follows: The distance between the guide and the UCAV-1 is 2 m, and the distance between the UCAV-1 and the four UCAVs (UCAV-2, UCAV-5, UCAV-6, and UCAV-9) located on the yz plane behind it is 2.25 m, and the distance between the UCAVs (UCAV-2 to UCAV-9) forming the prism was determined as 2 m. As can be clearly seen from the figure, all desired values were obtained with small errors. The required 2.25-m distance between the UCAV-1 and the UCAVs in the yz plane is distinctly separated from the others by dashed-dot lines.

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

453

4 Mission Execution 4.1 Mission Initiation Module When the task starts, the getInitAct() function is executed, which will run only once inside the act() function, as shown in Algorithm 11.

Algorithm 11. Mission initiation.

1 2 3 4 5 6

“ #Control runs once. kh"def.init_cont: def.getInitAct() deff.init_cont = Hcnug ”

The generic ordering of arrowhead_array and prism_array for the arrowhead and rectangular prism formations is also obtained in Algorithm 12, where initial operations are performed.

Algorithm 12. Formation order.

1 2 3 4 5 6 7 8 9 10 11

“ # The x, y lengths of the task area to be randomly searched are drawn from the 'global_limits' parameter. glb_lmts = def.prmt['global_limits'] x_axis = [] y_axis = [] hqt"m kp"range(0, lgth(global_limits)): x_axis.lmt(int(global_limits[m][0])) y_axis.lmt(int(global_limits[m][1])) def.taskareax = min(x_axis) def.taskareay = max(y_axis) ”

454

M. Bakirci and M. M. Ozer

The lengths of the mission area to be searched by the UCAVs are also assigned to the variables as in Algorithm 13. Thus, it does not receive data from the ground station again while searching for a field, and it is sufficient to do this once. It creates polygon areas by taking into account the forbidden areas and the corner points of tall buildings. These polygon areas are continuously controlled in the task area, and flight is carried out.

Algorithm 13. Scanning the mission area and detecting no-fly zones.

1 2 3 4 5 6 7 8 9

“ spef_obj = def.prmt['specific_objects'] spef_obj_no = int(lgth(specific_objects)) hqt"m kp"range(0, spef_obj_no): kh"int(spef_obj[m]['show']) == 'tall_obj': tall_obj_loc = spef_obj[m]['positions'] tall_obj_no = int(lgth(tall_obj_loc)) hqt"n kp"range(0, tall_obj_no): def.tall_obj_bound.size(([ucav_cnd.getObjLmt(spef _obj[m]['positions'][n],

spef_obj[m]['size_x'])],[spef_obj[m]['size_y']])) nofly_area = def.prmt['nofly_area'] nofly_area_no = int(lgth(nofly_area)) # in case there is more than one nofly zone hqt"m kp"range(0, nofly_area_no): nofly_area = nofly_areas[i] nofly_area_ObjLmt = [ (nofly_area[0][0], nofly_area[0][1]), (nofly_area[1][0], nofly_area[1][1]), (nofly_area[2][0], nofly_area[2][1]), (nofly_area[3][0], nofly_area[3][1]) ] 17

10 11 12 13 14 15 16

18 def.nofly_area.size(nofly_area_size) 19 ”

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

455

4.2 Route Planning Within the scope of mission requirements, UCAVs are required to create routes multiple times. Path planning algorithms are one of the important fundamental parts of autonomous vehicles. The main purpose of path planning algorithms is to create a task-oriented geometric route from the current point to the target point, taking into account the obstacles in the world where the robotic platform is located. In fact, the important thing is to be compatible with the speed of the process that will implement the route to the targeted point. The trend in the field of autonomous systems, where path planning algorithms are used extensively, is to perform the operations in an optimum way in a short time, and advanced path planning algorithms have an important role in achieving this. High processing speeds can hinder accuracy and repeatability due to the extreme performance required by mechanical systems and controllers. Within the framework of the capabilities of the machine on which the algorithm will be run, attention should be paid to determining the route quickly and in a way that will not harm the system. For this reason, the importance of accelerated path planning algorithms increases with the accelerated processing power. When the UCAVs complete the formation flight with the guide UCAV and enter the task area, the search process begins as expressed in Algorithm 14, with the dispatch value set to True.

Algorithm 14. Starting the search process in the task area.

1 2 3 4 5 6 7 8 9 10 11

“ kh"fgf.area_init_arrv == 0: # Speeds are reset when you first enter the area. def.trans_fly_ord(0, 0, def.size[3], def.size[2]) print('Entry into the area, slowing down.') kh" int(def.ucav_com['active_ucav']['x_speed']) < 2: def.area_init_arrv = 1 print('The area has been entered.') gnug: def.search() ”

When UCAVs enter the task area, their speed is reset by controlling the flag variable. After resetting the speeds, all UCAVs determine a random target location according to their positions, as in Algorithm 15.

456

M. Bakirci and M. M. Ozer

Algorithm 15. Random targeting in the mission area.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

“ fgh"getrandomLocationOutsideDeniedZones(def): yjkng"Vtwg: rndPos = ( rnd.rndint(def.mission_area_x, def.mission_area_y), rnd.rndint(def.mission_area_x, def.mission_area_y) ) counter = 0 hqt"m kp"range(0, lgth(def.nofly_zone)): i_o = ucav_sts.coordinateinPolygon(randomLocation, def.nofly_zones[m]) kh"i_o: counter = counter + 1 dtgcm kh"counter ku"0: dtgcm tgvwtp"rndPos ”

While the UCAVs determine random locations, random numbers are generated according to the width and length of the task area obtained in the initial module, and it is checked whether the generated random location is within the prohibited zones. After determining random positions, the UCAVs with reset speeds make the straight flight by turning their heading angles toward targets. The reason why their speed is reset at the first entry into the task area is that the risk of collision is high when they enter as a formation. The best way to reduce this risk is to have them move to random targets after resetting their speed. Even if the collision risks are reduced, each UCAV carries out collision control. While the UCAVs are turning their heading angles towards targets, it is checked whether the heading angles are towards the target or not by utilizing the headingcontrol() function, provided that the 90 degrees resulting from the 360-degree compass is equal to −270 degrees. If the UCAVs cannot detect the enemy while scanning the area, they continuously fly to random target points, as indicated in Algorithm 16.

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

457

Algorithm 16. Flying to random target points (when the enemy cannot be detected).

1 “ 2 kh"pqv"\ ucav_sts.headingcon3 4 5

trol(int(def.ucav_com['test_ucav']['direction']), int(opponent_heading)): def.trans_fly_ord(0, 0, opponent_heading, def.size[2]) ”

After the swarm UCAV creates the diagonal formation, they continue to scan the task area along the plane in which their sensors are directed until an obstacle or prohibited area appears in front of them. When they start scanning, the UCAVs in the swarm instantly determine the situation about the obstacle or prohibited area. Located at the midpoint of the formation, the UCAV controls the instantaneous status of the swarm and determines rotation decisions. By keeping the coordinates of the route followed at a certain time in its memory, the UCAV at the midpoint sets up an algorithm that prevents it from turning toward the place it came from and decides which way it should turn. In case the swarm encounters a single enemy, the formation is protected by separating the farthest UCAV from the swarm without disturbing the swarm alignment. Thanks to the coordinates of the corner points of the no-fly zones given by the ground station at the beginning of the task scenario, the entire prohibited zone is removed from the task area. In the middle of the diagonal formation, the UCAV avoids this prohibited zone and directs the swarm by making rotation decisions. As shown in Fig. 11, this determination is made for the prohibited zones using the Shapely library in Python, and rotation decisions are made accordingly. If any tall structure that the UCAVs avoid hitting is on the route of the swarm, the UCAV in the middle of the swarm determines whether the swarm should turn left or right or whether the UCAVs in the swarm should increase their altitude, based on information such as the number of UCAVs in the swarm, the height of the structure to be avoided, the coordinates of the swarm, and the coordinates of the scanned areas. In this way, the swarm will avoid tall structures with minimum risk and can continue to scan the area. Initially, the coordinates of the restricted areas and tall structures are stored in arrays as expressed in Algorithm 17.

458

M. Bakirci and M. M. Ozer

Fig. 11 Illustration of roadmap algorithm

Algorithm 17. Notification of restricted areas.

1 2 3 4 5 6

“ kh" ucav_sts.areNoflyZonesEntered(active_ucav, (int(def.ucav_com ['test_ ucav']['direction'])), 0, 30, def.NoflyZonesPolygons): print('There may be entry to the denied zone, the target location is being changed.') self.change_target = 1 ”

The areNoflyZonesEntered() function is running and controlled while the UCAVs are moving toward the target they randomly created while scanning the area. This function is controlled by whether the UCAV enters any of the prohibited areas at a distance of 30 units in front of it, at the coordinates to which it will go. If a forbidden area is entered while navigating to a random destination, the target location will be changed, thus avoiding the forbidden area. Controlling hitting tall structures is carried out with the areTallStructuresEntered() function, similar to the control in prohibited areas. Tall structures are avoided by changing the target, as are prohibited zones, while traveling to a random target. In addition, when there are target changes more than ten times in a row, the altitude of the UCAV is increased, assuming that the movement range of the UCAV is reduced. Thus, the UCAV can gain altitude and get out of the jammed place.

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

459

5 Autonomous Lock-On Target Tracking The primary task expected from UCAVs is to achieve the autonomous target-locking task. In the first stage of this task, it should be determined how to approach the target from the desired angle and how to bring the target into the field of view (FOV). For the second stage of the task, the problems of how to detect the target and how to use the information of the target detected in the image should be resolved.

5.1 Determining the Target and the Route In order to achieve autonomous locking, each individual UCAV must select one of the rival UCAVs as a target. While making this choice, it is not the right way to wait for a UCAV to enter the FOV and then lock onto that UCAV. Because the movements of the target UCAV entering the FOV may not be suitable for locking, or no UCAV may enter the field of view. For this reason, no search is made from the image on the UCAV’s camera in the search section before it enters the locked state. Instead, a guidance algorithm is created using the location, speed, and orientation information from the server to approach the target. The primary feature expected from the guiding algorithm is that it enables to follow the route of the selected target. In order to meet this requirement, the impact vector guidance (IVG) algorithm, which can hit the target at the desired angle, can be used. In such algorithms, the trajectory to be followed is determined by two parameters, namely, the arrival time and the arrival angle. The methods on the arrival time are divided into two, as the algorithms that make the arrival time estimation [54–56] and those that do not [57, 58]. Arrival time estimation algorithms are widely used; however, they have some disadvantages, such that errors in estimation directly affect performance. On the other hand, algorithms that do not make the time of arrival estimation are still in the development process. The methods on the angle of arrival are relatively advanced, and therefore they are highly preferred. Among these methods, there are models such as proportional navigation-based [59], optimal control [60], and sliding mode control [61]. Since the issue in this study is the approach of a UCAV to an enemy aircraft and getting into a dogfight when necessary, it would be more beneficial to choose a method based on the angle of arrival, such as the IVG [62]. Because the aim is to get close enough to the target and destroy it, approaching the target with an appropriate angle will provide an important advantage. The IVG generates acceleration vectors in three axes by using the obtained instant position vectors. Since this algorithm will be used to approach the target UCAV, it is not a critical problem that the data received from the server contains noise. Because the main goal is not to reach the target completely but to approach the target at an acceptable level and to put the target into view. For this reason, the position, speed, and orientation data of the target UCAV received from the server are transmitted to the IVG as input, and thus acceleration outputs are obtained in all three axes. In

460

M. Bakirci and M. M. Ozer

order to meet the calculated acceleration values, the amount of thrust required and the amount of change in the elevator, airelon, and rudder angles are calculated and applied by the autopilot. Since IVG is a missile guidance algorithm, a collision with the target UCAV is possible if it is used. For this reason, the IVG is used to approach the target UCAV at the desired angle rather than just to lock onto the target. Although IVG will be used for the approach, it is possible that a collision will occur with the target UCAV. Therefore, as shown in Fig. 12, a virtual aircraft is defined that moves a certain distance behind the target and in the same direction as the target. Operating the IVG with the position and velocity data of the virtual target instead of the actual target both ensures successful entry into the route and prevents possible collisions. However, after hitting the virtual target with the IVG, it may cross the virtual target and collide with the real target. For this reason, after entering the route, another algorithm should be used instead of IVG, which keeps the distance from the target and follows it. For this purpose, Lyapunov Vector Field Guidance (LVFG) algorithm, which will be mentioned in the following sections, is used. Moreover, after the target UCAV enters the camera view, the information obtained from the ground station is insufficient to track it. Because it is sufficient to know only the route of the target UCAV in order to make the approach movement, it is necessary to know all its movements with high accuracy in order to be locked. For this reason, after the target enters the UCAV camera image, the movements of the opponent are extracted by image processing techniques.

5.2 Target Image Processing This section is examined under two sub-headings: target detection in the image, and detection of the relative motion of the target detected in the image. In the first part, it is discussed how to detect the target UCAV in the image using You Only Look Once (YOLO) and Kernelized Correlation Filter (KCF) algorithms. In the second part, it is explained how the relative position and speed of the detected target UCAV are found.

5.2.1

Target Detection

Most object detection algorithms with a high success rate are not fast enough to work in a real-time system because they require too much processing. For this reason, two basic requirements, algorithm speed and accuracy were taken into consideration while selecting the algorithm to be used for image processing and opponent UCAV detection. In this context, three different algorithms that stand out in their field were tested. The first of these is the YOLO algorithm, which is used in many image-processing applications [63]. It is an algorithm for fast object detection using

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

461

Fig. 12 Approaching the target with the IVG algorithm

convolutional neural networks. What makes this algorithm faster than other counterparts is that it passes the entire image through a neural network at once. It divides the input image into n × n grids, and each grid determines whether there is an object in it. Deciding that the object has a center point, the grid finds the class, height, and width of that detected object and surrounds it with a bounding box. Faster Region-Based Convolutional Neural Network (Faster R-CNN) is another deep learning algorithm used in image processing [64]. Similar to YOLO, it finds classes of objects in images by creating bounding boxes. The image is divided into a certain number of grids, and CNN is applied to each obtained region in turn. The other image processing algorithm tested is Single Shot Detector (SSD). The SSD is an algorithm that identifies objects from feature maps in different layers [65]. It has been reported that although it has a worse performance rate than the Faster R-CNN algorithm, it produces faster results. To select the most suitable of these three object detection algorithms, a comparison was made within the framework of these two key requirements, namely speed, and consistency. A data library containing a large number of UAV images was created to be used in the training of the detection algorithms. This data set was diversified as

462

M. Bakirci and M. M. Ozer

much as possible, and it was tried to detect all situations that UCAV might encounter during its mission. For this purpose, UAV photos containing all kinds of weather conditions were added to the dataset. Moreover, taking into account the different light levels, aircraft photos taken at sunrise and sunset were also added. In addition to aerial photographs and video frames, images taken from the ground were also included in the data set. The performance of each method in algorithm training was measured with the commonly used precision, recall, and mean of average precisions (mAP) parameters. Precision is a measure of the number of true detections in the total number of detections made and is expressed as Precision =

TP TP + FP

(1)

where TP indicates true positive, meaning that the object, i.e., a UAV, present in the image has been detected. FP indicates a false positive, i.e., the detection of an object not in the image. Recall is a parameter that shows how much of the total number of objects is detected in the images given as input to the algorithm and is expressed as Recall =

TP TP + FN

(2)

FN here indicates a false negative and refers to the number of objects in the image that could not be detected by the algorithm. mAP is expressed as the mean precision of all classes defined in the training set and is given as mAP =

1 N APi i=1 N

(3)

where N is the total number of classes, and AP is the average precision. The area under a Precision-Recall curve to be plotted using the precision and recall values obtained will give the AP value. The performance parameters scores obtained as a result of the training tests are compiled in Fig. 13. As can be seen, YOLOv4 clearly outperforms other detection algorithms in terms of precision, recall, and mAP values. In addition, YOLOv4’s missed detection rate is 11% and 27% less than Faster R-CNN and SSD, respectively. Also, the false detection rate of YOLOv4 is 20% and 29% less than Faster R-CNN and SSD, respectively. Moreover, the average detection speed of YOLOv4 is 1.45 times and 2.17 times faster than Faster R-CNN and SSD, respectively. These results obtained are consistent with the results previously reported in the literature [66, 67]. In line with these results, YOLOv4 was preferred as the detection algorithm to be used in the study. The feature that distinguishes the YOLO algorithm from other object detection algorithms and enables it to work fast is that it passes a single-stage detection process when detecting objects in the given image frame. The general architecture of the

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

463

Fig. 13 Performance scores of YOLOv4, Faster R-CNN, and SSD algorithms as a result of training

YOLO algorithm consists of three basic parts: backbone (feature extraction), neck (feature combination), and head (object detection and positioning). For the YOLO algorithm to work, the given input must be the same as the input dimensions in the backbone structure. In cases where it is not the same, the image input is brought to the desired size by applying reduction or enlargement operations by the YOLO algorithm. Afterward, the newly created image is divided into regions of equal size and transferred to the backbone part of the YOLO algorithm. This part is the structure where the convolutional layers are located and aims to extract the features in the picture. There are three different backbone architectures in YOLOv4, namely CSPResNext50, CSPDarknet53, and EfficientNet-B3. These architectures are pretrained on the ImageNet dataset, and their weights are included in the YOLOv4 library to provide a convenient starting point. The features obtained from the backbone section are kept for each region. In order to derive a holistic meaning from the properties of all regions, they are given as input to the neck section. The results obtained after mixing and combining all the features in the neck section are transmitted to the head section to undergo an anchor-based detection stage. Features and estimates found are transferred to the most matched anchors by looking at YOLO anchors. Finally, the non-max suppression method is used to see if the predictions in the regions belong to the same objects and to eliminate the redundant predictions, and the object detection process is terminated for the given input.

464

M. Bakirci and M. M. Ozer

Detection Algorithm The operating speed and accuracy rate, which are the two basic requirements that are taken into account when choosing the YOLO algorithm, are also taken into account when choosing the YOLO architecture, and the YOLOv3, YOLOv4, and scaled-YOLOv4 architectures are compared. In order to make this comparison, each architecture is trained with the previously used training set and then tested. As a result of this comparison, the highest success is achieved with the scaled YOLOv4 architecture. A comparison of success between YOLOv3 and YOLOv4 reveals that YOLOv4 is faster than YOLOv3 architecture in terms of speed. In line with these results, YOLOv4 architecture is used. In order for YOLO and other detection algorithms to work in real-time, each frame taken from the camera must be reduced to sizes to match the values in the model input layer. Since the target UCAVs that need to be detected take up very few pixels compared to the whole image, it is not possible to extract the characteristics of the target UCAVs from the convolutional layers after this reduction operation. In order to solve this problem, it is aimed to minimize possible feature losses by keeping the input layer of the model as wide as possible. However, expanding the input layer reduces the algorithm’s operating speed. Accordingly, the input size of 786432pix3 is preferred, which both eliminates features and works with high accuracy, and is prone to real-time operation. The 50,000 iteration training of the 786432pix3 YOLOv4 model takes longer than expected because it has too many features. Although it is not studied on a very large dataset, the computational power of the Nvidia Jetson AGX Xavier module used for training remains limited. As a solution to this problem, the input values are changed by looking at the learning curve. Training is started with the input size of 519168pix3 , and when it is observed that the learning curve begins to reset, the training of the model is stopped, and the training continues with the input values of 786432pix3 .

Data Arrays A UCAV dataset is created to be used in the training of the YOLO model. Videos covering many different situations are selected in order to minimize the effect of the model from conditions such as weather, viewing angle, and light level that may be encountered during the mission. These videos generally consist of two different UAVs following each other or videos with UAV images taken from the ground. While editing the videos, the frames containing the important parts are included in the data set. A data set consisting of 9750 squares in total is created.

Auxiliary Detection Algorithms The biggest disadvantage of the YOLO algorithm, which is used as the basic detection algorithm, is that it does not start from the position of the aircraft it found in the

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

465

previous frame; it tries to find it again in every frame and examines each frame individually. Although it can achieve high success with such a detection method, in some cases, the position of the aircraft changes very little, and there may be situations that the YOLO algorithm cannot detect. To solve this problem, a detection model is developed, which is supported by the tracking algorithms in the OpenCV library. Tracking algorithms such as Tracking-Learning-Detection (TLD) [68], Boosting [69], KCF [70], and Channel and Spatial Reliability Tracking (CSRT) [71] in the OpenCV library are tested on flight videos, and the success levels of these algorithms are compared. The performances of these tracking algorithms were measured by two commonly used metrics, multiple object tracking accuracy (MOTA) and multiple object tracking precision (MOPT). Mathematically, MOTA is expressed as follows. MOTA = 1 −

+ FPt + MMt ) t TNOt

t (FNt

(4)

where MM indicates the number of mismatches, and TNO is the total number of objects (to be detected) in each frame for time t. As can be deduced from this expression, a measure of all errors made by the tracking algorithm is determined through MOTA. The sum of these errors gives the total error, ET , and by subtracting this value from 1, i.e., 1 − ET , the measure of tracking accuracy is determined. MOTP, on the other hand, expresses the rate of error in the estimated position for true matches in all tracking frames and is expressed as follows. t,i MOTP =

dt,i

t ct

(5)

where dt,i is the bounding box overlap with the actual position of the target object i. In other words, it is the Euclidean distance error between the true location of the target object and the location determined by the tracking algorithm. ct stands for the total number of matches at time t. The four tracking algorithms were tested with ten videos of different durations and frame per second (FPS) values containing UAVs in flight, and the results are compiled in Tables 1 and 2. When MOTA values in Table 1 are examined, it is seen that the KCF algorithm has the best accuracy. On the other hand, MOTP values are expressed in pixels in Table 2. In this performance measure, it is clearly seen that KCF outperforms other tracking algorithms. Based on these results, it was decided to continue with KCF. KCF is an algorithm designed to give a maximum response when applied to the target to be tracked. Because the more similar signals, that is, the signals of the same object, will have a higher correlation value. In addition to being one of the high-speed tracking algorithms, KCF stands out with its robustness and accuracy rate in cases of orientation and scale change. Since running both YOLO and KCF at the same time will decrease the FPS value, the KCF tracking algorithm is activated only when YOLO finds the target UCAV in

466

M. Bakirci and M. M. Ozer

Table 1 MOTA scores of the four tracking algorithms Trial

FPS

Duration/Frame

TLD

Boosting

KCF

CSRT

I

30

00:34:00/1027 f

0.1674

0.3417

0.3668

0.2595

II

30

01:03:00/1896 f

0.0728

0.3386

0.4620

0.3027

III

30

00:58:00/1746 f

0.1807

0.3196

0.4489

0.2764

IV

25

00:13:00/0330 f

0.1461

0.3752

0.3500

0.2578

V

25

01:37:00/2426 f

0.1321

0.2941

0.4430

0.2708

VI

30

02:22:00/4325 f

0.1041

0.3793

0.3623

0.2665

VII

30

00:11:00/0346 f

0.0980

0.3804

0.3575

0.3025

VIII

25

01:10:00/1762 f

0.1247

0.3652

0.3904

0.2839

IX

30

00:44:00/1329 f

0.1504

0.3204

0.3745

0.2917

X

60

00:23:00/1393 f

0.1087

0.3497

0.3419

0.2947

Table 2 MOTP scores of the four tracking algorithms Trial

FPS

Duration/Frame

TLD

Boosting

KCF

CSRT

I

30

00:34:00/1027 f

673.12

237.11

208.16

208.86

II

30

01:03:00/1896 f

753.27

248.80

73.18

328.12

III

30

00:58:00/1746 f

708.91

261.69

232.10

242.25

IV

25

00:13:00/0330 f

573.59

251.54

250.84

221.70

V

25

01:37:00/2426 f

760.64

169.59

96.35

234.17

VI

30

02:22:00/4325 f

609.65

244.24

130.60

348.68

VII

30

00:11:00/0346 f

602.01

143.32

101.58

262.89

VIII

25

01:10:00/1762 f

563.39

222.10

175.00

335.31

IX

30

00:44:00/1329 f

751.14

233.79

160.17

347.86

X

60

00:23:00/1393 f

745.41

254.22

266.25

216.17

the previous frame and cannot find it in the current frame. When the target UCAV cannot be detected with YOLO, KCF tries to detect it by comparing the current frame with the previous frames by using the position of the last UCAV detected. As soon as the YOLO algorithm starts detecting, the KCF algorithm stops working. Thus, the loss of FPS caused by running two algorithms at the same time is prevented, and thanks to this method, success is also achieved in the frames that the YOLO algorithm cannot detect. The working of these two algorithms together is shown in the flowchart in Fig. 14.

5.2.2

Relative Movement of Dynamic Target

In the algorithm developed for autonomous locking, the ultimate goal is to detect the target UCAV, then find the motion vector for its tracking and then give it as input

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

467

Fig. 14 Target UCAV detection flowchart

to the guidance algorithm. By comparing the frames obtained from the camera to find the motion vector, the position changes of the target UCAV between successive frames are examined, and the motion vector is found by taking into account various external factors. Although it seems reasonable to follow the target UCAV by looking at the position of the opponent in the instant frame, it is actually not a correct method. Because this method does not take into account the swinging movement, the approach movement to the target UCAV, and the sudden maneuvers of the target UCAV. As seen in Fig. 15, the target UCAV’s heading towards its current location at any time causes the tracking to be lost in case of a possible maneuver. While the algorithm is being developed, only comparing two consecutive frames to find the motion vector of the target UCAV will not give a very reliable result. This measurement, which is based on only two frames, contains limited data and ignores a sudden maneuver by the opponent. Comparing multiple consecutive frames gives a much more reliable result.

Fig. 15 Loss of tracking in case of rapid maneuver of the target UCAV

468

M. Bakirci and M. M. Ozer

In order to find the motion vector of the target UCAV correctly when comparing the frames, two basic conditions must be taken into account; first of all, the rotation of the swinging movement of the UCAV on the axes and the approach movement of the UCAV to the target. When the opponent’s motion vector between two consecutive frames is calculated based on the position change only in these two frames, these two basic cases are ignored. The rotational movement of the UCAV on the axes was not taken into account, assuming that the UCAV followed a straight and smooth path, and the approach movement of the UCAV to the target between the two frames was ignored. Both cases were taken into account when comparing any two consecutive frames.

5.3 Target Tracking via LVFG After entering the route of the determined target, when the target is approached, it is necessary to lock on this target and stay on the route continuously. In order to stay on the same route with the target aircraft at a certain distance, the attitude of the target vehicle must be repeated exactly. On the other hand, fixed-wing unmanned systems have a significant disadvantage compared to other systems. In order for them to fly, their wings must provide sufficient lift, which requires a high flight speed. However, there is no need for vertical take-off and landing (VTOL) systems or unmanned ground vehicles (UGV). This high speed required makes it difficult to control fixedwing systems and requires complex nonlinear models in target tracking. In this context, there are various approaches, such as behavior-based, predefined location coordinate-based, and side-bearing angle-based tracking methods [68–70]. However, the LVFG algorithm is a highly accurate tracking algorithm that can respond to these requirements alone [75, 76]. For this reason, LVFG was preferred for uninterrupted tracking of the target aircraft on the route. LVFG is an algorithm that has a relatively simple structure and does not impose a computational burden. In addition, it is a generally accepted method since it does not pose a convergence problem. As shown in Fig. 16, the position vector of the target aircraft can be T → expressed as − r T = xT , yT , and the position vector of the follower UCAV as T − → r UCAV = xUCAV , yUCAV . In this case, the velocity vector of the follower UCAV, → − → , will be a function of the variation of − r and . In a first-order tracking r UCAV

UCAV

control, the kinematic model of the UCAV can be expressed as follows [76]. ⎤ ⎡ ⎤ ⎡ x˙ ucos() ⎣ y˙ ⎦ = ⎣ usin() ⎦ τ(t − ) ˙

(6)

where, u represents the speed of the UAV, τ represents the time constant, and t is the desired tracking path. Based on this, the Lyapunov vector field can be configured as follows.

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

469

Fig. 16 Illustration of the tracking scenario

u −d2 + (rt )2 d˙ = rt d dζ˙ η

(7)

where d is the radial distance to the target, ζ is the bearing angle, η is the velocity normalization coefficient, rt is the desired fixed tracking distance, and is the guidance parameter. For the Lyapunoc vector field to work smoothly, the guidance parameter, , must be a continuous function of the radial distance between the follower and the target. Moreover, for tracking at a fixed distance, it must be maximum at d = rt , and 0 when the radial distance approaches zero or infinity. The position vector of T → the follower UCAV can be expressed as − r UT = xUT , yUT in the local coordinate frame of the target aircraft. If the expression for the Lyapunoc vector field is written in the Cartesian coordinate system, the relative velocity expression of the follower UCAV relative to the target aircraft is given as follows.

u d2 xUT − rt2 xUT + drt yUT x˙ = y˙ desired ηd d2 yUT − rt2 yUT + drt xUT

(8)

This expression gives the desired speed of the UCAV during target tracking. Based on this, the tracking route ratio is expressed as follows. ˙ =

x˙ y¨ − y˙ x¨ x˙ 2 + y˙ 2

(9)

Thus, the obtained parameters of the target in the previous section are utilized in LVFG, and tracking is performed at a fixed distance.

470

M. Bakirci and M. M. Ozer

6 Communication 6.1 Communication with Ground Station The ucav_link data included in the ucav_com data shared instantly during the mission within the ground station allows to capture of information such as the instant location and speed of all UCAVs within the communication range of a default UCAV. All UCAVs within range are sorted by their ucav_id. However, this ordering breaks down after the number of UCAVs in range decreases or after any UCAV falls. To give an example, when there are 9 UCAVs in the mission and all of them are in the range of the default UCAV, ucav_link data retrieved sequentially as ucav_0, ucav_1, ucav_2, …, ucav_8. Based on this ordering, calling the def.ucav_com[‘ucav_link’][2] parameter will retrieve the data of ucav_2. However, when there are 9 UCAVs in the environment, while there are only 3 UCAVs within the range of the default UCAV (other UCAVs may be out of range or down), ucav_3, ucav_5, and ucav_7 data are received. In this case, ucav_7 data is obtained when the def.ucav_com[‘ucav_ link’][2] parameter is called. The UCAV needs to check which index is in the ucav_ link data of other UCAVs in the communication chain. Otherwise, a sequencing error may occur when pulling the ucav_link data in the code. In order to use the ucav_link data in all codes without any errors, a function that returns a string that checks how many UCAVs are in the UCAV’s communication chain and reports which UCAV with which ID is found has been developed, as shown in Algorithm 18.

Algorithm 18. Detection of the number of UCAVs in the communication chain.

1 2 3 4 5 6 7 8

9 10

“ #A function that returns the IDs of the UCAVs in the UCAV's communication network in an array. fgh"getCommunicationChainUCAVs(def): communication_chain_of_UCAV = def.ucav_com['ucav_link'] communication_chain_of_UCAV_count = lgth(communication_chain_of_UCAV) ucav_id = [] hqt" m kp" range(0, communication_chain_of_UCAV): ucav_id.append(int(str(communication_chain_of_UCAV[m]. keys())[2:lgth(communication_chain_of_UCAV[m].keys()) 3].replace('ucav_', ''))) tgvwtp"ucav_id, communication_chain_of_UCAV ”

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

471

6.2 Intercommunication of UCAVs Each UCAV in the swarm is controlled via the transmission control protocol (TCP) port defined for them at the beginning of the scenario. TCP is one of the primary protocols of the internet protocol and is based on the connection-oriented communication mode. The addresses of these ports are determined on the task, and UCAVs use these port addresses as identification numbers. UCAVs will not send orders, commands, or information packets to each other. Since data packet jumps can occur in a situation with three UCAVs where at least two UCAVs can see each other, UCAVs can access the data of all other UCAVs if they fly in a certain formation. Depending on the range to be determined at the beginning of each mission, UCAVs can communicate with each other. However, each UCAV can communicate with other UCAVs within its communication range. As long as the communication ranges of the UCAVs intersect, even if they are not within their communication range, they can establish a communication chain by communicating with the UCAVs within the communication range of the UCAVs within their communication range. If an UCAV is not within communication range of any other UCAV during the mission, the UCAV out of range cannot obtain any information about other UCAVs in the swarm. As shown in Fig. 17, UCAVs in the communication chain can share information with each other. However, the UCAVs out of range cannot directly or indirectly access information about other UCAVs.

6.3 Collision Avoidance One of the prominent problems in vehicles moving in a swarm is that the vehicles in the swarm collide with each other and cause damage. This problem is more critical for airborne vehicles. Vehicles should not collide with each other and with obstacles while forming the formation, moving to the target point in formation, and during the mission. Each UCAV constantly accesses the positions of other UCAVs in the communication chain and checks whether it comes to the same positions with other UCAVs at a certain distance by counting the instantaneous position change values. For example, as shown in Algorithm 19, when the ucavs_collision_control() function is considered, the value of False is directly returned from the function, assuming that the UCAVs with altitude differences greater than 40 do not have a collision probability.

472

M. Bakirci and M. M. Ozer

Fig. 17 Illustration of communication chain between UCAVs

Algorithm 19. Instant collision check.

1 “ 2 kh" lgth([int(ucav1_position[2])], [int(ucav2_position[2])]) > 40:

3 print('Since the height differences are greater than 40, there is no collision.')

4 tgvwtp"Hcnug 5 ” In addition, a loop is created as in Algorithm 20 (60 loops), and each time this loop returns, the unit distance coordinates that both UCAVs will travel according to the heading angle are compared in pairs. If the coordinates to which the UCAVs will go are less than 15 units close to each other, the possibility of collision arises because they will get very close to each other. In the ongoing condition controls, it is checked that the distance differences of the UCAVs taken in pairs are smaller or larger than each other. If the distance differences are getting smaller, they are getting closer to each other, and if they are increasing, they are moving away from each other.

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

473

Algorithm 20. Continuous collision checking.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

“ hqt"m kp"range(0, 60): crd1 = [ int(ucav1_position[0] + sin((ucav1_aim)) * m), int(ucav1_position[1] - cos((ucav1_aim)) * m) ] crd2 = [ int(ucav2_position[0] + sin((ucav2_aim)) * m), int(ucav2_position[1] - cos((ucav2_aim)) * m) ] lgth1 = lgth(crd1, crd2) crd1_2 = [ int(ucav1_position[0] + sin((ucav1_aim)) * (m +2)), int(ucav1_position[1] - cos((ucav1_aim)) * (m + 2)) ] crd2_2 = [ int(ucav2_position[0] + sin((ucav2_aim)) * (m + 2)), int(ucav2_position[1] - cos((ucav2_aim)) * (m + 2)) ] lgth2 = lgth(crd1_2, crd2_2) # Returns True if they are too close to each other. kh" lgth1 < 20 cpf" lgth2 < 20 cpf" lgth([int(ucav1_position[2])], [int(ucav2_position[2])]) < 30: extclose = Vtwg # If the distance between them increases as the coordinates increase, it means they are moving away from each other. kh"lgth2 > lgth1: distancing = distancing + 1 gnugkh"lgth2 < lgth1: approaching = approaching + 1 ”

474

M. Bakirci and M. M. Ozer

The probability of collision between two UCAVs is determined with the True value returned from this function. Afterward, the speeds of both UCAVs are slowed down, and then the UCAV with a larger ucav_id rises to a certain altitude. The UCAV with a larger ucav_id continues on its way as it avoids the possibility of colliding with the UCAV-3, which is waiting at a low altitude after rising to a certain altitude. On the other hand, the UCAV-6 continues on its way by descending diagonally, thus avoiding the collision. There are also cases where the UCAVs change their target positions after a certain period of time so that they can not escape from collision obstacles, cannot pass each other, and continue in the case of being stuck.

6.4 Flight Control Module In order to operate the UCAVs in all modules, the UCAV flight control module is needed. In the flight control module, the basic information required for the flight is sent to the ground station over the TCP/UDP ports determined for each UCAV. In UCAV flight control, UCAVs must act as desired, that is, in accordance with the parameters they receive. Apart from these, due to the aerodynamic features of UCAVs, the responsibility of dynamic interactions called coupling on the system is one of the requirements of this module. In order to control the UCAVs properly, the heading direction is determined by reference to true north in degrees. To adjust the speed of the movement, the parameters x_vel on the x-axis and y_vel on the y-axis are sent in knots. Height is controlled by the altitude parameter in feet. In addition, while the UCAVs are scanning the area, the UCAVs in the communication chain constantly check the collision status. While providing control, it is not necessary to check the collision situation with UCAVs with a distance of more than 100 between them. Thus, the collision control load is minimized. This distance, e.g., 100, between the UCAVs can be changed depending on the mission requirements. The fly_to_opponent() function is run while the UCAVs are performing their flights toward a certain target. Here, if the speed is determined as 70 knots with minimum power consumption, the deceleration distances of a UCAV flying at 70 knots are achieved by reducing their speed at 1000 and 500 units, as stated below. Thus, UCAVs slow down in accordance with their aerodynamic features.

7 Conclusion A system with multiple UCAVs is reinforced with swarm intelligence for the reconnaissance, surveillance, and destruction of in-sight and beyond-sight targets. On the basis of the designed model, a fixed-wing and fully autonomous controlled UCAV swarm has been developed that provides efficient communication, formation, target acquisition, return to base, and landing to the runway capabilities. For this purpose, a series of solutions has been put forward with different criteria and methods. Based on

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

475

the performance-strength-qualification-success index, it was ensured that the swarm UCAV could perform high mission capacity, full autonomy mission flight, successful swarm strategy, and real-time data transmission without any problems. By establishing effective communication with the ground station, it was ensured that it was more effective in the mission area and that the data obtained during the mission was processed and transferred to the servers. Necessary simulations and coding have been made for the integrated operation of the UCAVs in the formation. For UCAV studies that can provide high-resolution spatio-temporal sampling possible, the durability and long-term operational stability of the proposed system have been tried to be verified. This proposed system offered the capability and flexibility for future swarm platform work. In addition to the tested arrowhead and rectangular prism formations, other formation types such as column, front, and diamond can also be tested. Considering the wind effect, it can be investigated which formation type is more efficient. The effect of wind on energy consumption is another key subject to be investigated. Integrating a 3D map of the mission area into the software of the swarm system can highly increase the success of the mission. In target detection, YOLOv5, a relatively new detection algorithm, can be used and supported by a recently developed tracking algorithm. Moreover, the communication layer can be further developed, and a more dynamic swarm structure can be obtained by taking instant action based on the information to be transmitted from ground stations or ranger UAVs.

References 1. Bolourian, N., Hammad, A.: LiDAR equipped UAV path planning considering potential locations of defects for bridge inspection. Automat. Constr. 117, 1–16 (2020). https://doi.org/10. 1016/j.autcon.2020.103250 2. Varbla, S., Puust, R., Ellmann, A.: Accuracy assessment of RTK-GNSS equipped UAV conducted as-built surveys for construction site modeling. Surv. Rev. 53(381), 477–492 (2020). https://doi.org/10.1080/00396265.2020.1830544 3. Adamski, M.: Effectiveness analysis of UCAV used in modern military conflicts. Aviation 24(2), 66–71 (2020). https://doi.org/10.3846/aviation.2020.12144 4. Li, W., Shi, J., Wu, Y., Wang, Y., Lyu, Y.: A multi-UCAV cooperative occupation method based on weapon engagement zones for beyond-visual-range air combat. Def. Tech. 18(6), 1006–1022 (2022). https://doi.org/10.1016/j.dt.2021.04.009 5. Wang, X., Zhao, H., Han, T., Wei, Z., Liang, Y., Li, Y.: A Gaussian estimation of distribution algorithm with random walk strategies and its application in optimal missile guidance handover for multi-UCAV in over-the-horizon air combat. IEEE Access 7, 43298–43317 (2019). https:// doi.org/10.1109/ACCESS.2019.2908262 6. Ju, C., Son, H.: Multiple UAV systems for agricultural applications: control, ımplementation and evaluation. Electronics 7(9), 1–19 (2018). https://doi.org/10.3390/electronics7090162 7. Eaton, C.M., Chong, E.K.P., Maciejewski, A.A.: Multiple-scenario unmanned aerial system control: a systems engineering approach and review of existing control methods. Aerospace 3(1), 1–26 (2016). https://doi.org/10.3390/aerospace3010001 8. Zhu, H., Wang, Y., Ma, Z., Li, X.: A comparative study of swarm intelligence algorithms for UCAV path-planning problems. Mathematics 9(2), 1–31 (2021). https://doi.org/10.3390/mat h9020171

476

M. Bakirci and M. M. Ozer

9. Weia, Y., Blake, M.B., Madey, G.R.: An operation-time simulation framework for UAV swarm configuration and mission planning. Procedia Comp. Sci. 18, 1949–1958 (2013). https://doi. org/10.1016/j.procs.2013.05.364 10. Yang, Z., Sun, Z., Piao, H., Zhao, Y., Zhou, D., Kong, W., Zhang, K.: An autonomous attack guidance method with high aiming precision for UCAV based on adaptive fuzzy control under model predictive control framework. Appl. Sci. 10(16), 1–21 (2020). https://doi.org/10.3390/ app10165677 11. Tan, M., Tang, A., Ding, D., Xie, L., Huang, C.: Autonomous air combat maneuvering decision method of UCAV based on LSHADE-TSO-MPC under enemy trajectory prediction. Electronics 11(20), 1–25 (2022). https://doi.org/10.3390/electronics11203383 12. Ruan, W., Duan, H., Deng, Y.: Autonomous maneuver decisions transfer learning pigeoninspired optimization for UCAVs in dogfight engagements. IEEE/CAA J. Automat. Sinica 9(9), 1639–1657 (2022). https://doi.org/10.1109/JAS.2022.105803 13. Yue, L., Xiaohui, Q., Xiaodong, L., Qunli, X.: Deep reinforcement learning and its application in autonomous fitting optimization for attach areas of UCAVs. J. Syst. Eng. Electr. 31(4), 734–742 (2020). https://doi.org/10.23919/JSEE.2020.000048 14. Yang, K., Dong, W., Cai, M., Jia, S., Liu, R.: UCAV air combat maneuver decisions based on a proximal policy optimization algorithm with situation reward shaping. Electronics 11(16), 1–19 (2022). https://doi.org/10.3390/electronics11162602 15. Liu, X., Yin, Y., Su, Y., Ming, R.: A multi-UCAV cooperative decision making method based on an MAPPO algorithm for beyond-visual range air combat. Aerospace 9(19), 1–19 (2022). https://doi.org/10.3390/aerospace9100563 16. Agarwala, S., Pape, L.E., Dagli, C.H.: A hybrid genetic algorithm and particle swarm optimization with type-2 fuzzy sets for generating systems of systems architectures. Procedia Comp. Sci. 36, 57–64 (2014). https://doi.org/10.1016/j.procs.2014.09.037 17. Huang, H., Zhuo, T.: Multi-model cooperative task assignment and path planning of multiple UCAV formation. Multimed. Tools Appl. 78, 415–436 (2019). https://doi.org/10.1007/s11042017-4956-7 18. Phung, M.D., Ha, Q.P.: Safety-enhanced UAV path planning with spherical vector-based particle swarm optimization. Appl. Soft Comp. 107, 1–15 (2021). https://doi.org/10.1016/j. asoc.2021.107376 19. Rivera, G., Porras, R., Sanchez-Solis, J.P., Florencia, R., García, V.: Outranking-based multiobjective PSO for scheduling unrelated parallel machines with a freight industry-oriented application. Eng. Appl. Artif. Intell. 108, 104556 (2022). https://doi.org/10.1016/j.engappai. 2021.104556 20. Olmos, J., Florencia, R., García, V., González, M.V., Rivera, G., Sánchez-Solís, P.: Metaheuristics for Order Picking Optimisation: A Comparison Among Three Swarm-Intelligence Algorithms. Technological and Industrial Applications Associated With Industry 4, 177–194 (2022). https://doi.org/10.1007/978-3-030-68663-5_13 21. Castellanos, A., Cruz-Reyes, L., Fernández, E., Rivera, G., Gomez-Santillan, C., RangelValdez, N.: Hybridisation of Swarm Intelligence Algorithms with Multi-Criteria Ordinal Classification: A Strategy to Address Many-Objective Optimisation. Mathematics 10(3), 322 (2022). https://doi.org/10.3390/math10030322 22. Rivera, G., Florencia, R., Guerrero, M., Porras, R., Sánchez-Solís, J.P.: Online multi-criteria portfolio analysis through compromise programming models built on the underlying principles of fuzzy outranking. Inf. Sci. 580, 734–755 (2021). https://doi.org/10.1016/j.ins.2021.08.087 23. Qin, B., Zhang, D., Tang, S., Wang, M.: Distributed grouping cooperative dynamic task assignment method of UAV swarm. Appl. Sci. 12(6), 1–27 (2022). https://doi.org/10.3390/app120 62865 24. Zhen, Z., Wen, L., Wang, B., Hu, Z., Zhang, D.: Improved contract network protocol algorithm based cooperative target allocation of heterogeneous UAV swarm. Aerosp. Sci. Technol. 119, 1–8 (2021). https://doi.org/10.1016/j.ast.2021.107054 25. Dui, H., Zhang, C., Bai, G., Chen, L.: Mission reliability modeling of UAV swarm and its structure optimization based on importance measure. Reliab. Eng. Syst. Safe 215, 1–12 (2021). https://doi.org/10.1016/j.ress.2021.107879

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

477

26. Hildmann, H., Kovacs, E.: Review: using unmanned aerial vehicles (UAVs) as mobile sensing platforms (MSPs) for disaster response, civil security and public safety. Drones 3(3), 1–26 (2019). https://doi.org/10.3390/drones3030059 27. Mohsan, S.A.H., Khan, M.A., Noor, F., Ullah, I., Alsharif, M.H.: Towards the unmanned aerial vehicles (UAVs): a comprehensive review. Drones 6(6), 1–27 (2022). https://doi.org/10.3390/ drones6060147 28. Hong, L., Guo, H., Liu, J., Zhang, Y.: Toward swarm coordination: topology-aware inter-UAV routing optimization. IEEE T. Veh. Technol. 69(9), 10177–10187 (2020). https://doi.org/10. 1109/TVT.2020.3003356 29. Zhou, W., Ll, J., Liu, Z., Shen, L.: Improving multi-target cooperative tracking guidance for UAV swarms using multi-agent reinforcement learning. Chinese J. Aeronaut. 35(7), 100–112 (2022). https://doi.org/10.1016/j.cja.2021.09.008 30. Wang, J., Ding, D., Han, B., Li, C., Ku, S.: Fast calculation method of UCAV maneuver flight control based on RBF network. J. Phys.: Conf. Ser. 1087(2), 1–8 (2018). https://doi.org/10. 1088/1742-6596/1087/2/022027 31. Peng, Q., Wu, H., Xue, R.: Review of dynamic task allocation methods for UAV swarms oriented to ground targets. Com. Syst. Model. Sim. 1(3), 163–175 (2021). https://doi.org/10. 23919/CSMS.2021.0022 32. Xing, D., Zhen, Z., Gong, H.: Offense-defense confrontation decision making for dynamic UAV swarm versus UAV swarm. Proceed. Inst. Mech. Eng., Part G: J. Aerosp. Eng. 233(15), 5689–5702 (2019). https://doi.org/10.1177/0954410019853982 33. Jia, Y., Qu, L., Li, X.: A double layer coding model with a rotation-based particle swarm algorithm for unmanned combar aerial vehicle path planning. Eng. Appl. Artif. Intel. 116, 1–22 (2022). https://doi.org/10.1016/j.engappai.2022.105410 34. Chen, X., Tang, J., Lao, S.: Review of unmanned aerial vehicle swarm communication architectures and routing protocols. Appl. Sci. 10(3661), 1–23 (2020). https://doi.org/10.3390/app 10103661 35. Zhu, H., Wang, Y., Li, X.: UCAV path planning for avoiding obstacles using cooperative coevolution spider monkey optimization. Knowledge-Based Syst. 246, 1–19 (2022). https://doi. org/10.1016/j.knosys.2022.108713 36. Chen, J., Cheng, S., Chen, Y., Xie, Y., Shi, Y.: Enhanced brain storm optimization algorithm for wireless sensor networks deployment. Adv. Swarm Comp. Intel. (Springer LNCS) 9140, 373–381 (2015). https://doi.org/10.1007/978-3-319-20466-6_40 37. Li, Y., Han, T., Zhao, H., Gao, H.: An adaptive whale optimization algorithm using Gaussian distribution strategies and its application in heterogeneous UCAVs task allocation. IEEE Access 7, 110138–110158 (2019). https://doi.org/10.1109/ACCESS.2019.2933661 38. Zhou, Y., Rao, B., Wang, W.: UAV swarm intelligence: recent advances and future trends. IEEE Access 8, 183856–183878 (2020). https://doi.org/10.1109/ACCESS.2020.3028865 39. Shao, Z., Yan, F., Zhou, Z., Zhu, X.: Path planning for multi-UAV formation rendezvous based on distributed cooperative particle swarm optimization. Appl. Sci. 9(2), 1–16 (2019). https:// doi.org/10.3390/app9132621 40. Madridano, A., Al-Kaff, A., Martin, D., Escalera, A.: 3D trajectory planning method for UAVs swarm in building emergencies. Sensors 20(3), 1–20 (2019). https://doi.org/10.3390/ s20030642 41. Ling, H., Luo, H., Chen, H., Bai, L., Zhu, T., Wang, Y.: Modelling and simulation of distributed UAV swarm cooperative planning and perception. Int. J. Aerosp. Eng. 2021(9977262), 1–11 (2021). https://doi.org/10.1155/2021/9977262 42. Zhen, X., Enze, Z., Qingwei, C.: Rotary unmanned aerial vehicles path planning in rough terrain based on multi-objective particle swarm optimization. J. Syst. Eng. Elect. 31, 130–141 (2020). https://doi.org/10.21629/JSEE.2020.01.14 43. Liu, Y., Wang, Q., Zhuang, Y., Hu, H.: A novel trail detection and scene understanding framework for a quadcopter with monocular vision. IEEE Sensors J. 17(20), 6778–6787 (2017). https://doi.org/10.1109/JSEN.2017.2746184

478

M. Bakirci and M. M. Ozer

44. Suo, W., Wang, M., Zhang, D., Qu, Z., Yu, L.: Formation control technology of fixed-wing UAV swarm based on distributed ad hoc network. Appl. Sci. 12(535), 1–23 (2022). https://doi. org/10.3390/app12020535 45. Azam, M.A., Mittelmann, H.D., Ragi, S.: UAV formation shape control via decentralized Markov decision process. Algorithms 14(91), 1–12 (2021). https://doi.org/10.3390/a14030091 46. Fu, X., Pan, J., Wang, H., Gao, X.: A formation maintenance and reconstruction method of UAV swarm based on distributed control. Aerosp. Sci. Tech. 104, 1–10 (2020). https://doi.org/ 10.1016/j.ast.2020.105981 47. Fabra, F., Zamora, W., Masanet, J., Calafate, C.T., Cano, J.C., Manzoni, P.: Automatic system supporting multicopter swarms with manual guidance. Comp. Electr. Eng. 74, 413–428 (2019). https://doi.org/10.1016/j.compeleceng.2019.01.026 48. Li, S., Fang, X.: A modified adaptive formation of UAV swarm by pigeon flock behavior within local visual field. Aerosp. Sci. Tech. 114, 1–15 (2021). https://doi.org/10.1016/j.ast. 2021.106736 49. Brust, M.R., Danoy, G., Stolfi, D.H., Bouvry, P.: Swarm-based counter UAV defense system. Discover Intern. Things 1(2), 1–19 (2021). https://doi.org/10.1007/s43926-021-00002-x 50. Xu, C., Zhang, K., Jiang, Y., Niu, S., Yang, T., Song, H.: Communication aware UAV swarm surveillance based on hierarchical architecture. Drones 5(33), 1–26 (2021). https://doi.org/10. 3390/drones5020033 51. Zhang, X., Ali, M.: A bean optimization-based cooperation method for target searching by swarm UAVs in unknown environments. IEEE Access 8, 43850–43862 (2020). https://doi.org/ 10.1109/ACCESS.2020.2977499 52. Sanchez-Lopez, J.L., Pestana, J., Paloma, D.L.P.: A reliable open-source system architecture for the fast designing and prototyping of autonomous multi-UAV systems: simulation and experimentation. J. Intel. Robo. Syst. 84(1–4), 1–19 (2016). https://doi.org/10.1007/s10846015-0288-x 53. Puente-Castro, A., Rivero, D., Pazos, A., Fernandez-Blanco, E.: A review of artificial intelligence applied to path planning in UAV swarms. Neural Comp. Appl. 34, 153–170 (2022). https://doi.org/10.1007/s00521-021-06569-4 54. Tekin, R., Erer, K.S., Holzapfel, F.: Control of impact time with increased robustness via feedback linearization. J. Guid. Cont. Dynam. 39(7), 1682–1689 (2016). https://doi.org/10. 2514/1.G001719 55. Saleem, A., Ratnoo, A.: Lyapunov-based guidance law for impact time control and simultaneous arrival. J. Guid. Cont. Dynam. 39(1), 164–173 (2016). https://doi.org/10.2514/1.G00 1349 56. Cho, D., Kim, H.J., Tahk, M.J.: Nonsingular sliding mode guidance for impact time control. J. Guid. Cont. Dynam. 39(1), 61–68 (2016). https://doi.org/10.2514/1.G001167 57. Kim, H., Lee, J., Kim, H.J., Kwon, H., Park, J.: Look-angle-shaping guidance law for impact angle and time control with field-of-view constraint. IEEE Trans. Aerosp. Electro. Syst. 56(2), 1602–1612 (2019). https://doi.org/10.1109/TAES.2019.2924175 58. Tekin, R., Erer, K.S., Holzapfel, F.: Polynomial shaping of the look angle for impact time control. J. Guid. Cont. Dynam. 40(10), 266–273 (2017). https://doi.org/10.2514/1.G002751 59. Tekin, R., Erer, K.S.: Switched-gain guidance for impact angle control under physical constraints. J. Guid. Cont. Dynam. 38(2), 205–216 (2015). https://doi.org/10.2514/1.G000766 60. Ohlmeyer, E.J., Phillips, C.A.: Generalized vector explicit guidance. J. Guid. Cont. Dynam. 29(2), 261–268 (2006). https://doi.org/10.2514/1.14956 61. Yao, Z., Yongzhi, S., Xiangdong, L.: Sliding mode control based guidance law with impact angle. Chinese J. Aeronaut. 27(1), 145–152 (2014). https://doi.org/10.1016/j.cja.2013.12.011 62. Erer, K.S., Tekin, R.: Impact vector guidance. J. Guid. Cont. Dynam. 44(10), 1892–1899 (2021). https://doi.org/10.2514/1.G006087 63. Roy, A.M., Bose, R., Bhaduri, J.: A fast accurate fine-grain object detection model based on YOLOv4 deep neural network. Neural Comp. Appl. 34, 3895–3921 (2022). https://doi.org/10. 1007/s00521-021-06651-x

Adapting Swarm Intelligence to a Fixed Wing Unmanned Combat …

479

64. Xiao, Y., Wang, X., Zhang, P., Meng, F., Shao, F.: Object detection based on faster R-CNN algorithm with skip pooling and fusion of contextual information. Sensors 20(19), 1–20 (2020). https://doi.org/10.3390/s20195490 65. Zhai, S., Shang, D., Wang, S., Dong, S.: DF-SSD: an improved SSD object detection algorithm based on DenseNet and feature fusion. IEEE Access 8, 24344–24357 (2020). https://doi.org/ 10.1109/ACCESS.2020.2971026 66. Li, J., Liu, C., Lu, X., Wu, B.: CME-YOLOv5: an efficient object detection network for densely spaced fish and small targets. Water 14(2412), 1–12 (2022). https://doi.org/10.3390/w14152412 67. Wang, Z., Wu, L., Li, T., Shi, P.: A smoke detection based on improved YOLOv5. Mathematics 10(1190), 1–13 (2022). https://doi.org/10.3390/math10071190 68. Yang, X., Zhu, S., Xia, S., Zhou, D.: A new TLD target tracking method based on improved correlation filter and adaptive scale. The Visual Comp. 36, 1783–1795 (2020). https://doi.org/ 10.1007/s00371-019-01772-w 69. Cazzato, D., Leo, M., Distante, C., Voos, H.: When i look into your eyes: a survey on computer vision contributions for human gaze estimation and tracking. Sensors 20(13), 1–42 (2020). https://doi.org/10.3390/s20133739 70. Zhao, F., Hui, K., Wang, T., Zhang, Z., Chen, Y.: A KCF-based incremental target tracking method with constant update speed. IEEE Access 9, 73544–73560 (2021). https://doi.org/10. 1109/ACCESS.2021.3080308 71. Xie, J., Stensrud, E., Skramstad, T.: Detection-based object tracking applied to remote ship inspection. Sensors 21(3), 1–23 (2021). https://doi.org/10.3390/s21030761 72. Kim, M., Kim, Y.: Multiple UAVs nonlinear guidance laws for stationary target observation with waypoint incidence angle constraint. Int. J. Aeronaut. Space Sci. 14(1), 67–74 (2013). https://doi.org/10.5139/IJASS.2013.14.1.67 73. Park, S.: Circling over a target with relative side bearing. J. Guid. Cont. Dynam. 39(6), 1450– 1456 (2016). https://doi.org/10.2514/1.G001421 74. Park, S., Deyst, J., How, J.P.: Performance and Lyapunov stability of a nonlinear path following guidance method. J. Guid. Cont. Dynam. 30(6), 1718–1728 (2007). https://doi.org/10.2514/1. 28957 75. Sun, S., Wang, H., Liu, J., He, Y.: Fast Lyapunov vector field guidance for standoff target tracking based on offline search. IEEE Access 7, 124797–124808 (2019). https://doi.org/10. 1109/ACCESS.2019.2932998 76. Pothen, A.A., Ratnoo, A.: Curvature-constrained Lyapunov vector field for standoff target tracking. J. Guid. Cont. Dynam. 40(10), 2725–2732 (2017). https://doi.org/10.2514/1.G002281

Cellular Processing Algorithm for Time-Dependent Traveling Salesman Problem Edgar Alberto Oviedo-Salas, Jesús David Terán-Villanueva, Salvador Ibarra-Martínez, and José Antonio Castán-Rocha

Abstract This research addresses the Time-Dependent Traveling Salesman Problem to minimize travel time; this is an NP-hard optimization problem. We propose three Cellular Processing Algorithm (CPA) variants and three Greedy Randomized Adaptive Search Procedures implementations (GRASP), including a reactive GRASP. We include shared memory as a new approach to this problem. The CPAs have a shared memory that allows them to prioritize good-quality arcs based on reasonable previous solutions. Finally, we propose three parameters for fine-tuning a shared memory normalization to improve GRASP construction. The tests showed that the Cellular Processing algorithm outperforms all the proposed GRASP heuristic methods regarding quality and efficiency. Keywords Cellular processing algorithms · Reactive GRASP · Time dependent-TSP · Shared memory

1 Introduction The Time Dependent-Traveling Salesman Problem (TD-TSP) is a specific case of the classical Traveling Salesman Problem (TSP), where the time required to go from one arc to another depends on the departure time [10]. The TD-TSP is present in private transportation, deliveries, and tourism, among others [12, 16]. The objective is to find an optimal route that minimizes the total tour time for a given graph. In urban areas, the travel time to go from one place to another changes dynamically during the day because of traffic congestion, which means that the traffic varies during the day. E. A. Oviedo-Salas · J. D. Terán-Villanueva (B) · S. Ibarra-Martínez · J. A. Castán-Rocha FIT (UAT), Centro Universitario Tampico Madero, P.C. 89109 Tampico, Tamaulipas, Mexico e-mail: [email protected] S. Ibarra-Martínez e-mail: [email protected] J. A. Castán-Rocha e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications, Studies in Big Data 132, https://doi.org/10.1007/978-3-031-38325-0_19

481

482

E. A. Oviedo-Salas et al.

Thus, it requires routing planning that adapts to time-dependent travel and traffic congestion. The TD-TSP is an NP-hard optimization problem [22]. This problem allows realworld modeling that classic TSP cannot handle [19]. On the other hand, most of the contributions to this problem are exact algorithm implementations. The TD-TSP has variants such as Time Dependent-Traveling Salesman Problem with Time Windows (TD-TSPTW) [18], the Time-Dependent Shortest Path Problem (TDSPP) [14], time-dependent vehicle routing problem (TDVRP) [24], Time-Dependent Traveling Deliveryman Problem (TDTDP) [11], Time-Dependent Traveling Repairman Problem (TDTRP) [21], and Asymmetric Travelling Salesman Problem (ATSP) [3], among others. In this paper, we propose a Cellular Processing Algorithm (CPA) with shared memory to tackle the TD-TSP.

2 State of the Art 2.1 Time-Dependent Traveling Salesman Problem Contributions Picard et al. propose the first implementation in [22], where the authors implemented three integer linear programming formulations. Two integer linear programming were designed to model TD-TSP for a scheduling problem in a manufacturing process. They used those formulations to minimize the cost in one machine by scheduling their movements, and their results showed that the model could solve instances up to twenty vertex. In [17], Miranda-Bront et al. implemented a Branch and Cut algorithm to analyze the inequalities between time dependence on the travel and the driver. The computational test between Branch and Cut algorithm and CPLEX showed that Branch and Cut algorithm produces better results with multiple instances, surpassing the CPLEX approach. In [6], Cordeau et al. implemented a branch and cut algorithm to analyze two components. The first component is the obtaining of the lower and upper bounding of the asymmetric traveling salesman problem (ATSP). The test showed that ATSP solutions were optimal for TD-TSP where all arcs shared a common traffic pattern. Finally, they implemented a linear programming model that validates inequalities, where the computational test showed that the proposal could solve instances up to forty vertex. Arigliano et al. [4] implemented a Branch and Bound algorithm to analyze the properties of the TD-TSP inequalities from the state-of-the-art. The computational test demonstrated that Branch and Bound algorithm could solve instance with up fifty vertices. Also, the authors concluded that the procedure could solve a large number of instances compared with the branch and cut state-of-the-art.

Cellular Processing Algorithm for Time-Dependent Traveling Salesman Problem

483

Adamo et al. in [1] studied a new degree of freedom about the speed decomposition of the TD-TSP using a Branch and Bound algorithm, where they defined a new family of lower bounds. Additionally, the authors implemented a compact Mixed Integer Linear Programming model to reduce the computational cost. The computational results indicates that the Branch and Bound algorithm could solve optimally more than seventy-two instances and reduced the gap from 7.43% to 2.79% at the root node. Cacchiani et al. [5], proposed a Mixed Integer Programming model for the Traveling Salesman Problem with Time-dependent Service times (TSP-TDS) that improved the lower and upper bounds, where they removed sub-tours with a branch and cut and a genetic algorithms. The Mixed Integer Programming model improved the upper and lowers bound during the process. The computational results showed that the exact methods could prove the optimization of the solutions of a large set of instances with low computational times. Additionally, the branch and cut algorithms could solve sets of instances with up to fifty-eight nodes. Furthermore, the asymmetric method can solve sets with forty-five nodes. Finally, the genetic algorithm is efficient with symmetric and asymmetric instances with up to two-hundred nodes. In Adamo et al. [2], analyzed the property of time-dependent graphs called Path Ranking “Invariance”, the authors argued that the ordering of the paths were independent of the start time. They indicated that if a graph is path ranking invariant, the solution could be tackled by time-independent routing problems using a Branch and Bound algorithm to solve a large class of time-dependent routing problems as: the Time-Dependent Travelling Salesman Problem and the Time-Dependent Rural Postman Problem. The computational test demonstrates that Branch and Bound outperformed the state-of-the-art algorithms, where the approach focuses on finding the upper and lower bounds. Table 1 shows the differences between approaches regarding the Time Dependent—Traveling Salesman Problem contributions. Here we can see that most of the approaches are exact algorithms with some differences regarding their restrictions.

2.2 Cellular Processing Algorithms Contributions In [23], Santiago et, al. implemented a Greedy Randomized Adaptive Search Procedure (GRASP) with Cellular Processing Algorithms (CPA) called GRASP-CPA for precedence-constraint task scheduling of parallel programs. The GRASP-CPA algorithm generated task execution orders with the GRASP algorithm and used the communication between Processing Cells to explore the search space. The computational test showed that GRASP-CPA surpassed the high-performance algorithm called Earliest Finish Time-Iterative Local Search (EFT-ILS) from the state-of-theart with statistical significance for a set of instances. Terán et al. in [26] implemented a Cellular Processing Algorithm (CPA) for the Vertex Bisection problem and they compared it against a Memetic Algorithm called

484

E. A. Oviedo-Salas et al.

Table 1 Differences between approaches Author Method

Objective

Attribute Minimize the tardiness function Used in scheduling problems Use of ILP Analyze ATSP Analyze TD-TSP properties Analyze the upper and lower bound Use of MILP New lower bounds Use of GA Use of MILP Use of MILP Can solve TD-RPP New lower bounds Use of shared memory Parameters for fine-tuning Prioritize arcs from previews solutions

Picard [22]

ILP

Reduce lower bounds

Miranda-Bront [17]

B&C

Analyze inequalities

Cordeau [6]

B&C

Analyze inequalities

Arigliano [4]

B&B

Analyze inequalities

Adamo [1]

B&B

Cacchiani [5]

B&C

Adamo [2]

B&B

CPA-TDTSP (Our proposal)

CPA with GRASP

Analyze new freedom degree Improve of upper and lower bounds Analyze of Path Ranking Invariance Minimize the total travel time

MA2. The computational tested to one-hundred-thirty-seven instances, and demonstrated that CPA improved the quantity of the best solutions to 190% and reduced the computational time to 21% regarding state-of-the-art. The authors concluded with CPA surpassed the MA2 in three out of six quality tests and five out of six efficiency tests, the rest of the test generated the best results, and their proposal produced better results on average than MA2. In [27], Vahidipour et al, implemented an algorithm named A Cellular Adaptive Petri Net-Learning Automata (CAPN-LA) as an alternative for the design of the A Petri Net-Learning Automata algorithm (APN-LA) for Vertex Coloring Problem. The authors concluded with an explanation of the cooperation among APNLA and CAPN-LA in a graph that consisted of groups assigned in adjacent, were created a new algorithm called A Petri Net-Learning Automata-Vertex Couliring (APN-LA-VC). Lopez et al., in [15] proposed an Integer Linear Programming Model (ILP) and two heuristic solutions for Internet shopping optimization problems with delivery costs, they used the MinMin algorithm and the Cellular Processing Algorithm. The authors argued that online purchases was a current activities where the offers increase, leaving the difficulty of selecting the best option among online stores. The computational test

Cellular Processing Algorithm for Time-Dependent Traveling Salesman Problem Table 2 Differences between CPA approaches Author Method

485

Objective

Attribute Analyze Scheduling Precedence Constraint Tasks Explore several space solutions Analyze of Vertex bisection problem Explore several space solutions Use of Memetic Algorithm Analyze of Vertex colouring problem Use of linear programming model Use of ILP Analyze of Internet shopping optimization problem Use of MinMin algorithm Use of shared memory Parameters for fine-tuning Prioritize arcs from previews solutions

Santiago [23]

GRASP-CPA

Find a good-quality task order

Terán [26]

CPA-AdHoc

Connect most arcs

Vahidipour [27]

APN-LA

Minimize the number of colors for a graph

Lopez-Loces [15]

CPA

Minimize the total cost of Shopping list

CPA-TDTSP (Our proposal)

CPA with GRASP

Minimize the total travel time

demonstrated that the ILP model could determine a gap between optimal solutions for medium size instances and the objective values. Also, they added a new problem to the cellular processing algorithm. Table 2 shows the differences between approaches regarding the Cellular Processing Algorithms’ contributions which show good performances in a variety of problems and with different internal algorithms and structures.

3 Time Dependent-Traveling Salesman Problem (TD-TSP) The Time Dependent-Traveling Salesman Problem is an NP-hard optimization problem commonly found in private transportation, deliveries, and tourism. The purpose is to find the minimum travel cost and the optimal route in an urban area. It means

486

E. A. Oviedo-Salas et al.

Fig. 1 Instance description

that in a city, the travel time from one point to another place changes dynamically during the day because of traffic. Here we consider traffic as a vehicular influx within an urban area that negatively affects the arrival time from one point to another. Problem definition, given a directed graph G = (V, E) where V = {v1 , v2 , . . . , vn } is a set of vertices, and E = {(vi , v j )∀vi , v j ∈ V } is an set of edges (see Eq. 1).

Min(z n ) =

n

τ pi p(i+1) t |t = z n−1

(1)

i=0

where τ pi p(i+1) t is the travel duration from pi to p(i+1) at time interval t, pi = vr and p(i+1) = vk , and vr = vk ∧ vr , vk ∈ V where P = ( p1 , p2 , . . . , pn ); additionally, t is the travel time from pi to p(i+1) which is inside of a time interval H = {h 1 , h 2 , . . . , h n }, therefore (t j , tk ) ∈ H |t j ≤ t ≤ tk define the specific traffic speed during each time interval. Finally, the goal is to minimize the total travel time z n . We implement three meta-heuristics algorithms a GRASP, a reactive GRASP, and a CPA. One interesting feature is the shared memory in the CPA, which improves the total travel time, and obtains good routes from edges. On the other hand, we analyze the performance of both algorithms, the impact of the shared memory to select a vertex, and the conventional assignation of the routes.

3.1 Instance Structure Figure 1 describes the instance elements, where a) is the time interval, b) the traffic zone F = { f 1 , . . . , f c }| f 1 , . . . , f c are disjoint sets and f 1 · · · ∪ f i · · · ∪ f c = E, finally c) describes the speed limit at which the salesman can travel.

Cellular Processing Algorithm for Time-Dependent Traveling Salesman Problem

487

The speed limit and traffic zone from Fig. 1 depend directly on the time interval because the speed and traffic zone change during the day. These attributes can increase or decrease depending on the time of day; while the travel speed decreases, the traffic zone increases, and vice versa.

3.2 Calculation Process Example In this research, we use Ichoua’s proposal in [13] to calculate the TD-TSP route with traffic patterns. For this example, we use the instance 15AN odi_1. This instance produces an objective value of 952.34 with the following route: 0, 6, 15, 14, 3, 1, 11, 8, 2, 12, 4, 7, 9, 13, 10, 5, 16. Figure 2 describes the way of taking speeds and traffic on a trip, where the green arrow represents the traffic zone, and the red arrow represents the time interval. The speed parameter is a value between [0, 1], representing an inverse percentage of the distance. Therefore, to determine the time value, the process divides the . distance by the speed percentage distance speed Table 3 describes the results of the sample instance, where pi and p(i+1) represent two different vertices, (t j , tk ) is the time interval (TI), t pi is the current time (CT), d pi , p(i+1) is the distance (Dst), s pi , p(i+1)(t j ,tk ) is the speed (Spd) from pi to p(i+1) at time interval (t j , tk ), and a p(i+1) is the arrival time (AT). Additionally, data is to one decimal for visibility purposes. Table 4 shows the first part of the travel, where the vertices do not surpass the time interval, which goes from τ0,6 to τ14,3 obtaining a total cost of 344.45. Table 5 describes the second part of the calculation process. In this case, the route τ3,1 surpasses the time interval (t j , tk ) = 362.6. Thus, it is necessary to split the cost into two parts to reflect the speed change. The first use a part of the total distance to reach the time interval (t j , tk ) = 362.6, and the second uses the remainder of the distance with the new speed and the new time interval to obtain a total cost of 370.47. The rest of the calculation process from Table 3 continues in the same way, either considering speed changes during a route or without considering speed changes until the end of the tack, where the final cost is 952.34.

Fig. 2 Traffic zones and time interval interpretation

488

E. A. Oviedo-Salas et al.

Table 3 Instance with all results Vertex pi Vertex pi+1 TI pi pi+1 (t j , tk ) 0 6 15 14 3 Split 1 11 8 2 12 4 7 9 13 10 5 16 Split

6 15 14 3 Split 1 11 8 2 12 4 7 9 13 10 5 16 Split 0

362.6 362.6 362.6 362.6 362.6 725.2 725.2 725.2 725.2 725.2 725.2 725.2 725.2 725.2 725.2 725.2 725.2 725.2 1087.8

Table 4 Example 1 Vertex pi Vertex pi+1 TI pi 0 6 15 14

pi+1 6 15 14 3

(t j , tk ) 362.6 362.6 362.6 362.6

Table 5 Instance with all results Vertex pi Vertex pi+1 TI pi 3 Split

pi+1 Split 1

(t j , tk ) 362.6 725.2

CT t pi

Dst d pi , pi+1

Spd AT s pi , pi+1 (t j ,tk ) a pi+1

0 263.4 304.9 324.1 344.5 362.6 370.5 385.7 398.7 411.9 427.9 448.6 462.6 475.8 493.4 515.2 531.8 674.8 725.2

58.0 9.1 4.2 4.5 4.0 3.1 6.1 5.2 5.3 6.4 8.3 5.6 5.3 7.0 8.7 6.7 57.2 15.1 99.2

0.44 0.22 0.22 0.22 0.22 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.44

CT

Dst

Spd

AT

t pi 0 263.4 304.9 324.1

d pi , pi+1 58.0 9.1 4.2 4.5

s pi , pi+1 (t j ,tk ) 0.44 0.22 0.22 0.22

a pi+1 263.4 304.9 324.1 344.5

CT

Dst

Spd

AT

t pi 344.5 362.6

d pi , pi+1 4.0 3.1

s pi , pi+1 (t j ,tk ) a pi+1 0.22 362.6 0.4 370.5

263.4 304.9 324.1 344.5 362.6 370.5 385.7 398.7 411.9 427.9 448.6 462.6 475.8 493.4 515.2 531.8 674.8 725.2 952.3

Cellular Processing Algorithm for Time-Dependent Traveling Salesman Problem

489

4 Greedy Randomized Adaptive Search Procedure Algorithm for Time Dependent-Traveling Salesman Problem The greedy randomized adaptive search procedure algorithm (GRASP) is a metaheuristic proposed by Feo et al. [7, 8]. The algorithm is a multi-start process with two phases. The first phase is a heuristic construction, which provides a high-quality solution, and the second phase improves the solution using a local search. In this research, we use a Cellular Processing Algorithm (CPA). The approach is a meta-heuristic proposed by [25], which allows using several Processing Cells (PCell) to explore different solution spaces and verify the stagnation condition to avoid unnecessary tasks to save time. We implement the GRASP heuristic construction and shared memory in each processing cell to influence the vertex selection process. Algorithm 1 shows our proposed optimization procedure; it generates solutions with a predefined time limit. The process starts with the I nitiali ze Parameter s function, which obtains the main components for the GRASP and CPA algorithm. The Cr eateLC provides the required information to build the candidate list. The T dT spGrasp function constructs the GRASP solutions for TD-TSP. This algorithm uses a communication component to share information among the processing cells. In this case, this component is called SaveShar ed Memor y. Finally, the N or mali zation process normalizes the data from the SaveShar ed Memor y function, which will influence the candidate list. Algorithm 1 CPA General Procedure 1: Pcell = ∅, best = ∅, lc = ∅. B Pc = ∅, r ecor ds = ∅, nor mali zation = ∅ 2: InitializeParameters() 3: while time > Max T ime do 4: for j = 1 → Max Pcell do 5: lc = Cr eateLC() 6: Pcell j = T dT spGrasp(lc, nor m) 7: if (Pcell j ) < best then 8: best ← Pcell j 9: SaveShar ed Memor y(Pcell j ) 10: end if 11: end for 12: B Pc = get Best Pcell() 13: r ecor ds = SaveShar ed Memor y(B Pc) 14: nor = N or mali zation(r ecor ds) 15: end while Output : best

490

E. A. Oviedo-Salas et al.

4.1 Greedy Randomized Adaptive Search Procedure Construction The main contribution of this paper is the use of shared memory during the construction process, which consists of generating the permutations to store the recurring vertices that help influence the selection process. The vertex v0 is the first and last point of the permutation. Algorithm 2 describe the route construction in two parts. The first part iterates the list of vertices available to calculate the cost (see lines 5–11) using the function cost (line 8). In the second part, the algorithm constructs the candidate list (CL), the cost set (C), the restricted candidate list (RCL), the restricted cost set (RC), and selects the next vertex using the roulette procedure (lines 17–31). Also, the procedure uses the information from N or mali zaion function to influence the candidate list called I n f luenceC L (line 13). Algorithm 2 GRASP Construction Require: i , C L , S 1: miss = 1 2: while miss < |V | do 3: C L = ∅ 4: S = ∅ 5: for i = 0 → |V | do 6: if si = 0 then 7: C L = C L ∪ {vi } 8: ci = CalculateCost (vi ) 9: C = C ∪ {ci } 10: end if 11: end for 12: if Reg_exist then 13: I n f luenceC L(vr −1 , C L , nor m) 14: end if 15: RC L = ∅ 16: RC = ∅ 17: for i = 1 → |C L| do 18: vmin = arg min ∀ci ∈ C i

19: RC L = RC L ∪ {vvmin } 20: RC = RC ∪ {cvmin } 21: C = C \ {cvmin } 22: end for 23: Limit = max(C) + (β × (max(C) − min(C))) 24: for i = 1 → |RC L| do 25: if ci ∈ RC & ci < Limit then 26: RC = RC \ {ci } 27: RC L = RC L \ {vi } 28: end if 29: end for 30: vr = Roulette(RC L , RC) 31: S ∪ {vr } 32: end while

Cellular Processing Algorithm for Time-Dependent Traveling Salesman Problem

491

The algorithm verifies that the set value si is equal to 0, where S = {s1 , s2 , . . . , sn } is a set that represents the availability of the vertex si . During the construction of the candidate list (CL), the vertex v0 is the first and last point by default. The selection of the remaining vi vertices is relative to the GRASP algorithm to build the candidate list and the restricted candidate list. The CalculationCost function obtains the cost of a vertex vi (see line 8) using the Eq. (1) from Sect. 3. After adding the elements to C L (see line 9 in Algorithm 2), to use the I n f luenceC L function (line 13), we need to normalize the data from SaveShar ed Memor y to search for candidate vertices from C L to update the cost (see Eq. (2)). C L .ncosti = C L .costi − (nor m i jk × C L .costi )

(2)

Where C L .ncost is the new cost from the candidate list, C L .cost is the original, and nor m i jk is the normalized data from shared memory; where the shared memory and normalized data are time interval dependent. Later, Eq. (3) calculates a limit value, which will be compared with the cost of the candidate vertex to create the restricted candidate list (RC L) (see line 23 in Algorithm 2). Limit = min(C) + (β × (max(C) − min(C)))

(3)

The algorithm initializes the restricted candidate list (RC L) with the elements in CL with the lowest cost. Line 18 obtains the index with a minimal cost ci . Later, the process updates the restricted candidate list (RC L) and the restricted cost list (RC) respectively (see lines 19, and 20). Finally, we remove cimin from C (see line 21). The algorithm calculates the Limit value to delimit the RC L with a lower or equal value than Limit (see line 23 in Algorithm 2). Therefore, the procedure updates RC and RC L in lines 26 and 27, respectively. Finally, the algorithm selects a vertex vr from RC L randomly using the roulette technique [9] (see line 30) to add it to the route.

4.2 Roulette Procedure Algorithm 3 shows the selection process using the roulette technique. The roulette assigns larger probabilities to the elements in RC L with a lower cost.

492

E. A. Oviedo-Salas et al.

Algorithm 3 General Roulette Procedure Require: RC L,RC 1: RC = {ci |ci = max(RC) + min(RC) − ci , ∀ci ∈ RC} 2: total S = c ∈RC (ci ) i 3: rand V = Random(1, t) 4: for ∀ci ∈ RC do 5: r es = rand V − ci 6: if r es ≤ 0 then 7: return (vi ∈ RC L) 8: end if 9: end for

The procedure starts obtaining the maximum and minimum cost values from RC transforming high values of ci to lower values of ci and vice versa (see line 1–3). Furthermore, total S stores the sum of RC elements (line 2). On the other hand, rand V obtains a random value between 1 and total S (line 3). Finally, the procedure iterates RC and subtracts the ci and rand V values to produce a r es value (lines 4 and 5 in Algorithm 3). The process ends when r es is less than or equal to zero and returns the selected vertex vi (lines 6 and 7).

4.3 Influence on the Candidate List Once C L has the candidate vertex, we need to check for records in the shared memory. If there is any record, then we update C L, exchanging the original cost with a new cost using the I n f luenceC L function (line 13 in Algorithm 1). Figure 3 describes the influence of C L. This process iterates CL looking in the normalization matrix for proper interval updating the cost regarding the ratio in the normalization matrix. This influence helps the candidate arc to prioritize the roulette to arcs found in previous solutions.

4.4 Shared Memory and Normalization Figure 4 shows the idea of shared memory and its normalization. When the construction process ends, the algorithm stores the frequency of each arc in the best solution of each processing cell considering their time interval k and normalizes it (see lines 13, and 14 in Algorithm 1). Shared memory helps the processing cells to determine the best candidate arc to add to the permutation. Additionally, it helps to determine the best interval to use in the permutation. However, the frequencies change dynamically. Therefore the process updates and normalizes the data from the shared memory constantly.

Cellular Processing Algorithm for Time-Dependent Traveling Salesman Problem

Fig. 3 Influence of candidate list

Fig. 4 Shared memory and normalization

493

494

E. A. Oviedo-Salas et al.

To normalize the data, we have proposed three values called x1 , x2 , and x3 ; where x1 and x2 help to influence the arcs in the candidate list construction, and x3 diminish the value of the frequencies in SaveShar ed Memor y SS M = {ssm 111 , ssm i jl , . . . , ssm nnk } (see Eqs. (4) and (5)). nor m i jk =

ssm

−min(SS M)

x1 × ( max(SSi jkM)−min(SS M) ) If nor m i jk not equal to x1 x2 Otherwise

ssm i jk = ssm i jk × x3

(4) (5)

4.5 Reactive Greedy Randomized Adaptive Search Procedure The reactive greedy randomized adaptive search procedure algorithm (Reactive GRASP) is a meta-heuristic proposed by Paris et al. in [20], the algorithm works the same as the GRASP algorithm with a variation in the selection of β. Conventionally, the β parameter used to create the RC L is fixed, while in the reactive version, the β parameter selection process is through using a roulette, considering the performance of previous solutions. Initially, the values have the same probability and update according to new reasonable solutions. The Reactive GRASP algorithm leads to improvements over the classical GRASP approach in robustness and solution quality due to diversification and less dependency on parameter tuning.

5 Experimental Results This section presents the experimental results from the proposed optimization method with CPA, where we incorporate two variations and three GRASP optimization methods for this problem. Additionally, we show the results of a series of non-parametric tests.

5.1 Configuration and Instances The computational test was carried out on a computer with an AMD Ryzen 7 4000 series with 2.90 GHz and 8 GB of RAM, Windows 10, Microsoft Visual Studio, and C++ as an implementation language. Table 6 shows the characteristics of the instance, where the first column shows the instance name, the second column shows the author reference, and the third and

Cellular Processing Algorithm for Time-Dependent Traveling Salesman Problem

495

Table 6 Instance set Traffic A Name 15A-C 20A-C 25A-C 30A-C 35A-C 40A-C 45A-C 50A-C 55A-C 60A-C

Author references Cordeau [6] Cordeau [6] Cordeau [6] Cordeau [6] Cordeau [6] Cordeau [6] Cordeau [6] Cordeau [6] Cordeau [6] Cordeau [6]

Traffic B

n

70

80

90

70

80

90

Total

15 20 25 30 35 40 45 50 55 60 Totals

30 30 30 30 30 30 30 30 30 30 300

30 30 30 30 30 30 30 30 30 30 300

30 30 30 30 30 30 30 30 30 30 300

30 30 30 30 30 30 30 30 30 30 300

30 30 30 30 30 30 30 30 30 30 300

30 30 30 30 30 30 30 30 30 30 300

180 180 180 180 180 180 180 180 180 180 1800

fourth columns show the total of the instances and their sizes. The instances, code, and results are available at https://github.com/csalas07/CPATd-Tsp.git.

5.2 Parameter Comparison In this section, we present the performance comparison between the three CPA optimizations methods with GRASP processing cells: C P AEst, here the GRASP algorithm uses a static β parameter; C P A Rank, where the GRASP uses a ranked β parameters; and C P A React, with a Reactive GRASP algorithm (see line 6 in Algorithm 1). Other performance comparisons consist of three G R AS P optimization methods. It consists of a G Rand, where the algorithm uses a random β parameters; G Rank method, selects a β parameter randomly in 0.3–0.6; and G React, with a Reactive GRASP algorithm. Additionally, we analyze the normalization parameters called x1 , x2 , and x3 ; these parameters define the probability of selecting a frequent arc (see Eqs. 4 and 5 in Sect. 4.4). Figures 5, 6, 7 and 8 show a sample of the 900 instance, and its behavior is consistent throughout this experimentation. Here, the test uses two traffic pattern A and B, therefore we called G Rand A–G Rand B, G Rank A–G Rank B, G React A–G React B, C P AEst A–C P AEst B, C P A Rank A–C P A Rank A, and C P A React A–C P A React B.

496

E. A. Oviedo-Salas et al.

Fig. 5 Cost comparison among GRASP method with pattern A

Fig. 6 Cost comparison among GRASP method with pattern B

Figure 5 describes the quality result between G Rand A, G Rank A, and G React A methods. This comparison shows that the G Rank A method obtains lower costs than the G Rand A and G React A methods. A Friedman test showed that the G Rank A is the best-ranked method with 1.68, followed by G Rand A with 2.06 and G React A with 2.25; here, the lower the rank, the lower the objective function value (see Table 7). Additionally, the results show that there is a statistical difference between them. Finally, the Wilcoxon test shows that G Rank A has statistically better performance than G Rand A and G React A with a p-value of 0.000 (100% certainty, see Table 11). Figure 6 describes the quality result between G Rand B, G Rank B, and G ReacBt. The comparison shows that the G Rank B method obtains lower costs than the G Rand B and G React B methods.

Cellular Processing Algorithm for Time-Dependent Traveling Salesman Problem

497

Fig. 7 Cost comparison between CPA methods with pattern A Table 7 Friedman test of GRASP with traffic pattern A

Technique

Rank

GRankA GRandA GReactA

1.68 2.06 2.25

Here, the Friedman test showed that the G Rank B is the best-ranked method with 1.67, followed by G Rand B with 1.99 and G React B with 2.33 (see Table 8). Finally, the Wilcoxon test shows that G Rank B performs better than G Rand B and G React B with a p-value of 0.000 (100% certainty, see Table 11). These results indicate that using discrete β values in G Rank(A/B) approach outperforms the use of β as continuous values. Furthermore, giving probabilities to β as in the G React (A/B) approach significantly affects the production of goodquality solutions; we consider that this behavior happens because it is easier to find new good solutions in the firsts iterations of the algorithm than at the last, which will have an impact on the selection of future β values. Therefore, it would be interesting to implement, as future work, a cleanup of the β selection probabilities. Figure 7 describes the quality result between C P AEst A, C P A Rank A, and C P A React A. This comparison shows that the C P AEst A method obtains lower costs than the C P A Rank A and C P A React A methods. Friedman test showed that the C P AEst A is the best-ranked method with 1.31, followed by C P A Rank A with 2.23 and C P A React A with 2.46; here, the lower the rank, the lower the objective function value (see Table 9). Finally, the Wilcoxon test shows that C P AEst A has statistically better performance C P A Rank A and C P A React A with a p-value of 0.000 (100% certainty, see Table 11). Figure 8 describes the quality result between C P AEst B, C P A Rank B, and C P A React B. This comparison shows that the C P AEst B method obtains lower costs than the C P A Rank B and C P A React B methods.

498

E. A. Oviedo-Salas et al.

Fig. 8 Cost comparison among CPA method with pattern B Table 8 Friedman test of GRASP with traffic pattern B

Technique

Rank

GRankB GRandB GReactB

1.67 1.99 2.33

Table 9 Friedman test of CPA with traffic pattern A

Technique

Rank

CPAEstA CPARankA CPAReactA

1.31 2.23 2.46

Table 10 Friedman test of CPA with traffic pattern B

Technique

Rank

CPAEstB CPARankB CPAReactB

1.35 2.21 2.44

Here, the Friedman test showed that the C P AEst B is the best-ranked method with 1.35, followed by C P A Rank B with 2.21 and C P A React B with 2.44 (see Table 10 for more information). Finally, the Wilcoxon test shows that C P AEst B performs statistically better than C P A Rank B and C P A React B with a p-value of 0.000 (100% certainty, see Table 11). These results show that shared memory produces good-quality solutions. However, we believe that the randomness in β parameters could produce not-so-good solutions, negatively affecting the shared memory and producing undesirable solutions in the remainder of the execution of the algorithm in other processing cells.

Cellular Processing Algorithm for Time-Dependent Traveling Salesman Problem Table 11 Wilcoxon test of the best GRASP and CPA techniques with traffic pattern A/B

Techniques

P-value

GRankA-GRandA GRankB-GRandB CPAEstA-CPARankA CPAEstB-CPARankB

0.000 0.000 0.000 0.000

499

Fig. 9 Parameters for x1 , x2 and x3

Table 11 shows the Wilcoxon test of the best GRASP and CPA techniques with Traffic pattern A/B, where Techniques are the proposed approaches and the P-value is the percentage certainty. We analyzed the three parameters x1 , x2 , and x3 from the normalization procedure. This parameter allows selecting a frequent arc during the permutation construction process. Figure 9 shows the optimal parameters for the normalization process where (a) is the x1 value with 0.9, (b) is the x2 parameter with 0.5; the x1 and x2 parameters allows increasing the chance of selecting a previous candidate arc. Finally, (c) is the x3 value with 0.6; the x3 parameter reduces the weight of normalized elements to change the data dynamically.

5.3 Comparison Between GRASP Methods This section compares the two best GRASP optimization methods, G Rand, and G Rank.

500

E. A. Oviedo-Salas et al.

Figures 10 and 11 show the sum of the results of each instance from Table 6 for both optimization methods. These results contemplate the traffic pattern A with jams to 70, 80, and 90%. We carried out experimentations within a time limit of 60 s. Figure 10 describes the sum of the results of each instance of G Rand A and G Rank A. Here, the G Rank A outperformed G Rand A in all cases. The Wilcoxon test shows that G Rank A has statistically better performance than G Rand A with a p-value of 0.000, which is equivalent to 100% certainty that G Rank A outperforms G Rand A (see Table 12). Figure 11 describes the sum of the results of each instance from Table 6 with both optimization methods, G Rand B and G Rank B. In this test, G Rank B also outperformed G Rand B in all cases. Additionally, the Wilcoxon test also showed that G Rank B outperformed G Rand B with 100% certainty (see Table 12).

Fig. 10 Comparison between G Rand A, and G Rank A of instances 15 A–C to 60 A–C

Fig. 11 Comparison between G Rand B, and G Rank B of instances 15 A–C to 60 A–C

Cellular Processing Algorithm for Time-Dependent Traveling Salesman Problem Table 12 Wilcoxon test of the best GRASP techniques with traffic pattern A/B

Techniques

P-value

GRankA-GRandA GRankB-GRandB

0.000 0.000

501

These results suggest that delimiting the range of β of G Rank(A/B) produces reasonable solutions because it obtains constant improvements. However, the total randomness in β of G Rand(A/B) could create not-so-good solutions; therefore, it can affect the production of reasonable solutions in new iterations. Figure 12 shows the production of no-so-good solutions from best solutions as a consequence of the randomness of the β parameters, where the green ones are the best solutions with good performance and the red ones are the best solutions with poor performance. Table 12 shows the Wilcoxon test two best GRASP optimization with Traffic pattern A/B, where Techniques are the proposed approaches and the P-value is the percentage certainty.

5.4 Comparison Between CPA Methods This section compares the results of the two best CPA optimization methods, C P AEst, and C P A Rank. Figures 13 and 14 show the sum of the results of each instance from Table 6 for both optimizations. These results contemplate the traffic pattern A with jams to 70%, 80%, and 90%. We carried out experimentations within a time limit of 60 s. Figure 13 compares the sum of the results of each instance with C P AEst A and C P A Rank A. Here, the C P AEst A outperformed C P A Rank A in all cases; which is also supported by Wilcoxon test with a p-value of 0.000 (100% certainty of difference in performance of both algorithms, see Table 13).

Fig. 12 GRASP solutions interpretations

502

E. A. Oviedo-Salas et al.

Figure 14 compares the sum of the results of each instance from Table 6 for C P AEst B and C P A Rank B. Additionally, C P AEst B also outperformed C P A Rank B in all cases as C P AEst A outperformed C P A Rank A. Here, the Wilcoxon test also showed that C P AEst B outperformed C P A Rank B with 100% certainty (see Table 13). Regarding these results, we believe that the randomness in β of C P A Rank(A/B) could produce not-so-good solutions, which could tamper the memory by adding notso-good edges; therefore, it can affect the production of reasonable solutions in other processing cells. Hence, we consider future research to study the memory alteration proportions to match the quality of the solutions. Figure 15 shows the alteration of shared memory with no so no-so-good from best solutions as a consequence of the randomness of the β parameters, where the

Fig. 13 Comparison between C P AEst A, and C P A Rank A of instances 15 A–C to 60 A–C

Fig. 14 Comparison between C P AEst B, and C P A Rank B

Cellular Processing Algorithm for Time-Dependent Traveling Salesman Problem Table 13 Wilcoxon test of the best CPA techniques with traffic pattern A/B

Techniques

P-value

CPAEstA-CPARankA CPAEstB-CPARankB

0.000 0.000

503

Fig. 15 CPA Solutions interpretations

green ones are the best solutions with good performance and the red ones are the best solutions with poor performance. Table 13 shows the Wilcoxon test for the best CPA optimization techniques with Traffic pattern A/B, where the columns show the proposed approaches to be compared and their respective P-value.

5.5 Comparison Between CPA and GRASP Methods This section compares the best CPA and GRASP optimization algorithms. The methods have the same denomination as in Sects. 5.3 and 5.4 (GRank, and CPAEst). We execute the approaches with a time limit of 60 s. Figures 16 and 17 shows a sample of the 900 instance, and its behavior is consistent thorough all of this experimentation. The test uses two traffic pattern A and B, therefore we called GRank(A/B), and CPAEst(A/B). Figure 16 shows the quality result between CPAEstA, and GRankA. The CPAEstA surpasses GRankA in all cases. We consider it to be the impact of the shared memory which prioritizes the selection of vertices that had good performance in previous solutions. The Wilcoxon test shows that CPAEstA outperforms GRankA with a p-value of 0.000 (100% certainty, see Table 14). Figure 17 shows the comparison between CPAEstB, and GRankB. Here, CPAEstB also surpasses GRankB in all, which is supported by the Wilcoxon test with a 100% level of significance (see Table 14).

504

E. A. Oviedo-Salas et al.

Fig. 16 Comparison between CPAEstA, and GRankA

Fig. 17 Comparison between CPAEstB, and GRankB Table 14 Wilcoxon test of the best GRASP and CPA techniques with traffic pattern A/B

Techniques

P-value

CPAEstA-GRankA CPAEstB-GRankB

0.000 0.000

These results show that the use of shared memory in CPAEst(A/B) produces reasonable quality solutions because it prioritizes the best solutions of each processing cell, contrary to the GRASP approaches, which despite having constant improvements, are outperformed by CPAEst(A/B). Table 14 shows the Wilcoxon tests between the best GRASP and CPA optimization techniques with both traffic patterns A/B, where the columns show the proposed optimization techniques and their respective P-values.

Cellular Processing Algorithm for Time-Dependent Traveling Salesman Problem

505

6 Conclusions In this paper, we tackle Time Dependent-Traveling Salesman Problem, which is NP-Hard [22]. We proposed a Cellular Processing algorithm with shared memory and a three GRASP implementation, including a Reactive GRASP for this problem. Additionally, we carried out extensive experimentation with the proposed algorithms and instances. Regarding the GRASP algorithms, the results showed that the best performing implementation used a random β parameter between 0.3 and 0.6 instead of a fixed supposedly good β or even the reactive GRASP. On the other hand, the Cellular processing algorithm used shared memory, which counts the frequencies an edge exists in the best solution of each processing cell in each iteration to prioritize future selections based on those edges. Here, the shared memory influences the candidate list in the GRASP construction. Additionally, the best performing CPA implementation used a static β parameter instead of a ranked variant an in the GRASP implementations; we believe that C P A Rank tampers the memory by including not so good elements obtained with higher β values in the rank. Additionally, we proposed three parameters to fine tune shared memory normalization, allowing to balance and influence the vertices selection in the candidate list; where x1 and x2 prioritize previous good vertices, and x3 diminish the value of the frequencies in the shared memory, see Sect. 4.4. As future work, we consider studying proportional-shared-memory-alterations regarding the solutions’ quality. We believe that the randomness of the β parameter could produce poor-quality solutions that might negatively affect the shared memory and the performance of the whole optimization technique.

References 1. Adamo, T., Ghiani, G., Guerriero, E.: An enhanced lower bound for the time-dependent travelling salesman problem. Comput. Oper. Res. 113 (2020). https://doi.org/10.1016/j.cor.2019. 104795 2. Adamo, T., Ghiani, G., Guerriero, E.: On path ranking in time-dependent graphs. Comput. Oper. Res. 135(May), 105446 (2021). https://doi.org/10.1016/j.cor.2021.105446 3. Albiach, J., Sanchis, J.M., Soler, D.: An asymmetric TSP with time windows and with timedependent travel times and costs: An exact solution through a graph transformation. Eur. J. Oper. Res. 189(3), 789–802 (2008). https://doi.org/10.1016/j.ejor.2006.09.099 4. Arigliano, A., Calogiuri, T., Ghiani, G., Guerriero, E.: A branch-and-bound algorithm for the time-dependent travelling salesman problem. Networks 72(3), 382–392 (2018). https://doi.org/ 10.1002/net.21830 5. Cacchiani, V., Contreras-Bolton, C., Toth, P.: Models and algorithms for the traveling salesman problem with time-dependent service times. Eur. J. Oper. Res. 283(3), 825–843 (2020). https:// doi.org/10.1016/j.ejor.2019.11.046

506

E. A. Oviedo-Salas et al.

6. Cordeau, J.F., Ghiani, G., Guerriero, E.: Analysis and branch-and-cut algorithm for the timedependent travelling salesman problem. Transp. Sci. 48(1), 46–58 (2014). https://doi.org/10. 1287/trsc.1120.0449 7. Feo, T.A., Resende, M.G.: A probabilistic heuristic for a computationally difficult set covering problem. Oper. Res. Lett. 8(2), 67–71 (1989). https://doi.org/10.1016/0167-6377(89)90002-3 8. Feo, T.A., Resende, M.G.: Greedy randomized adaptive search procedures. J. Glob. Optim. 6(2), 109–133 (1995). https://doi.org/10.1007/BF01096763 9. Gendreau, M., Potvin, J.Y.: Handbook of Metaheuristics, vol. 146, 2nd edn. Springer New York, USA (2012). https://doi.org/10.1007/978-1-4614-1900-6 10. Gomez-Santillán, C.G., Cruz-Reyes, L., Morales-Rodríguez, M.L., González-Barbosa, J.J., Castillo López, O., Rivera, G., Hernández, P.: Variants of VRP to optimize logistics management problems. In: Logistics Management and Optimization through Hybrid Artificial Intelligence Systems, pp. 207–237. IGI Global (2012). https://doi.org/10.4018/978-1-4666-02977.ch008 11. Heilporn, G., Cordeau, J.F., Laporte, G.: The Delivery Man Problem with time windows. Discret. Optim. 7(4), 269–282 (2010). https://doi.org/10.1016/j.disopt.2010.06.002 12. Holguin, L., Ochoa-Zezzatti, A., Larios, V.M., Cossio, E., Maciel, R., Rivera, G.: Small steps towards a smart city: Mobile application that provides options for the use of public transport in Juarez City. In: 2019 IEEE International Smart Cities Conference (ISC2), pp. 100–105. IEEE (2019). https://doi.org/10.1109/ISC246665.2019.9071728 13. Ichoua, S., Gendreau, M., Potvin, J.Y.: Vehicle dispatching with time-dependent travel times. Eur. J. Oper. Res. 144(2), 379–396 (2003). https://doi.org/10.1016/S0377-2217(02)00147-9 14. Jigang, W., Jin, S., Ji, H., Srikanthan, T.: Algorithm for time-dependent shortest safe path on transportation networks. In: Procedia Computer Science, vol. 4, pp. 958–966. Elsevier (2011). https://doi.org/10.1016/j.procs.2011.04.101 15. Lopez-Loces, M.C., Musial, J., Pecero, J.E., Fraire-Huacuja, H.J., Blazewicz, J., Bouvry, P.: Exact and heuristic approaches to solve the Internet shopping optimization problem with delivery costs. Int. J. Appl. Math. Comput. Sci. 26(2), 391–406 (2016). https://doi.org/10.1515/ amcs-2016-0028 16. Mancha, J.J., Guerrero, M.S., Chong, A.G.V., Barbosa, J.G., Gómez, C., Cruz-Reyes, L., Rivera, G.: A mobile application for helping urban public transport and its logistics. In: Handbook of Research on Military, Aeronautical, and Maritime Logistics and Operations, pp. 385–406. IGI Global (2016). https://doi.org/10.4018/978-1-4666-9779-9.ch020 17. Miranda-Bront, J.J., Méndez-Díaz, I., Zabala, P.: An integer programming approach for the time-dependent TSP. Electron. Notes Discret. Math. 36(C), 351–358 (2010). https://doi.org/ 10.1016/j.endm.2010.05.045 18. Montero, A., Méndez-Díaz, I., Miranda-Bront, J.J.: An integer programming approach for the time-dependent traveling salesman problem with time windows. Comput. Oper. Res. 88, 280–289 (2017). https://doi.org/10.1016/j.cor.2017.06.026 19. Ochoa-Zezzatti, A., Carbajal, U., Castillo, O., Mejía, J., Rivera, G., Gonzalez, S.: Development of a java library to solve the school bus routing problem. In: Smart Technologies for Smart Cities, pp. 175–196. Springer, Berlin (2020). https://doi.org/10.1007/978-3-030-39986-3_9 20. Paris, M., Ribeiro, C.C.: Reactive GRASP: An application to a matrix decomposition problem in TDMA traffic assignment. INFORMS J. Comput. 12(3), 164–176 (2000). https://doi.org/ 10.1287/ijoc.12.3.164.12639 21. Pei, J., Mladenović, N., Urošević, D., Brimberg, J., Liu, X.: Solving the traveling repairman problem with profits: A Novel variable neighborhood search approach. Inf. Sci. 507, 108–123 (2020). https://doi.org/10.1016/j.ins.2019.08.017 22. Picard, J.C., Queyranne, M.: Time-dependent traveling salesman problem and its application to the tardiness problem in one-machine scheduling. Oper. Res. 26(1), 86–110 (1978). https:// doi.org/10.1287/opre.26.1.86 23. Santiago, A., Terán-Villanueva David, J., Martínez, S.I., Rocha, J.A.C., Menchaca, J.L., Berrones, M.G.T., Ponce-Flores, M.: Grasp and iterated local search-based cellular processing algorithm for precedence-constraint task list scheduling on heterogeneous systems. Appl. Sci. (Switzerland) 10(21), 1–19 (2020). https://doi.org/10.3390/app10217500

Cellular Processing Algorithm for Time-Dependent Traveling Salesman Problem

507

24. Solomon, M.M.: Algorithms for the vehicle routing and scheduling problems with time window constraints. Oper. Res. 35(2), 254–265 (1987). https://doi.org/10.1287/opre.35.2.254 25. Terán-Villanueva, D., Martínez-Flores, J.A., López-Locés, M.C., Zamarrón-Escobar, D.E., Santiago, A.: Hybrid grasp with composite local search and path-relinking for the linear ordering problem with cumulative costs. Int. J. Comb. Optim. Probl. Inform. 3(1), 21–30 (2012) 26. Terán-Villanueva, J.D., Fraire-Huacuja, H.J., Ibarra Martínez, S., Cruz-Reyes, L., Castán Rocha, J.A., Gómez Santillán, C., Menchaca, J.L.: Cellular processing algorithm for the vertex bisection problem: Detailed analysis and new component design. Inf. Sci. 478, 62–82 (2019). https://doi.org/10.1016/J.INS.2018.11.020 27. Vahidipour, S.M., Meybodi, M.R., Esnaashari, M.: Cellular adaptive Petri net based on learning automata and its application to the vertex coloring problem. Discret. Event Dyn. Syst.: Theory Appl. 27(4), 609–640 (2017). https://doi.org/10.1007/s10626-017-0251-z

Portfolio Optimization Using Reinforcement Learning and Hierarchical Risk Parity Approach Jaydip Sen

Abstract Portfolio Optimization deals with identifying a set of capital assets and their respective weights of allocation, which optimizes the risk-return pairs. Optimizing a portfolio is a computationally hard problem. The problem gets more complicated if one needs to optimize future return and risk values, as predicting future stock prices is equally challenging. This work compares the performance of two approaches to portfolio optimization, hierarchical risk parity (HRP) and reinforcement learning (RL). Portfolios are designed using these two approaches on stocks chosen from thirteen important sectors listed on the National Stock Exchange of India on historical stock prices from January 1, 2017, to December 31, 2021. The portfolios are tested on test data from January 1, 2022, to November 30, 2022. The performances of two portfolios for each sector are compared on their annual returns and risks and Sharpe ratios. The results exhibit clear superiority of the RL portfolio to its HRP counterpart. Keywords Portfolio optimization · Reinforcement learning · Minimum variance portfolio · Hierarchical risk parity portfolio · Return · Risk · Sharpe ratio

1 Introduction Portfolio Optimization is the task of identifying a set of capital assets and their respective weights of allocation, which optimizes the risk-return pairs. Optimizing a portfolio with a reasonably large number of assets and investment constraints is a computationally hard problem [1]. The problem gets more complicated if one needs to optimize future return and risk values, as predicting future stock prices is equally challenging. Following the seminal work of Markowitz on the minimum-variance portfolio [2], several propositions have been made for different approaches to portfolio optimization. However, the mean–variance portfolio optimization suffers from J. Sen (B) Praxis Business School, Kolkata, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications, Studies in Big Data 132, https://doi.org/10.1007/978-3-031-38325-0_20

509

510

J. Sen

several weaknesses, including (i) Estimation errors in the expected returns and covariance matrix caused by the erratic nature of financial returns, (ii) Unstable quadratic optimization that greatly jeopardizes the optimality of the resulting portfolios. The hierarchical risk parity (HRP) algorithm addresses three problems in the mean–variance optimization approach to portfolio design [3]. This work proposes a systematic approach for building robust portfolios of stocks from thirteen sectors of the Indian stock market. Thirteen critical economic sectors of India are first chosen, and for each sector, the ten most stocks are identified which have the highest freefloat market capitalization as per their listing in the National Stock Exchange (NSE) of India [4]. The historical prices of these stocks are scraped from the web using their ticker names. Based on the historical prices for five years, two innovative approaches to portfolio design are adopted. In the first approach, portfolios are built for each sector based on the hierarchical risk parity approach [3]. In this approach, based on past returns, the stocks are categorized into several clusters, and the weights of the clusters are allocated in inverse proportion to their respective variances. In the second approach, a reinforcement learning (RL)-based model is designed for portfolio optimization. The optimization approach here works in a continuous control mode with delayed rewards, which is ideally suited for a continuously changing market. The RL agent exploits a Q-learning framework [5] that depends on a DQN (deep Q-learning network) [5, 6] to determine the policies for optimal allocation among the stocks for a given sector. The architecture of RL is adapted based on the historical stock prices of a given sector (i.e., on the training data). The states of the RL agent are the correlation matrix of the stock return values for a given sector over a specific time window, while the actions of the agent involve assigning and rebalancing the portfolio weights. For the RL agent corresponding to a sector, the environment consists of three years of historical close prices of the ten stocks in that sector. Both HRP and the RL portfolios for the thirteen sectors are built based on the historical stock prices from January 1, 2017, to December 31, 2020, and the performances of the portfolios are evaluated during the test period from January 1, 2022, to November 30, 2022, in terms of their annual volatilities, cumulative returns, and maximum Sharpe ratios [7]. For every sector, the portfolio (RL or HRP) that yields the higher cumulative return over the test period is identified. The results will indicate which portfolio design approach is more likely to yield higher returns for most of the sectors listed on the NSE. The main contribution of the current work is threefold. First, it presents two different methods of designing robust portfolios, the HRP algorithm and the DQNbased RL approach. These portfolio design approaches are applied to thirteen critical sectors of stocks of the NSE. The results can be used as a guide to investors in the stock market for making profitable investments. Second, a backtesting method is proposed for evaluating the performance of the portfolios based on annual returns and risks and Sharpe ratios. Since the backtesting is done both on the training and the test data of the stock prices, the work has identified the more efficient portfolio both on training and test data. Hence, a robust framework for evaluating different portfolios is demonstrated. Third, the returns of the portfolios on the thirteen major

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

511

sectors on the test data highlight the current profitability and the volatility of these sectors. This information can be useful for investors. The chapter is organized as follows. Section 2, some related works. Section 3 describes the methodology followed. Section 4 presents the results of the portfolio performances. Section 5 concludes the chapter.

2 Related Work Several approaches have been proposed by researchers for accurate prediction of stock prices and robust portfolio optimization. Time series decomposition and econometric approaches like autoregressive integrated moving average (ARIMA), Granger causality, and vector autoregression (VAR) are extensively used for stock price prediction and portfolio optimization [8–14]. The use of machine learning and deep learning models for future stock price prediction has been the most popular approach of late [15–28]. Unlike classical approaches to portfolio optimization [2], which are based on the computation of the expected values of the future returns of stocks, machine learning and deep learning models attempt to directly maximize the portfolio returns and Sharpe ratios [7] using the historical values of the stock prices. These models aggregate multiple features from the stocks in a portfolio and, using multiple layers, extract salient features from the stocks and compute the portfolio weights to maximize the returns or risk-adjusted returns. The learning-based models for portfolio optimization are found to be more efficient since it has been observed that the classical forecasting approach most often does not lead to the maximization of the portfolio returns [29]. Hybrid models are also proposed that utilize the algorithms and architectures of machine learning and deep learning and exploit the sentiments in the textual sources on the social web [30–35]. The use of metaheuristics algorithms in solving multi-objective optimization problems for portfolio management has been proposed in several works [36–38]. The use of fuzzy logic, genetic algorithms (GAs), and algorithms of swarm intelligence (SI), e.g., particle swarm optimization (PSO), are also quite common in portfolio optimization [39–43]. The performances of the mean–variance, Eigen, and HRP portfolios have been compared on different stocks from various sectors of the Indian stock market [44– 50]. A pair portfolio design approach using cointegration for the Indian stock market has also been proposed in the literature [51]. The use of generalized autoregressive conditional heteroscedasticity (GARCH) in estimating the future volatility of stocks and portfolios has also been illustrated [52, 53]. Finally, deep reinforcement learning approaches have been extensively used in portfolio optimization [54–67]. In the context of portfolio optimization, while machine learning and deep learning models learn from the patterns in historical stock prices and aspire to maximize future returns, reinforcement learning is an experience-driven autonomous system wherein an agent uses the historical stock

512

J. Sen

prices to randomly take actions and learn from its experience. Based on the experience gathered, the agent allocates weights to the stocks for maximizing future returns and minimizing the risk.

3 Data and Methodology In this section, the five-step approach adopted in designing the proposed system is discussed in detail. The steps are as follows.

3.1 Choosing the Sectors For designing the portfolios, the following sectors are chosen in the current study: (i) auto, (ii) consumer durables, (iii) financial services, (iv) FMCG, (v) healthcare, (vi) information technology (IT), (vii) media, (viii) metal, (ix) oil & gas, (x) private banks, (xi) PSU banks, (xii) realty, and (xiii) NIFTY 50. For each sector, ten stocks are identified which have the maximum free-float market capitalization based on the NSE’s report of December 21, 2021 [3]. Likewise, for designing the NIFTY 50 portfolio, the 50 stocks identified in the NSE’s report of December 31, 2021, are used. NIFTY 50 stocks are the leading 50 stocks in the Indian stock market chosen from several sectors. Hence, the portfolio designed on the NIFTY 50 can be considered a diversified portfolio.

3.2 Data Acquisition The historical prices of the stocks from January 1, 2017, to November 30, 2022, are extracted using Python’s DataReader function and the Yahoo Finance API. The portfolios are built on the close prices of the stocks from January 1, 2017, to December 31, 2021. The portfolios are tested on the stock price data from January 1, 2022, to November 30, 2022.

3.3 Hierarchical Risk Parity Portfolio Design The execution of HRP involves three steps: (a) cluster formation, (b) quasidiagonalization, and (c) bisecting the clusters recursively. A brief discussion of the steps is as follows. Cluster Formation: The tree clustering used in the HRP algorithm is an agglomerative clustering algorithm [68]. To design the agglomerative clustering algorithm, a

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

513

hierarchy class is first created in Python. The hierarchy class contains a dendrogram method that receives the value returned by a method called linkage defined in the same class. The linkage method receives the dataset after pre-processing and transformation and computes the minimum distances between stocks based on their return values. There are several options for computing distance. However, the ward distance [69] is a good choice since it minimizes the variances in the distance between two clusters in the presence of high volatility in the stock return values. In this work, the ward distance has been used as a method to compute the distance between two clusters. The linkage method performs the clustering and returns a list of the clusters formed. The computation of linkages is followed by the visualization of the clusters through a dendrogram. In the dendrogram, the leaves represent the individual stocks, while the root depicts the cluster containing all the stocks. The distance between each cluster formed is represented along the y-axis; longer arms indicate less correlated clusters and vice versa. The details of the clustering process are described in [3]. Quasi-Diagonalization: In this step, the rows and the columns of the covariance matrix of the return values of the stocks are reorganized in such a way that the largest values lie along the diagonal. Without requiring a change in the basis of the covariance matrix, quasi-diagonalization yields a very important property of the matrix—the assets (i.e., stocks) with similar return values are placed closer to each other, while disparate assets are put at a far distance. The working principles of the algorithm are as follows. Since each row of the linkage matrix merges two branches into one, the clusters (CN-1 , 1) and (CN-2 , 2) are replaced with their constituents recursively until there are no more clusters to merge. This recursive merging of clusters preserves the original order of the clusters [70]. The output of the algorithm is a sorted list of the original stocks (as they were before the clustering). Recursive Bisection: The quasi-diagonalization step transforms the covariance matrix into a quasi-diagonal form. It is proven mathematically that the allocation of weights to the assets in an inverse ratio to their variance is an optimal allocation for a quasi-diagonal matrix [70]. This allocation may be done in two different ways. In the bottom-up approach, the variance of a contiguous subset of stocks is computed as the variance of an inverse-variance allocation of the composite cluster. In the alternative top-down approach, the allocation among two adjacent subsets of stocks is done in inverse proportion to their aggregated variances. In the current implementation, the top-down approach is followed. A Python function computeIVP computes the inverse-variance portfolio based on the computed variances of two clusters as its given input. The variance of a cluster is computed using another Python function called clusterVar. The output of the clusterVar function is used as the input to another Python function called recBisect, which computes the final weights allocated to the individual stocks based on the recursive bisection algorithm.

514

J. Sen

3.4 Portfolio Design Using Reinforcement Learning Reinforcement learning based on the Q-learning approach involves several technical terms. For the benefit of the readers, a brief explanation of those terms is presented first. Agent: The agent is the entity that learns and then makes decisions or performs actions. Actions: The activities that an agent can perform in an environment are called actions. For example, in a portfolio optimization context, the agent performs the assignment of weights to the stocks constituting the portfolio. Three possible actions are buy, sell, and hold. Environment: The environment refers to the world in which the agent resides and performs its actions. In a portfolio design context, the environment encompasses the historical prices of all stocks and the factors that affect those prices. State: The state of an RL system refers to its current situation. In a stock trading and portfolio design context, when an agent buys, sells, or holds on a stock based on the current price of the stock, its state changes. In a multi-stock portfolio, the correlation matrix of the historical returns of the stocks is a good representation of the state. Reward: It is the immediate return received by the agent from the environment based on the action performed by the agent. In a portfolio management context, the cumulative return or the Sharpe ratio represents the possible rewards of the agent. Policy: Policy refers to a mapping rule from an instance of a state to an action. This mapping rule can be stochastic or deterministic based on the problem the agent is deployed to solve. In the portfolio optimization case, the policy is a stochastic mapping rule for the state set to the action set. As expressed in (1), the relation between a state st , at time instant t, and action at is related via the policy P. at = P(st )

(1)

A stochastic policy rule yields a probability distribution over the set of actions based on the current state of the agent. The goal of the RL agent is to learn the optimal policy for a given state so that its reward is maximized. Discount Factor: The discount factor, denoted as γ, is a real number lying in the interval [0, 1] that adjusts the rewards earned by the agent over time. The discount factor penalizes future rewards as the rewards that are earned later do not provide any immediate benefit and have higher uncertainty associated with them. Value Function and Q-Value: The value function computes the expected reward for the agent when it performs an action at a given state. The reward C t the agent achieves at time t is given by (2), in which γ represents the discount factor. Ct = Rt+1 + γ Rt+2 + · · · =

∞ k=0

γ k R t+k+1

(2)

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

515

For a given state s at time t, the value function reflects its attractiveness based on its expected future reward. The value function V(s) of a state s at time t is given by (3). V (s) = E(Ct |St = s)

(3)

The Q-value, also known as the action-value function for a state-action (s, a) pair of an agent is given by (4). Q(s, a) = E(Ct |St = s, At = a)

(4)

Hence, the value function is the expected reward of the agent at a given state when it follows a policy rule P. The Q-value, on the other hand, is the expected reward for the agent for a given (state, action) pair when it follows a policy rule P. Q-Learning: The algorithm of Q-learning evaluates the Q-values for all possible actions that are available to the agent at a given state, and recommends to the agent the action which has the maximum Q-value. For every possible (state, action) pair, the Qlearning algorithm maintains the running average of the rewards the agent receives on transiting from the state s performing an action a, plus the future expected discounted rewards. The agent performs that action that yields the maximum expected Q-value for the next state. The relationship between the reward function (R), the cumulative future return function (C t ), and Q-value is used in forming the Bellman Equation in (5), which is used for constructing a Q-table that stores the Q-values for all (state, action) pairs for the agent. V (S) = E(Rt+1 + γ V St+1 )|St = s

(5)

As observed in (5), the Bellman equation presents the value function as an aggregate of two components. The first part Rt+1 , is the immediate reward the agent receives, and the second part γV (St+1 ), depicts the discounted rewards received from the next state St+1 . The maximum value of the value function V *(s) at a state s is the reward that the agent seeks to achieve through its actions. Deep Q-Learning Network: Although Q-Learning is an effective way to train an agent in an RL system, it becomes a computationally challenging task to handle the Q-table when the state and action spaces are very large. To handle such situations, a deep neural network is trained. The set of parameters of the deep neural network represented as θ is used to compute the Q-values based on a function Q (s, a: θ). The deep neural network, known as the deep Q-network (DQN), learns the Q-values by its set of weights θ, using the backpropagation algorithm. The Epsilon Parameter and Exploration–Exploitation Trade-off : At the initial phase of learning, the agent randomly chooses its action at a given state and gathers experience in the environment. This is called the exploration phase. However, as the agent becomes more experienced, it starts taking actions mostly based on the knowledge it has gathered. This phase is known as the exploitation phase. A parameter epsilon controls the trade-off between exploration and exploitation. A low value of epsilon makes the agent less likely to take random actions and more likely to exploit

516

J. Sen

the knowledge it has already gathered. Usually, a high initial value of epsilon is chosen for the training of the agent. The epsilon value decreases at a chosen rate so that as the agent learns from the environment, it adaptively reduces its probability of exploring and increases its likelihood of exploiting its knowledge. Episodes: The number of times the agent code is trained over the entire training data (i.e., the environment) is defined as the number of episodes. A higher number of episodes usually leads to better training of the agent. However, if an agent is trained over many episodes, it may lead to overfitting. Batch Size: The batch size, also called a minibatch, refers to the size of the replay buffer or memory used during the training of the agent. In other words, batch size refers to the maximum number of records in the training dataset that is used for updating the Q-values in a lookup table. The Q-values are based on the records in a batch by minimizing a loss function so that the mean squared error (MSE) between the predicted and the target Q values. As all the relevant terms in the context of an RL system are explained, the methodology followed for building the portfolio optimizer is now explained in detail. RL Portfolio Optimizer Design: The RL portfolio optimizer consists of several classes and functions, which are discussed in the following. First, an agent class is designed that stores the variables and functions used in the Q-learning process. The agent class consists of the following components: (i) constructor method, (ii) model function, (iii) act function, (iv) histReplay function, and (v) weight function. The constructor method includes an init function for instantiating objects from the class, the discount factor for future returns (i.e., the rewards), and the epsilon parameter for the exploration–exploitation trade-off of Q-learning. The function model maps the states to the actions of the agent. The inputs to the function are the environment states, and its outputs are the values in the cells of the Q-table. The function act determines the best action for the agent for a given state. The function histReplay uses a trained deep neural network based on the historical actions of the agent. The approach is based on storing the history of the state, action, reward, and the transition to the next state that has been experienced by the agent in the past and exploiting this information for taking subsequent actions. The history over a minibatch is stored and used as the input, while the Q-values are determined based on the output of a deep neural network minimizing the loss function. The loss function used is the mean square error (MSE). There is a greedy updation method that is used to prevent any possible overfitting that may creep in. The weight function converts the output Q-values to the weights of the portfolios. Some auxiliary functions are also designed for the execution of the agent in the environment. The agent is trained in an environment that consists of the historical stock prices of the stocks constituting the portfolio. A function named getstate is designed for returning the states and the historical returns of stocks based on the past stock prices and the covariance matrix of stock returns, and the number of days in the past used for computing the state. The agent is trained in over 100 episodes. Longer episodes are found to yield worse performance, possibly due to the overfitting of the model. The batch size for training is chosen as 32. The training module involves

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

517

iterations over 100 episodes. On completion of each episode, the information of the state, action, reward, and the next state to be used in training are saved. The environment in which the agent has been trained consists of the historical daily stock prices (i.e., close values) over three years (i.e., January 1, 2018, to December 31, 2020). A class in Python is written wherein the environment for stock prices is created. As mentioned earlier, the environment consists of two important functions: (i) getState, and (ii) getReward. While the getState function returns the state based on the historical prices and the covariance matrix of the stock returns, the getReward function computes and returns the reward in the form of the Sharpe ratio of the portfolio based on the lookback period and the portfolio weights. The training of the agent involves the following four major steps. (i) initializing the agent object, (ii) invoking the stock price environment, (iii) setting the number of episodes to 100, and (iv) setting the batch size to 32. Since the stock prices are highly volatile in general, the state window size is chosen to be 180, and the portfolio rebalancing day is taken as 90 days. Each sector-specific portfolio consists of 10 stocks. However, the NIFTY 50 portfolio has 50 stocks. As depicted in Fig. 1, the training of the agent involves the following steps: (1) Using the getState function of the environment class, the current state of the agent is determined. The function computes the correlation matrix of the stock returns over the time frame defined by the window size parameter. (2) Using the act function, the current best action of the agent is determined. The action of the agent refers to computing the weights of the portfolio.

Fig. 1 The training of the reinforcement learning agent using deep Q-learning for optimum portfolio design. The actions performed by the agent are indicated by the blue-colored boxes

518

J. Sen

(3) Based on the getReward function, the rewards are computed. The rewards here are returns and Sharpe ratios. (4) The next state is identified based on the output of the getState function. The next state information is used in updating the Q-values. (5) The historical information of the current state, next state, action, and reward (i.e., the Q-value) are stored in the buffer for further use by the histReplay function. (6) If the batch size is reached, the agent executes the histReplay function and updates the Q-table with Q-values by minimizing the loss function (i.e., MSE) computed on the predicted and target Q-values using steps 8–10 as depicted in Fig. 1. If the batch is not complete, the agent proceeds to the next iteration. After the execution of the agent code for all the episodes, the cumulative daily return and the weight allocation done by the agent to each constituent stock in the portfolio are yielded as the output. Based on the daily returns, the annual return, the annual volatility, and the maximum Sharpe ratio are computed for both training and test data.

3.5 Backtesting the Portfolios on the Training and Test Data Finally, the performances of the two portfolios (i.e., HRP and RL) are evaluated on the training and test data. The metrics used for evaluation are the annual volatility and the Sharpe ratio. Additionally, the cumulative returns are also computed for the portfolios. For each sector, the portfolios that perform better on the training and test data are identified. From the point of view of the investors, the portfolio that performs better on the test data for a given sector is considered to have exhibited superior performance for that sector.

4 Results This section presents the performance results of the portfolios. The portfolios are implemented using the Python language and its libraries. The training and testing are carried out on the Google Colab platform [71]. Auto Sector Portfolios: The ten stocks from the auto sector with the maximum free-float market capitalization and their respective contributions to the computation of the auto sector index according to the report published by the NSE on December 31, 2021, are as follows: Mahindra & Mahindra (M&M): 19.93, Maruti Suzuki (MARUTI): 19.02, Tata Motors (TATAMOTORS):12.57, Eicher Motors (EICHERMOT): 7.75, Bajaj Auto (BAJAJ-AUTO): 7.67, Hero MotoCorp (HEROMOTOCO): 5.91, Tube Investments of India (TIINDIA): 4.61, TVS Motor Company (TVSMOTORS): 3.89, Bharat Forge (BHARATFORG): 3.51, and Ashok Leyland (ASHOKLEY): 3.41 [4]. The figures mentioned, along with the names of the stocks,

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

519

represent the respective weights (in percent) of the stocks used in computing the sectoral index of the auto sector. The ticker names of the stocks are mentioned within parentheses in upper case. The weights assigned by the RL and HRP portfolios based on the training data (January 1, 2017–December 31, 2021) are presented in Table 1. Figure 2 depicts the portfolio weights allocated by the portfolios in the form of pie charts. It is observed that the stock Bajaj Auto received the maximum weights as per both portfolio strategies. Figures 3 and 4 show the cumulative returns of the portfolios over the training and the test periods, respectively. These plots depict the cumulative daily returns of the portfolios. The portfolio yielding a higher cumulative return is more profitable for the investors. However, the returns need to be adjusted by their associated risks. Hence, the portfolio risks and the values of their Sharpe ratios are computed so that Table 1 Weights allocation by RL and HRP portfolios for the auto sector stocks Stock name

RL portfolio weights

HRP portfolio weights

M&M

0.0963

0.1462

MARUTI

0.0668

0.0848

TATAMOTORS

0.0041

0.0568

EICHERMOT

0.0954

0.0741

BAJAJ-AUTO

0.3581

0.1752

HEROMOTOCO

0.0963

0.1346

TIINDIA

0.2146

0.1147

TVSMOTOR

0.0427

0.1114

BHARATFORG

0.0192

0.0600

ASHOKLEY

0.0064

0.0422

Fig. 2 Weight allocation to the auto sector stocks by the RL and the HRP portfolios

520

J. Sen

the performances of the portfolios can be compared based on their respective Sharpe ratios. In Table 2, the summary of the performances of the two portfolios of the auto sector is presented for the training and the test periods. For both training and test periods, the annual returns, annual volatilities (i.e., standard deviations), and the max Sharpe ratio are tabulated in Table 2. The RL portfolio has yielded the highest Sharpe ratios for both training and test data for the auto sector.

Fig. 3 Cumulative returns yielded by the auto sector portfolios over the training period (January 1, 2017–December 31, 2021)

Fig. 4 Cumulative returns yielded by the auto sector portfolios over the test period (January 1, 2022–November 30, 2022)

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

521

Table 2 Summary of the performances of RL and HRP portfolios of the auto sector on the training and test data Portfolio Training performance

Test performance

Annual return Annual vol Sharpe ratio Annual return Annual vol Sharpe ratio (%) (%) RL

14.45

23.02

0.6277

32.94

21.42

1.5379

HRP

10.50

24.40

0.4303

33.44

21.90

1.5268

Consumer Durables Sector Portfolios: The ten stocks from the consumer durable sector with the maximum free-float market capitalization and their respective contributions to the computation of the sectoral index, according to the report published by the NSE on December 31, 2021, are as follows: Titan Company (TITAN): 34.17, Havells India (HAVELLS): 13.60, Crompton Greaves Consumer Electricals (CROMPTON): 9.39, Voltas (VOLTAS): 8.22, Dixon Technologies India (DIXON): 6.77, Bata India (BATAINDIA): 4.78, Rajesh Exports (RAJESHEXPO): 4.47, Kajaria Ceramics (KAJARIACER): 4.13, Blue Star (BLUESTARCO): 3.12, Relaxo Footwears (RELAXO): 2.96 [4]. The figures mentioned, along with the names of the stocks, represent their respective weights (in percent) used in computing the sectoral index of the consumer durables sector. The ticker names of the stocks are mentioned within parentheses in upper case. The weights assigned by the RL and HRP portfolios based on the training data are presented in Table 3. The weight allocation done by the portfolios is also presented as pie charts in Fig. 5. It is observed that the stock Rajesh Exports has received the maximum weights from both portfolios. Figures 6 and 7 show the cumulative returns of the portfolios over the training and the test periods, respectively. These plots depict the cumulative daily returns of the portfolios. The portfolio yielding a higher cumulative return is more profitable for the investors. However, the returns need to be adjusted by their associated risks. Table 3 Weights allocation by RL and HRP portfolios for the consumer durables sector stocks Stock name

Weights of RL portfolio

TITAN

0.0531

Weight of HRP portfolio 0.0554

HAVELLS

0.0735

0.0924

VOLTAS

0.0264

0.0887

CROMPTON

0.1120

0.1322

DIXON

0.0519

0.0778

BATAINDIA

0.0762

0.0676

KAJARIACER

0.0589

0.0865

RAJESHEXPO

0.2731

0.1699

RELAXO

0.1567

0.1242

BLUESTARCO

0.1182

0.1054

522

J. Sen

Fig. 5 Weight allocation to the consumer durable sector stocks by the RL and the HRP portfolios

Hence, the portfolio risks and the values of their Sharpe ratios are computed so that the performances of the portfolios can be compared based on their respective Sharpe ratios. Table 4 presents the summary of the performances of the two portfolios of the consumer durables sector for the training and the test periods. For both training and test periods, the annual returns, annual volatilities (i.e., standard deviations), and the max Sharpe ratios are tabulated in Table 4. It is observed that while the HRP portfolio has yielded the highest Sharpe ratios for the training data, for the test period, the RL portfolio has produced the highest Sharpe ratio. For portfolios with negative Sharpe ratios, higher negative values imply more negative returns and hence, more losses for the investors.

Fig. 6 Cumulative returns yielded by the consumer durables sector portfolios over the training period (January 1, 2017–December 31, 2021)

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

523

Fig. 7 Cumulative returns yielded by the consumer durables sector portfolios over the test period (January 1, 2022–November 30, 2022)

Table 4 Summary of the performances of RL and HRP portfolios of the consumer durable sector on the training and test data Portfolio Training performance

Test performance

Annual return Annual vol Sharpe ratio Annual return Annual vol Sharpe ratio (%) (%) RL

22.62

17.21

1.3142

−8.78

18.91

−0.4643

HRP

25.21

17.73

1.4217

−11.32

18.48

−0.6124

Financial Services Sector Portfolios: The ten stocks from the financial services sector with the maximum free-float market capitalization and their respective contributions to the computation of the sectoral index, according to the report published by the NSE on December 31, 2021, are as follows: HDFC Bank (HDFCBANK): 22.12, ICICI Bank (ICICIBANK): 20.75, Housing Development Finance Corporation (HDFC): 15.26, Kotak Mahindra Bank (KOTAKBANK): 8.94, Axis Bank (AXISBANK): 7.44, State Bank of India (SBIN): 7.22, Bajaj Finance (BAJAJFINANCE): 5.59, Bajaj Finserv (BAJFINSV): 308, SBI Life Insurance Company (SBILIFE): 1.81, and HDFC Life Insurance Company (HDFCLIFE): 1.75 [4]. The figures mentioned, along with the names of the stocks, represent their respective weights (in percentage) used in computing the sectoral index of the financial services sector. The ticker names of the stocks are mentioned within parentheses in upper case. The weights assigned by the RL and HRP portfolios based on the training data are presented in Table 5. The weight allocation done by the portfolios is also presented as pie charts in Fig. 8. It is observed that the stock HDFC Bank has received the maximum weights from both portfolios.

524

J. Sen

Table 5 Weights allocation by RL and HRP portfolios for the financial services sector stocks Stock name

Weights of RL portfolio

Weight of HRP portfolio

HDFCBANK

0.3489

0.1812

ICICIBANK

0.0066

0.0533

HDFC

0.0224

0.1443

KOTAKBANK

0.1275

0.1337

AXISBANK

0.0090

0.0459

SBIN

0.0520

0.0912

BAJFINANCE

0.0036

0.0671

BAJFINSV

0.0112

0.0812

SBILIFE

0.2907

0.1087

HDFCLIFE

0.1280

0.0934

Fig. 8 Weight allocation to the financial services sector stocks by the RL and the HRP portfolios

Figures 9 and 10 show the cumulative returns of the portfolios of the financial services sector over the training and the test periods, respectively. These plots depict the cumulative daily returns of the portfolios. Table 6 presents the summary of the performances of the two portfolios of the financial services sector for the training and the test periods. For both training and test periods, the annual returns, annual volatilities (i.e., standard deviations), and the max Sharpe ratios are tabulated in Table 6. It is observed that the HRP portfolio has yielded the highest Sharpe ratios for both training and test data. FMCG Sector Portfolios: The ten stocks from the FMCG sector with the maximum free-float market capitalization and their respective contributions to the computation of the sectoral index according to the report published by the NSE on December 31, 2021, are as follows: ITC ((ITC): 30.00, Hindustan Unilever (HINDUNILVR): 24.02, Nestle India (NESTLEIND): 7.22, Britannia Industries (BRITANNIA): 6.30, Tata Consumer Products (TATACONSUM): 5.99, Dabur India

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

525

Fig. 9 Cumulative returns yielded by the financial services sector portfolios over the training period (January 1, 2017–December 31, 2021)

Fig. 10 Cumulative returns yielded by the financial services sector portfolios over the test period (January 1, 2022–November 30, 2022) Table 6 Summary of the performances of RL and HRP portfolios of the financial services sector on the training and test data Portfolio Training performance

Test performance

Annual return Annual vol Sharpe ratio Annual return Annual vol Sharpe ratio RL

18.13%

22.56

0.8037

9.14%

20.92

0.4368

HRP

20.89%

24.66

0.8469

10.42%

21.51

0.4844

526

J. Sen

(DABUR): 4.21, Godrej Consumer Products (GODREJCP): 4.08, Varun Beverages (VBL): 3.57, United Spirits (MCDOWELL-N): 3.40, and Marico (MARICO): 3.21[4]. The figures mentioned, along with the names of the stocks, represent their respective weights (in percentage) used in computing the sectoral index of the FMCG sector. The ticker names of the stocks are mentioned within parentheses in upper case. The weights assigned by the RL and HRP portfolios based on the training data are presented in Table 7. The weight allocation done by the portfolios is also presented as pie charts in Fig. 11. It is observed that the stock ITC has received the maximum weights from both portfolios. Figures 12 and 13 show the cumulative returns of the portfolios of the financial services sector over the training and the test periods, respectively. These plots depict the cumulative daily returns of the portfolios. Table 8 presents the summary of the performances of the two portfolios of the FMCG sector for the training and the test periods. For both training and test periods, Table 7 Weights allocation by RL and HRP portfolios for the FMCG sector stocks Stock name

Weights of RL portfolio

Weight of HRP portfolio

ITC

0.1705

0.1497

HINDUNILVR

0.1176

0.0851

NESTLEIND

0.1676

0.0806

TATACONSUM

0.0104

0.0612

BRITANNIA

0.0720

0.1144

DABUR

0.0975

0.1364

GODREJCP

0.0443

0.0737

MARICO

0.1540

0.1303

MCDOWELL-N

0.0309

0.0777

VBL

0.1352

0.0909

Fig. 11 Weight allocation to the FMCG sector stocks by the RL and the HRP portfolios

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

527

Fig. 12 Cumulative returns yielded by the FMCG sector portfolios over the training period (January 1, 2017–December 31, 2021)

Fig. 13 Cumulative returns yielded by the FMCG sector portfolios over the test period (January 1, 2022–November 30, 2022)

the annual returns, annual volatilities (i.e., standard deviations), and the max Sharpe ratios are tabulated in Table 8. It is observed that the RL portfolio has yielded the highest Sharpe ratios for both training and test data. Healthcare Sector Portfolios: The ten stocks from the healthcare sector with the maximum free-float market capitalization and their respective contributions to the computation of the sectoral index according to the report published by the NSE on December 31, 2021, are as follows: Sun Pharmaceuticals Industries (SUNPHARMA): 23.19, Cipla (CIPLA): 12.65, Dr. Reddy’s Laboratories (DRREDDY): 11.19, Apollo Hospitals Enterprises (APOLLOHOSP): 9.91, Divi’s

528

J. Sen

Table 8 Summary of the performances of RL and HRP portfolios of the FMCG sector on the training and test data Portfolio Training performance

Test performance

Annual return Annual vol Sharpe ratio Annual return Annual vol Sharpe ratio RL

20.90%

15.83

1.3204

26.92%

16.81

1.6015

HRP

20.96%

16.27

1.2880

22.88%

17.38

1.3168

Laboratories (DIVISLAB): 8.91, Lupin (LUPIN): 3.79, Laurus Labs (LAURUSLABS): 3.36, Torrent Pharmaceuticals (TORNTPHARM): 3.34, Alkem Laboratories (ALKEM): 3.14, and Aurobindo Pharma (AUROPHARMA): 2.70 [4]. The figures mentioned, along with the names of the stocks, represent their respective weights (in percentage) used in computing the sectoral index of the healthcare sector. The ticker names of the stocks are mentioned within parentheses in upper case. The weights assigned by the RL and HRP portfolios based on the training data are presented in Table 9. The weight allocation done by the portfolios is also presented as pie charts in Fig. 14. It is observed that the stock Alkem Laboratories has received the maximum weights from both portfolios. Figures 15 and 16 show the cumulative returns of the portfolios of the healthcare sector over the training and the test periods, respectively. These plots depict the cumulative daily returns of the portfolios. Table 10 presents the summary of the performances of the two portfolios of the healthcare sector for the training and the test periods. For both training and test periods, the annual returns, annual volatilities (i.e., standard deviations), and the max Sharpe ratios are tabulated in Table 10. It is observed that the RL portfolio has yielded the highest Sharpe ratios for both training and test data. It may be noted that for portfolios with negative Sharpe ratios, higher negative values imply more negative returns and, hence, more losses for the investors. Table 9 Weights allocation by RL and HRP portfolios for the healthcare sector stocks Stock name

Weights of RL portfolio

Weight of HRP portfolio

SUNPHARMA

0.0366

0.0852

CIPLA

0.1229

0.0757

DRREDDY

0.1120

0.1184

APLOLLOHOSP

0.1117

0.1087

DIVISLAB

0.0593

0.0829

LAURUSLABS

0.0667

0.0928

LUPIN

0.0416

0.0617

TORNTPHARM

0.1290

0.1090

ALKEM

0.3143

0.2059

AUROPHARMA

0.0060

0.0596

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

529

Fig. 14 Weight allocation to the healthcare sector stocks by the RL and the HRP portfolios

Fig. 15 Cumulative returns yielded by the healthcare sector portfolios over the training period (January 1, 2017–December 31, 2021)

Information Technology Sector Portfolios: The ten stocks from the information technology (IT) sector with the maximum free-float market capitalization and their respective contributions to the computation of the sectoral index according to the report published by the NSE on December 31, 2021, are as follows: Infosys (INFY): 27.56, Tata Consultancy Services (TCS): 26.39, HCL Technologies (HCLTECH): 9.67, Tech Mahindra (TECHM): 8.35, Wipro (WIPRO): 8.10, Larsen & Toubro Infotech (LTI): 5.17, Persistent Systems (PERSISTENT): 4.92, MphasiS (MPHASIS): 3.92, Coforge (COFORGE): 3.46, and L&T Technology Services (LTTS): 2.46 [4]. The figures mentioned, along with the names of the stocks, represent their respective weights (in percentage) used in computing the sectoral index of the IT sector. The ticker names of the stocks are mentioned within parentheses in upper case.

530

J. Sen

Fig. 16 Cumulative returns yielded by the healthcare sector portfolios over the test period (January 1, 2022–November 30, 2022)

Table 10 Summary of the performances of RL and HRP portfolios of the healthcare sector on the training and test data Portfolio Training performance

Test performance

Annual return Annual vol Sharpe ratio Annual Return Annual vol Sharpe ratio RL

22.08%

18.01

1.2259

−4.73%

16.39

−0.2885

HRP

21.83%

18.69

1.1681

−7.20%

17.12

−0.4206

The weights assigned by the RL and HRP portfolios based on the training data are presented in Table 11. The weight allocation done by the portfolios is also presented as pie charts in Fig. 17. It is observed that the stock Tata Consultancy Services has received the maximum weights from both portfolios. Figures 18 and 19 show the cumulative returns of the portfolios of the IT sector over the training and the test periods, respectively. These plots depict the cumulative daily returns of the portfolios. Table 12 presents the summary of the performances of the two portfolios of the IT sector for the training and the test periods. For both training and test periods, the annual returns, annual volatilities (i.e., standard deviations), and the max Sharpe ratios are tabulated in Table 12. It is observed that the HRP portfolio has yielded the highest Sharpe ratio for the training data. However, for the test data, the Sharpe ratio of the RL portfolio is higher than that of the HRP portfolio. It may be noted that for portfolios with negative Sharpe ratios, higher negative values imply more negative returns and, hence, more losses for the investors. Media Sector Portfolios: The ten stocks from the media sector with the maximum free-float market capitalization and their respective contributions to the computation of the sectoral index according to the report published by the NSE on December 31, 2021, are as follows: Zee Entertainment Enterprises (ZEEL): 31.14, PVR (PVR):

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

531

Table 11 Weights allocation by RL and HRP portfolios for the IT sector stocks Stock name

Weights of RL portfolio

Weight of HRP portfolio

INFY

0.0972

0.0794

TCS

0.2522

0.1487

HCLTECH

0.0461

0.0796

TECHM

0.0357

0.1148

WIPRO

0.1970

0.1486

LTI

0.0358

0.0941

PERSISTENT

0.1487

0.1191

MPHASIS

0.1102

0.1168

COFORGE

0.0009

0.0430

LTTS

0.0762

0.0559

Fig. 17 Weight allocation to the IT sector stocks by the RL and the HRP portfolios

19.02, Sun TV Network (SUNTV): 11.17, Inox Leisure (INOXLEISUR): 8.49, Saregama India (SAREGAMA): 7.13, Dish TV India (DISHTV): 6.54, TV18 Broadcast (TV18BRDCST): 6.18, Nazara Technologies (NAZARA): 4.33, Network18 Media & Investments (NETWORK18): 4.15, and Hathway Cable & Datacom (HATHWAY): 1.84 [4]. The figures mentioned, along with the names of the stocks, represent their respective weights (in percentage) used in computing the sectoral index of the media sector. The ticker names of the stocks are mentioned within parentheses in upper case. The weights assigned by the RL and HRP portfolios based on the training data are presented in Table 13. The weight allocation done by the portfolios is also presented as pie charts in Fig. 20. It is observed that the stock Sun TV Network has received the maximum weights in the RL portfolio. The HRP portfolio, however, has allocated the maximum weight to the stock PVR.

532

J. Sen

Fig. 18 Cumulative returns yielded by the IT sector portfolios over the training period (January 1, 2017–December 31, 2021)

Fig. 19 Cumulative returns yielded by the IT sector portfolios over the test period (January 1, 2022–November 30, 2022) Table 12 Summary of the performances of RL and HRP portfolios of the IT sector on the training and test data Portfolio Training performance

Test performance

Annual return Annual vol Sharpe ratio Annual return Annual vol Sharpe Ratio RL

35.95%

19.60

1.8337

−29.93%

26.00

−1.1512

HRP

38.23%

20.27

1.8863

−33.50%

26.81

−1.2493

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

533

Table 13 Weights allocation by RL and HRP portfolios for the media sector stocks Stock name

Weights of RL portfolio

Weight of HRP portfolio

ZEEL

0.0708

0.1051

PVR

0.1989

0.1977

SUNTV

0.2116

0.1734

INOXLEISUR

0.1632

0.1521

SAREGAMA

0.1401

0.1156

TV18BRDCST

0.0160

0.0569

DISHTV

0.0429

0.0752

NETWORK18

0.0710

0.0555

HATHWAY

0.0854

0.0685

Fig. 20 Weight allocation to the media sector stocks by the RL and the HRP portfolios

Figures 21 and 22 show the cumulative returns of the portfolios of the media sector over the training and the test periods, respectively. These plots depict the cumulative daily returns of the portfolios. Table 14 presents the summary of the performances of the two portfolios of the media sector for the training and the test periods. For both training and test periods, the annual returns, annual volatilities (i.e., standard deviations), and the max Sharpe ratios are tabulated in Table 14. It is observed that the RL portfolio has yielded the highest Sharpe ratio for the training data. However, for the test data, the highest Sharpe ratio is produced by the HRP portfolio. Metal Sector Portfolios: The ten stocks from the metal sector with the maximum free-float market capitalization and their respective contributions to the computation of the sectoral index according to the report published by the NSE on December 31, 2021, are as follows: Adani Enterprises (ADANIENT): 24.12, Tata Steel (TATASTEEL): 19.54, JSW Steel (JSWSTEEL): 15.77, Hindalco Industries

534

J. Sen

Fig. 21 Cumulative returns yielded by the media sector portfolios over the training period (January 1, 2017–December 31, 2021)

Fig. 22 Cumulative returns yielded by the media sector portfolios over the test period (January 1, 2022–November 30, 2022) Table 14 Summary of the performances of RL and HRP portfolios of the media sector on the training and test data Portfolio Training performance

Test performance

Annual return Annual vol Sharpe ratio Annual return Annual vol Sharpe ratio RL

20.93%

26.88

0.7787

10.61%

24.91

0.4257

HRP

18.48%

27.17

0.6801

11.76%

26.02

0.4521

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

535

(HINDALCO): 14.81, Vedanta (VEDL): 7.64, Jindal Steel & Power (JINDALSTEL): 4.80, APL Apollo Tubes (APLAPOLLO): 3.66, Steel Authority of India (SAIL): 2.76, Hindustan Zinc (HINDZINC): 1.75, and National Aluminum Company (NATIONALUM): 1.56 [4]. The figures mentioned, along with the names of the stocks, represent their respective weights (in percentage) used in computing the sectoral index of the metal sector. The ticker names of the stocks are mentioned within parentheses in upper case. The weights assigned by the RL and HRP portfolios based on the training data are presented in Table 15. The weight allocation done by the portfolios is also presented as pie charts in Fig. 23. It is observed that the stock Hindustan Zinc has received the maximum weights from both portfolios. Figures 24 and 25 show the cumulative returns of the portfolios of the metal sector over the training and the test periods, respectively. These plots depict the cumulative daily returns of the portfolios. Table 15 Weights allocation by RL and HRP portfolios for the metal sector stocks Stock name

Weights of RL portfolio

Weight of HRP portfolio

ADANIENT

0.0559

0.1026

TATASTEEL

0.0818

0.0624

JSWSTEEL

0.1285

0.1108

HINDALCO

0.0089

0.1034

VEDL

0.0079

0.0763

JINDALSTEL

0.0008

0.0575

APLAPOLLO

0.2768

0.1777

SAIL

0.0016

0.0381

HINDZINC

0.3802

0.1973

NATIONALUM

0.0575

0.0739

Fig. 23 Weight allocation to the metal sector stocks by the RL and the HRP portfolios

536

J. Sen

Fig. 24 Cumulative returns yielded by the metal sector portfolios over the training period (January 1, 2017–December 31, 2021)

Fig. 25 Cumulative returns yielded by the metal sector portfolios over the test period (January 1, 2022–November 30, 2022)

Table 16 presents the summary of the performances of the two portfolios of the metal sector for the training and the test periods. For both training and test periods, the annual returns, annual volatilities (i.e., standard deviations), and the max Sharpe ratios are tabulated in Table 16. It is observed that the HRP portfolio has yielded the highest Sharpe ratios for both training data and test data. Oil & Gas Sector Portfolios: The ten stocks from the oil & gas sector with the maximum free-float market capitalization and their respective contributions to the computation of the sectoral index according to the report published by the NSE on December 31, 2021, are as follows: Reliance Industries (RELIANCE): 32.46,

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

537

Table 16 Summary of the performances of RL and HRP portfolios of the metal sector on the training and test data Portfolio Training performance

Test performance

Annual return Annual vol Sharpe ratio Annual return Annual vol Sharpe ratio RL

32.97%

26.45

1.2464

14.79%

26.98

0.5483

HRP

36.82%

28.77

1.2799

17.47%

28.34

0.6165

Adani Total Gas (ATGL): 19.31, Oil & Natural Gas Corporation (ONGC): 10.64, Bharat Petroleum Corporation (BPCL): 7.06, Indian Oil Corporation (IOC): 6.34, GAIL India (GAIL): 5.54, Petronet LNG (PETRONET): 3.46, Indraprastha Gas (IGL): 3.36, Hindustan Petroleum Corporation (HINDPETRO): 3.31, and Gujarat Gas (GUJGASLTD): 1.86 [4]. The figures mentioned, along with the names of the stocks, represent their respective weights (in percentage) used in computing the sectoral index of the metal sector. The ticker names of the stocks are mentioned within parentheses in upper case. The weights assigned by the RL and HRP portfolios based on the training data are presented in Table 17. The weight allocation done by the portfolios is also presented as pie charts in Fig. 26. It is observed that the stock Petronet LNG has received the maximum weights from both portfolios. Figures 27 and 28 show the cumulative returns of the portfolios of the oil & gas sector over the training and the test periods, respectively. These plots depict the cumulative daily returns of the portfolios. Table 18 presents the summary of the performances of the two portfolios of the oil & gas sector for the training and the test periods. For both training and test periods, the annual returns, annual volatilities (i.e., standard deviations), and the max Sharpe ratios are tabulated in Table 18. It is observed that the RL portfolio has yielded the highest Sharpe ratios for both training data and test data. Table 17 Weights allocation by RL and HRP portfolios for the oil & gas sector stocks Stock name

Weights of RL portfolio

Weight of HRP portfolio

RELIANCE

0.2017

0.1573

ONGC

0.0245

0.0564

BPCL

0.0059

0.0648

IOC

0.1231

0.0912

GAIL

0.0497

0.0627

PETRONET

0.2610

0.2008

IGL

0.1394

0.1547

HINDPETRO

0.0048

0.0642

GUJGASLTD

0.1899

0.1479

538

J. Sen

Fig. 26 Weight allocation to the oil & gas sector stocks by the RL and the HRP portfolios

Fig. 27 Cumulative returns yielded by the oil & gas sector portfolios over the training period (January 1, 2017–December 31, 2021)

Private Banking Sector Portfolios: The ten stocks from the private banking sector with the maximum free-float market capitalization and their respective contributions to the computation of the sectoral index, according to the report published by the NSE on December 31, 2021, are as follows: HDFC Bank (HDFCBANK): 25.83, ICICI Bank (ICICIBANK): 25.37, Axis Bank (AXISBANK): 12.35, Kotak Mahindra Bank (KOTAKBANK): 10.94, IndusInd Bank (INDUSINDBK): 10.13, Federal Bank (FEDERALBNK): 4.75, IDFC First Bank (IDFCFIRSTB): 3.29, Bandhan Bank (BANDHANBNK): 3.25, City Union Bank (CUB): 2.44, and RBL Bank (RBLBANK): 1.64 [4]. The figures mentioned, along with the names of the stocks, represent their respective weights (in percentage) used in computing the sectoral index of the private banking sector. The ticker names of the stocks are mentioned within parentheses in upper case.

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

539

Fig. 28 Cumulative returns yielded by the oil & gas sector portfolios over the test period (January 1, 2022–November 30, 2022)

Table 18 Summary of the performances of RL and HRP portfolios of the oil & gas sector on the training and test data Portfolio Training performance

Test performance

Annual return Annual vol Sharpe ratio Annual return Annual vol Sharpe ratio RL

20.49%

20.01

1.0236

0.95%

19.71

0.0481

HRP

18.10%

20.75

0.8721

-0.44%

19.53

-0.0226

The weights assigned by the RL and HRP portfolios based on the training data are presented in Table 19. The weight allocation done by the portfolios is also presented as pie charts in Fig. 29. It is observed that the stock HDFC Bank has received the maximum weights from both portfolios. Table 19 Weights allocation by RL and HRP portfolios for the private banking sector stocks Stock name

Weights of RL portfolio

Weight of HRP portfolio

ICICIBANK

0.0160

0.1113

HDFCBANK

0.5205

0.2386

AXISBANK

0.0067

0.0877

KOTAKBANK

0.1767

0.1723

INDUSINDBK

0.0002

0.0749

FEDERALBNK

0.0023

0.0497

IDFCFIRSTB

0.0613

0.0479

CUB

0.2150

0.1575

RBLBANK

0.0014

0.0601

540

J. Sen

Fig. 29 Weight allocation to the private banking sector stocks by the RL and the HRP portfolios

Figures 30 and 31 show the cumulative returns of the portfolios of the private banking sector over the training and the test periods, respectively. These plots depict the cumulative daily returns of the portfolios. Table 20 presents the summary of the performances of the two portfolios of the private banking sector for the training and the test periods. For both training and test periods, the annual returns, annual volatilities (i.e., standard deviations), and the max Sharpe ratios are tabulated in Table 20. It is observed that the RL portfolio has yielded the highest Sharpe ratios for the training data. However, for the test data, the highest Sharpe ratio is produced by the HRP portfolio. PSU Banking Sector Portfolios: The ten stocks from the PSU banking sector with the maximum free-float market capitalization and their respective contributions

Fig. 30 Cumulative returns yielded by the private banking sector portfolios over the training period (January 1, 2017–December 31, 2021)

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

541

Fig. 31 Cumulative returns yielded by the private banking sector portfolios over the test period (January 1, 2022–November 30, 2022)

Table 20 Summary of the performances of RL and HRP portfolios of the private banking sector on the training and test data Portfolio Training performance

Test performance

Annual return Annual vol Sharpe ratio Annual return Annual vol Sharpe ratio RL

18.42%

22.50

0.8189

18.16%

22.76

0.7980

HRP

16.50%

24.64

0.6697

24.87%

22.98

1.0820

to the computation of the sectoral index according to the report published by the NSE on December 31, 2021, are as follows: State Bank of India (SBIN): 23.63, Bank of Baroda (BANKBARODA): 19.78, Canara Bank (CANBK): 13.50, Punjab National Bank (PNB): 12.85, Union Bank of India (UNIONBANK): 9.96, Indian Bank (INDIANB): 6.98, Bank of India (BANKINDIA): 6.74, Indian Overseas Bank (IOB): 1.76, Bank of Maharashtra (MAHABANK): 1.68, and Central Bank of India (CENTRALBK): 1.57 [4]. The figures mentioned, along with the names of the stocks, represent their respective weights (in percentage) used in computing the sectoral index of the PSU banking sector. The ticker names of the stocks are mentioned within parentheses in upper case. The weights assigned by the RL and HRP portfolios based on the training data are presented in Table 21. The weight allocation done by the portfolios is also presented as pie charts in Fig. 32. It is observed that the stock State Bank of India has received the maximum weights from both portfolios. Figures 33 and 34 show the cumulative returns of the portfolios of the PSU banking sector over the training and the test periods, respectively. These plots depict the cumulative daily returns of the portfolios. Table 22 presents the summary of the performances of the two portfolios of the PSU banking sector for the training and the test periods. For both training and test

542

J. Sen

Table 21 Weights allocation by RL and HRP portfolios for the PSU banking sector stocks Stock name

Weights of RL portfolio

Weight of HRP portfolio

SBIN

0.5668

0.1768

BANKBARODA

0.0002

0.1242

CANBK

0.0001

0.1161

PNB

0.0005

0.0542

INDIANB

0.0095

0.0995

UNIONBANK

0.0005

0.0555

BANKINDIA

0.0001

0.0927

IOB

0.1796

0.0848

CENTRALBK

0.1090

0.0692

MAHABANK

0.1337

0.1269

Fig. 32 Weight allocation to the PSU banking sector stocks by the RL and the HRP portfolios

periods, the annual returns, annual volatilities (i.e., standard deviations), and the max Sharpe ratios are tabulated in Table 22. It is observed that the RL portfolio has yielded the maximum Sharpe ratios for the training data. However, for the test data, the highest Sharpe ratio is produced by the HRP portfolio. Realty Sector Portfolios: The ten stocks from the realty banking sector with the maximum free-float market capitalization and their respective contributions to the computation of the sectoral index according to the report published by the NSE on December 31, 2021, are as follows: DLF: 26.40, Godrej Properties: 16.15, Phoenix Mills: 14.63, Oberoi Realty: 11.39, Macrotech Developers: 9.05, Prestige Estates Projects: 7.06, Brigade Enterprises: 6.65, Indiabulls Real Estate: 4.21, Sobha: 2.39, and Sunteck Realty: 2.07 [4]. The figures mentioned, along with the names of the stocks, represent their respective weights (in percentage) used in computing the sectoral index of the realty sector. The ticker names of the stocks are mentioned within parentheses in upper case.

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

543

Fig. 33 Cumulative returns yielded by the PSU banking sector portfolios over the training period (January 1, 2017–December 31, 2021)

Fig. 34 Cumulative returns yielded by the PSU banking sector portfolios over the test period (January 1, 2022–November 30, 2022) Table 22 Summary of the performances of RL and HRP portfolios of the PSU banking sector on the training and test data Portfolio Training performance

Test performance

Annual return Annual vol Sharpe ratio Annual return Annual vol Sharpe ratio RL

11.11%

32.96

0.3370

29.21%

26.00

1.1236

HRP

2.50%

36.14

0.0692

50.50%

29.91

1.6884

544

J. Sen

The weights assigned by the RL and HRP portfolios based on the training data are presented in Table 23. The weight allocation done by the portfolios is also presented as pie charts in Fig. 35. It is observed that the stock Phoenix Mills has received the maximum weights from both portfolios. Figures 36 and 37 show the cumulative returns of the portfolios of the realty sector over the training and the test periods, respectively. These plots depict the cumulative daily returns of the portfolios. Table 24 presents the summary of the performances of the two portfolios of the realty sector for the training and the test periods. For both training and test periods, the annual returns, annual volatilities (i.e., standard deviations), and the max Sharpe ratios are tabulated in Table 24. It is observed that the HRP portfolio has yielded the highest Sharpe ratios for the training data. However, for the test data, the highest Sharpe ratio is produced by the RL portfolio. Table 23 Weights allocation by RL and HRP portfolios for the realty sector stocks Stock name

Weights of RL portfolio

Weight of HRP portfolio

DLF

0.0190

0.0660

GODREJPROP

0.1050

0.1047

PHOENIXLTD

0.2806

0.1731

OBEROIRLTY

0.1601

0.1415

BRIGADE

0.1634

0.1553

PRESTIGE

0.0566

0.0930

IBREALEST

0.0152

0.0349

SOBHA

0.0742

0.1105

SUNTECK

0.1259

0.1211

Fig. 35 Weight allocation to the realty sector stocks by the RL and the HRP portfolios

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

545

Fig. 36 Cumulative returns yielded by the realty sector portfolios over the training period (January 1, 2017–December 31, 2021)

Fig. 37 Cumulative returns yielded by the realty sector portfolios over the test period (January 1, 2022–November 30, 2022) Table 24 Summary of the performances of RL and HRP portfolios of the realty sector on the training and test data Portfolio Training performance

Test performance

Annual return Annual vol Sharpe ratio Annual return Annual vol Sharpe ratio (%) RL

34.73

26.56

1.3074

8.86

25.57

0.3467

HRP

35.56

27.13

1.3109

1.65

26.27

0.0627

546

J. Sen

NIFTY 50 Stocks Portfolio: The NIFTY 50 stocks are the market leaders across 13 sectors in the NSE, and these stocks have low-risk quotients. These large-cap stocks largely determine the overall performance of the Indian stock market [72]. The weights assigned by the RL and HRP portfolios based on the training data are presented in Table 25. The weight allocation done by the portfolios is also presented as pie charts in Fig. 38. It is observed that the stock Tata Consultancy Services has received the maximum weight in the RL portfolio. However, the HRP portfolio has allocated the maximum weight to the stock ITC. Figures 39 and 40 show the cumulative returns of the portfolios of the NIFTY 50 stocks over the training and the test periods, respectively. These plots depict the cumulative daily returns of the portfolios. Table 25 Weights allocation by RL and HRP portfolios for the NIFTY 50 stocks Stock name

Weights of RL portfolio

Weight of HRP portfolio

Stock name

Weights of RL portfolio

Weights of HRP portfolio

ADANIPORTS

0.0006

0.0146

IOC

0.0082

0.0135

ASIANPAINT

0.0469

0.0231

ITC

0.0722

0.0413

AXISBANK

0.0002

0.0126

JSWSTEEL

0.0002

0.0131

BAJAJ-AUTO

0.0142

0.0319

KOTAKBANK

0.0011

0.0139

BAJAJFINANCE

0.0002

0.0107

LT

0.0013

0.0158

BAJAJFINSV

0.0002

0.0085

M&M

0.0004

0.0190

BHARTIARTL

0.0117

0.0316

MARUTI

0.0004

0.0139

BPCL

0.0004

0.0091

NESTLEIND

0.0897

0.0246

BRITANNIA

0.0368

0.0252

NTPC

0.0242

0.0351

CIPLA

0.0741

0.0274

ONGC

0.0003

0.0171

COALINDIA

0.0395

0.0201

POWERGRID

0.1104

0.0291

DIVISLAB

0.0119

0.0234

RELIANCE

0.0006

0.0304

DRREDDY

0.0848

0.0410

SBIN

0.0003

0.0090

EICHERMOT

0.0021

0.0122

SBILIFE

0.0446

0.0234

GRASIM

0.0002

0.0109

SHREECEM

0.0014

0.2257

HCLTECH

0.0035

0.0307

SUNPHARMA

0.0043

0.0210

HDFC

0.0005

0.0210

TATAMOTORS

0.0002

0.0064

HDFCBANK

0.0680

0.0237

TATASTEEL

0.0002

0.0081

HDFCLIFE

0.0011

0.0201

TCS

0.1149

0.0253

HEROMOTOCO

0.0008

0.0186

TATACONSUM

0.0006

0.0163

HINDALCO

0.0001

0.0108

TECHM

0.0008

0.0177

HINDUNILVR

0.0742

0.0262

TITAN

0.0019

0.0199

ICICIBANK

0.0003

0.0098

ULTRACEMCO

0.0006

0.0148

INDUSINDBK

0.0001

0.0069

UPL

0.0003

0.0208

INFY

0.0208

0.0261

WIPRO

0.0277

0.0319

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

547

Fig. 38 Weight allocation to the NIFTY 50 stocks by the RL and the HRP portfolios

Fig. 39 Cumulative returns yielded by the portfolios of the NIFTY 50 stocks over the training period (January 1, 2017–December 31, 2021)

Table 26 presents the summary of the performances of the two portfolios of the NIFTY 50 stocks for the training and the test periods. For both training and test periods, the annual returns, annual volatilities (i.e., standard deviations), and the max Sharpe ratios are tabulated in Table 26. It is observed that the RL portfolio has yielded the highest Sharpe ratios for both training and test data. Table 27 presents a summary of the results, in which the portfolio yielding the higher Sharpe ratio for a sector is mentioned along with the corresponding Sharpe ratio for the training and test data of stock prices. It is observed that the RL portfolio has produced higher Sharpe ratios for eight out of thirteen sectors on both training and test data, clearly outperforming its HRP counterpart. Since there is still scope for optimizing the DQN model of the RL agent, the performance of the RL portfolio may be improved further. This is a possible future work.

548

J. Sen

Fig. 40 Cumulative returns yielded by the portfolios of the NIFTY 50 stocks over the test period (January 1, 2022–November 30, 2022) Table 26 Summary of the performances of RL and HRP portfolios of the NIFTY 50 stocks on the training and test data Portfolio Training performance

Test Performance

Annual return Annual vol Sharpe ratio Annual return Annual vol Sharpe ratio (%) (%) RL

17.48

15.13

1.1554

10.35

14.12

0.7336

HRP

18.01

17.78

1.0131

9.55

16.10

0.5934

Table 27 Summary of the results of the performances of the portfolios

Sector

Training data

Test data

Portfolio with Portfolio with Higher sharpe ratio Higher sharpe ratio Auto

RL (0.6277)

RL (1.5379)

Consumer durables HRP (1.4217)

RL (-0.4643)

Financial services

HRP (0.8469)

HRP (0.4844)

FMCG

RL (1.3204)

RL (1.6015)

Healthcare

RL (1.2259)

RL (-0.2885)

IT

HRP (1.8863)

RL (-1.1512)

Media

RL (0.7787)

HRP (0.4521)

Metal

HRP (1.2799)

HRP (0.6165)

Oil & Gas

RL (1.0236)

RL (0.0481)

Private banks

RL (0.8189)

HRP (1.0820)

PSU banks

RL (0.3370)

HRP (1.6884)

Realty

HRP (1.3109)

RL (0.3467)

NIFTY 50

RL (1.1554)

RL (0.7336)

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

549

5 Conclusion This chapter has presented portfolio design approaches on twelve important sectors and a diversified sector (NIFTY 50) of the Indian stock market using hierarchical risk parity (HRP) and reinforcement learning (RL). The portfolios are designed based on the historical prices of the stocks from the thirteen sectors. The stock price data from January 1, 2017, to December 31, 2021, are used for building the portfolios, and the performances of the portfolios are tested on stock price data from January 1, 2022, to November 30, 2022. The evaluation of the portfolios is done on three metrics, (i) annual return, (ii) annual risk (i.e., annual volatility), and (iii) Sharpe ratio. The portfolios yielding the highest Sharpe ratio on the test data for a given sector is the superior one. It is found that the RL portfolio has outperformed the HRP for most of the sectors on both training and test data. On both training and test data, the RL portfolio has yielded higher Sharpe ratios for eight out of thirteen sectors analyzed in this work. An analysis of stocks listed on the major global stock exchanges may be an interesting future research work.

References 1. Kizys, R., Doering, J., Juan, A. A., Polat, O., Calvet, L., and Panadero, J.: A Simheuristic Algorithm for the Portfolio Optimization Problem with Random Returns and Noisy Covariances. Computers & Operations Research, Vol 139, Art ID 105631, (2022). https://doi.org/10.1016/ j.cor.2021.105631 2. Markowitz, H.: Portfolio Selection. Journal of Finance 7(1), 77–91 (1952). https://doi.org/10. 2307/2975974 3. De Prado, M.L.: Building Diversified Portfolios that Outperform Out of Sample. J. Portf. Manag. 42(4), 59–69 (2016). https://doi.org/10.3905/jpm.2016.42.4.059 4. NSE Website: https://www1.nseindia.com 5. Rao, A. and Jelvis, T.: Foundations of Reinforcement Learning with Applications. CRC Press, USA, (2022). ISBN: 9781032124124 6. Francois-Lavet, Henderson, P., Islam, R., Bellemare, M. G., Pineau, J.: An Introduction to Deep Reinforcement Learning. Foundations and Trends in Machine Learning, 11(3–4), 219–354 (2018). https://doi.org/10.1561/2200000071 7. Sharpe, W.F.: The Sharpe Ratio. J. Portf. Manag. 21(1), 49–58 (1994). https://doi.org/10.3905/ jpm.1994.409501 8. Sen, J., Datta Chaudhuri, T.: An Alternative Framework for Time Series Decomposition and Forecasting and its Relevance for Portfolio Choice – A Comparative Study of the Indian Consumer Durable and Small Cap Sectors. J. Econ. Libr. 3(2), 303–326 (2016). https://doi. org/10.48550/arXiv.1605.03930 9. Sen, J. and Datta Chaudhuri, T.: An Investigation of the Structural Characteristics of the Indian IT Sector and the Capital Goods Sector: An Application of the R Programming Language in Time Series Decomposition and Forecasting. Journal of Insurance and Financial Management, 1(4), 68–132, (2016). https://doi.org/10.36227/techrxiv.16640227.v1 10. Sen, J. and Datta Chaudhuri, T.: Understanding the Sectors of the Indian Economy for Portfolio Choice. International Journal of Business Forecasting and Marketing Intelligence (IJBFMI), 4(2), 178–222, (2018). https://doi.org/10.1504/IJBFMI.2018.090914

550

J. Sen

11. Sen, J.: A Forecasting Framework for the Indian Healthcare Sector Index. International Journal of Business Forecasting and Marketing Intelligence (IJBFMI) 7(4), 311–350 (2021). https:// doi.org/10.1504/IJBFMI.2022.10047095 12. Yang, X., Mao, S., Gao, H., Duan, Y., Zou, Q.: Novel Financial Capital Flow Forecast Framework Using Time Series Theory and Deep Learning: A Case Study Analysis of Yu’e Bao Transaction Data. IEEE Access 7, 70662–70672 (2019). https://doi.org/10.1109/ACCESS.2019.291 9189 13. Sen, J.: Stock Composition of Mutual Funds and Fund Style: A Time Series Decomposition Approach towards Testing for Consistency. International Journal of Business Forecasting and Marketing Intelligence 4(3), 235–292 (2018). https://doi.org/10.1504/IJBFMI.2018.092781 14. Bisht, K. and Kumar, A.: A Portfolio Construction Model Based on Sector Analysis Using Dempster-Shafer Evidence Theory and Granger Causal Network: An Application to National Stock Exchange of India. Expert Systems with Applications, Vol 215, (2023). https://doi.org/ 10.1016/j.eswa.2022.119434 15. Mehtab, S., Sen, J., and Dutta, A.: Stock Price Prediction Using Machine Learning and LSTMBased Deep Learning Model. In: Thampi, S. M., Piramuthu, S., Li, K. C., Berretti, S., Wozniak, M., and Singh, D. (eds) Machine Learning and Metaheuristics Algorithms, and Applications. SoMMA 2020. Communications in Computer and Information Science, Vol 1366, pp. 88–106, Springer, Singapore, (2021). https://doi.org/10.1007/978-981-16-0419-5_8 16. Mehtab, S. and Sen, J.: Stock Price Prediction Using Convolutional Neural Networks on a Multivariate Time Series. In: Proceedings of the 2nd National Conference on Machine Learning and Artificial Intelligence (NCMLAI), February 1, 2020, New Delhi, India, (2020). https://doi. org/10.36227/techrxiv.15088734.v1 17. Sen, J.: Stock Price Prediction Using Machine Learning and Deep Learning Frameworks. In: Proceedings of the 6th International Conference on Business Analytics and Intelligence (ICBAI), December 20–22, Bangalore, India. (2018) 18. Mehtab, S. and Sen, J.: Analysis and Forecasting of Financial Time Series Using CNN and LSTM-Based Deep Learning Models. In: Sahoo, J. P., Tripathy, A. K., Mohanty, M., Li, K. C., and Nayak, A. K. (eds) Advances in Distributed Computing and Machine Learning. Lecture Notes in Networks and Systems, Vol 302, pp. 405–423, Springer, Singapore, (2022). https:// doi.org/10.1007/978-981-16-4807-6_39 19. Sen, J., Mondal, S., and Nath, G.: Robust Portfolio Design and Stock Price Prediction Using an Optimized LSTM Model. In: Proceedings of the IEEE 18th India Council International Conference (INDICON), pp. 1–6, December 19–21, Guwahati, India, (2021). https://doi.org/ 10.1109/INDICON52576.2021.9691583 20. Sen, J. and Mehtab, S.: Accurate Stock Price Forecasting Using Robust and Optimized Deep Learning Models. In: Proceedings of the IEEE International Conference on Intelligent Technologies (CONIT), pp. 1–9, June 25–27, Hubballi, India, (2021). https://doi.org/10.1109/CON IT51480.2021.9498565 21. Sen, J., Dutta, A., and Mehtab, S.: Profitability Analysis in Stock Investment Using an LSTMBased Deep Learning Model. In: Proceedings of the IEEE 2nd International Conference for Emerging Technology (INCET), pp. 1–9, May 21–23, Belagavi, India, (2021). https://doi.org/ 10.1109/INCET51464.2021.9456385 22. Mehtab, S. and Sen, J.: Stock Price Prediction Using CNN and LSTM-Based Deep Learning Models. In: Proceedings of the IEEE International Conference on Decision Aid Science and Applications (DASA), pp. 447–453, November 8–9, 2020, Sakheer, Bahrain, (2020). https:// doi.org/10.1109/DASA51403.2020.9317207 23. Mehtab, S., Sen, J., and Dasgupta, S.: Robust Analysis of Stock Price Time Series using CNN and LSTM-Based Deep Learning Models. In: Proceedings of the IEEE 4th International Conference on Electronics, Communication and Aerospace Technology (ICECA), pp. 1481– 1486, November 5–7, Coimbatore, India, (2020). https://doi.org/10.1109/ICECA49313.2020. 9297652 24. Mehtab, S., Sen, J.: A time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models. International Journal of Business Forecasting and

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

25.

26.

27.

28. 29. 30.

31.

32.

33. 34.

35.

36.

37.

38. 39. 40.

41.

551

Marketing Intelligence (IJBFMI) 6(4), 272–335 (2020). https://doi.org/10.1504/IJBFMI.2020. 115691 Sen, J. and Mehtab, S.: Long-and-Short-Term Memory (LSTM) Networks- Architectures and Applications in Stock Price Prediction. In: Singh, U., Murugesan, S., and Seth, A. (eds) Emerging Computing Paradigms – Principles, Advances, and Applications, pp. 143–160, Wiley, (2022). https://doi.org/10.1002/9781119813439.ch8 Chandola, D., Mehta, A., Singh, S., Tikkiwal, V.A., Agrawal, H.: Forecasting Directional Movement of Stock Prices using Deep Learning. Annals of Data Science (2022). https://doi. org/10.1007/s40745-022-00432-6 Qiu, J., Wang, B.: Forecasting Stock Prices with Long-Short Term Memory Neural Network Based on Attention Mechanism. PLoS ONE 15(1), e0227222 (2020). https://doi.org/10.1371/ journal.pone.0227222 Zhang, Z., Zohren, S., Roberts, S.: Deep Learning for Portfolio Optimization. The Journal of Financial Data Science 2(4), 8–20 (2020). https://doi.org/10.3905/jfds.2020.1.042 Moody, J., Saffell, M.: Learning to Trade via Direct Reinforcement. IEEE Trans. Neural Networks 12(4), 875–889 (2001). https://doi.org/10.1109/72.935097 Mehtab, S. and Sen, J.: A Robust Predictive Model for Stock Price Prediction Using Deep Learning and Natural Language Processing. In: Proceedings of the 7th International Conference on Business Analytics and Intelligence (BAICONF), December 5–7, 2019, Bangalore, India, (2019). https://doi.org/10.36227/techrxiv.15023361.v1 Sharaf, M., Hemdan, E.E.-D., El-Sayed, A., El-Bahnasawy, A.: An Efficient Hybrid Stock Trend Prediction System During COVID-19 Pandemic Based on Stacked-LSTM and News Sentiment Analysis. Multimedia Tools and Applications (2022). https://doi.org/10.1007/s11 042-022-14216-w Nousi, C., Tjortjis, C.: A Methodology for Stock Movement Prediction Using Sentiment Analysis on Twitter and Stock Twits Data. In: Proceedings of the 6th South-East Europe Design, Automation, Computer Engineering, Computer Networks and Social Media Conference (SEEDA-CECNSM), pp. 1–7, September 24–26, Preveza, Greece, (2021). https://doi.org/ 10.1109/SEEDA-CECNSM53056.2021.9566242 Bollen, J., Mao, H., Zeng, X.: Twitter mood predicts the stock market. Journal of Computational Science 2(1), 1–8 (2011). https://doi.org/10.1016/j.jocs.2010.12.007 Audrino, F., Sigrist, F., Ballinari, D.: The Impact of Sentiment and Attention Measures on Stock Market Volatility. Int. J. Forecast. 36(2), 334–357 (2020). https://doi.org/10.1016/j.ijf orecast.2019.05.010 Carta, S.M., Consoli, S., Piras, L., Podda, A.S., Recupero, D.R.: Explainable machine learning exploiting news and domain-specific lexicon for stock market forecasting. IEEE Access 9, 30193–30205 (2021). https://doi.org/10.1109/ACCESS.2021.3059960 Corazza, M., Di Tollo, G., Fasano, G., Pesenti, R.: A Novel Hybrid PSO-Based Metaheuristic for Costly Portfolio Selection Problem. Ann. Oper. Res. 304, 109–137 (2021). https://doi.org/ 10.1007/s10479-021-04075-3 Zhao, P., Gao, S., and Yang, N.: Solving Multi-Objective Portfolio Optimization Problem Based on MOEA/D. In: Proceedings of the 12th International Conference on Advanced Computational Intelligence (ICACI), pp. 30–37, August 14–16, 2020, Dali, China, (2020) . https://doi.org/10. 1109/ICACI49185.2020.9177505 Chen, C., Zhou, Y.: Robust Multi-Objective Portfolio with Higher Moments. Expert System with Application 100, 165–181 (2018). https://doi.org/10.1016/j.eswa.2018.02.004 Wang, Z., Zhang, X., Zhang, Z., Sheng, D.: Credit Portfolio Optimization: A Multi-Objective Genetic Algorithm Approach 22(1), 69–76 (2022). https://doi.org/10.1016/j.bir.2021.01.004 Erwin, K. and Engelbrecht, A.: Improved Set-Based Particle Swarm Optimization for Portfolio Optimization. In: Proceedings of the IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1573–1580, Canberra, Australia, (2020). https://doi.org/10.1109/SSCI47803.2020. 9308579 Garcia, F., Gujjaro, F., Oliver, J.: Index Tracking Optimization with Cardinality Constraint: A Performance Comparison of Genetic Algorithms and Tabu Search Heuristic. Neural Comput. Appl. 30(8), 2625–2641 (2018). https://doi.org/10.1007/s00521-017-2882-2

552

J. Sen

42. Rivera, G., Florencia, R., Guerrero, M., Porras, R., Sánchez-Solís, J.P.: Online multi-criteria portfolio analysis through compromise programming models built on the underlying principles of fuzzy outranking. Inf. Sci. 580, 734–755 (2021). https://doi.org/10.1016/j.ins.2021.08.087 43. Cruz, L., Fernandez, E., Gomez, C., Rivera, G., & Perez, F. Many-objective portfolio optimization of interdependent projects with ‘a priori’ incorporation of decision-maker preferences. Applied Mathematics & Information Sciences, 8(4), 1517, (2014). https://doi.org/10.12785/ amis/080405 44. Sen, J., Mehtab, S.: A Comparative Study of Optimum Risk Portfolio and Eigen Portfolio on the Indian Stock Market. International Journal of Business Forecasting and Marketing Intelligence 7(2), 143–193 (2021). https://doi.org/10.1504/IJBFMI.2021.10043037 45. Sen, J., Dutta, A., and Mehtab, S.: Stock Portfolio Optimization Using a Deep Learning LSTM Model. In: Proceedings of the IEEE Mysore Sub Section International Conference (MysuruCon), pp. 263–271, October 24–25, Hassan, India, (2021). https://doi.org/10.1109/Mysuru Con52639.2021.9641662 46. Sen, J., Mondal, S., and Mehtab, S.: Portfolio Optimization on NIFTY Thematic Sector Stocks Using an LSTM Model. In: Proceedings of the IEEE International Conference on Data Analytics for Business and Industry (ICDABI), pp. 364–369, October 25–26, Sakheer, Bahrain, (2021). https://doi.org/10.1109/ICDABI53623.2021.9655886 47. Sen, J., Dutta, A.: Design and Analysis of Optimized Portfolios for Selected Sectors of the Indian Stock Market. In: Proceedings of the IEEE International Conference on Decision Aid Sciences and Applications (DASA), March 23–25, 2022, Chiangrai, Thailand, (2022). https:// doi.org/10.1109/DASA54658.2022.9765289 48. Sen, J. and Dutta, A.: A Comparative Study of Hierarchical Risk Parity Portfolio and Eigen Portfolio on the NIFTY 50 Stocks. In: Buyya, R., Hernandez, S. M., Kovvur, R. M. R., and Sarma, T. H. (eds) Computational Intelligence and Data Analytics. Lecture Notes on Data Engineering and Communications Technologies, Vol 142, pp. 443–460, Springer, Singapore, (2022). https://doi.org/10.1007/978-981-19-3391-2_34 49. Sen, J., Mehtab, S., Dutta, A., and Mondal, S.: Precise Stock Price Prediction for Optimized Portfolio Design Using an LSTM Model. In: Proceedings of the IEEE 19th OITS International Conference on Information Technology (OCIT), pp. 210–215, December 16–18, 2021, Bhubaneswar, India, (2021). https://doi.org/10.1109/OCIT53463.2021.00050 50. Sen, J., Mehtab, S., Dutta, A., and Mondal, S.: Hierarchical Risk Parity and Minimum Variance Portfolio Design on NIFTY 50 Stocks. In: Proceedings of the IEEE International Conference on Decision Aid Science and Applications (DASA), pp. 668–675, December 7–8, 2021, Sakheer, Bahrain, (2021). https://doi.org/10.1109/DASA53625.2021.9681925 51. Sen, J.: Designing Efficient Pair-Trading Strategies Using Cointegration for the Indian Stock Market. In: Proceedings of the IEEE 2nd Asian Conference on Innovation in Technology (ASIANCON), pp. 1–9, August 26–28, 2022, Pune, India, (2022). https://doi.org/10.1109/ASI ANCON55314.2022.9909455 52. Sen, J., Mehtab, S., and Dutta, A.: Volatility Modeling of Stocks from Selected Sectors of the Indian Economy using GARCH. In: Proceedings of the IEEE Asian Conference on Innovation in Technology (ASIANCON), August 28–29, pp 1–9, Pune, India, (2021). https://doi.org/10. 1109/ASIANCON51346.2021.9544977 53. Chatterjee, A., Bhowmick, H., and Sen, J.: Stock Volatility Prediction Using Time Series and Deep Learning Approach. In: Proceedings of the IEEE 2nd Mysore Sub Section International Conference (MysuruCon), pp. 1–6, October 16–17, 2022, Mysuru, India, (2022). https://doi. org/10.1109/MysuruCon55714.2022.9972559 54. Sinha, M.: Portfolio Optimization Using Reinforcement Learning: A Study of Implementation of Learning to Optimize. In: Choudrie, J., Mahalle, P., Perumal, T., and Joshi, A. (eds) ICT with Intelligent Applications. Smart Innovation, Systems and Technologies, Vol 311, pp. 719–728, Springer, Singapore, (2023). https://doi.org/10.1007/978-981-19-3571-8_65 55. Soleymani, F. and Paquet, E.: Financial Portfolio Optimization with Online Deep Reinforcement Learning and Restricted Stacked Autoencoder – DeepBreath. Expert Systems with

Portfolio Optimization Using Reinforcement Learning and Hierarchical …

56.

57.

58.

59.

60.

61.

62.

63.

64.

65.

66.

67.

68.

69.

553

Applications, 156, Paper ID: 113456, October, (2020) . https://doi.org/10.1016/j.eswa.2020. 113456 Lim, Q.Y.E., Cao, Q., Quek, C.: Dynamic Portfolio Rebalancing through Reinforcement Learning. Neural Comput. Appl. 34, 7125–7139 (2022). https://doi.org/10.1007/s00521-02106853-3 Hu, Y.-J. and Lin, S.-J.: Deep Reinforcement Learning for Optimizing Finance Portfolio Management. In: Proceedings of the Amity International Conference on Artificial Intelligence (AICAI), pp. 14–20, February 4–6, Dubai, UAE, (2019). https://doi.org/10.1109/AICAI.2019. 8701368 Zha, L., Dai, L., Xu, T., and Wu, D.: A Hierarchical Reinforcement Learning Framework for Stock Selection and Portfolio. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN), pp. 1–7, July 18–23, Padua, Italy, (2022). https://doi.org/10.1109/IJCNN5 5064.2022.9892378 Wei, L. and Weiwei, Z.: Research on Portfolio Optimization Models Using Deep Deterministic Policy Gradient. In: Proceedings of the International Conference on Robots & Intelligent System (ICRIS), pp. 698–701, November 7–8, Sanya, China, (2020). https://doi.org/10.1109/ ICRIS52159.2020.00174 Huang, S.-H., Miao, Y.-H., Hsiao, Y.-T.: Novel Deep Reinforcement Algorithm with Adaptive Sampling Strategy for Continuous Portfolio Optimization. IEEE Access 9, 77371–77385 (2021). https://doi.org/10.1109/ACCESS.2021.3082186 Wang, H. and Yu, S.: Robo-Advising: Enhancing Investment with Inverse Optimization and Deep Reinforcement Learning. In: Proceedings of the 20th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 365–372, December 13–16, Pasadena, CA, USA, (2021). https://doi.org/10.1109/ICMLA52953.2021.00063 Maree, C. and Omlin, C. W.: Balancing Profit, Risk, and Sustainability for Portfolio Management. In: Proceedings of the 2022 IEEE Symposium on Computational Intelligence for Financial Engineering and Economics (CIFEr’22), pp. 1–8, May 4–5, Helsinki, Finland, (2022). https://doi.org/10.1109/CIFEr52523.2022.9776048 Gasperov, B., Saric, F., Begusic, S., and Kostanjcar, Z.: Adaptive Rolling Window Selection for Minimum Variance Portfolio Estimation Based on Reinforcement Learning. In: Proceedings of the 43rd International Convention on Information, Communication and Electronic Technology (MIPRO20), pp. 1098–1102, September 28- October 2, Opatija, Croatia, (2020). https://doi. org/10.23919/MIPRO48935.2020.9245435 Ha, M. H., Chi, S-G., Lee, S., Cha, Y., Ro, M. B.: Evolutionary Meta Reinforcement Learning for Portfolio Optimization. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’21), pp. 964–972, July 10–14, Lille, France, (2021). https://doi.org/10. 1145/3449639.3459386 Gu, F., Jiang, Z., Su, J.: Application of Features and Neural Network to Enhance the Performance of Deep Reinforcement Learning in Portfolio Management. In: Proceedings of the 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA), pp 92–97, March 5–8, 2021, Xiamen, China, (2021). https://doi.org/10.1109/ICBDA51983.2021.9403044 Almahdi, S., Yang, S.Y.: A Constrained Portfolio Trading System Using Particle Swarm Algorithm and Recurrent Reinforcement Learning. Expert Syst. Appl. 130, 145–156 (2019). https:// doi.org/10.1016/j.eswa.2019.04.013 Jiang, Z. and Liang, J.: Cryptocurrency Portfolio Management with Deep Reinforcement Learning. In: Proceedings of the Intelligent Systems Conference (IIntelliSys), pp. 905–913, March 7–8, London, UK, (2017). https://doi.org/10.1109/IntelliSys.2017.8324237 Zepeda-Mendoza, M. L., Resendis-Antonio, O.: Hierarchical Agglomerative Clustering. In: Dubitzky, W., Wolkenhauer, O., Cho, KH., Yokota, H. (eds) Encyclopedia of Systems Biology, pp. 886–887, Springer, New York, NY, (2013). https://doi.org/10.1007/978-1-4419-9863-7_ 1371 Murtagh, F., Legendre, P.: Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion? J. Classif. 31, 274–295 (2014). https://doi.org/10.1007/ s00357-014-9161-z

554

J. Sen

70. Baily, D, de Prado, M. L.: Balanced Baskets: A New Approach to Trading and Hedging Risks. Journal of Investment Strategies, 1(4), 21–62 (2012). https://doi.org/10.21314/JOIS.2012.010 71. Google Colaboratory: https://colab.research.google.com (accessed Apr. 12, 2023) 72. NIFTY 50 Stock List for 2022: https://tradingfuel.com/nifty-50-stock-list-2022 (accessed Apr. 12, 2023)

Reducing Recursion Costs in Last-Mile Delivery Routes with Failed Deliveries Luis Suárez, Cynthia Porras, Alejandro Rosete, and Humberto Díaz-Pando

Abstract There are many challenges in last-mile logistics arising from the growth of e-commerce, and the optimization of delivery routes is a fundamental step in the sustainability of these services. The main issues that impact the cost of delivery routes are tour length and failed deliveries. VRP models in the literature address the first issue, but few proposals treat first-time failed deliveries. Recent works address this issue by defining time windows and computing availability profiles for each customer using previous records of failed and successful deliveries. Among the caveats of this approach, the main one is data availability. These strategies are also prone to bias due to the representativity of the available data. This work proposes a recursion approach to treat failed deliveries. The main idea is to optimize the distance among the customers of the same route, ensuring a reduction in the cost of revisiting failed deliveries. Two examples of the advantages of the proposed approach are presented in a case study where two instances of the Capacitated Vehicle Routing Problem are solved. As a result, a reduction of recursion costs between 6% and 15% is obtained for a real instance of the problem under different failure rates. Keywords Last-mile logistics · Failed deliveries recursion · Vehicle routing problem · Second visits

L. Suárez (B) · C. Porras · A. Rosete · H. Díaz-Pando Facultad de Ingeniería Informática, Universidad Tecnológica de La Habana José Antonio Echeverría, La Habana, Cuba e-mail: [email protected] A. Rosete e-mail: [email protected] H. Díaz-Pando e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications, Studies in Big Data 132, https://doi.org/10.1007/978-3-031-38325-0_21

555

556

L. Suárez et al.

1 Introduction The generalization of e-commerce with home delivery has increased the interest in last-mile logistics, especially in the design of distribution routes. According to [13], last-mile logistics is the last segment in a delivery service which starts in a depot and ends with the delivery in a location specified by the customer that initiated the order. This type of delivery system is very common in e-commerce, since e-commerce customers prefer that the products are delivered directly to their homes [17]. The Vehicle Routing Problem (VRP) is a general class of optimization problems that models the distribution of goods, services, and personnel. The objective is to design a set of routes associated with a fleet of vehicles, such that the cost to travel these routes is minimum and the demand is satisfied [9]. According to [24], the VRP consists of two decisions: partitioning the set of customers into K groups and ordering the customers inside each group to create routes. The VRP is relevant to most of the logistic processes where the transportation of goods or personnel is required, including e-commerce with delivery service, according to [2, 14]. With the increasing competition among service providers special attention is given to last-mile logistics, as stated by [13]. In [15], it was proposed a CVRP variant to model last-mile deliveries in rural zones of China. The proposal aims to minimize distribution costs and to increase the profits generated in the transportation process. In [23], it was presented a generalization of the VRP with Time Windows (VRP-TW), where the customer can specify delivery options (home delivery or pickup at a parcel locker). The objective is to minimize the distribution cost. The proposal found in [12] applies the VRP-TW in a food delivery context, where customers have expressed delivery time windows. Furthermore, the authors propose a heuristic for the segmentation of customers into delivery regions to increase the service levels and reduce the number of vehicles. According to [8, 25], the main factors that increase the cost of last-mile deliveries are: route length and failed first-time deliveries. In [8], it was stated that failed firsttime deliveries occur mainly when a time window for the delivery is not specified, or a delivery confirmation is required. Although defining time windows for deliveries might seem like a solution to mitigate this issue, it causes an additional cost in terms of route length to deliver the same items. In a study performed by [1], when the size of time windows decreases, the length of the route increases. According to [8], this is called the ping pong effect. In [7], it was stated that, for small (cheaper) products, providers don’t usually give the option to specify a time window, unless in the case of a specialized or premium service. Furthermore, the vehicle might have to wait to perform the delivery if it arrives before the specified time or discard the delivery if it arrives later, according to [26]. According to [2], when deliveries fail, providers choose one of the following alternatives: deliver the product to an alternative receptor nearby, return the product to the depot or revisit the customer location later in the day. In [7], it was proposed a VRP-TW variant with customer availability profiles (CAPs) to mitigate the impact of failed first-time deliveries. With these availabil-

Reducing Recursion Costs in Last-Mile Delivery Routes with Failed Deliveries

557

ity profiles, the model seeks to maximize the estimated success rate of deliveries in each route. Furthermore, the authors consider the possibility of second visits to maximize the delivery success rate. The authors state that this approach can increase the success rate by 10%. Another alternative is the proposal of [21], which aims to minimize transportation and penalty costs expected because of a failed delivery attempts. According to the authors, their proposal increases the success rate of firsttime delivery attempts by 40%. Both approaches require the analysis and availability of historical records of successful deliveries to determine the probability that a customer is available at a given time to perform the delivery. This requirement makes the application of these approaches impossible in businesses with a partial automatization of the distribution routes planning process, or that simply don’t possess this information. Furthermore, even if the data is available, in [21] was stated that there is still the issue of representativity of the data, causing the presence of bias when estimating the success probability of the deliveries in the planned moment. The proposal of [6] explores the perspective of asking the customer to specify several alternative delivery locations, each one associated with a time window. This approach allows the client to specify a location for the time window where they are at work and another for when they are at home, for example. In the case of the Cuban experience in last-mile logistics (dramatically increased in the context caused by COVID-19), the failures of delivery attempts are frequent, even though there is no available data related to the success rate of deliveries. This paper presents a Capacitated Vehicle Routing Problem (CVRP) that adds as an objective the minimization of the distance among the customers that belong to the same route. The hypothesis is that optimizing this objective will reduce the cost of revisiting failed deliveries. To test this, we present a case study where two instances (one from the literature and another from a real scenario) are solved using the proposed model and the classic CVRP. In the case study, a simulation of failed deliveries is performed over the solutions for the two models, evaluating the cost of the recursion routes, finding that the proposed model allows a reduction of recursion costs between 6% and 8% in a practical instance of the problem. This paper is structured as follows: In Sect. 2 the CVRP proposed model is described. Section 3 describes a heuristic approach to solve the CVRP. Section 4 presents a case study where the proposed model is solved for two instances of the CVRP to evaluate the impact of the optimization of the clustering distance in the recursion cost. Finally, Section 5 summarizes the findings of this work.

2 Formal Description of the Proposed Model The model considers a last-mile delivery service that arranges several routes from a depot to the final destination indicated by each customer. Each customer is informed when they are next in line to be visited. In some cases, this causes the customer to be absent at the moment of the delivery, mainly because the customer is unavailable at the planned moment. It is considered that a customer might be available later in the day;

558

L. Suárez et al.

therefore, it might be feasible to apply a recursion approach to satisfy failed first-time deliveries at the end of the day. The model works under the hypothesis that by reducing the distance among the customers of a distribution route, the length of the tour that contains only the failed deliveries will also be minimized, as opposed to the approach of only minimizing the initial tour cost. To achieve this purpose, the model combines the Capacitated P-Median (CPMP) (see [3]) and the Traveling Salesman (TSP) (see [11]) problems. The CPMP aims to minimize the cost of allocating customers to their corresponding vehicles. The TSP seeks to reduce the cost of visiting customers on each route. The model is defined over a directed graph G = (V, A), where V is the set of vertices associated to the delivery locations, and A is the set of arcs that define the cost between each pair of vertices (Vi , V j ), considering V0 as the depot. A homogeneous fleet of K vehicles with a capacity C is also considered. Each customer has a demand that determines how much of the vehicle capacity it occupies. The proposed model can be formulated as a Mixed Integer Linear Program (MILP) as follows: Parameters: • n: number of costumers. • ai : demand of the costumer i. • di j : cost to travel from the location of the customer i to the location of the customer j. • K : number of vehicles in the fleet. • C: capacity of the vehicles in the fleet. Decision variables: • yik : Binary variable that takes a value of 1 if the customer i is selected as a medoid for the group k, and 0 otherwise. • wik : Binary variable that takes a value of 1 if the customer i is allocated to the group k, and 0 otherwise. • xi jk : Binary variable that takes a value of 1 if the path between customers i and j is selected in the route associated to the group k, 0 otherwise. • ti ∈ R: Continuous variable that indicates the size of the demand accumulated at the moment of visiting the delivery location of customer i. Objectives: Minimize

Z1 =

n K n

wik y jk di j

(1)

i=0 j=0 k=0

Minimize

Z2 =

n n K i=0 j=0 k=0

xi jk di j

(2)

Reducing Recursion Costs in Last-Mile Delivery Routes with Failed Deliveries

Subject to:

n

yik = 1

559

∀k ∈ K

(3)

yik ≤ 1

∀i ∈ n, i = 0

(4)

wik = 1

∀i ∈ n, i = 0

(5)

i=1 K k=0 K k=0

w0k = 1 wik ≥ yik n

∀k ∈ K

∀k ∈ K , ∀i ∈ n, i = 0

wik ai ≤ C

(6) (7)

∀k ∈ K

(8)

xi jk = wik

∀i ∈ n, ∀k ∈ K

(9)

xi jk = w jk

∀ j ∈ n, ∀k ∈ K

(10)

i=1 n j=0 n i=0 K

xiik = 0

∀i ∈ n

(11)

k=0

ti − t j + C xi jk ≤ C − a j ai ≤ ti ≤ C

∀k ∈ K , ∀i, j ∈ n, i, j = 0, i = j ∀i ∈ n, i = 0

(12) (13)

In the above model, the Eq. 1 aims to minimize the distance among the customer selected as medoid of a group k and the customers allocated to it. The Eq. 2 seeks to minimize the length of the tour within each group k. The Eqs. 3, 4, 5, 6, 7 and 8 define the constraints associated with the CPMP, while the Eqs. 9, 10, 11 and 12 define the constraints of the TSP associated with each group. The constraints represented by the Eqs. 3 and 4 ensure that a unique medoid is selected for each group k. The constraints of Eq. 5 ensure that a customer is allocated to a single group k, while the constraints in Eq. 6 ensure that the depot is included in every group k, and the restrictions in Eq. 7 include the medoid in the corresponding group k. The capacity limitations of each vehicle k are enforced by the restrictions

560

L. Suárez et al.

in Eq. 8. The constraints defined by the Eqs. 9 and 10 link the selected arcs in each route k to the customers allocated to the corresponding group. The restrictions in the Eqs. 11, 12 and 13 ensure that no sub-tours are generated in each route k. The Eqs. 12 and 13 correspond to the sub-tour elimination constraints proposed by [18] for the TSP and extended to the VRP by [4]. The (1) makes the model non-linear since it multiplies two decision variables. This objective function can be linearized by introducing a binary decision variable z i jk to associate the allocated customers to each group k and the medoid of the group. If z i jk = 1, then the customer i is allocated to group k with medoid j. To allow this, three additional constraints must be added: z i jk ≤ y jk

∀i, j ∈ n, ∀k ∈ K

(14)

z i jk ≤ wik

∀i, j ∈ n, ∀k ∈ K

(15)

z i jk ≥ wik − (1 − y jk )

(16)

The constraints in the Eqs. 14 and 15 ensure that z i jk takes the value 1 only when j is selected as medoid of the group k or when i is allocated to the group k respectively. The constraints in the Eq. 16 ensure that z i jk takes the value 1 only when the customer i is allocated to the group k and the selected medoid is j. With this new constraints, the Eq. 1 can be redefined as follows. Minimize

Z1 =

n n K

z i jk di j

(17)

i=0 j=0 k=0

3 Solution Method Both, CPMP and TSP are NP-Hard problems, and so is the case of the proposed combined model. This complexity makes impossible to warrant finding an exact solution in a reasonable amount of time. The model is solved using a lexicographic ordering approach, solving the CPMP and then the TSP. To achieve this with practical instances of the problem, a Cluster First, Route Second (CFRS) approach, based on the strategy presented by [22], is used. The proposed method begins with a decomposition of the original instance into smaller sub-instances. Then, for each sub-instance, a solution for the CPMP is obtained to determine the allocation of the customers to their corresponding vehicles. Finally, the TSP corresponding to each vehicle is solved. A greedy algorithm is used to generate initial solutions for the CPMP. In each iteration, the procedure includes in the solution the medoid that adds the lest cost to the solution until K medoids are added. The neighborhood structure comprises three improvement heuristics. The first heuristic exchanges a selected medoid for another customer within the same group.

Reducing Recursion Costs in Last-Mile Delivery Routes with Failed Deliveries

561

Fig. 1 Example of the cyclic transfer heuristic

The second heuristic exchanges two customers between two random groups. Finally, the third heuristic consists of a cyclic transfer of the clients among the groups. Figure 1 presents an example of this cyclic transfer heuristic. This procedure begins with the selection of a random group. Then, it transfers the farthest customer (node 2 in Fig. 1) from the current group (group 1 in Fig. 1) to the next until the last group is reached. All the groups maintain in the final solution the same number of clients assigned in the initial solution.

4 Case Study This section presents a case study to evaluate the impact of optimizing the distance among customers of the same route on the cost of revisiting failed first-time deliveries. To this purpose, two solutions are computed. In the first solution, both objectives of the proposed model (the distance among customers of the same route and the tour length) are optimized. In the second solution, only the total tour length of the routes is optimized (Classic VRP). Since we are unable to determine the real success probability of the deliveries and, following the approach in [21], in both test cases first-time failed deliveries are simulated considering 5 scenarios with fail rates of 10%, 20%, 30%, 40% y 50%, respectively. In each simulation, a percentage of customers is randomly selected as failed deliveries. The cost to visit again these customers is determined by the length of the tour that comprises the last customer in the original route, the failed customers (in the same visit order of the original route) and the depot. Figure 2 shows an example of a recursion route, given a subset of customers with failed deliveries. The recursion routes are computed this way to give more time to the recipients of the last failed deliveries in the route to become available. Consider that the last customer of the route depicted in Fig. 2 (customer 8) causes a failed delivery, it would make no sense to revisit the customer immediately after the delivery failed. All solutions are computed using an Intel Core i5 6200U personal laptop, with 8GB of DDR3 RAM and Ubuntu 22.04 operative system.

562

L. Suárez et al.

Fig. 2 Example of a simulated recursion route

Fig. 3 Customers distribution of instance A

To illustrate the advantages of the proposed model when performing second visits, two instances (A and B)1 are solved. The following sections describe the instances and present the solutions obtained for each one.

4.1 Solution of the Instance A The instance A was generated by a random sampling of 30 customers, from a 50 customers instance elaborated by [5]. Additionally, 3 vehicles with enough capacity to satisfy the demand of all customers are considered. The cost to travel between each customer corresponds to the Euclidean distance. Figure 3 shows the distribution of customer’s locations around the depot. The instance A is solved using the academic version of the solver CPLEX in its version 12.8.0 (see [20]), using the Linear Programming Modeler Pulp developed by [19]. A maximum execution time of 24 hours is defined to obtain both solutions. With the solution of this instance we aim to provide a detailed example of the impact of the optimization of the clustering distance in the overall recursion cost.

1

https://github.com/lsuarezgo96/recursion_routes_instances.

Reducing Recursion Costs in Last-Mile Delivery Routes with Failed Deliveries

563

(a) Optimal solution for the classic VRP.

(b) Optimal solution for the proposed model.

Fig. 4 Exact solutions obtained for the instance A

Figure 4 shows the optimal solutions computed. The solution depicted in Fig. 4a is obtained by only optimizing the tour length (classic VRP), while the solution represented by Fig. 4b is computed by optimizing both objectives of the model presented in this work. In both solutions, the black points indicate a customer location. The colored arrows connecting pairs of points, indicate the selection of a path between two customers to be included in the route. The arrows of the same color belong to the same route. As can be appreciated in the solutions depicted, the overlap among tours is eliminated when optimizing the cluster distance. Table 1 shows that, when the tour length objective is optimized, the solution obtained achieves the smallest value of this metric. However, in the solution of the proposed model, the final tour length of the solutions is increased only by 2.8%, while the cluster distance is reduced by 22.0%. This makes the increase in the tour length insignificant with respect to the decrease in the clustering distance when this metric is optimized. This behavior makes possible to optimize the clustering distance

564

L. Suárez et al.

Table 1 Results obtained with the optimization of the objectives defined Metric Classic VRP Proposed model Reduction (%) proposed model Clustering distance Tour length Time required

502.05 361.37 4.21

392.37 371.61 1.2

21.84 +2.8 72%

Fig. 5 Impact of cluster optimization in recursion cost for the solutions of instance A

to reduce recursion costs without sacrificing significantly the length of the original tour. Additionally, to obtain the solution of the proposed model the solver required 71.42% of the time required to solve the classical VRP. This is due to the fact that, once the allocation of the customers to the corresponding route is fixed, the solution of the TSP associated to each vehicle is easier to solve for this amount of customers. Figure 5 shows the results obtained after 20 (random) failed delivery simulations for each fail rate defined. As can be noticed, optimizing the clustering distance reduces the average recursion cost in all simulations, obtaining the maximum reduction when the failure rate is 20%. When fail rates are higher than 30%, the recursion costs increase because, with higher failure rates, the recursion route will contain the majority of customers of the original. Figure 6 shows an example of recursion routes from the simulations for each failure rate. Figure 6a represents the best recursion tours derived from the solution of the classic VRP, while Fig. 6b shows the recursion routes obtained from the solution of the proposed model. The most representative characteristic of the recursion routes represented in Fig. 6a is the degree of overlapping among the tours, contrary to the ones depicted in Fig. 6b. This characteristic of the solution where only the tour length is optimized could be causing the increase in the recursion cost. Even though the solutions represented in Fig. 6b are not the best found for the proposed model, only for a 20% failure rate the recursion cost is increased with respect to the solution obtained for the classic VRP. This increase, is due to the inclusion of customer (i = 11) in route 3 (green) from the solution of the proposed model.

Reducing Recursion Costs in Last-Mile Delivery Routes with Failed Deliveries

565

(a) Best recursion solutions obtained for the (b) Best recursion solutions obtained for the classical VRP solution. solution of the proposed model.

Fig. 6 Recursion solutions obtained for the instance A for each failure rate

This customer is included in the route because is the last customer and therefore the starting point of the recursion route. A characteristic of every optimal solution of a symmetric TSP is that the arcs corresponding to the path between two clients do not cross each other. This rule seems to be violated in every route depicted in Fig. 6a. The contrary seems to happen in the recursion routes from Fig. 6b that describe a circular tour in most cases.

566

L. Suárez et al.

Table 2 Average cost of all recursion routes derived from the solutions of the instance A Failure rate Avg. cost Avg. cost Reduction (%) Total solutions Total solutions classic VRP proposed proposed classic VRP proposed model model model 10% 20%

367.39 764.51

302.87 518.27

17.56 32.21

999 91125

989 22274

To rule out the effect of the randomness of the selection of the customers with failed deliveries, the cost of all possible recursion routes that can be obtained considering a failure rate of 10 and 20% is computed. The cost of all the possible recursion routes for the remaining failure rates were not computed due to the large number of alternatives that need to be processed (for a 30% failure rate the amount of possible fails ascends to 1, 728, 000). Table 2 shows the average cost of all the recursion routes for the solutions of the instance A. For the 10% failure rate, 999 recursion solutions were computed for the solution of the classic VRP, while 989 recursion solutions were computed for the solution of the proposed model. For this failure rate, a 17.56% reduction in the recursion cost is obtained when the clustering distance is optimized. For the 20% failure rate, 91, 125 recursion solutions are computed for the solution of the classic VRP, while 22, 274 recursion solutions are computed for the solution of the proposed model. In this case, a 32.21% average reduction in the recursion cost is obtained when the clustering distance is optimized. Note that the amount of recursion routes for the solution of the proposed model is smaller than the amount for the solution of the classic VRP. This is due to the fact that in the solution of the classic VRP all routes contain 10 customers, while in the solution of the proposed model each route contains 9, 10 and 11 customers, respectively.

4.2 Solution of the Instance B The instance B corresponds to data from the Cuban e-commerce platform TuEnvío, containing 2650 customer locations 2 corresponding to one day of sales and presented in [22]. The Open Source Routing Machine, developed by [16], is used to calculate the travel costs between each pair of locations. Figure 7 depicts the distribution of the customers. Among the operational characteristics of the delivery service provided by this platform, the following were considered:

2

To protect the anonymity of the customers, the locations were shifted by a randomly generated value.

Reducing Recursion Costs in Last-Mile Delivery Routes with Failed Deliveries

567

Fig. 7 Customers distribution of instance B

• Each customer can only request a single order in one day and the items purchased are small enough to dismiss each particular volume, therefore, the demand of each customer is considered to be unitary ai = 1, ∀i ∈ n. • A fleet of vehicles with a capacity of 40 units and enough vehicles to satisfy the n i=0 ai demand is available, therefore, the amount of vehicles in the fleet is K = C . The instance B is solved using the procedure described in the Sect. 3 for 2000 iterations. In the solution for the classic VRP, the allocation of the customers is obtained by applying only the construction heuristic of the clustering phase, after that, only the tour length is optimized. With the solution of this instance we aim to provide an example of the reduction of recursion costs when the clustering distance is optimized using a heuristic approach for a real instance of the problem. Figure 8 shows the solutions obtained for the classic VRP (Fig. 8a) and the proposed model (Fig. 8b), respectively. The polygons in the solutions represent the area covered by the points allocated to each route, defining the boundary of each route.

(a) Solution obtained for the classic VRP.

Fig. 8 Solutions obtained for the instance B

(b) Solution obtained for the proposed model.

568

L. Suárez et al.

The polygons were computed by applying the concave-hull algorithm, provided by the software QGIS [10], to the location of the customers allocated to each route. As can be noticed, the routes in Fig. 8a seem to be more overlapped than the ones in the Fig. 8b. In the solution represented in the Fig. 8a there is a total of 151 intersections among the areas that the routes cover, while in the solution represented in the Fig. 8b there is 115 intersections (23.85% less). The area covered by the intersections among the routes in the solution of the Fig. 8a is of 60.11 km2 , while in the solution of the Fig. 8b the area is of 82.10 km2 (26.78% more). This increase in the overlap area among the routes might be caused by insufficient optimization time, or the incapacity of the designed heuristics to find a better allocation of the customers, resulting in outlier allocations. In the solution for the classic VRP, the total tour length is 2242.52 km, yielding an average tour length of 33.47 km and an average clustering distance of 141.26 km. In the solution for the proposed model, the resulting tour length is 2213.78 km, yielding an average tour length of 33.04 km and an average clustering distance of 50.69 km. It should be noticed that with the optimization of the clustering distance the tour length is expected to increase, but the obtained solutions are not optimal because of the approximate nature of metaheuristics. Both solutions achieve similar tour lengths with only a 2% decrease when the clustering distance is optimized. On the other hand, the clustering distance gets a significant reduction when optimizing (also) this metric, resulting in a 64.12% decrease. These results correspond to the characteristic observed in the exact solutions obtained for the proposed model, making the heuristic solution proposal a suitable candidate to solve practical instances of the problem in a reasonable time. Figure 9 shows the average cost reduction of the second visit routes among the 20 simulations performed by each failure rate applied on both solutions of Fig. 8. The results for this instance show that, by optimizing the clustering distance and then the tour length, the overall recursion cost is reduced. The cost of the recursion routes for this instance shows a similar behavior of those obtained for the optimal solutions of the instance A, achieving the maximum average reduction with a 20% failure rate.

Fig. 9 Impact of cluster optimization in recursion cost for the solutions of the instance B

Reducing Recursion Costs in Last-Mile Delivery Routes with Failed Deliveries

569

The results obtained for instance B with the proposed heuristic method also support the hypothesis that reducing the distance among the customers in the same route, the cost of the recursion routes to revisit failed deliveries is reduced in average. Furthermore, this can be done without incurring in a significant increase of the cost to visit the planned customers in the original route. This behavior makes possible to apply the proposed solution method in case failed deliveries arise and it is required to perform a second visit to a subset of the customers.

5 Conclusions Among the challenges in last-mile logistics, failed deliveries play an expensive role. The proposals found in the literature attempt to increase the success rate of deliveries by estimating the success probability of planned deliveries in a time window. Besides the obvious issue of data availability, another shortcoming of these proposals is the bias caused by insufficient data or its non-representativeness. The model presented in this work reduces the clustering distance of the customers aiming to reduce the cost of revisiting first-time failed deliveries. The optimal solution obtained by optimizing the clustering distance reduces recursion costs between 7% and 15% (on average) compared to the optimal solution of the classic CVRP. Furthermore, the solutions obtained using the proposed heuristic approach achieve an average reduction between 6% and 8% on recursion costs for a practical instance of the problem. The reduction of the recursion cost for the considered instances is obtained without a significant detriment of the tour length (less than 3%). This makes possible the application of this strategy as a preventive measure when is not possible to estimate the success probability of a delivery. In spite of the advantages of the presented model, only two instances of the problem were tested. Therefore, a further analysis is due to determine if the behavior of the model is similar for other instances of the problem, or the existence of particular characteristics of the instances that affect the reduction of recursion cost when the distance among the customers is optimized. So far, an association between the clustering distance and recursion costs has been found, future works should expand this research to prove that in fact the reduction of the clustering distance causes a reduction of recursion costs. The work presented in this paper focuses on the context of last-mile delivery routes for e-commerce, but it could be extended to other areas like customer service, food delivery and other delivery applications that require the customer presence. Additionally, in recent years electric vehicles such as E-Bikes have been introduced in city logistics. Given the short travel range that these vehicles can achieve, it could be beneficial to explore the benefits of reducing the distance among the customers that this type of vehicles have to visit.

570

L. Suárez et al.

Acknowledgements We would like to thank the Datacimex division from CIMEX Corporation, for providing the data used to evaluate our model.

References 1. Boyer, K.K., Prud’homme, A.M., Chung, W.: The last mile challenge: Evaluating the effects of customer density and delivery window patterns. J. Bus. Logist. 30(1), 185–201 (2009). https:// doi.org/10.1002/j.2158-1592.2009.tb00104.x 2. Boysen, N., Fedtke, S., Schwerdfeger, S.: Last-mile delivery concepts: a survey from an operational research perspective. OR Spectr. 43(1), 1–58 (2021). https://doi.org/10.1007/s00291020-00607-8 3. Current, J.R., Storbeck, J.E.: Capacitated covering models. Environ. Plan. B: Plan. Des. 15(2), 153–163 (1988). https://doi.org/10.1068/b150153 4. Desrochers, M., Laporte, G.: Improvements and extensions to the miller-tucker-zemlin subtour elimination constraints. Oper. Res. Lett. 10(1), 27–36 (1991). https://doi.org/10.1016/01676377(91)90083-2 5. Eilon, S., Watson-Gandy, C., Christofides, N., de Neufville, R.: Distribution managementmathematical modelling and practical analysis. IEEE Trans. Syst., Man Cybern. 21, 589–589 (1974). https://doi.org/10.1109/TSMC.1974.4309370 6. Escudero-Santana, A., Muñuzuri, J., Lorenzo-Espejo, A., Muñoz Díaz, M.L.: Improving ecommerce distribution through last-mile logistics with multiple possibilities of deliveries based on time and location. J. Theor. Appl. Electron. Commer. Res. 17(2), 507–521 (2022). https:// doi.org/10.3390/jtaer17020027 7. Florio, A.M., Feillet, D., Hartl, R.F.: The delivery problem: Optimizing hit rates in e-commerce deliveries. Transp. Res. Part B: Methodol. 117, 455–472 (2018). https://doi.org/10.1016/j.trb. 2018.09.011 8. Gevaers, R., de Voorde, E.V., Vanelslander, T.: Characteristics and Typology of Last-mile Logistics from an Innovation Perspective in an Urban Context, chap. 3, pp. 38–56. Edward Elgar Publishing, Cheltenham, UK (2011). https://doi.org/10.4337/9780857932754.00009 9. Gómez Santillán, C., Cruz Reyes, L., Morales Rodríguez, M., González Barbosa, J., López Castillo, O., Rivera, G., Hernández, P.: Variants of VRP to optimize logistics management problems. In: Logistics Management and Optimization through Hybrid Artificial Intelligence Systems, pp. 207–237. IGI Global (2012). https://doi.org/10.4018/978-1-4666-0297-7.ch008 10. Graser, A., Mearns, B., Mandel, A., Ferrero, V.O., Bruy, A.: QGIS: Becoming a GIS Power User. Packt Publishing Ltd., S.l (2017) 11. Jünger, M., Reinelt, G., Rinaldi, G.: The traveling salesman problem. In: Network Models, Handbooks in Operations Research and Management Science, vol. 7, chap. 4, pp. 225–330. Elsevier (1995). https://doi.org/10.1016/S0927-0507(05)80121-5 12. Lespay, H., Suchan, K.: Territory design for the multi-period vehicle routing problem with time windows. Comput. Oper. Res. 145, 105–123 (2022). https://doi.org/10.1016/j.cor.2022. 105866 13. Lim, S.F.W.T., Jin, X., Srai, J.S.: Consumer-driven e-commerce. Int. J. Phys. Distrib. Logist. Manag. 48(3), 308–332 (2018). https://doi.org/10.1108/IJPDLM-02-2017-0081 14. Liu, L., Li, K., Liu, Z.: A capacitated vehicle routing problem with order available time in e-commerce industry. Eng. Optim. 49(3), 449–465 (2017). https://doi.org/10.1080/0305215X. 2016.1188092 15. Liu, W.: Route optimization for last-mile distribution of rural e-commerce logistics based on ant colony optimization. IEEE Access 8, 12179–12187 (2020). https://doi.org/10.1109/ACCESS. 2020.2964328

Reducing Recursion Costs in Last-Mile Delivery Routes with Failed Deliveries

571

16. Luxen, D., Vetter, C.: Real-time routing with openstreetmap data. In: Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS ’11, pp. 513–516. ACM, New York, NY, USA (2011). https://doi.org/10.1145/ 2093973.2094062 17. Madleˇnák, R., Madleˇnáková, L.: Multi-criteria evaluation of e-shop methods of delivery from the customer’s perspective. Transp. Probl. 15, 5–14 (2020). https://doi.org/10.21307/tp-2020001 18. Miller, C.E., Tucker, A.W., Zemlin, R.A.: Integer programming formulation of traveling salesman problems. ACM 7(4), 326–329 (1960). https://doi.org/10.1145/321043.321046 19. Mitchell, S., OSullivan, M., Dunning, I.: Pulp: a linear programming toolkit for python. Univ. Auckl., Auckl., N. Z. 65 (2011) 20. Nickel, S., Steinhardt, C., Schlenker, H., Burkart, W., Reuter-Oppermann, M.: IBM ILOG CPLEX Optimization Studio, pp. 9–23. Springer, Berlin (2021). https://doi.org/10.1007/9783-662-62185-1_2 21. Özarık, S.S., Veelenturf, L.P., Woensel, T.V., Laporte, G.: Optimizing e-commerce last-mile vehicle routing and scheduling under uncertain customer presence. Transp. Res. Part E: Logist. Transp. Rev. 148, 102263 (2021). https://doi.org/10.1016/j.tre.2021.102263 22. Suárez González, L., Porras Nodarse, C., Diaz Pando, H., Sánchez Anzola, E., Rosete Suárez, A., Pérez Pérez, A.C.: Twards more efficient routes: A virtual store case study. Revista Cubana de Ciencias Informáticas 15(UCIENCIA I) (2021). https://rcci.uci.cu/?journal=rcci& page=article&op=view&path%5B%5D=2263 23. Tilk, C., Olkis, K., Irnich, S.: The last-mile vehicle routing problem with delivery options. OR Spectr. 43(4), 877–904 (2021). https://doi.org/10.1007/s00291-021-00633-0 24. Toth, P., Vigo, D.: An Overview of Vehicle Routing Problems, second edn., chap. 1, pp. 1–26. Society for Industrial and Applied Mathematics and the Mathematical Optimization Society, Philadelphia, US (2014). https://doi.org/10.1137/1.9781611973594.ch1 25. Vanelslander, T., Deketele, L., Hove, D.V.: Commonly used e-commerce supply chains for fast moving consumer goods: comparison and suggestions for improvement. Int. J. Logist. Res. Appl. 16(3), 243–256 (2013). https://doi.org/10.1080/13675567.2013.813444 26. Zhang, H., Ge, H., Yang, J., Tong, Y.: Review of vehicle routing problems: Models, classification and solving algorithms. Arch. Comput. Methods Eng. 29(1), 195–221 (2022). https://doi.org/ 10.1007/s11831-021-09574-x

Intelligent Decision-Making Dashboard for CNC Milling Machines in Industrial Equipment: A Comparative Analysis of MOORA and TOPSIS Methods Javier Andres Esquivias Varela, Humberto García Castellanos, and Carlos Alberto Ochoa Ortiz

Abstract The efficient and precise handling of metals is a current requirement in many industries. To address this need, the CNC milling machine has proven to be an essential tool. This machine uses a specific coordinate system to work on heavier and harder metals, thereby allowing for the production of objects in various sizes and with high precision. To identify the best machine for a particular task, this study will develop an intelligent board that can determine the ideal option based on the importance of different machinery attributes.To accomplish this, the study will use the TOPSIS and MOORA programs to evaluate and compare the critical variables when purchasing CNC equipment. The analysis of the attributes will be used to determine the best option in terms of equipment cost and efficiency for a given task. The final use of the equipment and the type of materials that will be worked with will also be taken into consideration to avoid buying machinery that exceeds production requirements and to ensure that the ideal equipment is acquired for the job. Keywords Industry · Machining · Innovation · CNC milling machine

J. A. Esquivias Varela (B) · H. García Castellanos Tecnológico Nacional de México IT Ciudad Juárez, Juárez City, Chih, Mexico e-mail: [email protected] H. García Castellanos e-mail: [email protected] C. A. Ochoa Ortiz Research at Technology Ph.D., Autonomous University of Ciudad Juárez, Juárez City, Chih, Mexico e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications, Studies in Big Data 132, https://doi.org/10.1007/978-3-031-38325-0_22

573

574

J. A. Esquivias Varela et al.

1 Introduction The origin of milling machines can be traced back to 1818, with the invention of the first manual milling machine energy [1]. Despite its progression over time, the manual milling process still presents dangers to both the worker and the workpiece being processed. This has led to the pursuit of a more efficient and accurate manufacturing approach, resulting in the development of various methods, including the implementation of computer numerical control (CNC) to improve accuracy energy [2]. A study was conducted to improve operating procedures and equipment selection by identifying errors in CNC selection [3]. This requires conducting different machining tests and control measurements to establish a model that describes machine tool errors [4]. A recent study developed a model that accurately identifies the necessary variables at the time of CNC machine selection [5]. In this model, factors such as price, voltage, device size, and others are considered inputs. Experiments with CNC machine tools have significantly reduced the volume error in error compensation [5]. The production of parts using CNC machining equipment consumes a substantial amount of electricity, contributing significantly to the carbon footprint [6]. Hence, it is imperative to improve energy efficiency by reducing energy consumption from various machine tools, including CNC [7]. CNC milling machines, equipped with computerized numerical control and a coordinate system reference, are widely used in the industry, particularly in the metallurgical and automotive industries, to make precise cuts or wear on the workpiece. Despite its advanced technology, there is still room for improvement, such as the ability to work in unison with other machinery without the need for an operator or the possibility of modifying the workpiece’s finish during operation. CNC technology has also facilitated the production of parts from light materials that would break with conventional tools to the hardest materials that are challenging to work with. The best CNC milling machine will be determined by evaluating its attributes of importance. Another problem that needs to be addressed is the production of low-cost prostheses that can be tailored to each individual’s measurements. To achieve this, a CNC milling machine with specific necessary characteristics will be sought for the production of the parts that will serve as support for the prostheses, as well as the joints between the prostheses and the individual. The selection of the milling machine will be aided by an intelligent dashboard that incorporates databases loaded with information on the milling machines, with the option to add alternatives or modify existing ones. The calculation of Topsis and Moora will also be necessary to adjust the importance of each attribute’s values [8]. The criteria for selection will depend on the intended use of the milling machine, with attributes such as work table size being given more weight for the production of larger parts. For the purpose of the intelligent dashboard, seven different models will be used as a basis, with sixteen additional characteristics to conduct the first tests (which can be modified). Spindle acceleration, the process by which the spindle moves from a stationary state or lower speed to a higher rate without material handling, requires significantly

Intelligent Decision-Making Dashboard for CNC Milling Machines …

575

more energy demand than the steady state due to the high torque required to accelerate the spindle system. The modeling of spindle energy consumption is challenging due to the short duration of the transient energy [9]. The utilization of computerized equipment also presents several advantages, such as the ability to connect to a wireless network. This enables connecting equipment, such as a computer in an office, to a milling machine in the production area, providing employees with added security. This can also be used for personal projects, facilitating communication with experts in the field for support through a network connection (Fig. 1). Empirical models of energy consumption for spindle rotation, spindle acceleration, feed, and material removal studies, and evaluation criteria for the classification of CNC models were proposed [10], where empirical models were analyzed to calculate the energy consumption during spindle start-up. Assuming three piecewise functions, linear function, Hermite interpolation, and cubic function fit the power curve for power consumption prediction. According to the power consumption characteristics [11, 12], the critical state transitions of the turning process are identified and classified. The power consumption model of the CNC machine tool was established, and an empirical model based on regression analysis calculates the power consumption during the spindle acceleration process [13]. Implementing an intelligent board for decision-making in CNC machining equipment is a viable option for the solution of the problem because what

Fig. 1 Conceptual diagram of the elaboration of a smart board

576

J. A. Esquivias Varela et al.

must be done is to capture the variables to be analyzed in search of a logical answer by comparing results between equipment with different specifications.

2 Developing Developing the intelligent dashboard for decision-making requires learning economic models because its calculation could be complicated if the method were manual. Another reason is to integrate different knowledge on other issues related to the economy, such as job rotation and carbon footprint, among other things. To initialize this, the first thing to do is to define the characteristics that will be analyzed for each of the alternatives, such as price, power, speed, electrical requirement, electrical consumption, work area, movement in the different axes, the type of cone used for drills, the dimensions of the machine (Fig. 2) and the distances between the machine and the internal gap between the machine and the work area. Each of the essential characteristics to be evaluated for each chosen machine is described below: Price. The monetary value of the machine. The essential factor because this parameter depends on the acquirer of the device concerning the currency of the United States, the dollar. This is the most important since if it exceeds the budget; it would be unlikely to acquire the machines that exceed it. (i) Power. This is another essential agent. The amount of force the machine can exert on the drill bit or the sweeping element to cut the workpiece, a small amount of power cannot move the drill enough to cut hardened metals. (ii) Velocity. Needing power at the same level of importance requires an excellent speed to give the revolutions per minute at which the spindle rotates, making the part wear uniform in the desired area. (iii) Electrical requirement. The type of electrical installation in the workplace consists of the number of phases and currents that are generated, which divide the voltage that they receive from the source. (iv) Work area. It is the amount of space in which the spindle can move the part, and you can continue collaborating with it, but you can only drive within this area, which is limited by the axes’ movement. Movement in the axes (X, Y, Z). is the maximum distance the spindle can move in each of the axes, horizontally and vertically. In addition to modifying the height, the axes give a cube-shaped space to work. (v) Cone type. It is a critical factor, being the size of the spindle. Of the different measures that exist, the most common is the ISO which has several types of cones, the most common being 7/24 in. (vi) Electrical consumption. The voltage that supports the electrical installation of the work area, in Mexico, the standard voltage is 220-210v in industries and 110v in houses. (vii) Weight. Another factor with low importance is weight. When the need to move the machine. And the weight is simply the weight of the entire machinery.

Intelligent Decision-Making Dashboard for CNC Milling Machines …

577

Fig. 2 CNC milling machine KENTA XK9036TM

(viii) Wide, Long, Height. This is not an essential factor, only considering the space to search for a place to put the machine. They are the full measures of all the machinery to assign you a location at work. (ix) Distance from spindle to table. This is the distance from the table to the tip. From this, you must subtract the space generated by the drill or the sweeping element. (x) Distance from spindle to the column. This is the distance between the column and the vertical space of the spindle nose and table.

578

J. A. Esquivias Varela et al.

3 Obtaining the Data Now that we have defined the characteristics that we must look for in each of the machines, what we must do is search on reliable pages. For this, the following methodology, because the pages of the machines do not deliver the price without a previous quote, is complicated, and it takes time to obtain these prices; therefore, the used machines are searched. On second-hand pages, obtain an estimate of the cost of the machinery and then locate that same milling machine on the company’s sales page and get the catalog of technical specifications to fill out Table 1. As seen in Fig. 4, there are more characteristics for which data could be obtained for better comparison of the features, but there are too many. Some technical sheets do not specify them within their data sheets, so designers decided to reduce the characteristics. The most critical set in the data sheets with greater frequency, thus facilitating the obtaining of data and the calculation of the different economic models. Now that we know what we must do, the information about each CNC milling machine, and the economic model to make the decision, we need to start with the program. The first thing we need for the intelligent dashboard is to visualize the data in case the user needs to incorporate more alternatives or modify one of the CNC machines already in the selected data. Therefore, we must add the option to change the choices and share them in the intelligent dashboard. The first step is designing an interface to show the information in a table for the user and adding a mode with buttons to facilitate the user to add/remove or modify the pre-selected alternatives. The next step is calculating the best CNC milling machine (Fig. 3) using the economic model. Topsis is the first method to calculate. After that, it is necessary to use another economic model to make the decision table (Table 2). Table 1 Comparison tables of the characteristics Z-Move

Cone Weight-kg Width-mm Length-mm Height-mm DHM-mm DHC-mm type

150 mm bt 40 50 mm

1800

1800

1600

2480

400

360

bt 40

2100

2250

2000

2400

680

460

550 mm bt 40

5000

2615

2050

2850

690

550

710 mm bt 40

6200

3540

2710

2750

810

800

570 mm bt 30

2500

2200

2180

2400

670

470

413 mm bt 30

726

1753

1435

2337

419

279

440 mm Iso 40

2300

1820

1860

2340

610

410

Intelligent Decision-Making Dashboard for CNC Milling Machines …

Fig. 3 CNC milling machine parts

Fig. 4 Virtual processes on Unity

579

580

J. A. Esquivias Varela et al.

Table 2 Graphic interface of the intelligent dashboard

Pow Speed Elec. Elec Es M er RPM Req. Cons 18 Mono 220 XK9036TI 23500 7000 47 90 kW phase v Tri 220 FCM-800 22000 5 hp 5000 phasi 45 80 v c 11 Mono 220 VCM-105 35000 10000 55 10 wk phase v Tri 220 M8 32000 20 hp 8000 phasi 1.8 20 v c Tri 220 VF740M4 35000 10 hp 10000 phasi 40 75 v c 1.87 Mono 230 1100MX 24000 10000 20 45 hp phase v Tri 220 XKW7130 28400 5 hp 9000 phasi 41 77 v c Name

Price

M M Type Wei Wid Len Hei DHM DHC 36 15 bt

18 18 16 24 40

36

40 50 bt

21 22 20 24 68

46

50 55 bt

50 26 20 28 69

55

85 71 bt

62 35 27 27 81

80

40 57 bt

25 22 21 24 67

47

27 41 bt

72 17 14 23 41

27

49 44 iso

23 18 18 23 61

41

4 Analysis of Data Capture In this case, the first step is to give a weight for each criterion, a number from 1–10 or 1–100. Another essential point is to decide the scale, which is from 1 - 10 or 10 to 1 in each criterion to determine that the RED standards are 1 to 10. The GREENS criteria are for 10 to 1 (for example, the price is better to be the lowest, close to one is better, in power is necessary the high number, close to ten is better) Table 3. The first step is to normalize the decision matrix. This step is for each column of the matrix and obtains the root of the sums of each criterion, usually called “Normal” Table 4. The next step for standardizing the decision matrix is to divide every number of each column by the normal obtained before. For example, all the prices are added, and the result is rooted. After analyzing costs is necessary to build a weighted standardized matrix by multiplying the attribute weight by each rating. For example, the first “price” of the first variant is multiplied by the weight of the prices. (Table 5 and 6). The next step is to generate the ideal solution. The maximum values determine this for each criterion. For example, in “price,” the perfect solution is 35,000, but this is the original matrix that we must obtain in the weighted standardized matrix. This

Price

1

Name

Weight

9

Power

8

Speed

Table 3 Weight for each criterion

1

Elec. Con 7

Work Spa 3

Mov X 3

Mov Y 3

Mov Z 9

Weight 2

Length

2

Width

2

Height

6

DHM

6

DHC

Intelligent Decision-Making Dashboard for CNC Milling Machines … 581

Price

Elec Con

Work Spa

Mov X

Mov Y

Mov Z

Weight

Length

Width

Height

DHM

DHC

20,928.54 539.09 2,064,049.20 2687.40 1224.92 1152.18 8824.36 5965.49 4992.00 6230.51 1542.61 1258.11

Power Speed

Normal 71,864.58 36.78

Name

Table 4 Standardized criteria

582 J. A. Esquivias Varela et al.

Price

0.3270

0.3061

0.4870

0.4620

0.4870

0.3340

0.3952

Name

XK9036TM

FCM-800NC

VMC-1050I

M8

V740M400

1100MX

XKW7130A

0.1356

0.0508

0.2718

0.5437

0.4010

0.1821

0.6560

Power

Table 5 Standardized alternatives

0.1911

0.4778

0.4778

0.3823

0.4778

0.2389

0.3345

Speed

0.4081

0.4081

0.4081

0.4081

0.4081

0.4081

0.4081

Elec Con

0.1991

0.1008

0.1969

0.8830

0.2674

0.2190

0.2279

Work Spa

0.2865

0.1701

0.2791

0.7442

0.3721

0.2977

0.3349

MovX

0.4000

0.2278

0.3266

0.6939

0.4082

0.3266

0.2939

MovY

0.3819

0.3585

0.4947

0.6162

0.4774

0.0434

0.1302

MovZ

0.2606

0.0823

0.2833

0.7026

0.5666

0.2380

0.2040

Weight

0.3051

0.2939

0.3688

0.5934

0.4384

0.3772

0.3017

Length

0.3726

0.2875

0.4367

0.5429

0.4107

0.4006

0.3205

Width

0.3756

0.3751

0.3852

0.4414

0.4574

0.3852

0.3980

Height

0.3954

0.2716

0.4343

0.5251

0.4473

0.4408

0.2593

DHM

0.3259

0.2218

0.3736

0.6359

0.4372

0.3656

0.2861

DHC

Intelligent Decision-Making Dashboard for CNC Milling Machines … 583

Price

0.3270

0.3061

0.4870

0.4620

0.4870

0.3340

0.3952

Name

XK9036TM

FCM-800NC

VMC-1050I

M8

V740M400

1100MX

XKW7130A

1.2233

0.4575

2.4466

4.8932

3.6087

1.6392

5.9036

Power

1.5290

3.8225

3.8225

3.0580

3.8225

1.9113

2.7658

Speed

0.4081

0.4081

0.4081

0.4081

0.4081

0.4081

0.4081

Elec Con

Table 6 Weighted standardized decision matrix

1.3939

0.7054

1.3783

6.1808

1.8720

1.5332

1.5953

Work Spa

0.8596

0.5102

0.8372

2.2326

1.1163

0.8931

1.0047

Mov X

1.2001

0.6833

0.9797

2.0818

1.2246

0.9797

0.8817

Mov Y

1.1457

1.0754

1.4841

1.8487

1.4321

0.1302

0.3906

Mov Z

2.3458

0.7405

2.5498

6.3234

5.0995

2.1418

1.8358

Weight

0.6102

0.5877

0.7376

1.1868

0.8764

0.7543

0.6035

Length

0.7452

0.5749

0.8734

1.0857

0.8213

0.8013

0.6410

Width

0.7511

0.7502

0.7704

0.8828

0.9149

0.7704

0.7961

Height

2.3726

1.6297

2.6060

3.1505

2.6837

2.6449

1.5558

DHM

1.9553

1.3306

2.2415

3.8152

2.6230

2.1938

1.7169

DHC

584 J. A. Esquivias Varela et al.

Intelligent Decision-Making Dashboard for CNC Milling Machines …

585

matrix will determine what we must obtain in the weighted standardized matrix. This matrix is what will be done for each criterion (Table 7). Now that the ideal and harmful solutions have been determined, it is essential to obtain the separation of the perfect solution (S’) and the separation of the negative ideal solution(S ). The separation of the perfect solution is calculated by resting the ideal solution of the weighted normalized matrix, and the result is squared. This step, an error in the criteria for Constant Electric, found that all alternatives have the same value, and the sum of each value of the value of each standard is the separation of the ideal solution (S’) (Table 8). The step of calculating the separation from the negative ideal is the same process as calculating the break of the ideal solution, the first thing is to rest the adverse purpose of every number in the weighted standardized matrix and raise the result at the square, and the resulting matrix is summed for each column (S ) (Table 9). The last step for Topsis is to generate a comparison table (Table 10) to separate the ideal and harmful solutions and add the sum of these two and the closeness to the perfect solution with this Eq. 1 [14]. The calculation of the closeness to the ideal solution in the Technique for Order Preference by Similarity to the Ideal Solution (TOPSIS) method. The formula is a way of calculating the relative closeness of each alternative to the ideal solution. In the formula, S’ and S” represent the positive ideal solution and the negative ideal solution, respectively, which are calculated based on the normalized decision matrix. The subscripts i and j represent the alternative being evaluated and the criterion being considered, respectively. The result of this calculation will be a relative measure of the closeness of each alternative to the ideal solution, with a higher value indicating closer proximity to the ideal. These values can then be used to rank the alternatives and determine which one is the most preferred based on the criteria used in the analysis. closennes =

S Si + S j

(1)

In the closeness factor, there is an error in the last row of the fourth column corresponding to the power consumption. This is because the value of each criterion is the same, and the separation of the ideal solution and the separation of the negative ideal is zero because the negative and the purpose are equal to the values of each criterion in the normalized matrix. When the ideal solution (Table 11) is divided by the negative sum and the ideal is equal to 0 divided by zero. The calculations, the order of the options sorted from the best to the minor result, is: In the comparison, the next step is calculating the Moora method, an economic model that Breuer introduced in 2004. It is a technique that can be successfully applied to solve diverse types of complex decision-making. Therefore, the Moora method is the optimization process of two or more conflicting objectives subject to certain restrictions (Ochoa A.).

Ideal Solution

0.4870

5.9036

3.8225

0.4081

0.6108

Table 7 Ideal Solution for the weighted standardized matrix 2.2326

2.0818

1.8487

6.3234

1.1868

1.0857

0.9149

3.1505

3.8152

586 J. A. Esquivias Varela et al.

Mov X

Mov Y

Mov Z

Weight

Length Width

Height DHM

DHC

0.0005

0.0000 11.7770 0.0000 0.0000

0.0203 29.2280 0.0000 0.0000

0.0007 21.5863 5.0749 0.0000

M8

V740M400

1100MX

XKW7130A

1.0061 0.5639 0.0000

0.0000 0.0000 0.0000 0.0009 0.0000 0.0000

52.9982

1.5744

28.0099

66.7657

51.5435

S’

22.0410 1.7425 0.6703 0.4315 14.8155 0.3042 0.1018 0.0235 0.5234 3.1279

70.4501

28.8373 2.7421 1.6864 0.5221 29.1874 0.3284 0.2291 0.0238 2.0005 5.5822 100.3875

22.1849 1.7996 1.0474 0.1160 13.3349 0.1846 0.0396 0.0183 0.2565 2.2395

0.0000 0.0000 0.0000 0.0000

17.8574 1.1517 0.6336 0.1515

1.4026 0.0880 0.0614 0.0000 0.1884 1.2853

20.7761 1.6585 1.0474 2.5781 16.3740 0.1711 0.0711 0.0183 0.2211 2.3773

5.1899 0.0000 0.0000

0.0000

20.2249 1.3936 1.2419 1.8560 18.5878 0.3113 0.1737 0.0124 2.1997 3.9814

Elec.Con Work Spa

0.0000 1.2687 0.0000

Speed

VMC-1050I

0.0221

XK9036TM

Power

FCM-800NC 0.0283 17.9202 3.5242 0.0000

Price

Name

Table 8 Separation from the ideal solution

Intelligent Decision-Making Dashboard for CNC Milling Machines … 587

Mov X

Mov Y

Mov Z

0.0283

0.0007

0.0069

V740M400

1100MX

XKW7130A

0.5779 0.0000 0.0000

0.0000 5.0749 0.0000

3.8988 5.0749 0.0000

0.0210 19.3885 2.2555 0.0000 28.8373 2.7421 1.6864 2.5781 29.1874 0.3284

M8

1.8389 0.0254

0.4560 0.1128 0.2303 0.9002

0.0000 0.0000 0.0000 0.7799

0.4355 0.0989 0.0757 1.6003

2.4132 0.0005

0.0000 0.0000

3.0654 0.0205

1.3093 0.3396 0.2526 1.4796 17.7932 0.0764

9.7855 5.0749 0.0000

0.6593 0.1355 0.0757 0.0000

1.3761 0.1410 0.0000

0.0283

Height DHM

DHC

S’

5.9968

0.0255 0.0000 0.5771 0.3529

0.0000 0.0000 0.0047 0.0000

5.6531

5.8602

0.0782 0.0004 0.9539 0.7502 16.0812

0.2291 0.0154 2.1997 5.5822 95.0511

0.0533 0.0238 1.1005 1.5103 38.8273

0.0450 0.0004 1.0259 0.6737

0.0038 0.0018 0.0000 0.1349 32.8425

Length Width

1.1236 0.0002

Weight

VMC-1050I

0.7618 0.2260 0.0339 0.0592

Work Spa

FCM-800NC 0.0000

Elec. Con

0.0004 29.2280 1.2687 0.0000

Speed

XK9036TM

Power

Price

Name

Table 9 Separation from the Negative Solution

588 J. A. Esquivias Varela et al.

0.3130 186.0254

S”j

0.0000 63.4972

6.9484

67.6652 0.0000 125.6851 13.5572

37.3700

14.7354 107.8810 0.8772 0.8211 0.0813 10.6976 17.2000

8.9467 27.9912 197.9688 1.6781 1.5889 0.1388 20.2947 32.8897

4.5997

(S”i)(S”i 0.6472 + S”j)

0.6552

0.6504 0.0000

0.6644

0.6611

0.6605

0.6551

0.6473 0.6567 0.6593 0.6307

0.6548

0.6556

S”i + S”j 0.4837 283.9308 104.0352 0.0000 189.1823 20.5055 13.5464 42.7265 305.8498 2.5553 2.4101 0.2200 30.9923 50.0897

0.1707 97.9054

S’j

Table 10 Table of comparison of solutions and closeness to the solutions

Intelligent Decision-Making Dashboard for CNC Milling Machines … 589

590

J. A. Esquivias Varela et al.

Table 11 Results of topsis

Options

Order

0.335393

3

0.321812

5

0.381957

2

1.370719

1

0.269481

4

0.134738

6

0.276806

7

The Moora method starts from an arrangement with the different alternatives and criteria defined. The other options are on the X-axis and the requirements on the Y-axis, something like Eq. 2. X = X 11 X 12 X 1 j X 21 X 22 X 2 j . . . X i1 . . . X i2 . . . X i j

(2)

Table 12 is analyzed in detail in Fig. 4, which represents the variability of each process to be performed by the selection of the maneuverer to be performed in the Industry 4.0 workspace. This environment was realized in the Unity program that helps us to virtualize the process; in this way, we can model the portability with the user. The development of the last step is the “normal” of each criterion (Table 13). Now, with the normal, the next step is dividing each value by its standard Table 14. Normally, the last step is set to use Eq. 3; this formula is related to the normalization of a decision matrix in the Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) method. The normalization step is important in the TOPSIS method because it allows for the comparison of criteria with different units or scales. In the formula, Nxij represents the normalized value of the element in the decision matrix, x ij , for alternative i and criterion j. The denominator represents the square root of the sum of the squares of the elements in the jth column of the decision matrix. By normalizing the decision matrix in this way, each criterion is given equal weight in the analysis, and the resulting normalized decision matrix can be used to calculate the positive ideal solution (S’) and negative ideal solution (S”) [14]. xi j N x i j = m

2 J =1 x i j

(3)

It is necessary to declare a weighting for each measure (Table 15), depending on the importance of each criterion; for example, price is the most important criterion, and therefore the weighting of this criterion is 0.14. The sum of the weightings is equal to the unit (1). The sum of each value of weighting is equal to 1. 0.14 + 0.12 + 0.1 + 0.09 + 0.06 + 0.05 + 0.06 + 0.06 + 0.04 + 0.04 + 0.04 + 0.04 + 0.08 + 0.08 = 1

Price

23,500

22,000

35,000

33,200

35,000

34,000

28,400

Name

XK9036TM

FCM-800NC

VMC-1050I

M8

V740M400

1100MX

XKW7130A

5

2

10

20

15

7

24

Power

4000

10,000

10,000

8000

10,000

5000

7000

Speed

220

220

220

220

220

220

220

Elec. Con

Table 12 Array with the alternatives and criteria

411,000

207,983

406,400

1,822,500

552,000

452,100

470,400

Work Spa

770

457

750

2000

1000

800

900

MovX

490

279

400

850

500

400

360

Mov Y

440

413

570

710

550

50

150

MovZ

2300

726

2500

6200

5000

2100

1800

Weight

1820

1753

2200

3540

2615

2250

1800

Length

1860

1435

2180

2710

2050

2000

1600

Width

2340

2337

2400

2750

2850

2400

2480

Height

610

419

670

810

690

680

400

DHM

410

279

470

800

550

460

360

DHC

Intelligent Decision-Making Dashboard for CNC Milling Machines … 591

Power

37.06

Price

77,272.50

21,307.28

Speed

Table 13 Normal for Moora

582.07

Elec. Con

2,104,571.15

Work Spa 2795.40

Mov-X 1319.11

Mov-Y 1233.15

Mov-Z 9119.05

Weight 6236.80

Length

5327.08

Width

6655.27

Height

1658.66

DHM

1323.08

DHC

592 J. A. Esquivias Varela et al.

Price

0.3041

0.2847

0.4529

0.4296

0.4529

0.3106

0.3675

Name

XK9036TM

FCM-800NC

VMC-1050I

M8

V740M400

1100MX

XKW7130A

0.0001

0.0000

0.0001

0.0003

0.0002

0.0001

0.0003

Power

Table 14 Normalized array

0.0518

0.1294

0.1294

0.1035

0.1294

0.0647

0.0906

Speed

0.0028

0.0028

0.0028

0.0028

0.0028

0.0028

0.0028

Elec. Con

5.3188

2.6916

5.2593

23.585

7.1435

5.8507

6.0875

Work Spa

0.0100

0.0059

0.0097

0.0259

0.0129

0.0104

0.0116

MovX

0.0063

0.0036

0.0052

0.0110

0.0065

0.0052

0.0047

Mov Y

0.0057

0.0053

0.0074

0.0092

0.0071

0.0006

0.0019

Mov Z

0.0298

0.0094

0.0324

0.0802

0.0647

0.0272

0.0233

Weight

0.0236

0.0186

0.0282

0.0458

0.0338

0.0291

0.0233

Length

0.0241

0.0186

0.0282

0.0351

0.0265

0.0259

0.0207

Width

0.0303

0.0302

0.0311

0.0356

0.0369

0.0311

0.0321

Height

0.0079

0.0054

0.0087

0.0105

0.0089

0.0088

0.0052

DHM

0.0053

0.0036

0.0061

0.0104

0.0071

0.0060

0.0047

DHC

Intelligent Decision-Making Dashboard for CNC Milling Machines … 593

Weight

0.14

Price

0.12

Power

0.01

Speed

Table 15 The weighting of the criteria

0.09

Elec. Con 0.06

Work Spa 0.05

Mov X 0.06

Mov Y 0.06

Mov Z 0.04

Weight 0.04

Length

0.04

Width

0.04

Height

0.08

DHM

0.08

DHC

594 J. A. Esquivias Varela et al.

Intelligent Decision-Making Dashboard for CNC Milling Machines …

595

Continuing with the calculation of the Moora economic model [16], the next step is multiplying each value of each criterion for the weighting (Table 16), and the resultant array is the next: Now with the matrix of the weighted matrix, it is necessary to decide which of the criteria will be maximized or which needs to be minimized, and depending on this, the criterion will be added or subtracted from the solution. Table 17. As shown in the sum or subtraction, we can determine the times associated with the human factor, by running a computer simulation in a Unity environment Fig. 5, in this way, we can detect the search space of the solutions associated with the resolution of the industrial process to be performed.

5 Results All the last steps are driven by Eq. 4. This formula is related to the calculation of the positive ideal solution (S’) and negative ideal solution (S”) in the Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) method. In the formula, Nyj represents the value for the jth criterion. The first sum represents the weighted sum of the normalized decision matrix elements for the positive ideal solution, where St is the weight assigned to the criterion, Nxij is the normalized value for alternative i and criterion j, and g is the number of positive ideal criteria. The second sum represents the sum of the normalized decision matrix elements for the negative ideal solution. By calculating the positive and negative ideal solutions in this way, the TOPSIS method allows for the determination of the relative closeness of each alternative to the ideal solution, which can then be used to rank the alternatives and make a decision [14, 15]. Nyj =

i=g i=1

St ∗ N xi j −

i=n i=g+1

N xi j

(4)

In the last step, Moora obtained the result of adding or subtracting the criteria alternative (Table 18) depending on whether the requirements were added or removed. After calculating the weights, the result is the efficiency of the options. Now with the results of the two economic models for deciding, Topsis and Moora, the next part of the investigation is comparing the two models to verify if the results are the same or if one of the calculations of the models is wrong. As shown in Table 19, the data reveals that the results of the two economic models are largely consistent. This is because the first three alternatives have the same level of importance, with the first option being favored due to its larger size and more powerful and efficient CNC milling machine. In the context of the intelligent dashboard design, the focus was on developing an economic decision-making model using Python programming and the Topsis method [17], as presented in Listing 1. The objective was to select the optimal CNC milling machine to optimize the production of prostheses and improve production efficiency.

Elec. Con

Work Spa

Mov X

Mov Y

Mov Z

Weight

Length

Width

Height

DHM

DHC

0.06341 0.00002 0.01294 0.00260 0.42861 0.00065 0.00039 0.00043 0.00259 0.00135 0.00106 0.00148 0.00071 0.00057

0.06015 0.00003 0.01035 0.00260 1.41512 0.00129 0.00066 0.00055 0.00321 0.00189 0.00140 0.00142 0.00084 0.00083

0.06341 0.00002 0.01294 0.00260 0.31556 0.00049 0.00031 0.00044 0.00129 0.00114 0.00113 0.00124 0.00069 0.00049

0.04348 0.00000 0.01294 0.00260 0.16149 0.00030 0.00022 0.00032 0.00038 0.00091 0.00074 0.00121 0.00043 0.00029

0.05145 0.00001 0.00518 0.00260 0.31913 0.00050 0.00038 0.00034 0.00119 0.00094 0.00096 0.00121 0.00063 0.00042

VMC-1050I

M8

V740M400

1100MX

XKW7130A

FCM-800NC 0.03986 0.00001 0.00647 0.00260 0.35104 0.00052 0.00031 0.00004 0.00109 0.00116 0.00104 0.00124 0.00070 0.00048

0.04258 0.00004 0.00906 0.00260 0.36525 0.00058 0.00028 0.00012 0.00093 0.00093 0.00083 0.00128 0.00041 0.00037

Speed

XK9036TM

Power

Price

Name

Table 16 The weighting of the criteria

596 J. A. Esquivias Varela et al.

Power

Max

Price

Min

Max

Speed

Min

Elec. Con

Table 17 Sum or subtraction of the criterion

Max

Work Spa Max

Mov X Max

Mov Y Max

Mov Z Min

Weight Max

Length

Max

Width

Max

Height

Max

DHM

Max

DHC

Intelligent Decision-Making Dashboard for CNC Milling Machines … 597

598

J. A. Esquivias Varela et al.

Fig. 5 Unity environment Table 18 Order of alternatives calculated whit Moora

Table 19 Table of Comparison of the two alternatives

Alternative

Weight

Order

XK9036TM

0.335393216

3

FCM-800NC

0.321811532

4

VMC-1050I

0.381956627

2

M8

1.370719269

1

V740M400

0.269481138

6

1100MX

0.134738297

7

XKW7130A

0.276805859

5

Alternative

Topsis

Order

XK9036TM

3

3

FCM-800NC

4

5

VMC-1050I

2

2

M8

1

1

V740M400

6

4

1100MX

7

6

XKW7130A

5

7

Intelligent Decision-Making Dashboard for CNC Milling Machines …

599

A user interface was created to calculate the Topsis model, with the calculation process being performed using Excel for the purposes of this example. Topsis, a multi-criteria evaluation method, is based on the distance between the ideal positive and ideal negative solutions [17]. Listing 1 Python script for evaluating criteria using both the MOORA and TOPSIS import numpy as np def topsis(decision_matrix, weights, impacts): """ The Topsis method multiple criteria. Parameters: decision_matrix matrix. weights - A numpy criterion. impacts - A numpy criterion.

for evaluating several points based on

A numpy array containing the decision array containing the weight of each array containing the impact type of each

1 for benefit and -1 for cost. Returns: The Topsis scores for each point. """ # normalize decision matrix normalized_matrix=decision_matrix/np.linalg.norm(decision_matrix , axis=0) # calculate the weighted normalized decision matrix weighted_normalized_matrix=normalized_matrix* weights # determine ideal solutions ideal_positive = np.amax(weighted_normalized_matrix, axis=0) * impacts ideal_negative = np.amin(weighted_normalized_matrix, axis=0) * impacts # calculate Topsis scores distances = np.sqrt(np.sum((weighted_normalized_matrixideal_positive)**2,axis=1))+ np.sqrt(np.sum((weighted_normalized_matrix - ideal_negative)**2, axis=1)) topsis_scores = np.reciprocal(distances) return topsis_scores

def moora(decision_matrix, weights, impacts, ranking_weights): """ The Moora method for evaluating several points based on multiple criteria. Parameters: decision_matrix - A numpy array containing the decision matrix. weights - A numpy array containing the weight of each criterion.

600

J. A. Esquivias Varela et al. impacts - A numpy array containing the impact type of each criterion. 1 for benefit and -1 for cost. ranking_weights - A numpy array containing the weight of each criterion for the ranking process. Returns: The Moora scores for each point. """ # normalize decision matrix normalized_matrix = decision_matrix / np.linalg.norm(decision_matrix, axis=0) # calculate the weighted normalized decision matrix weighted_normalized_matrix = normalized_matrix * weights # determine ranking matrix ranking_matrix=np.zeros(weighted_normalized_matrix.shape) for i in range(weighted_normalized_matrix.shape[0]): for j in range(weighted_normalized_matrix.shape[1]): if impacts[j] == 1: ranking_matrix[i][j] = weighted_normalized_matrix[i][j] * ranking_weights[j] else: ranking_matrix[i][j] = (1 / weighted_normalized_matrix[i][j]) * ranking_weights[j] # calculate Moora scores moora_scores = np.sum(ranking_matrix, axis=1) return moora_scores

6 Conclusions The development of the smart board is an excellent way to understand different things, such as programming, the economic model for decision-making, and other things that are not specified in this paper. Not only the topics related to the economic models for decision making, but other things like scheduling are also important in the daily life of the work, not in the other things to know, such as things everyone should know. It is of great importance the correlation in the comparative of the data to analyze when making the decision since it would be useless to make an exhaustive analysis of the characteristics of these are not appropriate; what is sought with this comparative is to seek savings in time of equipment selection and turn an efficiency in quality and energy consumption, since this point, in the long run, will make us have better results in costs. At the same time, the price of the equipment is an essential factor at the time of making the analysis—the higher the price, the more efficient the equipment—the decision maker will have to decide in this analysis. To demonstrate industrial production, it is necessary to elaborate an animation of the process because reproducing the production can be difficult and costly. A proposal to make an animation with the software UNITY, where the user makes an avatar and an environment like the industrial sector that will be upgraded. The first step is making an avatar in a 3D modeling software like “VRoid Studio.” In this application, you can personalize an

Intelligent Decision-Making Dashboard for CNC Milling Machines …

601

avatar to make it like you. You can personalize the eyes, mouth, hair, or body style and add dress and accessories. But also, you can use other software. In our case, we use MIXAMO, an internet page that gives free avatars and animations, but you can upload your avatar. The page makes the animation for the avatar and adds it to the unity making a type of archive acceptable for UNITY. Like the next, the hair is white because of the illumination inside the software unity, and the original avatar’s hair is black (Fig. 6). Now, it is necessary to elaborate on the working environment with the station that will be upgraded Fig. 7. In line with the value of solidarity, it is essential to enhance certain industrial production processes to improve efficiency and speed. For instance, consider the production line for the assembly of prostheses, where the first step involves the arrival of raw materials, such as metal tubes and plastic components. These materials are

Fig. 6 Avatar in UNITY

602

J. A. Esquivias Varela et al.

Fig. 7 Space of work

then processed in CNC milling machines before being assembled into prosthetic columns with their supports. It is important to note that the selection of industrial processes has a crucial role in determining the behavior of components associated with electric vehicles, influencing their future purchasing patterns, and mapping their use in a smart city. The aim is to carefully evaluate these processes to ensure optimal results [18–20]. As can be seen in the graphical results analyzed in Fig. 8.

Fig. 8 Smart City vehicles simulation

Intelligent Decision-Making Dashboard for CNC Milling Machines …

603

7 Future Research The planning of a methodology to guide, organize and intelligently manage the efforts and initiatives of workers to achieve better work efficiency, seeking to meet the objectives and achievements in the workplace, is significant. With the analysis through the methodology described in the article and thanks to the cooperation of the Tecnológico Nacional de México (TecNm) [21], we can implement an analysis of the work procedures, which would lead us to reduce the work stress of the staff now avoiding that they are saturated with activities and that leads them to psychological stress. Certain workers’ problems are presented with stress at the time of not being able to perform the activities to be completed correctly. We can highlight that in certain cases, it is due to the lack of training of the worker or simply that certain workers need a little more time when evaluating the learning curve in the activity to be performed. Also, we can extend this research by evaluating other multicriteria decision models, especially those based on non-compensatory preference relations [22–24]. For this reason, it is suggested that for future studies, an analysis of the activities to be performed through intelligence tables should be conducted to better understand the behavior of work stress that could cause a hefty workload. Although it is important to emphasize that future studies will have to be supported by previous analyses on work stress and human behavior, trying to reduce the bias of the information with systematic and previously calculated valuations. We must also understand this methodology may vary according to each worker since their environment is also a factor that may impact the results of these studies in the future. The integration of technology with a focus on the human factor in Industry 4.0 necessitates a comprehensive approach to virtual training. This is crucial in order to effectively combine a range of capabilities and determine an optimal learning curve to supplement the tacit knowledge involved in product development. This will be essential for meeting the needs of Generation Z and resulting in a paradigm shift in operations. To ensure success, it will be imperative to identify the necessary resources while also considering future sustainability and minimizing any negative impact on the environment.

References 1. Ravimachines. (n.d.). What is milling machine history of milling machine. Retrieved from https://ravimachines.com/what-is-milling-machine-history-of-milling-machine/ (Accessed: 15, Apr. 2023) 2. Ramesh, R., Mannan, M. A., Poo, A. N.: Error compensation in machine tools - a review. Part I: geometric, cutting force induced and fixture dependent errors. International Journal of Machine Tools and Manufacture, 40, 1235–1256 (2000). https://doi.org/10.1016/S0890-695 5(00)00009-2 3. Wei, W., Zhang, D., Huang, T.: A general approach for error modeling of machine tools. International Journal of Machine Tools and Manufacture, 79. https://doi.org/10.1016/j.ijmach tools.2014.01.003. (2014). https://doi.org/10.1016/j.ijmachtools.2014.01.003

604

J. A. Esquivias Varela et al.

4. Du, Z., Zhang, S., H, M., Development of a multistep measuring method for motion accuracy of NC machine tools based on cross grid encoder. International Journal of Machine Tools and Manufacture, 50, 270–280 (2010). https://doi.org/10.1016/j.ijmachtools.2009.11.010 5. Lasemi, A., Xue, D., Gu, P.: Accurate identification, and compensation of geometric errors of 5-axis CNC machine tools using double ball bar. Meas. Sci. Technol. 27(5), 055004 (2016). https://doi.org/10.1088/0957-0233/27/5/055004 6. Wei, W., Liang, Y., Liu, F., Mei, S., Tian, F.: Taxing strategies for carbon emissions: a bilevel optimization approach. Energies 7, 2228–2245 (2014). https://doi.org/10.3390/en7042228 7. Avram, O.I., Xirouchakis, P.: Evaluating the use phase energy requirements of a machine tool system. J. Clean. Prod. 19, 699–711 (2011). https://doi.org/10.1016/j.jclepro.2010.10.010 8. Karel, W., Brauers, W., and Zavadskas, E.K.: The MOORA method and its application to privatization in a transition economy. Control and Cybernetics, 35, (2006) 107–124. ISSN 0324–8569 9. Mori, M., Fujishima, M., Inamasu, Y., Oda, Y.: A study on energy efficiency improvement for machine tools. CIRP Ann. Manuf. Technol. 60, 145–148 (2011). https://doi.org/10.1016/j.cirp. 2011.03.099 10. Huang, J., Liu, F., Xie, J.: A method for determining the energy consumption of machine tools in the spindle start-up process before machining. Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture 230, 1639–1649 (2016). https://doi. org/10.1177/0954405415600679 11. Jia, S., Yuan, Q., Ren, D., Lv, J., Peng, T.: Energy demand modeling methodology of key state transitions of turning processes. Energies 10, 1–19 (2017). https://doi.org/10.3390/en10040462 12. Jia, S., Tang, R., Lv, J., Yuan, Q., Peng, T.: Energy consumption modeling of machining transient states based on finite state machine. The International Journal of Advanced Manufacturing Technology 88, 2305–2320 (2017). https://doi.org/10.1007/s00170-016-8952-2 13. Pawanr, S., Garg, G.K., Routroy, S.: Development of a Transient Energy Prediction Model for Machine Tools. Procedia CIRP 98, 678–683 (2021). https://doi.org/10.1016/j.procir.2021. 01.174 14. Yuvaraj, N., Praghash, K., Rajan Arshath Raja, T., Karthikeyan, T.: An Investigation of Garbage Dis-posal Electric Vehicles (GDEVs) Integrated with Deep Neural Networking (DNN) and Intelligent Transportation System (ITS) in Smart City Management System (SCMS). Wireless Personal Communications, 123(2), 1733–1752. (2022). https://doi.org/10.1007/s11277-02109210-8 15. Miç, P., Antmen, Z.F. (2021). A Decision-Making Model Based on TOPSIS, WASPAS, and MULTIMOORA Methods for University Location Selection Problem. SAGE Open, 11(3). https://doi.org/10.1177/21582440211040115 16. Salvatore, C., Menelaos T.: A robust TOPSIS method for decision making problems with hierarchical and non-monotonic criteria, Expert Systems with Applications, Volume 214, 2023, 119045, ISSN 09574174. https://doi.org/10.1016/j.eswa.2022.119045 17. Triantaphyllou, E.: Multi-criteria decision making methods: A comparative study. Springer. (2000). https://doi.org/10.1007/978-1-4757-3157-6 18. Yoon, K.P., Hwang, C.L.: Multiple attribute decision making: An introduction. Sage. (1981). https://doi.org/10.1007/978-3-642-48318-9 19. Soares, J. P., Lezama, F., Trindade, A., Ramos, S., Canizes, B., Vale, Z. A.: Electric Vehicles Trips and Charging Simulator Considering the User Behaviour in a Smart City. ISGT-Europe, 1–6. (2021). https://doi.org/10.1109/ISGTEurope52324.2021.9640054 20. Fernández Pallarés, V., Guerri Cebollada, J.C., Roca Martínez, A.: Interoperability network model for traffic forecast and full electric vehicles power supply management within the smart city. Ad Hoc Networks, 93 (2019). https://doi.org/10.1016/j.adhoc.2019.101929 21. TecNM: Matrícula 2020–2023 [Enrollment 2020–2023] (2022). Retrieved November 30, 2022, from https://sne.tecnm.mx/public/ (Accessed: 15, Apr. 2023) 22. Rivera, G., Porras, R., Sanchez-Solis, J.P., Florencia, R., García, V.: Outranking-based multiobjective PSO for scheduling unrelated parallel machines with a freight industry-oriented application. Eng. Appl. Artif. Intell. 108, 104556 (2022). https://doi.org/10.1016/j.engappai. 2021.104556

Intelligent Decision-Making Dashboard for CNC Milling Machines …

605

23. Rivera, G., Florencia, R., Guerrero, M., Porras, R., Sánchez-Solís, J.P.: Online multi-criteria portfolio analysis through compromise programming models built on the underlying principles of fuzzy outranking. Inf. Sci. 580, 734–755 (2021). https://doi.org/10.1016/j.ins.2021.08.087 24. Rivera, G., Coello, C. A. C., Cruz-Reyes, L., Fernandez, E. R., Gomez-Santillan, C., RangelValdez, N.: Preference incorporation into many-objective optimization: an Ant colony algorithm based on interval outranking. Swarm and Evolutionary Computation, 69, 101024 (2022). https://doi.org/10.1016/j.swevo.2021.101024