Intelligent Systems and Applications: Proceedings of the 2023 Intelligent Systems Conference (IntelliSys) Volume 3 (Lecture Notes in Networks and Systems, 824) [1st ed. 2024] 3031477146, 9783031477140

The book is a unique collection of studies involving intelligent systems and applications of artificial intelligence in

118 103 85MB

English Pages 885 Year 2024

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Intelligent Systems and Applications: Proceedings of the 2023 Intelligent Systems Conference (IntelliSys) Volume 3 (Lecture Notes in Networks and Systems, 824) [1st ed. 2024]
 3031477146, 9783031477140

Table of contents :
Preface
Contents
TPDNet: A Tiny Pupil Detection Neural Network for Embedded Machine Learning Processor Arm Ethos-U55
1 Introduction
2 Related Work
3 Pupil Detection Dataset
4 Tiny Pupil Detection Neural Network
4.1 Architecture of Neural Network
4.2 Training Process
4.3 Quantization of Neural Network and Simulation
5 Results
5.1 Detection Rate of Tiny Pupil Detection Neural Network
5.2 Run-Time of Quantized Tiny Pupil Detection Neural Network
6 Conclusion and Future Work
References
Low Cost Machine Vision for Insect Classification
1 Introduction
2 Materials and Methods
2.1 Imaging Requirements
2.2 Hardware Setup
2.3 Dataset and Preprocessing
2.4 Machine Learning, Insect Classification
2.5 Semantic Segmentation
3 Results and Discussion
3.1 Classification
3.2 Bounding Boxes via Segmentation
4 Conclusion and Outlook
A Appendix
References
Shape Complexity Estimation Using VAE
1 Introduction
2 Methods
2.1 Datasets
2.2 Variational Autoencoder Reconstruction Measure
2.3 Zlib Compression Measure
2.4 Fourier Transform Measure
2.5 Combining Measures
3 Results
4 Discussion, Conclusion, and Future Work
References
Training System for the Tomato Paste Production Process Through Virtual Environments
1 Introduction
2 Virtual Environment
3 Control Scheme
4 Analysis and Results
5 Conclusions
References
Shared Intentionality Before Birth: Emulating a Model of Mother-Fetus Communication for Developing Human-Machine Systems
1 Introduction
1.1 Achievements of Fetuses
1.2 A Launching Cognition Hypothesis
1.3 Two Questions About Perception
1.4 Shared Intentionality
2 Objective
3 Method
4 Results
4.1 Pulsed ElectroMagnetic Field
4.2 The Effect of PEMF on A(2A) ARs
4.3 Computer-Aided Assessing Shared Intentionality
5 Discussion
6 Conclusion
References
DAP: A Framework for Driver Attention Prediction
1 Introduction
2 Experimental Setup
2.1 Dataset
2.2 Architecture
2.3 Training Details
3 Results and Discussion
4 Conclusion
References
On Object Detection Based on Similarity Measures from Digital Maps
1 Introduction
2 Problem Description
2.1 Elements of Wastewater Networks
2.2 Data on Wastewater Networks
3 The Proposed Approach
3.1 Extraction Methods
3.2 Matching Measures
4 Experimental Evaluation
5 Conclusion and Further Work
References
Virtualization of the Paint Mixing Process Using the Hardware in the Loop Technique
1 Introduction
2 Proposed Implementation
3 Design of the Process
3.1 Virtualization of the Mixing Process
3.2 Design and Implementation of the SCADA System
3.3 Communication
3.4 Network Setting
4 Results Experimental
4.1 Stages of Virtualization of the Paint Mixing
5 Conclusions
References
Locally Enhanced Chan-Vese Model with Anisotropic Mesh Adaptation for Intensity Inhomogeneous Image Segmentation
1 Introduction
2 Models and Methods
2.1 Review of Image Segmentation Models
2.2 AMA Image Segmentation Framework
3 Results
3.1 Segmentation for Images with Geometric Shapes
3.2 Segmentation of Real Images
3.3 Comparison with Other Models
3.4 Discussion
4 Conclusions
References
An Unmanned System for Automatic Classification of Hazardous Wastes in Norway
1 Introduction
2 Related Work
3 Waste Classification System
4 Collected Data
5 Classification
5.1 Model
5.2 Data Transformations and Augmentations
5.3 Training Hyperparameters
5.4 Evaluation Process of Models
6 Evaluation
6.1 Final Model: Evaluation on Collected Data
6.2 Final Model: On-Site Testing
7 Discussion and Outlook
7.1 Investigation of the Difference in Performance
7.2 Multiple Objects per Image
7.3 Improvement of Classification Models
7.4 Extension of the Dataset
7.5 Multi-view Classification
A Appendix
A.1 Data and Source Code
A.2 All Cross-Validated Models
References
Remote Learning of Mathematics for Visually Impaired Students During COVID-19: Exploring Online Intervention, Resources, Challenges and Issues
1 Introduction
2 Related Work
3 Remote Learning-Based Educational Model for Visually-Impaired Students
4 COVID-19 Resources and Tips for Remote Learning
5 Overall Educational Experience with Remote Learning
5.1 Analysis of Remote Learning During COVID
6 Discussion
7 Conclusion
8 Future Work
References
Adversarial Robustness of Multi-bit Convolutional Neural Networks
1 Introduction
2 Related Work
2.1 Quantized and Binary Neural Networks
2.2 Adversarial Robust Compression
3 Methodology
3.1 Design Space of Multi-bit Networks
3.2 Analysing Gradient Flows
3.3 Compute Complexity
4 Experiments
4.1 Worst-Case Threat-Model
4.2 Inherent Robustness of Multi-bit Networks
4.3 Adversarial Training of Multi-bit Networks
5 Conclusion
References
Optimization of Lacrimal Aspect Ratio for Explainable Eye Blinking
1 Introduction
2 Background
3 Lacrimal Aspect Ratio
4 System Configuration and Experimental Setup
4.1 Dataset
4.2 Proposed LAR Blink Detection
4.3 Data Optimization
5 Results and Discussion
6 Conclusion
References
SIMRL: A New Approach for Integrating Simulation with Reinforcement Learning
1 Introduction
2 Methodology
2.1 Simulation Engine
2.2 Reinforcement Learning
2.3 SIMRL Integrator
3 SIMRL Implementation
3.1 Architecture Components
3.2 Packages Description
3.3 System Processes
3.4 Metrics
4 Test Case Objective
4.1 Entities for the Epidemic Spreading Test Case
4.2 Calculating the RL Agent Score
4.3 Input Variables
4.4 Minimal Check
4.5 Running the Simulation Engine on Actual Data
4.6 Classes Description
4.7 System GUI
4.8 Assumption for the Test Case
5 Experimental Results
5.1 Training Data
5.2 Simulated Data, Before and After the Learning
5.3 Further Investigation
6 Conclusion
References
Grouping Shapley Value Feature Importances of Random Forests for Explainable Yield Prediction
1 Introduction
2 Related Work
3 Approach
3.1 The Value of Predefined Coalitions in a Cooperative Game
3.2 From Grouped Shapley Values to Local Explanations
3.3 From Local Explanations to Global Understanding
3.4 Grouped Shapley Values on Tree Structures
4 Experimental Results
4.1 Soybean Yield Prediction based on Remote Sensing Data
4.2 Grapevine Yield Prediction Based on Phenological Data
5 Conclusions
References
Mining Interesting Aggregate Tuples
1 Introduction
2 Preliminary
2.1 Data Cube
2.2 Skyline Operation
2.3 Interesting Aggregate Tuples
3 Computing the Complete Index Base of Data Cube
3.1 Complete Index Base
3.2 The Complete Index Base for Querying Data Cube
4 Computing Interesting Aggregate Tuples Based on the Complete Index Base
5 Experimental Results and Discussions
5.1 On Building the Complete Index Base
5.2 On Query with Aggregate Functions
5.3 On Computing the Interesting Aggregate Tuples
5.4 Synthesis of Results
6 Discussions and Conclusion and Further Work
References
Optimization of Agrivoltaic Plants: Development and Validation of a Numerical Model to Account for Shading Effects on Crop Yields
1 Introduction
1.1 Shading Effect of Photovoltaic Panels on Crops Production: State of the Art
2 Evaluation of Crop Yields
3 Methodology
3.1 Incident Radiation Reduction at Ground Level
3.2 Agrivoltaic Plants Geometry
3.3 Model Results Validation
4 Agrivoltaic Plants Dynamic Simulation
4.1 TRNSYS© Model
4.2 Results of the Simulation
4.3 Solution Comparison
5 Conclusions
References
Algorithmic Fairness in Healthcare Data with Weighted Loss and Adversarial Learning
1 Introduction
2 Related Works
2.1 Fair Prediction
2.2 Reducing the Impact of Algorithmic Bias
2.3 Different Approaches for Mitigating Bias
3 Dataset
3.1 Descriptive Analysis by Age Groups and Gender
3.2 Feature Extraction
4 Approaches to Reduce Bias
4.1 Classification Using Artificial Neural Network
4.2 Data Construction and Weighted Loss
5 Adversarial Learning
5.1 Model Structure
5.2 Model Training
6 Result Analysis
6.1 The Effect of Adding Sensitive Attributes
6.2 Mitigating Imbalanced Labels and Sensitive Groups Disparities
6.3 Group Fairness
7 Conclusions
References
Application of Mean-Variance Cloning Technique to Investigate the Comparative Performance Analysis of Classical Classifiers on Imbalance and Balanced Data
1 Introduction
2 Material and Methods
2.1 Mean-Variance Cloning Technique
2.2 Imbalance Classifier
3 Performance Evaluation
3.1 Data Set
4 Results and Discussion
4.1 Part A: Over-Sampling
4.2 Part B: Under-Sampling
4.3 Comparative Performance Analysis of Imbalance and Balanced Data Set Based on Classifiers
4.4 Discussion
5 Conclusion
References
Research on Music Recommendation Model with Limited Historical Data and User’s Information
1 Introduction
2 Data Analysis
2.1 Data Information
2.2 Evaluation
3 Music Recommendation Model
3.1 General Music Recommendation
3.2 Clustering Method
3.3 Recommendation Model Based on Classification
4 Conclusion
References
A System for Analyzing Human Capability at Scale Using AI
1 Introduction
1.1 Related Literature
2 System Implications
3 Data
4 Human Capability Text Extraction
5 HC Lexicons
6 Engineering the System
7 HC Reporting and Business Outcomes
7.1 Revenue per Employee
7.2 Tobin's Q (Price to Book Value)
7.3 Earnings Before Interest, Taxes, Depreciation, and Amortization (EBITDA)
7.4 Social Responsibility
8 Concluding Discussion
References
The Impact of Data Ingestion Layer in an Improved Lambda Architecture
1 Introduction
2 State of the Art
3 An Improved Lambda Architecture
4 Simulation Results
4.1 Data Ingestion Evaluation Process
4.2 Impact on the Real-Time Processing Process
4.3 Impact on the Batch Processing
5 Conclusion
References
Using Clustering Ensembles and Heuristic Search to Estimate the Number of Clusters in Datasets
1 Introduction
2 Related Work
2.1 Methods for Estimating the Number of Clusters
2.2 How Does a Cluster Estimator Work?
3 The Ensemble Framework
3.1 Generation of the Base Clustering
3.2 Construction of the Agreement Matrix
3.3 Sub-setting the Agreement Matrix
3.4 Experimental Data
3.5 Selection of the Best Subset
3.6 The Fitness Function
4 Methods
4.1 The Exhaustive Approach
4.2 Random Mutation Hill Climbing (RMHC)
5 Experiments
6 Results and Discussions
7 Conclusions and Recommendations for Future Work
References
Data-Driven Urban Air Mobility Flight Energy Consumption Prediction and Risk Assessment
1 Introduction
2 Literature Review
3 Problem Statement
4 Methodology
4.1 Flight Data Description
4.2 Factors Affecting Energy Consumption
4.3 Ensemble Learning Prediction Model
4.4 Risk Formulation and Metrics
4.5 Energy Reserve Requirement
5 Results and Discussion
6 Conclusions and Future Work
References
Development of a Decision Support System in a Canning Industry
1 Introduction
2 Data Acquisition Through PLC- OPC Server: Database
3 Decision Support Systems (DSS)
3.1 Data Cleaning
3.2 Data Pre-processing
3.3 Decision Making
3.4 Data Visualization
4 Conclusion
References
Estimating the Tendency of Social Media Users to Spread Fake News
1 Introduction
2 Related Work
3 Dataset Creation
4 Methodology
4.1 Task Description
4.2 Feature Extraction
4.3 Model Description
5 Experiments and Results
5.1 Experimental Setup
5.2 Regression Algorithm Selection
5.3 Evaluation and Analysis
5.4 Feature Importance
6 Discussion
7 Conclusions and Future Work
References
Analyzing Apache Storm as Core for an Event Processing Network Model
1 Introduction
2 Related Work
3 Event Processing Networks and Requirements
3.1 Basics of Event Processing Networks
3.2 Requirements
4 Apache Storm
4.1 Overview of Apache Storm
4.2 Evaluation of Apache Storm for EPN Modeling
5 Comparison
6 Conclusion
References
Designing Against Bias: Identifying and Mitigating Bias in Machine Learning and AI
1 Introduction: Bias in Machine Learning and AI
1.1 Overview
1.2 Important Examples
2 Types of Bias and Their Root Causes
2.1 Overview
2.2 Sampling Bias
2.3 The History Problem
2.4 The Spaghetti Problem
2.5 Lack of Transparency
3 Bias Mitigation
3.1 Measuring Bias with Disparate Impact
3.2 Biased Minimized Comparison Algorithm
3.3 Open Source Packages for Measuring and Mitigating Bias
4 Summary
References
Study on Grade Discrimination Method of Traffic Accident Risk Factors Considering Road Type
1 Introduction
2 Accident Dataset and Visual Analysis
2.1 Traffic Accident Dataset Information
2.2 Multidimensional Visualization Analysis of Traffic Accident Data
3 Data Mining Methods and Model Building
3.1 Data Mining Method Theory Overview
3.2 Establishment and Optimization of Risk Factor Grade Screening Model for Different Road Types
4 Model Results Analysis and Accident Prevention Suggestions
5 Discussion
6 Conclusion
References
Summarization of Feedback from Residents in Urban Area Using the Unsupervised Method
1 Introduction
2 Literature Review
2.1 Urban and Urbanization
2.2 Topic Modelling Using LDA Method
2.3 The Importance of Voices of Urban Residents
2.4 Filling the Gaps of Previous Research
3 Methodology
3.1 Data Pre-processing
3.2 Modelling
3.3 Visualization of Results
4 Results and Discussion
5 Conclusion
References
Generating and Updating Supervised Data Mining Models on a Periodic Basis
1 Introduction
2 Handling Concept Drift
3 Self-updating Algorithms
3.1 Constructing and Training the Models
3.2 Evaluating the Models
3.3 Testing the System
4 Conclusions and Future Work
References
Home Appliance Demand Forecasting: A Comparative Approach Using Traditional and Machine Learning Algorithms
1 Introduction
1.1 Literature Review
2 Materials and Methods
2.1 Industry Production Process for the Forecast
2.2 Dataset Description
2.3 Forecast Models
3 Experimentation and Discussions
3.1 First Experiment
3.2 Second Experiment
4 Conclusions and Limitations
4.1 Future Work
References
Predictive Analysis of Bank Marketing for Financial Decision Support and Smart Economy
1 Introduction
2 Data Mining on Bank Customer Profiles
2.1 Data Description and Preprocessing
2.2 Association Rule Mining
2.3 Decision Tree Classification
3 Tool for Bank Marketing Predictive Analysis
3.1 Implementation of the Tool
3.2 System Demo and Experiments
4 Applications: Financial Decision Support and Smart Economy
5 Related Work
6 Conclusions
References
Wireless Sensor Networks Anomaly Detection Using Machine Learning: A Survey
1 Introduction
2 Classification Criterion of Anomalies
3 Anomaly Detection Techniques in WSN
3.1 Classification of the Approaches
3.2 Supervised Learning Approaches
3.3 Unsupervised Learning Approaches
3.4 Semi-Supervised Learning Approaches
4 Conclusion
References
Android Malware Detection Using Machine Learning: A Review
1 Introduction
2 Literature Review
2.1 Overview of the Relevant Research
2.2 Classification of the Existing Approaches
2.3 Comparison of the Approaches
3 Methodology
3.1 Overview of the Selection Criteria
3.2 Selection of the Papers
3.3 Data Collection and Analysis
4 Results and Discussion
4.1 Overview of the Key Findings
4.2 Summary of the Contributions
4.3 Discussion of the Limitations
4.4 Identification of Future Research Directions
5 Conclusion
References
A Data Protection Design for Online Exam Proctoring in Compliance with the Indonesian Personal Data Protection Law
1 Introduction
2 Background
2.1 Indonesian Personal Data Protection Regulation
2.2 Learning Management System (LMS)
3 Methodology
3.1 System Model
4 Indonesian PDP Compliant Online Proctoring in Learning Management System Architecture
4.1 Privacy and Security by Data Labelling
4.2 Use Cases
4.3 Operation Schemata
5 Conclusion
References
A Method to Transform Datasets into Knowledge Graphs
1 Introduction
2 Revision of Related Work
2.1 Knowledge Graphs in the Medical Domain
2.2 Knowledge Graph Construction Methods
3 Methodological Issues
3.1 Description of the Method
4 Analysis of Dataset
5 Data Extraction
5.1 Patient Data Parsing
5.2 Insulin Administration Data Parsing
5.3 Physical Exam Data Parsing
5.4 Continuous Glucose Monitoring Data Parsing
6 Knowledge Graph Construction
6.1 Knowledge Graph Design
6.2 Knowledge Graph Implementation
6.3 Knowledge Graph Evaluation
7 Conclusions
References
Explainable Dynamic Ensemble Framework for Classification Based on the Late Fusion of Heterogeneous Multimodal Data
1 Introduction
2 Related Work
2.1 Static Ensemble
2.2 Dynamic Ensemble
2.3 Early Fusion-Based Ensemble
2.4 Late Fusion-Based Ensemble
3 Methods
3.1 Dataset
3.2 Proposed Model
3.3 Data Preprocessing
4 Results
4.1 Early Fusion-Based Ensemble Modeling
4.2 Late Fusion-Based Ensemble Modeling
5 Model Explainability
5.1 Case-Based Reasoning
5.2 Deep-Based Classifiers Contributions
6 Discussion
7 Conclusion
References
Towards Efficient Edge Learning with Limited Storage Resource: Bandit-Based Training Data Retrieval in AIoT
1 Introduction
2 Related Work
2.1 Data Compression and Data Filtering for Edge Intelligence
2.2 Data Caching for Intelligent Applications at Edge
3 Problem Formulation
3.1 Problem Definition
3.2 System Model
4 Bandit-Based Training Data Retrieval
4.1 Rewards of End Devices Based on Thompson Sampling
4.2 Data Retrieval Paradigm
5 Performance Evaluation
5.1 Experiment
5.2 Results Analysis
6 Conclusion
References
EduChat: AI-Powered Chatbot with Personalized Engagement for Online Learning
1 Introduction
2 Related Work
2.1 A Chatbot
2.2 Chatbot in Education
3 EduChat Design and Development
3.1 NLP Pre-processing
3.2 Semantic Knowledge Representation
3.3 AI-Engine
4 Steps and Sample Scripts of EduChat
5 Discussion and Conclusion
References
Q8KNN: A Novel 8-Bit KNN Quantization Method for Edge Computing in Smart Lighting Systems with NodeMCU
1 Introduction
2 Related Works
3 Proposed Design
3.1 Smart Lighting System with NodeMCU Edge Computing Concept
3.2 The Q8KNN Model Compression Method
3.3 Performance Metrics
4 Results and Discussion
4.1 Results
4.2 Discussion
5 Conclusion
References
Internet of Things Platform as a Service for Building Digital Twins and Blockchain
1 Introduction
2 Related Works
2.1 IoT Platform Architecture Comparison
2.2 Blockchain-Based Architecture for IoT Platform
2.3 Digital Twin Based for IoT Platform
3 Internet of Things Platform Analysis
3.1 Tuya Platform
3.2 ThingsBoard Platform
3.3 Azure IoT Platform
3.4 SEMAR Platform
3.5 OpenHAB Platform (Open Home Automation Bus)
3.6 Home Assistant Platform
4 IoT Platform Comparison and Conceptual Model of the New IoT Platform Architecture
4.1 IoT Platform Evaluation
5 Conclusions
6 Future Work
References
From Industry 4.0 Toward Agriculture 4.0
1 Introduction
2 Previous Research
3 Definition of Industry 4.0
4 Agriculture 4.0 and Food Industry
5 From Industry 4.0 Toward Agriculture 4.0
6 Technology-Driven Agriculture
6.1 Internet of Things in Agriculture
6.2 Wireless Sensor Networks in Agriculture
6.3 Artefactual Intelligence in Agriculture 4.0
6.4 Advance Robotic
6.5 Machine Learning in Agriculture
6.6 Drones (UAVs) Technology in Agriculture 4.0
7 The Benefits of Industry 4.0 in Agriculture
8 Challenges Facing Agriculture 4.0
9 Discussions
10 Conclusions
References
Sustainable Hydroponics: Embedding IoT and Efficient Energy Generation and Distribution Algorithms
1 Introduction
2 Literature Review
2.1 Hydroponics
2.2 Wireless Sensor Networks
2.3 IoT
2.4 SEH-WSN (Solar Energy Harvesting Wireless Sensor Nodes)
3 Research Methodology
4 Data Analysis and Findings
4.1 Energy Harvesting Model
4.2 Energy Efficient SWORD Algorithm in Wireless Sensor Network
4.3 Energy Efficient Routing Algorithms
5 Simulated Experiment
6 Conclusion
7 Limitations and Future Research Directions
References
Towards Programmable Context Aware Interaction with Smart Home Internet of Things
1 Introduction
2 Related Work
3 Proposed Context Aware Framework
3.1 Child Aware Device Programming
3.2 Location Aware Device Programming
3.3 Encouraging Healthy Posture
4 Annotation of IoT and Scene Programming Editor
5 Realization and Measurements
5.1 Location Accuracy
5.2 Comparisons of Interactions with Annotated Objects
5.3 Study of Interactions with a Light
5.4 Usability Study of Healthy Posture Awareness
5.5 User Posture Accuracy
6 Conclusion
References
IPAssess: A Protocol-Based Fingerprinting Model for Device Identification in the IoT
1 Introduction
2 Related Works
3 Proposed Methodology
3.1 Dataset
3.2 Data Pre-processing
3.3 Selection of Optimal Feature Set and Machine Learning Algorithm
3.4 Data Augmentation
3.5 Data Aggregation
3.6 Prediction Model
4 Results
5 Discussion
6 Conclusion
References
Developing a Smart Monitoring App and Water Management System for Arid Environments Using IoT: An Innovative Approach
1 Introduction
2 Related Work
3 Proposed System
4 Discussion/Results
5 Conclusion/Future Work
References
Taking Natural Language Generation and Information Extraction to Domain Specific Tasks
1 Introduction
2 Background and Related Work
3 Tools and Techniques for IE and NLG
3.1 Text Extraction
3.2 Named Entity Recognition
3.3 Keyword Extraction/Matching
3.4 Template based Content Generation
3.5 Text Simplification
3.6 Text Summarization
4 Case Study and Results
4.1 Contract Annotation
4.2 SmPC to PIL
5 Conclusions and Future Work
References
A Simple But Powerful Graph Encoder for Temporal Knowledge Graph Completion
1 Introduction
2 Preliminaries and Related Work
3 Our Method
3.1 Subgraph Sampling in Temporal Neighborhood
3.2 Time-Aware Relational Aggregation
3.3 Learning and Inference
4 Experiments
4.1 Experimental Setup
4.2 Implementation Details
4.3 Main Results
4.4 Parameter Efficiency Analysis
4.5 Generalization to Unseen Timestamps and Irregular Timestamped Data
4.6 Ablation Study
4.7 Comparison over Different KG Score Functions
4.8 Temporal Neighborhood Exploration
5 Conclusion
References
A Hybrid Model for Novel Story Generation Using the Affective Reasoner and ChatGPT
1 Introduction and Motivation
2 Background Theory
3 The Basics of Story Morphing in the Affective Reasoner
4 The Hybrid Model of Story Generation, and Its Significance
5 Different Appraisals for Characters Yields Different Stories
6 Using ChatGPT to Change Narrative Perspectives
6.1 Mona’s Perspective Story One
6.2 Mona’s Perspective Story Two
6.3 Bruce’s Perspective Story 3
6.4 Professor Kim’s Perspective Story 4
7 Building a Compassionate Computer Companion, Including Understanding Fortunes-of-Others Emotions
8 A Companion Computer, That Has a Relationship with the User, and Feelings
9 Third Person Perspective, with Cases That Are Metaphorically Similar, Representing an Automated Therapist
10 How Different AR Temperaments Affect Spoken Dialogue and Actions
10.1 John, Temperament One
10.2 John, Temperament Two
11 AR Moods
11.1 John, Depressed Mood
11.2 John, Manic Mood
11.3 John, in Physical Pain
12 Implementation
13 Conclusion and Summary
References
Prompt-Based Learning for Thread Structure Prediction in Cybersecurity Forums
1 Introduction
2 Related Work
2.1 Reply Relationships Identification
2.2 Thread Structure Prediction
2.3 Instructional Prompting
3 Model Description
4 Evaluation and Results
4.1 Datasets
4.2 Metrics and Task
4.3 Libraries and Hyperparameters
4.4 Results
5 Analysis and Discussion
5.1 NPP-IP Performance
5.2 Precision and Recall Implications
5.3 Error Cases
6 Conclusion
References
Part-of-Speech Tagging Accuracy for Manufacturing Process Documents and Knowledge
1 Introduction
2 Literature Review
3 Methodology
4 Result and Discussion
5 Conclusion and Future Work
References
A Multi-label Few-Shot Learning with Combinations of Layers
1 Introduction
2 Related Work
3 Architecture of the Framework
3.1 Creating Features for Classification Pipelines: Feature Extraction Module
3.2 Filtering Irrelevant Transcripts: Semi-Supervised Binary Classification of Unlabeled Transcripts
3.3 Utilizing Multi-class Classification Estimator for Multi-label Classification
4 Experiments
4.1 Data
4.2 Experiment Set-Up
4.3 Experimental Results
5 Discussion
6 Conclusion
References
Application of Language Models for the Segmentation of Website Visitors
1 Introduction
2 Related Work
2.1 Other Application Areas of Language Models
2.2 Other Approaches to User Segmentation of Web Users
3 Problem Statement
4 Data Preprocessing
4.1 Online Analytics
4.2 Extraction of the Text Content
4.3 Summarization of the Page Content
4.4 Creation of User Profiles in Text Form
5 Analysis/Methods
5.1 Unsupervised Segmentation Approach
5.2 Supervised Querying in Natural Language
6 Evaluation
7 Conclusion
References
Identifying Research Topics in Human-Computer Interaction for Development: What Value Can Natural Language Processing Techniques Add?
1 Introduction
2 Literature
3 Collection of Data
4 Method
4.1 Pre-processing of Data
4.2 Applying tSNE and BERT
4.3 Topic Labelling
5 Results and Findings
5.1 BERT Results
6 Discussion
7 Conclusion
8 Future Work
9 Appendix
References
End-to-End Aspect Extraction and Aspect-Based Sentiment Analysis Framework for Low-Resource Languages
1 Introduction
2 Related Work
2.1 Aspect-Based Sentiment Analysis in English
2.2 Aspect-Based Sentiment Analysis in Greek
3 Dataset
3.1 Dataset Preprocessing and Augmentation
4 Architecture
4.1 Aspect Extraction
4.2 Aspect-Based Sentiment Analysis
5 Experiments
5.1 Experimental Setup
5.2 Experimental Results
5.3 Error Analysis
5.4 Performance Comparison
6 Discussion
7 Conclusions
References
Attending to Entity Class Attributes for Named Entity Recognition with Few-Shot Learning
1 Introduction and Background
2 AtEnA Framework
2.1 Entity Class Attribute Extraction
2.2 Attribute Attention Mask for Pre-training
2.3 Application Fine-Tuning and Inference
3 Experimental Setup
4 Experimental Results
4.1 Ablation and Parameter Study
5 Conclusion
References
Author Index

Citation preview

Lecture Notes in Networks and Systems 824

Kohei Arai   Editor

Intelligent Systems and Applications Proceedings of the 2023 Intelligent Systems Conference (IntelliSys) Volume 3

Lecture Notes in Networks and Systems

824

Series Editor Janusz Kacprzyk , Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland

Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas—UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Türkiye Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong

The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science. For proposals from Asia please contact Aninda Bose ([email protected]).

Kohei Arai Editor

Intelligent Systems and Applications Proceedings of the 2023 Intelligent Systems Conference (IntelliSys) Volume 3

Editor Kohei Arai Faculty of Science and Engineering Saga University Saga, Japan

ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-031-47714-0 ISBN 978-3-031-47715-7 (eBook) https://doi.org/10.1007/978-3-031-47715-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

Preface

It gives me immense pleasure and privilege to present the proceedings of Intelligent Systems Conference (IntelliSys) 2023 which was held in a hybrid mode on 7 and 8 September 2023. IntelliSys was designed and organized in Amsterdam, the Netherlands, that aimed to advance and apply artificial intelligence to real world. IntelliSys is an annual conference which provides the platform to researchers, academics and industry practioners across the globe to share their valuable findings and insights. The conference witnessed huge international participation with delegates from all parts of the world. A technological revolution has hit the world where Artificial Intelligence, Robotics, Machine Vision and Ambient Intelligence have gained preeminence over all the other fields. The researches in these fields have managed to give workable solutions to many intriguing problems. They also let us see through what the future would look like if artificial intelligence was entwined in our life. One of the meaningful and valuable dimensions of this conference is the way it lets researchers report and discuss these breakthroughs. The aim was to further increase the body of knowledge in this specific area by providing a forum to exchange ideas and to build international links. Authors from 50+ countries submitted a total of 605 papers to be considered for publication. Each paper was reviewed on the basis of originality, novelty and rigorousness. After the reviews, 239 were accepted for presentation, out of which 227 papers are finally being published in the proceedings. We would like to extend our gratitude to all the learned guests who participated on site as well as online to make this conference extremely fruitful and successful and also special note of thanks to the technical committee members and reviewers for their efforts in the reviewing process. Special acknowledgment to all the distinguished keynote speakers. We are extremely glad to bring forth the precious researches from our learned scholars and hope to whet the appetite of our readers. Your continued support and enthusiasm would motivate us to grow and evolve exponentially. Saga, Japan

Kohei Arai

Contents

TPDNet: A Tiny Pupil Detection Neural Network for Embedded Machine Learning Processor Arm Ethos-U55 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gernot Fiala, Zhenyu Ye, and Christian Steger

1

Low Cost Machine Vision for Insect Classification . . . . . . . . . . . . . . . . . . . . . . . . . Danja Brandt, Martin Tschaikner, Teodor Chiaburu, Henning Schmidt, Ilona Schrimpf, Alexandra Stadel, Ingeborg E. Beckers, and Frank Haußer

18

Shape Complexity Estimation Using VAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Markus Rothgänger, Andrew Melnik, and Helge Ritter

35

Training System for the Tomato Paste Production Process Through Virtual Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fernando J. Pantusin, Jimmy W. Cordonez, Milton A. Quimbita, Víctor H. Andaluz, and Andrés D. Vargas Shared Intentionality Before Birth: Emulating a Model of Mother-Fetus Communication for Developing Human-Machine Systems . . . . . . . . . . . . . . . . . . Igor Val Danilov

46

56

DAP: A Framework for Driver Attention Prediction . . . . . . . . . . . . . . . . . . . . . . . . Ahmed Kamel, Ibrahim Sobh, and Walid Al-Atabany

70

On Object Detection Based on Similarity Measures from Digital Maps . . . . . . . . Arthur Marzinkowski, Salem Benferhat, Anastasia Paparrizou, and Cédric Piette

81

Virtualization of the Paint Mixing Process Using the Hardware in the Loop Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kelvin F. Toapanta, Katherin G. Pichucho, Jessica S. Ortiz, and José Varela-Aldás

98

Locally Enhanced Chan-Vese Model with Anisotropic Mesh Adaptation for Intensity Inhomogeneous Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Karrar K. Abbas and Xianping Li An Unmanned System for Automatic Classification of Hazardous Wastes in Norway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Marc Gröling, Laurent Huang, and Ibrahim A. Hameed

viii

Contents

Remote Learning of Mathematics for Visually Impaired Students During COVID-19: Exploring Online Intervention, Resources, Challenges and Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Muhammad Shoaib, Donal Fitzpatrick, and Ian Pitt Adversarial Robustness of Multi-bit Convolutional Neural Networks . . . . . . . . . . 157 Lukas Frickenstein, Shambhavi Balamuthu Sampath, Pierpaolo Mori, Manoj-Rohit Vemparala, Nael Fasfous, Alexander Frickenstein, Christian Unger, Claudio Passerone, and Walter Stechele Optimization of Lacrimal Aspect Ratio for Explainable Eye Blinking . . . . . . . . . 175 Misbah Ayoub, Andrew Abel, and Haiyang Zhang SIM_RL: A New Approach for Integrating Simulation with Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Elad Razy and Maya Herman Grouping Shapley Value Feature Importances of Random Forests for Explainable Yield Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Florian Huber, Hannes Engler, Anna Kicherer, Katja Herzog, Reinhard Töpfer, and Volker Steinhage Mining Interesting Aggregate Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Viet Phan-Luong Optimization of Agrivoltaic Plants: Development and Validation of a Numerical Model to Account for Shading Effects on Crop Yields . . . . . . . . . 244 Michela Costa, Stefano Barba, Daniele Piazzullo, and Adolfo Palombo Algorithmic Fairness in Healthcare Data with Weighted Loss and Adversarial Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 Pronaya Prosun Das, Marcel Mast, Lena Wiese, Thomas Jack, Antje Wulff, and ELISE STUDY GROUP Application of Mean-Variance Cloning Technique to Investigate the Comparative Performance Analysis of Classical Classifiers on Imbalance and Balanced Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 Friday Zinzendoff Okwonu, Nor Aishah Ahad, Joshua Sarduana Apanapudor, Festus Irimisose Arunaye, and Olimjon Shukurovich Sharipov Research on Music Recommendation Model with Limited Historical Data and User’s Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Myeongjun Kang

Contents

ix

A System for Analyzing Human Capability at Scale Using AI . . . . . . . . . . . . . . . . 308 Daniel Zhu, Dave Ulrich, Sanjiv Das, and Norm Smallwood The Impact of Data Ingestion Layer in an Improved Lambda Architecture . . . . . 325 Miguel Landry Foko Sindjoung, Ernest Basile Fotseu Fotseu, Mthulisi Velempini, Bernard Fotsing Talla, and Alain Bertrand Bomgni (PI) Using Clustering Ensembles and Heuristic Search to Estimate the Number of Clusters in Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 Afees Adegoke Odebode, Mahir Arzoky, Allan Tucker, Ashley Mann, Faisal Maramazi, and Stephen Swift Data-Driven Urban Air Mobility Flight Energy Consumption Prediction and Risk Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 Yonas Ayalew, Wendwosen Bedada, Abdollah Homaifar, and Kenneth Freeman Development of a Decision Support System in a Canning Industry . . . . . . . . . . . . 371 Panagiotis Mallioris, Georgios Kokkas, Alexandros Styliadis-Heinz, Ioannis Margaritis, Fotios Stergiopoulos, and Dimitrios Bechtsis Estimating the Tendency of Social Media Users to Spread Fake News . . . . . . . . . 381 Ahmad Hashemi, Wei Shi, Mohammad Reza Moosavi, and Anastasia Giachanou Analyzing Apache Storm as Core for an Event Processing Network Model . . . . 397 Christin Schulze, Christian Gerner, Matthias Tyca, Arne Koschel, Anna Pakosch, and Irina Astrova Designing Against Bias: Identifying and Mitigating Bias in Machine Learning and AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 David J. Corliss Study on Grade Discrimination Method of Traffic Accident Risk Factors Considering Road Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 Wang Jie, Zhan Zhenfei, and Qian Liuzhu Summarization of Feedback from Residents in Urban Area Using the Unsupervised Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436 Nur Maisara Deli, Sofianita Mutalib, Mohd Fadzil Abdul Rashid, Haslizatul Fairuz Mohamed Hanum, and Shuzlina Abdul-Rahman

x

Contents

Generating and Updating Supervised Data Mining Models on a Periodic Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 Ana Duarte and Orlando Belo Home Appliance Demand Forecasting: A Comparative Approach Using Traditional and Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457 Lissette Culcay, Fernanda Bustillos, and Diego Vallejo-Huanga Predictive Analysis of Bank Marketing for Financial Decision Support and Smart Economy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 Hiral Vashi, Jyoti Yadav, and Aparna S. Varde Wireless Sensor Networks Anomaly Detection Using Machine Learning: A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 Ahshanul Haque, Naseef-Ur-Rahman Chowdhury, Hamdy Soliman, Mohammad Sahinur Hossen, Tanjim Fatima, and Imtiaz Ahmed Android Malware Detection Using Machine Learning: A Review . . . . . . . . . . . . . 507 Naseef-Ur-Rahman Chowdhury, Ahshanul Haque, Hamdy Soliman, Mohammad Sahinur Hossen, Tanjim Fatima, and Imtiaz Ahmed A Data Protection Design for Online Exam Proctoring in Compliance with the Indonesian Personal Data Protection Law . . . . . . . . . . . . . . . . . . . . . . . . . 523 Sidik Prabowo, Maman Abdurohman, Hilal Hudan Nuha, and Sarwono Sutikno A Method to Transform Datasets into Knowledge Graphs . . . . . . . . . . . . . . . . . . . 536 Maricela Bravo, José L. Barbosa, and Leonardo D. Sánchez-Martínez Explainable Dynamic Ensemble Framework for Classification Based on the Late Fusion of Heterogeneous Multimodal Data . . . . . . . . . . . . . . . . . . . . . . 555 Firuz Juraev, Shaker El-Sappagh, and Tamer Abuhmed Towards Efficient Edge Learning with Limited Storage Resource: Bandit-Based Training Data Retrieval in AIoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571 Jing Wang, Siyuan Liu, Wenjing Liu, Zhiwei Xu, Jiaqi Zhang, and Jie Tian EduChat: AI-Powered Chatbot with Personalized Engagement for Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589 Rabiah Abdul Kadir, Mohamad Fairus Zulkifli, Sabrina Binti Tiun, Mohd Modi Lakulu, Shaidah Jusoh, and Ahmad Faridz Ahmad Faudzi

Contents

xi

Q8KNN: A Novel 8-Bit KNN Quantization Method for Edge Computing in Smart Lighting Systems with NodeMCU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598 Aji Gautama Putrada, Maman Abdurohman, Doan Perdana, and Hilal Hudan Nuha Internet of Things Platform as a Service for Building Digital Twins and Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616 Sritrusta Sukaridhoto, Agus Prayudi, Muhammad Udin Harun Al Rasyid, and Hendro Wicaksono From Industry 4.0 Toward Agriculture 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636 Bushara Ali, Anamarija Ilieva, Ahmad Zakeri, and Oliver Iliev Sustainable Hydroponics: Embedding IoT and Efficient Energy Generation and Distribution Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652 Menachem Domb, Viraj Hudli, and Sujata Joshi Towards Programmable Context Aware Interaction with Smart Home Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663 A. S. M. M. Rahman, Mohammadhossein Momeni, and Colby Braton IPAssess: A Protocol-Based Fingerprinting Model for Device Identification in the IoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682 Parth Ganeriwala, Siddhartha Bhattacharyya, Raja Muthalagu, Shreya Nandanwar, and Anubhav Gupta Developing a Smart Monitoring App and Water Management System for Arid Environments Using IoT: An Innovative Approach . . . . . . . . . . . . . . . . . . 699 Azza Mohamed and Ibrahim Ismail Taking Natural Language Generation and Information Extraction to Domain Specific Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713 Sandeep Varma, Shivam Shivam, Sarun Natarajan, Snigdha Biswas, and Jahnvi Gupta A Simple But Powerful Graph Encoder for Temporal Knowledge Graph Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729 Zifeng Ding, Yunpu Ma, Bailan He, Jingpei Wu, Zhen Han, and Volker Tresp A Hybrid Model for Novel Story Generation Using the Affective Reasoner and ChatGPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748 Clark Elliott

xii

Contents

Prompt-Based Learning for Thread Structure Prediction in Cybersecurity Forums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766 Kazuaki Kashihara, Kuntal Kumar Pal, Chitta Baral, and Robert P. Trevino Part-of-Speech Tagging Accuracy for Manufacturing Process Documents and Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 782 Fatemeh Besharati Moghaddam, Angel J. Lopez, Stijn De Vuyst, and Sidharta Gautama A Multi-label Few-Shot Learning with Combinations of Layers . . . . . . . . . . . . . . 792 Bilge Sipal Sert, Cem Rifki Aydin, and Arjumand Younus Application of Language Models for the Segmentation of Website Visitors . . . . . 807 Andreas Stöckl and Oliver Krauss Identifying Research Topics in Human-Computer Interaction for Development: What Value Can Natural Language Processing Techniques Add? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822 Judy van Biljon, Etienne van der Poel, and Tawanda Chiyangwa End-to-End Aspect Extraction and Aspect-Based Sentiment Analysis Framework for Low-Resource Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 841 Georgios Aivatoglou, Alexia Fytili, Georgios Arampatzis, Dimitrios Zaikis, Nikolaos Stylianou, and Ioannis Vlahavas Attending to Entity Class Attributes for Named Entity Recognition with Few-Shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859 Raj Nath Patel, Sourav Dutta, and Haytham Assem Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 871

TPDNet: A Tiny Pupil Detection Neural Network for Embedded Machine Learning Processor Arm Ethos-U55 Gernot Fiala1,2(B) , Zhenyu Ye2 , and Christian Steger1 1 2

Institute of Technical Informatics, Graz University of Technology, Graz, Austria {gernot.fiala,steger}@tugraz.at ams-OSRAM AG, INO Innovation Office Semiconductors, Premstaetten, Austria {gernot.fiala,zhenyu.ye}@ams-osram.com

Abstract. Augmented reality and virtual reality (AR/VR) systems contain several different sensors including image sensors for gesture recognition, head pose tracking and pupil/eye tracking. The data of all these sensors must be processed by a host processor in real-time. For future AR/VR systems, new sensing technologies are required to fulfill the demands in power consumption and performance. Currently pupil detection is performed with images on resolutions around 300 × 300 pixels and above. Therefore, deep neural networks (DNN) need host platforms, which are capable to compute the DNNs with such input resolutions to process them in real-time. In this work, the image resolution for pupil detection is optimized to a resolution of 100 × 100 pixels. A tiny pupil detection neural network is introduced, which can be processed with the ARM Cortex-M55 and the Embedded Machine Learning (ML) processor Arm Ethos-U55 with a performance of 189 frames per second (FPS) with high detection rates. This allows to reduce the power consumption of the communication between image sensor and host for future AR/VR devices. Keywords: Arm Ethos-U55 · Arm M55 · Augmented reality · Deep neural network · Machine learning · Neural network processor · Pupil detection · Virtual reality

1

Introduction

Pupil detection and Eye tracking is widely used for different kinds of applications like eye tracking-based illness detection [1,2], in-cabin sensing of cars to detect the attention of the drivers and for augmented reality (AR) and virtual reality (VR) systems. In the field of AR/VR systems, there are smart glasses used for teaching, industry [3,4], gaming [5,6] and simulation systems [7]. Future devices will use more and more human machine interfaces (HMI) to interact with the environment through different sensors and displays. Hence, more and more processing is required. Some of these systems allow a higher power consumption c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 1–17, 2024. https://doi.org/10.1007/978-3-031-47715-7_1

2

G. Fiala et al.

but for other devices like smart glasses, power consumption is an important factor, which needs to be optimized. It is expected to use stacked configurations for image sensors for future AR/VR devices stated by Chiao Liu et al. in [8]. Furthermore, conventional image sensors, cannot fulfill the requirements [9] for future AR/VR systems. Probably next generation smart glasses will become the next smart phones as stated by Chiao Liu et al. in [9]. Therefore, new sensor technologies and system designs must be developed to lower the power consumption of such AR/VR devices. One idea to lower the power consumption is, to send less data from the sensor towards the next processing levels as described by Chiao Liu et al. in [9]. For image sensors, the mobile industry processor interface (MIPI)1 is used to transfer the image sensor data to a host platform. If only region-of-interest (ROI) data or images with lower resolutions are transferred from the image sensor to the host, power consumption of the communication can be saved significantly, by switching the MIPI interface into the sleep mode, shown by Gernot Fiala et al. in [12]. To lower the amount of data transferred to the host, region-of-interest (ROI) of the eye or even the processed data in form of pupil center coordinates can be transmitted to the host. Both ways need some processing steps directly on the image sensor. Machine learning algorithms for pupil detection outperform standard edge-based pupil detection algorithms [13], but the processing demand is higher. Sony developed in 2020 the first intelligent image sensor with integrated processing units in a stacked layer configuration [14,15]. Mobilenet V1 can be processed directly on the sensor for object classification. Since image sensors for pupil detection or eye tracking in AR/VR systems are required to have a smaller form factor, the resources for processing units and memory are more limited due to less available chip area. Therefore, optimizations of the image processing system are required. In this paper, the focus is on optimizing pupil detection with an image resolution of 100 × 100 pixels and a neural network, which can be processed on embedded processors Arm M55 [16] in cooperation with the neural network processing unit (NPU) Arm Ethos-U55 [17] and evaluate the pupil detection rate and execution time. The main contributions of this paper are: – Introduction of a tiny neural network for pupil detection with an input resolution of 100 × 100 pixels. A training process to improve the detection rate. The trained neural network is quantized with TensorFlow Lite and can be processed in real-time on embedded processors Arm M55 and Arm Ethos-U55 with 189 FPS and high detection rates. – An extension of an existing pupil detection dataset, with images generated from an image sensor software model with different illumination powers with sensor specific artifacts and noise levels for a resolution of 100 × 100 and 200 × 200 pixels. This paper is structured as follows: Sect. 2 shows related work with pupil detection and optimization methods. Section 3 introduces the extension of the 1

https://www.mipi.org/.

TPDNet: A Tiny Pupil Detection Neural Network for Embedded Machine

3

dataset. Section 4 describes the neural network architecture, the training process and quantization. Section 5 shows the pupil detection results and execution times and Sect. 6 concludes the paper.

2

Related Work

There is a lot of research in the area of pupil detection and eye tracking for AR/VR applications. Several years ago, standard edge-based pupil detection algorithm were introduced by Lech Swirski et al. [19], Wolfgang Fuhl et al. [22], Amir-Homayoun Javadi et al. [24] and many more. More recently neural networks and machine learning algorithms got more popular for pupil detection and eye tracking. With synthetic eye data, the generation of datasets with ground truth data is easier compared to a complete hardware setup. Some of the 3D eye models use Blender [10,20,21,27], Unity3D [25,26] or even the Unreal Engine to render images with ground truth data containing either eye segmentation, pupil center coordinates, gaze vector or other eye information. Joohwan Kim et al. highlights in [13], that machine learning algorithms outperform standard edgebased pupil detection algorithms, but these neural networks use different kernel sizes for convolutions and are deep neural networks (DNN). They require more processing power than standard edge-based pupil detection algorithms and the evaluation is mostly done with general purpose processors [22,23,28] or graphic processing units (GPUs) [13]. With these hardware platforms, high frame rates can be achieved, but for small AR/VR devices like future smart glasses not suitable. There is also research to improve the execution time for neural networks as described by Wolfgang Fuhl et al. in [23]. They split the input image into sections and use two neural networks to find the pupil center position. One neural network is used for a rough estimation and the other one for a more accurate estimation. It can be processed in 7 ms with a single core and in 2 ms with multi core on an Intel i5-4570. Another algorithm was developed by Thiago Santini et al. described in [28]. With an Intel i5-4590, 120 FPS were reached. Since the demand for smart or intelligent image sensors is growing, new methods are required to optimize the processing of such neural networks. The most well known framework nowadays is TensorFlow [29] with TensorFlow Lite [30]. It allows a quantization of 32 bit float weights to 8 bit integer values. It lowers the memory footprint of the whole neural network model. Another framework is Apache’s TVM [31], a machine learning compiler to optimize the code for a specific hardware architecture. Lately, the research in RISC-V processors and code optimization with instruction set extensions is growing. Such extensions can be single instruction multiple data (SIMD) or even hardware (HW) accelerators for speeding up the processing of convolutions or other neural network related operations. Conventional eye tracking hardware uses mainly embedded GPUs to process the image frames from the image sensors, like Qualcom Snapdragon [32,33]. They can process more than 60 frames per second (FPS) with higher resolutions. To make the AR/VR devices even smaller or even go to smart glasses,

4

G. Fiala et al.

the focus is more on embedded NPUs or image signal processors (ISP). The research to develop embedded NPUs or ISPs is growing. There are new products on the commercial market, which are optimized with vector instructions and HW-accelerators for neural network processing, like Arm M55 [16], Arm Ethos-U55 [17] or Quadric Chimera [34]. In this work, pupil detection is optimized to an image resolution of 100 × 100 pixels. A tiny neural network for pupil detection was developed, trained and tested with an extended pupil detection dataset with different illumination powers and noise levels. The trained neural network was quantized with TensorFlow Lite and processed with the embedded processors Arm M55 and Arm EthosU55. High frame rates of more than 90 FPS can be achieved by using images with 100 × 100 pixel resolution. The NPU accelerates the processing of the neural network. This shows, that smaller embedded hardware can be used to process neural networks with good performance. An image resolution of 100 × 100 pixel is sufficient for pupil detection to get high detection rates.

3

Pupil Detection Dataset

For training the neural networks, large eye datasets are required. Therefore, we used the modified 3D eye model from [10], which is based on [20,21] to render a dataset of 11000 images for each of the resolutions, 100 × 100, 200 × 200 and 500 × 500 pixels with the pupil center positions as ground truth information. Furthermore, an image sensor model (ISM) was used to generate additional images. The image sensor model contains proprietary information and processes. Therefore, this paper does not explain internals of the ISM. However, the image sensor model can be tuned with different illumination powers and different resolutions for the output image. The generated output image contain camera artifacts and noise. As input for the ISM, the rendered 500 × 500 pixel dataset was used to generate output images with a resolution of 100 × 100 and 200 × 200 pixels shown in Fig. 1. The illumination power is tuned with values of 1 1 and 64 . That means, the illumination power was reduced each time by 1, 14 , 16 one quarter of the previous illumination power. Due to the output behaviour of 1 look brighter the ISM, the generated images with an illumination power of 16 1 compared to images generated with an illumination power of 64 . In total, the rendered eye dataset was expanded by additional 44000 images with the previously mentioned illumination powers to a total image number of 55000 images for each of the resolutions 100 × 100 and 200 × 200 pixels. These dataset were used to train the neural network. Example images of the dataset are shown in Fig. 2. On the left are the rendered images and on the right the images generated by the ISM with different 1 1 and 64 . The images with higher illumination illumination powers of 1, 14 , 16 power are darker compared to the rendered images. Lower illumination power 1 , the images are brighter increases the noise level. For an illumination power of 16 and noisy due to a specific bit readout strategy. A further reduction of the illumination power adds even more noise to the images.

TPDNet: A Tiny Pupil Detection Neural Network for Embedded Machine

5

Fig. 1. Generation of images with image sensor software model.

Fig. 2. Example images of the extended dataset. Left is the rendered image, then 1 1 and 64 from left to right. images with illumination powers of 1, 14 , 16

As separate test data, we got the small dataset generated by Gernot Fiala et al. introduced in [11]. Here, two different image sensor models were used to tune the pixel pitch value, which influences the brightness of the output images. We used again the 500 × 500 image resolution of the rendered images as input for our image sensor model and generated test images with the different illumination powers in addition to the provided test images from [11]. A detailed overview of the entire dataset with the subsets is shown in Table 1.

4

Tiny Pupil Detection Neural Network

Machine learning with neural networks for pupil detection is state-of-the-art and neural networks outperform standard edge-based computer vision (CV) algorithms. Therefore, the focus is on the development of a tiny pupil detection neural network, which can be processed on Arm M55 and Arm Ethos-U55.

6

G. Fiala et al.

Table 1. Overview of the pupil detection dataset for training and testing for each resolution of 100 × 100 and 200 × 200 pixels with subsets. Train dataset

Test dataset from [11] Extended test dataset

11000 × rendered

130 × rendered

130 × illPower 1

130 × ISM1 Set1

130 × illPower

11000 × illPower 1 11000 × illPower 11000 × illPower 11000 × illPower

1 4 1 16 1 64

130 × ISM1 Set2

130 × illPower

130 × ISM1 Set3

130 × illPower

1 4 1 16 1 64

130 × ISM2 Set1 130 × ISM2 Set2 130 × ISM2 Set3

4.1

Architecture of Neural Network

Since input image resolution impacts the processing time of layers and the number of parameters of the neural network, we started with an input resolution of 100 × 100 pixels. The architecture of the model (model 1) is shown in Table 2. We used Python 3.8.10 and TensorFlow version 2.10.0. All convolution layers have a kernel size of 3, except for the last convolution layer, which has a kernel size 1. The max pooling layers have size 2. The max pooling layers are used after the first, third and fourth convolution layer. The activation function is ‘relu’ for all convolution layers and the first layer uses a random uniform kernel initializer. In total this model has 363758 parameters. The parameter number is important due to resource requirements for the processing on the Arm M55 and Arm Ethos-U55. We also slightly varied some layers to see the change in performance and number of parameters. The changes described below were performed on model 1: • model 2: The layers conv 3 and conv 4 were changed. The number of kernels were increased from 24 to 32 and from 64 to 96. The number of parameters increased to 452502. • model 3: The max pooling layers were removed. For conv 1 and conv 3 a stride of 2 were used and the kernel size of conv 5 was changed from 3 to 1. This model has 618734 parameters. • model 4: The kernel size of the first two convolution layer were changed to five. The number of kernels of conv 5 were changed from 256 to 128 and after the convolution layer an additional max pooling layer with size two was added. The kernel size of conv 6 was changed from one to three. This model has 689934 parameters. In addition to our own models, we used the DNN for pupil detection from Nvidia described by Joohwan Kim in [13]. To process it for a 200 × 200 input resolution, the stride of the first convolution layer had to be removed. The architecture is shown in Table 3. For a comparison of the models with different input resolutions, the pixel error between the ground truth values and the predicted

TPDNet: A Tiny Pupil Detection Neural Network for Embedded Machine

7

pupil locations were normalized to an image resolution of 500 × 500 pixels and the pupil detection rate calculated. Table 2. Architecture of tiny neural network for pupil detection (model 1). Layer

Kernel size Output size 100 × 100 × 1

data 3

100 × 100 × 10

max pool 1 2

50 × 50 × 10

conv 2

3

48 × 48 × 16

conv 3

3

46 × 46 × 24

max pool 2 2

23 × 23 × 24

conv 4

3

21 × 21 × 64

max pool 3 2

10 × 10 × 64

conv 1

conv 5

3

8 × 8 × 256

conv 6

1

8 × 8 × 512

flatten

32768

dense

2

Table 3. Architecture of Nvidia pupil detection DNN with 200 × 200 input. Layer

Kernel size Stride Output size 200 × 200 × 1

data

200 × 200 × 24

conv 1 9 conv 2 7

2

97 × 97 × 36

conv 3 5

2

47 × 47 × 52

conv 4 5

2

22 × 22 × 80

conv 5 3

2

10 × 10 × 124

conv 6 3

2

4 × 4 × 256

conv 7 3

2

1 × 1 × 512

flatten dense

4.2

512 2

Training Process

For the training of the DNNs, the training dataset was used. In total, there are 55000 images per resolution available. The dataset was split into 80% train data and 20% test data. From the train data again 10% where used as validation

8

G. Fiala et al.

dataset. Since, we require a very high accuracy of the pupil center prediction, iterative training was used. An overview of this process is shown in Fig. 3. First, the DNNs are trained with a learning rate of 0.0005 with a batch size of 32. The model is trained, saved and tested. Then, for the next iteration, the already trained model is loaded and retrained on the same dataset and again saved and tested. We used 10 iterations to train the model and after half the iterations, the learning rate was reduced to 0.0001. This allows the model to predict the pupil center position with higher accuracy. For all predictions, the pixel error is calculated and normalized to an image resolution of 500 × 500 pixels. Then, the pupil detection rate is calculated, which gives information of the correct detected pupil center position for a given pixel error. Typically, a pixel error up to 5 is considered as correct detection. This allows a good comparison between the models with different input resolutions. Due to the normalization to a resolution of 500 × 500 pixels, the models should be very accurate in the prediction because for a 100 × 100 pixel resolution, only a pixel error of maximum 1 would be allowed to count as correct detection. The Nvidia model was trained with the 200 × 200 dataset and also the results were normalized to a resolution of 500 × 500 pixels.

Fig. 3. Overview of the iterative training process for the DNNs.

4.3

Quantization of Neural Network and Simulation

After the training, all models were tested on the additional test set and the extended test set with the different illumination powers, shown in Table 1. Afterwards, the models were quantized with the TensorFlow Lite converter from 32 bit float weights to 8 bit integer weights. The quantization process slightly decreases the pupil detection rate. The workflow of the model quantization and simulation on Arm M55 and Arm Ethos-U55 is shown in Fig. 4. The trained model with the 32 bit float weights

TPDNet: A Tiny Pupil Detection Neural Network for Embedded Machine

9

is converted into the 8 bit integer model. Representative data are used to create a data binary file with data information. These files are used together with the quantized model as an input for the Arm Vela compiler, which generates two header files, one for the data and one for the model. Both header files are used in the Arm Development Studio (Arm-DS)2 version 2021.1. To simulate the Arm M55 and Arm Ethos-U55, Corstone SSE300 [18] reference package was used. The Arm-DS project was built and the run-time was simulated with the fast cycle simulator. The cycle simulator counts the cycles for processing the DNNs on Arm M55 and Arm Ethos-U55. With a given frequency, the processing time can be calculated. In this work, the run-time was calculated based on a frequency of 100 MHz. The results are presented in the next section.

Fig. 4. Overview of model quantization with TensorFlow Lite and simulation on Arm M55 and Arm Ethos-U55.

5

Results

This section shows the results of the tiny pupil detection neural network and compares them with each other. Furthermore, a detailed analysis of the detection rates is done based on the separate test sets introduced in Table 1.

2

https://developer.arm.com/Tools%20and%20Software/Arm%20Development %20Studio.

10

5.1

G. Fiala et al.

Detection Rate of Tiny Pupil Detection Neural Network

Each of the neural networks was tested separately on the different test sets. A detailed overview of the detection rates for a pixel error of 5 is shown in Table 4 and for the quantized model in Table 5. All detection rates, as already mentioned, are normalized to a resolution of 500 × 500 pixels. For a 100 × 100 image resolution this would equal to a pixel error of 1. As shown in Tables 4 and 5, all neural network models achieve similar normalized detection rates. However, there are 2 test sets, where some models perform slightly worse compared to others. For model 1 and model 3, the normalized detection rate is lower for the test dataset ISM1 Set1 from [11]. This test set contains bright images that are slightly overexposed. However, the normalized detection rate drops a little bit for model 4 and the Nvidia model at the test data with illumination power being 1 . This low illumination power generates a lot of noise, shown in Fig. 2 set to 64 on the right side. The same results can be seen with the quantized models. In some cases the detection rate drops, but it can also be a little bit higher due to this quantization step. The overall performance of all models is very good and all of the models are used for the run-time evaluation. The edge-based CV algorithms Swiski [19] and ElSe [22] perform worse than the neural networks, shown in Table 4. The highest detection rates with the algorithms Swirski and ElSe are achieved for the rendered test dataset with nearly 20 and 51.5%. For the other test sets, the detection rate drops. The worst detection rate can be observed for the test set with illumination power being set 1 , with lower than 5%. The noise influences the edge detection steps of the to 64 algorithms. Examples of a graphical comparison between the normalized pupil detection rates of the algorithms from Table 4 are shown in Fig. 5 for the test Table 4. Normalized pupil detection rates in % of the DNNs for the different test sets at a pixel error of 5. Test dataset

model 1 model 2 model 3 model 4 Nvidia [13] Swirski [19] ElSe [22]

130 × rendered

92.3

92.3

93.0

92.3

92.3

19.6

51.5

130 × ISM1 Set1

87.6

91.5

82.3

94.6

92.3

16.6

38.4

130 × ISM1 Set2

91.5

91.5

92.3

95.3

92.3

16.8

40.0

130 × ISM1 Set3

91.5

91.5

91.5

92.3

92.3

16.6

43.0

130 × ISM2 Set1

90.7

91.5

91.5

95.3

92.3

16.6

43.0

130 × ISM2 Set2

91.5

92.3

91.5

93.8

92.3

15.7

44.6

130 × ISM2 Set3

92.3

91.5

92.3

91.5

92.3

17.6

47.6

88.4

91.5

90.0

86.1

89.2

0.5

3.0

92.3

91.5

90.7

92.3

91.5

15.0

32.3

92.3

91.5

92.3

92.3

92.3

17.9

33.8

92.3

89.2

91.5

91.5

92.3

16.4

40.0

130 × illPower 130 × illPower 130 × illPower

1 64 1 16 1 4

130 × illPower 1 130 overall

91.1

91.4

90.8

92.5

91.9

15.39

37.9

20% of trainset

99.0

99.4

94.8

97.2

99.6

n.a

n.a

TPDNet: A Tiny Pupil Detection Neural Network for Embedded Machine

11

Table 5. Normalized pupil detection rates in % of the quantized DNNs for the different test sets at pixel error of 5. Weights converted to 8 bit integer values. Test dataset

model 1 model 2 model 3 model 4 Nvidia [13]

130 × rendered

92.3

92.3

93.0

91.5

92.3

130 × ISM1 Set1 86.9

91.5

82.3

93.0

91.5

130 × ISM1 Set2 92.3

92.3

91.5

94.6

92.3

130 × ISM1 Set3 90.7

90.7

91.5

92.3

92.3

130 × ISM2 Set1 90.7

91.5

91.5

95.3

92.3

130 × ISM2 Set2 92.3

92.3

91.5

93.0

92.3

130 × ISM2 Set3 90.7

90.7

90.7

92.3

92.3

90.0

90.7

90.7

84.6

90.0

92.3

91.5

92.3

92.3

91.5

130 × illPower 130 × illPower 130 × illPower

1 64 1 16 1 4

92.3

91.5

92.3

92.3

92.3

130 × illPower 1

91.5

89.2

91.5

91.5

92.3

130 overall

91.1

91.2

90.7

92.5

92.0

Fig. 5. Comparison of the normalized pupil detection rate on the test dataset with illumiantion power being set to 1 for the edge-based algorithms Swirski [19], ElSe [22], the Neural Network Nvidia [13] and your own Neural Networks model 1, model 2, model 3 and model 4.

dataset with illumination power being set to 1 and in Fig. 6 for the rendered test dataset. All of the neural networks have a detection rate around 90%. Representative results of the pupil center detection of the TensorFlow Lite converted model 1 are shown in Fig. 7. The white dots represent the ground

12

G. Fiala et al.

Fig. 6. Comparison of the normalized pupil detection rate on the rendered test dataset for the edge-based algorithms Swirski [19], ElSe [22], the Neural Network Nvidia [13] and your own Neural Networks model 1, model 2, model 3 and model 4.

truth value and the gray X is the predicted pupil center position. The shown images include different illumination powers, noise levels and brightness values. 5.2

Run-Time of Quantized Tiny Pupil Detection Neural Network

This section shows the run-time/cycle count simulation results with Arm M55 and Arm Ethos-U55 of the models simulated with the Arm-DS. The cycle counts with the associated frame rates are shown in Table 6. Model 1 is the smallest model, and therefore the fastest to be processed with 189 FPS. Nearly all processing is done on the Arm Ethos-U55. This is the same for all other models except model 3. Model 3 is has less parameters compared to model 4 but it takes much longer to process. This is due to the big flatten layer with 225792 parameters of model 3, which is too much for the U55 internal memory. Therefore, the M55 takes over the processing. The advantages of the NPU are not fully used in this case and the frame rate is much lower compared to the similar sized model 4. The difference of the processing time is 48 FPS. The M55 in this case needs more than 374000 cycles. All of these models have an input resolution of 100 × 100 pixel. The Nvidia model with an input resolution of 200 × 200 pixels takes almost 11 million cycles on Arm Ethos-U55. Therefore, this model can only be processed with 9 FPS and is the slowest. The original Nvidia model uses an input resolution of 293 × 293 pixels and was processed with Titan V and Jetson TX2, shown in [13]. A comparison of the run-time with other algorithms is shown in Table 7. The execution time for the Arm M55 and Arm Ethos-U55 are based on a frequency

TPDNet: A Tiny Pupil Detection Neural Network for Embedded Machine

13

Fig. 7. Example images with the predicted pupil center position with the quantized model 1 for different test subsets. The ground truth are highlighted with a white dot and the prediction is a gray X. Table 6. Arm M55 and Arm Ethos-U55 simulation results of the different pupil detection neural networks. The used frequency is 100 MHz. model 1 model 2 model 3 model 4 Nvidia [13] Parameters

363758

Cycle count M55 205

452502

618734

689934

1751814

205

374242

205

206

Cycle count U55 527073

623073

712073

713073

10846073

Active cycles

526474

622604

711365

712170

10845500

Idle cycles

599

469

708

903

573

Total cycles

527278

623278

1086315 713278

10846279

Exec. time [ms]

5.273

6.233

10.863

7.133

108.463

FPS

189.64

160.43

92.05

140.19

9.21

14

G. Fiala et al.

of 100 MHz. The values of the other algorithms are taken from the associated papers that used different evaluation hardware. With 90–189 FPS, our models are in a good range for pupil detection even if they were processed on an embedded hardware. Table 7. Arm M55 and Arm Ethos-U55 simulation results with a used frequency of 100 MHz compared with execution times of other algorithms on different hardware platforms.

6

Model

HW

model 1

Arm M55/U55 5.273

Exec. time [ms] FPS 189

model 2

Arm M55/U55 6.233

160

model 3

Arm M55/U55 10.863

92

model 4

Arm M55/U55 7.133

140

Nvidia [13]

Arm M55/U55 108.463

9

Nvidia [13]

Titan V

0.914

1694

Nvidia [13]

Jetson TX2

3.781

264

ElSe [22]

Intel i5-4570

7

142

PupilNet v2.0 [23] Intel i5-4570

2

500

PuRe [28]

8.333

120

Intel i5-4590

Conclusion and Future Work

This paper introduces a tiny pupil detection neural network, which can be processed on the embedded processors Arm M55 and Arm Ethos-U55 with 189 FPS with a normalized pupil detection rate of around 90%. This was achieved with the iterative training of neural networks and lowering the learning rate during the iterations. The hardware capability of Arm Ethos-U55 limits the number of parameters that can be used in the flatten layer. If the flatten layer has more than 65000 parameters, the Arm Ethos-U55 hands the processing back to the Arm M55, due to memory limitations. Therefore, the advantages of the NPU cannot be fully used, which significantly increases the processing time. Furthermore, an existing dataset was extended with images, which were generated by an image sensor model. For that, different illumination powers were simulated, which adds noise to the images and also changes the brightness. Using the proposed tiny pupil detection neural networks, which achieve both good detection rates and short processing times, future AR/VR image sensors can make use of in-sensor processing to lower the power consumption of the communication between the image sensor and the host.

TPDNet: A Tiny Pupil Detection Neural Network for Embedded Machine

15

Future work is to extend the dataset with more challenging eye images (occlusion, reflections). Additionally, the neural networks should be optimized with Apache TVM and evaluate the performance on a RISC-V processor with an instruction set extensions or HW-accelerators. Acknowledgment. A big thank you goes to my colleagues and ams-OSRAM AG, Premstaetten, Austria, who supported this research.

References 1. Akter, T., Ali, M.H., Khan, M.I., Satu, M.S., Moni, M.A.: Machine learning model to predict autism investigating eye-tracking dataset. In: 2021 2nd International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST), DHAKA, Bangladesh, pp. 383–387 (2021). https://doi.org/10.1109/ ICREST51555.2021.9331152 2. Mengoudi, K., Ravi, D., Yong, K.X.X., Primativo, S., Pavisic, I.M., Brotherhood, E., Lu, K., Schott, J.M., Crutch, S.J., Alexander, D.C.: Augmenting dementia cognitive assessment with instruction-less eye-tracking tests. IEEE J. Biomed. Health Inform. 24(11), 3066–3075 (2020). https://doi.org/10.1109/JBHI.2020.3004686 3. Angreisani L.: A wearable brain-computer interface instrument for augmented reality-based inspection in industry 4.0 (2020) 4. Katkuri, P.K., Mantri, A., Anireddy, S.: Innovations in tourism industry & development using Augmented Reality (AR), Virtual Reality (VR). In: TENCON 2019– 2019 IEEE Region 10 Conference (TENCON), Kochi, India, pp. 2578–2581 (2019). https://doi.org/10.1109/TENCON.2019.8929478 5. Microsoft.: HoloLens 2 (2022). https://www.microsoft.com/en-us/hololens 6. Meta.: Meta quest (2022). https://www.meta.com/at/en/quest/ 7. FLAIM Solutions.: FLAIM. https://flaimsystems.com/ 8. Liu, C., Hall, M., De Nardi, R., Trail, N., Newcombe, R.: Sensors for future VR applications. In: 2017 International Image Sensor Workshop (IISW), pp. 250–253 (2017) 9. Liu, C., Berkovich, A., Chen, S., Reyserhove, H., Sarwar, S.S., Tsai, T.-H.: Intelligent vision systems—bringing human-machine interface to AR/VR. In: 2019 IEEE International Electron Devices Meeting (IEDM), pp. 10.5.1–10.5.4 (2019). https:// doi.org/10.1109/IEDM19573.2019.8993566 10. Fiala, G., Ye, Z., Steger, C.: Pupil detection for augmented and virtual reality based on images with reduced bit depths. In: 2022 IEEE Sensors Applications Symposium (SAS), pp. 1–5 (2022). https://doi.org/10.1109/SAS54819.2022.9881378 11. Fiala, G., Ye, Z., Steger, C.: Framework for image sensor design parameter optimization for pupil detection. In: 2022 8th International Conference on Systems and Informatics (ICSAI), Kunming, China, pp. 1–6 (2022). https://doi.org/10. 1109/ICSAI57119.2022.10005532 12. Fiala, G., Loinig, J., Steger, C., Impact of image sensor output data on power consumption of the image processing system. In: Arai, K. (ed.) Intelligent Systems and Applications, IntelliSys 2022. Lecture Notes in Networks and Systems, vol. 542, pp. 618–636. Springer, Berlin (2023). https://doi.org/10.1007/978-3-031-16072-1 45 13. Kim, J., Stengel, M., Majercik, A., De Mello, S., Dunn, D., Laine, S., McGuire, M., Luebke, D.: NVGaze: an anatomically-informed dataset for low-latency, near-eye gaze estimation. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2019). https://doi.org/10.1145/3290605.3300780

16

G. Fiala et al.

14. Sony Corporation.: Sony to release world’s first intelligent vision sensors with AI processing functionality, 14 May 2020. https://www.sony.com/en/SonyInfo/News/ Press/202005/20-037E/. Accessed 4 Jan 2023 15. Sony Group Corporation.: Sony’s latest image sensors and the technologies that lies behind them, 15 Oct 2020. https://www.sony.com/en/SonyInfo/technology/ stories/imagesensor7tech/. Accessed 4 Jan 2023 16. Arm.: Arm Cortex-M55. https://www.arm.com/products/silicon-ip-cpu/cortexm/cortex-m55 17. Arm.: Arm Ethos-U55. https://www.arm.com/products/silicon-ip-cpu/ethos/ ethos-u55 18. Arm.: Arm Corstone-300. https://developer.arm.com/Processors/Corstone-300 19. Swirski, L., Bulling, A., Dodgson, N.: Robust real-time pupil tracking in highly off-axis images. In: Proceedings of Eye Tracking Research and Applications Symposium (ETRA), pp. 173–176 (2012). https://doi.org/10.1145/2168556.2168585 20. Swirski, L., Dodgson, N.: Rendering synthetic ground truth images for eye tracker evaluation. In: Proceedings of Eye Tracking Research and Applications Symposium (ETRA), pp. 219–222 (2014). https://doi.org/10.1145/2578153.2578188 21. Swirski, L.: Eyemodel. https://github.com/LeszekSwirski/eyemodel 22. Fuhl, W., Santini, T., K¨ ubler, T., Kasneci, E.: ElSe: ellipse selection for robust pupil detection in real-world environments. In: Proceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research & Applications (ETRA), pp. 123–130 (2016). https://doi.org/10.1145/2857491.2857505 23. Fuhl, W., Santini, T., Kasneci, G., Rosenstiel, W., Kasneci, E.: PupilNet v2.0: convolutional neural networks for CPU based real time robust pupil detection (2017). https://doi.org/10.48550/arXiv.1601.04902 24. Javadi, A.-H., Hakimi, Z., Barati, M., Walsh, V., Tcheang, L.: SET: a pupil detection method using sinusoidal approximation. Front. Neuroeng. (2015). https://doi. org/10.3389/fneng.2015.00004 25. Wood, E., Baltruˇsaitis, T., Zhang, X., Sugano, Y., Robinson, P., Bulling, A.: Rendering of eyes for eye-shape registration and gaze estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV 2015) (2015) 26. Wood, E., Baltruˇsaitis, T., Morency, L.-P., Robinson, P., Bulling, A.: Learning an appearance-based gaze estimator from one million synthesised images. In: Proceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research & Applications, pp. 131–138 (2016) 27. Porta, S., Bossavit, B., Cabeza, R., Larumbe-Bergera, A., Garde, G., Villanueva, A.: U2Eyes: a binocular dataset for eye tracking and gaze estimation. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea (South), pp. 3660–3664 (2019). https://doi.org/10.1109/ICCVW. 2019.00451 28. Santini, T., Fuhl, W., Kasneci, E.: PuRe: robust pupil detection for real-time pervasive eye tracking. J. Comput. Vis. Image Underst. 170, 40–50 (2018). https:// doi.org/10.1016/j.cviu.2018.02.002 29. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jozefowicz, R., Jia, Y., Kaiser, L., Kudlur, M., Levenberg, J., Man´e, D., Schuster, M., Monga, R., Moore, S., Murray, D., Olah, C., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vi´egas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). Software available from https://www.tensorflow.org/

TPDNet: A Tiny Pupil Detection Neural Network for Embedded Machine 30. 31. 32. 33.

17

Google Brain Team.: TensorFlow lite. https://www.tensorflow.org/lite Apache Software Foundation.: Apache TVM. https://tvm.apache.org/download Qualcom.: Qualcom snapdragon. https://www.qualcomm.com/snapdragon Qualcom.: Qualcom snapdragon wear. https://www.qualcomm.com/products/ application/wearables/snapdragon-wear-3100-platform 34. Quadric.: Quadric chimera. https://quadric.io/products/

Low Cost Machine Vision for Insect Classification Danja Brandt1(B) , Martin Tschaikner1 , Teodor Chiaburu1 , Henning Schmidt1 , Ilona Schrimpf2 , Alexandra Stadel2 , Ingeborg E. Beckers1 , and Frank Haußer1 1

Berliner Hochschule fuer Technik, Berlin, Germany [email protected] 2 UBZ Listhof e.V., Reutlingen, Germany http://bht-berlin.de,http://www.listhof-reutlingen.de

Abstract. Preserving the number and diversity of insects is one of our society’s most important goals in the area of environmental sustainability. A prerequisite for this is a systematic and up-scaled monitoring in order to detect correlations and identify countermeasures. Therefore, automatized monitoring using live traps is important, but so far there is no system that provides image data of sufficient detailed information for entomological classification. In this work, we present an imaging method as part of a multisensor system developed as a low-cost, scalable, open-source system that is adaptable to classical trap types. The image quality meets the requirements needed for classification in the taxonomic tree. Therefore, illumination and resolution have been optimized and motion artefacts have been suppressed. The system is evaluated exemplarily on a dataset consisting of 16 insect species of the same as well as different genus, family and order. We demonstrate that standard CNN-architectures like ResNet50 (pretrained on iNaturalist data) or MobileNet perform very well for the prediction task after re-training. Smaller custom made CNNs also lead to promising results. Classification accuracy of >96% has been achieved. Moreover, it was proved that image cropping of insects is necessary for classification of species with high inter-class similarity. Keywords: Computer vision monitoring · Biodiversity

1

· Image classification · Insect

Introduction

With over 1 million species worldwide, insects are the largest group of animals on our planet. Many ecological and economic interactions of insects with their environment are still unexplored. The importance of insects for humans is becoming increasingly clear. One example is their pollination service, which is estimated at 3.8 billion ¤ for Germany alone [19]. All over the world the number of insect species and individuals is declining sharply [30]. In Germany, the Krefeld study reports a 75% decrease in insect biomass from 1989 to 2015 [11]. Both in the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 18–34, 2024. https://doi.org/10.1007/978-3-031-47715-7_2

Low Cost Machine Vision for Insect Classification

19

short term and in the long term, knowledge about the occurrence, distribution and population trends of insects in correlation with environmental influences is of great importance. Still most classical monitoring systems consist of dead traps. Biomass is determined and only a few species are identified. In order to avoid blind spots in monitoring and to multiply expert knowledge, research in the field of automated classification systems is becoming more and more important. In this context, three partly overlapping use cases can be identified: 1. Mobile applications (Picture Insect, Insect Identifier, a.o.) and Open Source Citizen Science applications, which contribute to raising public awareness through gamification approaches. Large amounts of data are generated this way. In particular, the iNaturalist platform serves as a global database and had an enormous impact in scientific research questions [34]. It also serves as a networking tool. Large data sets of images are labeled by the community in a collaborative manner (whenever more than two thirds of the annotators agree on the label, the sample is awarded a Research Grade as a seal of quality). However, these randomly collected cell phone or camera images cannot properly contribute to scientific monitoring which rather requires standardized data collected systematically over a period of time at fixed locations. 2. Monitoring for pest control and identification of beneficial insects in agriculture [18]. In this use case, multilabel classification machine learning algorithms are used to identify certain insects species, which are of special interest [17,39]. 3. Biodiversity monitoring [35] Systematic monitoring of insect diversity requires suitable sensor systems and multilabel classification algorithms for fine-grained images or multisensor data. Moreover, the data sets are highly unbalanced due to rare insect species. Automated monitoring uses different technologies for insect species detection and differentiation [35]. The data acquisition for monitoring is mainly based on wingbeat frequency measurements and image data. For wing beat frequencies acquisition alone a wide variety of different methods is used, including acoustic sensors [21], radar [25,38], multispectral analysis of reflected light [26,29], as well as capacitance change [16] and optoacoustic sensors [5,6,9,22,36]. There are also different approaches for imaging. One requirement is that insects must be in relative rest for a camera to take sharp images. The studies [4,37] use images from museums and other collections of dead insects for training of automated systems. The researches [10,23,31] use yellow traps on which the insects stick. For monitoring living insects, yellow fields [20] are used—suitable for pollinators—or light traps [13,15,40]—suitable for nocturnal insects. A recent survey of machine learning methods for insect classification is given in [32]. We developed an automated low cost open source insect multisensor monitoring system where insects are counted and classified according to the taxonomic system of GBIF [1] (order, family, genus, species). Citizen scientists are involved in developing and measuring. The sensor system includes an infrared optoacustic

20

D. Brandt et al.

wingbeat frequency detector and a computer vision unit. For a robust classification, local environmental data are collected additionally i.e. time, temperature, humidity as well as spectral irradiance. While based on standard insect traps from classical entomology, our multisensor [3] discards the container with the lethal liquid and leaves the insects unharmed. Hence, our approach allows to deal with all three use cases as discussed above. It is designed for scientific biodiversity monitoring but due to its adaptability to various types of traps, it may as well be used to monitor specific insects e.g. by training the neural network in a one-versus-All classification. In this paper we focus on the imaging system: the hardware setup, image post-processing as well as evaluation by solving a classification task with machine learning. A particular effort has been made to optimize and standardize image acquisition. The paper is structured as follows: First, the requirements for the imaging system are summarized (Sect. 2.1). Structures on which experts distinguish species need to be resolved on images of living insects in motion. According to these requirements the camera system is developed (Sect. 2.2). The present dataset shows one of the biggest challenges, which stands as representative for insect monitoring: The unbalanced distribution of insect data (Sect. 2.3). The overall system was evaluated for a data set of 16 species, some differing at the family level others only at the species level (Sect. 2.4). Standard machine learning models like U-Net for semantic segmentation and ResNet-50 [12], MobileNet [14] and a small self designed CNN demonstrate convincing prediction power for the insect species. Finally, results are presented and discussed in the context of currently used methods (Sect. 3). The conclusion and future outlook is presented in Sect. 4.

2 2.1

Materials and Methods Imaging Requirements

Classifying insects from images requires taking various features into consideration such as color, contrast, patterning, size, and shape relationships. Special features for entomological differentiation such as pubescence and wing vascular pattern may also be important for certain species, see Fig. 1. For example, Vespula vulgaris and Vespula germanica, both of the Vespidae family, look similar but are distinguished by a particular pattern on the hind limb. On the other hand, Polistes dominula, also from the Vespidae family, has a distinctive shape relationship (Fig. 2). The developed system is standardized according to homogeneous illumination, suppression of motion blur, resolution and depth-of-field, what is critical for an accurate classification. A homogeneous, diffuse illumination is essential as well as a resolution that matches insect structural sizes. Furthermore, shadows on the background make automatic segmentation and detection difficult. To address these requirements, the acquisition unit has diffusing panels on the side walls and the bottom. A trade-off between maximal resolving power and large

Low Cost Machine Vision for Insect Classification

21

Fig. 1. Image of a European paper wasp taken with the imaging unit with zoomed-in characteristic details of body pattern, tabia-tarsus junction and wing details.

Fig. 2. Comparison of 3 wasp species. (Top) Images acquired from the imaging unit. (Bottom) Schematic representation according to Ripberger [27]. From left to right: Vespula Germanica, Vespa Vulgaris, Polistes Dominula.

depth-of-field has been figured out. Characteristic structures like the hairs or the wing veins are in the range of 20–100 µ. However, the distance between the veins is at least one order of magnitude greater. The choice of an effective aperture (f/8) leads to a resolution of 10 µm, which makes it possible to differentiate

22

D. Brandt et al.

vein sequences on the object side. The related depth of field is 15 mm for a rear distance of 150 mm (Fig. 3). 2.2

Hardware Setup

Behind the insect trap, the multisensor system [3] with the imaging unit (Fig. 3 (left)) replaces the container filled with poisonous agent. This imaging unit is described below. The area for imaging as shown in Fig. 3 (right) is confined to 60 × 45 × 20 mm3 . The field of view (FoV) is optimized for FoV = 60 × 45 mm2 . The adapters at the input and output are 3D printed from PLA. The arena itself is cut from Plexiglas by a laser cutter. The top side is transparent, while the side walls and bottom are made of diffusing PMMA with a light transmission of 62%. Two photoelectric sensors (IS471FE OPIC optical integrated circuit), one at the bottom and one at the top, facing each other in the center of the arena, trigger the photo detection. Insects passing by trigger the strobe flash light, consisting of three 12V LED strips with three units each (nine in total), which homogeneously illuminate the insects over the entire height. Moreover, the LED strips provide a neutral gray background by allowing light to enter the floor from the side facets. The IS471FE chip as receiver of the light barrier provides a modulated signal to

Fig. 3. Image acquisition unit (left) and principle of the insect arena (right). The red region indicates the light barrier in the center of the arena that starts the image recording.

Low Cost Machine Vision for Insect Classification

23

an external IR LED (transmitter with 940 nm), so that interfering influences, such as ambient light, are filtered. To reduce stray light, an additional 0.75 mm slit aperture is applied. This increases the sensitivity of the photoelectric light barrier sensor especially for small insects such as fruit flies and hover flies. Custom electronic circuit boards are designed and developed to be plugged into the Raspberry Pi minicomputer. For imaging, a Raspberry Pi HQ camera (Sony CMOS sensor IMX477R, 12.3 megapixels) is used in combination with a 10MP 16mm telephoto lens with variable aperture. The entire system is located in a dark case of size 296 × 127 × 115 mm3 . The length is chosen according to an optimal trade-off for resolution, field-of-view and depth of field. As a result, an additional 2 mm spacer ring was 3D-printed to extend the image distance, resulting in greater magnification and shorter object distance. Flashing in a dark scene minimizes motion blur caused by CMOS rolling shutter technology. The FSTROBE pin of the camera chip sends a PWM signal synchronized with the frame rate. The combination of a short flash of 500 microseconds and an exposure time of 23.5 ms, which is related to the readout time and the reset time of a single line, results in the simulation of a global shutter that minimizes distortion effects. Ants, for example, crawl very fast with a speed of 50 cm/s. During one single flash, this corresponds to a path of 0.25 mm. At 10× reduction in magnitude, this is equivalent to 25 µm on the chip (13 pixels on the Sony IMX477R sensor chip). This results in motion blur 2.5 times worse than the blur due to diffraction-limited resolving power at f/8. However, most insects are slower; besides, they also slow down in the arena. It is evident in the acquired images that the wings, whose wingbeat frequency can reach values in the kHz range, as well as individual limbs, exhibit motion blur, while the trunk is detected free of motion blur, see Fig. 2. This has been demonstrated for various insects including ants. To calibrate the camera system a Bokeh application (Python) has been implemented, which is running on the Raspberry Pi as a local server. The automated measurement is processed by a Python script, too. It controls the individual sensors, processes the incoming data and stores the information on SD card. The camera provides an in-memory H264 video stream based on a ring buffer to limit RAM memory usage. The size of the ring buffer corresponds to an H264 video of about 1.5 s in length at a resolution of 1440 × 1080 pixels. A sudden increase in frame brightness triggers the extraction of the first image frame from the video, followed by two others, saved as png-images. 2.3

Dataset and Preprocessing

To evaluate the quality of our insect images, we will solve a classification task using a dataset of 1154 annotated images with 16 insect species classes, which are spread over the taxonomic tree—order, family, genus, species. The dataset is extremely unbalanced, as can be seen in Fig. 4. This is a realistic assumption for real world data, since certain insect species will pass through the sensor system more often than others e.g. rare species. Thus, the data set represents the reality

24

D. Brandt et al.

Fig. 4. Distribution of the species in the total data set used for the experiments. Notice the long tail of the histogram which is typical for insect monitoring data.

within insect monitoring very well. The detailed list of the distribution of species in the data set used in this work is given in Table 1. The original images are of size 1440 × 1080 pixels. We used Label Studio [2] to manually crop them to a square region containing the insect. Section 2.5 discusses how cropping may be done automatically and very reliably using semantic segmentation. 2.4

Machine Learning, Insect Classification

This work shows that the high quality image data provided by the presented imaging system gives very good results for automatic insect classification on minicomputers like Raspberry Pis. We decided to use two standard convolutional neural networks (CNN) of different sizes and a custom made rather small CNN. In all three cases, the input image size is 224 × 224. As already stated, insect classification is a fine-grained image analysis task. Moreover, data from monitoring will always be extremely unbalanced. An additional constraint on the monitoring system is that predictions must be performed on a small devices of low power consumption, which precludes massive neural network models. As a benchmark model, ResNet-50 [12] was used, which is a rather large CNN (≈25 M parameters). Pretrained on the iNaturalist dataset 2021 with

Low Cost Machine Vision for Insect Classification

25

Table 1. The dataset consists of 1154 images of 16 insect species from various insect orders (5), families (8) and genera (15). Order

Family

Hymenoptera Apidae

Genus

Species

Apis

Mellifica

Bombus

Lapidarius Terrestris

Vespidae

Coleoptera

Xylocopa

Violacea

Vespa

Crabro

Vespula

Germanica

Polistes

Dominula

Chrysomelidae Leptinotarsa Decemlineata Cetoniidae

Cetonia

Aurata

Mecoptera

Panorpidae

Panorpa

Communis

Diptera

Tachinidae

Sarcophaga

Carnaria

Syrphidae

Scaeva

Pyrastri

Heteroptera

Pentatomidae

Eristalis

Tenax

Episyrphus

Balteatus

Syrphus

Ribesii

Graphosoma Lineatum

2.7 M images from 10k different species [33] the ResNet converged within a small number of epochs. No layers have been frozen. Secondly, a smaller MobileNet (≈3.5 M parameters, default depth parameter α = 1) pretrained on ImageNet was retrained, also with no frozen layers. Finally, a small custom made CNN with 5 layers and ≈ 1.3 M parameters was trained from scratch, see Fig. 8 in the Appendix. The performance of the three CNNs is compared for two distinct cases: full images of the 6 × 4 cm arena, see Fig. 3 and cropped images of the insects. In the first case, downscaling from 1440 × 1080 pixels to the NN input size of 224 × 224 results in a serious loss of morphological details of the insects. Top-1 accuracies as well as confusion matrices are discussed in Sect. 3.1. The models and experiments have been implemented in Python using Keras and Tensorflow 2.8. 2.5

Semantic Segmentation

For a precise and robust object localisation, we evaluated a small sized U-Net [28] as provided by [7]. The images were resized from 1440 × 1080 down to 256 × 256. Label Studio [2] was used to segment insects on 301 images by hand. Based on this ground truth, the U-Net was trained to predict masks. Preliminary results are presented in Sect. 3.2. The predicted masks can be used to define a minimal bounding box containing the insect and, thus, perform an automatic cropping.

26

D. Brandt et al.

3

Results and Discussion

3.1

Classification

The data set was split in fixed training, validation and test sets of ratios 60/20/20. Top-1 accuracies on the test set are reported in Table 2.1 Table 2. Comparison of insect classification performance (top-1 accuracy) on the test set using full images and cropped images. Model

Accuracy (full) Accuracy (cropped)

ResNet-50

0.9685

MobileNet

0.8776

0.9780

Custom CNN 0.7058

0.8722

0.9604

It is quite evident that all models perform very well on cropped images. Also worth emphasizing is that the MobileNet and the Custom CNN reach noticeably higher scores when trained on the cropped samples as opposed to the uncropped ones. This is, in contrast, not the case with the ResNet, which seems to perform equally well on the raw and zoomed-in images. At first sight, this appears counterintuitive, since downsampling full images to 224 × 224 pixels causes substantial information loss and results in rather coarse/blurry images. This indicates, that up to a certain classification accuracy within the taxonomic tree, the features expressed by high spatial frequencies (fine grained structures) are not relevant. Moreover, a reasonable explanation for the still very high scores may lie in the favorable initialization with iNat21 weights. Hence, the ResNet starts the retraining already with prior knowledge about insects, which a MobileNet, for instance, did not inherit from the ImageNet weights. And indeed, the size of the ResNet (around six times more parameters than MobileNet and 20 times more than the custom CNN) is also not to be neglected. Since the dataset is highly unbalanced, the test accuracy alone does not provide a complete picture of the performance of the models in the individual classes. The confusion matrices in Figs. 7 and 6 from the Appendix reveal that especially the rare species are considerably difficult to distinguish in full images. In particular the two rarest species Polistes dominula and Episyrphus balteatus (Fig. 4) are never predicted correctly in the full images, see Fig. 7. A key parameter to further improve the accuracy of the models and to handle a large number of different species is to increase the amount of data. Evidently, the more data provided from different species, the more species can be classified. For this reason, the contribution of citizen scientists in data collection is central to the project. 1

As a side note, we did not fine-tune class weights or over-/under-sampling techniques to deal with the underrepresented classes. Nonetheless, the models performed very well on the cropped data.

Low Cost Machine Vision for Insect Classification

27

There are insect species for which it is very difficult or even impossible to determine the correct taxon based on multisensor data. Here, dissection or genetic tests would be necessary. Therefore, the classifier should not always predict the insect species but rather the highest possible level in the taxonomic hierarchy, e.g. species or genus, or family. To achieve this, a hierarchical classification scheme may be implemented, see Gitlab repository [8]. 3.2

Bounding Boxes via Segmentation

The results discussed in detail in Sect. 3.1 show that zooming in on the insect produces better results, although size information is lost when cropping. Therefore, we implement automated cropping in our classification workflow, see Fig. 5.

Fig. 5. Example of inferring a bounding box: from left to right: original image, manually annotated segmentation mask (used as label for training the u-net in a Supervised Context), U-Net predicted mask, bounding box inferred from the predicted mask.

The automated cropping algorithm based on segmentation and described in Sect. 2.5, reaches an IoU of 0.7658 for the predicted masks using 301 labeled images as input. This method is more robust than commonly used thresholding techniques. Variations in brightness, white balance and reflects can occur in every image, making it impossible to set a global threshold value suitable for every image. Thresholding also leads to problems if several insects of different colours are present in one image or if certain parts of the insect have similar colours compared to the background, such as transparent wings. In addition, we observed that the arena gets dirty quickly, making it necessary to filter out nonspecific objects such as dust or insect dirt from the bounding box evaluation. Prediction of insect masks overcomes these problems. In order to improve the preprocessing pipeline, one-stage models such as YOLO [24] or other single-shot detectors will also be tested to obtain the best bounding box evaluator suitable for our data.

28

4

D. Brandt et al.

Conclusion and Outlook

The goal of this project is to develop a complete and autonomous system for insect species classification for use in the field that can be replicated and operated by citizen scientists. To achieve this, a low-cost system based on a minicomputer with various sensors and a Raspberry Pi HQ camera has been developed for standardized and AI-based automated monitoring of insects. The presented work focuses on the image acquisition unit including image processing. Both have been optimized to meet the requirements for an automated species-level classification on a minicomputer that can serve as edge device. The image processing includes an automatic trigger to capture a video clip, an adjusted frame selection, identification of insects in each frame and cropping, and classification of the insect. The image processing system is evaluated based on the correlation between the preprocessing (cropping) and the classification accuracy for different CNNs (ResNet50, MobileNet, and a Custom-defined CNN). The ResNet-50 pretrained on the iNaturalist data of insects shows the best results for full images with up to 96% accuracy, whereas the MobilNet and ResNet both achieved 96% accuracy for cropped images. However, a custom designed network proves to be a promising approach for an energy-efficient application, in particular inference on the Raspberry Pi as edge device. In addition, we found that the accuracy of the smaller models is significantly higher trained on cropped images. As a next step, we will integrate multisensor data such as wing beat frequencies, environmental data and metadata such as insect size, which is known by its relationship to fixed FoV. Especially, the classification of even very small insects is sensitive to characteristic wingbeat frequencies. Prior probabilities depending on size, temperature, season and other factors, which are also used by experts, should similarly support the classification in automated monitoring. To overcome the problem of rare species, achieving high accuracy even with few input samples, class weighting and oversampling techniques will be investigated as well as methods for few shot learning such as contrastive learning. Extensive data collection with citizen scientists will be conducted during this year’s flight season.

Low Cost Machine Vision for Insect Classification

A

Appendix

A. Confusion Matrices

Fig. 6. Confusion Matrix MobileNet, test set with cropped images.

29

30

D. Brandt et al.

B. Custom CNN Model

Fig. 7. Confusion Matrix for MobileNet, test set with full images. Although the top-1 accuracy (0.8776) is rather high also for full images, the classifier does not perform very well on some rare species, e.g. Polistes dominula and Episyrphus Balteatus.

Low Cost Machine Vision for Insect Classification

Fig. 8. Custom CNN architecture - 1.270.992 parameters.

31

32

D. Brandt et al.

References 1. Gbif. http://www.gbif.org 2. Label studio. http://github.com/heartexlabs/label-studio 3. Stadel, A., Beckers, I., Brandt, D., Haußer, F., Tschaikner, M., Schrimpf, I., Treutlein, H.: Kinsecta project. http://www.kinsecta.org/project. Accessed 30 Jan 2023 4. Amrani, A., Sohel, F., Diepeveen, D., Murray, D., Jones, M.G.: Insect detection from imagery using yolov3-based adaptive feature fusion convolution network. Crop. Pasture Sci. (2022) 5. Balla, E., Fl´ ori´ an, N., Gerg´ ocs, V., Gr´ anicz, L., T´ oth, F., N´emeth, T., Dombos, M.: An opto-electronic sensor-ring to detect arthropods of significantly different body sizes. Sensors (Switzerland) 20, 2 (2020) 6. Batista, G.E., Keogh, E.J., Mafra-Neto, A., Rowton, E.: Sigkdd demo: sensors and software to allow computational entomology, an emerging application of data mining, 761–764 (2011) 7. Sreenivas Bhattiprolu (2021) 8. Teodor Chiaburu (2021) 9. Universita degli studi Roma tre, European Association for Signal Processing, IEEE Signal Processing Society, Institute of Electrical, and Electronics Engineers. EUSIPCO 2018: 26th European Signal Processing Conference: Rome, Italy, 3–7 Sept 2018 (2018) 10. Gerovichev, A., Sadeh, A., Winter, V., Bar-Massada, A., Keasar, T., Keasar, C.: High throughput data acquisition and deep learning for insect ecoinformatics. Front. Ecol. Evol. 9, 600931 (2021) 11. Hallmann, C.A., Sorg, M., Jongejans, E., Siepel, H., Hofland, N., Schwan, H., Stenmans, W., M¨ uller, A., Sumser, H., H¨ orren, T., Goulson, D., De Kroon, H.: More than 75% decline over 27 years in total flying insect biomass in protected areas. PLoS ONE 12, 10 (2017) 12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015) 13. Hogeweg, L., Zeegers, T., Katramados, I., Jongejans, E.: Smart insect cameras. Biodivers. Inf. Sci. Stand. 3, e39241 (2019) 14. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications (2017) ¨ 15. Høye, T.T., Arje, J., Bjerge, K., Hansen, O.L., Iosifidis, A., Leese, F., Mann, H.M., Meissner, K., Melvad, C., Raitoharju, J.: Deep learning and computer vision will transform entomology. PNAS 118, 2002545117 (2021) 16. Khoo, B., Khoo, B.T.T., Rosalyn, R.P., Kitt, W.W.: Capacitive mosquito wingbeat sensor: a novel sensor for intelligent traps IJERA journal capacitive mosquito wing-beat sensor: a novel sensor for intelligent traps. Int. J. Eng. Res. Appl. 10, 18–27 (2020). http://www.ijera.com 17. Kirkeby, C., Rydhmer, K., Cook, S.M., Strand, A., Torrance, M.T., Swain, J.L., Prangsma, J., Johnen, A., Jensen, M., Brydegaard, M., Græsbøll, K.: Advances in automatic identification of flying insects using optical sensors and machine learning. Sci. Rep. 11, 12 (2021) 18. Lima, M.C.F., de Almeida Leandro, M.E.D., Valero, C., Coronel, L.C.P., Bazzo, C.O.G.: Automatic detection and monitoring of insect pests-a review. Agriculture 10(5), 161 (2020)

Low Cost Machine Vision for Insect Classification

33

19. Lippert, C., Feuerbacher, A., Narjes, M.: Revisiting the economic valuation of agricultural losses due to large-scale changes in pollinator populations. Ecol. Econ. 180, 106860 (2021) 20. Sittinger, M.: Insect detect. http://maxsitt.github.io/insect-detect-docs. Accessed 30 Jan 2023 21. Mankin, R., Hagstrum, D., Guo, M., Eliopoulos, P., Njoroge, A.: Automated applications of acoustics for stored product insect detection, monitoring, and management. Insects 12, 3 (2021) 22. Potamitis, I., Rigakis, I., Vidakis, N., Petousis, M., Weber, M.: Affordable bimodal optical sensors to spread the use of automated insect monitoring. J. Sens. 2018 (2018) 23. Ramalingam, B., Mohan, R.E., Pookkuttath, S., G´ omez, B.F., Sairam Borusu, C.S.C., Wee Teng, T., Tamilselvam, Y.K.: Remote insects trap monitoring system using deep learning framework and IoT. Sensors (Switzerland) 20(9), 1–17 (2020) 24. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016) 25. Rhodes, M.W., Bennie, J.J., Spalding, A., French-Constant, R.H., Maclean, I.M.: Recent advances in the remote sensing of insects. Biol. Rev. 97(1), 343–360 (2022) 26. Rigakis, I., Potamitis, I., Tatlas, N.A., Livadaras, I., Ntalampiras, S.: A multispectral backscattered light recorder of insects’ wingbeats. Electronics (Switzerland) 8, 3 (2019) 27. Ripberger, R., Hutter, C.P., Koch, F.: Sch¨ utzt die hornissen.-1992. 119 s. hard cover. k. thienemanns verlag gmbh & co., stuttgart. preis: 36,-dm. ISBN:3-52230450-0 (1993) 28. Ronneberger, O., Fischer, P., Thomas B.: Convolutional networks for biomedical image segmentation, U-net (2015) 29. Rydhmer, K., Bick, E., Still, L., Strand, A., Luciano, R., Helmreich, S., Beck, B.D., Grønne, C., Malmros, L., Poulsen, K., et al.: Automating insect monitoring using unsupervised near-infrared sensors. Sci. Rep. 12(1), 1–11 (2022) 30. S´ anchez-Bayo, F., Wyckhuys, K.A.G.: Worldwide decline of the entomofauna: a review of its drivers. Biol. Cons. 232, 8–27 (2019) 31. S¨ ut˝ o, J.: Embedded system-based sticky paper trap with deep learning-based insect-counting algorithm. Electronics (Switzerland) 10, 8 (2021) 32. Teixeira, A.C., Ribeiro, J., Morais, R., Sousa, J.J., Cunha, A.: A systematic review on automatic insect detection using deep learning. Agriculture 13(3) (2023) 33. Van Horn, G., Cole, E., Beery, S., Wilber, K., Belongie, S., Mac Aodha, O.: Benchmarking representation learning for natural world image collections (2021) 34. Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P., Belongie, S.: The iNaturalist species classification and detection dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8769–8778 (2018) 35. van Klink, R., August, T., Bas, Y., Bodesheim, P., Bonn, A., Fossøy, F., Høye, T.T., Jongejans, E., Menz, M.H., Miraldo, A., Roslin, T., Roy, H.E., Ruczy´ nski, I., Schigel, D., Sch¨ affler, L., Sheard, J.K., Svenningsen, C., Tschan, G.F., W¨ aldchen, J., Zizka, V.M.A., ˚ Astr¨ om, J., Bowler, D.E.: Emerging technologies revolutionise insect ecology and monitoring. Trends Ecol. Evol. 37(10), 872–885 (2022) 36. Wang, J., Zhu, S., Lin, Y., Svanberg, S., Zhao, G.: Mosquito counting system based on optical sensing. Appl. Phys. B 126(2), 1–10 (2020). https://doi.org/10.1007/ s00340-019-7361-2

34

D. Brandt et al.

37. Wang, Q.-J., Zhang, S.-Y., Dong, S.-F., Zhang, G.-C., Yang, J., Li, R., Wang, H.-Q.: Pest24: a large-scale very small object data set of agricultural pests for multi-target detection. Comput. Electron. Agric. 175, 105585 (2020) 38. Wang, R., Cheng, H., Xiaowei, F., Long, T., Zeng, T.: Micro-doppler measurement of insect wing-beat frequencies with w-band coherent radar. Sci. Rep. 7, 12 (2017) 39. Wu, X., Zhan, C., Lai, Y.K., Cheng, M.M., Yang, J.: Ip102: a large-scale benchmark dataset for insect pest recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8787–8796 (2019) 40. Qing, Y.A.O., Jin, F.E.N.G., Jian, T.A.N.G., XU, W.G., Zhu, X.H., Yang, B.J., Jun, L., Xie, Y.Z., Bo, Y.A.O., Wu, S.Z., Kuai, N.Y., Wang, L.J.: Development of an automatic monitoring system for rice light-trap pests based on machine vision. J. Integr. Agric. 19(10), 2500–2513 (2020)

Shape Complexity Estimation Using VAE Markus Rothg¨ anger(B) , Andrew Melnik, and Helge Ritter Bielefeld University, Bielefeld, Germany [email protected] https://github.com/mmrrqq/shape-complexity

Abstract. In this paper, we compare methods for estimating the complexity of two-dimensional shapes and introduce a method that exploits reconstruction loss of Variational Autoencoders with different sizes of latent vectors. Although complexity of a shape is not a well defined attribute, different aspects of it can be estimated. We demonstrate that our method captures some aspects of shape complexity. Code and training details will be publicly available. Keywords: Shape complexity vision

1

· Variational autoencoders · Computer

Introduction

The complexity of shapes is not a well defined attribute. However, computer vision systems are faced with the problem of ranking shapes by their complexity. For the case of two-dimensional shapes, previous definitions for shape complexity used approximations of the Kolmogorov complexity [5] or the entropy in local features [14]. Other work defined strict criteria with circles being the least complex and adding parts that are different from the existing parts of the shape should increase the complexity among others [4,13]. Reference [4] applied a range of measures grouped into boundary-based, regional, and skeletal methods [1,4]. These measures are evaluated with a user study [1] and the authors listed three measures capturing most of the complexity information according to the users: boundary, convexity and skeleton. For humans, the complexity of a shape can be measured by the memory load of remembering the number and arrangement of the composite blocks [12]. Other methods only consider shapes represented by their outlines [5,6] which implies that these methods often miss the inner structure of the shapes. Shape complexity estimation exploiting the Fourier analysis was previously conducted in [10] where the outlines of different biological species were modeled using elliptical Fourier analysis. The method by [15] considers the inner structure by minimizing the number of ellipses needed to fill the shape while preserving a certain accuracy as an indicator for shape complexity, which however, can get computationally intensive. Considering scene like images, the work in [16] exploits convolutional neural networks to estimate visual complexity in images. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 35–45, 2024. https://doi.org/10.1007/978-3-031-47715-7_3

36

M. Rothg¨ anger et al.

We propose a shape complexity estimation method (see Fig. 1) driven by Variational Autoencoders (VAEs). The latent vector of VAE is often used as an information bottleneck [7,17]. We investigated whether complexity measures can be obtained by employing of VAEs with different sized latent vectors for the shape reconstruction step. We focus on datasets of two-dimensional binary images and compare complexity orderings obtained with VAEs with more conventional measures based on frequency spectra in Fourier space and an estimation of the Kolmogorov complexity following [5] but using lossless compression. The Kolmogorov complexity and the close relation to data compression is thoroughly examined in [8]. We further compare our rankings with orderings obtained from human judgments [6]. We finally propose a measure that combines the VAE with previous measures. Datasets used in this work are described in Sect. 2.1 to which follows the introduction of our methods in Sects. 2.2–2.5.

Fig. 1. Diagram of the architecture and calculation of the Complexity Score (CS) of a shape. The higher the score, the more complex the shape. S1, S2, S3—Shape 1, Shape 2, Shape 3.

2 2.1

Methods Datasets

We evaluated and compared methods on three different datasets. First dataset consists of shapes of segments in first-person-view agent from the MineRL

Shape Complexity Estimation Using VAE

37

dataset [9,11]. We also used the MPEG7 shape dataset [3] that provides a wide range of shape classes from real world examples that are altered in subtle ways within those classes. We further apply the estimation methods to a shape dataset from the work by [6] as this enables us to compare our results to human-made rankings. Images from these datasets were preprocessed by applying the minimum centered squared bounding box followed by resizing to 64 by 64 pixels. 2.2

Variational Autoencoder Reconstruction Measure

For training the VAE data augmentation is implemented in the form of random horizontal and vertical flipping (p = 0.5) as well as a random rotation in the range of ±85◦ (p = 0.5). We propose to use the absolute pixel wise difference of the reconstructions obtained from two differently limited VAEs. In our experiments, we limit the latent representations of two VAE to 16 and 64 neurons respectively. Using VAE to estimate complexity arises from the expectation which follows the Kolmogorov complexity that a limited network succeeds at reconstructing simple shapes but generally produces more errors the more complex the shape gets. We limit the reconstruction capabilities by setting the number of neurons available to represent the latent space for mean and variance of the underlying distributions. For encoding, we use three convolutional layers with ReLU activation and MaxPooling and thereby reduce the 64 × 64 × 1 input to 6 × 6 × 64. Lastly, two linear layers encode the mean and variance depending on the desired limitation of the network. While decoding the reparameterized latent representation, we first use a linear layer followed by five transposed convolution layers with ReLU activation. The last decoding layer consists of another transposed convolution with sigmoid activation, yielding the same dimensions as the input. To calculate a single complexity value, our proposed method passes the image through both VAEs and reconstructs from the latent representation. Then, we calculate the absolute pixel wise difference of the two reconstructions divided by the sum of all pixel values in the input image. It has to be noted that the reconstructions are not thresholded to bitmasks and, therefore, yield floating point pixel values from zero for black to one for white. By its construction, the range of the resulting complexity value is not strictly limited in the interval [0, 1] and might exceed the upper limit, however the value can easily be clipped to one as this would result from a very strong deviation in the reconstructions and therefore suggest a high complexity anyway. To underline this assumption, all absolute pixel wise differences exceeding the sum of white pixel values in the original image (which are always 1) exceed a complexity value of one. Both VAEs were trained on the same subset dataset. 2.3

Zlib Compression Measure

We approximate the Kolmogorov complexity by calculating the ratio of the byte lengths of the uncompressed image and its compressed counterpart. In theory, a lossless compression of images with homogeneous areas and therefore large

38

M. Rothg¨ anger et al.

chunks of the same data is easier than compressing scattered shape images. As a compression algorithm, we chose to use the zlib implementation of the lossless DEFLATE algorithm.1 For small shapes (with larger homogeneous black areas) with small circumference the compression ratio is not as big as for shapes with large circumference that fill most of their bounding box (i.e., having smaller homogeneous black areas). To consider this assumption in our measure, we multiply the compression ratio by (1 − f ill ratio) where f ill ratio describes the percentage of white pixels in the image. Using this technique we aim to increase the size invariance of our measure. 2.4

Fourier Transform Measure

High spatial frequency is both an indicator for shape complexity, but at the same time it can also indicate image noise [2]. The Fast Fourier Transform (FFT) inevitably lumps both together, reflecting the problem that “unrecognized complexity” may appear as noise. In some cases, the spatial frequencies of content and noise may be separable, allowing to filter out the noise without affecting the content. Otherwise, a distinction of noise from content is only possible with more detailed models for both, as described in [2]. While high spatial frequency might not only represent the shape complexity but also image noise as stated above, the FFT measure is based on the assumption that images of complex shapes oftentimes are less homogeneous and therefore contain more high frequency areas. It follows that the average spatial frequency throughout the image gets higher the more complex the shape is. To find the average frequency of the image, we calculate the two dimensional discrete Fourier transform of the image and extract the mean frequency for both dimensions. The two frequency means are then combined to a single value using the euclidean norm, yielding √ a measure for the mean frequency over both dimensions in the range of 0 to 0.52 + 0.52 cycles per pixel. √ Finally, we normalize this measure by dividing it by 0.52 + 0.52 . 2.5

Combining Measures

As previously stated, we conducted experiments using a combination of the single complexity estimators described above. In line with [4], the combination of measures aims at capturing more than one aspect of complexity. Although [4] suggest that complexity should not be viewed as a one dimensional measure, we often wish to have a complexity ranking of shapes. Therefore, we consider the combination of measures as vectors in an ndimensional space where all component values lie within the range [0, 1] where zero suggest low complexity and one high complexity. To combine their results to a single value, we chose to use the magnitude of the resulting n-dimensional vector. Although the ranges of the single measures are the same, the values are not necessarily distributed equally over the interval. In simple sorting scenarios 1

https://www.rfc-editor.org/rfc/rfc1951.

Shape Complexity Estimation Using VAE

39

a normalization over the min and maximum range of each measure is possible which equalizes the contribution of the measures like already described by [6]. However it has to be kept in mind that this is not applicable for automated pipelines where the complexity of a single shape needs to be evaluated in a one shot manner. In the following, we will reference the combination of compression, FFT and VAE measures as the combined measure.

3

Results

A study looking for human judgments of complexity was conducted by Dai et al. in [6]. We use this dataset to compare the shape-complexity sorting by all methods from Sect. 2 and the human-made sorting using the Spearman rank correlation as seen in Fig. 2. The strongest correlation to the human reference sorting exists for the compression measure. The rankings of the measures are plotted against the human-made ranking of the reference dataset in Fig. 3. The trendlines allow to define the compression ratio based sorting to be most similar to the human sorting.

Fig. 2. Spearman’s rank correlation evaluated on the reference test dataset [6] with 30 shapes. Fill Ratio is the percentage of white pixel in the images, 3D is the combined measure of VAE, FFT and compression.

We additionally compare the Spearman’s correlation coefficients for rankings of 2000 randomly sampled shape sets of nine shapes from the MPEG-7 dataset [3] (Table 1). Considering the single value methods, we observed the strongest correlation between the FFT and the compression based measures. However, with a value of 0.264 this is a weak correlation. The strongest correlation to the combined measure is given by the FFT based method. These results agree with the previous results and emphasise our informal definition of shape complexity whereas if we consider the combined measure to capture most of the complexity features, there seems to be a slight focus on the frequency distribution of the shapes.

40

M. Rothg¨ anger et al.

Fig. 3. Rankings of measures over human judged rankings including trend lines. Table 1. Average spearman correlation coefficients for rankings of nine randomly sampled shapes (n = 2000) Compression FFT Combined VAE Compression FFT

0.146

0.170 0.275 0.264 0.369 0.436

To visualize the difference in reconstruction capabilities of the VAEs, the reconstructions belonging to the reference dataset [6] are visualized in Fig. 4. Each cell consists of three images which are the ground truth bitmask image, the 64 latent neuron reconstruction as well as the 16 latent neuron reconstruction (from top to bottom). While large blocks of filled regions are reconstructed with a similar accuracy, the differences are most visible when considering detailed or high frequency regions where the 16 latent neuron network reconstructions are way more blurred. Visualization of the sortings for all methods on the reference dataset are given in Fig. 5. Applying the measures to a randomly chosen but fixed subset of the MPEG-7 data without augmentation sorts the shapes as depicted in Fig. 6. In

Fig. 4. Reconstruction example for the reference dataset [6]. The first row contains the ground truth images, second row the reconstruction results from the 64 latent neuron VAE, last row the reconstructions by the 16 latent neuron network. The associated complexity value is given in orange.

Shape Complexity Estimation Using VAE

41

Fig. 5. A comparison of sorting the reference dataset [6]. The associated complexity value is given in orange text (for the combined measure it is FFT, compression and VAE from top to bottom). (a—human reference, b—pixel fill percentage, c—compression, d—FFT, e—VAE, f —combined, g—combined equal contribution).

Fig. 6. A comparison of sorting a random subset of the MPEG-7 [3] dataset using our measures. The associated complexity value is given in orange text (for the combined measure it is FFT, compression and VAE from top to bottom). (a—pixel fill percentage, b—compression, c—FFT, d—VAE, e—combined, f —combined equal contribution).

Addition to the mentioned measures from Sect. 2, we use the percentage of white pixels in the image to analyze the influence of the fill ratio on the measures. A visible difference in the sensitivity of the measures can be seen as the compression and FFT measure tend to agree for—what these measures consider to be—the more complex shapes whereas the FFT and the VAE measure agree more on the lower part. In general, if we consider high spectral frequency to be an indicator for high complexity, we notice that large pixelwise differences as well as large compression differences result from high spectral frequency. Due to the values of the VAE measure being more equally distributed over the range [0, 1], naturally, the sorting using the combined measure without min/max normalization of the components contribution does mostly agree with the single VAE measure. Similar results are visible in Fig. 7 for shapes extracted via segmentation in Minecraft first person views. To be able to show the invariance to the white pixel percentage of the images, we show results for a subset of the MPEG-7 dataset [3] in which we only chose

42

M. Rothg¨ anger et al.

Fig. 7. A comparison for shapes extracted from Minecraft first person views [9, 11] using our measures. The associated complexity value is given in orange text (for the combined measure it is FFT, compression and VAE from top to bottom). (a—pixel fill percentage, b—compression, c—FFT, d—VAE, e—combined, f —combined equal contribution).

Fig. 8. A comparison of sorting a subset of the MPEG-7 dataset containing only shapes with a white pixel percentage in the interval (0.6, 0.8). The associated complexity value is given in orange text (for the combined measure it is FFT, compression and VAE from top to bottom). (a—pixel fill percentage, b—compression, c—FFT, d—VAE, e—combined, f —combined equal contribution).

images with 60 to 80% white pixels in Fig. 8. The sorting based on the white pixel percentage in Fig. 8a deviates strongly from the sortings by our measures and indicates an invariance of the white pixel percentage. In a similar fashion as introduced in [4], we evaluate the performance of our methods by analyzing how well shapes belonging to the same class of the MPEG-7 dataset are considered to be of the same complexity. The results in Fig. 9 are obtained using all three measures combined. While the classification is not necessarily related to the complexity, some classes of the given dataset are altered in such a way that their level of detail is not changing and the overall shape structure remains the same (apple). For other classes, such as the pocket watch, the detail within the class is changing a lot and we see simple as well as complex considerations in our measures. Nevertheless, the sorting seems plausible as shapes with filled regions are considered to be less complex than others.

Shape Complexity Estimation Using VAE

43

Fig. 9. Results of sorting a subset of the MPEG-7 dataset [3] including ten classes instead of 70 using the normalized combined measure. Shapes belonging to the same class are colorized equally. The associated complexity value for each measure is given as white text for FFT, compression and VAE from top to bottom.

4

Discussion, Conclusion, and Future Work

The informal definition of shape complexity makes it hard to determine the validity of the shown methods. To further evaluate the sorting capabilities of different methods a larger user study might be the most fitting evaluation method, similar to the study conducted by [1]. In this paper, we were able to show that traditional image analysis and estimations of the Kolmogorov complexity, including our method using VAEs, are suitable to shape complexity estimation—especially in environments where the computational overhead is required to be limited. While not following strict definitions of complexity, the rankings obtained from the explored methods are correlating to human made rankings and open up the field for further research for computer vision pipelines. Our proposed method using VAEs shows promising results for multiple datasets while being lightweight in training and application. While the normalized combination of all measures utilizes a variety of indicators for complexity, it might be more fitting to use a single measure for complexity estimation in some scenarios where the general structure of the input shapes can be foreseen. Rejecting noisy shapes might work best using the FFT

44

M. Rothg¨ anger et al.

or the compression measure while a certain deviation from an expected shape can be evaluated using the VAEs which are trained on the expected shapes. Although we were able to show that our assumption that limited neural networks are able to reconstruct shapes of low complexity seems to hold in our experiments, the field of reconstructional neural networks yields many more points of interest which might be directed towards shape complexity reconstruction and estimation. Further investigation of using learning based models might be directed towards the question if it is possible to learn to distinguish complexity and noise while approximating the Kolmogorov complexity, as described by [2].

References 1. Bazazian, D., Magland, B., Grimm, C., Chambers, E.W., Leonard, K.: Perceptually grounded quantification of 2D shape complexity. Vis. Comput. 38, 3351–3363 (2022) 2. Bloem, P., de Rooij, S., Adriaans, P.W.: Two problems for sophistication. In: International Conference on Algorithmic Learning Theory (2015) 3. Bober, M.: Mpeg-7 visual shape description. In: IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 07, pp. 716–719 (2001) 4. Chambers, E., Emerson, T., Grimm, C., Leonard, K.: Exploring 2D Shape Complexity, pp. 61–83. Springer International Publishing, Cham (2018) 5. Chen, Y., Sundaram, H.: Estimating complexity of 2D shapes. In: 2005 IEEE 7th Workshop on Multimedia Signal Processing, pp. 1–4 (2005) 6. Dai, L., Zhang, K., Zheng, S., Martin, R., Li, Y.-N., Yu, J.: Visual complexity of shapes: a hierarchical perceptual learning model. Vis. Comput. 38, 02 (2022) 7. Korthals, T., Hesse, M., Leitner, J., Melnik, A., R¨ uckert, U.: Jointly trained variational autoencoder for multi-modal sensor fusion. In: 2019 22th International Conference on Information Fusion (FUSION), pp. 1–8. IEEE (2019) 8. An Introduction to Kolmogorov Complexity and Its Applications. TCS, Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11298-1 8 9. Malato, F., Leopold, F., Raut, A., Hautam¨ aki, V., Melnik, A.: Behavioral cloning via search in video pretraining latent space (2022). arXiv:2212.13326 10. McLean, C., Garwood, R., Brassey, C.: Assessing the patterns and drivers of shape complexity in the amblypygid pedipalp. Ecol. Evol. 11(15), 10709–10719 (2021) 11. Melnik, A., Harter, A., Limberg, C., Rana, K., S¨ underhauf, N., Ritter, H.: Critic guided segmentation of rewarding objects in first-person views. In: Edelkamp, S., M¨ oller, R., Rueckert, E. (eds.) KI 2021. LNCS (LNAI), vol. 12873, pp. 338–348. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87626-5 25 12. Melnik, A., Sch¨ uler, F., Rothkopf, C.A., K¨ onig, P.: The world as an external memory: the price of saccades in a sensorimotor task. Front. Behav. Neurosci. 12, 253 (2018) 13. Nolte, F., Melnik, A., Ritter, H.: Stroke-based rendering: from heuristics to deep learning (2022). arXiv:2302.00595 14. Page, D., Koschan, A., Sukumar, S., Roui-Abidi, B., Abidi, M.: Shape analysis algorithm based on information theory. In: Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429), vol. 1, pp. I–229 (2003) 15. Panagiotakis, C., Argyros, A.: Parameter-free modelling of 2D shapes with ellipses. Pattern Recogn. 53, 11 (2015)

Shape Complexity Estimation Using VAE

45

16. Saraee, E., Jalal, M., Betke, M.: Visual complexity analysis using deep intermediate-layer features. Comput. Vis. Image Underst. 195, 102949 (2020) 17. Zai El Amri, W., Tautz, O., Ritter, H., Melnik, A.: Transfer learning with jukebox for music source separation. In: Artificial Intelligence Applications and Innovations: 18th IFIP WG 12.5 International Conference, AIAI 2022, Hersonissos, Crete, Greece, 17–20 June 2022, Proceedings, Part II, pp. 426–433. Springer, Berlin (2022)

Training System for the Tomato Paste Production Process Through Virtual Environments Fernando J. Pantusin, Jimmy W. Cordonez, Milton A. Quimbita, Víctor H. Andaluz(B) , and Andrés D. Vargas Universidad de las Fuerzas Armadas ESPE, Sangolquí, Ecuador {fjpantusin,jwcordonez,maquimbita3,vhandaluz1, advargas3}@espe.edu.ec

Abstract. This paper presents the virtualization and control of a tomato paste manufacturing plant (double effect evaporator). The virtual environment was developed through CAD and Unity 3D programs, with the purpose of simulating the real behavior of the process. A control algorithm based on the inverse of the process is proposed to control the process. Control algorithm is implemented in Matlab software, which communicates with the virtual environment through the use of shared memories. The virtual environment sends the value of the process variables and receives the value of the control actions from the software where the control law is established. The virtual simulator allows simulating the robustness of the controller to different disturbances, e.g., sensor measurement error, input error and modeling error, without compromising the safety of the real plant and the operator. As a result, an immersed and interactive virtual environment was obtained that represents the real behavior of the closed-loop process and with a user-friendly interface allowing to observe the evolution in real time of the plant together with the implemented controller. Keywords: Virtual reality · Multi-effect evaporator · Stability · Lyapunov theory

1 Introduction The food industries for its efficiency use multiple effect evaporators in the syrup production process, in the concentration of milk, fruit juice and tomato paste, for this reason the tomato paste industry concentrates on the concentration stage because it is essential to reduce costs and increase production profits, also for energy saving a double effect evaporator is used which consists that the first effect feeds the second effect [1, 2]. In the evaporator, the input is tomato juice with more than 85% water, which is cooked for 30 to 45 min at a temperature of 90 °C to obtain a tomato paste concentration of no more than 61.5% water, which facilitates storage, reduces the cost of shipping and satisfies consumers [3]. It should be noted that the cooking time is determined by the concentration required at the end of the producto [4]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 46–55, 2024. https://doi.org/10.1007/978-3-031-47715-7_4

Training System for the Tomato Paste Production

47

The mathematical model of the evaporator was carried out with the analysis of each effect in order to facilitate its understanding, since it involves the mass and energy balance. Due to the composition of tomato juice, it was considered as a binary solution of water and soluble solids (water). The following considerations were taken into account: there is a perfect mixture in each effect, the juice level is constant in each evaporator, there is thermodynamic equilibrium (of liquid and vapor), the composition and temperature are homogeneous inside each evaporator, and the evaporator is made of stainless steel because it is a food process [5, 6]. Industrial control has had great advances in terms of hardware and complexity of tasks to be performed, however, in basic concepts of system structure and control techniques have remained unchanged from 40 years ago to the present [7]. Accurate control of evaporators leads directly to better control in subsequent processes, which implies greater energy efficiency, good quality of the final product and a minimum cost in the production rate of the plant [8, 9]. The control of multistage evaporation processes throughout history, different control strategies have been designed based on cascade controllers, setpoint adaptation algorithms, the most classical way is the multivariable PID control with decouplers, as well as predictive controllers which have been designed for process automation and higher quality of the final product [10, 11]. The virtualization of processes as a new technological tool has allowed advances and growth in different areas of knowledge, due to the pandemic virtual reality was used which is a computer system to create a digital world where the user has the feeling of being in that world, for this we chose the use of various platforms for developing animations for example Unity 3D which is a tool for developing games in 3D comprehensive multiplatform that is compatible with programs such as 3DMAX, Sketcheup, AutoCAD and other models in order to achieve a good approximation of the physical phenomena that represents the behavior of the system. The creation of the model in a virtual way facilitates the virtual tour inside industrial plants and interact with them, the designs are made based on the real plant, which benefits students in the absence of classroom practices in a real laboratory [12–15]. With the aforementioned background, this work considers a second order nonlinear industrial process with two inputs and two outputs. The analysis was performed in steady state; therefore, a controller based on the inverse of the process matrix is implemented in continuous time to govern the concentration of each effect ensuring the optimal operation of the plant, where the control law is nonlinear due to the nonlinear nature of the process. The closed-loop equation is used to analyze the stability of the system with the Lyapunov theory when the process is under ideal conditions, i.e., without disturbances, and to verify that the outputs are equal to the desired values. As in a process there can be several disturbances as variables to be corrected because the processes are not static, on the contrary, they are very dynamic. Therefore, a robustness analysis of the controller was carried out using Lyapunov’s theory for the three possible disturbances: disturbances at the plant inlet, sensor measurement error and modeling errors [16]. For the development of the immersed and interactive virtual environment, AutoCAD 3D Plant was used to design the double effect evaporator, which corresponds to the structure, stairs, railings, vertical tubes, evaporators and sensors based on a P&ID diagram of the real process. The export of the CAD design was done through 3ds MAX with which the FBX file

48

F. J. Pantusin et al.

is obtained for use in Unity 3D while the bilateral communication is done by shared memories that are coded in Microsoft Visual Studio and Matlab in order to control the system in real time with the implemented control algorithm. The stages that constitute this work are the following: Sect. 2 describes the Virtual Environment developed for the 3D process. Section 3 develops the control algorithm, stability analysis and robustness to disturbances; Sect. 4 deals with the analysis of results; Sect. 5 describes the conclusions of the work.

2 Virtual Environment This work is oriented to the simulation of the plant for the production of tomato paste in the concentration stage (double effect evaporator) with its respective control algorithm and for better visualization of the system behavior, Virtual Reality was used to create an immersive and interactive environment based on a real P&ID of a double effect evaporator for greater realism, therefore the Unity 3D platform was used. Figure 1 shows the composition of the virtual environment proposed in this work, the architecture is composed of two stages: (i) Process Creation that studies the structure from the real plant, with the help of a P&ID where the dimensions, composition and operation of the process are specified together with the necessary elements for the double effect evaporator. From these considerations the CAD of the 3D plant model is designed and then the fbx file necessary to export to the virtual environment is obtained. (ii) Virtual Environment the modeled file of the evaporator is imported from the CAD program to the Unity 3D platform where the coding for each object is done, in addition Unity’s own virtual environments are imported through the assets, sliders are also added and an avatar of the operator is imported for greater realism which will interact directly with the process while the bilateral communication is given by the creation of shared memories to receive and send data in order to accurately simulate the behavior of the process.

Fig. 1. Real and simulated 3d process diagram.

Figure 2 shows the architecture of the entire process, which consists of three stages: Graphical Interface, Controller and Process. The Graphical Interface stage is developed on the Unity 3D platform and is composed of: (i) Avatar is the character that interacts with the entire plant and can also visualize the

Training System for the Tomato Paste Production

49

Fig. 2. Diagram of virtualization, process and controller.

process graphics, (ii) Evaporator tanks change their color depending on the concentration of the output, (iii) Valve shows the opening of the steam valve depending on the control actions and (iv) System Graphics is the representation of the graphs obtained from Process Evolution as output concentrations, control actions for the actuators that enter the plant and control errors that are measured by the sensors present in the process. The Process stage is developed in Microsoft Visual Studio software where the codes for the animations of the objects and to correctly simulate the behavior of the plant are found and subdivided into: (i) Shared Memories its function is the bilateral communication between the Process and Controller stage to send and receive or vice versa, (ii) Process Evolution is responsible of evolving the process according to the set point values and the mathematical model found, this stage is considered as the main one of the process, (iii) Concentration its function is to modify the color of the tanks depending on the concentration of the outputs, (iv) Steam Valve its function depends on the control actions for the opening of the valve. The Controller stage is developed in the mathematical software MatLab which consists of: (i) Control Algorithm where the control law code implemented for a regulation task is found, (ii) Shared Memories has the same operation described in the Process stage and (iii) System Graphics has the same operation described in the Graphical Interface stage.

3 Control Scheme In the production of tomato paste, a double effect evaporator was used for the concentration stage due to its operation and considerations at the time of the mathematical modeling, resulting in a second order NON-linear MIMO system. The assumptions of the model are: there is a perfect mixture in each effect, the juice level is constant, there is thermodynamic equilibrium (liquid and vapor) for the whole system, the composition and temperature are homogeneous inside each evaporator, and

50

F. J. Pantusin et al.

the evaporator is made of stainless steel because it is a food process. The nonlinear modeling is observed in the following equations, obtained from [17], which represent the behavior of the double effect evaporator. ⎧ dC 1 ⎪ = F0 (C0 − C1 ) + k1 uC1 ⎨ w1 dt ⎪ ⎩ w dC2 = F (C − C ) + C (k u + k u) − k uC 2 0 1 2 2 1 2 1 1 dt

(1)

where, the manipulated variables are the vapor flow rate of the first effect u and flow of tomato juice F0 . The controlled variables are the concentrations of the first C1 and second effect C2 . The system consists of a perturbation that is of the input concentration C0 , also presents the mass of the retained liquid from the first w1 and second effect w2 , finally, the static constants of the flow of both effects k1 y k2 [10]. For the development of the control algorithm was used a controller based on the inverse of the process that is programmed in Matlab software while the plant is programmed in Visual Studio software that is connected with the Unity 3D platform to send and receive data of the evolution of the process in closed loop with a real time regulation task, the communication bridge is bilateral between both software’s through the use of shared memories.

Fig. 3. Undisturbed control algorithm.

Considering the control system of a double effect evaporator presented in Fig. 3, where the process is defined as follows: c˙ (t) = A(c(t))x(t)

(2)

 T where, c˙ (t) = c˙ 1 c˙ 2 is the output vector of the process; A(c(t)) ∈ R2×2 represents the nonlinear matrix of the process with respect to the input variables,  k1 C1 (t) C0 − C1 (t) (3) A= C1 (t) − C2 (t) C2 (t)(k1 + k2 ) − k1 C1 (t)

Training System for the Tomato Paste Production

51

T  x(t) = F0 u symbolizes the input vector of the process. From the process Eq. (2) the control action is cleared, and we obtain: x(t) = A−1 c˙ . The rate of change of the system output can be represented as the difference between the desired value and the actual value, multiplied by an adjustment gain, which generates an equivalence at the unit level and weights the importance of the error. Therefore, the following control law is proposed uref = A−1 W(cd −c)

(4)

where cd is a vector of desired values, W ∈ R2x2 is a gain matrix and c(t) is a vector of real values measured by the sensor. To analyze the stability, the Lyapunov theory was used, which consists of finding a candidate function, this can take a long time to be defined correctly, therefore the minimum quadratic error was used as a candidate function V (˜c) = 21 c˜ T c˜ . The evolution of the Lyapunov candidate function by applying the derivative resulted. V˙ (˜c) = c˜ T c˙˜

(5)

To find the closed-loop equation, the process equation is equated with the proposed control law uref (t) = x(t), the process control errors are defined according to the following equation c˜ (t) = cd (t)−c(t). Finally, the control error is evolved for a regulation task, c˙ d (t) = 0, the closed-loop equation shown below is therefore obtained: c˙˜ (t) = −W˜c. Considering the loop equation in the evolution of the Lyapunov candidate, one has V˙ (˜c) = −˜cT W˜c. Therefore, for the process to be globally uniform and asymptotically stable, the gain matrix must be a positive definite matrix W > 0. On the other hand, we proceed to analyze the robustness of the proposed control system, considering all the disturbances indicated in Fig. 4, so, the control action is considered as described below: x(t) = uref (t) + ρ(t)

(6)

where ρ(t) represents the disturbances at the process input. Considering errors in the process model, the following is obtained: ˆ ˜ A(t) = A(t) + A(t)

(7)

˜ ∈ R2×2 represents a matrix of random values, A ∈ R2×2 indicates the real where A ˆ ∈ R2×2 represents the matrix of the process corresponding to process matrix and A ˆ −1 = I. Finally, the control law established in Eq. (4). Therefore, it is defined that AA errors in the reading of the sensors are considered as shown in Fig. 5, so the following is established. cˆ (t) = c(t) + γ(t)

(8)

Subsequently, for the robustness analysis, the following closed-loop equation is obtained from Fig. 5 and considering the perturbations established in (6)–(8).

ˆ −1 W˜c + Aρ + γ˙ c˙˜ = − AA (9)

52

F. J. Pantusin et al.

Fig. 4. Closed-loop control scheme.

Applying again Lyapunov’s theory, with the minimum quadratic error as a candidate function and evolving the system, the following expression is obtained: ˆ −1 W˜c − c˜ T Aρ − c˜ T γ˙ V˙ (˜c) = −˜cT AA

(10)

As the gain matrix was previously established in the stability analysis as a positive definite matrix, it is obtained that the error will be varying at a rate defined below: ||˜c|| ≤

1 ˆ −1 |||λ ||AA

min (W)|

(||Aρ|| + ||γ˙ ||)

(11)

Therefore, the error is bounded with (11).

4 Analysis and Results Next, the implementation of an interactive and immersive virtual environment is presented. The simulation corresponding to the operation of the process is shown in Fig. 5, where the corresponding industrial process is indicated, which consists of different sections, where the double effect evaporator, the control panel and the visualization screens stand out. The double effect evaporator is intended to simulate the evolution of the process in a real way, since it can represent the variation of each concentration by the change of the color tone in the outlet pipes of each tank. The color tone of concentration 1 will be orange if it increases, while the color tone of concentration 2 will be red if it increases. In addition, it can represent the variation of steam flow in the process, by means of the change of the color tone in the steam flow pipes, the color tone will be red while the steam flow increases, additionally, it indicates an animation of steam particles in each tank.

Training System for the Tomato Paste Production

53

Fig. 5. Simulation of the virtual environment.

For the variation of the desired values and the visualization of the process behavior, a control panel is presented, which has the possibility of modifying the desired parameters of each variable, likewise, the actual value of the concentrations is presented on the right side of the control panel, as shown in Fig. 6a. Figure 6b shows the behavior of the plant by displaying graphs of the evolution of the process (desired value and actual value), control error and control action. Where, the system output reaches the desired value if this is possible, because the double effect evaporator is a series process.

Fig. 6. Graphical interface of the control process.

The graphics corresponding to the evolution of the process are presented in the Unity virtual environment and, additionally, in the MATLAB program, to demonstrate the bilateral communication between these programs. Figure 7a presents the evolution corresponding to concentration 1, consequently, Fig. 7b presents the evolution corresponding to concentration 2, demonstrating the similarity between the graphics belonging to Fig. 6b. The virtual environment obtained was analysed together with a group of students from the University of the Armed Forces ESPE with knowledge in immersive environment development, process control and Unity software. In order to run the programme correctly, the participants previously configured each of their computers to be able to establish communication between Unity and Matlab correctly. At the end of the execution of the programme, the group of participants carried out a usability test to measure the degree of acceptance of the virtual environment. To measure the degree of usability of the virtual environment present in the current project, a System Usability Scale (SUS)

54

F. J. Pantusin et al.

Fig. 7. Evolution of the process outputs with the controller.

was used, where the score was 81.7% with respect to the usability of the digitalisation of the process.

5 Conclusions The virtual environment created in the Unity 3D graphic engine allows visualizing the evolution of the process based on the mathematical model determined by the heuristic method, where the visualization of the animations and the effects produced during the evolution of the system facilitates and improves the interaction with the user. The industrial process of a double effect evaporator is a non-linear system, because the matrix that represents the process does not contain all its constant variables, so the control law used is based on the inverse process of the matrix, which allows the system to reach the desired values despite the disturbances present, taking into account that the double effect evaporator is a serial process because the second concentration depends on the first.

References 1. Farsi y, M., Jahanmiri, A.: A new control algorithm for concentration control in three effect falling film evaporators. Iran. J. Sci. Technol. 387–396 (2009) 2. Pérez, M.P., Berná y, J.G., Gil, J.C.: Modelo Análisis y Control de un Evaporador de Doble Efecto. XXV Jornadas de Automática (2014) 3. Runyon, C.H., Rumsey y, T.R., McCarthy, K L.: Dynamic simulation of a nonlinear model of a double effect evaporator. J. Food Eng. 185–201 (1991) 4. Cardona, J.E., Hoyos y, J., Zapata, J.H.: Control de un evaporador de doble efecto por linealización instantánea usando redes neuronales artificiales. Scientia et technica 97–102 (2007) 5. Miranda y, V., Prada, C.: Controlador predictivo explicito de un evaporador (2006). [En línea]. http://www.ecosimpro.com/download/articles/ESC_2006_02_es.pdf 6. Núñez, Á., Tarifay, E., Farfán, N.: Simulación Estacionaria De Un Evaporador De Doble Efecto Para Jugo De Tomate. Avances en Ciencias e Ingeniería 2–3 (2011)

Training System for the Tomato Paste Production

55

7. Moncada, L.: Automatización y control de procesos. Universidad Nacional de Trujillo, Trujillo (2005) 8. Russell, N.T., Bakker y H.C., Chaplin, R.I.: Comparison of dynamic models for an evaporation process. Trans IChemE 78, 1120–1128 (2000) 9. Winchester y J.A., MarshMarsh, C.: Dynamics and control of falling film evaporators with mechanical vapour recompression. Chem. Eng. Res. Design 357–371 (1999) 10. Ortega García, O.W.: Control Predictivo Basado en Modelo Aplicado a una Planta de Evaporación de la Industria Azucarera. Universidad Central “Marta Abreu” de Las Villas, Santa Clara (2014) 11. Ritter y R.A., Andre, H.: Evaporator control system design. Canad. J. Chem. Eng. 6(48), 696–701 (1970) 12. Suaréz-Santillán, R.A., Sepúlveda-Cervantes, G., Portilla-Flores, E.A., Calva-Yánez, M.B., Morales-Cruz y C., Hernández-Oliva, N.: Sintonización óptima de un controlador PD utilizando evolución diferencial con un modelo dinámico virtualizado en Unity 3D. Res. Comput. Sci. 148(8), 423–435 (2019) 13. Sarhadi y, P., Jalil, A.: Introduction to virtual reality technology. Mech. Eng. 1, 22 (2005) 14. Cadena, M.S., Novillo, K., Ruiz, B., Ortiz, J.S., Rodríguez, D., Varela-Aldás, J.: Hardware in the loop simulation for soda can palletizing process applying ethernet TCP/IP communication protocols based on unity 3D. In: Advances in Information and Communication: Proceedings of the 2023 Future of Information and Communication Conference (FICC), vol. 1, pp. 558–569. Springer Nature Switzerland, Cham (Feb 2023) 15. Pico, L., Marroquín, F., Ortiz, J.S., Varela-Aldás, J.: Design and virtualization of an automated system with hardware in the loop for the virtualized water bottling process in a unity environment. In: Advances in Information and Communication: Proceedings of the 2023 Future of Information and Communication Conference (FICC), vol. 1, pp. 570–577. Springer Nature Switzerland, Cham (Feb 2023) 16. Marquina Ramos, A.I.: Comparación de métodos de sintonización de un controlador PID para procesos industriales. Puebla (2018) 17. Hernández Torres, R.: Investigación Del Comportamiento De Controladores No Lineales En Un Evaporador De Doble Efecto. Instituto Superior Politécnico José Antonio Echeverría, La Habana (2007)

Shared Intentionality Before Birth: Emulating a Model of Mother-Fetus Communication for Developing Human-Machine Systems Igor Val Danilov(B) Liepaja University, Liepaja, Latvia [email protected]

Abstract. Numerous studies show the ability of fetuses for affective evaluation and sensitivity to the sounds and rhythms of other human presence. This shows an appearance of fetuses’ perception in intentional engagement with the environment. It means that fetuses are able to select the relevant stimulus from the noisy environment with a cacophony of other stimuli: chemical interactions, pressure changes, and electromagnetic fields. This ability can appear in ecological learning only. The theoretical study observes the literature to understand what environmental features of the mother-fetus communication model enable a fetus to interact with the mother in ecological training. The objective is to design Human-Machine Systems and, specifically, computer-aided Medical Diagnosis systems based on the mother-fetus communication model. The article proposes the physiological mechanism of shared intentionality that relies on the mother’s heart pulsed electromagnetic field (PEMF) impact on the adenosine receptors in both organisms. The study creates the concept design for future research to provide evidence of the mother-fetus communication model and establish human-computer connectivity. Keywords: Communication model · Human-machine systems · Shared intentionality

1 Introduction In modern pediatrics, there is an agreement that the fetus should be considered a conscious agent by twenty-four weeks. According to the received view in cognitive sciences, children’s cognition develops in ecological learning. However, the model of interaction between mother and fetus in cognitive development remains under-studied. The article tends to understand the model of mother-fetus communication for developing HumanMachine Systems. This section introduces the reader to this topic by observing difficulties of a fetus in interaction for even launching cognition. It observes the literature on features of the mother-fetus communication model and poses the research problem. 1.1 Achievements of Fetuses Ecological learning is the only way for fetuses to assimilate knowledge. Intentionality and perception are the prerequisites of cognition. The development of these abilities © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 56–69, 2024. https://doi.org/10.1007/978-3-031-47715-7_5

Shared Intentionality Before Birth: Emulating

57

indicates their achievements in launching cognition. Research on the behavior of fetuses shows their perception in intentional engagement with the environment. Fetuses’ motions seem intentional [1]. The intra-pair movements of the twins are an even higher degree of accuracy [2, 3]. Single fetuses show ability for voice recognition [4–8]. Fetuses distinguish a change in the gender of a speaker reading a sentence [8]. The fetus can discriminate sounds (e.g., speech sounds—“babi” and “biba”) [9]. They can learn frequently heard sounds (e.g., music, voices) and flavors it experiences in the womb [9]. Neuroscience research studies revealed the underlying mediation of the behavioral responses seen in response to language and voice stimuli [9]. At 33 weeks of gestation, activity increased in the left temporal lobe of the fetal brain when it was exposed to an unfamiliar female voice compared with pure tones [9, 10]. At 34 weeks GA, the lower bank of the temporal lobe was significantly more active during exposure to a maternal voice than to an unfamiliar female voice [9, 10]. The data show beginning of cognitive development even at the first trimester of pregnancy. Could fetuses assimilate initial knowledge independently? 1.2 A Launching Cognition Hypothesis The embodied dynamic system approach accounts for the problem of launching cognition. It addresses the problem of initial knowledge assimilation by embodied information in sensorimotor networks [11–13]. According to Trevarthen and Delafield-Butt [14], patterns of sensory regulation of movements of the fetus’ body and limbs ‘gain affective evaluation and sensitivity for sounds and rhythms of other human presence [14, p. 19]’. However, to be completed, the embodied dynamic approach needs to consider the organism’s ability of intentionality–the directed property of certain mental states [15] that manifests both an individual’s conscious and unaware targeting [16]. In a multistimuli environment (in a cacophony of chemical interactions, pressure changes, and electromagnetic fields), the association of a particular stimulus with a specific pattern of sensory regulation of movements [14] is unpredictable due to the cacophony of unrelated stimuli that can be haphazardly associated with this embodied meaning. The nervous system ‘regulates prospective perception in intentional engagement with the world [14, p. 22]’. While intentionality and perception are different psychological constructs, they deal with a holistic representation of an object when both are applied to the material world. Intentionality and perception are co-dependent components of ecological learning. Therefore, the article studies the appearance of perception in intentional engagement with the environment since the central issue in how fetuses assimilate initial knowledge is how they become intentional in distinguishing proper stimulus from a cacophony of chemical interactions, pressure changes, and electromagnetic fields. The literature survey reveals two questions that lead to understanding the appearance of perception. 1.3 Two Questions About Perception There is an agreement in cognitive sciences that perception shapes a holistic representation of an object by the organization, identification, and interpretation of sensory information [17]. From the perspective of cognitive psychology, there is a question of

58

I. Val Danilov

perception emergence in organisms at the reflex substage of the sensorimotor stage of development: the binding problem [18]. The binding problem shows obstacles for young organisms in independent simultaneous coordination of sensory inputs of different modalities [18], challenging their autonomy in the independent appearance of perception. One of the components of this problem, the so-called perception stability issue, highlights the difficulties of young children in grasping similar to adults’ holistic representations of objects, regardless of the difference in sensory capabilities of these organisms. Physiology shows lower sensory capabilities of newborns due to the underdevelopment of organs involved in sensory processing. Reduced sensing challenges young children’s autonomy to grasp percepts similar to adults. Neuroscience poses one more question for perception in organisms. According to the received view, the grounds of perception relay in ecological mechanism of neurons ability to be shaped in networks due to particular stimuli. A variation of Long-Term Potentiation (LTP) or Long-Term Depression (LTD) has been described in pyramidal neurons that involve the pairing of pre- and postsynaptic action potentials (APs), known as spiketiming-dependent plasticity (STDP) [19–23]. In this process, the timing between preand postsynaptic APs modulates synaptic strength, triggering LTP or LTD [21]. The sign and magnitude of the change in synaptic strength depend on the relative timing between spikes of two connected neurons (the pre- and postsynaptic neuron [19, 24]. So, the neurons know the timing code to modulate a particular synaptic strength, which then triggers either LTP or LTD [19]. From this perspective, they modulate a particular synaptic strength in correspondence to the engagement of the emotional and sensorimotor networks. How neurons learn the structural organization of the excitatory inputs supporting STDP in correspondence to particular stimuli remains unknown [19]. So, at the reflex substage of the sensorimotor stage of development, while fetuses seem to possess intentionality and perception, the two problems of perception emergence in young organisms challenge their autonomy in solving these problems. How can fetuses gain the quality of perception independently indwelling in the womb without any chance of ecological learning with the external environment via sensory cues? If any kind of bilateral communication is possible between the mother and the fetus, it explains the ecological training of the fetus in interaction with the mother. However, from the physiological viewpoint, meaningful sensory interaction between these organisms is impossible. Even more, supposing fetuses can somehow learn from feedback, interacting with the external world via sensory cues, neuroscience shows a gap in knowledge about the autonomy of ecological learning in any environment. How can the relevant sensory stimulus be associated with the neurons’ quality of STDP without any clue or a template? At the neuronal level, any interaction between the mother and the fetus is excluded because of independent nervous systems. Regarding the mother-fetus communication model, the data show that young organisms ‘gain affective evaluation and sensitivity for sounds and rhythms of other human presence [14, p. 19]’ before understanding their meanings and without sensory cues. 1.4 Shared Intentionality After birth, newborns meet the same problems–they still lack communication skills since they are at the same reflexes substage of the sensorimotor stage of development

Shared Intentionality Before Birth: Emulating

59

as fetuses [25]. However, research on newborns’ behavior [26–45] also shows these young organisms achievements demonstrating behavior that seem intentional: a reaction to the crying of another newborn [26–30], early imitation [31–34], the other-race effect [35, 36], recognizing faces by their parts [37], recognizing faces without their context in schematic pictures [38, 39], facial attractiveness [40], distinguishing mother and stranger [41–44], other-species effect, and others [45]. Again, newborns show these achievements in a lack of communication skills. After Piaget, we know that, at the reflex substage of the sensorimotor stage of development, organisms can only demonstrate pure reflexes when they independently react to stimuli. It means that newborns still cannot convey meaningful interaction through sensory cues. Because the mother-fetus and mother-newborn communication models seem to be similar, this allows us to look at communication between the mother and the child from the both sides of the birth. Indeed, these two communication models describe interactions between organisms at the same stage of development. These models are distinguished only by important hours of birth and different environments of the young organism. In terms of the mother-newborn model, there is a broad consensus about cognitive achievements of neonates due to protoconversation–an interaction that conveys meanings before the young child begins to communicate through symbols [46–50]. Tomasello [50] argues that protoconversation occurs due to the young child’s primary motive force of shared intentionality. According to the received view in cognitive sciences, shared intentionality is a putatively human unique motivation to share mental states with others [15, 51, 52]. Interestingly, shared intentionality seems to complement the modalities of social interaction in adulthood. Psychophysiological research studies showed increased implicit coordination [53–56], joint commitment [57], and performance [53, 55, 57–59] of subjects due to shared intentionality when they solved identical tasks without communication. In recent years, hyper-scanning neuroscience research shed light on neurophysiological processes occurring during shared intentionality. They found an increased inter-brain coordinated activity under the condition without interaction in pairs during a similar mental performance in contrast to the condition when subjects accomplished a similar intellectual task alone [60–64]. In order to meet empirical evidence, recent articles proposed to redefine shared intentionality as collaborative interaction in which an organism chooses a shared relevant stimulus from many irrelevant ones due to the intentionality of another organism familiar with the environment [25, 59, 65–67]. This pre-perceptual communication via the nonlocal neuron coupling allows ecological training, starting at the reflexes substage of development, for processing the organization, identification, and interpretation of sensory information in developing perception [25, 59, 65–67]. In terms of the communicative model, shared intentionality is the only quality that can enable protoconversation between a newborn and a mother. Whether the fetus is also involved in interaction with the mother via shared intentionality or this quality appears just after the birth. Therefore, the research problem is what environmental features of the communication model enable a fetus to interact with the mother in ecological training for acquiring the necessary skills to develop perception?

60

I. Val Danilov

2 Objective The analysis shows that shared intentionality via nonlocal cell coupling in the motherfetus interaction is a solution for developing the ‘intentional engagement with the world [14, p. 22]’ in the fetus. If intrinsic physiological features of the mother’s body for enabling shared intentionality with the fetus were defined, it would support a hypothesis of this interaction between them providing their simultaneous reaction towards sensory cues. The central question is the environmental features that predictably enable bioengineering systems to emulate this model. This knowledge would allow the development of a technique that can predictably emulate a model of mother-fetus communication for developing the non-invasive method of interaction within and between human-computer systems. Therefore, the objective is to reveal the environmental features of the physiological mechanism of shared intentionality for designing Human-Machine Systems and, specifically, for computer-aided Medical Diagnosis of preverbal children.

3 Method The theoretical article synthesizes relevant observations from computer science, cognitive psychology, physiology, and neuroscience for reasoning a conclusion about the modus operandi of shared intentionality. In specific, it observes research that studies the impact of pulsed electromagnetic field (PEMF) on A(1) and A(2A) adenosine receptors (ARs) in the cerebral cortex and cortical neurons. Finally, the article proposes a conceptual design for further research on human-computer interaction for various applications to Human-Machine Systems and specifically to Medical Diagnosis.

4 Results 4.1 Pulsed ElectroMagnetic Field According to Faraday’s law, the electromagnetic field of a time-varying nature can induce internal currents in the space wherever it changes, for example in the brain tissues [68, 69]. Magnetic fields (MF) and electromagnetic fields might have a similar mechanism, such as generating internal currents [68, 69]. Charged particles exist among proteins and cells, for example, K+ is the dominant positive ion in cells [68]. Hence, applying an MF externally could affect internal particles, and may thereby alter biological processes in cells [68]. PEMF produces signals that could be perceived by brain more easily and delivers a large amount of energy in short bursts at a lower level of average energy [68, 70, 71]. A growing body of literature shows the different application of PEMF in treating [72–76], and PEMF has been found to be effective when its frequency ranges from 0.16 Hz to 480 Hz and the intensity ranges from 0.6 to 250 mT [68, 77]. One Hertz (Hz) equals one vibration per second. Normal human heart rates vary from about 45 to 220 beats/ minute. In Hertz that would be from 0.75 to 3.66 Hz. Recent neuroscience studies have shown the protective effect of PEMF, which was able to reduce the death of neurons [78]. Exposure to PEMF alters neuroprocessing in human brains [79, 80].

Shared Intentionality Before Birth: Emulating

61

4.2 The Effect of PEMF on A(2A) ARs Adenosine is an omnipresent homeostatic substance emitted from most cells, including glia and neurons [81]. Adenosine changes cell functioning by operating G-proteincoupled receptors (GPCR; A(1), A(2A), A(2B), A(3)) that can inhibit (A(1)) or enhance (A(2A)) neuronal interaction [81]. Interactions between adenosine receptors and other G-protein-coupled receptors, ionotropic receptors, and receptors for neurotrophins contribute to fine-tuning neuronal function [81]. Adenosine receptors’ manipulation influences arousal and sleep, memory and cognition, and neuronal maturation [81, 82]. Adenosine is a fine-tuning modulator of neuronal activity, which provokes harmonic actions on neuronal activity [81]. In the nervous system, the adenosine receptors (ARs) act as a kind of “go-between” to regulate the release of neurotransmitters and the action of neuromodulators [82]. Receptor-receptor interactions and AR-transporter interplay occur as part of the adenosine’s attempt to control synaptic transmission [82]. A(2A)ARs are more common in the striatum than A(1)ARs in the hippocampus. Both receptors intervene with the efficiency and plasticity-regulated synaptic transmission in most brain areas [82]. The omnipresence of adenosine and A(2A) and A(1) ARs in all nervous system cells (neurons and glia), together with the intensive release of adenosine following insults, makes adenosine a kind of “maestro” of the tripartite synapse in the homeostatic coordination of the brain function [82]. Research observed the effect of PEMF on A(1) and A(2A) ARs in the rat cerebral cortex and cortical neurons [83]. PEMF treatment of the intact rat cerebral cortex or cortical neurons at 1.5 mT mediated a transient and significant increase in A(2A) ARs after 4 h (2.0-fold increase) and 6 h (1.4- and 1.8-fold increase, respectively) of exposure [83]. In addition, PEMF treatment of the rat cerebral cortex and rat cortical neurons at 3 mT upregulated A(2A) ARs after 2 h (2.0- and 2.2-fold increase, respectively) and 4 h (1.6- and 1.9-fold increase, respectively) [83]. The treatment of rat cortex membranes with PEMF at 1.5 and 3 mT induced an increase in A(2A) AR density after 2 h (1.9and 2.2-fold increase, respectively) and was constant at all incubation times investigated [83]. 4.3 Computer-Aided Assessing Shared Intentionality A recent study has assessed the magnitude of shared intentionality in mother-child dyads with children aged 3–6 years (n = 15) [59]. The experiment emulated the mothernewborn communication model in the subjects [59]. Using the subjects’ smartphones, the bioengineering system (mother-child-smartphone) generated PEMF at wavelengths of 400 and 700 nm, alternately at a frequency of 1.3 Hz (80 pulses per minute-im/min), a pulse duration of 650 ms, and a period of 1300 ms. Each [59]. The flash chat of this bioengineered system is shown in Fig. 1.

62

I. Val Danilov

Fig. 1. The flash-chat of the bioengineering system [59].

5 Discussion Recent neuroscience research shows that exposure to PEMF changes neuroprocessing in human brains [79, 80, 83]. Studies show that adenosine provides the tripartite synapse in the homeostatic coordination of brain function [82, 83]. The experiments detected the effect of PEMF on A(1) and A(2A) ARs in the rat cerebral cortex and cortical neurons [83]. The PEMF is effective when its frequency ranges from 0.16 Hz to 480 Hz, and the intensity ranges from 0.6 to 250 mT [68, 77]. However, even using smartphones, another research [59] successfully generated the PEMF for assessing shared intentionality in mother-child pairs. The intensity of electromagnetic field at a distance of 20 cm from the monitor (of a computer or smartphone) is not more than 1.5(–7)T (the range of 1.3–1.5 mG, (0.00013–0,00015 mT)[84]. For example, the intensity of solar sunspots is about 0.3 T. Experiments on the PEMF effect emitted the field intensity from 0.6 to 250 mT [68, 77]. That is, the research [59] stimulated shared intentionality in the subjects by the field with the intensity of 1000 times lower. Research [59] showed that even emitting the weaker PEMF, it stimulates shared intentionality in subjects. Can more weak heart’s PEMF stimulate shared intentionality in subjects? Normal human heart rates vary from about 45 to 220 beats/ minute (in Hertz that would be from 0.75 to 3.66 Hz). The heart’s field frequency is relative to the frequency ranges from 0.16 Hz to 480 Hz mentioned in recent research as the effective diapason [68, 77]. In terms of the field frequency, this fact supports the efficiency of the heart’s PEMF impact on neuronal stimulation. The heart’s magnetic field is the strongest rhythmic field in the human body [85]. However, this strongest human body field is weaker than the weak smartphone electromagnetic field. The currents set up in the chest by the electromotive forces of the heart produce magnetic fields at the chest surface, which have a peak

Shared Intentionality Before Birth: Emulating

63

intensity of about 10(–11)T (one microgauss) [86]. The field strength at the surface of the body is between 10(–11) T and 10(–14)T [87]. At the same time, according to McCraty et al. [88], even this weak field affects another person’s heart activity when people touch or are in proximity [88]. Research reported that one’s electrocardiogram (ECG) signal was registered in another person’s electroencephalogram (EEG) and elsewhere on the other person’s body [88]. While this signal was most potent when people were in contact, it was still detectable when subjects were in proximity without contact [88]. From this perspective, the heart’s PEMF is felt by other organisms (being in proximity) and, therefore, can alter neuroprocessing in many human brains in the same way as the stronger PEMF with intensity ranges from 0.6 to 250 mT has done [88]. In these terms, the heart’s PEMF is the stimulus for shared intentionality in the prenatal period and, even more so, the more robust smartphone’s electromagnetic field. The heart’s PEMF is the possible candidate to become the crucial environmental feature that predictably stimulates shared intentionality between the mother and the fetus also because of other empirical research data on fetal development. Recent research showed that a mother’s heartbeat sounds elicit auditory plasticity in the fetal brain before full gestation [89]. The article proposes the physiological mechanism of shared intentionality that relies on the mother’s heart PEMF impact on the adenosine receptors. The hypothesis is that, under specific conditions, the mother’s heart PEMF mediates a transient and significant increase in A(2A) adenosine receptors neuronal communication in both the fetus’s brain and its own. Adenosine modifies cell functioning by operating G-protein-coupled receptors (GPCR; A(1), A(2A), A(2B), A(3)) that can enhance neuronal communication [81], since A(2A) has an important role in the brain, regulating the release of other neurotransmitters such as dopamine and glutamate. Interactions between adenosine receptors and other G-protein-coupled receptors, ionotropic receptors, and receptors for neurotrophins also happen, and these relations contribute to a fine-tuning of neuronal function [81]. These simultaneous processes in the mother and fetus brains in the common ecological situation, when solving the same task, provide ecological learning of the fetus. If so, this knowledge sheds light on the above-noted question on the appearance of perception: how the fetus’s neurons grasp the structural organization of the excitatory inputs supporting STDP in correspondence to particular stimuli. The technique based on the PEMF effect would also provide for designing Human-Machine Systems. Further quantitative research can provide evidence of the mother-fetus communication model. The mother’s emotional stimulation during sound stimuli would become the experiment’s independent variable in further research. Sound stimuli without emotional stimulation become a baseline. Therefore, research would observe the association between sound stimuli and fetuses’ reactions in two conditions. In addition, the endogenous nucleoside adenosine concentration seems to be modulated by the PEMF. Its concentration in the mothers during the experiments can shed light on the mechanism of shared intentionality. Oxytocin, cortisol, glutamate and dopamine concentrations in the mother influence social entrainment and performance [90]. The research also can observe the association of hormone concentration in the mothers with the fetus’s performance during the experiment. The mother-fetus interaction is a crucial component of fetal cognitive development. Knowledge about the mother-fetus communication model

64

I. Val Danilov

can also help understand innate cognitive issues’ causes. This research also provides evidence of the non-local coupling of neurons enabled by the heart’s PEMF effect on neuroprocessing in organisms indwelling in interpersonal dynamics. The PEMF effect technique for the non-local cell coupling can establish human-computer connectivity if further research defines computer interface features (in the PEMF range from 0.75 to 3.66 Hz) for this non-local connection.

6 Conclusion The theoretical article showed the fetuses’ ability to evaluate and be sensitive to the sounds and rhythms from outside the womb. This means an appearance of fetuses’ perception in intentional engagement with the environment. Fetuses are able to select the relevant stimulus from the noisy environment with a cacophony of other stimuli: chemical interactions, pressure changes, and electromagnetic fields. This ability can appear in ecological learning only. The study observed the literature to understand what environmental features of the mother-fetus communication model enable a fetus to interact with the mother in ecological training. The objective was to design Human-Machine Systems and, specifically, computer-aided Medical Diagnosis systems based on the mother-fetus communication model. The article proposed the physiological mechanism of shared intentionality that relies on the mother’s heart PEMF impact on the adenosine receptors in both organisms. The hypothesis is that, under specific conditions, the mother’s heart PEMF mediates a transient and significant increase in A(2A) adenosine receptors neuronal communication in both the fetus’s brain and its own. Adenosine modifies cell functioning by operating G-protein-coupled receptors (GPCR; A(1), A(2A), A(2B), A(3)) that can enhance neuronal communication [81], since A(2A) has an important role in the brain, regulating the release of other neurotransmitters such as glutamate and dopamine. Interactions between adenosine receptors and other G-protein-coupled receptors, ionotropic receptors and receptors for neurotrophins also occur, and this contributes to a fine-tuning of neuronal function [81]. These simultaneous processes in the mother and fetus brains in the common ecological situation, when solving the same task, provide ecological learning of the fetus. If so, this knowledge sheds light on the two above-noted questions about the appearance of perception: the binding problem and how the fetus’s neurons learn the structural organization of the excitatory inputs supporting STDP in correspondence to particular stimuli. The technique based on the PEMF effect would also provide for designing Human-Machine Systems. The study created the concept design for future research to provide evidence of the mother-fetus communication model and establish human-computer connectivity.

References 1. Zoia, S., et al.: Evidence of early development of action planning in the human foetus: a kinematic study. Exp. Brain Res. 176, 217–226 (2007) 2. Castiello, U., et al.: Wired to be social: the ontogeny of human interaction. PLoS ONE 5(10), e13199 (2010)

Shared Intentionality Before Birth: Emulating

65

3. Sasaki, M., Yanagihara, T., Naitoh, N., Hata, T.: Four-dimensional sonographic assessment of inter-twin contact late in the first trimester. Int. J. Gynecol. Obstet. 108(2), 104–107 (2010) 4. Kisilevsky, B.C.: Fetal auditory processing: implications for language development? Fetal Development. Research on Brain and Behavior, Environmental Influences, and Emerging Technologies, pp. 133–152 (2016) 5. Hepper, P.G., Scott, D., Shahidullah, S.: Newborn and fetal response to maternal voice. J. Reprod. Infant Psychol. 11, 147–153 (1993) 6. Lee, G.Y.C., Kisilevsky, B.S.: Fetuses respond to father’s voice but prefer mother’s voice after birth. Dev. Psychobiol. 56, 1–11 (2014) 7. Krueger, C,A., Cave, E.C., Garvan, C.: Fetal response to live and recorded maternal speech. Biol. Res. Nurs. 17, 112–120 (2015) 8. Lecanuet, J.P., Granier-Deferre, C., Jacquet, A.Y., Capponi, I., Ledru, L.: Prenatal discrimination of a male and a female voice uttering the same sentence. Early Dev. Parent. 2(4), 217–228 (1993) 9. Hepper, P.: Behavior during the prenatal period: Adaptive for development and survival. Child Dev. Perspect. 9(1), 38–43 (2015). https://doi.org/10.1111/cdep.12104 10. Jardri, R., Houfflin-Debarge, V., Delion, P., Pruvo, J.-P., Thomas, P., Pins, D.: Assessing fetal response to maternal speech using a noninvasive functional brain imaging technique. Int. J. Dev. Neurosci. 30, 159–161 (2012). https://doi.org/10.1016/j.ijdevneu.2011.11.002 11. Thompson, E.: Mind in Life: Biology, Phenomenology, and the Sciences of Mind. Harvard University Press (2010), Sept 30 12. Varela, F.J.: Principles of Biological Autonomy (1979). ISBN-10:0135009502, ISBN13:978–0135009505 13. Bourgine, P., Varela, F.J.: Towards a practice of autonomous systems. In: Toward a Practice of Autonomous Systems: Proceedings of the First European Conference on Artificial Life, vol. 1, pp. xi±xvii. MIT Press, Cambridge, MA (1992), 2 Apr 1992 14. Trevarthen, C., Delafield-Butt, J.: Development of consciousness. In: Hopkins, B., Geangu, E.., Linkenauger, S. (eds.) Cambridge Encyclopedia of Child Development, 2nd edn., pp. 821– 835 (2017). Cambridge University Press, Cambridge. ISBN: 9781107103412. https://core. ac.uk/download/pdf/77034048.pdf. Accessed 02 Mar 2023 15. Searle, J.R.: The Rediscovery of the Mind. MITPress, London (1992) 16. Bargh, J.A.: Goal and intent: Goal-directed thought and behavior are often unintentional. Psychol. Inq. 1(3), 248–251 (1990) 17. Gregory, R.: “Perception” in Gregory, Zangwill, pp. 598–601 (1987) 18. Treisman, A.: Solutions to the binding problem: progress through controversy and convergence. Neuron 24(1), 105–125 (1999) 19. Tazerart, S., Mitchell, D.E., Miranda-Rottmann, S., Araya, R.: A spike-timing-dependent plasticity rule for dendritic spines. Nat. Commun. 11, 4276 (2020) 20. Feldman, D.E.: The spike-timing dependence of plasticity. Neuron 75, 556–571 (2012) 21. Bi, G.Q., Poo, M.M.: Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci. 18, 10464–10472 (1998) 22. Debanne, D., Gahwiler, B.H., Thompson, S.M.: Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol. (Lond.) 507, pp. 237–247 (1998) 23. Sjöström, P.J., Turrigiano, G.G., Nelson, S.B.: Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron 32, 1149–1164 (2001) 24. Zhang, L.I., Tao, H.W., Holt, C.E., Harris, W.A., Poo, M.M.: A critical window for cooperation and competition among developing retinotectal synapses. Nature 395, 37–44 (1998)

66

I. Val Danilov

25. Val Danilov, I.: Smartphone in detecting developmental disability in infancy: a theoretical approach to shared intentionality for assessment tool of cognitive decline and e-learning. In: Arai, K. (ed.) Proceedings of the SAI 2022, LNNS 508, pp. 1–11, 2022. Springer Nature Switzerland AG (2022). https://doi.org/10.1007/978-3-031-10467-1_19. Accessed 02 Mar 2023 26. Geangu, E., Benga, O., Stahl, D., Striano, T.: Contagious crying beyond the first days of life. Infant Behav. Dev. 33, 279–288 (2010) 27. Dondi, M., Simion, F., Caltran, G.: Can newborns discriminate between their own cry and the cry of another newborn infant? Dev. Psychol. 35, 418 (1999) 28. Martin, G.B., Clark, R.D.: Distress crying in neonates: Species and peer specificity. Dev. Psychol. 18, 3–9 (1982) 29. Sagi, A., Hoffman, M.L.: Empathic distress in the newborn. Dev. Psychol. 12, 175–176 (1976) 30. Simner, M.L.: Newborn’s response to the cry of another infant. Dev. Psychol. 5, 136–150 (1971) 31. Meltzoff, A.N.: Imitation and other minds: the ‘like me’ hypothesis. In: Hurley, S., Chater, N. (eds) Perspectives on Imitation: From Neuroscience to Social Science, vol. 2, pp. 55–78. MIT Press, Cambridge (2005) 32. Meltzoff, A.N., Moore, K.: Imitation of facial and manual gestures by human neonates. Science 198, 75–78 (1997) 33. Nagy, E., Pilling, K., Orvos, H., Molnar, P.: Imitation of tongue protrusion in human neonates: Specificity of the response in a large sample. Dev. Psychol. 49, 1628–1638 (2013) 34. Simpson, E.A., Murray, L., Paukner, A., Ferrari, P.F.: The mirror neuron system as revealed through neonatal imitation: presence from birth, predictive power and evidence of plasticity. Philosoph Trans R Soc B Biol Sci 369, 20130289 (2014) 35. Pascalis, O., de Schonen, S.: Recognition memory on 3- to 4-day-old human neonates. NeuroReport 5, 1721–1724 (1994) 36. Kelly, D.J., et al.: Cross-race preferences for same-race faces extend beyond the African versus caucasian contrast in 3-month-old infants. Infancy 11, 87–95 (2007) 37. Simion, F., Leo, I., Turati, C., Valenza, E., Dalla Barba, B.: How face specialization emerges in the first months of life. Progress Brain Res. 164, 169–85 (2007) 38. Goren, C.C., Sarty, M., Wu, P.Y.K.: Visual following and pattern discrimination of face-like stimuli by newborn infants. Pediatrics 56, 544–549 (1975) 39. Johnson, M.H., Dziurawiec, S., Ellis, H., Morton, J.: Newborns’ preferential tracking of face-like stimuli and its subsequent decline. Cognition 40, 1–19 (1991) 40. Quinn, P.C., Kelly, D.J., Lee, K., Pascalis, O., Slater, A.M.: Preference for attractive faces in human infants extends beyond conspecifics. Dev. Sci. 11, 76–83 (2008) 41. Bushnell, I.W.R.: Mother’s face recognition in newborn infants: Learning and memory. Infant Child Dev. Int. J. Res. Pract. 10, 67–74 (2001) 42. Bushnell, I.W.R., Sai, F., Mullin, J.T.: Neonatal recognition of the mother’s face. Br. J. Dev. Psychol. 7, 3–15 (1989) 43. Field, T.M., Cohen, D., Garcia, R., Greenberg, R.: Mother–stranger face discrimination by the newborn. Infant Behav. Dev. 7, 19–25 (1984) 44. Pascalis, O., de Schonen, S., Morton, J., Deruelle, C., Fabre-Grenet, M.: Mother’s face recognition in neonates: a replication and an extension. Infant Behav. Dev. 17, 79–85 (1995) 45. Val Danilov, I.: Social Interaction in Knowledge Acquisition: Advanced Curriculum. Critical Review of Studies Relevant to Social Behavior of Infants. Journal of Higher Education Theory and Practice, 20, 12 (2020) 46. Gopnik, A.: Index of authors. J. Child Lang. 8, 495–499 (1981) 47. Trevarthen, C.: Signs before speech. In: Sebeok, T.A., Umiker-Sebeok, J. (eds.) The Semiotic Web. De Gruyter Mouton, Berlin and Boston, pp. 689–756 (1989). https://doi.org/10.1515/ 9783110874099.689

Shared Intentionality Before Birth: Emulating

67

48. Yingling, J.M.: Does that mean ‘no’? Negotiating proto-conversation in infant-caregiver pairs. Res. Lang. Soc. Interact. 24, 71–108 (1990). https://doi.org/10.1080/08351819009389333 49. Bråten, S., Manstead, A., Oatley, K. (eds.): Intersubjective Communication and Emotion in Early Ontogeny, No. 3. Cambridge University Press, Cambridge (1998) 50. Tomasello, M.: Becoming Human: A Theory of Ontogeny. Belknap Press of Harvard University Press, Harvard (2019). https://doi.org/10.4159/9780674988651 51. Gilbert, M.: On Social Facts. Routledge, New York (1989) 52. Tuomela, R.: The Importance of Us. SUP. Stanford University Press, Stanford, CA (1995) 53. Atmaca, S., Sebanz, N., Prinz, W., Knoblich, G.: Action co-representation: the joint SNARC effect. Soc. Neurosci. 3(3–4), 410–420 (2008). https://doi.org/10.1080/17470910801900908 54. Shteynberg, G., Galinsky, A.D.: Implicit coordination: sharing goals with similar others intensifies goal pursuit. J. Exp Soc Psychol 47(6), 1291–1294 (2011). ISSN 0022–1031. https:// doi.org/10.1016/j.jesp.2011.04.012 55. Reddish, P., Fischer, R., Bulbulia, J.: Let’s dance together: synchrony, shared intentionality and cooperation. PLoS ONE 8(8), e71182 (2013). https://doi.org/10.1371/journal.pone.007 1182 56. McClung, J.S., Placì, S., Bangerter, A., Clément, F., Bshary, R.: The language of cooperation: shared intentionality drives variation in helping as a function of group membership. Proc. R. Soc. B 284, 20171682 (2017). https://doi.org/10.1098/rspb.2017.1682 57. Tang, N., Gong, S., Zhao, M., Gu, C., Zhou, J., Shen, M., Gao, T.: Exploring an imagined “We” in human collective hunting: In: Joint Commitment within Shared Intentionality, Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 2022, pp. 44 (2022). https://esc holarship.org/uc/item/3wj722pb 58. Val Danilov, I., Mihailova, S., Perepjolkina, V.: Unconscious social interaction, Coherent intelligence in Learning, Proceedings of the 12th Annual Conference ICERI Seville (Spain), pp. 2217–2222 (2019). https://doi.org/10.21125/iceri.2019.0606 59. Val Danilov, I., Mihailova, S., Svajyan, A.: Computerized assessment of cognitive development in neurotypical and neurodivergent children. OBM Neurobiol 6(3), 18 (2022). https://doi.org/10.21926/obm.neurobiol.2203137. https://www.lidsen.com/journals/ neurobiology/neurobiology-06-03-137. Accessed 02 Mar 2023 60. Fishburn, F.A., et al.: Putting our heads together: interpersonal neural synchronization as a biological mechanism for shared intentionality. Soc. Cogn. Affect. Neurosci. 13(8), 841–849 (2018) 61. Astolfi, L., et al.: Neuroelectrical hyperscanning measures simultaneous brain activity in humans. Brain Topogr. 23, 243–256 (2010) 62. Szymanski, C., et al.: Teams on the same wavelength perform better: inter-brain phase synchronization constitutes a neural substrate for social facilitation. Neuroimage 152, 425–436 (2017) 63. Hu, Y., Pan, Y., Shi, X., Cai, Q., Li, X., Cheng, X.: Inter-brain synchrony and cooperation context in interactive decision making. Biol. Psychol. 133, 54–62 (2018) 64. Painter, D.R., Kim, J.J., Renton, A.I., Mattingley, J.B.: Joint control of visually guided actions involves concordant increases in behavioural and neural coupling. Commun. Biol. 29, 4(1), 1–5 (2021). https://doi.org/10.1038/s42003-021-02319-3 65. Val Danilov, I., Mihailova, S.: A new perspective on assessing cognition in children through estimating shared intentionality. J. Intell. 10, 21 (2022). https://doi.org/10.3390/jintelligenc e10020021 66. Val Danilov, I., Mihailova, S.: Neuronal coherence agent for shared intentionality: a hypothesis of neurobiological processes occurring during social interaction. OBM Neurobiol. 5(4), 26 (2021). https://doi.org/10.21926/obm.neurobiol.2104113

68

I. Val Danilov

67. Val Danilov, I.: Advanced artificial intelligence in contactless human-computer systems through shared intentionality. In: Arai, K. (ed.) Advances in Information and Communication. FICC 2022. Lecture Notes in Networks and Systems, vol. 438. Springer, Cham. (2022). https://doi.org/10.1007/978-3-030-98012-2_54 68. Huang, M., Li, P., Chen, F., Cai, Z., Yang, S., Zheng, X., Li, W.: Is extremely low frequency pulsed electromagnetic fields applicable to gliomas? A literature review of the underlying mechanisms and application of extremely low frequency pulsed electromagnetic fields. Cancer Med. (2022) 69. Barker, A.T., Jalinous, R., Freeston, I.L.: Non-invasive magnetic stimulation of human motor cortex. Lancet (London, England) 1(8437), 1106–1107 (1985). https://doi.org/10.1016/ s0140-6736(85)92413-4 70. Frey, A.H.: Differential biologic effects of pulsed and continuous electromagnetic fields and mechanisms of effect. Ann. N. Y. Acad. Sci. 238, 273–279 (1974). https://doi.org/10.1111/j. 1749-6632.1974.tb26796.x 71. Liboff, A.R., Jenrow, K.A.: Physical mechanisms in neuroelectromagnetic therapies. NeuroRehabilitation 17(1), 9–22 (2002) 72. Larsen, E.R., Licht, R.W., Nielsen, R.E., et al.: Transcranial pulsed electromagnetic fields for treatment-resistant depression: a multicenter 8-week single-arm cohort study. Eur. Psychiatry 63(1), e18 (2020). https://doi.org/10.1192/j.eurpsy.2020.3 73. Bagnato, G.L., Miceli, G., Marino, N., Sciortino, D., Bagnato, G.F.: Pulsed electromagnetic fields in knee osteoarthritis: a double blind, placebo-controlled, randomized clinical trial. Rheumatology (Oxford) 55(4), 755–762 (2016). https://doi.org/10.1093/rheumatology/ kev426 74. Ross, C.L., Ang, D.C., Almeida-Porada, G.: Targeting mesenchymal stromal cells/pericytes (MSCs) with pulsed electromagnetic field (PEMF) has the potential to treat rheumatoid arthritis. Front. Immunol. 10, 266 (2019). https://doi.org/10.3389/fimmu.2019.00266 75. Marmotti, A., Peretti, G.M., Mattia, S., et al.: Pulsed electromagnetic fields improve tenogenic commitment of umbilical cord-derived mesenchymal stem cells: a potential strategy for tendon repair-an in vitro study. Stem Cells Int. 2018, 9048237 (2018). https://doi.org/10.1155/2018/ 9048237 76. Callaghan, M.J., Chang, E.I., Seiser, N., et al.: Pulsed electromagnetic fields accelerate normal and diabetic wound healing by increasing endogenous FGF-2 release. Plast. Reconstr. Surg. 121(1), 130–141 (2008). https://doi.org/10.1097/01.prs.0000293761.27219.84 77. Cameron, I.L., Markov, M.S., Hardman, W.E.: Optimization of a therapeutic electromagnetic field (EMF) to retard breast cancer tumor growth and vascularity. Cancer Cell. Int.;14(1), 125 (2014). https://doi.org/10.1186/s12935-014-0125-5 78. Gessi, S., et al.: Pulsed electromagnetic field and relief of hypoxia-induced neuronal cell death: the signaling pathway. J. Cell. Physiol. 234(9), 15089–15097 (2019) 79. Robertson, J.A., Théberge, J., Weller, J., Drost, D.J., Prato, F.S., Thomas, A.W.: Lowfrequency pulsed electromagnetic field exposure can alter neuroprocessing in humans. J. R. Soc. Interface. 7(44), 467–473 (2010) 80. Premi, E., et al.: Modulation of long-term potentiation-like cortical plasticity in the healthy brain with low frequency-pulsed electromagnetic fields. BMC Neurosci. 19, 1–6 (2018) 81. Ribeiro, J.A., Sebastião, A.M., de Mendonça, A.: Adenosine receptors in the nervous system: pathophysiological implications. Prog. Neurobiol. 68(6), 377–392 (2002). https://doi.org/10. 1016/s0301-0082(02)00155-7. PMID: 12576292 82. Sebastião, A,M„ Ribeiro, J.A.: Adenosine receptors and the central nervous system. Handb. Exp. Pharmacol. 193, 471–534 (2009). https://doi.org/10.1007/978-3-540-896159_16. PMID: 19639292 83. Varani, K., et al.: Effect of pulsed electromagnetic field exposure on adenosine receptors in rat brain. Bioelectromagnetics 33(4), 279–287 (2012)

Shared Intentionality Before Birth: Emulating

69

84. Ilori, A., Gbadamosi, O.: Measurement and evaluation of extremely low frequency (Radiation) from computer monitors and laptops. Int. J. Sci. Technoledge 6(3), 73–80 (2018) 85. McCraty, R.: Science of the Heart, Exploring the Role of the Heart in Human, vol. 2. HeartMath Institute, Boulder Creek (2015). ISBN 978-1-5136-0636-1 Paperback 86. Baule, G., McFee, R.: Theory of magnetic detection of the heart’s electrical activity. J. Appl. Phys. 36(6), 2066–2073 (1965) 87. Hart, G.: Biomagnetometry: imaging the heart’s magnetic field. Br. Heart J. 65(2), 61–62 (1991). https://doi.org/10.1136/hrt.65.2.61 88. McCraty, R., Atkinson, M., Tomasino, D., Tiller, W.A.: The electricity of touch: detection and measurement of cardiac energy exchange between people. In: Brain and Values, pp. 359–379. Psychology Press (2018) 89. Kadic, A.S., Kurjak, A.: Cognitive functions of the fetus. Ultraschall Med. 38, 181–189 (2017) 90. Val Danilov, I.: A bioengineering system for assessing children’s cognitive development by computerized evaluation of shared intentionality. In: International Conference on Computational Science and Computational Intelligence (CSCI), Proceedings IEEE Xplore, Conference Location: Las Vegas, NV, USA (2022)

DAP: A Framework for Driver Attention Prediction Ahmed Kamel1,2(B) , Ibrahim Sobh2 , and Walid Al-Atabany1 1 Nile University, Giza, Egypt {Ah.Atta,W.Al-Atabany}@nu.edu.eg 2 Valeo,Giza, Egypt {ahmed.kamel,ibrahim.sobh}@valeo.com

Abstract. Human drivers employ their attentional systems during driving to focus on critical items and make judgments. Because gaze data can indicate human attention, collecting and analyzing gaze data has emerged in recent years to improve autonomous driving technologies. In safety-critical situations, it is important to predict not only where the driver focuses his attention but also on which objects. In this work, we propose DAP, a novel framework for driver attention prediction that bridges the attention prediction gap between pixels and objects. The DAP Framework is evaluated on the Berkeley DeepDrive Attention (BDD-A) dataset. DAP achieves state-of-the-art performance in both pixel-level and object-level attention prediction, especially improving object detection accuracy from 78 to 90%.

Keywords: Attention prediction Driver attention

1

· Transformer · Deep learning ·

Introduction

The driver’s visual attention assists him in identifying the most critical items. Studying where the driver focuses their attention and on which items while driving is critical for alerting the driver to potentially dangerous situations. Several driving datasets which include front camera frames in addition to the eye tracking recording are created to help researchers to create and test their attention prediction models. Datasets can be categorized into two categories: in-car datasets like Dr(eye)VE dataset [16,23] which frames are recorded in the vehicle while moving on the roads and recording different scenarios. The problem with this model is the consistency of testing, the same scenario cannot be replicated with multiple drivers and the noise of eye tracking may be considered. The second category is in-lab datasets like DADA datasets [4,13,14] where many drivers positioned in front of screen with recorded videos and record drivers’ eye movement in different scenarios. This model is very good at consistency because the same scenario can be replicated multiple times with different drivers but it has a problem that it is not a real environment, it is a simulated environment c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 70–80, 2024. https://doi.org/10.1007/978-3-031-47715-7_6

DAP: A Framework for Driver Attention Prediction

71

where the screen size and the distance between the driver and the screen can affect the whole experiment. In the past decade, deep learning (DL) has achieved considerable improvement for both visual saliency and object detectors. Convolution Neural Network (CNN) is one of the main classes of deep learning which is heavily used in computer vision tasks. The driver’s attention was mostly predicted by CNN layers in earlier models [20,21]. U-Net [5,25] is a convolutional neural network designed mainly for biomedical applications which are mainly used to tackle the image segmentation task in computer vision. The network architecture consists of a contracting path (downsampling) and an expansive path (up-sampling). High-resolution features from the contracting path are combined with the expansive path. Transformers [22] were initially designed for natural language processing (NLP) tasks to obtain long-range dependencies, and they have recently been used heavily in computer vision tasks [8] as an effective architecture with global self-attention mechanisms to capture long-range dependencies. You only look once (YOLO) [30] is a state-of-the-art, real-time object detection system. This algorithm is popular because of its speed and accuracy. It provides high efficiency in automotive tasks to predict objects on the road (pedestrian, car, truck, traffic signs, ...). Recently, several convolutional neural networks (CNN) [6,16] have been proposed for driver attention prediction. Unfortunately, CNN networks have a limitation in modeling long-range dependencies without many layers. Researchers also proposed recurrent neural networks (RNN) based architecture [17] capable of handling time sequence image data to estimate a saliency map but the RNN-based methods have a problem that no explicit modeling of long and short-range dependencies because it depends only on the previous frame. The Unified Framework [3] for Co-Object Segmentation (UFO) is a Unified framework for addressing Co-Object Segmentation tasks: Co-Segmentation, CoSaliency Detection for both image and Video. It mainly follows U-Net architecture and enhances its performance by adding Transformers between contracting and expansive paths to capture long-range dependencies using the self-attentionbased architecture. The UFO model has Two transformers with four heads are used between the connection of the up-sampling and down-sampling paths of the U-Net architecture to increase the self-attention features. In addition to the Intra MLP, which finds the long-range dependencies based on K-means. Driver gaze prediction and relating the prediction to the object identification is important to alert the driver about risky critical objects such as a pedestrian while divining in the roads [2,7]. Researchers utilize driver gaze prediction for various purposes, such as determining the appropriate steering wheel angle in autonomous driving [12]. Despite being a widely discussed topic in research articles and survey papers [9,10,19], Driver Gaze Prediction remains a significant issue. The prediction results can yet be improved. As a result of a significant reliance on the driver’s

72

A. Kamel et al.

attention and the conspicuous items in front of him. The DAP Framework, which is built on CNN and Transformer Modules achieves competitive state-of-the-art performance in both pixel-level and object-level attention prediction. In this work, the UFO model is adopted because of its superiority in salient object detection, which relates to the driver’s attention, and YOLO is used to detect the objects in the videos in our dataset. Our contributions are summarized as follows: – Enhance Driver Attention Prediction in Pixel level – Enhance Driver Attention-based Object Detection. The rest of this paper is structured as follows: Sect. 2 focuses on our contribution as well as the development of the environment and the recommended framework. Section 3 then covers the model results and a discussion regarding the model’s efficiency. Section 4 closes with a discussion and conclusions.

2 2.1

Experimental Setup Dataset

To ensure that the framework trained on actual data is not dependent on the simulation setting, the Berkeley DeepDrive Attention (BDD-A) dataset [15] is used, which belongs to the in-lab dataset group. It has 1429 videos in total. 926 were set aside for training, 200 for validation, and 303 for testing. The video frame rate is 30 frames per second. Only one frame is taken per ten frames, for a total of three frames per second. 2.2

Architecture

Our DAP Framework is made up of four major components: the Feature Extraction Module, the Gaze Prediction Module, the Object Detection Module and the attention-based object detection operation. As seen in Fig. 1, the Framework is evaluated with two configurations to test the efficacy and flexibility of our gaze prediction. VGG16 [27] Backbone or Yolov5 [30] Backbone can be used by the Gaze Prediction Module. The parameters of the Feature Extraction module are not trainable and are acquired from the Yolov5 model. In the case of a VGG16 Backbone, the two modules: feature extraction and object detection may be replaced by the Yolo model to detect objects for each frame. The Feature Extraction Module, on the other hand, is completely different in the Yolov5 Backbone architecture. Four primary characteristics are taken from distinct levels in the Yolo model for each frame to replace the contracting path parameters in the UFO model and reduce the number of trainable parameters in our framework. As illustrated in Fig. 2, the module collects features from layers 2, 17, 13, and 22, and feeds the four features to the gaze prediction module and the object detection module. These four layers were deliberately chosen to capture various characteristics for each frame and to serve as the foundation for the gaze prediction module.

DAP: A Framework for Driver Attention Prediction (a)

73

Driver Attention Prediction (Pixel Level)

VGG16 Backbone Gaze Prediction Module (a)

Input

Driver Attention Prediction (Object Level)

Feature Extraction Object Detection Module

Driver Attention Prediction (Pixel Level)

(b)

Yolov5 Backbone

Gaze Prediction Module (b)

Input

Driver Attention Prediction (Object Level)

Feature Extraction Object Detection Module

Fig. 1. DAP Framework predicts the driver attention at the pixel and object levels. a Represent the framework configuration with VGG16 Backbone is used in Gaze Prediction Module. b Represent the framework configuration with Yolov5 Backbone used in both Gaze Prediction Module L2 L17 L13

Gaze Prediction Module

L22

Yolov5 Feature Encoder

L22

Object Detection Module

Fig. 2. Use pre-trained layers from yolov5 into gaze prediction module

There are two configurations for the gaze prediction module. The fundamental model, which is modified from the UFO model and used a VGG16 backbone to extract the essential features from the input frame. Because the Yolo model is already incorporated into our system for detecting objects in frames, the UFO backbone was changed from VGG16 to Yolov5 to reduce the number of trainable parameters. Our Framework can perform both jobs, gaze prediction and object detection, using this new way. In the Yolov5 architecture in Fig. 3, the Object Detection Module is just layer 23 (the last layer).

74

A. Kamel et al. Driver Attention Prediction (Pixel Level)

Feature Extraction

Object Detection Module

Input Driver Attention Prediction (Object Level)

Transferred Yolo Layer 22

Transferred Yolo Layer 23

Fig. 3. DAP object detection module architecture

With altered thresholds, the attention-based object identification is reused from the Where and What study [2]. The discovered items were filtered in Yolo with a 0.85 confidence level to ensure that the model was evaluated on the proper objects. The ground truth frame threshold has been empirically adjusted to 0.30, while the predicted frame threshold has been set to 0.10, which is optimal for our created model. The greatest value inside the object’s bounding region is compared to the threshold to determine whether the item is focused on or not. Precision, Recall, F1, Accuracy (ACC), and finally Area Under the Curve (AUC) are used to evaluate the attention-based object. Extend the UFO Model The UFO model originally supported the co-saliency detection model, and our DAP framework extended its capability to serve the purpose of driver attention prediction. The mentioned UFO model is mostly used for co-segmentation and requires batches of five frames to discover the common salient item between the five frames. Despite the fact that the UFO Model includes gray prediction frames, all losses and evaluation measures are utilized for co-saliency job segmentation using binary prediction frames. Because our framework is designed to forecast the saliency map for each driving frame, only one frame is presented to the model at a time. The UFO model is trained with four losses: weighed binary cross entropy, classification loss, smooth L1 loss, and intersection over union. All of these losses are aimed at binary prediction pictures, which are incompatible with our application. The cross-entropy is used as the loss function to minimize the difference between ground-truth attention map S and the predicted map Sˆ LS,Sˆ = −

N 1 ˆ Si log(Si ) N i=1

(1)

where Si denotes the i-th pixel of the ground-truth attention map S, Sˆi denotes ˆ and N is the number of pixels. the i-th pixel of the predicted map S,

DAP: A Framework for Driver Attention Prediction

75

The precision and Jaccard index (Jaccard similarity coefficient) were used to analyze the UFO model. Again, these metrics are inapplicable to our application because they only apply to binary pictures. Our model was evaluated using Kullback-Leibler divergence KLdiv which measures the difference between two probability distributions and Pearson’s correlation coefficient (CC) which measures the correlation between two distributions. Reuse Yolo Parameters To reduce the number of trainable parameters in the framework and therefore boost efficiency. The Yolov5 backbone is used instead of the VGG16 backbone in the UFO model as shown in Fig. 4. The Yolov5 Model is mostly composed of numerous convolution layers that collect visual information in order to detect object bounding boxes. The gaze prediction module makes use of these convolution layer settings. The first feature was extracted after layer 2, the second feature was retrieved after layer 17, the third after layer 13, and the fourth after layer 22. The first convolution layer input channels were reduced from 128 to 64 to accommodate the pre-trained Yolov5 layer 2. This method decreased the number of framework parameters from 56M to 41M parameters. Feature Extraction

Gaze Prediction Module

Transferred Yolo Layer 2

Conv

Driver Attention Prediction (Pixel Level)

Sigmoid

Input Transferred Yolo Layer 17

Conv

Transferred Yolo Layer 13 Conv + Transformer

Transferred Yolo Layer 22

Conv + Transformer

Conv

Up-Conv

Fig. 4. DAP gaze prediction module architecture.

2.3

Training Details

All experiments are carried out on an Intel Xeon Gold 5222 CPU (3.8 GHz), NVIDIA Quadro P2200 (5 GB) GPU, and 64 GB RAM. The proposed deep neural network was trained for 50 epochs on the BDD-A training set using the Adam optimizer and validated on randomly 10 videos from the validation set. The learning rate started from 0.01 and decayed with a factor of 1e-6 every 4 epochs.

76

3

A. Kamel et al.

Results and Discussion

The DAP Framework outperforms the state-of-the-art models and demonstrates sophisticated behaviors in attention-based object detection specifically in terms of accuracy and AUC. As shown in Table 1. Table 1. Attention-based object detection comparison result Models

AUC Prec (%) Recall (%) F1 (%) Acc (%)

Baseline

0.82

66.10

74.22

69.92

74.47

BDD-A [15]

0.82

66.00

74.33

69.92

74.43

DR(eye)VE [16]

0.85

70.04

74.94

72.41

77.16

PiCANet [18]

0.86

70.23

77.67

73.76

77.91

ML-Net [24]

0.84

70.48

73.75

72.08

77.15

Where/What [2]

0.85

70.54

75.30

72.84

77.55

78.13

75.37

88.99

73.76

90.44

DAP (VGG16) 0.93 72.81

DAP (Yolov5) 0.94 73.99 73.54 Previous models results are imported from [2]

The Driver Attention Prediction is marginally improved using the same framework that was used to improve attention-based object detection as shown in Table 2. Table 2. Driver attention prediction comparison result Models

KLdiv ↓ CC ↑

Baseline

1.51

0.47

Itti [29]

0.06

0.12

GBVS [28]

0.26

0.18

DR(eye)VE [16]

1.82

0.57

PiCANet [18]

1.69

0.50

BDD-A [15]

1.52

0.57

ML-Net [24]

1.47

0.60

SALICON [26]

1.41

0.53

CDNN [11]

1.18

0.61

Where/What [2]

1.15

0.60

ACT-Net [1]

1.11

0.63

DAP (VGG16) 1.11

0.64

DAP (Yolov5) 1.10 0.64 Previous models results are imported from [1, 2]

DAP: A Framework for Driver Attention Prediction

77

The model succeeds, as shown in Fig. 5, in figuring out the critical objects in many crucial scenarios as detected in the ground truth. As shown in Scenario (a), the model succeeds in predicting the driver is focusing on the essential vehicle crossing the road, which is vital for driver safety. In scenario (b), the model succeeds in predicting the driver is focused on the pedestrian crossing the road. In scenario (c), the model successfully predicts that the driver is focusing on the vehicles beside him. The second glance in the center of all pictures is not a concern since it is focused on the vanishing point of the road in front of him rather than a specific item.

Fig. 5. DAP model result as good example (true positive)

In Fig. 6, our model predicted that the driver would concentrate entirely on the vehicle in the center of the picture, but the model was confused about the car on the left. Although the driver in the ground truth attended to the car on the

Fig. 6. DAP model result with bad example (false positive)

78

A. Kamel et al.

left in the initial frame, the model failed to anticipate this behavior. However, later after 4 s, the model predicted that the driver would respond to this car, despite the fact that the driver had turned away in the ground truth. Our model failed to recognize the car on the right in the initial frame in Fig. 7. The model detected it after analysis, but only later after 2 s.

Fig. 7. DAP model result with bad example (false negative)

4

Conclusion

It is necessary to get better outcomes in forecasting the driver’s attention while driving, but it is also important to anticipate what the user is looking at. A DAP Framework is provided in this work to deliver cutting-edge outcomes in recognizing the objects that the driver concentrates on with accuracy up to 90%. In the future, it will be critical to investigate strategies to increase framework performance by including temporal information in DAP.

References 1. Gou, C., Zhou, Y., Li, D.: Driver attention prediction based on convolution and transformers. J. Supercomput. 78(6), 8268–8284 (2022) 2. Rong, Y., et al.: Where and what: driver attention-based object detection. In: Proceedings of the ACM on Human-Computer Interaction 6. ETRA, pp. 1–22 (2022) 3. Su, Y., et al.: A unified transformer framework for group-based segmentation: co-segmentation, co-saliency detection and video salient object detection (2022). arXiv:2203.04708

DAP: A Framework for Driver Attention Prediction

79

4. Fang, J., et al.: DADA: driver attention prediction in driving accident scenarios. In: IEEE Transactions on Intelligent Transportation Systems (2021) 5. Siddique, N., et al.: U-net and its variants for medical image segmentation: a review of theory and applications. IEEE Access 9, 82031–82057 (2021) 6. Pal, A., Mondal, S., Christensen, H.I.: Looking at the right stuff-guided semanticgaze for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020) 7. Li, C., Chan, S.H., Chen, Y.T.: Who make drivers stop? towards driver-centric risk assessment: risk object identification via causal inference. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE (2020) 8. Dosovitskiy, A., et al.: An image is worth 16 × 16 words: transformers for image recognition at scale (2020). arXiv:2010.11929 9. Kang, Y., Yin, H., Berger, C.: Test your self-driving algorithm: an overview of publicly available driving datasets and virtual testing environments. IEEE Trans. Intell. Veh. 4(2), 171–185 (2019) 10. Guo, J., Kurup, U., Shah, M.: Is it safe to drive? an overview of factors, metrics, and datasets for driveability assessment in autonomous driving. IEEE Trans. Intell. Transp. Syst. 21(8), 3135–3151 (2019) 11. Deng, T., et al.: How do drivers allocate their potential attention? driving fixation prediction via convolutional neural networks. In: IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 5, pp. 2146–2154 (2019) 12. Liu, C., et al.: A gaze model improves autonomous driving. In: Proceedings of the 11th ACM Symposium on Eye Tracking Research & Applications (2019) 13. Fang, J., et al.: Dada-2000: can driving accident be predicted by driver attention analyzed by a benchmark. In: 2019 IEEE Intelligent Transportation Systems Conference (ITSC). IEEE (2019) 14. Fang, J., et al.: Dada: a large-scale benchmark and model for driver attention prediction in accidental scenarios (2019). arXiv:1912.12148 15. Xia, Y., et al.: Predicting driver attention in critical situations. In: Asian Conference on Computer Vision. Springer, Cham (2018) 16. Palazzi, A., et al.: Predicting the driver’s focus of attention: the DR (eye) VE project. In: IEEE Trans. Pattern Anal. Mach. Intell. 41(7), 1720–1733 (2018) 17. Tawari, A., Mallela, P., Martin, S.: Learning to attend to salient targets in driving videos using fully convolutional RNN. In: 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE (2018) 18. Liu, N., Han, J., Yang, M.H.: PiCANet: Learning pixel-wise contextual attention for saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018) 19. Yin, H., Berger, C.: When to use what data set for your self-driving car algorithm: an overview of publicly available driving datasets. In: 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC). IEEE (2017) 20. Tawari, A., Kang, B.: A computational framework for driver’s visual attention using a fully convolutional architecture. In: 2017 IEEE Intelligent Vehicles Symposium (IV). IEEE (2017) 21. Palazzi, A., et al.: Learning where to attend like a human driver. In: 2017 IEEE Intelligent Vehicles Symposium (IV). IEEE (2017) 22. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 23. Alletto, S., et al.: DR (eye) VE: a dataset for attention-based tasks with applications to autonomous and assisted driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2016)

80

A. Kamel et al.

24. Cornia, M., et al.: A deep multi-level network for saliency prediction. In: 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE (2016) 25. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham (2015) 26. Huang, X., et al.: Salicon: reducing the semantic gap in saliency prediction by adapting deep neural networks. In: Proceedings of the IEEE International Conference on Computer Vision (2015) 27. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556 28. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Advances in Neural Information Processing Systems, vol. 19 (2006) 29. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998) 30. Jocher, G., Stoken, A., Borovec, J., Chaurasia, A., Xie, T., Changyu, L., Abhiram, V., Hogan, A., Hajek, J., Diaconu, L., Kwon, Y., Defretin, Y., Lohia, A.: Laughing, tkianai, yxNONG, lorenzomammana, AlexWang1900, Marc, oleg, wanghaoyang0106, ml5ah, Ben Milanko, Benjamin Fineran, Daniel Khromov, Ding Yiwei. Durgesh, and Francisco Ingham, Doug, NanoCode012 ultralytics/yolov5: v5.0 - YOLOv5-P6 1280 models (2021). https://doi.org/10.5281/zenodo.4679653

On Object Detection Based on Similarity Measures from Digital Maps Arthur Marzinkowski1 , Salem Benferhat1 , Anastasia Paparrizou2 , and C´edric Piette1(B) 1

2

CRIL Universit´e d’Artois, CNRS UMR, 81881 Arras, France {marzinkowski,benferat,piette}@cril.fr LIRMM, Universit´e Montpellier, CNRS UMR, 5506 Montpellier, France [email protected]

Abstract. This paper deals with the problem of object detection from digital maps. We are interested in detecting objects in a map which are defined in the legend. We will explore different similarity measures to compare the legend objects to those detected in different areas of the map. Our object detection method is evaluated on maps representing wastewater networks. In particular, we are interested in the detection of objects that represent lifting stations and manholes. The ultimate goal, after detecting correctly the target objects, is to repair misfunctions or inconsistencies in the water supply or evacuation network. The experimental results show that our similarity measures give good accuracy results on the detection of the objects of the legends.

Keywords: Object detection Wastewater networks’ data

1

· Maps · Similarity measures ·

Introduction

This paper focuses on map-like data that represents wastewater networks. Objects of interest are described in the map legend, with a focus on the detection of objects that represent lifting stations and manholes. The principle is therefore to take the sub-images associated with the objects in the legend as the reference image. The goal is then to detect all parts of the map that are sufficiently similar to the reference image. This problem is known as template matching (e.g., [1]) and is used in many applications processing images (e.g., [5]). Even if the images to be processed are of high quality, it is often difficult to detect all the legend objects present in the map. This is particularly true for images or maps that represent wastewater networks as the one given in Fig. 1. Indeed, let us consider a legend object that represents a manhole. At the level of the legend (see Fig. 2), a manhole is described by a complex geometric figure: a full blue circle inside another circle. At the map level, in some areas of the map (as the one illustated in Fig. 4a, manholes appear isolated and independent (even if the background is rarely white as in the legends). However, in general the manholes do not appear c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 81–97, 2024. https://doi.org/10.1007/978-3-031-47715-7_7

82

A. Marzinkowski et al.

exactly as they are as described by the legends. Namely, they do not appear independently and well separated on a white background (as in the legend). In reality, we find some objects that are superimposed on each other (as illustrated by Fig. 4b), others are on non-white backgrounds, which evidently makes the task of detection difficult. It is clear in this case that it will not be useful to search for geometric figures which have exactly the same shape as those given in the legend, but rather search for shapes which are (sufficiently) similar. One of the objectives of this paper is to study different measures of similarities which are adapted to our framework of study, namely the maps of wastewater networks. The field of water sciences (including wastewater networks’s data management) is rich in heterogeneous data. There have recently been many works that call for the use of machine learning techniques (e.g., [6,8,9]) and classifiers [14]. On the management of wastewater network data, there are already works to detect objects of interest, such as manholes, from images (e.g., [2,13]). Existing work mainly used machine learning techniques on images to detect an object of interest which is the manhole (the most visible of urban networks). The work considered in this paper does not deal with images but with maps, and therefore can detect different objects identified in the legend of the map. An extra difficulty we face when dealing with maps is the heterogeneity of the data, meaning that the maps in our database come from different sources and may have different formats. The rest of the paper is organized as follows. The next section presents the problem considered in the paper. Section 3 describes the approach we propose and the similarity measures for comparing the legend objects to segments of the map and extract the most plausible objects of interest. Section 4 presents the experimental results of our paper. The last section concludes the paper.

2

Problem Description

The problem tackled in this paper concerns the management of heterogeneous information. The approach that is followed consists in transforming data of a different nature into a format that is easily exploitable by the Artificial Intelligence mechanisms of query answering. This is especially true for image type data or Portable Document Format (PDF) files, where it is important to extract their content and represent it as a factual database (which may or may not be uncertain). Figure 1 shows an example of the maps we have to manage. In order to be able to query it and exploit its content, it is necessary to identify the relevant objects it contains. The legend (often present in maps) is a good starting point for identifying objects of interest associated with the map. Figure 2 contains the legend of the map, given in Fig. 1, from a small village of south France (written in French). It gives four elements (or objects) important for the description of wastewater networks. The first object of interest represent manholes. A manhole is a structure allowing the cleaning and the inspection of a pipe of a network of waste water.

On Object Detection Based on Similarity Measures from Digital Maps

83

Fig. 1. Wastewater network map for Gigean village in France

Manholes play an important role because they are the access points to the networks. They are placed at regular intervals on the network and at each bend and intersection of the network. Figure 3 gives an example of an image that contains such manholes. In Fig. 2, we can find the representation of manholes on the maps and on the legends. A manhole is represented by a geometric figure: a blue circle included in another circle of black color. Another important object is the so-called lifting station, which is a pump system used to raise the level of water. In the legend and maps, the lifting station is also represented by a geometric figure, more complex than the one associated with manholes. A lifting station is represented by a black circle inside a pink circle, the whole is attached to a pink arrow (appears below the manhole symbol in Fig. 2). In this paper, we focus on these two objects of sewerage networks. Note that these two objects are not represented in the same way on all maps. There are very few lift stations (six in all) on the map. On the other hand, the manholes are numerous in the maps (e.g., more than 800 manholes for Gigean) and are distributed in a non-uniform way. In some areas (which represent the most populated places), manholes are shown in a superimposed manner; which makes the task of detecting them very difficult.

84

A. Marzinkowski et al.

Fig. 2. Legend associated with Gigean map given in Fig. 1, where “regard” means manhole and “poste de relevage” means lifting station

Fig. 3. Example of images representing a manhole Reference https://www.tubao.fr/ tuyau-canalisation-pehd-reseaux-gravitaires

2.1

Elements of Wastewater Networks

In the following sections, several terms in the field of sanitation networks will be used. We will therefore define them in the preamble: – Manhole The manhole is an opening allowing the cleaning and the inspection of a pipe. They are placed at regular intervals on the network and at each bend and intersection of the network (see Fig. 3). – Lifting station The lifting station is an automatic pump system that is used to raise the level of water that accumulates below the so-called discharge level. The waters that flow below this level cannot flow via gravity.

On Object Detection Based on Similarity Measures from Digital Maps

85

Fig. 4. Two examples of sub-images of the Gigean map (Fig. 1) representing two different arrangements of manholes

– Gravity The gravity is a pipe allowing the flow of the water without the use of a pump, using just the force of gravity. 2.2

Data on Wastewater Networks

As mentioned earlier, Fig. 1 represents an example of data that we will work on. This is a map of the Gigean wastewater network in image form. Several symbols are visible on the legend such as manholes, lifting stations, discharges and gravity. We will first focus on the manholes and lifting stations because they are specific elements on the map.

3 3.1

The Proposed Approach Extraction Methods

The method we use in this work is referred in the literature as template matching method [3]. This method is widely used in order to find one or more elements in an image that are similar to another one used as reference. Recall that, the reference elements are those appearing in the legend of the map. The element matching is done by calculating a matching score between the reference element and each sub-region of the image. A sub-region is defined by the dimensions of the reference image. This operation is a convolution over the image. Figure 5a, b and c (figures which are inspired by the document given in https://towardsdatascience.com/intuitively-understanding-convolutions-fordeep-learning-1f6f42faee1), illustrate the main steps of our method. These steps are clearly similar to convolution operations (e.g., [4]); with one of the main difference is that in a convolution operation the output is an image whereas in our method the output is a 2 dimensional array of similarity degrees.

86

A. Marzinkowski et al.

Fig. 5. Conceptual illustrations of convolution

The blue part, in Fig. 5a, b and c, represent the base image (in our example, the map of Gigean given in Fig. 1) where we seek to detect the elements of interest (e.g., manholes and lifting stations). The red part is the reference image which represents the description of the object that we seek to detect. In our example, the reference image can be the sub-image associated with a manhole, which is given by the legend (e.g., Fig. 2). The idea is to apply template matching techniques in order to detect references images in the map. More precisely, we propose to use the reference image (the red image) as a sliding window in order to iterate through all the sub-images (with the same size as the red image) of the base image (blue part). At each iteration, we calculate a degree of matching or similarity between the sub-image of the reference image (image given by the legend) with the base image (in our case the Gigean map). We will use different similarity measures that are appropriate for image comparison. The results of all these degrees of similarity will be stored in a third matrix. In our example (Fig. 5a, b and c), the similarity matrix is represented by the gray color. To be more precise, each of the comparisons of the reference image, with different parts of the base image, results in a degree of similarity, stored in a cell of the gray matrix. Algorithm 1 summarizes our matching method based on similarity measures. The inputs to our algorithm are the two images: the base image (or the source image), denoted I, and the reference image, denoted J. Let (w1, h1) be the matrix’s dimensions of the source image (base) and (w2, h2) be the reference’s dimensions. The output of our algorithm is a similarity matrix, denoted S. This matrix is of size (w1–w2, h1–h2).

On Object Detection Based on Similarity Measures from Digital Maps

87

Indeed, only the sub-images of the base image which have the same dimension as the reference image are taken into account in the detection of objects. For each x and y coordinates of the similarity matrix S, the degree of similarity S[x, y] is the result of the application of a similarity function between the reference matrix J and the base sub-image I composed of the rows that are between x and x + w2 and the columns that are between y and y + h2. The similarity matrix calculated by the algorithm will then be used to detect the objects of interest. Assuming that we seek to detect a single object, in this case we take the sub-image of the base image whose degree of associated similarity is maximum. In the general case of multiple object detection, we need to define a threshold above or below which we consider the similarity degree is good enough to represent the reference image.

Algorithm 1: Similarity-Based Matching Algorithm Input: Source image I Input: Reference image T Output: Similarity matrix S I ← Imgsrc; T ← Imgref ; foreach coordonates (x,y) in S do S[x, y] ← matching score(T, I[x, y : x + w2, y + h2]); end

Once the similarity matrix is calculated by the Algorithm 1, a decision step must be added to it to determine which sub-images (or regions) of the map represent the reference image. In this paper, we first used a threshold to determine the list of regions of the map that may correspond to the objects we seek to detect. Then, we defined an algorithm for grouping very close regions which all potentially represent the detected objects. The difficult question posed here is whether two regions which are very close (namely, which are only a few pixels apart) correspond to a single object or to two different objects. 3.2

Matching Measures

In this subsection, we present some similarity measures. We first introduce some notations. Let f and g be two matrices (not necessarily of the same size). These two matrices will intuitively encode the two images that are inputs of our algorithm (the base image and the reference image). These matrices will contain positive numbers where in our example will encode the luminosity of a pixel (typically the gray level or gray intensity of a pixel). We denote by h the height of the image or matrix g and by w its width. We will also use the value n as n = w × h (the number of elements in g). The matching measures are split in two main categories: intensity measures and correlation measure.

88

A. Marzinkowski et al.

Intensity measures consist in computing the difference pixel by pixel between the reference image and the source image. The sum of squared differences method belongs to this category. Sum of Squared Differences-Based Methods The sum of squared differences is a pixel by pixel intensity differences measure. The lower the matching score is, the more similar the compared images are. The measures we will compare are the Root Mean Square Distance (RMS) and the Sum of Squared Differences or squared error (SSD). Formally: – Sum of squared differences or squared error (SSD) MSSD (f, g) =

w−1  h−1  i=0

(f (i, j) − g(i, j))2

(1)

j=0

– Root mean square distance (RMS)   w−1 h−1 1   (f (i, j) − g(i, j))2 MRM S (f, g) =  n i=0 j=0

(2)

The formulas (1) and (2) are equivalent in the sense that SSD and RMS produce the same relative order. In practice, only SSD is used due to its lower computing cost [3]. Other measures similar to (SSD) and (RMS) have been used in the context of image or signal processing (e.g., [12]). For the SSD measure, when MSSD (f, g) = 0, then we have a perfect matching. The maximum score depends on the type of image processed. SSD is symmetric since MSSD (f, g) = MSSD (g, f ) and satisfies the triangular inequality as a classic distance measure. The SSD measure is used, for instance, to track objects in videos [10]. The reference image is extracted from the video and used as template for the next frames. We1 will illustrate with an example the sensitivity of SSD to luminosity/intensity variations, which is one of the main drawback of this measure.

Fig. 6. Fruit and vegetable image 1

https://monjardindidees.fr/optimisons-l-espace-au-potager

On Object Detection Based on Similarity Measures from Digital Maps

89

Fig. 7. Illustration of reference images

Figure 6 is an image of fruits and vegetables, considered as the source image. Let us now consider two examples of reference images, given by Fig. 7a and b. The two images contain the same object, except that the image, given in Fig. 7b is a strongly darkened version of the image given in Fig. 7a. In fact, the image, given in Fig. 7a, is directly taken from the source image 6. Our aim is to find the exact locations of the reference object in the original image (Fig. 6) from both reference images given Fig. 7a and b. Our algorithm based on the SSD measure is able to find the exact location of the template image 7a. However, it fails finding the original position of the darkened template. This is one limitation of SSD (and RMS similarly), a measure which is sensitive to luminosity variations. Although in our case the reference image is heavily darkened, we have observed that even smaller brightness changes can significantly deteriorate the matching score. Correlation Measure Correlation measures are known to be a robust similarity measure that has been used in different application contexts (e.g., [5]). In this section, we are interested in the normalized version of cross-correlation measure (NCC), given by Eq. 3 (see for instance [11] for its use in image correlation). – Normalised cross-correlation (NCC) w−1 h−1

i=0 j=0 f (i, j).g(i, j) MN CC (f, g) =    w−1 h−1 w−1 h−1 2 2 ( i=0 j=0 f (i, j) ) . ( i=0 j=0 g(i, j) )

(3)

When using a normalized cross-correlation (NCC) measure, the higher the obtained similarity value, the more similar are the images to be compared. In particular, when MN CC (f, f ) = 1 then we have a perfect matching, which is the maximum score for NCC. Similarly, the degree 0 is the minimal score for NCC. The main reason for NCC being more robust is that it is less sensitive to luminosity variations [7] comparing to SSD measures.

90

A. Marzinkowski et al.

Indeed, consider again our example with the Figs. 6, 7a and b. The use of the NCC similarity measure has made it possible to detect the object of interest from the two reference images (Fig. 7a and b) which was not the case with SSD. This illustrates the robustness of NCC even with really high variations of luminosity. We now illustrate different similarity measures on academic examples. Let us consider two grey scale source images represented by the matrices I1 and I2 respectively. These two matrices are of dimensions 3 × 3. These two images are such that ∀i ∈ {0, 1, 2}, ∀j ∈ {0, 1, 2}, I2 (i, j) = I1 (i, j)/2. We also assume that we have a unique reference image T of dimensions 2 × 2: ⎡ ⎤ 10 20 30 I1 = ⎣40 50 60⎦ 70 80 90 ⎡ ⎤ 5 10 15 I2 = ⎣20 25 30⎦ 35 40 45

10 20 T = 40 50 The four matrices (S1 , . . . , S4 ) below give the similarity results, based on the two measures SSD and N, between the reference image T and each of the images sources I1 and I2 . Each matrix is of dimension 3 × 3. More precisely, the matrix S1 (resp. S3 ) is the result of applying the SSD similarity measure to the images I1 and T (resp. I3 and T ). Similary, the matrix S2 (resp. S4 ) is the result of applying the NCC similarity measure to the images I1 and T (resp. I3 and T ).

0. 0.06855912 S1 = 0.42772362 0.6574571

1. 0.9941072 S2 = 0.9742593 0.9656402

0.5 0.22281712 S3 = 0.05940606 0.07190938

1. 0.9941072 S4 = 0.9742593 0.9656402 We can observe that the matrices S2 and S4 , both obtained by using the NCC measure, are identical. This illustrates what was aforementioned, NCC is less sensitive to intensity variations on pixels. However, as indicated at the level of the array S1 , the use of the SSD measure only allows to find (perfectly) the location of T in I1 . However, when we apply SSD on the source image I2 , we obtain a lower similarity score (a degree 0.5 in the matrix S3 ,). This shows that

On Object Detection Based on Similarity Measures from Digital Maps

91

SSD allows us to find reference images only if they are nearly identical to the original. Note that in our map example application, this should not be a problem. This is because there is not a large variation in pixel values between map items and the reference inside the legend. The similarity measures seen so far have been defined on gray level images. These measurements can easily be extended to colored images. A natural solution is to simply apply the similarity measure on each channel (R, G, B) and thus sum up the result. This will give us a single similarity value on each pixel. To illustrate these similarity measures, we consider the following example with two colored images, simply denoted again as I1 and I2 . Again, I2 is obtained by dividing each number in the I1 matrix by 2: ⎡ ⎤ (0, 0, 0) (255, 0, 0) (0, 255, 0) I1 = ⎣ (0, 0, 255) (255, 255, 0) (255, 0, 255) ⎦ (0, 255, 255) (255, 255, 255) (127, 127, 127) ⎡ ⎤ (0, 0, 0) (127, 0, 0) (0, 127, 0) I2 = ⎣ (0, 0, 127) (127, 127, 0) (127, 0, 127)⎦ (0, 127, 127) (127, 127, 127) (63, 63, 63)

(0, 0, 0) (255, 0, 0) T = (0, 0, 255) (255, 255, 0) As we did with grayscale images, we apply SSD and NCC to the two color source images I1 and I2 . We first add a matrix, S, which will be the result of non-normalized SSD on I1 to illustrate the operation on different channels:

0. 520200. S= 260100. 374022. Let us consider for example the S[0, 1] = 520200. This example is simple to check, because we only have 255 and 0 as values. We so need to count the range of different channels (between 0 and 2552 = 65025). Below is the sub-matrix of I1 in I[0, 1]:

(255, 0, 0) (0, 255, 0) S1 = (255, 255, 0) (255, 0, 255) This gives us S[0, 1] = 2552 ∗ 1 + 2552 ∗ 2 + 2552 ∗ 3 + 2552 ∗ 2 = 520200.

0. 1. S1 = 0.70710677 1.

1. 0.20412417 S2 = 0.70710677 0.5383155

0.50591326 1. S3 = 0.7099016 1.

1. 0.20412415 S4 = 0.70710677 0.53781 Again, applying NCC measures leads to identical results for I1 and I2 .

92

4

A. Marzinkowski et al.

Experimental Evaluation

This section presents the experimental results where the two main similarity measures, namely SDD and NCC, have been evaluated on the Gigean wastewater network map given in Fig. 1. The Gigean map has been used as an PNG image with a resolution of (19850, 14033) and using format RGB (3 bytes value per pixel) for representing the colored image. The experiments were conducted on a 12 core CPU 3.7 GHz, 16 Go RAM computer. We used two reference images, extracted from the legend (given in Fig. 2), associated with the objects of manholes and lifting stations. Table 1 gives the number of objects considered in our study. Tables 2 and 3 summarize the results obtained during our experiments. Table 1. Number of manholes and lifting stations in the map Objects

Total number of objects

Manholes

885

Lifting stations

5

Table 2. Results of normalized SDD on Gigean map Objects

Thresholds Candidate sub-images

Manholes 0.10

Lifting stations

Detected objects

False positive

PCC (%) False positive False rate (%)

30503

595

0

67.11

0

0.148

79675

726

1

81.80

0.13

0.15

82215

730

1

82.25

0.13

0.20

182934

942

157

88.58

16.66

0.10

247

2

0

20

0

0.15

609

3

0

40

0

0.20

1375

5

0

80

0

0.22

1932

6

0

100

0

Table 3. Results of normalized NCC on gigean map Objects

Thresholds Candidate sub-images

Manholes 0.95 0.935

Lifting stations

Detected objects

False positive

PCC (%) False positive False rate (%)

30984

598

0

67.45

0

31945

685

0

74.40

0 –

0.9

139101968







0.85

242781641









249

2

0

20

0

0.9

1613

5

0

80

0

0.85

14890

8

2

100

25

0.95

On Object Detection Based on Similarity Measures from Digital Maps

93

The “Objects” column simply contains the two types of objects that our algorithm have to detect. The “Thresholds” column provides the different used thresholds. If the degree of similarity of a region, of the source image, is above (for NCC), or below (for normalized SSD) the threshold, then the region is considered to be sufficiently similar to the reference object. In this case, the region will therefore be considered as potentially containing the reference object. The column “Candidate sub-images” indicates the number of elements of the similarity matrix which satisfy the threshold criteria. Specifically, this represents all sub-regions that have a sufficient similarity score to be considered as representing the reference image. However, multiple regions can represent a single object. The next column, “Detected Objects”, specifies how many different detected objects exist on the map. Therefore, the column “Detected objects” represents the number of detected objects once the sub-regions have been grouped. The algorithm that selects the set of detected objects from the candidate objects has a complexity of O(n2 ). Consequently, when the number of candidate regions is too high (in our experiments above 200,000), it is not possible to accurately determine the number of detected objects (which explains the presence of the symbol “–” in the Table 2). Tables 4 and 5 summarize running times for the computation of similarity scores (which is independent of the threshold) and for grouping algorithm. Finally, the column “False positive” simply indicates the number of false positive. This represents the number of sub-regions that have a sufficient similarity score (therefore considered by our algorithm representing the reference objects) while the sub-region does not contain any reference object. Note that the detection of false positives is done manually. In addition, the column “PCC” (Percent Correctly Classified) shows the ratio between the total number of elements to detect and the number of elements actually detected (after removing false positive). The column “False positive rate” provides the ratio between the number of false positive and the total number of object detected (false positive regions included). Figure 8 is a black and white image representation of the similarity matrix (with a threshold equal to 0.9). Each white pixel in this image is a value of the similarity matrix whose value exceeds 0.9. Thus, each white pixel is the starting point of a valid candidate sub-image (of the source image). Future work is to develop efficient clustering algorithms for clustering areas that are sufficiently similar to be detected and which refer to the same object. Figure 9 shows some rare cases of false positives recovered by NCC for lifting stations. Unlike using NCC for manhole detection, these false positives are more localized. Most of them are located in the pink region of the map like; in particular on certain texts written in pink color. False positives are almost non-existent (with a threshold of 0.9 for example) in the detection of manholes, this is due to the specific and proper manholes representation.

94

A. Marzinkowski et al.

Table 4. Running Time when using normalized the SDD Measure on Gigean map Objects

Thresholds Computing similarity scores (seconds)

Region grouping algorithm (in seconds)

Manholes

0.10 0.148 0.15 0.20

50

13 30 32 105

50

10 10 10 12

Lifting stations 0.10 0.15 0.20 0.22

Table 5. Running time when using normalized the normalized NCC on Gigean map Objects

Thresholds Computing similarity scores (seconds)

Region grouping algorithm (in seconds)

Manholes

0.95 0.935 0.9 0.85

50

10 20 – –

50

10 10 11

Lifting stations 0.95 0.9 0.85

5

Conclusion and Further Work

In this paper, we studied the detection of multiple objects in the context of maps of wastewater networks. We particularly applied a template matching algorithms augmented with region grouping algorithm. We examined two similarity measures (i.e., the Sum of Squared Difference (SSD) and the Normalized CrossCorrelation (NCC)). We showed some advantages and drawbacks of those measures. The SSD measure is efficient at finding exact replicates of the reference image inside the source image, but intensity variation can deteriorate the results, even though they seem visually identical. The NCC measure on the other hand is more tolerant with those variations, though it can lead to more incorrect results. We tested both measures on the wastewater network map for Gigean with both manhole and lifting station. SSD brings more accurate results in object detection than NCC. We concluded that this was due to an overall high brightness across the whole image. The experimental results showed that the method is suitable for fixed object detection in this digital map. Still some questions remain open; the first one is how to choose the right threshold automatically, namely, without any prior knowledge on the map (e.g.,

On Object Detection Based on Similarity Measures from Digital Maps

95

Fig. 8. Image representation of similarity matrix with a threshold equal to 0.9

Fig. 9. False positive example with NCC on lifting station

giving the number of elements inside it). The second is how to improve the region grouping algorithm. In this study we used a partitioning algorithm, but on high quality maps, the number of region matches can be quite high. Obviously, this limits the maximum number of objects we can detect and also the resolution of the images we operate on. One solution could be to assume that all matching regions that are adjacent in the similarity matrix represent a single object. Then, we can pick the location of the best score for the bounding box of the object. But, what if two different objects are so close that this assumption is wrong? This is still an open question. Another direction is to study other types of elements in maps, like lines and areas of interest. Such types of elements cannot be grouped as easily as punctual

96

A. Marzinkowski et al.

ones, since defining a bounding box might not be a satisfying solution in such cases. Another problem to be solved is the automatic detection of those elements. Currently, the template images have been extracted manually in order to be able to have a general algorithm. A method of automatically detecting elements inside the legend, as well as labeling these elements, is under development. Acknowledgments. This research has received support from the European Union’s Horizon research and innovation programme under the MSCA (Marie SklodowskaCurie Actions)-SE (Staff Exchanges) grant agreement 101086252; Call: HORIZONMSCA-2021-SE-01, Project title: STARWARS (STormwAteR and WastewAteR networkS heterogeneous data AI-driven management). This research has also received support from the french national projet ANR CROQUIS (Collecte, repr´esentation, compl´etion, fusion et interrogation de donn´ees de r´eseaux d’eau urbains h´et´erog`enes et incertaines) project, grant ANR-21-CE23-0004 of the French research funding agency (Agence Nationale de la Recherche ANR).

References 1. Brunelli, R.: Template Matching Techniques in Computer Vision: Theory and Practice. Wiley Publishing (2009) 2. Chahinian, N., Delenne, C., Commandr´e, B., Derras, M., Deruelle, L., Bailly, J.S.: Automatic mapping of urban wastewater networks based on manhole cover locations. Comput. Environ. Urban Syst. 78 (2019) 3. Cox, G.F.: Template matching and measure of match in image processing (1995) 4. Jiuxiang, G., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G., Cai, J., Chen, T.: Recent advances in convolutional neural networks. Pattern Recogn. 77, 354–377 (2018) 5. Hashemi, N.S., Aghdam, R.B., Ghiasi, A.S.B., Fatemi, P.: Template matching advances and applications in image analysis (2016) 6. Meyer, H., Pebesma, E.: Machine learning-based global maps of ecological variables and the challenge of assessing them. Nat. Commun. 13(1), 2208 (2022) 7. Hisham, M.B., Yaakob, S.N., Raof, R.A.A., Nazren, A.A., Wafi, N.M.: Template matching using sum of squared difference and normalized cross correlation 12 100– 104 (2015) 8. Mounce, S.: A comparative study of artificial neural network architectures for time series prediction of water distribution system flow data 04 (2013) 9. Du Nguyen, H., Nguyen, T.Q.D., Thi, H.N., Lap, B.Q.: The use of machine learning algorithms for evaluating water quality index: a survey and perspective. In: 2022 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pp. 1–6 (2022) 10. Nickels, K., Hutchinson, S.: Estimating uncertainty in SSD-based feature tracking (2002) 11. Scambos, T.A., Dutkiewicz, M.J., Wilson, J.C., Bindschadler, R.A.: Application of image cross-correlation to the measurement of glacier velocity using satellite image data. Remote Sens. Environ. 42(3), 177–186 (1992) 12. Sun, Y.: Root mean square minimum distance as a quality metric for stochastic optical localization nanoscopy images. Sci. Rep. 8(1), 17211 (2018)

On Object Detection Based on Similarity Measures from Digital Maps

97

13. Yongtao, Yu., Guan, H., Li, D., Jin, C., Wang, C., Li, J.: Road manhole cover delineation using mobile laser scanning point cloud data. IEEE Geosci. Remote Sens. Lett. 17(1), 152–156 (2019) 14. Zhu, K., Chen, Y., Ghamisi, P., Jia, X., Benediktsson, J.A.: Deep convolutional capsule network for hyperspectral image spectral and spectral-spatial classification. Remote Sens. 11(3), 223 (2019)

Virtualization of the Paint Mixing Process Using the Hardware in the Loop Technique Kelvin F. Toapanta1,3 , Katherin G. Pichucho3 , Jessica S. Ortiz3(B) , and José Varela-Aldás2 1 SISAu Research Group, Facultad de Ingeniería, Industrial y Producción, Universidad

Indoamérica, Ambato, Ecuador [email protected] 2 Centro de Investigación de Ciencias Humanas y de la Educación – CICHE, Universidad Indoamérica, Ambato, Ecuador [email protected] 3 Universidad de las Fuerzas Armadas ESPE, Latacunga, Ecuador {kgpichucho,jsortiz4}@espe.edu.ec

Abstract. This paper describes the implementation of the Hardware in the Loop technique, which in the automation area allows communication between a simulated industrial process (software) and a programmable logic controller, PLC, (Hardware). The simulation section consists of TIA PORTAL for the automation of the paint mixing process, while the visualization and monitoring corresponding to the SCADA system is done with the INTOUCH software, and finally the virtualization is done in the Unity 3D software, in order to provide greater realism to the user in the manipulation of the variables that make up the process. The simulations consist of a bilateral communication between INTOUCH and TIA Portal software through tags created in KepServer of the respective variables for the control and monitoring of the process. While the communication with Unity 3D and TIA Portal software will be done through a free Sharp7 library. By means of data blocks, the TIA Portal programming will be related to the variables created in Visual Studio. This will be connected to a real control panel, consisting of a Siemens S7–1200 PLC through the open communication protocol that allows the exchange of data via Industrial Ethernet with TCP/IP protocol (Server–Client), obtaining a digital data sent stored in bits. The Ethernet information frame with a delay time of 220 ms which is considered optimal within the response of an industrial process. Keywords: PLC · TIA PORTAL · Intouch · UNITY 3D · Ethernet

1 Introduction Currently in Ecuador there are several companies in charge of paint manufacturing [1]. The same that seek to automate their processes, perfect the dye formulas and bring their industry to meet the highest standards of quality and efficiency, as is the case of “Pinturas Líder”, which is responsible for the manufacture and distribution of paints, solvents and © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 98–110, 2024. https://doi.org/10.1007/978-3-031-47715-7_8

Virtualization of the Paint Mixing Process Using

99

industrial coatings, using automated processes [2]. Generating the need in the industry to focus on a study to increase the efficiency and robustness of these processes. For which requires the implementation of equipment focused on research which leads to high costs, generating the development of new proposals for industrial control does not advance at the desired rate [3]. Therefore, an effective solution to this type of process, according to [4], the development of virtual work environments; allow testing of advanced controls focused on industry. This is due to the realism that must be considered in the environment, i.e., a real presentation of the behavior and response of actuators and sensors that support the control issued by a PLC [5]. The same ones that house the control of the designed and stored process; the use of the device within the industry is due to aspects such as: simple programming, high reliability, feasibility of communication with a computer, for the virtualization of a mixing process [6]. The virtual environment serves as the main tool for the operator to coordinate and control the industrial process, this technology simulates the physical presence of the user in the environment, allowing him to interact with it through an avatar that represents an operator. At the same time, realizing a virtualized environment allows for reduced capital investment and operating expenses as well as improved business continuity and disaster recovery capabilities [7]. From the programming in the TIA PORTAL software and the communication with KEPServer, a SCADA (Supervisory Control and Data Acquisition) system, represented in the Intouch Software, is put into operation, which allows to control and supervise the process, besides collecting data from certain components, facilitating the feedback in real time with the field devices (sensors and actuators), and controlling the process automatically [8]. The SCADA system provides all the information generated in the production process (supervision, quality control, production control, data storage, etc.) and allows its management and intervention [9]. The variables designated in the SCADA system will be sent from the TIA PORTAL software through a database, creating the respective tags in the KepServer software, to verify the communication of these tags or variables it is convenient to click on “Quick Client” [10], where a menu of options is displayed with a quality described by “Good” where the communication validity is rectified [11]. Once the communication is completed, we intend to use techniques and concepts of a Hardware-in-the-loop (HIL), where the operating conditions of a real system can be emulated, taking it to a mathematical model embedded in an electronic system [12]. The main purpose of this simulation is to improve product quality, overcome system complexity and reduce development costs [13]. Other authors also use HIL simulation to facilitate the testing of prototypes in multiple conditions, such as a powertrain for an electronic vehicle, [14] due to the ease of testing its simulation suitability for different scenarios [15]. This article is organized as follows: In Sect. 2, the proposal for the implementation of HLI, in a paint mixing process is presented, while in Sect. 3 the process development is described, subdivided in the simulation and communication section, In Sect. 4 the results obtained are presented and finally the conclusions are shown in Sect. 5.

100

K. F. Toapanta et al.

2 Proposed Implementation The work proposes to perform a HIL simulation for academic purposes, as a procedure to understand the scope and limitations of the technique, before using it in applications with fast dynamics and complex systems. The performance of HIL simulation depends on both the dynamic complexity of the system and the real-time computer hardware used. Among the main advantages of HIL, is to perform a simulation in advance and in parallel with the development of the plant and to reduce its manufacturing cost. In this context, the implementation is subdivided into two parts as shown in Fig. 1.

Fig. 1. Subdivision of the HIL implementation in the paint mixing process.

The implementation of the HIL, described in Fig. 1, consists of the software part, in charge of controlling, manipulating, acquiring data and visualizing the process in real time, and the hardware part, which will be connected to the software to externally control the process. 1. Software Part; it consists of three phases. (i) Control, a sequential control of the paint mixing process is developed in the TIA Portal V16 software, which allows manipulating the digital variables such as Start and Stop of the process, as well as the paint selectors, while controlling the level of the four main paints through timers that will be previously programmed; (ii) SCADA, responsible for supervising, controlling and acquiring data from the mixing process, which will help to improve remote decision making from a control booth, in this case using Intouch software; (iii) Virtualization, considered as a graphic engine, used for the creation of virtual environments focused on the industry, due to the immersion and interaction it offers, in such a way that it seeks to considerably optimize technological resources and unnecessary expenses, said virtualization will be designed in UNITY 3D software. 2. Hardware Part; the implementation of the virtualization of the paint mixture will use a PLC Siemens S7–1200 CPU 1212 AC/DC RLY 6ES7 212-1BE40-0XB0, due to the number of inputs and outputs suitable for the virtualization process, in addition

Virtualization of the Paint Mixing Process Using

101

to a control panel consisting of buttons and indicators that will give the desired performance to the simulation part. Finally, the Hardware will communicate with the Software through the HIL technique. By means of a database created in the TIA Portal software, the same one that allows to store variables of reading and writing for it is due to bear in mind that the PLC S7 −1200 must be of firmware 4.0 or superior, because the inferior versions do not count on the option PUT/GET what will allow the communication with the software of virtualization Unity 3D by means of the use of the free library Sharp7, using the TCP/IP communication protocol. While the interaction with the SCADA will be done through an open OPC communication protocol that allows the manipulation of set points and visualization of the process variables, allowing the interaction between the operator and the machine.

3 Design of the Process The design and implementation of the elements for the execution of the HIL technique, the automation of the mixing process and the classification of paint cans and containers is carried out in order to improve productivity, avoiding material waste. In addition, the personnel in charge of this activity will reduce the occupational risk due to the high exposure of chemical compounds when they are exposed for too long to the chemical components of the paint. 3.1 Virtualization of the Mixing Process This section presents the methodology to be used for the virtualization of the paint mixing process, which is divided into four main stages called: reference models; 3D modeling (plant design); programming and visualization; and Hardware in the Loop communication with the PLC and the computer, as follows in the Fig. 2. (i) Reference Model, The design is based on the characteristics and aspects of a real industrial factory, complying with the norms and standards of each instrumentation element, which will be represented in a P&ID diagram that is taken as a reference for the creation of the plant in 3D using computer-aided design (CAD) software, where the operator differentiates the location and functionality of each component used in the mixing process; (ii) 3D modeling, the design of the quadruple tank process is developed with AutoCAD Plant 3D software, where layers are defined to distribute the equipment and devices (tanks, flanges, pipes, structure, meshes) as well as the dimensions of the design, structures containing the tanks are created, valves and pumps are placed; the file created has the extension.dwg extension. To change the format, the native AutoCAD Plant 3D file must be opened in the SketchUp software, which will help to obtain the.fbx file extension that is recognized by the Unity 3D 16 software; (iii) Programming and virtualization of the 3D industrial environment, the different characteristics of the process such as liquid filling, actuators, sensors and visual effects are incorporated using the Unity 3D graphics engine and object-oriented programming in Visual Studio, allowing the virtualized plant to have a greater resemblance to the real plant. The same that consist

102

K. F. Toapanta et al.

Fig. 2. Stages of the virtualization process.

of an; (a) Avatar, which navigates through the virtual process, (b) System graphics, which shows the curves of the process variables of the mixing tank and finally to provide greater realism to the process we have, (c) Animation, where the variation of the mixing tank level, the change of color according to the selected push button and system failures in case of any catastrophe will be shown; (iv) Industrial communication, with the PLC using the Profinet protocol where the virtualized plant reads an IP address generated by the PLC S7–1200 CPU 1212 AC/DC/RLY to be able to write and read the data implemented in the PLC for the communication between the SCADA system designed in the Intouch software and the virtualization of UNITY 3D, using in this way the hardware in the loop technique. 3.2 Design and Implementation of the SCADA System For the proper functioning of the SCADA system, the field level variables are declared, such as sensors and actuators, which must be assigned in the OPC platform of the industry, generating the tags to later assign to each of the elements of the process. The SCADA environment consists of the following elements: a control screens whose function is to start/stop the process and select the desired type of color, as well as the acquisition of data in real time; considering data such as the count of bottles and boxes, as well as the history of events and alarms that occur during the mixing of paint. These events are represented in graphs that show the trend of the process, as shown in Fig. 3. To generate the SCADA system, the INTOUCH software was used, which presents a graphic animation, intuitive and flexible in the design of industrial applications. Within this software, each of the tags assigned for each element of the process are declared. These tags were previously declared within the KEPServer software, taking as reference the input and output variables that are being sent from the TIA Portal V16 software.

Virtualization of the Paint Mixing Process Using

103

Fig. 3. Design and implementation of the SCADA system.

3.3 Communication To start the simulation of the paint mixing process, the programs are linked as follows in the Fig. 4.

Fig. 4. Communication between unity virtual environments-TIA Portal–intouch.

The communication between TIA Portal V16 and UNITY 3D was done by using the free Sharp 7 library and the TCP/IP communication protocol; this library allows the use of methods or functions in a portion of the Script, relating the variables of the data blocks in the TIA Portal software. With variables created in Visual Studio, interacting with the UNITY 3D, which will send and receive the data with the help of the import of the.dll file (Dynamic Link Library) which must be in the root of the project where all the Scripts will be located [12].

104

K. F. Toapanta et al.

Fig. 5. Designation of tags for SCADA system.

A screen is configured in Intouch to confirm the PLC input and output activation signals for use by the user’s personnel. For the communication between the PLC S7–1200 and the Intouch was handled with the OPC communication channel using the KepServer software, the respective tags were designated for each of the input and output variables, as follows in the Fig. 5. 3.4 Network Setting The PLCs of the 1200 family communicate via the Profinet interface. To do this, they need to be assigned an IP address, placing them on the same network. Ethernet/IP networks are communication networks that provide a complete set of messages and services for many automation applications. This open network standard uses standard Ethernet communication products to support real-time I/O, information exchange and general messaging functions. To create the communication between the physical PLC and the computer it is necessary to assign a static IP, from the Ethernet properties of the PLC, as shown in the Fig. 6. It is also necessary to declare the IP inside the KepServer software so that it can link correctly with the physical PLC S7–1200 and can take the information to the INTOUCH SCADA system as shown below in the Fig. 7.

4 Results Experimental The results obtained with the implementation of the HIL technique in a paint mixing process are described below. Starting from the operation of the virtualized environment, which consists of actuators and sensors corresponding to the process, these have physical and operating characteristics of the real elements, in order to deliver a high level of realism to the user (see Table 1). 4.1 Stages of Virtualization of the Paint Mixing The automated process of an industrial plant dedicated to the mixing of paint colorants contains three production stages: mixing, bottling and capping/labeling.

Virtualization of the Paint Mixing Process Using

105

Fig. 6. Assigning a static ip address.

Fig. 7. Assigning an ip within the TIA PORTAL and KepServer.

The pre-production stage starts with a start button that starts the process and only stops with the emergency stop button and verifies the existence of a low-level alarm, once started the three stages will be fulfilled as follows in the Fig. 8. • Mixing: It consists of four tanks with primary colors, each tank has a pump, which will take the liquid to the deposit tank, inside this tank are placed two level sensors and finally a mixer. The operator has the freedom to choose the color to be produced, for which has a control panel with four buttons, pressing one of these, activates the pump containing the dye A and is stored in the tank deposit until detected by sensor one, then stops the pump tank liquid A and immediately starts the pump tank containing liquid B, which will remain active until detected by sensor two, once the sensor two

106

K. F. Toapanta et al. Table 1. Equipment used for virtualization

Centrifugal pump

Limit switch (Sensor)

Mixer

Electrovalve

Belt conveyor

Tanks (TK)

Fig. 8. Stages of virtualization of the paint mixing.

is activated will start the engine of the paint mixer for 30 s. Once the mixing stage is finished, the bottling stage starts. • Bottling: It consists of a conveyor belt, a solenoid valve and a presence sensor. Once the mixing stage is finished, the conveyor belt starts working with the paint cans on it until they are detected by a presence sensor that is exactly located under the solenoid valve which will be in charge of depositing the exact liquid in each paint can. When the can is full, the conveyor belt is activated and takes the cans to the last stage.

Virtualization of the Paint Mixing Process Using

107

• Capping and Labeling: It consists of a presence sensor, a double-acting cylinder, a labeling machine and a conveyor belt. The conveyor belt of the bottling stage takes the cans to the conveyor belt of the labeling stage, which will be active until it is detected by the presence sensor located under the double-acting cylinder that has the function of taking the caps to the cans and placing them under pressure so that the cans are sealed. Once the cylinder has fulfilled its function, the band is activated again and passes through the labeling machine. It should be noted that the process ends when the three stages are completed, otherwise a new color cannot be chosen. The process can be controlled by means of a control panel, as shown in Fig. 9. It consists of six buttons; (a) Start, in charge of starting the paint mixing process, (b) Stop, which stops the whole process in case of emergency, (c) Green paint selector, (d) Orange paint selector, (f) Purple paint selector, (g) Pink paint selector, and there is also an indicator light to show if the process is running. In this control panel you will find the total values of each paint, as well as the level and the number of boxes that will be classified by colors.

Fig. 9. Virtualization in UNITY of the mixing process.

The avatar shown in Fig. 10, allows a greater immersion so that the user can have a more direct and real interaction with the virtualized process, this element is dressed with a PPE (Personal Protective Equipment), which is widely used in the area of industrial safety, designed specifically for the protection of the worker during the work environment. The operation of the virtual environment works jointly between the TIA Portal program and Unity 3D, executing the communication commands and a database that allows bilateral communication so that its operation is carried out in real time through the online connection of the programming software where its operation can be verified once the ladder language is marked green as shown in Fig. 11. In the Fig. 12., the process implemented is presented, for which the PLC S7–1200 is interconnected with the virtual environment, the communication used is through the Ethernet TCP/IP protocol that works with a provider-client architecture, where the PLC is

108

K. F. Toapanta et al.

Fig. 10. Scroll avatar animation.

Fig. 11. Communication of the virtualization between tia portal and unity software.

the provider and the virtual environment assumes the role of client. This communication has an industrial focus, so its transmission speed is 10 Mbps.

5 Conclusions The work presents the design and connection of industrial devices and equipment corresponding to a SCADA system in a virtual environment in Unity 3D. It is based on instrumentation standards, providing users with a real environment of the quadruple tank process and managing to manipulate and visualize the plant through a control panel. The Profinet protocol of the PLC S7–1200 CPU 1212 AC/DC/RLY 6ES7 212-1BE40-0XB0 allowed the communication between the PLC and the virtualized plant, achieving the sending and receiving of data for the correct operation of the control algorithms that act in the process. It was obtained a sending of digital data stored in bits (1bit), Byte (8Bits), int (16 bits), real (32bits), with a delay time of 220 ms (65535 ms) which is considered optimal within the response of a paint mixing process.

Virtualization of the Paint Mixing Process Using

109

Fig. 12. Physical control whit the unity 3d.

The immersion and familiarization of the user with the Process Control Systems implemented through the Hardware in the Loop technique was essential to achieve a correct knowledge of their operation. The importation of 3D graphics of each element that composes the process plant contributes in a high degree so that the simulation of these plants presents an appearance very close to the Process Stations facilitating the user to manipulate each element in a correct way and fulfilling the objective of carrying out the control of the system. Acknowledgments. The authors would like to thank the ARSI Research Group of the Universidad de las Fuerzas Armadas-ESPE and SISAu Research Group of the Universidad Indoamérica for their support in the development of this work.

References 1. Nicolas, C.-F., Ayestaran, I., Poggi, T., Sagardui, G., Martin, J.-M.: A CAN Restbus HiL elevator simulator based on code reuse and device para-virtualization. In: Proceedings of the 2017 IEEE 20th International Symposium on Real-Time Distributed Computing (ISORC); May 2017; pp 117–124 (2017) 2. Jahromi, N.T., Yangui, S., Shanmugasundaram, S., Rangy, A., Glitho, R.H., Larabi, A., Smith, D., Brunner, R.: A prototype for value-added video service provisioning in content delivery networks. In: Proceedings of the 2017 14th IEEE Annual Consumer Communications & Networking Conference (CCNC); Jan 2017; pp. 597–598 (2017) 3. Chen, H., Jin, H., Hu, K., Yuan, M.: Adaptive audio-aware scheduling in Xen Virtual Environment. In: Proceedings of the ACS/IEEE International Conference on Computer Systems and Applications-AICCSA 2010; May 2010; pp. 1–8 (2010) 4. Trabado de la Cruz, M.: Aplicación de La Realidad Virtual al Diseño y Programación de La Automatización de Procesos Industriales: Versión Simulación de Procesos (2020) 5. ARSI: Automatización, Robótica y Sistemas Inteligentes | ISBN 978-9942-30-406-3-Libro

110

K. F. Toapanta et al.

6. Wang, J., Jia, Z., Campos, L.A., Knittle, C.: Delta-sigma modulation for next generation fronthaul interface. J. Lightwave Technol. 37, 2838–2850 (2019). https://doi.org/10.1109/ JLT.2018.2872057 7. Muench, D., Isfort, O., Mueller, K., Paulitsch, M., Herkersdorf, A.: Hardware-based I/O virtualization for mixed criticality real-time systems using PCIe SR-IOV. In: Proceedings of the 2013 IEEE 16th International Conference on Computational Science and Engineering; Dec 2013, pp. 706–713 (2013) 8. Zhiping, H., Longfei, T.: HIL Simulation for the closure process of contactors based on neural network. In: Proceedings of the 2019 4th International Conference on Intelligent Green Building and Smart Grid (IGBSG); Sept 2019; pp. 300–303 (2019) 9. Tavagnutti, A.A., Coletto, T., Bosich, D., Sulligoi, G.: HIL virtual prototyping of a DC shipboard microgrid. In: Proceedings of the 2021 AEIT International Annual Conference (AEIT); Oct 2021; pp. 1–6 (2021) 10. Araque, E.J., Cajamarca, A.A., Paredes, D.E., Segovia, J.A., Tumbaco, W.W., Miranda, M., Andaluz, V.H.: Soda bottling process through the virtual hardware in the loop technique. In: Mesquita, A., Abreu, A., Carvalho, J.V. (eds.) Proceedings of the Perspectives and Trends in Education and Technology, pp. 621–631. Singapore, Springer (2022) 11. Quispe, M.A., Molina, M.C., Ortiz, J.S., Andaluz, V.H.: Unicycle mobile robot formation control in hardware in the loop environments. In: Botto-Tobar, M., Montes León, S., Camacho, O., Chávez, D., Torres-Carrión, P., Zambrano Vizuete, M. (eds.) Proceedings of the Applied Technologies, pp. 430–443. Springer International Publishing, Cham (2021) 12. Reitz, J.; Gugenheimer, A.; Roßmann, J.: Virtual hardware in the loop: hybrid simulation of dynamic systems with a virtualization platform. In: Proceedings of the 2020 Winter Simulation Conference (WSC); Dec 2020; pp. 1027–1038 (2020) 13. Technologies, U.: Unity Real-Time Development Platform | 3D, 2D VR & AR Engine. https:// unity.com/. Accessed 13 Feb 2023 14. Ortiz, J.S., Palacios-Navarro, G., Andaluz, V.H., Guevara, B.S.: Virtual reality-based framework to simulate control algorithms for robotic assistance and rehabilitation tasks through a standing wheelchair. Sensors 21, 5083 (2021). https://doi.org/10.3390/s21155083 15. Münch, D.; Paulitsch, M.; Hanka, O.; Herkersdorf, A. MPIOV: Scaling hardware-based I/O virtualization for mixed-criticality embedded real-time systems using non transparent bridges to (Multi-Core) multi-processor systems. In: Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE); Mar 2015, pp. 579–584 (2015)

Locally Enhanced Chan-Vese Model with Anisotropic Mesh Adaptation for Intensity Inhomogeneous Image Segmentation Karrar K. Abbas1 and Xianping Li2(B) 1 University of Kufa, Najaf, Iraq

[email protected]

2 Arizona State University, Tempe, AZ 85281, USA

[email protected]

Abstract. Chan-Vese (CV) model is a well-known mathematical model for image segmentation; however, it has difficulty handling images with inhomogeneous intensity. Many models have been proposed to address this difficulty. In this paper, we propose a locally enhanced Chan-Vese model (LECV) to successfully segment images with intensity inhomogeneity. We define a new signed pressure force (SPF) function based on the local image information from a triangular mesh representation. The anisotropic mesh representation (AMA representation) of the image also helps improving the computational accuracy and efficiency. Numerical results demonstrate that our proposed LECV model provides better segmentation for images with inhomogeneous intensity than the traditional Chan-Vese model as well as a few commonly used models. Keywords: Image segmentation · Chan-Vese model · Inhomogeneous intensity · LECV · SPF function · AMA representation

1 Introduction The vast and fast advancement in computer electronics and vision as well as the needs for it in various fields such as medicine, industry, and military has led to the emergence of the image processing field [1–3]. For processing a certain digital image, specialists need to identify the region of interest (ROI) by the process of segmentation. Therefore, image segmentation is an important process that has a significant impact on other processes such as image recognition and detection [4–6]. There are many different mathematical models for image segmentation based on the Mumford-Shah functional [7, 8] or active contour models [9–17]. Accurate segmentation is still a challenging task due to a variety of difficulties such as inhomogeneous image intensities, image noise, low boundary disparities, or blurred edges [18–20]. The presence of these difficulties makes it hard to develop a mathematical model that deals with all these challenges. The mathematical image segmentation models can be categorized into two classes: edge approach models and regional approach models. Edge approach models use the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 111–127, 2024. https://doi.org/10.1007/978-3-031-47715-7_9

112

K. K. Abbas and X. Li

edge information of the image to segment and control the convergence of the contour. Those models are usually fast to compute but do not work well with blurred edges. They are also sensitive to the initialization of the contour curves. One example is the geodesic active contour model (GAC) [21]. GAC model starts with an initial contour curve that continuously evolves and converges to the edges of an image object. It converges to the target object according to an edge stopping function (ESF) that is constructed based on the image gradient. The initial contour curve fails to converge if its position is far from the target object that needs to be segmented [22]. On the other hand, regional approach models utilize global information instead of edge information to control the convergence of the contour curves. One of the advantages of the regional approach models over the edge-based models is that they are less sensitive to noise and contour curve initialization. One example of this approach is the Chan-Vese (CV) model [8] that takes the information of the whole image in consideration. It can segment an image with blurred edges or low boundary disparity. The disadvantage of the CV model is that it cannot deal with inhomogeneous intensity images. A local region segmentation method is introduced to address this disadvantage among the regional approach models. An example of this approach is the local binary fitting (LBF) model [23, 24] that successfully segments the inhomogeneous intensity images, as well as the local image fitting energy (LIF) model [25] that improves the performance over LBF. However, the local information is generally not sufficient to produce an accurate segmentation. Furthermore, this approach is highly sensitive to the initial contour and may stay at a local minimum without converging to the boundaries of the target object in the image. A mixed approach of local and global regional models have been proposed later on such as the local and global strength fitting (LGIF) model [26], the global and local region active contour (GARAC) [27], and the GCV model [22]. The GCV model improves the GAC model by replacing the edge stopping function (ESF) with the proposed signed pressure force (SPF) function. GCV model works well with noisy images. It also has the advantage of moving in the direction of the inward and outward normal based on the position of the initial curve with respect to the desired object. However, it still cannot deal with inhomogeneous image intensities. Other mixed approach models try to include local and global information in the SPF [28–30] to deal with the inhomogeneous image intensities but they encounter other difficulties such as the sensitivity to contour initialization, additive noise of high strength, and object leaking. Xiuming et al. [31] construct a new SPF containing local and global fitting terms that replaces ESF in the GAC model. The preliminary results of Xiuming’s model show a good performance when segmenting low-resolution MR brain images. It is worth mentioning that all the mentioned models have only been applied to low-resolution images. Applying them directly to high-resolution images will require significant computational cost. Motivated by CV and GCV models, we propose a locally enhanced Chan-Vese model (LECV) based on the CV model and GCV global SPF function. We define a local SPF function to address the inhomogeneous intensity issue and utilize an anisotropic adaptive mesh in the computations that lead to faster convergence. Our proposed LECV model can deal with high-resolution images directly, without the need to down-sample the images to lower resolutions.

Locally Enhanced Chan-Vese Model with Anisotropic Mesh

113

The outline of this paper is as follows. Section 2 is a brief introduction to the different segmentation models and the anisotropic mesh adaptation framework. Some numerical results are presented in Sect. 3 and some conclusions are drawn in Sect. 4.

2 Models and Methods Let I0 be the given grayscale image and x ∈  ⊂ R2 . Assume C(q) : [0, 1] → R2 be the contour curve in . In this section, we give a brief review of the different segmentation models including CV, GCV, LBF, LIF, LGIF, and our Locally Enhanced Chan-Vese (LECV) model. Then we describe our framework using anisotropic mesh adaptation for image segmentation. 2.1 Review of Image Segmentation Models Chan-Vese (CV) Model. Chan-Vese (CV) model is a well-known regional-based model [8] that is a simplification for the Mumford-Shah functional [7]. CV assumes that the image intensity under process is homogeneous. The energy functional for the CV model for the segmentation curve is as follows:  CV

E

 |f (x) − c1 | dx + λ2

= λ1

|f (x) − c2 |2 dx,

2

inside(C)

(1)

outside(C)

where C is the boundary of a closed set, λ1 , λ2 are positive parameters, and c1 and c2 are chosen as the region average,   I0 (x)H (ϕ(x))dx I0 (x)(1 − H (ϕ(x)))dx , c2 =  , (2) c1 =  H (ϕ(x))dx (1 − H (ϕ(x)))dx with ϕ(x) being the level set function and H (ϕ(x)) being the Heaviside function. In our computation, we use the regularized form as follows:   ϕ  2 1 1 + arctan , (3) H (ϕ) = 2 π ∈ for some  > 0. The PDE form of the CV model is as follows    ∇ϕ ∂ϕ 2 2 − ν − λ1 (I0 (x) − c1 ) + λ2 (I0 (x) − c2 ) , = δ(ϕ) μ∇ · |∇ϕ| ∂t

(4)

where μ and ν are positive parameters and δ∈ (ϕ) =

d ∈ H= . dϕ π(∈2 +ϕ 2 )

(5)

114

K. K. Abbas and X. Li

The CV model is a global regional-based segmentation model. It does not use local information, thus it does not work well on images with inhomogeneous intensities. GCV Model. The GCV model [22] is a regional active contour model that combines the benefits of both the CV model and the GAC model by proposing a regional-based information function called signed pressure force function (SPF). The SPF function is constructed based on the regional global intensities c1 and c2 proposed by the CV model for the inside and outside of the contour curve C, respectively. The value of the SPF function is in the range [−1, 1] that controls shrinking or expanding of the contour curve according to the position of the curve. spf (I0 (x)) =

2 I0 (x) − c1 +c 2 x ∈ ,

max I0 (x) − c1 +c2

(6)

2

where the values of c1 and c2 are defined in (2). GCV model replaces the ESF in GAC model with their SPF to formulate their model as,   ∂ϕ ∇ϕ + α + ∇spf (I0 (x)) · ∇ϕ. (7) = spf (I0 (x))|∇ϕ| ∇ · |∇ϕ| ∂t with α being a positive parameter. GCV introduces a new level-set formulation that utilizes a Gaussian filter to regularize the level set function. This regularization helps prevent the level set function from being too steep or flat and decreases the computational complexity comparing to the traditional level set methods. Even though the GCV model combines the advantages of the CV model and the GAC model, it does not work well on images with inhomogeneous intensities [31]. LBF Model. To improve the performance of CV and GCV models for inhomogeneous images, Li et al. [23] proposed a model based on local strength fitting energy functional (LBF).

 E LBF = λ1 ∫ ∫ Kσ (x − y)|I0 (y) − f1 (x))|2 H (ϕ(y))dy dx

 + λ2 ∫ ∫ Kσ (x − y)|I0 (y) − f2 (x))|2 (1 − H (ϕ(y)))dy dx, (8) where Kσ , f1 and f2 are the kernel function and local intensities defined as follows: Kσ (x) ∗ [H (ϕ(x))I0 (x)] , Kσ (x) ∗ H (ϕ(x)) (2π ) Kσ (x) ∗ [(1 − H (ϕ(x)))I0 (x)] . f2 (x) = Kσ (x) ∗ [1 − H (ϕ(x))]

Kσ (x) =

2

1

n 2

σ2

e

− |x| 2 2σ

, f1 (x) =

The model is given by      ∇ϕ ∂ϕ ∇ϕ − λ1 e1 + λ2 e2 + ν ∇ 2 ϕ − ∇ · , = δ (ϕ) μ∇ · |∇ϕ| |∇ϕ| ∂t where e1 and e2 are the functions defined as below:  e1 (x) = Kσ (y − x)|I0 (x) − f1 (y))|2 dy,

(9)

(10)

Locally Enhanced Chan-Vese Model with Anisotropic Mesh

115

 e2 (x) =

Kσ (y − x)|I0 (x) − f2 (y))|2 dy.

Because LBF uses the local information represented by the fitting center f1 and f2 that depend on the localization of the kernel function Kσ , it can successfully segment the gray-scale inhomogeneous images. However, the LBF energy functional may stay at undesirable local minimum due to its dependence on the initial contour. Furthermore, the computation of f1 and f2 is highly time consuming. LIF model. Local image fitting model proposed by Zhang et al. [25] is another model introduced to segment the inhomogeneous images, with energy functional as,  1 |I0 − I LIF |2 dx, (11) E LIF (m1 , m2 , ϕ) = 2  where I LIF , m1 and m2 are defined as follows: I LIF = m1 (x)H (ϕ) − m2 (x)(1 − H (ϕ) m1 (x) = mean(I0 (x), x ∈ x ∈ |ϕ < 0 ∩ Wk (x), m2 (x) = mean(I0 (x), x ∈ x ∈ |ϕ > 0 ∩ Wk (x)

(12)

and m1 and m2 are average image intensities in rectangular window function Wk . Hence, m1 , m2 and Wk are similar to f1 , f2 and Kσ in LBF model. LIF model improves computational efficiency comparing to LBF model, howerver, the convergence still depends on the curve initialization like LBF model. LGIF Model. Built on the predecessor models, Wang et al. [28] established a global fitting energy based on CV and LBF energy functional and introduced a hybrid model that utilizes both local and global information (LGIF). The LGIF energy functional is defined as E LGIF = (1 − ω)E LBF + ωE GIF ,

(13)

where E GIF is the energy functional of CV model in (1) and ω is a weight parameter between 0 and 1. LGIF model improves its predecessor models with better accuracy and robustness. However, it highly depends on the weight that is set manually, which is inconvenient for an application aspect. Furthermore, before applying these models, the images have to be re-sized to lower resolutions. Our Proposed LECV Model. Inspired by other models mentioned above, we introduce a new SPF function based on the local image information to improve the CV model. The idea is to combine the advantage of the CV model as a global based model with our local SPF function that helps dealing with the inhomogeneous intensities. Two functions fint (y) and fext (y) as local fitting functions are computed respectively in the local interior and exterior of neighbor vertices y of point x, as shown in Fig. 1. The picture Apple is taken from [40]. For each point x lies on the evolving curve C, we calculate the local interior fitting function fint (y) where y is the directly connected vertices of x in the local interior neighbourhood. Similarly, we calculate the local exterior fitting function fext (y) where y is the

116

K. K. Abbas and X. Li

Fig. 1. Graphical representation of local information. a The adaptive mesh for image apple (taken from [40]). The green curve denotes the contour curve c, which divides the image into two regions, interior region c1 and the exterior region c2 . b Our AMA representation: the local neighborhood of x is represented by the red mesh of triangles. The red neighborhood is split by the green curve into local interior (green) points yi and local exterior (blue) points yj . c Regular representation: the local neighborhood of x is represented by the black rectangle. The rectangle is split by the green curve into local interior (red) and local exterior (blue) regions.

directly connected vertices of x in the local exterior neighborhood. Then we use fint (y) and fext (y) to compute our local SPF function as follows: spf L (I0 (x)) =

ext (y) I0 (x) − fint (y)+f 2  x ∈ ,  ext (y) max I0 (x) − fint (y)+f 2

(14)

where fint (y) and fext (y) are computed as follows: fint (y) =





Nev Niv j=1 wj · f yj · 1 − H ϕ yj i=1 wi · f (yi ) · H (ϕ(yi ))  , f (y) =     ext

 ,

Niv ev max N max i=1 wi · f (yi ) · H (ϕ(yi )) j=1 wj · f yj · 1 − H ϕ yj

(15)

with f (yi ) and f (yj ) being the gray values of the vertices yi and yj in the local interior and exterior of neighborhood of x, wi and wj being the weights of the local interior and exterior vertices yi and yj from x, Niv and Nev are the number of interior and exterior vertices in neighborhood of x. The vertex y that is closer to x has more weight in the

Locally Enhanced Chan-Vese Model with Anisotropic Mesh

117

calculation. We compute the weights as follows: 1/dj 1/di wi = N , wj = N , ev iv j=1 1/dj i=1 1/di

(16)

where di and dj are the distance from yi and yj to x respectively. With our new SPF function, we propose the locally enhanced Chan-Vese (LECV) model as follows:    ∇ϕ ∂ϕ 2 2 − λ1 (I0 − c1 ) + λ2 (I0 − c2 ) + spf L (I0 ) · α, (17) = δ(ϕ) μ∇ · |∇ϕ| ∂t where α is a parameter that controls the speed of the evolving curve and the weight of the local information term. Note that Fig. 1 shows the advantages of our method over other similar method [23, 24] for local approach models. Figure 1a shows the original image Apple and the circular shape contour curve in green color. Figure 1b shows part of the adaptive mesh for the image using our method, the neighborhood of x for the local information on the mesh, and the local interior and local exterior yi and yj represented by green and blue points, respectively. Figure 1c shows part of the image used by other methods, the neighborhood of x for the local information on the image, and the information y of local interior and local exterior represented by the red and blue, receptively. Other models use fixed rectangular neighborhood to compute the local information in all parts of the image as shown in Fig. 1c. However our method uses the triangles in the adaptive mesh as a neighborhood for the local information as shown in Fig. 1b. The adaptive mesh provides better neighborhood representation for the local information with significantly fewer vertices comparing to other models. As seen from Fig. 1c, the local information for other methods is computed on every point x on the green contour curve C that intersects the image [31], while for our method, it is only computed for the vertices of the neighboring triangles in the adaptive mesh and only on the point x that lies on the intersection of the contour curve C and the adaptive mesh. Moreover, the neighborhood of x in Fig. 1b contains fewer vertices than those in Fig. 1c, therefore, the computational efficiency is also improved. For example, there are only four points for the local exterior which represented with blue dots yj and three green points for the local interior yi . The representation quality is maintained using the anisotropic mesh adaptation framework, which will be briefly described in the following subsection. 2.2 AMA Image Segmentation Framework In this section, we briefly describe the AMA image segmentation framework, which consists of three parts: the AMA method for image representation, the finite element formulation, and our algorithm for numerical computation. More details can be found in [32]. Anisotropic mesh adaptation (AMA) has been effectively implemented in numerical computations to improve the efficiency and accuracy [33–36]. The AMA representation framework has been developed in [38] and more details can be found in [34, 37–39]. In this section, we focus on the second and third parts of the framework.

118

K. K. Abbas and X. Li

Finite Element Formulation. In this subsection, we consider the linear finite element solution for the LECV model in (17). Denote D=

μ , F1 (ϕ) = −λ1 (I0 − c1 )2 + λ2 (I0 − c2 )2 |∇ϕ|

(18)

and F2 (ϕ) = spf L (I0 ) · α. Then Eq. (17) can be rewritten as ∂ϕ − δ (ϕ)∇ · (D∇ϕ) = δ (ϕ)F1 (ϕ) + F2 (ϕ). ∂t

(19)

Let  ⊂ R2 be a connected polygon which has the same size of the image dimension. of simplicial triangulations given on . Denote U0 = Let   {Th1} be an affine family v ∈ H () : v|∂ = 0 and let U0h ⊂ U0 be a linear finite element space associated with mesh Th . Then a linear finite element solution ϕ h (t) ∈ U0h of (19) is given as follows:   ∂ϕ h h v dx + ∫ δ ϕ h (∇vh )T D∇ϕ h dx  ∂t        h = ∫ δ ϕ F1 ϕ h vh dx + ∫ F2 ϕ h vh dx, ∫

 h



∀v ∈

U0h , t

∈ (0, T ].

(20)

Denote the number of vertices of Th by Nv , and let φj be the linear basis function associated with vertex xj . Then we can express the solution ϕ h as ϕ = h

Nv 

ϕj (t)φj .

(21)

j=1

Substituting (21) into (20) and taking vh = φi (i = 1, ..., Nv ), we obtain the linear system M

→ d− ϕ − → → + A(ϕ h )− ϕ = b (ϕ h ), dt

(22)

→ where − ϕ = (ϕ1 , ..., ϕNv )T is the unknown vector, M and A are the mass and stiffness − → matrices, respectively, and b is the right-hand side vector. Let i, j = 1, ..., Nv , the − → entries of M , A and b are given as follows:  mij = φj φi dx, (23) 

 aij =  bi =

δ (ϕ h )(∇φi )T D∇φj dx,

    h δ ϕ F1 ϕ φi dx + 





h



  F2 ϕ h φi dx.

(24) (25)

Locally Enhanced Chan-Vese Model with Anisotropic Mesh

119

→ Regarding the time discretization, we denote the numerical solution at t = tn by − ϕ . Applying the semi-implicit scheme to (22), we get n

M

∼ n+1 n − → → ϕ −− ϕ → n+1 − → + A˜ · − ϕ =b, tn

(26)



− → − → where tn = tn+1 − tn , and A˜ and b are approximations of A and b at tn . AMA Segmentation Algorithm. Our computational algorithm is based on the AMA segmentation algorithm that developed in [32]. The algorithm consists of three main stages: representing the image using an adaptive mesh, solving the PDE model using FEM, and reconstruct the numerical solution to an image. The basic idea is to generate an adaptive mesh based on a metric tensor M. The metric tensor is computed in stage one based on the image gray values. Then we utilize the finite element method with adaptive mesh to solve the model in stage two. Finally, we reconstruct the segmented image or evolving curves in stage three using finite element interpolation. Here, we focus on the o → second stage. Firstly, an initial condition − ϕ (x) is chosen as follows: π  π  o − → (27) ϕ (x) = sin x1 · sin x2 . 4 4 Then, the values of the parameters are computed including the Heviside function H as in (3), its derivative δ as in (5), c1 , c2 as in (2), f1 (y), f2 (y) as in (15) and our local SPF as in (14) for the local information. Finally, the linear system (26) is assembled and solved. Figure 2 shows the sketch of the procedures. Note that anisotropic mesh adaptation is only performed for AMA representation and the corresponding mesh is used when solving the PDE model with finite element method.

Fig. 2. A sketch of the procedures for image segmentation using the LECV model.

3 Results In this section, we first present some results obtained by our proposed LECV model in (17) and compare them with those obtained by the CV model in (6). Note that both models are solved using our AMA segmentation algorithm for high-quality images.

120

K. K. Abbas and X. Li

Then we compare our results obtained using AMA algorithm to other models (LBF, LIF, LGIF) that solved using the classical algorithm (Finite Difference Method). If not stated otherwise, the following values are used for parameters in (17): λ1 = λ2 = 1,  = 1, α = 1 and μ = 0.001. The following notations are also used to describe the results. Iterations is the number of iterations needed for the evolving curve to converge to the solution. Adaptive-time is the time needed for mesh adaptation in the stage of AMA representation. Total-time is the total time needed to finish the computation including mesh adaptation. Adaptivetime and Total-time are both measured using tic and toc functions in MATLAB. dt is the time step used in the discretization of the time domain, sd is the sample density used to represent the original image with anisotropic mesh, Nv is the number of vertices in the adaptive mesh, and Nvl is the number of the adaptive mesh vertices used for the local information. 3.1 Segmentation for Images with Geometric Shapes In this subsection, we demonstrate the effectiveness of our LECV model through images of geometric shapes and compare it with the CV model. Example 1 For the first example, we show effectiveness of our proposed LECV model on one inhomogeneous image denoted as image Star as shown in Fig. 3 and image Circle as shown in Fig. 4. The image Star has resolution of 1080 × 997 and the image circle has resolution of 1024 × 1024. We use the metric tensors Maniso as in [38] for AMA segmentation. Note that the metric tensors are used only in the AMA representation stage. Once the AMA mesh is generated according to the corresponding metric tensors for representation, the mesh will be fixed and provided as the mesh for solving both our LECV model and the CV model.

Fig. 3. One-level AMA segmentation of image star. a Original image. b CV evolving curve. c CV segmentation, iterations = 3, total-time = 10.9s. d Adaptive mesh with sd = 0.5%, adaptive-time = 7.2s. e LECV evolving cure. f LECV segmentation, iterations = 2, total-time = 10.8s, Nvl =76 pixels.

Locally Enhanced Chan-Vese Model with Anisotropic Mesh

121

Figure 3a shows the original image Star. Figure 3b displays the evolving curve of CV model segmentation. Figure 3c hows the resulted segmentation of CV model. The computational times are Adaptive-time = 7.2s and Total-time = 10.9s, with three iterations for the evolving curve to converge. Figure 3d displays the adaptive mesh. Figure 3e displays the evolving curve of our LECV model. Figure 3f shows the resulted segmentation of our LECV model. The computational times are Adaptive-time = 7.2s and Total-time = 10.8s, with two iterations for the evolving curve to converge. Figure 4 shows the results for image Circle, which are similar to Fig. 3. Comparing the results for image Star in Fig. 3c and f, we observe that our LECV model is able to segment the whole star object, while CV model fails to segment the inhomogeneous part of the image.

Fig. 4. One-level AMA segmentation of image cirlce. a Original image. b CV evolving curve. c CV segmentation, iterations = 3, total-time = 10.1s. d Adaptive mesh with sd = 0.5%, adaptivetime = 7.3s. e LECV evolving cure. f LECV segmentation, iterations = 2, total-time = 10s, Nvl = 102 pixels.

Example 2 In this example, we consider a more complex image with multiple objects and different pixel intensities, denoted as image Collection with resolution 1024 × 923. It has four objects and all of them have different pixel intensities. Thus, inhomogeneity is more complex comparing to the images in the previous example. The results are shown in Fig. 5. The results demonstrate that our LECV model can successfuly detect and segment all objects. On the other hand, the CV model fails in detecting and segmenting the inhomogeneous part of the image. 3.2 Segmentation of Real Images In this subsection, we demonstrate the advantage of our LECV model through more complex real-life images.

122

K. K. Abbas and X. Li

Fig. 5. One-level AMA segmentation of image collection. a original image. b CV evolving curve. c CV segmentation, iterations = 3, total-time = 10.7. d Adaptive mesh with sd = 0.5%, adaptivetime = 7.7s. e LECV evolving cure. f LECV segmentation, iterations = 2, total-time = 10.9s, Nvl = 88 pixels.

Example 3 We consider one inhomogeneous image denoted as image Plant, as shown in Fig. 6. The image has resolution 1727 × 1735.

Fig. 6. One-level AMA segmentation of image plant (taken from [41]) with resolution 1727 × 1735. a Original image. b CV evolving curve. c CV segmentation, iterations = 3, total-time = 17.4s. d adaptive mesh with sd = 0.3%, adaptive-time = 14.9s. e LECV evolving cure. f LECV segmentation, iterations = 2, total-time = 17.4s, Nvl = 56 pixels.

Locally Enhanced Chan-Vese Model with Anisotropic Mesh

123

Figure 6a shows the original image Plant. Figure 6b displays the evolving curve of CV model segmentation, while Fig. 6c shows the resulted segmentation of CV model. Figure 6d displays the adaptive mesh. Figure 6e displays the evolving curve of our LECV model, while Fig. 6f shows the resulted segmentation of our LECV model. For CV model and our LECV model, the computational times are comparable. However, our LECV model is able to segment the whole plant with all of its parts, while the CV model fails to identify some of the image parts, as can be seen from Fig. 6c and f. 3.3 Comparison with Other Models In this subsection, we compare our LECV model with a few commonly used models (LBF, LIF and LGIF) that deal with inhomogeneous images. Example 4 In this example, we consider running LBF, LIF and LGIF models on the high quality images, image Star and image Circle as in Example 1. The results are shown in Fig. 7.

Fig. 7. LBF, LIF and LGIF segmentation of image star 1080 × 997 and image circle 1024 × 1024. a and b LBF evolving curve, Total-time = 1035s, 1019s, Iteration = 5000, 5000. c and d LIF evolving curve, Total-time = 452s, 438s, Iteration = 5000, 5000. e and f LGIF evolving curve, Total-time = 46s, 3614s, Iteration = 60, 5000.

Figure 7a, c and e (first column) displays the evolving curve of LBF, LIF, and LGIF models for image Star. The computational times are Total-time = 1035s, 452s, and 46s, respectively. Figure 7b, d and f (second column) displays the evolving curve of LBF, LIF, and LGIF models respectively for image Circle. The computational times are Total-time = 1019s, 438s, and 3614s, respectively.

124

K. K. Abbas and X. Li

Comparing the results for image Star in Figs. 7a, c and e, to Fig. 3e, we observe that all models considered here can deal with inhomogeneity reasonably well. However, comparing to LECV model, LBF, LIF and LGIF take much more time for the results to converge. That is one of the main reason that LBF, LIF and LGIF models are usually applied to images with lower resolutions. With AMA segmentation framework, our LECV model can work directly on high resolution images. Similar results are observed for image Circle, as shown in Fig. 7b, d and f as well as Fig. 4e. 3.4 Discussion We have applied our LECV with the AMA segmentation algorithm to a few test images and compared the results with the CV model that utilizes the same algorithm. Images in Examples 1 and 2 show that our LECV model is able to segment all objects in the inhomogeneous images. Although our model computes local information while CV model does not, the overall computational time of our LECV model is comparable to CV model. Thus, LECV has the advantage of better accuracy than CV model. We have also considered segmentation for real-life inhomogeneous images in Example 3. The image is more complex, with higher resolution than the test images. Our LECV model can sucessfully segment real-life inhomogeneous images, while the traditional CV model fails. In Example 4, we consider segementation using LBF, LIF and LGIF models for high resolution images via the classical finite difference method and compare them with our results using LECV model. When applying LBF, LIF and LGIF models to high resolution images without down-sampling, it takes much more time for the results to converge. Sometimes, the results may even fail to converge around the inhomogeneous parts. On the other hand, our LECV model has better computational efficiency and can work well directly with high resolution images.

4 Conclusions Hybrid segmentation models that utilize both local and global information have been very popular in recent years due to its capability of dealing with various segmentation problems. Local information can deal with inhomogeneous images, while global information can deal with noisy images and blurred edges. Global information can also help the contour curve not to fall in a local minimum. Hence, models that utilize both local and global information tend to fix many of these issues. The partial differential equations of those models are usually complex, and computational efficiency is a challenge. Thus, these models are generally applied to low-resolution images. In this paper, we have introduced our hybrid LECV model, which is based on the well-known CV model and addresses the challenge that the CV model faces for intensity inhomogeneous images. Our proposed LECV model combines local and global information to handle the intensity inhomogeneity. Together with the AMA segmentation framework [32], the LECV model can be applied to high-resolution images directly, providing improved computational efficiency and accuracy for segmentation.

Locally Enhanced Chan-Vese Model with Anisotropic Mesh

125

The numerical results demonstrate that our proposed LECV model does improve the CV model by adding local information to deal with intensity inhomogeneity in images. We have also applied the AMA segmentation framework to improve the computational efficiency when segmenting high resolution images. The results from our LECV model are comparable to those obtained from commonly used models such as LBF, LIF and LGIF. However, our model takes much less time, especially for images with highresolution. Our model also has potential applications in medical image segmentation with high-resolution images.

References 1. Fang, L., Zhao, W., Li, X., Wang, X.: A convex active contour model driven by local entropy energy with applications to infrared ship target segmentation. Opt. Laser Technol. 96, 166–175 (2017) 2. Bialas, J., Oommen, T., Havens, T.C.: Optimal segmentation of high spatial resolution images for the classification of buildings using random forests. Int. J. Appl. Earth Obs. Geoinf.Geoinf. 82, 101895 (2019) 3. Zhang, W., Wang, X., Zhang, P., Chen, J.: Global optimal hybrid geometric active contour for automated lung segmentation on CT images. Comput. Biol. Med.. Biol. Med. 91, 168–180 (2017) 4. Tong, S.-G., Huang, Y.Y., Tong, Z.M.: A robust face recognition method combining LBP with multi-mirror symmetry for images with various face interferences. Int. J. Autom. Comput.Autom. Comput. 16(5), 671–682 (2019) 5. Chun, L., Chunhua, X., Jian, Y., Yingying, X., Junliang, B.: A method for coastal oil tank detection in polarimetrie SAR images based on recognition of T-shaped harbor. J. Syst. Eng. Electron. 29(3), 499–509 (2018) 6. Guan, S.-Y., Wang, T.-M., Meng, C., Wang, J.-C.: A review of point feature based medical image registration. Chin. J. Mech. Eng. 31(1), 1–16 (2018) 7. Mumford, D.B., Shah, J.: Optimal approximations by piecewise smooth functions and associated variational problems. Commun. Pure Appl. Math. Pure Appl. Math. 42(5), 577–685 (1989) 8. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Trans. Image Process. 10(2), 266–277 (2001) 9. Xu, C., Prince, J.L.: Snakes, shapes, and gradient vector flow. IEEE Trans. Image Process. 7(3), 359–369 (1998) 10. Badoual, A., Unser, M., Depeursinge, A.: Texture-driven parametric snakes for semiautomatic image segmentation. Comput. Vis. Image Underst.. Vis. Image Underst. 188, 102793 (2019) 11. Liu, G., Deng, M.: Parametric active contour based on sparse decomposition for multi-objects extraction. Signal Process. 148, 314–321 (2018) 12. Bai, J., Shah, A., Wu, X.: Optimal multi-object segmentation with novel gradient vector flow based shape priors. Comput. Med. Imaging Graph.. Med. Imaging Graph. 69, 96–111 (2018) 13. Han, B., Wu, Y.: Active contours driven by global and local weighted signed pressure force for image segmentation. Pattern Recogn.Recogn. 88, 715–728 (2019) 14. Liux, C., Liu, W., Xing, W.: An improved edge-based level set method combining local regional fitting information for noisy image segmentation. Signal Process. 130, 12–21 (2017) 15. Ding, K., Xiao, L., Weng, G.: Active contours driven by local pre-fitting energy for fast image segmentation. Pattern Recogn. Lett.Recogn. Lett. 104, 29–36 (2018)

126

K. K. Abbas and X. Li

16. Yu, H., He, F., Pan, Y.: A scalable region-based level set method using adaptive bilateral filter for noisy image segmentation. Multimed. Tools Appl. 79(9), 5743–5765 (2020) 17. Liu, H., Fang, J., Zhang, Z., Lin, Y.: A novel active contour model guided by global and local signed energy-based pressure force. IEEE Access 8, 59412–59426 (2020) 18. Abdelsamea, M,M., Tsaftaris, S.A.: Active contour model driven by globally signed region pressure force. In: 2013 18th International Conference on Digital Signal Processing (DSP) (2013) 19. Song, H.: Active contours driven by regularised gradient flux flows for image segmentation. Electron. Lett. 50(14), 992–994 (2014) 20. Chen, L., Zhou, Y., Wang, Y., Yang, J.: GACV: geodesic-aided C-V method. Pattern Recogn.Recogn. 39(7), 1391–1395 (2006) 21. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. Int. J. Comput. VisionComput. Vision 22(1), 61–79 (1997) 22. Zhang, K., Zhang, L., Song, H., Zhou, W.: Active contours with selective local or global segmentation: A new formulation and level set method. Image Vis. Comput.Comput. 28(4), 668–676 (2010) 23. Li, C., Kao, C.-Y., Gore, J.C., Ding, Z.: Implicit active contours driven by local binary fitting energy. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition (2007) 24. Li, C., Kao, C.-Y., Gore, J.C., Ding, Z.: Minimization of region-scalable fitting energy for image segmentation. IEEE Trans. Image Process. 17(10), 1940–1949 (2008) 25. Zhang, K., Song, H., Zhang, L.: Active contours driven by local image fitting energy. Pattern Recogn.Recogn. 43(4), 1199–1206 (2010) 26. Wang, L., Li, C., Sun, Q., Xia, D., Kao, C.-Y.: Active contours driven by local and global intensity fitting energy with application to brain MR image segmentation. Comput. Med. Imaging Graph. 33(7), 520–531 (2009) 27. Wang, H., Huang, T.-Z., Xu, Z., Wang, Y.: A two-stage image segmentation via global and local region active contours. Neurocomputing 205, 130–140 (2016) 28. Wang, L., He, L., Mishra, A., Li, C.: Active contours driven by local Gaussian distribution fitting energy. Signal Process. 89(12), 2435–2447 (2009) 29. Wang, P., Sun, K., Chen, Z.: Local and global intensity information integrated geodesic model for image segmentation. In: 2012 International Conference on Computer Science and Electronics Engineering (2012) 30. Tran, T.-T., Pham, V.-T., Chiu, Y.-J., Shyu, K.-K.: Active contour with selective local or global segmentation for intensity inhomogeneous image. In: 2010 3rd International Conference on Computer Science and Information Technology (2010) 31. Li, X., Jiang, D., Shi, Y., Li, W.: Segmentation of MR image using local and global region based geodesic model. Biomed. Eng. Online 14(1), 02 (2015) 32. Abbas K., Li, X.: Anisotropic mesh adaptation for image segmentation based on MumfordShah functional. 17 July 2020. https://arxiv.org/abs/2007.08696, unpublished 33. Huang, W., Li, X.: An anisotropic mesh adaptation method for the finite element solution of variational problems. Finite Elem. Anal. Des. 46, 61–73 (2010) 34. Li, X., Huang, W.: An anisotropic mesh adaptation method for the finite element solution of heterogeneous anisotropic diffusion problems. J. Comput. Phys.Comput. Phys. 229(21), 8072–8094 (2010) 35. Li, X., Huang, W.: Maximum principle for the finite element solution of time-dependent anisotropic diffusion problems. Numeric. Methods Partial Differ Equ. 29, 1963–1985 (2013) 36. Li, X.: Anisotropic mesh adaptation for finite element solution of anisotropic porous medium equation. Comput. Math. Appl.. Math. Appl. 75, 2086–2099 (2018) 37. Huang, W.: Mathematical principles of anisotropic mesh adaptation. Commun. Comput. Phys 1(2), 276–310 (2006)

Locally Enhanced Chan-Vese Model with Anisotropic Mesh

127

38. Li, X.: Anisotropic mesh adaptation for image representation. EURASIP J. Image Video Process. 2016(1), 1–16 (2016). https://doi.org/10.1186/s13640-016-0132-7 39. Huang, W.: Metric tensors for anisotropic mesh generation. J. Comput. Phys.Comput. Phys. 204(2), 633–665 (2005) 40. Two Apples Red and Green wesbite. https://www.tapeciarnia.pl/94419_dwa_jablka_cze rwone_zielone 41. Haddad, I.: Outdoor Plants: Chinese Evergreen (2020). https://pep-khaoula.netlify.app/pro duct/chinese-evergreen

An Unmanned System for Automatic Classification of Hazardous Wastes in Norway Marc Gr¨ oling1 , Laurent Huang2 , and Ibrahim A. Hameed1(B) 1

NTNU ˚ Alesund, Larsg˚ ardsvegen 2, 6009 ˚ Alesund, Norway [email protected] 2 ENSI Caen, 6 Bd Mar´echal Juin, 14000 Caen, France [email protected]

Abstract. Separation of waste is an essential step in the recycling process, which can save resources, provide energy and reduce environmental pollution. Separation is a tedious process which is usually done by human workers, who hand-pick the items to separate them. To make this process easier, more accurate and safer, in this work a system is developed that can classify items using image recognition techniques and output classes using a projector. A dataset of 12 (of which eight are combined into a single “others” class, due to them being uncommon) different classes with about 5000 images in total is collected and used to train different classification models using convolutional neural networks and transfer learning. A mean accuracy of 74.043%±12.621% is achieved on test data in 10-fold cross-validation. Unfortunately, the model performs drastically worse on newer data, due to unknown reasons. Keywords: Waste management · Convolutional neural network Transfer learning · Edge computing · Data set

1

·

Introduction

Recycling of waste has a lot of benefits such as a reduction of greenhouse-gas emissions, soil contamination, and air and water pollution. It conserves natural resources, increases economic value and creates job opportunities [1]. Although recycling comes with many benefits, one must also consider its negative impacts, e.g. diffusion of hazardous materials in the recycled products or limited applications of recycled materials [2]. In general, it can be seen as a process with great potential though [1]. Waste separation is one of the important steps in the process of recycling materials [1]. The separation process is usually done with “the handpicking method, whereby someone is employed to separate the different objects/materials” [3]. To make this process easier, more reliable and safer a system is developed which can help to classify waste for separation. This is c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 128–144, 2024. https://doi.org/10.1007/978-3-031-47715-7_10

An Unmanned System for Automatic Classification

129

important since the misclassification of waste might lead to enhanced pollution of the environment and automatic classification can help reduce health risks when dealing with hazardous waste. Since at the time of starting this project (March of 2022) no dataset that could be used for this task was available (to the knowledge of the authors), it was decided to create a new dataset. In this work, an on-site system is developed that can take images of objects, classify these and output the class via a projector, which is presented in Sect. 3. In Sect. 2 related work is reviewed and it is explained how this work differs from theirs. A dataset is collected from the recycling centre where the system is deployed, which is presented in Sect. 4. This dataset is then used to train different classification models. In Sect. 5 the used model architectures, the training and evaluation process, as well as the choice of hyperparameters and data augmentations are explained. The results of cross-validation, as well as the evaluation of the final model, are shown in Sect. 6. Finally, Sect. 7 summarises the results of this work, as well as elaborates on possible improvements and experiments.

2

Related Work

There have already been several articles that reviewed the classification of waste. However, they cover different types of waste and to the author’s knowledge, no work has been done on a similar dataset. Furthermore, this work proposes a new system that can classify waste on-site and is practical. The authors of [4] used artifical convolutional neural networks to classify waste. They used the WaDaBa dataset [5] to train their models which only covers subcategories of plastic waste. In [3] the authors used the ResNet-50 pre-trained model to extract image features and then SVMs to classify objects. Their dataset consists of four different classes: glass, paper, plastic, and metal. A system that can automatically segregate waste into metallic, wet, and dry categories is developed in [6], which is done using conveyor belts and different kinds of sensors. In [7] an intelligent waste bin which uses a Jetson Nano was created, like this work, to classify waste in conjunction with artficial convolutional neural networks into six different categories. The waste is also compacted inside the bin. In [8] the authors created a dataset of about 400 images that consists of plastic, glass, paper, and organic wastes. They then compared two artficial convolutional neural networks with different amount of layers and evaluated them. They achieved an accuracy of 70% on validation data. The authors of [9] compared several convolutional neural network architectures on the TrashNet dataset [10], which consists of about 2400 images of six classes. Their best performing model achieves an accuracy of 88.6%. A two-stage classification algorithm was proposed in [11], which first classifies “waste into one of thirtheen subcategories” and then into one of four categories.

130

M. Gr¨ oling et al.

It performs better than the single stage classification algorithm that was used as a comparison. In [12] the authors used a genetic algorithm to optimise the fully-connected layer of DenseNet121 [13]. They use the TrashNet dataset [10] on which they achieve an accuracy of 99.6%.

3

Waste Classification System

The system can be seen in Fig. 1. It consists of the following components, which are held together using an aluminium frame: – Control unit: a jetson nano that is used to control the whole system – Camera: used to take images of objects for classification and data collection – Projector: used in an attempt to create similar light conditions for easier classification, as well as displaying the label of the predicted class – Router: used for remote access to the system.

Fig. 1. Waste classification system

The system works as described in Fig. 2, first the customer places an item on the table of the system. Next, an image is taken of the item, which is then classified an the label of it is shown with the projector. The customer can then read the label off the table and put it into the corresponding bin.

An Unmanned System for Automatic Classification

131

Fig. 2. The pipeline of the system

4

Collected Data

In this work the developed system was also used to collect a dataset, which would later be used for training classification models for the system. Information about the collected data and its classes can be seen in Table 1. The data was collected with the system and then cleaned. The data can be found in this GitHub repository1 of the project. The aim was to get a dataset with a similar number of images per class for an adequate number of classes. Unfortunately, some classes are less common than others which is why it was decided to first focus on the classes that are most common, i.e. “7055”, “7051”, “7133”, and “7042”. The other classes were combined into a single class called “others” to make sure that items which are not of the four classes would not be sorted into the containers of these. For the sake of simplicity it was decided that each image should only contain one item, which may be extended to multiple objects per image in the future. Multiple images are taken for each object where the object is rotated differently in each of these with about three to seven images per object. All images have a size of 980 × 620 pixels. Sample images can be seen in Fig. 3.

5

Classification

5.1

Model

Since the dataset only contains about 5000 images, it was decided to use a model pre-trained on another dataset (transfer learning). Only the feature extractor of each model was used. The models are all taken from the torchvision.models 1

https://github.com/marc131183/WasteClassification/tree/main/data/cleaned.

132

M. Gr¨ oling et al. Table 1. Collected data Norwegian waste ID # of Images Contents 7133

1037

Cleaning products

7051

1029

Paint, glue and varnish waste

7055

1029

Spray

7042

902

Organic solvents and halogen

7023

247

Petrol and diesel filters

7121

192

Isocyanates

7134

117

Acidic organic waste

7132

92

Inorganic bases

7152

84

Tarry waste

7151

78

Organic waste with halogen

7123

74

Hardeners

7011

66

Waste oil

Total

4947

library.2 Three different architectures were tested: AlexNet [14], ResNet [15], and VGG [16]. Furthermore, two different ResNet variants are examined, since these performed particularly well on the dataset. The weights of the feature extractors of pre-trained models were then partially frozen (including full freeze and no freeze) during training. This is done in a way, such that the first x% of layers is frozen. As such, 0% freeze means that all weights are changed during training and 100% means that no weights of the feature extractor are changed during training. The feature extractor part of pre-trained models is then connected to a classifier. Four different classifier architectures were considered, which can be seen in Fig. 4. These architectures are inspired by the traditional classifier architectures from the pre-trained models but downsized due to this classification task having fewer classes and fewer data. 5.2

Data Transformations and Augmentations

Data augmentation/transformation is applied. Since most pretrained models operate on 224 × 224 sized images, the data is first resized into 256 × 256 and then center cropped to 224 × 224. In an attempt to make the model more robust, data augmentations are applied to training data, which consist of randomly rotating the image or randomly flipping it horizontally. Finally, the pixel values of images are normalized between 0 and 1.

2

https://pytorch.org/vision/stable/models.html.

An Unmanned System for Automatic Classification

133

Fig. 3. Sample images of collected data

Fig. 4. Classifier Architectures (input tensor and input of first linear layer varies in size depending on the used feature extractor)

134

5.3

M. Gr¨ oling et al.

Training Hyperparameters

The hyperparameters that were used for all model training can be seen In Table 2. The learning rate is initially set to 0.001 and then decayed every step size epochs (here seven) to help with optimization and generalization [17]. The new learning rate after decay is calculated as follows: lrnew = γ · lrold Early stopping patience denotes the number of consecutive epochs that training is continued after the accuracy on the validation set has not improved. Table 2. Hyperparameters Hyperparameter

Value

Loss criterion

Cross entropy loss

Batch size

4

Learning rate

0.001

Learning rate decay gamma

0.1

Learning rate decay step size 7

5.4

Momentum

0.9

Early stopping patience

10

Evaluation Process of Models

For evaluating models, stratified k-fold cross validation (which balances the classes, even in the “others” class), with k = 10, was used. In each fold, the data is split into 80% training, 10% testing, and 10% validation data. This three-way split is done in order to implement early stopping, which is used in order to avoid overfitting [18], and still have an unbiased model evaluation. The validation-split is used for early stopping, while the test-split is used to evaluate the performance of the model. The whole process can be seen in Fig. 5.

6

Evaluation

For evaluation, as discussed in Sect. 5.4, it was decided to use k-fold cross validation, with k = 10, to validate models. For each model, the mean and standard deviation of the accuracy is evaluated on test data of the current fold. Furthermore, since the models are supposed to run on a computationally weak system, the inference time (average time to classify an image on the jetson nano) and the number of parameters of the model are also taken into consideration for deciding which model should be used. The results of this can be seen in Table 3, where

An Unmanned System for Automatic Classification

135

Fig. 5. Cross validation pipeline

the top-3 performing models in terms of mean accuracy are listed, as well as the best model of each architecture. In Table 4, all cross-validated models can be seen. When considering Table 3 it becomes apparent that the ResNet50-pre-trained model works best (in terms of mean accuracy) for this classification task, as all top-3 models use this as a feature extractor. The ResNet18-pre-trained models use only about half of the number of parameters of ResNet50, however since the model size is still rather small, this, in our case, is not an issue. The same goes for the mean classification time per image. The other two architectures perform worse in terms of mean accuracy while having a similar or even higher number of parameters. Table 3. Model performance of the top 3 models and the best of each architecture in cross-validation Name

Classifier Frozen (%) # Trainable Inference (s) Mean Acc (%) σ Acc (%) parameters

ResNet50 0

60

18984453

0.158

74.043

12.621

ResNet50 0

20

23297029

0.158

73.003

12.8

ResNet50 0

0

23518277

0.158

72.948

12.062

ResNet18 0

0

11179077

0.097

71.378

12.676

VGG11

3

80

60308485

0.185

66.808

12.649

AlexNet

3

80

23673093

0.087

58.991

8.746

136

6.1

M. Gr¨ oling et al.

Final Model: Evaluation on Collected Data

Since the accuracy of the ResNet50 model (with classifier zero and 60% freeze) is the highest in k-fold cross-validation and its inference time and number of parameters are acceptable for our purposes, it was decided to choose this as the final model.

Fig. 6. Heatmap of the final model on test data of all folds summed (as percentages)

In Fig. 6 one can see the heatmap of the ResNet50 model on validation data in k-fold cross-validation (summed up for all folds, shown as percentages). The classification rates differ greatly between the different classes, with class “7042” and “7133” being barely over 50% accuracy. The model is best at classifying class “7055”, which is against expectations that class “7051” would be easiest to classify (since it mostly consists of paint buckets, which are rather distinct from the other classes). As expected, however, class “others” gets predicted the most, this is probably because the class consists of many different objects (classes). In Fig. 7, one can see the training process of the final model, with 90% of the data being used as train data and 10% as validation data. The loss increases drastically twice in the beginning, but then quickly goes back, after which it only slowly decreases. The accuracy rises quickly in the beginning and stagnates at about epoch 7, after which it only increases by about 3%. After epoch 22, there is no improvement in the accuracy on validation data for 10 consecutive epochs, as such early stopping stops the training process and the weights from epoch 22 are used.

An Unmanned System for Automatic Classification

137

Fig. 7. Training metrics for the final model

6.2

Final Model: On-Site Testing

To verify that the model would perform well in action and that the dataset is representative, on-site testing was conducted. For this, 20 objects for each class were used (for “others” seven from “7023”, seven from “7121”, and six from “7132”) and taken pictures from three different perspectives (top, front, back; similar to how it was done for the initial data collection).

138

M. Gr¨ oling et al.

Fig. 8. Heatmap of the final model from on-site testing (as percentages)

These collected images, which will from now on be referred to as on-site data, were then used as a final test for the model. The results of this can be seen in Fig. 8. Unfortunately, the results differ a lot between the previous tests and this one. In almost all cases, the model predicts class “others” and achieves an accuracy of only 21.284%. This result was highly unexpected and as such, it was investigated how the accuracy on this dataset develops while training. This can be seen in Fig. 9. The loss and accuracy of both validation and on-site data are very similar in the beginning, differing only by about 0.5 and 0.05 for loss and accuracy respectively. However, from epoch three to four, the loss and accuracy on the on-site data worsen. For the loss, it increases by about 200% and the accuracy drops to half. While the loss recovers during training to some extent, the accuracy worsens even further and does not get back up. It is unclear, why the model almost always predicts “others” on the on-site data. Since the loss and accuracy were very similar at the beginning of training (see Fig. 9), it might be that the model finds some criterion in the train data during training, which it mainly uses for classifying, which is only present in the older data and not anymore in the on-site data.

7

Discussion and Outlook

In this work, a unmanned waste classification system is built and deployed on a recycling center and used to collect a dataset of 12 classes with about 5000

An Unmanned System for Automatic Classification

139

Fig. 9. Training metrics for the final model with on-site test data

images. Different pretrained artifical convolutional neural network architectures are then used as feature extractors in combination with four different proposed classifier architectures to classify the images of the collected dataset. Due to some classes being less common than others, a classification model is built on five classes (with eight being combined into a single class. The final model achieves an accuracy of 74.043% ± 12.621% on validation data in k-fold cross validation, with k = 10.

140

M. Gr¨ oling et al.

However, due to unknown reasons the model mostly predicts one class for the final on-site testing. As such, the system cannot be used yet for on-site classification. Unfortunately, due to time constraints, it was not feasible to investigate this discrepancy in accuracy between the testing in cross-validation and final on-site testing. Since the model is not ready to be used as a substitute for human sorting, refinement of it would be advantageous. This could be, but is not limited to the following: 7.1

Investigation of the Difference in Performance

Most importantly, it would be beneficial to investigate how and why this difference in accuracy comes to be between the validation set and the newer on-site test set. While some possible explanations were given, these are all highly speculative, since thorough investigations were not feasible in the scope of this work. 7.2

Multiple Objects per Image

As of now, the dataset consists only of one item per image. It is questionable however if the average person depositing their waste has enough patience to put the items under the classification system one by one. Thus, it may be advantageous to allow for multiple items at once, which would decrease the time spent depositing. To reuse the data collected in this work, this may be done with a 2-phase classification system, that would operate as follows: for the first phase, the image is analyzed for any objects, then all of these are separately passed to the second part of the system which would classify them. This way, all data collected in this work may be used for the second phase and perhaps partially for the first phase. 7.3

Improvement of Classification Models

Since only a limited amount of architectures and variants were tested in this work, it would be interesting to see how different architectures perform on this dataset. Furthermore, since ResNet50 performed better than ResNet18, maybe another ResNet variant, with even more parameters, such as ResNet101 or even ResNet151 would perform even better. Although, as the number of parameters increases, the time it takes to classify an image goes up as well, which is a limiting factor for this system.

An Unmanned System for Automatic Classification

141

In [19] it was shown that partially trained models outperform fully trained models as feature extractors. It would thus be interesting to perform experiments on how their accuracy would compare with the accuracy of fully trained feature extractors, which were tested in this work. 7.4

Extension of the Dataset

Currently, the dataset consists only of a limited amount of classes. For the system to be a full substitute for human workers, it would need to cover a greater share of objects. Also, it might be beneficial to extend the four classes which already have about 1000 images as of now with even more, to further increase the diversity of the dataset. 7.5

Multi-view Classification

It is unclear whether a single camera is sufficient for achieving high accuracy on this task, as “the visual information conveyed by a single image may be insufficient for an accurate decision” [20]. Multi-view classification can be used to fuse the information from multiple images of different perspectives of a single object together. As the accuracy of the current classification results from singleview is not sufficient, this method may help achieve that.

A A.1

Appendix Data and Source Code

The data and source code is available in the corresponding GitHub repository: https://github.com/marc131183/WasteClassification A.2

All Cross-Validated Models

All model variants that were cross-validated can be seen in Table 4. The ones which are also presented in Table 3 are written boldly.

142

M. Gr¨ oling et al.

Table 4. Model performance of all tested model variants in cross-validation Name

Classifier Frozen (%)

# Trainable Inference parameters time (s)

Mean Acc (%) σ Acc (%)

AlexNet

1

0

7255109

0.085

54.982

8.386

AlexNet

1

90

5375493

0.085

54.925

8.496

AlexNet

1

100

4785413

0.085

42.822

6.930

AlexNet

2

80

1770501

0.083

54.067

6.562

AlexNet

3

0

25552709

0.087

56.395

8.88

AlexNet

3

80

23673093

0.087

58.991

8.746

AlexNet

3

100

23083013

0.087

45.294

8.114

ResNet18 0

0

11179077

0.098

71.378

12.676

ResNet18

0

20

11058565

0.098

71.030

12.126

ResNet18

0

40

10717701

0.098

69.369

11.912

ResNet18

0

60

9315333

0.098

65.198

15.476

ResNet18

0

80

6035973

0.098

64.431

13.378

ResNet18

0

100

2565

0.098

45.704

3.598

ResNet18

1

0

11505477

0.097

64.407

13.377

ResNet18

1

80

6362373

0.097

60.328

10.108

ResNet18

2

0

11242821

0.097

65.411

13.31

ResNet18

2

80

6099717

0.097

62.143

8.442

ResNet18

3

0

16433733

0.099

66.703

14.326

ResNet18

3

80

11290629

0.099

64.226

9.827

ResNet50 0

0

23518277

0.158

72.948

12.062

ResNet50 0

20

23297029

0.158

73.003

12.8

ResNet50

40

22320645

0.158

72.758

11.623

0

ResNet50 0

60

18984453

0.158

74.043

12.621

ResNet50

0

80

15106053

0.158

70.667

13.438

ResNet50

0

100

10245

0.158

50.376

6.464

ResNet50

1

0

24623429

0.157

68.003

14.153

ResNet50

1

80

16211205

0.157

68.969

13.628

ResNet50

2

0

23770949

0.154

64.668

15.663

ResNet50

2

80

15358725

0.154

65.604

13.947

ResNet50

3

0

31910981

0.156

70.604

10.915

ResNet50

3

80

23498757

0.156

70.562

15.204

VGG11

1

0

22132357

0.174

64.755

12.529

VGG11

1

90

15271685

0.174

61.528

11.3

VGG11

1

100

12911877

0.174

49.752

7.625

VGG11

2

80

7931653

0.171

62.444

11.885

VGG11

3

0

64809349

0.185

66.025

11.728

VGG11

3

80

60308485

0.185

66.808

12.649

VGG11

3

100

55588869

0.185

51.725

9.254

An Unmanned System for Automatic Classification

143

References 1. Asmatulu, R., Asmatulu, E.: Importance of recycling education: a curriculum development at wsu. J. Mater. Cycles Waste Manag. 13(2), 131–138 (2011) 2. Jacobson, M.Z.: On the causal link between carbon dioxide and air pollution mortality. Geophys. Res. Lett. 35(3) (2008) 3. Adedeji, O., Wang, Z.: Intelligent waste classification system using deep learning convolutional neural network. Procedia Manuf. 35, 607–612 (2019) 4. Bobulski, J., Kubanek, M.: Waste classification system using image processing and convolutional neural networks. In: International Work Conference on Artificial Neural Networks, pp. 350–361. Springer (2019) 5. Bobulski, J., Piatkowski, J.: Pet waste classification method and plastic waste database-wadaba. In: International Conference on Image Processing and Communications, pp. 57–64. Springer (2017) 6. Gupta, N.S., Deepthi, V., Kunnath, M., Rejeth, P.S., Badsha, T.S., Nikhil, B.C.: Automatic waste segregation. In: 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), pp. 1688–1692. IEEE (2018) 7. White, G., Cabrera, C., Palade, A., Li, F., Clarke, S.: Wastenet: Waste classification at the edge for smart bins (2020). arXiv:2006.05873 8. Altikat, A., Gulbe, A., Altikat, S.: Intelligent solid waste classification using deep convolutional neural networks. Int. J. Environ. Sci. Technol. 19(3), 1285–1292 (2022) 9. Ruiz, V., Sanchez, A., Velez, J.F., Raducanu, B.: Automatic image-based waste classification. In: International Work-Conference on the Interplay Between Natural and Artificial Computation, pp. 422–431. Springer (2019) 10. Yang, M., Thung, G.: Classification of trash for recyclability status. CS229 project report, vol. 2016, no. 1, p. 3 (2016) 11. Zhang, S., Chen, Y., Yang, Z., Gong, H.: Computer vision based two-stage waste recognition-retrieval algorithm for waste classification. Resour. Conserv. Recycl. 169, 105543 (2021). https://www.sciencedirect.com/science/article/pii/ S0921344921001506 12. Mao, W.-L., Chen, W.-C., Wang, C.-T., Lin, Y.-H.: Recycling waste classification using optimized convolutional neural network. Resour. Conserv. Recycl. 164, 105132 (2021) 13. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708 (2017) 14. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 25. Curran Associates, Inc. (2012). https://proceedings.neurips.cc/paper/2012/file/ c399862d3b9d6b76c8436e924a68c45b-Paper.pdf 15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015). http://arxiv.org/abs/1512.03385 16. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556 17. You, K., Long, M., Wang, J., Jordan, M.I.: How does learning rate decay help modern neural networks? (2019). arXiv:1908.01878 18. Prechelt, L.: Early stopping-but when? In: Neural Networks: Tricks of the Trade, pp. 55–69. Springer (1998)

144

M. Gr¨ oling et al.

19. Deng, A., Li, X., Li, Z., Hu, D., Xu, C., Dou, D.: Inadequately pre-trained models are better feature extractors (2022). arXiv:2203.04668 20. Seeland, M., Mader, P.: Multi-view classification with convolutional neural networks. PLoS ONE 16(1), e0245230 (2021)

Remote Learning of Mathematics for Visually Impaired Students During COVID-19: Exploring Online Intervention, Resources, Challenges and Issues Muhammad Shoaib1(B)

, Donal Fitzpatrick2 , and Ian Pitt1

1 School of Computer Science and Information Technology, University College Cork, Cork,

Ireland [email protected] 2 Centre for Excellence in Universal Design, National Disability Authority, Dublin, Ireland

Abstract. Due to COVID-19, several dramatic changes have appeared all over the world i.e., travel restrictions, healthcare shortages, self-isolation, economic crises, social distancing, increases in food demand, job losses and closure of educational institutions. This led to a rapid transition from face-to-face to remote teaching. Most remote learning platforms are designed primarily for sighted students and are less useful for visually-impaired students. Especially in mathematics, it’s quite difficult for visually-impaired students to access rich visual information, such as graphs, algebraic notations, geometric shapes and statistical formulas. This article provides an overview of mathematics learning resources during the COVID19 pandemic. It explores assistive technology-based interventions which enable visually-impaired and blind students to actively participate in remote teaching and enhance their learning skills. It was noted that solutions are available for various platforms (i.e., smartphones, tablets, laptops & desktop PCs) to support visuallyimpaired students. Assistive Technology in remote learning also plays a key role in enhancing the mathematical skills of visually-impaired students. Furthermore, we have investigated the important COVID-19-related resources of different institutes which are very beneficial for the online education of visually-impaired students. Moreover, some challenges and issues are discovered such as internet connectivity, lack of interaction, inadequate support, one-way communication and less effective learning. Finally, some future directions are suggested for research i.e., parents should be a part of remote learning. Keywords: Learning · Mathematics · COVID-19 · Visually impaired · e-Learning

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 145–156, 2024. https://doi.org/10.1007/978-3-031-47715-7_11

146

M. Shoaib et al.

1 Introduction Coronavirus was identified in 2019 in Wuhan, China. This virus was recognised globally as a pandemic in early 2020 when confirmed cases significantly increased. The World Health Organization recognized this coronavirus disease as COVID-19 on 11 February 2020 [1]. The global mortality rate peaked in January 2021 and the number of newly diagnosed cases peaked in April 2021 [2]. Due to COVID-19, educational institutions were closed and traditional teaching methods shifted to remote learning to reduce the pandemic effects on education [3]. Teachers and students both had to learn to use remote teaching platforms. Teachers had to prepare the lecture material according to the student’s expectations and in addition had their own requirements i.e., interactive virtual environments, strong collaborative environments, online support for special education, and easy content management [4]. Mathematics has its peculiar characteristics i.e., symbols, equations, formulas, language, shapes and abstraction in its concepts. Mathematics develops reasoning capabilities in the human mind and is also very important for the foundation of other science subjects [5]. Visually-impaired students face several issues when learning mathematics, such as working with graphs and diagrams [6, 7]. Visually-impaired students can access information about basic math operations (i.e., addition, subtraction, multiplication and division), geometry, numbers, algebra, graphs and statistics with the help of mathematics teaching. Several instructors of visually-impaired students have reported that their students face problems in accessing mathematics graphics independently [8]. There is a need for tools, techniques, systems and web-based platforms that provide easy interaction, full encouragement and motivation to the disabled student [9]. In the CoVID-19 situation, digital learning provided better opportunities to deal with challenges [10]. Numerous computer-based applications integrated with different software and web-based packages offer virtual educational environments to empower the students learning skills [11, 12]. Assistive Technology is also very useful for exploring a variety of mathematics topics. Visually-impaired students can learn many areas of mathematics with the help of assistive programs [13]. Assistive Technology provides hardware and software-based solutions which are free or commercially available for visually-impaired students and used to enhance learning and functional capabilities [14]. In mathematics, there is always a need to design and develop learning programs that could be useful for enhancing mathematical skills [15]. The National Council of Teachers of Mathematics (NCTM) in the USA has issued some principles and standards (i.e., Content standards, Process standards) for mathematics institutes. The six principles are Teaching, Learning, Curriculum, Assessment, Equity and Technology. Five important process standards are connection, reasoning, representation, communication and problem-solving skills. Five core topics in mathematics are probability, number, geometry, data analysis and algebra [16]. Researchers and developers who wish to introduce a new solution for visually impaired people can bear in mind these principles and standards to optimize the learning ability of students.

Remote Learning of Mathematics for Visually Impaired

147

The goal of the study is to investigate the challenges and issues faced by those working in this area, the resources available, and the potential for online intervention. This study suggests ways to address the current challenges and increase the inclusivity and accessibility of online mathematics education for students who are visually impaired. This article also explores the use of remote learning tools and techniques for teaching mathematics during the COVID-19 pandemic. Also, to explores freely-available COVID19 resources and discusses challenges and issues faced by visually-impaired students during the Covid-19 pandemic. The remaining sections are organized as follows: Sect. 2 reviews related work in the area of remote learning of mathematics. Section 3 illustrates the remote learning model for visually-impaired students. Section 4 describes some COVID-19 Resources and tips for visually impaired learners. Section 5 defines the overall educational experience from different perspectives (i.e., Teaching perspective, Social perspective, Cognitive perspective and Student perspective). This section also provides an analysis of online learning during COVID-19 and discusses strengths, weaknesses, opportunities and challenges associated with this. Section 6 presents a discussion based on previous sections. Section 7 concludes the article whereas Sect. 8 provides the future directions for upcoming researchers and developers.

2 Related Work This section provides a detailed review of the previous literature on remote learning of the visually impaired student. Researchers have introduced technology-based solutions to help visually-impaired students learn mathematics. Reddy et al., noted that mathematics is a very challenging subject. With the help of assistive technology, students can easily access the material and interact more effectively with the platforms to support their learning process [17]. Sokhulu et al., have provided a case study-based analysis of five masters-degree students and discussed the impact of digital technology on their learning during the COVID-19 pandemic. The findings suggested that some students were familiar with digital technologies and able to select technologies that were appropriate for them, whereas other participants did not have this awareness and faced challenges in selecting and using technology-based solutions. They also concluded that a combination of social and professional experience can address the research needs of students individually [18]. Recently, assistive technology-based solutions have proven very useful in helping visually-impaired students to acquire mathematical skills (i.e., Abacus Media, Mathematical Braille, ViewPlus Accessible Graphing Calculator, MathTalk, and MathPad Plus) [19–21]. Children with special needs required extra attention during the COVID situation. Kadarisma & Juandi proposed a qualitative approach to describe the mathematics learning process for visually-impaired children during the pandemic. They conducted interviews with teachers to obtain information on teaching materials used by them during the COVID pandemic. The results demonstrated that at the beginning of the pandemic, Alam Bandung School implemented online learning but students faced issues during video lectures. Mutiara Bunda School also used online learning through Google Meetings and assessment collection by Google Classroom. Some teachers use their learning platform (i.e., home learning guides) to provide support to the children. Overall, student

148

M. Shoaib et al.

learning capabilities are decreased especially in mathematics as compared to traditional teaching methods before the pandemic [22]. Rett Mcbride has discussed the finding of two articles, one of which is related to the preparation of Teachers while the second is associated with the impact of the COVID-19 pandemic on the educational experiences of students. In the first study, forty-two teachers participated, half of whom reported that they have better abilities and experience to support mathematics students. Some reported that at the start they were not well prepared but slowly they increased their skills through self-study and professional development opportunities provided by their institutions. For the most part, they used large print or braille materials in the teaching and guidance of visually-impaired students. The second study investigated the impact of COVID-19 on different dimensions (i.e., parents concerns, access to material through technology and contacting teachers). The parents of visually impaired children are concerned about the education of their children because they lack the skills and training to educate their children at home. Parents also reported that they did not have adequate knowledge to access Braille material and mathematics educational tools. Some parents are satisfied with the teachers and instructors of their children, but others reported that the teachers don’t have enough skills to understand the needs of their children [23].

3 Remote Learning-Based Educational Model for Visually-Impaired Students Before the COVID-19 pandemic, visually-impaired students could easily access teaching resources, lab facilities, and classroom environments. They could also have direct communication with their teachers during the classes. Unfortunately, during the pandemic, their learning process was changed. Figure 1 provides an overview of the remote learning-based educational model for visually-impaired students. In this model, the focus is shifted from classroom to homeschooling. Students can connect with educational facilities through the Internet. These educational facilities are online classes, assistive technology-based solutions, library services, outdoor learning, self-instruction material, support services, online seminars, group learning, presentations and meetings with teachers.

4 COVID-19 Resources and Tips for Remote Learning Table 1 provides a list of useful learning resources for visually-impaired students. These resources are available on the web portals of several institutes i.e., Michigan Department of Education, The California School for the Blind, Training and Technical Assistance Centers, National Council for Special Education, Paths to Literacy for students who are blind or visually impaired and Special educational needs and disabilities. Visuallyimpaired students normally use these resources at their homes during the pandemic. Table 2 provides a list of important remote and virtual learning-based tips that are very beneficial for visually-impaired students. These tips are listed from different electronic sources and articles that are part of our study.

Remote Learning of Mathematics for Visually Impaired

149

Fig. 1. Remote learning-based educational model.

5 Overall Educational Experience with Remote Learning Figure 2 provides a complete overview of the educational experience through remote learning. Firstly, to obtain the desired learning outcomes from any online platform, it is important to consider the teaching perspective. Several important aspects should be kept in mind when designing a better online learning platform i.e., effective pedagogical approach, suitable use of technology, flexible learning environment and providing better assessment opportunities [24–27]. Secondly, some studies suggest that several aspects of social presence are important in remote learning i.e., teacher and student involvement, belongingness, community and participation during online sessions [28– 32]. Thirdly, some articles addressed issues relating to cognitive skills, such as students need good cognitive skills to benefit from the teaching, critical reflection where learning is based upon watching, listening and engaging students by practising, doing, and thinking [33–36]. Fourthly, some studies examined the concerns of students, and these included enhanced collaboration between student and teacher, creating interpersonal relationships, and enhancing engagement during online learning [37–39]. Moreover, once the institutes switched their educational structure from face-to-face to online during COVID-19 then the parents of the students also faced issues with this transition. Some countries researched this such as the government of Luxembourg, which

150

M. Shoaib et al. Table 1. Resources along with the short description.

Name

Description

Michigan Department of Educationa

These resources are collected by MDE staff members. These are very useful for families of visually-impaired students and school personnel

The California School for the Blindb

These are very useful resources for remote, virtual and home learning

Training and Technical Assistance Centers (TTAC) Onlinec

These resources have a lot of stay-at-home activities that are very useful for visually-impaired children

National Council for Special Education (NCSE)d

They provide many online resources i.e., curriculum, outdoor and active learning. These resources are also very useful for visually-impaired children

Paths to Literacy | for students who are blind or visually impairede

They provide several online resources for educators and students i.e. math games & activities for home

Special educational needs and disabilities (SEND)f

They provide very convenient COVID-19-related resources

a https://mdelio.org/blind-visually-impaired/other-resources/covid-19-resources b https://www.csb-cde.ca.gov/resources/covid-19/parentvirtualresources.aspx c https://ttaconline.org/Resource/JWHaEa5BS741JNaLDkgnuA/Resource-stay-at-home-activi

ties-for-children-with-visual-impairments-paths-to-literacy-for-students-who-are d https://ncse.ie/all-online-resources e https://www.pathstoliteracy.org/resources/online-learning-opportunities-schools-blind-and-rel

ated-organizations f https://www.sthelens.gov.uk/send/support-services/visual-impairment-service/covid-19-useful-

resources-for-children-with-a-visual-impairment/

switched to online instruction in response to the COVID-19 pandemic. Their Ministry of Education ran a survey which revealed that more than half of the parents had issues using remote teaching with their children. To address this issue Haas et al., used augmented reality, digital and physical-mathematical models in online learning of mathematics for primary schools. They focused on how parents can help their children access various educational instructions remotely. They suggested in online teaching parent perspective is important to enhance learning designs and associated pedagogical approaches [40]. 5.1 Analysis of Remote Learning During COVID Nowadays, the educational sector faces many challenges due to the COVID-19 pandemic. Remote learning is quite useful to address these challenges. Mathematics is a very important domain of education because students have to acquire essential knowledge and

Remote Learning of Mathematics for Visually Impaired

151

Table 2. Virtual learning tips for visually impaired learners. Sr. No

Virtual learning tips

1

Proper Training for the online classroom

2

Use communication options properly i.e., Zoom and Microsoft

3

Proper use of assistive technology on top of the online classes i.e., use JAWAS

4

Online tools should compatible with screen-reading software

5

Use printed/magnified worksheets for braille users

6

The whole print material should be clear and dark

7

In colour contrast, black and white is the best option

8

Provide a clear verbal explanation

9

Introduce more oral activities and exercises

10

A time limit should be more to complete the tasks

11

Call the student’s name when required his/her attention

12

Provide proper training to students

13

Introduce peer tutoring to support the students

Fig. 2. Educational experience with remote learning.

skills i.e., geometry, calculus, graphs, numbers and algebra. Visually impaired students can access this information through digital and assistive technologies.

152

M. Shoaib et al.

The main strength of online learning is to provide better opportunities and address the challenges faced during this time by enabling learners to customize procedures and processes. Remote learning has several weaknesses i.e., technical issues, time management and lack of personal attention. Remote learning provides many opportunities to various educational institutes in a smooth transition from traditional education methods to online educational platforms. Teachers and students can communicate better with one another through assistive technologies. Due to this paradigm shift, a lot of new opportunities are available for researchers and developers to introduce new solutions. Many challenges are also associated with online education. These challenges can be faced by students, teachers and developers. There are different kinds of challenges i.e., lack of digital knowledge, changing the teaching methodologies, engaging students and technological costs. Figure 3 provides an analysis of remote learning during the pandemic and demonstrates the strengths, weaknesses, opportunities, and challenges associated with this situation.

Fig. 3. Analysis of remote learning.

6 Discussion Many nations around the globe have adopted remote learning instead of face-to-face classes during the COVID-19 pandemic. This transition is very challenging for visually impaired students because they have special needs to access and learn mathematics. Teachers of visually impaired students played an important role during this challenging time. They have utilised a variety of resources and methods, such as screen readers,

Remote Learning of Mathematics for Visually Impaired

153

Braille materials, audiobooks and online math applications to facilitate visually impaired students through remote learning. This study has provided a remote learning-based educational model for visually impaired students. This model illustrated that the focus shifted toward homeschooling during the pandemic situation. Visually impaired students can attend online classes, access various services (i.e., library and support services) and use different assistive technology-based solutions whilst studying remotely. We have carefully reviewed and considered the particular requirements and challenges of visually impaired students during remote learning. We have also explored the important resources and tips that were useful for visually impaired students during COVID-19. Visually impaired students have accessed the mathematical content with the help of accessible materials i.e., screen readers, audiobooks, text-to-speech and magnification software. Visually impaired students have regularly maintained communication with their peers and teachers by using smartphones and video conferencing tools. Teachers were also able to provide them with clearer feedback on assignments to ensure that they were on track. This process helped them all to enhance their collaboration with each other. It was also noted that schools, colleges and universities also provided additional training and support to both teachers and students. This training and support helped them in understanding of remote learning platform. After that training, their abilities to use assistive technology-based solutions and accessible materials were enhanced. In the context of earlier studies, a comprehensive model was designed to demonstrate the overall educational experience with remote learning. Four important aspects (i.e., teaching, social, cognitive and student perspective) were considered for effective online teaching and learning. Teaching perspective is very important during the designing of a remote learning platform. Better integration of technology and appropriate pedagogical approach are the key factors determining the effectiveness of the e-learning platform in teaching and learning. Social and cognitive perspectives enable teachers and students to maintain a better interpersonal relationship, trust and perceive knowledge development (according to the cognitive skills of the students). According to the student perspective, the learning process can be maximized by providing better support and interaction mechanisms to the students.

7 Conclusion Remote learning of mathematics for visually impaired students during COVID-19 has highlighted a significant number of new issues. To cover them in this study, a remote learning-based educational model was introduced that showed the focus shifted toward homeschooling during COVID-19. This study also highlighted some important pandemic resources and tips to enhance the learning skills of visually impaired students. Moreover, a comprehensive model was developed to demonstrate the overall educational experience of visually impaired students with remote learning in mathematics. Furthermore, a detailed analysis was conducted to examine the strengths, weaknesses, opportunities, and challenges of remote learning during the COVID-19 pandemic. Remote learning was a powerful tool in the pandemic situation. It’s quite a flexible approach in the context of location and time. It’s a student-centred approach and can be customized according to their needs. Remote learning platforms provide a

154

M. Shoaib et al.

very interactive environment by using several functionalities i.e., audio, video, touch and vibrotactile feedback. Students can easily raise queries, ask questions of their instructors and access course material from anywhere by connecting through the internet. Some weaknesses are also associated with remote learning i.e., a communication gap between student and teacher, lack of face-to-face interaction, discipline behaviour due to time flexibility, lack of technical skills, and different cognitive and confidence levels of students. Additionally, numerous opportunities are also available for remote learning i.e., innovation in e-learning, digital solutions development, introducing flexible solutions, effective management of the critical situation, increasing market demand and providing effective ways of teaching, learning, evaluation, assessment and so on. Lastly, there are many challenges in remote education for students, teachers, parents, institutions, government, developers and researchers. Students require quality education, teachers want to implement their teaching methodologies, parents wish to be a part of remote learning, institutions aim to provide a better platform for learning, the government wants to implement their educational policies, developers want to tackle the concerns of stakeholders and researchers always try to find an innovative solution of the problems. These are the key areas where researchers and developers can work and design efficient educational systems for visually impaired students.

8 Future Work Currently, lack of direct interaction with teachers is a major issue in remote learning. Researchers can introduce new innovative ways of interaction to overcome this issue. Specialized tools and applications are expensive and not readily available for visually impaired students. Developers can introduce easily accessible and cheaper solutions for visually impaired students. Furthermore, providing teachers with professional training and promoting a collaborative learning environment among visually impaired students can also facilitate remote learning in visually impaired students. Acknowledgment. This publication has emanated from research conducted with the financial support of Science Foundation Ireland under Grant number 18/CRT/6222. For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.

References 1. WHO Director-General’s remarks at the media briefing on 2019-nCoV on 11 February 2020. https://www.who.int/director-general/speeches/detail/who-director-general-s-rem arks-at-the-media-briefing-on-2019-ncov-on-11-february-2020. Accessed 30 Nov 2021 2. Coronavirus Resource Center. New Cases of COVID-19 in World Countries. https://corona virus.jhu.edu/map.html. Accessed 2 Dec 2021 3. Bozkurt, A., Sharma, R.C.: Emergency remote teaching in a time of global crisis due to coronavirus pandemic. Asian J. Distance Educ. 15(1), i–vi (2020) 4. Flores, M.A., Gago, M.: Teacher education in times of COVID-19 pandemic in Portugal: National, Institutional and Pedagogical Responses. J. Educ. Teach. Advance online publication. (2020). https://doi.org/10.1080/02607476.2020.1799709

Remote Learning of Mathematics for Visually Impaired

155

5. Iqbal, Z., Shams, J.A.: Effectiveness and transformation in students’ beliefs: a case of collaborative teaching. J. Elem. Educ. 29(1), 121–128 (2019) 6. Beal, C.R., Rosenblum, L.P.: Evaluation of the effectiveness of a tablet computer application (app) in helping students with visual impairments solve mathematics problems. J. Visual Impair. Blind. 112(1), 5–19 (2018) 7. Morash, V., McKerracher, A.: The relationship between tactile graphics and mathematics for students with visual impairments. Terra Haptica 4, 1–10 (2014) 8. Zebehazy, K.T., Wilton, A.P.: Charting success: the experience of teachers of students with visual impairments in promoting graphic use by students. J. Visual Impair. Blind. 108, 263–274 (2014) 9. Diane, B., Smith, D., Brayant, B.: Teaching students with special needs in inclusive classrooms (Translate by Mahmoud Ismael). Dar Alfeker, Amman (2012) 10. Dhawan, S.: Online learning: a panacea in the time of COVID-19 crises. J. Educ. Technol. 49(1), 5–22 (2020). https://doi.org/10.1177/0047239520934018 11. Kleanthous, I., Meletiou-Mavrotheris, M.: Early statistical reasoning: an exploratory study of primary school students’ use of a dynamic statistics software package for analyzing and interpreting data. In: Information Resources Management Association (Ed.), K-12 STEM education: Breakthroughs in research and practice, pp. 359–376. IGI Global, Hershey, PA (2018) 12. Merchant, Z., Goetz, E.T., Cifuentes, L., Keeney-Kennicutt, W., Davis, T.J.: Effectiveness of virtual reality-based instruction on students’ learning outcomes in K-12 and higher education: a meta-analysis. Comput. Educ. 70, 29–40 (2014) 13. Asebriy, Z., Raghay, S., Bencharef, O.: An assistive technology for Braille users to support mathematical learning: a semantic retrieval system. Symmetry 10(11), 547 (2018) 14. Wong, M., Cohen, L.: School, family and other influences on assistive technology use: access and challenges for students with visual impairment in Singapore. Br. J. Vis. Impair. 29(2), 130–144 (2011) 15. Godino, J., Batanero, C.: The onto-semiotic approach to research in mathematics education. ZDM Math. Educ. 39, 127–135 (2007) 16. National Council of Teachers of Mathematic (NCTM): Principle and Standards for School Mathematics. NTCM (2000). https://en.wikipedia.org/wiki/Principles_and_Standards_for_ School_Mathematics 17. Reddy, E., Sharma, B., Reddy, P., Dakuidreketi, M.: Mobile learning readiness and ICT competency: a case study of senior secondary school students in the Pacific Islands. IEEE, Nadi, Fiji (2017) 18. Sokhulu, L.H.: Students’ experiences of using digital technologies to address their personal research needs during the COVID-19 lockdown. Afr. Identities 1–17 (2020) 19. Reddy, P., Chaudhary, K., Sharma, B., Chand, R.: The two perfect scorers for technology acceptance. Educ. Inf. Technol. 2020, 1–23 (2020). https://doi.org/10.1007/s10639-020-103 20-2 20. Bouck, E., Satsangi, R.: Maths assistive technology to support inclusion Bingley, pp. 1–18. Emerald Publishing Limited, England (2020) 21. Sharma, B., Jokhan, A., Kumar, R., Finiasi, R., Chand, S., Rao, V.: Use of short message service for learning and student support in the Pacific region. In: Handbook of Mobile Teaching and Learning. Springer, Berlin, Heidelberg (2015) 22. Kadarisma, G., Juandi, D.: Mathematics learning for students with special needs during the covid-19 pandemic. J. Phys. Conf. Ser. 2123(1), 012008. IOP Publishing (2021). (Nov 2021) 23. McBride, C.R.: Critical issues in education for students with visual impairments: access to mathematics and the impact of the covid-19 pandemic. Doctoral Dissertation, University of Georgia (2020)

156

M. Shoaib et al.

24. Ducan, H.E., Barnett, J.: Learning to teach online: What works for pre-service teachers. J. Educ. Comput. Res. 40(3), 357–376 (2009). https://doi.org/10.2190/EC.40.3.f 25. Jin, S.H.: Analyzing student-student and student-instructor interaction through multiple communication tools in web-based learning. Int. J. Instr. Media 32(1), 59–67 (2005) 26. Chieu, V.M., Herbst, P., Weiss, M.: Effect of an animated classroom story embedded in online discussion on helping mathematics teachers learn to notice. J. Learn. Sci. 20(4), 589–624 (2011). https://doi.org/10.1080/10508406.2011.528324 27. Uribe, S.N., Vaughan, M.: Facilitating student learning in distance education: a case study on the development and implementation of a multifaceted feedback system. Distance Educ. 38(3), 288–301 (2017). https://doi.org/10.1080/01587919.2017.1369005 28. Hramiak, A.: Online learning community development with teachers as a means of enhancing initial teacher training. Technol. Pedagog. Educ. 19(1), 47–62 (2010). https://doi.org/10.1080/ 14759390903579265 29. Komninou, I.: A case study of the implementation of social models of teaching in e-learning: “The social networks in education”, Online course of the inter-orthodox centre of the church of Greece. TechTrends 62(2), 146–151 (2017). https://doi.org/10.1007/s11528-017-0247-4 30. Li, Z.: Learners’ reflexivity and the development of an e-learning community among students in China. Res. Learn. Technol. 19(1), 5–17 (2011). https://doi.org/10.1080/09687769.2010. 548505 31. Satar, H.M., Akcan. S.: Pre-service EFL teachers’ online participation, interaction, and social presence. Lang. Learn. Technol. 22(1), 157–183 (2018). 10125/44586 32. Yeh, Y.: Integrating collaborative PBL with blended learning to explore preservice teachers’ development of online learning communities. Teach. Teach. Educ. 26, 1630–1640 (2010). https://doi.org/10.1016/j.tate.2010.06.014 33. Ryan, J., Scott, A.: Integrating technology into teacher education: How online discussion can be used to develop informed and critical literacy teachers. Teach. Teach. Educ. 24, 1635–1644 (2008). https://doi.org/10.1016/j.tate.2008.02.012 34. Sing, C.C., Khime, M.S.: An analysis of interaction and participation patterns in online community. Educ. Technol. Soc. 9(1), 250–261 (2006). https://www.jstor.org/stable/jeduct echsoci.9.1.250. Accessed 25 July 2020 35. Jones, M., Ryan, J.: Learning in the practicum: engaging pre-service teachers in reflective practice in the online space. Asia-Pac. J. Teach. Educ. 42(2), 132–146 (2014). https://doi.org/ 10.1080/1359866X.2014.892058 36. Dyment, J., Downing, J.: ’There Was Nowhere to Hide … ’: The surprising discovery of how weekly web conferences facilitated engagement for online initial teacher education students. Asia-Pac. J. Teach. Educ. 46(4), 399–418 (2018). https://doi.org/10.1080/1359866X.2018. 1444140 37. Mumford, S., Dikilita¸s, K.: Pre-service language teachers reflection development through online interaction in a hybrid learning course. Comput. Educ. 144, 103706 (2020). https:// doi.org/10.1016/j.compedu.2019.103706 38. Farr, F., Riordan, E.: Tracing the reflective practices of student teachers in online modes. ReCALL 27(1), 104–123 (2015). https://doi.org/10.1017/S0958344014000299 39. Jones, P.: My peers have also been an inspiration for me: developing online learning opportunities to support teacher engagement with inclusive pedagogy for students with severe/profound intellectual developmental disabilities. Int. J. Incl. Educ. 14(7), 681–696 (2010). https://doi. org/10.1080/13603111003778452 40. Haas, B., Lavicza, Z., Kreis, Y.: Parent’s experience in remote learning during COVID-19 with digital and physical mathematical modelling. In: Research and Practice in Technology Enhanced Learning, vol. 18 (2023)

Adversarial Robustness of Multi-bit Convolutional Neural Networks Lukas Frickenstein1(B) , Shambhavi Balamuthu Sampath1 , Pierpaolo Mori3 , Manoj-Rohit Vemparala1 , Nael Fasfous1 , Alexander Frickenstein1 , Christian Unger1 , Claudio Passerone3 , and Walter Stechele2 1 BMW Autonomous Driving, Unterschleissheim, Germany {lukas.frickenstein,shambhavibalamuthu.sampath,manoj-rohit.vemparala, nael.fasfous,alexander.frickenstein,christian.unger}@bmw.de 2 Technical University of Munich, Munich, Germany [email protected] 3 Politecnico di Torino, Turin, Italy {pierpaolo.mori,claudio.passerone}@polito.it

Abstract. Deploying convolutional neural networks (CNNs) on resource-constrained, embedded hardware constitutes challenges in balancing task-related accuracy and resource-efficiency. For safety-critical applications, a third optimization objective is crucial, namely the robustness of CNNs. To address these challenges, this paper investigates the tripartite optimization problem of task-related accuracy, resource-efficiency, and adversarial robustness of CNNs by utilizing multi-bit networks (MBNs). To better navigate the tripartite optimization space, this work thoroughly studies the design space of MBNs by varying the number of weight and activation bases. First, the pro-active defensive model MBN3x1 is identified, by conducting a systematic evaluation of the design space. This model achieves better adversarial accuracy (+10.3pp) against the first-order attack PGD-20 and has 1.3× lower bit-operations, with a slight degradation of natural accuracy (–2.4pp) when compared to a 2-bit fixed-point quantized implementation of ResNet-20 on CIFAR10. Similar observations hold for deeper and wider ResNets trained on different datasets, such as CIFAR-100 and ImageNet. Second, this work shows that the defensive capability of MBNs can be increased by adopting a state-of-the-art adversarial training (AT) method. This results in an improvement of adversarial accuracy (+13.6pp) for MBN3 × 3, with a slight degradation in natural accuracy (–2.4pp) compared to the costly full-precision ResNet-56 on CIFAR-10, which has 7× more bitoperations. To the best of our knowledge, this is the first paper highlighting the improved robustness of differently configured MBNs and providing an analysis on their gradient flows. Keywords: Adversarial robustness · Neural network quantization Multi-bit convolutional neural networks

·

Lukas Frickenstein, Shambhavi Balamuthu Sampath and Pierpaolo Mori: Equal contributions. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 157–174, 2024. https://doi.org/10.1007/978-3-031-47715-7_12

158

1

L. Frickenstein et al.

Introduction

Convolutional neural networks (CNNs) have become prevalent in the field of computer-vision, tackling a wide-range of complex problems with unprecedented task-related accuracy [18,35]. Despite their rise in popularity, several drawbacks have limited their adoption in embedded, safety-critical settings [9–11]. Most prominent of these drawbacks are their increasing demand in memory and computational complexity [18], as well as their susceptibility to edge-cases and adversarial attacks [2]. This has led to extensive research into CNN compression [1,16,17,20,21,26,30,31,37,38] and defensive adversarial training [13,25,33,40]. Among the wide-range of techniques in both research domains, binary neural networks (BNNs) have been shown to effectively tackle both challenges by constraining the parameters of the CNN to B ∈ {−1, 1}. This highly discrete representation of the parameters allows for the execution of the costly arithmetic operations of a CNN by simple Boolean XNOR operations on inference hardware [7,31], and requiring 1-bit of memory to store the CNN’s parameters and intermediate activations. The discreteness of BNNs necessitates a complex training scheme involving gradient approximation to allow gradientdescent-based training over the discrete binarization functions [8,27,31]. As firstorder adversarial-attacks try to exploit vulnerabilities of the latent weight representations, the binarized weights used during inference show higher resilience to the produced adversarial example [12,29]. Through this gradient approximation at training-time, the severity of gradient-based adversarial attacks is reduced. Although BNNs partly tackle both problems of resource-constrained, embedded inference and adversarial robustness, their learning capacity is hampered due to the low-information representation of their parameters, resulting in lower taskrelated natural accuracy. To address the low task-related accuracy of BNNs, research into multi-bit networks (MBNs) increased the number of binary representations a single layer can have [26]. A single full-precision weight filter or input feature map can be represented with an arbitrary number of binary tensors, called bases. Considering all three aspects of parsimonious inference, robustness of BNNs and improved natural task-related accuracy of MBNs, the design choices involved in producing the deployed neural network become a balancing act of maintaining all three desired targets in a tripartite solution space. This work navigates this solution space and builds an understanding of the interactions between the three target optimization criteria, which is tested empirically on a wide-range of possible MBN configurations. To the best of our knowledge, this is the first paper highlighting the improved robustness of differently configured MBNs and providing an analysis on their gradient flows. The contributions of this work can be summarized as follows: – Performing a thorough investigation of the design space of MBNs by varying the number of weight and activation bases. With the correct choice of bases, a model with +10.3pp better adversarial accuracy against the ultimate first-order attack PGD-20 and 1.3× lower bit-operations can be found

Adversarial Robustness of Multi-bit Convolutional

159

when compared to a 2-bit fixed-point quantized implementation of ResNet-20 trained on CIFAR-10, with a slight degradation in natural accuracy (–2.4pp). – Showing that the defensive capability of MBNs can be increased by adapting a state-of-the-art adversarial training (AT) method. With different architectural setups, various practical trade-offs can be achieved, such as an improvement of adversarial accuracy (+13.6pp) for MBN3 × 3, with a slight degradation in natural accuracy (–2.4pp) compared to the costly full-precision ResNet-56 on CIFAR-10, while providing a 7× improvement in bit-operations. – Supporting the empirical evidence by formulating an understanding of the gradient flows of full-precision, fixed-point quantized, and multi-bit networks, relating to different levels of adversarial robustness.

2 2.1

Related Work Quantized and Binary Neural Networks

Quantization relies on representing a CNN’s parameters with a discrete, constrained set of values. This typically requires complex training schemes to enable standard stochastic gradient descent (SGD) to update the model parameters. In [42], Zhou et al. limit the magnitude of the latent weights and activations between [0, 1], where the latent datatypes are deterministically quantized such that the straight-through estimator (STE) [4] is required. Choi et al. aim to improve the training scheme of quantized neural networks (QNNs) with PACT [6] by learning the optimal clipping level for the activations of each layer at training time. Thus, the representational capability is increased, leading to an increase in task-related accuracy. Binarization represents the most drastic form of quantization, where the parameters of a CNN are constrained to either {+1, −1}. A common solution to train BNNs is to maintain full-precision latent parameters θ used during training to update the highly discrete model parameters θb with gradient information. For the forward pass, the latent model parameters are deterministically mapped to either {+1, −1} through the sign function. However, this creates a gradientvanishing problem during backpropagation, as the derivative of the sign function is mostly zero, causing all the gradients of parameters before the sign to take the value of zero. To tackle this issue, Bengio et al. [4] propose the straight-throughestimator (STE) which passes the identity if the argument is positive. This can be seen as passing the gradients through the piece-wise linear activation function hard tanh (htanh). This ensures sufficient gradient flow during backpropagation to update all the BNN parameters, bypassing the sign operation. With BinaryNet [7], Courbariaux and Bengio presented the first BNN with binary weights and activations. Building on top of basic BNNs, Rastegari et al. introduced XNOR-Net [31], a scheme to train BNNs with latent weights by approximating the convolutions of input feature maps Al−1 and weights W l of the layer l by a combination of XNOR operations and popcounts, multiplied with a trainable scaling factor α, resulting in: Conv(Al−1 , W l ) ≈

160

L. Frickenstein et al.

(sign(Al−1 ) ⊕ sign(W l )) · α. This introduced a significant improvement in accuracy, as the added trainable scaling factor could allow more information to be learned by the BNN. To further mitigate the accuracy degradation of BNNs, Lin et al. [26] extended BNNs by approximating the full-precision convolutions in CNNs by using linear combinations of multiple binary bases for both weights M and activations N , resulting in Accurate Binary Convolutional Neural Networks (ABC-Nets). Thus, the convolutions of multi-bit networks (MBN) can be implemented by computing M × N bit-wise convolutions in parallel. 2.2

Adversarial Robust Compression

Szegedy et al. [36] first proved the existence of adversarial attacks in the domain of image classification. Adversarial examples are generated by adding some imperceptible perturbation δ to some given original input images, fooling the network into changing its prediction. Carlini et al. [5] proposed three characteristics that specify a defined threat model τ of the adversarial attack. First, the adversary goal defines a successful attack. Second, the capability of adversarial attacks can be formulated as a set of allowed perturbations S : D(X, Xadv ) ≤ , where some distance D between the original and the adversarial image does not extend some perturbation budget . Third, the degree of accessibility of the adversarial attack to the underlying neural network defines the adversarial knowledge. For white-box attacks (e.g. PGD), the complete model parameters are exposed to the adversary. In general, Carlini et al. [5] describe the problem of finding an adversarial example Xadv for a given model f (·) and the label Y as maximizing the loss L for a given perturbation budget  as shown in Eq. 1.   (1) max L (f (Xadv ), Y ) E (x,y)∼D

Xadv ∈S

In the work of Goodfellow et al. [14], a simple and fast method of generating adversarial examples is introduced, namely the Fast Gradient Sign Method (FGSM). FGSM performs a step into the ascent direction of the gradient of the loss function, scaled by the perturbation budget . Madry et al. [28] use a more powerful adversary to maximize the loss by introducing the multi-step variant of FGSM, which represents Projected Gradient Descent (PGD). Inspired by Tram´er et al. [39], Madry et al. initialize the PGD attack by randomly choosing starting points inside S. This ensures varying explorations of the non-concave and constrained maximization problem by the PGD attack, which is able to converge to local maxima. As stated by Madry et al. [28], this renders the PGD attack as the ultimate first-order adversary. A more detailed description is provided in Sect. 4.1. In the following sections, we apply the PGD attack to full-precision CNNs, QNNs, BNNs, a wide range of MBNs, as well as adversarially-trained networks. We leverage our understanding of the different gradient flow characteristics in these types of neural networks (presented in Sect. 2.1) to explain their robustness (or lack thereof). Recent works addressed the simultaneous optimization with regard to network compression and adversarial robustness. In [13], Goldblum et al. leverage

Adversarial Robustness of Multi-bit Convolutional

161

knowledge distillation to distill adversarial robustness onto a smaller student from a larger teacher network. In [15], Guo et al. investigated to mitigate potential threats of adversarial examples through robust neural architecture search (NAS) techniques. With [24], Kundu et al. aim for highly compressed CNNs while maintaining their robustness through robust pruning. Galloway et al. [12] evaluated and interpreted the adversarial robustness of BNNs. The reduced memory consumption and faster inference of BNNs is complemented with adversarial robustness, by demonstrating an improved or at least on par robustness against several attacks compared to full-precision models. The introduced discontinuity and approximated gradients of BNNs account for the improved robustness over the full-precision networks. Building on this knowledge, our work aims to exploit the increased representation capabilities of MBNs to boost natural accuracy compared to BNNs, while utilizing the resilience of binary parameters against adversarial attacks. With [25], Lin et al. jointly optimize the efficiency and robustness of DNNs by robust quantization. In Table 1, we summarize the exploration and investigations performed in existing literature. In this work, we holistically consider the tripartite optimization space by using multi-bit networks and analyse the effect of MBN design decisions on all three optimization targets. Table 1. Classification of related work explorations of the tripartite optimization space, considering the natural accuracy, the HW efficiency and the robustness against adversarial attacks Exploration

[6, 26, 31, 42] [28, 40, 41] [12, 13, 15, 25] [This work]

Accuracy









HW-Efficiency ✓















Robustness

3

Methodology

This paper navigates the tripartite optimization problem by studying the design space of MBNs. In this section, the considered design space of MBNs is formulated with varying number of bases for weights and activations (Sect. 3.1). Furthermore, we aim to analyse the gradient flows of multi-bit networks to achieve different levels of adversarial robustness, and compare them to floating-point and fixed-point networks along with addressing the gradient obfuscation problem known in literature when assessing the robustness of quantized and binarized CNNs [3] (Sect. 3.2). Lastly, the number of bit-operations (BOPS) is formulated, allowing an evaluation and comparison of the complexity of neural networks, such as BNNs, QNNs, MBNs and full-precision representations (Sect. 3.3).

162

3.1

L. Frickenstein et al.

Design Space of Multi-bit Networks

As noted in Sect. 2.1, the accuracy gap between full-precision networks and BNNs can be mitigated by approximating the full-precision convolutions as a linear combination of multiple binary bases M for weights and N for activations (shown in Eq. 2). Although each convolution of two individual binary bases m and n still has a limited information capacity, the learned linear combination of the M × N binary convolutions is collectively more capable of representing the information of a full-precision convolution more accurately. Conv(W, A) ≈

N M  

αm βn Conv(sign(Wm ), sign(An ))

(2)

m=1 n=1

The multiple binary activations Ab,n = sign(An ), their corresponding scaling coefficients βn and the multiple binary weight approximations Wb,m = sign(Wm ) along with the weight scaling coefficients αm result in the whole convolution scheme. Each convolution of a weight and activation base-pair can be computed bit-wise independently from other convolutions. More specifically, the multi-bit convolution operation can be parallelized by up to M × N convolutions, requiring the end-to-end latency of a single, standard binary convolution. It is important to note, that parallelizing bit-wise computations of an equivalent QNN is not possible due to the data dependencies among the bits belonging to the same elements of the convolution operands. We can intuitively expect that increasing M × N up to a certain extent, increases the accuracy of the MBN, at the cost of more bit-operation computations. However, to understand the effect of increasing (or decreasing) the number of bases (for weights and/or activations) on adversarial robustness (naturally or adversarially trained) is a complex problem, requiring more in-depth analysis and experimental evaluation. Throughout this work we follow the notation of MBN M × N representing a multi-bit network configurations with M number of weight bases and N number of activation bases. 3.2

Analysing Gradient Flows

In this section, valuable insights into the tripartite optimization problem are extracted, by analysing the differences in the forward pass and the gradient flows in the backward pass of convolutions with different numerical representation methods i.e. fixed-point, multi-bit, and floating-point networks (see Fig. 1). Multi-bit: The forward pass of multi-bit networks follows a linear combination of multiple binary bases M and N , described in Eq. 2. In the backward pass, the gradients of the weights gW l of the layer l and the gradients of the activations gAl−1 are computed as a concatenation of scaling coefficients and the identities of the gradients of the binary bases, as shown in Eq. 3. gAl−1 =

N  n=1

βn 1A≤1

δAl δAl−1 b,n

;

gW l =

M  m=1

αm 1W ≤1

δAl l δWb,m

(3)

Adversarial Robustness of Multi-bit Convolutional

163

Fig. 1. Comparison of convolutions with different numerical representation methods i.e. multi-bit (left), fixed-point (middle) and floating-point (right).

Here, each binary base for weights Wb,m and activations Ab,n has separate trainable scaling parameters, αm ’s and βn ’s respectively, increasing the learning capabilities with increased number of bases. For multi-bit convolutions the individual discrete sign function is bypassed with the STE to allow gradients to flow, representing a coarse approximation. This induces an increased degree of gradient approximation (i.e. between latent and actual binary parameters), producing large discrepancies between forward and backward pass. Fixed-Point: Considering the approximated convolutions of fixed-point CNNs, the round() operand is used to map from full-precision latent parameters (Al−1 and W l ) to the fixed-point quantized representations (Al−1 and Wql ), as shown q in Eq. 4. Conv(W l , Al−1 ) ≈ αβConv(round(W l , qW ), round(Al−1 , qA ))

(4)

The scaling and shifting happens before round(), such that W ∈ [0, 1]. Note that the number of bits used to represent weights and activations is denoted by qW and qA respectively, ranging from integer values q =[1; 8]. The gradients in the backward pass follow as in Eq. 5. gAl−1 = β1A≤1

δAl δAl ; gW l = α1W ≤1 l−1 δWql δAq

(5)

Convolutions of fixed-point networks rely on a single trainable scaling factor α for weights and β for activations, collapsing into one scaling factor. In the backward pass, the round() operation is bypassed with the STE for weights and activations. This bypassing represents a closer approximation of the gradients compared to the bypassing of the sign as done in multi-bit convolutions, leading to smaller discrepancies between forward and backward passes for fixed-point networks. The well-established computations of floating-point DNNs do not induce any discrepancies, as they do not require any discrete operations. Analysing the

164

L. Frickenstein et al.

forward and backward pass of the convolutions of different numerical representations, the gradient analysis points to the following implications: – First, the multi-bit convolution introduces large discrepancies between the forward and backward pass, due to the harsh approximation of the sign function with the STE for each binary base. The huge gradient approximation demands stronger gradient-based attacks with larger perturbations to change the output of MBNs. This implies an improved inherent resilience of MBNs compared to floating-point and fixed-point networks. – Second, as the number of binary bases of MBNs increase, the approximation gap to the floating-point convolution and its respective gradient flow decreases. This implies that an increased number of bases reduces the inherent robustness against gradient-based attacks. – Lastly, adversarial training methods demand an increased learning ability, due to the additional examples provided by the attacks. This suggests that training MBNs with adversarial examples requires an increased number of binary bases to improve the trade-off between natural accuracy and adversarial robustness. Obfuscated Gradients: In general, iterative optimization-based attacks (e.g. PGD (see Sect. 2.2)) require gradient information of the underlying model to create strong adversarial attacks. A main concern in literature are false defense mechanisms exploiting the effect of gradient masking, or the special case of obfuscated gradients [3]. The phenomenon of obfuscated gradients can lead to a false sense of security against adversarial attacks by not providing enough gradient information. We are highlighting the fact that our investigations on different quantization techniques in this work neither exploit shattered gradients, stochastic gradients nor vanishing/exploding gradients, representing the three types of gradient obfuscation. This work follows the common practice to use the STE [4] to ensure gradient flow over all non-differentiable operations (e.g. round(), sign()). This is ensured for both training the neural network and the process of generating the adversarial attack for the underlying model. This is also detailed in Fig. 1, showing CNN layers of different numerical representation methods, all allowing the backward pass to bypass the non-differentiable quantization functions with the STE. 3.3

Compute Complexity

Hardware arithmetic computation units can be classified as either being bit-serial or bit-parallel (vectorized) in their processing of the input operands. To fairly compare the computational complexity of low-precision and binary neural networks, they must be evaluated on hardware architectures which can exploit their respective benefits. In this paper, we refer to bit-serial based HW improvements, as they are able to flexibly process any bit-width inputs, albeit with an added latency as the input bit-width grows. Bit-serial computation units break down the operands and perform the computation bit-by-bit until the bit-width of both

Adversarial Robustness of Multi-bit Convolutional

165

inputs is exhausted [34]. In principle, two 1-bit (binary) operands need 1 cycle of computation, whereas two 16-bit operands require 256 cycles to complete an arithmetic operation. As such, we use the Bit-OPs (BOPS) metric to evaluate and compare the complexity of neural networks, for all BNN, MBN, QNN and full-precision representations.

4

Experiments

Breaking CNNs can be achieved by simply adding large perturbations onto the input. However, finding the minimum necessary input perturbations is more practical to understanding the robustness of CNNs [29]. Choosing  for PGD is based on breaking the full-precision version of ResNet [18], as we aim to compare floating-point, fixed-point and multi-bit networks and their resilience against PGD. If the classification accuracy drops below random guessing, the target model is considered broken. The PGD threat model τPIN GD = { = 2, α = 0.5} is used to assess the inherent (IN) robustness of the target models, representing the worst-case threat model for the mentioned numerical representations and the considered perturbation budget  = 2. The process of identifying the worst-case threat model for all considered models is described in Sect. 4.1. For adversarially trained (AT) models [40], τPAT GD = { = 8, α = 2} is considered to further stress the target models. The reported PGD accuracies are averaged over five runs, ensuring varying explorations of the set of allowed perturbations S, as described in Sect. 2.2. Experiments are carried out on CIFAR-10/100 [23] and ImageNet [32]. For CIFAR-10, 50K train and 10K test images (32 × 32 pixels) are used to train and evaluate the multi-bit configurations of ResNet-20/56. ImageNet consists of ∼ 1.28M train and 50K validation images (256 × 256 pixels), on which ResNet18 is trained and evaluated as well as various multi-bit configurations of ResNet-18 are trained and evaluated. If not otherwise mentioned, all hyperparameters defining the training or the attacks are adopted from the reference implementations. First, the worst-case PGD threat model is identified for different network configurations by varying the step size α (Sect. 4.1). Second, the design space of MBNs is systematically evaluated by (1) naturally training the networks on original image data, (2) evaluating the resource-efficiency of the configurations based on the number of BOPS and (3) assessing the inherent adversarial robustness against the ultimate first-order adversarial attack (PGD) (Sect. 4.2). We train and evaluate the multi-bit configurations of ResNet-20, ResNet-56, and ResNet-18 on CIFAR-10, CIFAR-100, and ImageNet, respectively. Third, combining MBNs with state-of-the-art adversarial training methods (e.g. FastAT [40]) can further increase the defensive capability, showing the learning capabilities of larger MBNs (Sect. 4.3). We adversarially train and evaluate multi-bit configurations of ResNet-56 on CIFAR-10/100. To assess the performance with respect to the tripartite optimization space, we report the prediction accuracy on original images (Top-1), attacked images (PGD-20/50) and the number of

166

L. Frickenstein et al.

bit-operations (BOPS). Note that for floating-point networks, we report the BOPS of the 8-bit version since we expect no natural accuracy degradation [22], however, the respective Top-1 and PGD experiments are still performed on classic 32-bit floating-point CNNs. 4.1

Worst-Case Threat-Model

The iterative multi-step PGD attack, introduced by Madry et al. [28], maximizes the problem of finding adversarial examples. PGD performs iterative steps into the ascent direction of the gradient of the corresponding target model’s loss function, scaled by the PGD step size α, and projected π onto the legal set S, see Eq. 6.  i  i  i+1 = πS Xadv + α · ∇L Xadv , Y, θ (6) Xadv Inspired by Tram´er et al. [39], Madry et al. initialize the PGD attack by randomly choosing starting points with uniform random noise U(−, ) inside the defined legal set S, with S : D(X, Xadv ) ≤ . This ensures random starting points on the highly non-concave maximization problem of finding adversarial examples, where PGD is able to converge to local maxima. The local maxima represent possible worst-case adversarial examples for the target model. Having the non-concave loss surface in mind, starting from random starting points results in subsequent varying exploration of potentially varying local maxima, representing different worst-case scenarios. As stated by Madry et al. [28], this renders the PGD attack as the ultimate first-order adversary. Therefore, we follow the relevant literature [24,25] which suggests using PGD attack, an iterative optimization-based attack, to evaluate the adversarial robustness of CNNs. The threat model for PGD τP GD comprises of the perturbation budget , the step size α, and the iterations i. As stated by Carlini et al. [5], adapting the adversarial threat model of state-of-the-art adversarial attacks is a compulsory step to demonstrate an upper bound of adversarial robustness. Therefore, the following performs an exploration of the PGD attack configuration to identify the worst-case threat model τP∗ GD over a variety of numerical representations i.e. floating-point (Fig. 2), fixed-point (Fig. 3) and multi-bit (Fig. 4). Identifying the worst-case threat model τP∗ GD for various compressed CNN variants relies on two metrics. First, the overall accuracy level after converged attack for a specific τP GD , where the worst-case results in the lowest accuracy after attack. Second, the required number of iterations i of PGD to break the model. Having the metrics in mind, the goal of the PGD threat model exploration is to empirically identify a value for the step size α such that it results in the worst-case PGD attack for the underlying model. In general, the PGD step size α balances out convergence speed and the characteristic of escaping local maxima. Versions of ResNet-20 and ResNet-56 are naturally trained on CIFAR-10, followed by exposing them to the PGD attack with  = 2, while varying the PGD step size α = {0.1, 0.5, 1} to find the most suitable value. The PGD attacks perform 1,000 iterations i to ensure convergence of the multi-step optimizationbased attack, as proposed by Carlini et al. [5]. However, it is worth mentioning

Adversarial Robustness of Multi-bit Convolutional

90

80

80

70

70

60

60

Acc[%]

100

90

Acc[%]

100

50 40

50 40

30

30

20

20

10 0

167

10 0

20

40

60

80

0

100

0

20

PGD Iterations

40

60

80

100

PGD Iterations α=0.1

α=0.5

α=1

Fig. 2. PGD attack accuracy over PGD iterations for ResNet-20 (left) and ResNet-56 (right) with floating-point representation averaged over five runs on CIFAR-10 with a fixed perturbation budget  and varying step size parameters α.

90

80

80

70

70

60

60

Acc[%]

100

90

Acc[%]

100

50 40

50 40

30

30

20

20

10 0

10 0

20

40

60

80

0

100

90

90

80

80

70

70

60

60

Acc[%]

100

Acc[%]

100

50 40

30 20

10

10 20

40

60

80

100 α=0.1

40

60

80

100

0

20

40

60

80

100

40

20

0

20

50

30

0

0

0 α=0.5

α=1

Fig. 3. PGD attack accuracy over PGD iterations for ResNet-20 (left column) and ResNet-56 (right column) variants with fixed-point representations averaged over five runs for PACT-2bit (top row) and PACT-4bit (bottom row) on CIFAR-10. Similarly, the perturbation budget  is fixed while step size parameter α varies for the threat model.

168

L. Frickenstein et al.

90

80

80

70

70

60

60

Acc [%]

100

90

Acc[%]

100

50 40

50 40

30

30

20

20

10

10

0

0

20

40

60

80

0

100

90

90

80

80

70

70

60

60

Acc [%]

100

Acc[%]

100

50 40

30 20

60

80

100

0

20

40

60

80

100

0

20

40

60

80

100

10

10 0

20

40

60

80

0

100

100

90

90

80

80

70

70

60

60

Acc[%]

100

Acc[%]

40

40

20

50 40

50 40

30

30

20

20

10

10

0

20

50

30

0

0

0

20

40

60

80

100 α=0.1

0 α=0.5

α=1

Fig. 4. PGD attack accuracy over PGD iterations for ResNet-20 (left column) and ResNet-56 (right column) variants with multi-bit representation averaged over five runs for MBN1 × 3 (top row), MBN3 × 1 (middle row) and MBN3 × 3 (bottom row) on CIFAR-10.

that the following results are shown up to i = 100, since the attack convergence is observed in that range. Additionally, each variant is exposed to all threat models τP GD five times to utilize the described uniform random initialization of the attack. The results reported in this experiment are then averaged over the five runs.

Adversarial Robustness of Multi-bit Convolutional

169

Figure 2 visualizes the threat model exploration for floating-point versions of ResNet-20 (left) and ResNet-56 (right). Similarly, Fig. 3 shows the threat model exploration for fixed-point quantized versions of ResNet-20 (left column) and ResNet-56 (right column). In detail, PACT-2bit [6] (top row) and PACT-4bit [6] (bottom row) are exposed to the defined PGD threat models. Lastly, Fig. 4 contains the threat model exploration for various multi-bit versions of ResNet20 (left column) and ResNet-56 (right column). MBN1 × 3 (top row), MBN3x1 (middle row) and MBN3x3 (bottom row) are exposed to the PGD attacks as defined in the experimental setup. The worst-case threat model was identified for all investigated CNNs and numerical representations as τP∗ GD = { = 2, α = 0.5}. First, τP∗ GD results in the highest PGD attack effectiveness over all numerical representations (i.e. floating-point, fixed-point, multi-bit) for ResNet-20/56. Second, τP∗ GD requires an adequate number of iterations to break the target models, rendering PGD20/50 as a valid assessment of inherent robustness of the target models. 4.2

Inherent Robustness of Multi-bit Networks

We compare the inherent robustness of MBNs to full-precision (ResNet [18]), fixed-point (PACT [6]), binary (XNOR-Net [31]), and pruned (AMC [19]) networks, showing the increased resilience of multi-bit networks to the gradientbased PGD attack, implied in Sect. 3.2. To better understand the gradient flows for MBNs and the influence of varying the number of bases M for weights and N for activations on the tripartite optimization, various MBN configurations are systematically evaluated. The extensive exploration of the design search space is presented in Table 2. In general, MBNs provide an increased adversarial accuracy against PGD, compared to other numerical representations. Particularly, the pro-active defensive model MBN3 × 1 achieves better adversarial accuracy (+10.3pp) against PGD-20 and has 1.3× lower BOPS , with a slight degradation of original accuracy (–2.4pp) when compared to a 2-bit fixed-point quantized implementation of ResNet-20 on CIFAR-10. Similar trends hold for the ResNet-56 and ResNet-18 versions of MBN3x1, trained on CIFAR-100 and ImageNet respectively. This highly supports the theoretical implication, provided in Sect. 3.2, that the induced discrepancies between forward and backward pass of multibit networks provide an increased resilience against PGD compared to other numerical representations. We notice that limiting the number of bases for the activations N = 1, while choosing the number of weight bases M > 1, produces more robust MBN configurations compared to MBN configurations with weight bases M = 1 and activation bases N > 1. This behavior is associated with the structural characteristics of the activation bases, where the number of activation bases determine the gradient flow to the next layer. Since the activation bases serve as “gates” of information flow to the next layer, increasing the number of gates enables the attack to exploit more information about the gradients. Considering the tripartite optimization space, MBN3 × 1 is a favourable solution,

170

L. Frickenstein et al.

Table 2. Comparisons of naturally trained (inherent robustness) multi-bit, fullprecision, fixed-point and pruned networks of ResNet-20/56/18, on CIFAR-10/100 and ImageNet respectively. Model/Dataset

Method

Number of Bases/Bits BOPS [108 ] Top-1 [%] PGD-20 [%] PGD-50 [%] Weights Activations

ResNet-20 CIFAR-10

ResNet-20 [18] AMC W. Pr. [19] AMC Ch. Pr. [19] PACT-4 [6] PACT-2 [6] XNOR-Net [31] MBN [26]

8 8 8 4 2 1 1 3 1 5 1 7 1 3 5

8 8 8 4 2 1 1 1 3 1 5 1 7 3 5

25.95 12.98 12.98 6.70 1.89 0.68 0.68 1.49 1.49 2.29 2.29 3.09 3.09 3.98 10.31

92.49 88.71 89.71 92.21 89.63 83.98 84.44 87.19 87.54 87.01 89.02 86.88 89.49 89.03 90.86

1.02 0.82 0.77 1.49 3.94 9.93 11.10 14.28 9.09 14.13 8.68 14.29 6.49 10.00 7.60

0.81 0.69 0.62 1.27 3.33 8.28 9.59 12.54 7.96 12.55 7.71 12.80 5.74 8.78 6.57

ResNet-56 CIFAR-10

ResNet-56 [18] PACT-4 [6] PACT-2 [6] XNOR-Net [31] MBN [26]

8 4 2 1 1 3 1 5 1 7 1 3 5

8 4 2 1 1 1 3 1 5 1 7 3 5

80.31 20.29 5.29 1.53 1.53 4.03 4.03 6.54 6.54 9.04 9.04 11.54 31.54

93.89 92.98 92.08 85.61 87.39 89.82 90.76 89.77 91.78 89.78 92.25 91.96 92.68

4.75 2.43 2.46 17.23 9.04 11.07 9.83 11.08 8.83 10.55 7.48 10.99 9.52

3.54 2.01 2.03 15.47 7.63 9.67 8.50 9.47 7.56 9.16 6.35 9.73 7.58

ResNet-56 CIFAR-100 ResNet-56 [18] PACT-4 [6] PACT-2-4 [6] PACT-2 [6] XNOR-Net [31] MBN [26]

8 4 2 2 1 1 3 1 5 1 7 1 3 5

8 4 4 2 1 1 1 3 1 5 1 7 3 5

80.31 20.29 10.29 5.29 1.53 1.53 4.03 4.03 6.54 6.54 9.04 9.04 11.54 31.54

72.62 70.44 70.40 67.79 58.46 60.67 65.41 66.21 66.23 68.10 64.75 69.23 68.40 69.74

0.90 0.95 2.42 1.56 6.76 4.08 3.94 2.76 4.13 2.62 4.88 2.24 3.00 2.64

0.82 0.88 2.19 1.46 6.25 3.49 3.47 2.48 3.49 2.33 4.22 1.97 2.63 2.30

ResNet-18 ImageNet

8 2 1 3 1 3

8 2 1 1 3 3

1259.6 149.8 94.3 131.3 131.3 242.3

69.01 60.04 43.52 56.18 57.12 60.49

0.07 0.04 0.39 0.75 0.41 0.64

0.07 0.03 0.32 0.64 0.34 0.48

ResNet-18 [18] PACT-2 [6] MBN [26]

Adversarial Robustness of Multi-bit Convolutional

171

providing a practical trade-off between the three objectives. Similar trends hold for various model complexities of ResNet, across multiple datasets. 4.3

Adversarial Training of Multi-bit Networks

The defensive capability of NNs can be increased by adversarial training (e.g. FastAT [40]). We train a variety of full-precision, fixed-point, pruned and multibit networks, to show the learning capabilities of MBNs in detecting both original and adversarial images (see Table 3). Table 3. Adversarial robustness comparison of multi-bit to floating-point, fixed-point and pruned networks of ResNet-56, adversarially trained on CIFAR-10 and CIFAR-100. Model/Dataset

FastAT [40] CIFAR-10

FastAT [40] CIFAR-100

BOPS [108 ]

Top-1 [%]

PGD-20 [%]

8

80.31

84.03

38.45

8

40.16

83.94

41.04

4

4

20.29

85.56

40.48

PACT-2 [6]

2

2

5.29

81.80

45.98

XNOR-Net [31]

1

1

1.53

75.64

44.15

MBN [26]

3

1

4.03

77.01

51.98

1

3

4.03

79.42

51.00

5

1

6.54

75.27

50.10

1

5

6.54

80.76

50.02

3

3

11.54

81.65

52.02

Method

Number of Bases/Bits Weights

Activations

ResNet-56 [18]

8

AMC W. Pr. [19]

8

PACT-4 [6]

5

5

31.54

82.01

50.71

ResNet-56 [18]

8

8

80.31

56.17

23.64

PACT-4 [6]

4

4

20.29

58.32

22.14

PACT-2 [6]

2

2

5.29

54.86

22.53

XNOR-Net [31]

1

1

1.53

42.52

27.25

MBN [26]

3

1

4.03

47.23

25.88

1

3

4.03

50.35

26.09

5

1

6.54

49.14

27.46

1

5

6.54

52.70

25.50

3

3

11.54

53.49

26.70

5

5

31.54

55.25

26.56

In general, MBNs achieve better resilience against PGD, compared to floating-point, fixed-point quantized, and pruned versions of ResNet over various model complexities and datasets. The multi-bit configuration MBN3x3 improves the adversarial robustness (+13.6pp) against PGD-20 and has 7× lower BOPS, with a slight degradation of original accuracy (-2.4pp) when compared to the costly full-precision implementation of ResNet-56 on CIFAR-10. Unlike fixedpoint models, increasing the number of binary bases for MBNs further scales the original accuracy and robustness against PGD. This empirically supports the implication, that adversarial training demands large learning capacities, as provided by MBNs with increased number of binary bases for weights and activations.

172

5

L. Frickenstein et al.

Conclusion

This work aims to find the balance among the three objectives of task-related accuracy, resource-efficiency, and robustness of CNNs by utilizing MBNs. The solution space is navigated, performing an analysis and thorough evaluation of the design space of MBNs by varying the number of weight and activation bases. Their inherent robustness is assessed against the gradient-based PGD attack and their learning capabilities in the context of adversarial training, compared to floating-point, fixed-point, and pruned networks. To the best of our knowledge, this is the first paper highlighting the improved robustness of differently configured MBNs and providing an analysis on their gradient flows. The pro-active configuration MBN3x1 improves the robustness by +10.3pp, providing 1.3× fewer bit-operations, with a slight degradation in natural accuracy by -2.4pp, compared to a 2-bit implementation of ResNet-20 trained on CIFAR-10.

References 1. Ahn, S., Hu, S.X., Damianou, A.C., Lawrence, N.D., Dai, Z.: Variational information distillation for knowledge transfer. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9155–9163 (2019) 2. Akhtar, N., Mian, A.S.: Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access 6, 14410–14430 (2018) 3. Athalye, A., Carlini, N., Wagner, D.A.: Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In: ICML, pp. 274–283 (2018) 4. Bengio, Y., L´eonard, N., Courville, A.C.: Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation (2013). arXiv:abs/1308.3432 5. Carlini, N., Athalye, A., Papernot, N., Brendel, W., Rauber, J., Tsipras, D., Goodfellow, I.J., Madry, A., Kurakin, A.: On Evaluating Adversarial Robustness (2019). CoRR, arXiv:abs/1902.06705 6. Choi, J., Wang, Z., Venkataramani, S., Chuang, P.I.-J., Srinivasan, V., Gopalakrishnan, K: PACT: Parameterized Clipping Activation for Quantized Neural Networks (July 2018). arXiv:1805.06085 [cs] 7. Courbariaux, M., Bengio, Y.: BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 (2016). arXiv:abs/1602.02830 8. Darabi, S., Belbahri, M., Courbariaux, M., Nia, V.P.: BNN+: Improved Binary Network Training (Jan 2018). arXiv:abs/1812.11800 9. Fasfous, N., Vemparala, M.-R., Frickenstein, A., Badawy, M., Hundhausen, F., H¨ ofer, J., Naveen-Shankar Nagaraja, C.U., V¨ ogel, H.-J., Becker, J., Asfour, T., Stechele, W.: Binary-LoRAX: low-power and runtime adaptable XNOR classifier for semi-autonomous grasping with prosthetic hands. In: International Conference on Robotics and Automation (2021) 10. Fasfous, N., Vemparala, M.-R., Frickenstein, A., Frickenstein, L., Badawy, M., Stechele, W.: BinaryCoP: binary neural network-based COVID-19 face-mask wear and positioning predictor on edge devices. In: IPDPS-RAW (2021)

Adversarial Robustness of Multi-bit Convolutional

173

11. Frickenstein, A., Vemparala, M.R., Mayr, J., Nagaraja, N.S., Unger, C., Tombari, F., Stechele, W.: Binary DAD-Net: binarized driveable area detection network for autonomous driving. In: 2020 IEEE International Conference on Robotics and Automation, pp. 2295–2301 (2020) 12. Galloway, A., Taylor, G.W., Moussa, M.: Attacking binarized neural networks. In: International Conference on Learning Representations (ICLR) (2018) 13. Goldblum, M., Fowl, L., Feizi, S., Goldstein, T.: Adversarially robust distillation. In: Conference of Association for the Advancement of Artificial Intelligence (AAAI), vol. 34 (2020) 14. Goodfellow, I., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: International Conference on Learning Representations (ICLR) (2015) 15. Guo, M., Yang, Y., Rui, X., Liu, Z.: When nas meets robustness: In search of robust architectures against adversarial attacks (2019) 16. Guo, Y., Yao, A., Chen, Y.: Dynamic network surgery for efficient DNNs. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NeurIPS), vol. 29, pp. 1379–1387. Curran Associates, Inc. (2016) 17. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NeurIPS), vol. 28, pp. 1135–1143. Curran Associates, Inc. (2015) 18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (Jun 2016) 19. He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., Han, S.: AMC: AutoML for model compression and acceleration on mobile devices. In: Proceedings of the European Conference on Computer Vision (ECCV), vol. 7, pp. 815–832 (2018) 20. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS Deep Learning and Representation Learning Workshop (2015) 21. Huang, Q., Kevin Zhou, S.., You, S., Neumann, U.: Learning to prune filters in convolutional neural networks. In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 709–718 (2018) 22. Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., Kalenichenko, D.: Quantization and training of neural networks for efficient integerarithmetic-only inference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018) 23. Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. University of Toronto (May 2012) 24. Kundu, S., Nazemi, M., Beerel, P.A., Pedram, M.: Dnr: a tunable robust pruning framework through dynamic network rewiring of dnns. In: Proceedings of the 26th Asia and South Pacific Design Automation Conference, ASPDAC’21, pp. 344–350. Association for Computing Machinery, New York, NY, USA (2021) 25. Lin, J., Gan, C., Han, S.: Defensive quantization: when efficiency meets robustness. In: International Conference on Learning Representations (ICLR) (2019) 26. Lin, X., Zhao, C., Pan, W.: Towards accurate binary convolutional neural network. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NeurIPS), vol. 30, pp. 345–353. Curran Associates, Inc. (2017) 27. Liu, Z., Wu, B., Luo, W., Yang, X., Liu, W., Cheng, K.-T.: Bi-Real Net: enhancing the performance of 1-bit CNNs with improved representational capability and

174

28.

29.

30.

31.

32.

33.

34. 35.

36.

37.

38. 39. 40. 41.

42.

L. Frickenstein et al. advanced training algorithm. In: Proceedings of the European Conference on Computer Vision (ECCV) (Sept 2018) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: International Conference on Learning Representations (ICLR) (2018) Vemparala, M.R., Frickenstein, A., Fasfous, N., Frickenstein, L., Zhao, Q., Kuhn, S.F., Ehrhardt, D., Wu, Y., Unger, C., Nagaraja, N.S., Stechele, W.: Breakingbedbreaking binary and efficient deep neural networks by adversarial attacks. In: Intelligent Systems Conference (2021) Vemparala, M.R., Fasfous, N., Frickenstein, L., Frickenstein, A., Singh, A., Salihu, D., Unger, C., Nagaraja, N.-S., Stechele, W.: Hardware-aware mixed-precision neural networks using in-train quantization. In: British Machine Vision Conference (2021) Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-Net: imagenet classification using binary convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Proceedings of the European Conference on Computer Vision (ECCV), vol. 14, pp. 525–542. Springer International Publishing, Cham (2016) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y Shafahi, A., Najibi, M., Ghiasi, M.A., Xu, Z., Dickerson, J., Studer, C., Davis, L.S., Taylor, G., Goldstein, T.: Adversarial training for free! In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alch´e-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NeurIPS), vol. 32, pp. 3358–3369. Curran Associates, Inc. (2019) Sharify, S., et al.: Loom: exploiting weight and activation precisions to accelerate convolutional neural networks. In: DAC (2018) Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440 (2015) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I.J., Fergus, R.: Intriguing properties of neural networks. In: Bengio, Y., LeCun, Y. (eds.) International Conference on Learning Representations (ICLR) (2014) Tang, W., Hua, G., Wang, L.: How to train a compact binary neural network with high accuracy? In: Proceedings of the Conference on Artificial Intelligence (AAAI), vol. 31, pp. 2625–2631. AAAI Press (2017) Tian, Y., Krishnan, D., Isola, P: Contrastive representation distillation. In: International Conference on Learning Representations (ICLR), vol. 8 (2020) Tram`er, F., Papernot, N., Goodfellow, I., Boneh, D., McDaniel, P.: The Space of Transferable Adversarial Examples (2017) Wong, E., Rice, L., Zico Kolter, J.: Fast is better than free: Revisiting adversarial training. In: International Conference on Learning Representations (ICLR) (2020) Zhang, H., Yu, Y., Jiao, J., Xing, E.P., El Ghaoui, L., Jordan, M.I.: Theoretically principled trade-off between robustness and accuracy. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the International Conference on Machine Learning, (ICML). Proceedings of Machine Learning Research, vol. 97, pp. 7472–7482. PMLR (2019) Zhou, S., Yuxin, W., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Training low bitwidth convolutional neural networks with low bitwidth gradients, Dorefa-net (2018)

Optimization of Lacrimal Aspect Ratio for Explainable Eye Blinking Misbah Ayoub1 , Andrew Abel2(B) , and Haiyang Zhang1 1

2

School of Advanced Technology, Xi’an Jiaotong-Liverpool University, Suzhou, China [email protected], [email protected] Computer and Information Sciences, University of Strathclyde, Glasgow, Scotland [email protected]

Abstract. Eye blinking has been studied extensively due to its wide range of potential applications. However, one under-researched field is the use of the wider lacrimal area for detection. This paper proposes a new eye blinking detection method using a novel lacrimal aspect ratio (LAR) strategy that utilises eyebrow movement and eyes. The proposed algorithm estimates facial landmarks using an automatic facial landmark detector to extract a single scalar quantity by using LAR and characterizing eye opening and closing, and to detect both partial and full blinking in each frame using a LAR threshold. We set three threshold values, –2.4 and –2.6, and –2.9, to detect blinks by each frame. Experimental results show that our approach successfully detects eye blinks and can outperform other state-of-the-art works. The utilization of LAR in detecting blinks and partial blinks demonstrates its potential to offer a novel and informative metric for researchers. This approach also opens up possibilities for further eye-related investigations, including the recognition of emotions. With its low dimensionality and easily understandable time domain features, LAR provides an effective pathway towards achieving these goals. Keywords: Lacrimal aspect ratio · Blinking aspect ratio · Eyeblink8 · TalkingFace.

1

· Image processing · Eye

Introduction

Eye blink detection uses image processing to measure blinking in faces, and can aid with facial movement disorders, blepharospasm, and tiredness while driving [9,27,38]. Generally, blinking refers to the shutting and reopening of eyelids with a typical duration of between 100 and 400 ms [45], and there are many factors that can impact on blinking including eyelid condition, presence of illness, contact lenses, psychological states, the surrounding environment, medicine, and other stimuli. Blink detection has been applied for various applications such as inter-communication between paralyzed persons and computers [33], and drowsiness detection [41]. There are various approaches for detecting blinking, such as c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 175–192, 2024. https://doi.org/10.1007/978-3-031-47715-7_13

176

M. Ayoub et al.

measuring ocular parameters [6], or using the modification of the algetic distance algorithm for conic approximation by using the Eye Aspect Ratio (EAR) [42]. In recent years, deep learning has become more widely used for eye blinking, with various state-of-the-art techniques used in blinking such as CNNs (Convolutional Neural Networks), RNNs (Recurrent Neural Networks), transfer learning, and object detection. By training deep learning algorithms on a large dataset of eye images, the algorithm can learn to recognize eye blinking patterns in realtime. Recently, various studies have used deep learning approaches for various applications of eye blinking [1,11,19,24,43,46,55]. Eye information such as blink count, time and frequency may also provide valuable mental health information. Cori et al. [10] used continuous eye blinking measures to successfully identify drowsiness during driving. Grice et al. reported that eye gaze may identify young adults with autism [20]. Laxmipriya et al. examined the psychological state of subjects by blink rate [37], showing that the blink rate increases with more positive moods. Deception can also be calculated with blink rate. Borza et al. used various cues considered as an indicator of deception, such as blink rate, gaze direction, and the saccadic eye movement rate [7], demonstrating the applications of blink detection. In addition to blinking, partial eye opening (which we detect in this paper) can also potentially be used to detect Down’s syndrome and Schizophrenia [18,35,36]. Understanding the importance of eye movements, eye narrowing and eye widening helps to highlight important emotional indicators [26,51]. Therefore, rather than simply being able to count blinks, it is also valuable to obtain accurate time domain vectors for further research. This article puts forth a novel method of measuring eye openness levels by utilizing Lacrimal Aspect Ratio (LAR) based measurements using the Dlib face detector’s identified points. By applying the LAR measurement with three different threshold values, we can detect full and partial blinks in a continuous manner. This innovative approach is straightforward and comprehensible in both computer science and psychological research, and has the potential to be used for real-time blink detection. The combination of eyelid and eyebrow area could be employed for future eyebrow movement research, as well as for facial expression and emotion recognition, given that blinking is an action that has a significant impact on emotion recognition (according to research on action units) [44]. Our proposed method is compared to other state-of-the-art methods in the literature using the same datasets, including both geometric methods and deep learning models, showing that our proposed method can produce comparable accuracy to deep learning models while being much more lightweight, and presenting a useful and analysable feature stream of time domain values. These can potentially be used as low dimensionality feature inputs in future research. Our novel approach is designed to be more intelligible, providing researchers with greater transparency and control over the model. Instead of deep learning, we can use smaller datasets than complex deep learning models. Moreover, the nature of the proposed method means that it is expected to have lower memory and power consumption than deep learning, making it a more practical solu-

Optimization of Lacrimal Aspect Ratio

177

tion for smaller datasets. By providing an explainable solution with comparable accuracy, the proposed model can help advance the field of emotion recognition and contribute to future developments in the field. The contributions of this paper can be summarized as follows: (1) we propose a simple and explainable blinking detection method based on Lacrimal Aspect Ratio (LAR), using the area between the eyebrows and the eyes, which can potentially be used in medical and psychological research to better track eye and eyebrow movement. (2) For blink detection and partial blink detection, ranges have been defined with threshold values based on facial landmark detection. In this paper, detailed results will be analyzed using widely used datasets, Eyeblink8 and talking face. (3) To handle noise, the dlib face detector has been modified to identify and to eliminate all noisy or defective frames where eyebrows and eye landmarks cannot be detected. (4) We demonstrate that similar results to those reported in the literature can be achieved with less data intensive models than using deep learning approaches. The remainder of this paper is organized as follows. A detailed background is provided in Sect. 2, our proposed LAR approach is then introduced in Sect. 3. The detailed system configuration is provided in Sect. 4, including the complete blinking process, and the datasets used. We present our results and compare and discuss with other state-of-the-art results in Sect. 5. We will then conclude the paper and discuss future research directions in Sect. 6.

2

Background

As discussed in the previous section, there are various approaches for detecting blinking, such as measuring ocular parameters [6], or the Eye Aspect Ratio (EAR) [42]. Machine learning approaches have also become more widely used for eye blinking [14,25,57]. One approach for blinking is the EAR measure, which is used in computer vision algorithms to detect eye blinking [47]. It is defined as the ratio of the distances between the eye landmarks. To distinguish between open and closed eyes, a threshold value is used (Dewi et al. used thresholds of 0.3 and 0.2 in their 2022 paper [15]). In this paper, we will investigate calculating blinks with our proposed method using three different threshold values. There are various machine learning approaches to detect blinking, especially for driver drowsiness detection. Adireddi et al. [2] detected eye blinking by using a Viola-Jones detector and an SVM classifier. They used the Eye aspect ratio (EAR) pattern on the widely used Eyeblink8 and talking face datasets. They concluded that when subject looked to the side for longer than a blink duration, the threshold fails, and therefore their SVM classifier outperformed the EAR thresholding values, and the SVM detects the presence of closed eyes with 95% accuracy. Machine learning is also used for blink detection as depicted by You et al. [52] who used a deep cascade CNN for drowsiness detection on the WIDER FACE dataset and the AFLW dataset. They proposed a fatigue state classifier, based on an SVM, which takes EAR as input. They reported 94.8% accuracy.

178

M. Ayoub et al.

This is an active research field, with several state-of-the-art (as of 2023) publications. Akhdanet et al. [3] used SVM and MLP classifiers for a blinking detection system used in face detection with antispoofing eye blinking detection. This study used the Eyeblink8 dataset. Using the SVM classifier, they reported a precision value of 0.73 and recall of 0.91, and an F1 score of 0.81. On the other hand, using MLP, they received a precision value of 0.95, recall of 0.98, and an F1 score of 0.96. Demonstrating potential practical applications, Hutamaputra et al. [23] investigated eye blinking for assisting people with disabilities by proposing menu selection using eye gaze. Eye detection uses the facial landmark method, and classification uses K-nearest neighbors. The method’s highest accuracy result is 90.42. Cristanti et al. [53] also used eye recognition to select menus. This study used edge detection on the eye iris. In this study, users can choose the activity menu by closing their eye for more than 250 ms. The system’s output is audio that represents the activity menu. The reported success rate of left gaze tracking was 80%, straight gaze tracking was 90%, and right gaze tracking was 80%. Jordan et al. [28] proposed a CNN in smart connected glasses based on IR signals to detect eye blinks and compared it to threshold based methods. They reported 90.8% accuracy in the third convolution layer whereas their thresholdbased method reported 84% accuracy. However, they also noted that ROM and RAM usage is significantly higher with the CNN approach than for threshold method. We are also interested in a lightweight approach that can be implemented without high end hardware. One issue with using pre-trained deep learning models such as VGG [21], Inception [48], ResNet [22], AlexNet [32], and Xception [8] is that the output is not necessarily useful. Although, it can output blink and non-blink classifications, their approach to identifying those blinks is not transparent, and there are no explanations provided to justify decisions. Explanations can help researchers understand what features the model is using to make a particular decision and make informed modifications to the model as needed, particularly if the applications are more advanced than just blinking. Deep learning models also often require a large amount of memory and power, which can be impractical for small datasets like talkingface, which has fewer than 500 eye blinks. Our proposed model is designed to be more intelligible, providing researchers with greater transparency and control over the model. Instead of deep learning, our approach is based on piecewise training, meaning that the model is trained on a smaller, more manageable dataset rather than a complex deep learning model. This approach makes it easier for researchers to understand the model’s decisionmaking process and to make modifications to the model as needed. Moreover, the proposed model is expected to have lower memory and power consumption than deep learning, making it a more practical solution for smaller datasets.

3

Lacrimal Aspect Ratio

Researchers have proposed various methods to detect eye blinking, as discussed in the introduction. Here, we propose LAR, a novel approach, which uses the

Optimization of Lacrimal Aspect Ratio

179

measured area between the eyebrows and the upper eyelid. For feature extraction, we have extracted frames from the video at 30 fps (frames per second). Facial landmarks are then detected by using the Dlib face detector. For face identification and cropping the face region, the Dlib algorithm has been identified to be a successful and widely used method. We have upgraded the method to detect defective frames; these frames that cannot be recognized by the method due to some occlusion such as hand-over face, blur. Some frames hide the complete eyes and reduce the efficiency of the proposed method. To separate the noisy frames, the Dlib detector has been updated and separates defective frames to apply LAR. Following face detection, the eye area along with the eyebrows is identified using facial landmarks. The landmarks associated with both eyes and eyebrows are as follows: (1). Left Eye: (37–41), (2). Left Eyebrow: (17–21), (3). Right Eye: (43–48), (4). Right Eyebrow: (43–48). After identifying both eyes and eyebrows individually, our proposed LAR method is applied to detect eye blinking. Our research detects eye blinks with the help of two lines drawn horizontally and vertically between the eyebrows and eyes. Blinking is a rapid process; we can detect this by using parameters. If the LAR ratio drops below a threshold, we can assume that the eye is blinking and closing (–2.4/–2.6/–2.9). If eyes are between the partial range (from –1.4 to –2.4/–2.6/–2.9), as usually happens before and after a blink, eyes are considered partially open. Otherwise the eyes show no blink. We will evaluate blinking per frame; therefore, partial blinks are detected before and after the blink. In this proposed work, we have detected eye blinking with three different threshold values –2.4, –2.6, –2.9. Additionally, the distance between the eyebrows to the eyes is divided into two key measurements, the tarsal platform show (TPS) and brow fat span (BFS). TPS is the distance between the upper eyelid margin and the crease, whereas BFS is the distance from the superior brow hairs to the crease in the primary gaze [29]. We use the name Lacrimal Aspect Ratio due to its association with the lacrimal gland. Many medical researchers have calculated variations in the average length of this area [49,56], however, to the best of our knowledge, no one has precisely calculated it for measuring muscle movement. We propose to calculate this distance and to use it in eye blinking. We define three threshold values –2.4, –2.6 and –2.9 (as will be discussed later), because the area between the eyebrows and the eyes of different people, according to age and ethnicity, may vary as per facial anthropometric data, as reported by Zhuang et al. [56]. We categorized the distance into three categories, eye blink, partial blink, and no blink (eyes are open). The LAR is defined as: LARblink = LARnb =

(x1 +x2 )min (y1 +y2 )max

(x1 +x2 )max (y1 +y2 )min

(1)

(2)

where x1 , x2 refers to the landmark points for calculating the distance between the Dlib landmarks P 1, P 5, P 6, P 9 and y1 , y2 refers to P 1, P 6, P 5, P 9. The

180

M. Ayoub et al.

horizontal length |x1 + x2| can be calculated where x1 is defined as |P 1 − P 5| and x2 using |P 6 − P 9|. The vertical length |y1 + y2 | can also be calculated, where y1 represents |P 1 − P 6| and y2 , |P 5 − P 9|: (x1 + x2 ) |P 1 − P 5| + |P 6 − P 9| = (y1 + y2 ) |P 1 − P 6| + |P 5 − P 9|

(3)

where, P1, P5, P6 and P9 are landmarks associated with eyebrows and eyes. LAR =

(x1 + x2 ) (y1 + y2 )

(4)

After applying normalization to all values of LAR, we get α which represents the normalized value of the blink ratio. Then, LART hreshold =

(LARblink + LARnb ) 2

(5)

Here, we identified a suitable robust threshold value using several datasets (as will be listed in Sect. 4.1). By comparing frames with no blinking to frames with blinks, we were able to tune the thresholds appropriately. These thresholds were suitable for several datasets, although similarly to other research in the literature, and given the differences in datasets, should be generally suitable and robust values for many datasets in the literature. Some additional minor tuning may be required in special cases, but the thresholds identified in this paper should be suitable for most usage. The eye status is therefore: ⎧ ⎪ = EyeClosed ⎨α ≤ LARt (6) = N oBlink α > LARp ⎪ ⎩ LARt < α ≥ LARp = EyeP artiallyClosed where, LARt defines the Lacrimal threshold value (–2.4, –2.6, –2.9) and LARp represents a partially closed eye value which can generally be detected before and after blinks, where LARp is in range −1.4 ≤ LARp < LARt . Finally, LARnb depicts no blink, i.e. an open eye. For measuring the distance, we use Euclidean length. The horizontal distance is taken from two midpoints. One midpoint is taken from the lateral canthus (the lateral corner of the left eye), to the lateral corner of the left eyebrow. The midpoint will be located on the Lateral Hooding area (the lacrimal gland muscle). Another midpoint is taken from the medial corner of the left eye (the medial canthus), to the medial corner of the left eyebrow. The horizontal distance is calculated from these two midpoints. Similarly, the vertical distance is taken from two midpoints, the Lateral and medial corner of the eyebrow, and the lateral and medial cantus of the eye. This is shown in Fig. 1.

Optimization of Lacrimal Aspect Ratio

181

Fig. 1. LAR Measurement, where V.D: Vertical Distance, H.D: Horizontal Distance, BFS: Brow Fat Span, TPS: Tarsal Platform Show

For landmark and feature extraction, Dlib is used, and our LAR method is applied for eyebrows and eye landmarks. For eye blinking, landmark units start from 17 to 21 for the left eyebrow and 38 to 41 for left eyes. Therefore, we will consider landmark 17 as P1 and landmark 21 as P5. Similarly, 25 to 26 are associated with the right eyebrow, and 42–47 are associated with the right eye.

4 4.1

System Configuration and Experimental Setup Dataset

EyeBlink8 Dataset: The Eyeblink8 dataset is a challenging dataset recorded in a home environment and widely used for state-of-the-art research [1,15,17,31, 39,54,55]. It consists of eight videos and four subjects. There are 408 Eyeblinks in 70,992 annotated frames, recorded at a resolution of 640 * 480. Annotation files are available, start with the line #start and consist of 19 fields. We are only concerned with Frame id, blink id, left eye fully closed or right eye fully closed. The Eyeblink8 dataset is publicly available [17]. Talking Face Dataset: This dataset consists of one video recording of one person sitting in front of the camera and making various facial expressions, including smiling, laughing, and ‘funny faces’. This video is captured at 30fps with a resolution of 720* 576 and contains 61 annotated blinks [16]. A number of recent studies used this dataset making it suitable for evaluation use in this paper [15,31,39]. ADFES and Grid Datasets: While not specifically used for eye blinking, we tested our approach and calibrated our threshold values with 2 additional datasets. We used 22 speakers from the Amsterdam Dynamic Facial Expression Set (ADFES), recorded at a resolution of 1280 × 720 and 34 from the GRID corpus dataset (recorded at a resolution of 360 × 288). These were used to calibrate and test our thresholds, although due to not being dedicated blink datasets, results are not reported here. However, by using a wide range of frame sizes to identify a common threshold, we are able to show that our threshold values are robust.

182

4.2

M. Ayoub et al.

Proposed LAR Blink Detection

There are several steps involved in our proposed LAR, as shown in Fig. 3. First, we extract frames at 30fps (frames per second) and use the pre-trained Dlib facial landmark detector [30]. This consists of two shaped predictor models trained on the i-Bug300-W datasets, which each localize 68 and 5 landmark points, respectively, with a face image. The 68 landmarks refer to jaw points 0–16, right and left eyebrows 17–26, nose points 27–35, and right and left eye points 36–47, mouth points 48–67. Defective frames (frames that cannot be recognized due to occlusions such as hand covering face) are identified and removed. Following this, the eye and the eyebrow region is identified using the Dlib facial landmark points, and our LAR method is then applied to calculate the lacrimal aspect ratio for eye blinking. Our method detects eye blinks with the help of two lines drawn horizontally and vertically between the eyebrows and the eyes, and three different threshold values, –2.4, –2.6, –2.9, will be evaluated in this paper. The three threshold values are determined statistically from the distance variation between eyebrows and the eyes in case of eye opening and closing across the subjects of four datasets, as reported in Sect. 3. We determined the thresholds by measuring the landmarks and distances associated with the eyebrows and eyes and found that the distance between the eyebrows and eyes during eye opening and closing can vary between individuals. We calculated three threshold values with respect to the area between the eyebrows and eyes. By using three different threshold values, it allows for more accurate blink detection by taking into account the variations in the distance between the eyebrows and eyes. This ensures that the blink ratio is accurate for different individuals. If the LAR ratio drops below a threshold (–2.4/–2.6/–2.9), we can assume that the eye is closed. If the eyes are between the partial range (from –1.4 to –2.4/–2.6/–2.9), as usually happens before and after a blink, the eyes are considered partially open, otherwise the eyes show no blink. Since blinking is evaluated on an individual frame, eyes generally register a partial blink before and after the blink. The operation of the system is shown in Fig. 2. 4.3

Data Optimization

Data has been cleaned by using the Dlib output to detect defective frames. Figure 4 shows examples of defective frames. We identified results where facial landmarks were not detected, with an output (i.e. the face region rectangle) of 0, and did not use those frames for assessment. In Sect. 5, we compare our work with EAR and to maintain balance, in frames where either our LAR method or the EAR approach identified defective frames, we did not use these frames for evaluation. In the Eyeblink8 dataset, 358 defective frames were identified using the EAR method, with 333 using our method. In total, this gives us a total of 374 defective frames which were not considered. In the TFS dataset, no defective frames were identified.

Optimization of Lacrimal Aspect Ratio

183

Fig. 2. Complete system process, showing defective frame identification, face and eye detection, as well as LAR calculation

184

M. Ayoub et al.

Fig. 3. Systematic architecture of proposed eye blinking model of our lacrimal aspect ratio approach

Fig. 4. Examples of defective frames in the Eyeblink8 dataset

5

Results and Discussion

Similarly to other research in the literature [40], precision and recall are used to report the effectiveness of our proposed method. The results of our method with the Eyeblink8 dataset for each of our chosen thresholds are shown in Table 1. Overall, we have achieved overall 94%, 95%, and 95% accuracy with –2.4, –2.6 and –2.9 threshold values and a total of 390, 269 and 269 blinks from all videos compared to 408 Ground truth blinks (Ground truth blink numbers are taken from the Al-gawwam et al. [4]. For the cumulative frame total of 70548 frames, after eliminating defective frames in the the blink8 dataset, 1697 frames were detected as partial blinks, whereas 68461 frames showed no blink. For the –2.6 and –2.9 thresholds, 1617 frames were partial blinks and 68662 frames showed no blink. To illustrate our results, Fig. 5 illustrates the LAR for video 4 of the Eyeblink8 dataset. The video’s duration is 1:09 minutes, with 5454 frames recorded at 30 fps. Our method identified 50 blinks at the –2.4 threshold whereas 28 blinks were blink detected at the –2.6 and –2.9 thresholds along with 89 defective frames. One key reason for the difference was that in the ground truth measurements, multiple

Optimization of Lacrimal Aspect Ratio

185

Table 1. LAR accuracy for all videos from Eyeblink8 dataset Videos

Accuracy

Detected blink

Threshold

−2.4 −2.6 −2.9 −2.4 −2.6 −2.9 0.3

1

0.974 0.979 0.979 51

32

32

38

2

0.908 0.905 0.905 57

37

37

88

3

0.942 0.944 0.944 54

41

41

65

4

0.936 0.940 0.940 50

28

28

31

8

0.971 0.974 0.974 87

59

59

30

9

0.918 0.92

32

33

33

41

10

0.927 0.930 0.930 42

28

28

72

11

0.922 0.924 0.924 16

11

11

43

269

269

408

0.92

Overall Accuracy 0.94 0.95 0.95 390

GT blink

Fig. 5. Example result for Eyeblink8 dataset, showing frame-by-frame LAR values, with blinks denoted by horizontal lines

frames were considered to be a single blink. However, in our approach, we are able to distinguish between partial eye opening, full eye opening, and eye closing, meaning that there are differences between blink frame numbers. However, as reported above, by grouping partial blinks and blinks together as per the ground truth measurements, we achieved high accuracy for blinks. For the TFS dataset, our proposed method achieved 92% accuracy for the –2.4 threshold value, and 91% accuracy for –2.6 and –2.9 threshold values. For the –2.4 threshold value, 52 detected blinks and 125 partial blinks were detected. Similarly, for –2.6 and –2.9, 28 detected blinks and 299 partial blinks were detected from 5000 frames, as shown in Table 3.

186

M. Ayoub et al.

Table 2. Comparison of EAR, modified EAR and our LAR approach for Eyeblink8 videos 4 and 8

Threshold

EAR Modified LAR EAR V4 EAR V4 V4 V8

Modified LAR EAR V8 V8

0.3

0.2105

0.2923

–2.4

0.3

–2.4

Total frame

5454 5454

5454 10712 10712

10712

Total active frame

5315 5315

5315 10663 10663

10663

Total defective frames 89

89

89

0

0

0

Total blink count

117

1035

103

107

628

97

Detected blink

31

54

50

30

51

87

Partial blink





50





Accuracy

0.907 0.9117

0.936 0.9108 0.966

87 0.97

Table 3. Statistics of TFS dataset for all threshold values Threshold

–2.4

–2.6

–2.9

Total frame

5000

5000

5000

Total active frame

5000

5000

5000

Total defective frames 0

0

0

Total blink count

173

103

103

Partial blinks

229

299

299

No blinks

4598

4598

4598

Ground truth blinks

61

61

61

Detected blinks

52

28

28

Accuracy

0.9212 0.9128 0.9128

Table 2 shows the detailed statistics for EAR, modified EAR and our proposed method on different threshold values on Eyeblink8 videos 4 and 8. These videos were chosen due to having similar numbers of defective frames. EAR has 117 total blink frames with 31 blinks detected with a 0.3 threshold value. Similarly, mEAR, a state-of-the-art approach [15] found 1035 blink frames and 54 blinks with a threshold of 0.2923. For our proposed method, we have detected 103 total blink frames along with 50 detected blinks with a –2.4 threshold value. Similar small differences can be seen for video8. In both cases, our results are more accurate. Figure 6 provides a detailed example of selected frames from video 8 of the Eyeblink8 dataset, showing both detected eye regions, and the LAR value, along with labels for partial blinks and detected blinks. The threshold value is also indicated in the figure, and a blink is counted when the lowest threshold is crossed. As can be seen in the figure, the threshold is only crossed in 2 frames here, meaning that 2 blink frames are detected. There are also incorrect identifi-

Optimization of Lacrimal Aspect Ratio

187

Table 4. Comparisons with other state-of-the-art methods Dataset

Approach (threshold val)

EyeBlink8 Druratovsky and Fogelton [16]

TFS

Year Precision (%) Recall (%) 2014 79

85.27

Fogelton and Benesova [17]

2016 94.69

91.91

Anas et al. [5]

2017 79

85.27

Al-Gawwam and Benaissa [4]

2018 96.65

98.78

K´ evin Cortacero et al. [11] (MobileNe V2)

2019 88.4

91.9

K´ evin Cortacero et al. [11] (ResNet)

2019 84.2

93.9

K´ evin Cortacero et al. [11] (DenseNet)

2019 88.3

95.4

K´ evin Cortacero et al. [11] (Ensemble)

2019 99.5

95.8

Dari et al. [13]

2020 100

93.4

Liang et al. [34]

2021 93.8

89.0

Dewi et al. (Modified EAR) [15] (0.3)

2022 99

81

Dewi et al.(Modified EAR) [15] (0.2923)

2022 99

82

Kraft et al. [31]

2022 90.9

90.4

Phuong et al. [39]

2022 95.28

95.81

Sridharan et al. [46] (Efficient Net + CNN)

2022 92.73

89.7

Sridharan et al. [46] (MobileNet V2 + CNN) 2022 91.2

92.3

Sridharan et al. [46] (FaceMesh+ CNN)

2022 95.64

94.25

Chen Zhao et al. [55] (FedSup)

2023 90.22

94.25

Our Proposed LAR(−2.4)

2023 95

98.7

Our Proposed LAR(−2.6)

2023 95

99

Our Proposed LAR(−2.9)

2023 95

99

Druratovsky and Fogelton [16]

2014 92.2

96.7

Fogelton and Benesova [17]

2016 95

93.44

Liang et al. [34]

2021 96.7

95.1

Dewi et al. [15] (0.3)

2022 97

82

Dewi et al. [15] (0.2923)

2022 98

94

Kraft et al. [31]

2022 98.3

93.4

Phuong et al. [39]

2022 95.08

93.55

Dari. [13]

2020 89.5

88.4

Our Proposed LAR(−2.4)

2023 93

99

Our Proposed LAR(−2.6)

2023 92

99

Our Proposed LAR(−2.9)

2023 92

99

cations in 3 frames, where the value differs noticeably from the frames around it. Around the detected blink, there are also a number of frames that are detected as being partial blinks. Table 4 compares our work with the literature. The table shows that there has been a consistent progression in the field of eye blinking research since 2014. Some research, such as that by Dari et al. [13] has achieved a higher accuracy rate compared to other models, including deep learning models. On the other hand, some deep learning models showed accuracy rates below 90%. It is noteworthy that alternative approaches, such as those presented in [15,39], outperformed deep learning models in terms of precision on the Eyeblink8 dataset. Our proposed approach offers a novel and successful solution, with very good precision

188

M. Ayoub et al.

Fig. 6. shows the frame result of Eyeblink8 datasets Video8.PRB: Partial Blink Range, NBR: No Blink Range, L.E: Left Eye, R.E: Right Eye, P.B: Partial Blink, D.B: Detected Blink, NB:No Blink

rates on both eyelink8 and talkingFace datasets for all three threshold values. We obtained overall 94 and 95% accuracy respectively on three different threshold values, and 95% precision with 99% recall rate for Eyeblink8. For TFS, we have achieved 92% precision and 99% recall. This compares favourably to other work in the literature, demonstrating the merits of our proposed method, which does not require powerful processing or deep learning methodologies. One limitation we have faced with this approach is eyebrow movement, which can directly affect our results. Movement of the eyebrows, specifically when raised or lowered, can impact the resulting output. As a result, when a subject’s eyes are open and their eyebrows are raised, it can give the appearance of an eye blink, even if the individual is not actually blinking. Similarly, if the eyebrows are lowered and eyes are closed, our system may detect it as the eyes opening. To address this limitation in our future work, we intend to incorporate eyebrowbased analysis to enable more accurate emotion recognition. This limitation is also a baseline for our future work that will enhance the performance of future work by incorporating this challenge. In comparison to deep learning models, our proposed model is better suited for specific applications for several reasons. Firstly, the proposed model outputs simple and understandable tracking of eye openness levels. The feature used to detect blinking can be analysed for more than just blinking, but also for eye narrowing and widening. This has applications for psychological research, rather than simply a black box machine learning approach that simply outputs blink counts. While deep learning has the advantage of high accuracy, it comes with the trade-off of increased memory and power consumption. Secondly, deep learning algorithms and architectures are better suited for big data analytics with a high volume and variety of data, but for smaller datasets like the talkingface dataset, which has fewer than 500 eye blinks, deep learning models are not the optimal choice [12,50]. Therefore, we argue that it is not always required to train deep learning models at high computational and energy cost, for limited results.

Optimization of Lacrimal Aspect Ratio

6

189

Conclusion

This paper proposed a novel technique for measuring eye blinking by using a measurement we call the lacrimal aspect ratio. This measurement was partially motivated by the potential for future integrated use of eye and eyebrow feature measurements, and used a very simple and lightweight process. We presented results on the widely used Eyeblink8 and talkingFace datasets. Our focus was on showing blinks, partial blinks, and no blinks per frame and providing a continuous measurement for eye opening. Our objective was to present a novel and explainable approach with comparable accuracy to deep learning models, which can be incorporated into future lightweight feature based research. Our proposed approach, based on piece-wise training, is arguably more suitable for certain tasks than using deep learning. Future work will investigate combining these feature based techniques with other image based feature extraction for video emotion recognition research. A lot of current work is focused on deep learning, which lacks psychological explainability and requires considerable processing. By using carefully chosen features, such as our proposed LAR approach, which has been shown to not only successfully detect blinks, but can also generate continuous temporal features, we can generate lightweight and explainable models, with features that can be analysed. The results show that our new LAR approach with three valid threshold values improves the blink detection result in comparison to other work in the literature. In the future, we will further extend our work by integrating eyebrow movement along with blink detection for facial expression recognition.

References 1. Adireddi, V.S., Boddeda, C.N.S.J., Kumpatla, D.S., Mantri, C.D., Reddy, B.D., Geetha, G., Thirupathi Rao, N., Bhattacharyya, D.:. Detection of eye blink using svm classifier. In: Smart Technologies in Data Science and Communication: Proceedings of SMART-DSC 2022, pp. 171–178. Springer (2023) 2. Adireddi, V.S., Boddeda, C.N.S.J., Kumpatla, D.S., Mantri, C.D., Dinesh Reddy, B., Geetha, G., Thirupathi Rao, N., Bhattacharyya, D.: Detection of eye blink using svm classifier. In: Ogudo, K.A., Saha, S.K., Bhattacharyya, D. (eds.) Smart Technologies in Data Science and Communication, pp. 171–178. Springer Nature Singapore, Singapore (2023) 3. Akhdan, S.R., Supriyanti, R., Nugroho, A.S.: Face recognition with anti spoofing eye blink detection. In: AIP Conference Proceedings, vol. 2482, no. 1, pp. 020006 (2023) 4. Al-gawwam, S., Benaissa, M.: Robust eye blink detection based on eye landmarks and savitzky-golay filtering. Information 9(4), 93 (2018) 5. Anas, E.R., Henriquez, P., Matuszewski, B.J., et al.: Online eye status detection in the wild with convolutional neural networks. In: VISIGRAPP (6: VISAPP), pp. 88–95 (2017) 6. Bergasa, L.M., Nuevo, J., Sotelo, M.A., Barea, R., Elena Lopez, M.: Real-time system for monitoring driver vigilance. IEEE Trans. Intell. Transp. Syst. 7(1), 63–77 (2006)

190

M. Ayoub et al.

7. Borza, D., Itu, R., Danescu, R.: In the eye of the deceiver: analyzing eye movements as a cue to deception. J. Imaging 4(10), 120 (2018) 8. Chollet F.: Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017) 9. Chu, C.-H., Feng, Y.-K.: Study of eye blinking to improve face recognition for screen unlock on mobile devices. J. Electric. Eng. Technol. 13(2), 953–960 (2018) 10. Cori, J.M., Turner, S., Westlake, J., Naqvi, A., Ftouni, S., Wilkinson, V., Vakulin, A., O’Donoghue, F.J., Howard, M.E.: Eye blink parameters to indicate drowsiness during naturalistic driving in participants with obstructive sleep apnea: a pilot study. Sleep Health 7(5), 644–651 (2021) 11. Cortacero, K., Fischer, T., Demiris, Y.: Rt-bene: a dataset and baselines for realtime blink estimation in natural environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0 (2019) 12. Cortacero, K., Fischer, T., Demiris, Y.: Rt-bene: a dataset and baselines for realtime blink estimation in natural environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops (Oct 2019) 13. Dari, S., Epple, N., Protschky, V.: Unsupervised blink detection and driver drowsiness metrics on naturalistic driving data. In: 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pp. 1–6. IEEE (2020) 14. de Lima Medeiros, P.A., da Silva, G.V.S., dos Santos Fernandes, F.R., S´ anchezGendriz, I., Castro Lins, H.W., da Silva Barros, D.M., Pinto Nagem, D.A., de Medeiros Valentim, R.A.: Efficient machine learning approach for volunteer eye-blink detection in real-time using webcam. Expert Syst. Appl. 188, 116073 (2022) 15. Dewi, C., Chen, R.-C., Jiang, X., Hui, Yu.: Adjusting eye aspect ratio for strong eye blink detection based on facial landmarks. PeerJ Comput. Sci. 8, e943 (2022) 16. Drutarovsky, T., Fogelton, A.: Eye blink detection using variance of motion vectors. In: European Conference on Computer Vision, pp. 436–448. Springer (2014) 17. Fogelton, A., Benesova, W.: Eye blink detection based on motion vectors analysis. Comput. Vis. Image Underst. 148, 23–33 (2016) 18. Ghaziuddin, N., Nassiri, A., Miles, J.H.: Catatonia in down syndrome; a treatable cause of regression. Neuropsychiatr. Dis. Treat. 11, 941 (2015) 19. Ghosh, R., Phadikar, S., Deb, N., Sinha, N., Das, P., Ghaderpour, E.: Automatic eye-blink and muscular artifact detection and removal from eeg signals using knearest neighbour classifier and long short-term memory networks. IEEE Sens. J. (2023) 20. Grice, S.J., Halit, H., Farroni, T., Baron-Cohen, S., Bolton, P., Johnson, M.H.: Neural correlates of eye-gaze detection in young children with autism. Cortex 41(3), 342–353 (2005) 21. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015) 22. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 23. Hutamaputra, W., Utaminingrum, F., Setia Budi, A.: Eye gaze for menu display selection on smart wheelchair using k-nearest neighbors method. In: AIP Conference Proceedings, vol. 2609, pp. 040009. AIP Publishing LLC (2023)

Optimization of Lacrimal Aspect Ratio

191

24. Hutamaputra, W., Utaminingrum, F., Setia Budi, A., Ogata, K.: Eyes gaze detection based on multiprocess of ratio parameters for smart wheelchair menu selection in different screen size. J. Vis. Commun. Image Represent. 103756 (2023) 25. Ibrahim, B.R., Khalifa, F.M., Zeebaree, S.R.M., Othman, N.A., Alkhayyat, A., Zebari, R.R., Sadeeq, M.A.M.: Embedded system for eye blink detection using machine learning technique. In: 2021 1st Babylon International Conference on Information Technology and Science (BICITS), pp. 58–62. IEEE (2021) 26. Isler, J.R., Pini, N., Lucchini, M., Shuffrey, L.C., Morales, S., Bowers, M.E., Leach, S.C., Sania, A., Wang, L., Condon, C., et al.: Longitudinal characterization of eeg power spectra during eyes open and eyes closed conditions in children. Psychophysiology, e14158 (2023) 27. Jang, J., Lew, H.: Blink index as a response predictor of blepharospasm to botulinum neurotoxin-a treatment. Brain Behav. 11(11), e2374 (2021) 28. Jordan, A.A., Pegatoquet, A., Castagnetti, A., Raybaut, J., Coz, P.L.: Deep learning for eye blink detection implemented at the edge. IEEE Embed. Syst. Lett. 13(3), 130–133 (2020) 29. Kashkouli, M.B., Abdolalizadeh, P., Abolfathzadeh, N., Sianati, H., Sharepour, M., Hadi, Y.: Periorbital facial rejuvenation; applied anatomy and pre-operative assessment. J. Curr. Ophthalmol. 29(3), 154–168 (2017) 30. King, D.E.: Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 10, 1755– 1758 (2009) 31. Kraft, D., Hartmann, F., Bieber, G.: Camera-based blink detection using 3dlandmarks. In: Proceedings of the 7th International Workshop on Sensor-based Activity Recognition and Artificial Intelligence, pp. 1–7 (2022) 32. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017) 33. Kr´ olak, A., Strumillo, P.: Eye-blink detection system for human-computer interaction. Univers. Access Inf. Soc. 11, 1–11 (2011) 34. Liang, R., Song, Q.: Blink detection and duration estimation by using adaptive threshold with considering individual difference. In: 2021 IEEE International Conference on Real-time Computing and Robotics (RCAR), pp. 1116–1121. IEEE (2021) 35. Mackert, A., Woyth, C., Flechtner, K.-M., Volz, H.-P.: Increased blink rate in drug-naive acute schizophrenic patients. Biol. Psychiat. 27(11), 1197–1202 (1990) 36. Malaspina, D., Coleman, E., Goetz, R.R., Harkavy-Friedman, J., Corcoran, C., Amador, X., Yale, S., Gorman, J.M.: Odor identification, eye tracking and deficit syndrome schizophrenia. Biol. Psychiatry 51(10), 809–815 (2002) 37. Moharana, L., Das, N., Nayak, S., Routray, A.: Video based eye blink analysis for psychological state determination. Intell. Dec. Technol. (Preprint), 1–10 (2021) 38. Patel, B.C., Anderson, R.L.: Blepharospasm and related facial movement disorders. Curr. Opin. Ophthalmol. 6(5), 86–99 (1995) 39. Phuong, T.T., Hien, L.T., Vinh, N.D., et al.: An eye blink detection technique in video surveillance based on eye aspect ratio. In: 2022 24th International Conference on Advanced Communication Technology (ICACT), pp. 534–538. IEEE (2022) 40. Radlak, K., Smolka, B.: Blink detection based on the weighted gradient descriptor. In: Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013, pp. 691–700. Springer (2013) 41. Rahman, A., Sirshar, M., Khan, A.: Real time drowsiness detection using eye blink monitoring. In: 2015 National software engineering conference (NSEC), pp. 1–7. IEEE (2015)

192

M. Ayoub et al.

42. Rakshita, R.: Communication through real-time video oculography using face landmark detection. In: 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), pp. 1094–1098. IEEE (2018) 43. Sangeetha, J.: Deep learning architecture for a real-time driver safety drowsiness detection system. In: Handbook of Research on Computer Vision and Image Processing in the Deep Learning Era, pp. 29–41. IGI Global (2023) 44. Soleymani, M., Pantic, M., Pun, T.: Multimodal emotion recognition in response to videos. IEEE Trans. Affect. Comput. 3(2), 211–223 (2011) 45. Soukupova, T., Cech, J.: Eye blink detection using facial landmarks. In: 21st computer vision winter workshop, Rimske Toplice, Slovenia (2016) 46. Sridharan, S., Soundar, S., et al.: Assistive technology to communicate through eye blinks-a deep learning approach. Int. J. Comput. Digit. Syst. 11(1), 831–839 (2022) 47. Sugawara, E., Nikaido, H.: Properties of adeabc and adeijk efflux systems of acinetobacter baumannii compared with those of the acrab-tolc system of escherichia coli. Antimicrob. Agents Chemother. 58(12), 7250–7257 (2014) 48. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9 (2015) 49. Wang, D, Amoozgar, B., Porco, T., Wang, Z., Lin, S.C.: Ethnic differences in lens parameters measured by ocular biometry in a cataract surgery population. PloS one 12(6), e0179836 (2017) 50. Wang, L., Ann Alexander, C.: Machine learning in big data. Int. J. Math. Eng. Manag. Sci. 1(2), 52–61 (2016) 51. Yi, Y., Zhang, H., Zhang, W., Yuan, Y., Li, C.: Fatigue working detection based on facial multi-feature fusion. IEEE Sens. J. (2023) 52. You, F., Li, X., Gong, Y., Wang, H., Li, H.: A real-time driving drowsiness detection algorithm with individual differences consideration. IEEE Access 7, 179396–179408 (2019) 53. Yuli Cristanti, R., Sigit, R.,Harsono, T., Adelina, D.C., Nabilah, A., Anggraeni, N.P.: Eye gaze tracking to operate android-based communication helper application. In: 2017 International Electronics Symposium on Knowledge Creation and Intelligent Computing (IES-KCIC), pp. 89–94 (2017) 54. Zhang, H., Wang, X., Ren, W., Noack, B.R., Liu, H.: Improving the reliability of gaze estimation through cross-dataset multi-task learning. In: 2022 International Conference on High Performance Big Data and Intelligent Systems (HDIS), pp. 202–206. IEEE (2022) 55. Zhao, C., Gao, Z., Wang, Q., Xiao, K., Mo, Z., Jamal Deen, M.: Fedsup: a communication-efficient federated learning fatigue driving behaviors supervision approach. Future Gener. Comput. Syst. 138, 52–60 (2023) 56. Zhuang, Z., Landsittel, D., Benson, S., Roberge, R., Shaffer, R.: Facial anthropometric differences among gender, ethnicity, and age groups. Ann. Occup. Hyg. 54(4), 391–402 (2010) 57. Zwaard, S., Boele, H.-J., Alers, H., Strydis, C., Lew-Williams, C., Al-Ars, Z.: Privacy-preserving object detection & localization using distributed machine learning: a case study of infant eyeblink conditioning (2020). arXiv:2010.07259

SIM_RL: A New Approach for Integrating Simulation with Reinforcement Learning Elad Razy(B) and Maya Herman Open University of Israel, Raanana, Israel [email protected]

Abstract. Computer simulation, the process of mathematical modelling performed on a computer, is designed to predict the behavior of a real-world system. As a system becomes more complex, the simulation engine must run numerous times in response to the increasing complexity of the input and the simulation process. Additionally, an expensive physical experiment needs to be performed to validate the results. This paper demonstrates an innovative, general-purpose simulation approach strengthened by refinement learning (RL), formalized in the SIM_RL algorithm, and using epidemic spread (COVID-19) test data. The main advantages of this approach are computational resource savings, reduced need for physical experiments, and the ability to predict system behavior based on actual results. Moreover, this approach can be used in various disciplines to solve complex simulation problems. Keywords: Computer simulation · Machine learning · Reinforcement learning

1 Introduction Reinforcement learning (RL) [1, 2] is a machine learning approach that deals with how an agent operates in a certain environment to increase some cumulative reward. This approach examines how an agent learns from success, failure, reward, and punishment in this endeavor. Using sequential decision problems (SDPs), RL determines the agent’s reward based on a series of decisions (actions) it makes in the environment. Because real-world environments are characterized by uncertainty about the success or failure of an action, the agent must account for this as well. The combination of RL and simulation is innovative; consequently, few prior studies are available. However, [3] reviewed and discussed multiple challenges in applying RL to system optimization problems and proposed a set of metrics that can help evaluate the effectiveness of these solutions. The authors discuss the growing complexity in systems that calls for learning-based approaches and explain that RL presents a unique opportunity to address dynamic system behavior [4]. The SIM_RL algorithm utilizes RL functionality by setting the rewards based on the agent’s decisions using a correlation between the simulation (guided by the agent) results and real data. The development life cycle of a traditional simulation consists of six serial stages that are repeated until the desired results are obtained: logical model, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 193–209, 2024. https://doi.org/10.1007/978-3-031-47715-7_14

194

E. Razy and M. Herman

computational model, high-level software implementation, software implementation, hardware in the loop implementation, and experiments, as illustrated in Fig. 1.

"Logical Model

Experiments•

Computational • Model

Hardware In the • Loop Implementation

High Level • Software Implementation

Software • Implementation

Fig. 1. Traditional simulation development life cycle.

This paper presents an effective method that creates synergy between simulation and RL, especially in large and complex systems. As illustrated in Fig. 2, the RL process was added between the high-level software implementation and software implementation stages. This helps reduce development expenses in the most expensive steps: hardwarein-the-loop implementation and experiments. Such an approach is more robust because it considers proximity to reality rather than simply completing a technical run of a simulation. Other approaches show that RL is still not fully utilized in simulation improvements, whereas the proposed approach is innovative and reliable because it takes into account the results’ closeness to reality and not only the completion of the simulation’s technical run. The paper is organized as follows: 1. 2. 3. 4.

Section 2 overviews our problem-solving methodology. Section 3 provides concrete platform implementation details. Section 4 presents results of the test case implementation. Section 5 provides conclusions and a discussion of future improvement and extensions.

SIM_RL: A New Approach for Integrating Simulation

195

"Logical Model

Expensive steps Experiments•

Computational • Model

Hardware In • the Loop Implementation

High Level • Software Implementation

Software • Implementation

Reinforcement • Learning Process

Fig. 2. Simulation development life cycle with SIM_RL.

2 Methodology The developed system consists of three core components: a simulation engine, the reinforcement learning agents, and an integrator. 2.1 Simulation Engine Prior works [5, 6] defines the modeling and simulation process, starting with the problem definition, the goals of the system to be modeled, and the data to be used as a reference to validate the model. Next, a logical model is designed that represents a system using objects and the activities that take place between them as a non-softwarespecific description of the simulation. The logical model describes the objectives, input, output, content, assumptions, and simplifications of the problem. The next step is writing a conceptual computational model to explore the model’s behavior, test hypotheses, and analyze the results. These mechanisms are provided by the software’s functional components, which enable the user to visualize the simulated processes and provide decision input, data analysis, report creation, and parameter optimization. Together, the computational model and these functional components comprise the simulation program, which enables users to conduct numerical experiments. The simulation engine represents the logical model programmatically. Input arguments are used to calibrate the simulation run. After the simulation runs for a certain time, a vector of discrete results is obtained. We use discrete event simulation, where the variables change at discrete times and in discrete steps. The simulation program also allows for

196

E. Razy and M. Herman

the verification and validation of the computational model’s accuracy. Verification is the process of ensuring that the computational model is consistent with the specifications of the conceptual model and validation is the process of ensuring that the computational model is consistent with the project’s goals. One of the most common techniques used to validate a model is argument optimization. Specifically, the parameters are varied continuously to explore the corresponding model behaviors as broadly as possible to bring them as close as possible to the observed or desired ones. The final goal of verification and validation is to ensure the credibility of the simulation program in terms of the project’s goals. 2.2 Reinforcement Learning Reinforcement learning (RL) is a growing subset of machine learning that involves software agents attempting to take actions in order to maximize some prioritized reward. Several different forms of feedback may govern the methods of an RL system. Compared to supervised learning algorithms, which map functions from input to output, RL algorithms typically do not dictate the target outputs (only inputs are given). A basic RL algorithm consists of three elements: the agent (which can choose to commit actions in its current state), the environment (which responds to action and provides new input to the agent), and the reward (the incentive or cumulative mechanism returned by the environment). The basic schema for an RL algorithm is given in Fig. 3. One of the most well-known RL algorithms is Q-learning, the goal of which is to find a function Q(S, A) that represents the probability of a reward. Thus, a Q-function represents the expected total reward that an agent in state s can receive by executing a specific action a. Q maps state-action pairs to the highest combination of immediate reward with all future rewards that might be harvested by later actions in the trajectory. Q-learning combines policy and value functions to calculate how useful a given action is in gaining some future reward. Once the agent learns this Q-function, it looks for the best possible action at a particular state (s) that yields the highest value, where the value of state a under a given policy π is the expected return from starting from state s and following π thereafter.

Fig. 3. Flowchart of an RL algorithm (source https://i.stack.imgur.com/eoeSq.png).

SIM_RL: A New Approach for Integrating Simulation

Q∗ (s, a) = max qπ (s, a) π

197

(1)

Once we have an optimal Q-function (Q*), we can determine the optimal policy by applying an RL algorithm to find an action that maximizes the value for each state. After t steps, the agent will decide a next step (2). The weight for this step is calculated as γt , where the discount factor γ is a number between 0 and 1 and has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a “good start”). At every step t, γ may also be interpreted as the probability to succeed (or survive) [2]. Q(st, at) ← Q(st, at) + α[rt + 1 + γ maxaQ(st + 1, a) − Q(st, at)]

(2)

2.3 SIM_RL Integrator The main goal of the integrator is to integrate the RL manager and the simulation manager. Figure 4 shows a flowchart of the actual data, simulation engine, and RL manager. The integrator has five main functions: 1. Run the simulation engine [7] with given arguments and obtain results as an array. 2. Compare the run results with the actual data and calculate a score according to the correlation between the calculated results and the actual results. 3. Declare this step as an action for the RL manager, and let it decide on new arguments’ values as an action. 4. Run the simulation engine with the new arguments. 5. Repeat the steps until a stopping condition of a sufficiently high score is obtained. One of the core ideas behind integrating the simulation engine and the RL agent is the definition of the simulation engine as a central element in the environment in which the RL agent operates. The simulation engine actually plays the role of the RL agent’s environment. Figure 4 demonstrates the input, output, process, and orchestration. The input to the process is real data describing what transpired in the real world as well as initial values for input arguments for the simulation. During the process, the RL agent decides on the discrete changes in the values of the input arguments to the simulation engine. The environment decides the reward/punishment values for the agent according to the correlation between the results achieved and the real data. After convergence, the simulation engine is fed with the calibrated arguments and produces as output future simulation results based on the truth data. The entire process is orchestrated by the process manager. After converging to satisfactory input arguments (reaching a target score), this approach should arrive at close-to-reality input. At this point, current data is retrieved from the database, and a simulation is run to obtain a prediction. Figure 5 provides the SIM_RL algorithm pseudocode: • Initial process: In this block, initial variables are loaded. Line 1 is a generic call to the real data that includes loading it to local SIM_RL variable. Line 2 involves

198

E. Razy and M. Herman

Fig. 4. SIM_RL algorithm flowchart.

setting initial argument values based on the user experience; these arguments will be calibrated during the learning process of the RL. • Learning process: This block involves the learning loop. Line 4 provides the reward to the RL agent from the simulation manager based on the correlation score. Line 5 is an inherent step in RL learning process where the agent’s “wisdom” is saved. • Predicting process: After the learning process, the RL agent obtains accurate values for the simulation arguments. Line 6 calibrates the simulation engine to the input arguments that match the actual data. Line 7 completes the predictive process; after setting input arguments’ values, the simulation engine is run to obtain a forecast.

SIM_RL: A New Approach for Integrating Simulation

199

PROGRAM SIM_RL: # Initial process 1 data ← load_past_data() 2 arguments_init_vals ← set_vals(config_file) # Learning process 3 FOR (iterations) 4

Agent.reward ← Env.simulation_engine.run(SIM_INPUT_ARGS,

data) 5

Agent.file ← Agent.save() # Improvement after each

iteration ENDFOR # Last Agent decision arguments are most effective # Predicting process 6 sim_input_args ← load_current_data(Agent.file) 7 predicted_results ← simulation_engine.run(sim_input_args) END.

Fig. 5. SIM_RL algorithm pseudocode.

3 SIM_RL Implementation Our SIM_RL consists of several modules. There was a need to separate between the simulation and the RL processes. Most of our code was written in Python. 3.1 Architecture Components The SIM-RL architecture, illustrated in Fig. 6, consists of packages, main classes, a database (DB), and the data streams between them.

Fig. 6. SIM_RL architecture

200

E. Razy and M. Herman

3.2 Packages Description An object-oriented approach is used to implement the solution. The architecture is derived from the methodology as described in the methodology section. The main package contains: the process manager, which orchestrates the process; the DB, which contains the real-world data; the RL manager package, which contains the RL agent; and the tailored environment, which sends the modified arguments to the simulation. The simulation manager package has an auxiliary class that supports the simulation runs. 3.3 System Processes Figure 7 shows a sequence diagram of all the events during the SIM_RL process: • Initialization process: At this point, the data file is loaded into the system, and default values are set as input arguments to the simulation engine as well as a score threshold for the RL agent. • Learning process: The simulation engine runs the simulation according to the input arguments loaded from the RL agent. The run results are compared with real results, and a score is determined based on the correlation between the calculated results and the actual results. • Prediction process: After the learning process, the simulation engine receives as input the best arguments. current data is loaded, and a simulation can be run to obtain a forecast for the future.

Fig. 7. Sequence diagram of the Sim_RL processes.

SIM_RL: A New Approach for Integrating Simulation

201

3.4 Metrics The RL agent can be evaluated according to the following metrics [8]: • Average number of penalties per episode: The smaller the number, the better the performance of the RL agent. Ideally, this metric is close to zero. • Average number of timesteps per trip: A small number of timesteps per episode is optimal to ensure the agent reaches the destination in the minimum number of steps (i.e., the shortest path). • Average rewards per move: The larger the reward, the more likely that the agent is doing the right thing. Thus, determining rewards is a crucial part of RL. Because both timesteps and penalties are negatively rewarded, a higher average reward means that the agent reaches the destination as quickly as possible with the fewest penalties.

4 Test Case Objective The test case objective is meant to prove the usefulness of the generic algorithm given a real-world problem. 4.1 Entities for the Epidemic Spreading Test Case Input data for this study was taken from the official website of the Israeli Ministry of Health: https://data.gov.il/dataset/covid-19. The database we used was Corona statistical data covid-19 by area, which contains the following relevant fields: Town, Date, new_cases_on_date, and new_recoveries_on_date. For the learning process [8–10], the simulation engine used an object matrix that represents a geographical cell that contains objects that simulate a population. Under the initial conditions, the objects are scattered randomly in the surface cell. Each object has a vector that represents an initial direction of progress and walking speed. There are several patients at the beginning of the simulation run. The object matrix [11] offers a geographical agent-based modelling for the computational method to understand the behavior of complex systems by simulating the actions of entities on a virtual geographical cell. Each object has the attributes listed in Table 1. Figure 8 illustrates the simulation process, which demonstrates how the human entities influence and are influenced by their physical and social environments during the epidemic attack. At each step of the simulation, the simulation engine changes the position of the objects as a reflection of the movement of the people (i.e., changing the position coordinates in the x and y axes to simulate walking in a 2D space). Furthermore, the model changes the status of the current disease state of each object based on its physical distance from other patients, duration of recovery, and chance of death. 4.2 Calculating the RL Agent Score The database contains the following fields: • New_cases_on_date: the number of verified cases detected on a given day • New_recovered_on_date: the number of recoveries detected on a given day

202

E. Razy and M. Herman Table 1. Objects attributes.

Field name

Description

Unique ID x coordinate

Current x coordinate

y coordinate

Current y coordinate

Heading in x direction

Current heading in x direction

Heading in y direction

Current heading in y direction

Current speed 0 = healthy, 1 = sick, 2 = immune, 3 = dead, 4 = immune but infectious

Current state Age Infected_since

Frame the person got infected

Recovery vector

Used in determining when someone recovers or dies

In treatment Active destination

0 = random wander, 1,.. = destination matrix index

At destination

Whether arrived at destination (0 = traveling, 1 = arrived)

Wander_range_x

Wander ranges on x axis for those confined to a location

Wander_range_y

Wander ranges on y axis for those confined to a location

The score obtained by the RL agent will reflect the correlation between the results of the simulation run (made according to the input arguments set by the agent) and the actual results based on to the formula to be specified. This allows a reliable calculation of the correlation coefficient, which is detailed below. For example, consider the cases in Fig. 9. n x(n) · y(n) Normalized Correlation(x, y) = n 1 n 2 2 1 x [n] · 1 y [n]

(2)

i.e. ncc–new cases calculated. rcc–recovered cases calculated. nca–nre cases actual. rca–recovered cases actual. In the data for the recovering series: 2∗1+3∗2+2∗4+2∗3+1∗2 Grade recovered =     = 0.887 22 + 32 + 22 + 22 + 12 · 12 + 22 + 42 + 32 + 22 The realization in the calculation is based on the cosine angle between arrays: cos array = dot(x, y)/(norm(x) ∗ norm(y))

SIM_RL: A New Approach for Integrating Simulation

(a) Start simulation

(b) After 400 steps

(d) After 900 steps – an increase in the number of recoveries

203

(c) After 600 steps – a leap in illness

(e) After 1,000 steps – close to recovery

Fig. 8. The matrix of objects that reflects the population during covid-19 spread. Legend: gray– healthy red–infected green–recovered black–deceased.

For the patient series, we will obtain ncc = [1, 2, 3, 1, 0]. nca = [1, 1, 3, 2, 0]. Grade recovered = 0.93. However, if the array of patients in reality was nca = [1, 1, 3, 10,1 0], we will obtain Grade recovered = 0.391. 4.3 Input Variables The simulation engine argument that the RL agent can modify discreetly after running an episode is the infection range, defined as the range surrounding a sick patient where infections can take place between the entities (additional arguments are described in the further investigation section).

204

E. Razy and M. Herman

Actual data of daily patients and recoveries

Simulated data (simulation output) Fig. 9. Demonstration of RL agent score calculation.

4.4 Minimal Check To formulate the problem given to the conceptual world of the RL agent, it is necessary to give a score on each episode run by the agent according to a comparison of the output data at the end of the simulation with the real-world data. To illustrate, note a simplified example with a single argument: the infected range between two objects from which there is a greater chance than 0 of being infected. The maximum infection range is defined arbitrarily to 10 m, and at each step, the RL agent can advance 1 m. Because the correlation to the truth data can be 100% in the case of a full match, we will define the target score of the agent as 1. Thus, we will accept the parameters in Fig. 10 for the reward of the agent’s study. 4.5 Running the Simulation Engine on Actual Data After the learning process, the simulation engine receives the proper arguments as input. Current data is loaded, and a simulation can be run to obtain a prediction (Fig. 11).

SIM_RL: A New Approach for Integrating Simulation

205

Actions MOVE_LF = 0 MOVE_RT = 1 MAX_STEPS = 10 Rewards REWARD = correlation factor Fig. 10. RL agent attributes.

Fig. 11. Simulation expected results.

4.6 Classes Description The simulation manager loads the simulation engine, loads the input data file, sets start values in variables, and runs the simulation in steps. In order to solve the epidemic spread test case, the simulation manager will use the following classes. • Population: This file contains the matrix of objects as well as their properties. • Infection: This file deals with the infection logic. A scan is performed on the objects, and according to the geographical distance between them, the chances of infection and/or recovery and the probability of mortality are calculated at each step simulation step. In this case, the objects will be infected, recover, and die. Note that if the problem differs, classes adapted to the content world of the specific problem must be implemented.

206

E. Razy and M. Herman

4.7 System GUI

Fig. 12. System GUI.

The system GUI (Fig. 12) illustrates the learning process and the results. In the learning phase (upper section), the user can choose the algorithm and how many learning iterations are going to be performed. Figure 12 demonstrates an example with five learning iterations. The bottom section provides a comparison of simulated output values produced from the calibrated simulation engine with the true values of what occurred on these dates. 4.8 Assumption for the Test Case Based on the test case data, two sets of assumptions were made for this study: • The published morbidity data from the Ministry of Health represent a complex reality, prediction difficult due to many influencing factors, such as differences in demographic data and different behavioral characteristics between populations. • The simulation in the test case represents a simplified case of an area cell of 1 sq. km and, within it, a population of 2,000 women and men moving randomly and then moving in a common direction, as might occur at a social event after which the public disperses.

SIM_RL: A New Approach for Integrating Simulation

207

5 Experimental Results 5.1 Training Data Figure 13 shows the real COVID-19 epidemic data that was used to train the algorithm: new infected population versus recovered population each day between June and December 2020 at Area 225 (Giv’at HaPrahim, Ashqelon, Israel).

Fig. 13. Real epidemic data area 225.

5.2 Simulated Data, Before and After the Learning Figure 14a and b describe the simulation engine results before and after the learning process, respectively. For this study, we did not include the influence of hospitalization and deaths, the infection range is the only argument that changed. As the figure illustrates, before SIM_RL was run, there was a low correlation between the training data and actual data for the rates of new infected and new recoveries, 0.106 and 0.226, respectively. After SIM_RL was run, the correlation increased to 0.779 and 0838, respectively. 5.3 Further Investigation The discussed simulation enables the calibration of additional arguments in addition to the infection range (Table 2). These can lead to a more accurate correlation with reality.

208

E. Razy and M. Herman

Fig. 14. SIM_RL simulation results. a Low correlation with the training data before the learning process. b High correlation with the training data after the learning process. Table 2. Input arguments to the simulation engine. Field name

Description

Infection_chance

Chance that an infection spreads to nearby healthy people each tick

Recovery_duration

How many ticks it may take to recover from the illness

Mortality_chance

Global baseline chance of dying from the disease

SIM_RL: A New Approach for Integrating Simulation

209

6 Conclusion In this paper, we describe the SIM_RL, an innovative general-purpose approach that combines simulation with the strength of RL to save computational resources, reduce the need for physical experiments, and predict future system behavior. This approach is useful for large and complex systems for which it is difficult to find the exact and accurate input arguments. Combining the disciplines may lead to significant cost savings. The proposed method was demonstrated by investigating an epidemic spread. Thus, it could provide a tool to support public policy decisions, assist medical authorities (e.g., helping prepare medical centers for a significant increase in patient traffic), and support large-scale economic responses. In addition, because SIM_RL is a general-purpose algorithm, it can also be applied to a variety of other disciplines.

References 1. Russell, S., Norvig, P.: A modern approach. Artificial intelligence (2020). http://aima.cs.ber keley.edu/ 2. Sutton, S., Barto, G.: Reinforcement learning: An introduction (2018). http://incompleteid eas.net/sutton/book/ebook/the-book.html 3. Paternina-Arboleda, C., Montoya-Torres, J.R,: Simulation-optimization using a reinforcement learning approach (2008) 4. https://www.researchgate.net/publication/221528410_Simulation-optimization_using_a_r einforcement_learning_approach 5. Law, A.M., David Kelton, W.: Simulation modeling and analysis (2014). https://fac.ksu.edu. sa/sites/default/files/index.pdf 6. van Gent, P.: python_corona_simulation, GIT Open Source. https://github.com/paulvange ntcom/python_corona_simulation 7. Willke T. , Gonzalez, J., et al.: A view on deep reinforcement learning in system optimization (2019). https://arxiv.org/abs/1908.01275 8. yoon, C., Understanding actor critic methods and A2C (2019). https://towardsdatascience. com/understanding-actor-critic-methods-931b97b6df3f 9. Le Lan, C., Bellemare, M.G., Samuel Castro, P.: Metrics and continuity in reinforcement learning (2021). https://ojs.aaai.org/index.php/AAAI/article/view/17005 10. Kamalov, F., Rajab, K., et al.: Deep learning for COVID-19 forecasting: State-of-the-art review (2022). https://doi.org/10.1016/j.neucom.2022.09.005 11. Badham, J., Chattoe-Brown, E., et al.: Developing agent-based models of complex health behavior (2018). https://www.sciencedirect.com/science/article/pii/S1353829218301047

Grouping Shapley Value Feature Importances of Random Forests for Explainable Yield Prediction Florian Huber1(B) , Hannes Engler2 , Anna Kicherer2 , Katja Herzog2 , Reinhard T¨ opfer2 , and Volker Steinhage1 1

2

Department of Computer Science IV, University of Bonn, Friedrich-Hirzebruch-Allee 8, 53121 Bonn, Germany [email protected] Julius K¨ uhn-Institut, Federal Research Centre of Cultivated Plants, Institute for Grapevine Breeding Geilweilerhof, 76833 Siebeldingen, Germany

Abstract. Explainability in yield prediction helps us fully explore the potential of machine learning models that are already able to achieve high accuracy for a variety of yield prediction scenarios. The data included for the prediction of yields are intricate and the models are often difficult to understand. However, understanding the models can be simplified by using natural groupings of the input features. Grouping can be achieved, for example, by the time the features are captured or by the sensor used to do so. The state-of-the-art for interpreting machine learning models is currently defined by the game-theoretic approach of Shapley values. To handle groups of features, the calculated Shapley values are typically added together, ignoring the theoretical limitations of this approach. We explain the concept of Shapley values directly computed for predefined groups of features and introduce an algorithm to compute them efficiently on tree structures. We provide a blueprint for designing swarm plots that combine many local explanations for global understanding. Extensive evaluation of two different yield prediction problems shows the worth of our approach and demonstrates how we can enable a better understanding of yield prediction models in the future, ultimately leading to mutual enrichment of research and application.

Keywords: Shapley values prediction

1

· Explainability · Regression forests · Yield

Introduction

Providing a secure source of food for the world’s population is an increasingly difficult challenge. While the world population grows, the available farmland around the world is limited, pushing us to reach the limits of efficiency in agricultural environments. One way to ensure precise planning and optimized workflows is to c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 210–228, 2024. https://doi.org/10.1007/978-3-031-47715-7_15

Grouping Shapley Values for Yield Predictions

211

predict yields with machine learning approaches, which have developed increasingly high accuracy over time [24]. However, with increasing accuracy, the complexity of the models and input data also expands. Popular data sources include remote sensing data [11], fertilization schedules [18], climate information [5], and soil evaluation [21], just to name a few. Successful models utilize up to over 1000 features to predict yield [11], making models and data increasingly difficult to understand. The investigation of yield prediction models is a time-consuming part of the research. Often, models are investigated by defining feature subsets and correlating different accuracies reached, when these subsets are absent from training, with their importance for yield prediction [4,5,28]. Not only is the number of possible feature combinations exponential in the number of groups to be investigated, full training, tuning, and testing steps are also necessary for every combination. Lundberg and Lee [16] argue that Shapley values are capable of offering a unified framework for explainability and provide a fast way to calculate Shapley value feature importances for tree structures [15]. While Shapley value feature importances offer a sound mathematical foundation for desirable properties, like the individual feature explanations adding up to the prediction output of the models, they are not well suited to investigate models relying on a high amount of input features and, therefore, are not naturally equipped to explain yield prediction models. Without grouping features together, it is harder to find patterns within the data or even present them in a meaningful way. Furthermore, when using feature importances to decide which sensors to buy, or at which times a plant should be monitored, it is mostly a batch of data that is procured together. Grouping features for feature importances allows us to evaluate all connected data as a unit. The aim of our study is to address these problems and provide a global explanation of yield prediction models through a direct extension of Shapley values. We want to preserve the desirable mathematical properties and stay in line with the unified approach, while extending the paradigm to be able to handle large groups of features. Yield prediction tasks usually have some properties in common that we can exploit to do so. First, within the input data, we can pick out similar features and group them to obtain explanations. A multitude of features are often captured by the same sensors or associated with a similar time frame. Second, tree structures achieve state-of-the-art results for yield prediction in many use cases [5,11,21]. Combining this, we are able to focus on feature groups to calculate Shapley value feature importances. Furthermore, focusing on tree structures allows for fast polynomial-time calculation. We will refer to this paradigm as Grouped Shapley Values (GSV). Shapley value feature importances are local explanations, meaning that they are calculated for individual data instances. Within this study, our further objective is to take advantage of many local explanations to obtain information on the global structure of the model by designing specific swarm plots. Achieving accessible explanations for tasks with a multitude of input features will help the exchange of information between computer scientists and domain

212

F. Huber et al.

experts, which is very important to model the intricacies for the prediction of different yields. Furthermore, being able to explain why the model made its prediction can help to accept automated yield prediction within the society. The information gained can be used to decide which sensors are worth buying and which plant steps are worth monitoring for future yield prediction surveys. In summary, the contributions of this study are the following: 1. We examine a definition of GSV for predefined coalitions that conserves the axiom of efficiency, which is important for explainable machine learning applications (cf. Sect. 3.1). 2. We describe a graphical structure that allows the combination of many local explanations to unlock global understanding of the models (cf. Sect. 3.2). 3. To our knowledge, we are the first to describe a polynomial algorithm to calculate GSV feature importances for random forests, based on a definition where groups of features compete directly against other groups (cf. Sect. 3.4). 4. We show the capabilities of our approach by providing an exemplary analysis of two different yield prediction models (cf. Sect. 4). The rest of the document will be presented as follows: We summarize the related articles that influence our research in Sect. 2. In Sect. 3 we describe our approach to GSV, give an idea of the polynomial-time algorithm to calculate the GSV on tree structures (Algorithm 1), and explain how we leverage many local explanations to obtain global model understanding. An example of the usage of our algorithm in the context of yield prediction is presented in Sect. 4 and we draw conclusions of our work in Sect. 5. A detailed presentation of Algorithm 1 is given in the Appendix.

2

Related Work

We review related work categorized. First, we will examine the state of explainability in yield prediction. Second, we will describe the history of using Shapley values as feature importance measure together with previous efforts to expand the paradigms of Shapley values onto groups of features, both in a general gametheoretic and a machine learning context. Yield prediction as a research problem is being addressed with a multitude of approaches, both for modeling and explaining the results. Many deep learning approaches are applied to yield prediction; for example, the work of You et al. [28], Wang et al. [26], and Khaki et al. [13], just to name a few. Although the problem is difficult due to the black-box nature of deep learning models, some efforts were made to explain the results. You et al. [28] correlate the importance of a feature with the decrease in the accuracy of the model when information on the feature is missing. They render the information of a feature useless by randomly permuting the values throughout the dataset. With this approach, they can assess the importance of whole feature groups by permuting their values simultaneously. Experiments show that the red and near-infrared bands of satellite images are important for their yield prediction model. Another way to gain

Grouping Shapley Values for Yield Predictions

213

explanations for the output of deep learning frameworks is deployed by Khaki et al. [13]. They backpropagate the output of active neurons in the last layer and are able to find active neurons in the first layer to correspond to features of the input space. An explainable alternative to deep learning for yield prediction can be found using regression forests. Tree structures are prevalent in yield prediction and show state-of-the-art results in a multitude of real-world scenarios. Huber et al. [11] have derived state-of-the-art results for the prediction of soybean yields in the United States. They applied regression forests created by Extreme Gradient Boosting (XGBoost) [6]. Feature importances are analyzed by adding multiple Shapley value feature importances over the whole training data to find that the red and near-infrared bands are very important, especially in the time close to harvest. D´ıaz et al. [8] use the M5-Prime algorithm [27] to create regression trees for the prediction of citrus orchards in Argentina. Since singular trees are considered instead of forests, they conduct a visual analysis of the feature importances based on the resulting tree structure. Similarly, Bobeda et al. [4] use the M5-prime algorithm to predict citrus orchards in Argentina. They use another popular method to understand the yield predictions and explain their model output. By creating multiple subsets of features and evaluating their model in the absence of each of the subsets, they find that it is not necessary to count the fruits in the trees multiple times a year, and the results are only slightly worse when, instead, calculating the volume of the trees’ crowns once. Since we can understand how tree models are functioning, we can rely on the inner relations of the trees to find feature importances that can give explanations to the model’s output. Most famously, the Mean Decrease in Impurity (MDI) [14] can be used to give a measure of the number of splits made by each feature, weighted with the impact of the individual split, that is, the proportion of training samples divided. This internal measure of importance is used by Sirsat et al. [21] to select expressive features when predicting grapevine yields based on phenological information, soil properties, and climatic conditions. Meng et al. [18] use this method to show the high importance of vegetation indices when predicting maize yield in California on a field scale. Shapley values are a game-theoretic measure for solving a fair distribution of resources in a cooperative game. The value awarded to a player is calculated by averaging his contribution over all possible coalitions that he could join within the game. The Shapley value was first used as a feature importance measure by Lundberg and Lee [16]. The idea is to assign each feature an importance value for a particular prediction. The choice of the game-theoretic construct to solve cooperative games, namely the Shapley value, was made because of its desirable theoretical properties and results that are in line with human intuition. One of the mathematical properties allows them to provide an additive feature attribution method, which means that the sum of the feature importances will equal the actual model output for the example. The calculation of Shapley values is, in general, a NP-hard problem. But for decision trees, exploiting the tree structure allows computations in polynomial

214

F. Huber et al.

time, as explained in [15], giving the first polynomial-time algorithm to compute explanations on tree structures based on game theory. The work also gives an idea of how to use many local explanations to represent the global structure of the model. Lastly, we want to highlight other efforts to extend the Shapley value feature importances towards groups of features. On the one hand, we have the classical game-theoretic view on this topic. However, the relevant works [9,10,17] all fail to preserve the efficiency property, which means that the sum of all attribution values of the features will not coincide with the output of the model and, therefore, are not suitable to base the explanations on. The work of Jullum et al. [12] recognizes this weakness and presents a form for extending the Shapley value to groups of players in the context of feature importance. This allows for easier representation of the results together with a lower computational complexity. Amoukou et al. [1] base an approach to evaluate groups of features on a different definition of grouped Shapley values, where groups of players continue to play against individuals. The work is extended by giving a fast computation for tree structures and selecting minimal subsets of features, so that the classifier will make the same decision with high probability.

3

Approach

In this section, we present our approach to Grouped Shapley Values (GSV). The approach is divided into multiple steps (1) definition of GSV for general cooperative games, (2) transfer of GSV to gain local explanations for machine learning models, (3) use of local explanations to gain global understanding, and finally (4) calculation of local explanations in polynomial time for tree structures. 3.1

The Value of Predefined Coalitions in a Cooperative Game

As mentioned above, the classic Shapley value is defined to solve the fair distribution of resources within a cooperative game. A cooperative game is a tuple (P, v), where P = {1, 2, . . . , p} is the finite set of players, and v : 2P → R is the value function. A subset of players P is called a coalition, and the value function assigns a value to each coalition of players. The classic Shapley value now represents the average contribution of a player to all possible coalitions that the players in P can form [20]. For our approach, we extend the classic Shapley value formula by allowing the players to form predefined coalitions, before the game starts. Players in a predefined coalition will be evaluated together and will never be separated. To notate the predefined coalitions of players, we assume a partition of the set of players C = {C1 , . . . , Ck }, where each Ci is a nonempty subset of P and represents a different predefined coalition. To obtain GSV, the classic definition of Shapley values is restricted to only average the contribution of the group Ci to all possible subsets S that can be built from predefined coalitions within C. Therefore, the GSV ϕCi (v) for a predefined coalition Ci depending on the value function v can be defined as:

ϕCi (v) =

 S⊆C\{Ci }

Grouping Shapley Values for Yield Predictions

215

|S|! (k − |S| − 1)! (v(∪S ∪ {Ci }) − v(∪S)), k!

(1)

where ∪S describes the union of all selected sets of C that are in S. We note that this definition allows the predefined coalition of players to have varying sizes, which is very useful in terms of feature importances, since natural groupings are mostly related to the origin of the feature and vary throughout most datasets. To understand GSV better, we can take a look at the fraction at the beginning of Eq. (1). The fraction evolves from the original Shapley value definition that is made by permutating the set of players and averaging the difference when evaluating the value function over the coaltion with all players that precede a player Ci in the given order with and without Ci itself. The number of players preceding in our formula coincides with the number of players in the set S, having S! possible orders. Similarly the players succeeding Ci have (k−|S|−1)! possible orderings. Normalizing with all possible k! permutations results in the factor in Eq. (1). Since the equation is a direct extension of the classical Shapley value, all desirable game-theoretic properties still hold (efficiency (2), symmetry, dummy variable, and additivity). Most importantly, the efficiency axiom ensures that the Shapley value precisely distributes the gain produced by the coalition consisting of all players among all players.  ϕCi (v) = v(C). (2) efficiency : Ci ∈C

Later, the axiom of efficiency translates into local explanations that always add up to explain the output value of the model for the explained data point. Exactly this important axiom of efficiency is not preserved in game-theoretic approaches to extend Shapley values on predefined coalitions that are proposed in [9,10,17]. Therefore, these game-theoretic approaches are not applicable towards explainable machine learning, as the explanations would not add up to explain the model outcome. A naive solution to aggregate Shapley values within a group of players is to add individual Shapley values [11]. To show that this option is not appropriate and produces counterintuitive results, we analyze a minimal example based on the classic illustrative glove game [2]. The Glove Game Example: Within the glove game, we observe three players P = {1, 2, 3} trying to complete a pair of gloves. The player 1 and the player 2 each have a left glove, while the player 3 has a right glove. The value function v(S) is evaluated with the value 1, if the set S contains a matching pair of gloves and with the value 0 otherwise. Calculating the classic Shapley value, we obtain ϕ1 (v) = ϕ2 (v) = 16 and ϕ3 (v) = 46 . The results follow our direct intuition that the player 3 is the most important player in the game, since it is the only player who can complete a pair of gloves. Our observation changes when players 1 and 2 form a predefined coalition. This means C1 = {1, 2} and C2 = {3}. Both groups should be valued equally within the game, as having multiple left gloves does not increase the value function, and only a combination of C1 and

216

F. Huber et al.

C2 can build a pair of gloves. Using Eq. (1) we observe that our definition of grouped Shapley values follows this intuition by valuing ϕC1 (v) = ϕC1 (v) = 12 , while when summing the initial values ϕ1 (v) + ϕ2 (v) = 26 we would undervalue the coalition of players 1 and 2. Note that we can increase the gap between both approaches by adding more players who own a left-hand glove to the game. 3.2

From Grouped Shapley Values to Local Explanations

To use GSV to better understand yield prediction, we define a game (P, v) in the context of a machine learning model M trained on a dataset X ∈ Rn,m with the following targets y ∈ Rm . That is, the data set consists of n features {f1 , . . . , fn } and m data points used to create the model. We obtain local explanations for a fixed data point x ∈ X by interpreting each of the features as a player in a cooperative game, so P = {f1 , . . . fn }. For a subset of features S ⊆ P , we want the value function v(S) to represent the answer of the model M to the data point x, assuming only the feature values of the data point from features in S are known. For any predefined coalition Ci ⊆ P and set S ⊆ C\{Ci }, the difference between v(∪S) and v(∪S ∪ {Ci }) describes the change in the output of the model, assuming additional knowledge of the values of the features in Ci . The idea of the GSV is to average contributions over all possible combinations of other predefined coalitions S and give an estimate of the impact of the specific values of the features in Ci to the model output. To give an estimation of the model answer based on limited access to features, we will calculate the expected answer of the model M for a data point x, where only the features in S are known: E[M (x)|S]. To do so, we will take advantage of tree structures that have been shown to work well for yield prediction. To estimate the model’s answer of a singular tree, we will traverse the tree as we would normally. If, while traversing, we find a feature F , that is not included in S we estimate the average model answer from data points in our training data set that are similar to x with regards to the set S. Similar data points are defined as points that induce the traversing of the tree equal to the data point x, for every feature in S. For the feature F that is not in S we calculate the weighted average of the model answer, according to the amount of similar data points that follow the 2 possible branches of the tree. We give exemplary calculations in Fig. 1 and put emphasis on handling the unknown feature “Rain” in part (a) and “Temp day” in part (b). The procedure described gives us a naive recursive algorithm to calculate the value function of Eq. (1) for any given data point and subset. By iterating over all the necessary subsets of features and building the sum, we can calculate the GSV. In Sect. 3.4 we will describe a further procedure for calculating Eq. (1) in polynomial time. 3.3

From Local Explanations to Global Understanding

The described approach for GSV is capable of obtaining local explanations for any yield prediction model, that is, predictions that can explain the impact of

Grouping Shapley Values for Yield Predictions

217

Fig. 1. Two calculations of the the model output E[M (x)|S] for a fixed data point x, Feature-Subset S and Model M . The number of data points covered by each node is indicated by the number next to it. The same path ending in the prediction labeled “8 kg” is used to estimate the model answer of two different feature-subsets S

each feature for a specific data point. To continue, we want to have access to a global understanding of the model. Only then can we decide to drop features of low importance or analyze general patterns within the data that will help us to understand the yields. The low dimension of the GSV allows a clear representation of the importance of the features. The first step is to calculate the local GSV for a variety of data points. We can then utilize specialized swarm plots for a joint representation to not only get an idea of the magnitude of impact each feature group possesses, but also the impact of high and low feature values on the final yield prediction. The information gained can then serve as baseline for further experiments to decide whether the machine learning task at hand is reliant on a specific group of features and can lead to decisions like, e.g. not buying a certain sensor or not using human resources to capture in-field information that often. We build the swarm plots as follows: After grouping the features, we decide on an aggregated value to represent the magnitude of the features in the group. For similar features, such as features captured from the same sensor in multiple timeframes, the mean value of all features serves this purpose. Since the number of groups is limited, we can even visualize multiple swarm plots in one figure, where each swarm represents a feature group. The x-axis is then used to show the GSV of the regarding group, indicating this feature group’s impact on the prediction in the grand scheme. Finally, after normalizing the representative values for all groups, we can use the hue of the individual points to highlight how the high and low values of the feature groups influence the prediction of the model. Examples are shown in Figs. 2 and 3. Having the GSV calculated, it can

218

F. Huber et al.

help us to understand the impact of the group of features Ci on our model for the specific data point x. Each GSV can be interpreted as the difference it made for our model, that the features in Ci are valued within x the way they are, in comparison to what the model would output, if these values were unknown. The higher the absolute value, the more impact is attributed to the feature group. 3.4

Grouped Shapley Values on Tree Structures

In general, the question of calculating the Shapley values is known to be NPhard [7]. The usage of predefined a priori coalitions is capable of reducing the complexity of the task by reducing the number of summands in Eq. (1) compared to the classic Shapley value. We have already established how to estimate the value function v(S) for a model M and a fixed data point x in Sect. 3.1 and how to use it to calculate Eq. (1). Understanding the naive algorithm alters the way of formulating an algorithm capable of solving Eq. (1) in polynomial time. Lundberg et al. [15] give a polynomial time algorithm for the calculation of the classic Shapley value feature importances on tree structures. On the basis of their work, we are able to formulate an algorithm to calculate the GSV feature importances in polynomial time. An abstract representation of the result can be seen in Algorithm 1, where we forego the exact weight update to achieve a more streamlined representation. An exact version of the algorithm can be found in the Appendix (Algorithm 2). We have already described an intuition for calculating Eq. eqrefgroupspssv by iterating over all possible subsets of the sum and estimating the answer of the value function. For each subset S we have to traverse multiple paths of the tree, as highlighted in Fig. 1, since every time we need to split at a feature not in S, we continue to build the weighted average over both possible following children. Similarly, we use each path of the model M in multiple calculations for the sets of Eq. (1). In the examples in Fig. 1, we see the path ending in “8 kg” traversed for two different subsets S. The main idea of the polynomial algorithm is to traverse the entire tree only once while keeping track of the contribution that each individual path has to all possible subsets. At the end of the traversal of each path, the grouped Shapley values are updated accordingly. The weights are determined by three different factors that need to be tracked and updated throughout the algorithm. First, the sizes of the possible sets S as necessary for the factor in Eq. (1) (noted as w frac in Algorithm 1). Second, the fraction of training examples that follow the branches, as explained in the example in Fig. 1. Third, for every group Ci we want to calculate the GSV for, we need a sign depending on whether the path traversal assumes knowledge of the features in Ci (positive) or not (negative). The weights are updated consecutively while traversing the tree. We focus on line 7 of the algorithm, where we need to check if the feature group was already represented during the path we traverse currently. If this is the case, the structure of subsets leading down to the path does not change, since all features within a group are treated as a unit.

Grouping Shapley Values for Yield Predictions

219

Algorithm 1 Polynomial Group Shapley Values for Trees (Simplified) 1: procedure GroupTreeShapleyValue(Datapoint: x, Model: M, Coalitions: C) 2: ϕ = array of len(C) zeros  Storage for GSV 3: procedure Expand(currentNode, path, weigths) 4: path, weights = Update weights(currentNode, path, weights)  Get the group of the node’s 5: Ci = Group of currentNode.getFeature() split-feature 6: if currentNode is an inner node then 7: if a previous Feature F along the path is also in Ci then  Subset sizes remain the same 8: UNDO the Update weights for F 9: end if 10: Child1 = Traverse further following the values of x 11: Child2 = Traverse further the other child 12: Expand(Child1, path, weights) 13: Expand(Child2, path, weights) 14: end if 15: if currentNode is a leaf then 16: for node in path do 17: Cj = Group of node.getFeature() 18: w frac = weights according to subsets along the path in regard to Cj 19: w pos = weights of splits according to features in Cj 20: w neg = weights of splits without information of features in Cj 21: ϕ [j] += w f rac ∗ (w pos − w neg)∗ currentNode.getValue() 22: end for 23: end if 24: end procedure 25: EXPAND(root, path = [], weights = [])  Start at root node with empty path and no weights 26: return ϕ 27: end procedure

The algorithm was first defined in [15] for classic Shapley values. We altered the algorithm to calculate the grouped Shapley values. The correctness of our algorithm follows directly from the work of Lundberg et al. [15], since we only consider multiple features to be handled as if they were the same player within the equation. Since the check of occurring groups can be made via a lookup table, the run-time is still bounded by O(T LD2 ), with T being the number of trees within the random forest. This is a factor since we need to execute the algorithm for every individual tree. L refers to the maximum number of leaves within the trees and D is the maximum depth of the trees.

220

4

F. Huber et al.

Experimental Results

To show how grouped Shapley values can be used for yield prediction, we will analyze models based on two different datasets for yield prediction. Each data set comes with different intricacies and challenges to explainability. The best insights about a data domain can be gained, when the models are well-fitted to the problem, since otherwise we would try to explain patterns that are not learned correctly. We build a regression forest for individual problems using eXtreme Gradient Boosting (XGBoost) [6] and tune the models to be optimal using Tree-Structured Parzen Estimation (TPE) [3]. XGBoost builds the regression tree ensemble from the training data by iteratively adding new trees to minimize the residual error of the training set, that is, the difference between the prediction made by the sum of all trees previously designed and the actual target output. For each dataset, we use 20% of the training data for validation to apply TPE. The choice of testing sets depends on the data at hand, to be as close as possible to real-world scenarios. 4.1

Soybean Yield Prediction based on Remote Sensing Data

The first model is trained to predict county-level soybean yields in the United States. The real-world yield data are taken from the U.S. Department of Agriculture [23]. Grouped Shapley values are necessary here, since the input data consist of 1131 features. The features are taken from remote sensing images and originate from 11 different bands captured in 8 day composites at 34 different time steps. The first 7 bands are from the MODIS satellite and differ within the wavelength depicted. The temperature bands are also MODIS products and capture the average temperature during the day and at night [25]. The precipitation and vapor pressure bands are taken from the Daymet V4 satellite [22]. Following [11], each band at each time is compressed to three values, that is, the mean, a 20% quantile and an 80% quantile. Figure 3 shows the swarm plots resulting from the grouping of features with respect to their spectral bands. We note that the sum of all GSV of a local explanation will result in the model output for the specific data point. Therefore, each grouped Shapley value describes how knowing the data point’s values for the features within the groups impacted the model output, in comparison to the absence of these values. If we look, for example, at the values on the far right within the swarm plot for the handcrafted features in Fig. 2, we know that there are data points that we would expect to have ca. 15 bu/ac lower predicted yield, when the handcrafted features would be unknown. Furthermore, we know that this behavior is caused by the high overall values of the feature group, since the respective data points are colored in red. When iterating through the swarm plots, we see interesting and insightful patterns in most of them. The following insights can be gained from the corresponding swarm plots: Red—620−670 nm and NIR—841−876: These two feature groups are components of the well-known Normalized Difference Vegetation Index (NDVI) [19], which is historically used to summarize remote sensing images to predict

Grouping Shapley Values for Yield Predictions

221

yields. The NDVI is calculated by dividing the difference between the NIR and the Red band by their sum. This means that a high NIR will increase the NDVI and is correlated with a higher yield, while the opposite holds for the red band. This coincides with the plots in Fig. 2, as can be seen by the reverse order of the red and blue points in the two plots. For the red band, the blue dots are mostly on the positive side of the plot, meaning that a low value for this band coincides with an increasing yield prediction. For the NIR band, we observe the opposite. The dots colored blue make the biggest negative impact on the model of all feature groups, by reducing the model’s prediction by more than 5 bu/ac. Blue—459−479 nm and Green—545−564 nm: Both bands have historically not been used for yield prediction and also have little impact on our prediction models, as there are no dots within the plots that show a high GSV. NIR—1230−1250 nm and IR—1628−1652 nm: These feature groups show lower impacts, indicated by very narrow swarm plots. But for both bands, we see a tendency that higher values coincide with lower predictions since the negative impacts on the yield predictions are all recorded for red-colored dots. IR—2105−2155 nm: This feature group shows a higher impact, indicated by a larger swarm plot. Since we find red and blue dots at both ends of the spectrum, we cannot derive a pattern or interpret the influence of higher or lower values. This means that the model’s interpretation of this feature group is highly influenced by other feature groups around, but still important to derive the final prediction. TempDay and TempNight: The temperature at day shows a larger swarm plot than the temperature at night and therefore is more influential on the model output. At the same time, the dots within the night temperature are clearly sorted from blue to red and indicate that a higher temperature at night coincides with a higher yield prediction. Precipitation and Vapor Pressure: The precipitation group shows almost no impact on the model output, as it is the narrowest swarm plot within Fig. 1. The vapor pressure group shows a small impact on the yield prediction and a very clear indication that a higher vapor pressure should lead to slightly higher yields, because the dots are completely in order from blue to red. Interestingly, both feature groups are not available worldwide, as they are specifically captured within the United States. The relatively low impact on the prediction model encourages soybean yield prediction experiments in other regions of the world, although this information would not be available. Handcrafted Features: This is the most influential group of features, capable of altering the prediction of yield by more than 15 bu/ac. The impact is very high, as it includes the average yield of the county represented by the data point over the foregoing years. That is, a county with traditionally higher yields in the past will obtain a higher yield prediction by our model, thus inducing spatial context to the modeling.

222

F. Huber et al.

Fig. 2. GSV feature importances for soybean yield prediction. Every dot is a local explanation of a soybean yield prediction in the u.s. the first 11 plots represent a group of 102 features each calculated from different remote sensing data. The bottom plot shows a group of 7 additional handcrafted features. The color represents an averaged and normalized representation of the grouped features values

4.2

Grapevine Yield Prediction Based on Phenological Data

The second model, which we use to highlight the ability to understand yield prediction models through GSV feature importance, predicts grapevine yields. The data are directly obtained from the Julius K¨ uhn-Institute, an institute for grape vine breeding in germany. The input features are measured by manual plant appraisal in the year 2021 and the data consist of 461 grapevines stemming from 11 different grape varieties, with manual weighted yields, directly after harvest. The 11 different grape varieties include established varieties already used in commercial practice and experimental varieties, which are not guaranteed to meet the high conditions required to become commercially viable. Data are captured at 7 different timestamps and consist of different phenological information, such as the number of grapevine shoots, the number of inflorescences, and the number of clusters of grapes. GSV can help to identify important and nonimportant time-stamps to obtain meaningful yield predictions. Since the data are captured manually, omitting some time stamps for data retrieval could save time and money. We divide the data into test and training subsets by randomly sampling

Grouping Shapley Values for Yield Predictions

223

20% of the data for testing. The GSV for the test data can be seen in Fig. 3. Analyzing the plots one by one, we see the following patterns: May 03 and May 12: Both times of data acquisition are early in the grapevine growth cycle. The dots for both plots are close to the center, which means that the impacts of both feature groups are rather low. However, we still observe some patterns. For the feature group May 03 we see that all the blue dots have a negative GSV and therefore a negative impact on the predictions. Within the feature group, we count the number of shoots. We can deduce that too few shoots in the early stages of growth will harm the expected yield. June 02: This is one of the groups showing one of the largest impacts on our model, as can be seen from the GSV in a range between –0.4 kg and 0.4 kg. The high importance of this at such an early stage of the growth cycle may seem surprising at first. Investigating the issue, with the help of knowledge from domain experts, helps us explain why this feature group is important. For grapevines in our study region, it is known that grapes build their inflorescences up to the beginning of June. Since this phenological feature is counted within this study group, it helps explain its high importance. Regarding the distribution of values within this feature group, we see that the red dots are distributed at both ends of the plot, whereas the blue dots are mostly in the negative area. We can interpret this to mean that a high value of the phenological features of the group is a necessary but not a sufficient criterion for a high yield prediction. June 16 and July 01: The feature group of features captured in June 16 is of no importance to our model, since all GSV are zero-valued. The group of features captured on July 01 shows an interesting pattern. We see that the dots with negative GSV are all colored very lightly, meaning that the values are close to the average of the regarding feature group. Slightly better yields can be expected for low feature values, as indicated by the cluster of blue dots, and the best yield is expected for high values of the feature group, as indicated by the cluster of red dots on the right. July 20: This feature group contains all the features captured on July 20th, and therefore the captured information is very close to harvest. Not only do we see GSV from –0.4 to 0.6, we also see a clear pattern with the dots being ordered from blue to red. The information captured includes phenological information that counts the number of clusters of grapes, which directly correlates with the expected yield.

224

F. Huber et al.

Fig. 3. GSV feature importances for grapevine yield prediction. Each plot represents a group of features that are captured at the same point in time. The color represents an averaged and normalized representation of the grouped features values.

5

Conclusions

In this work, we propose a novel approach to explain machine learning models in the context of yield prediction. First, we examine a notion of Shapley values for predefined coalitions in a cooperative game that is capable of preserving desirable mathematical properties of the classical Shapley value. Foremost, the axiom of efficiency that states, when transferred to the context of feature importances, that the sum of all Shapley values of a specific explanation will add up to show the model’s output. We call this Grouped Shapley Values (GSV). Second, we leverage the visualization of many local explanations, to achieve global understanding of our model. This is done by introducing swarm plots that not only show the GSV but also give a colored indication of the feature’s aggregated values, giving further information to analyze. Third, we solve the NP-hard problem of calculating the GSV in polynomial time for tree structures. On the one hand, tree structures show state-of-the-art results for many yield prediction scenarios, on the other hand, we can calculate the GSV by traversing a tree once and keeping track of the according addition to the GSV, that comes from each singular path. And fourth, we give an exemplary analysis on two real-world yield prediction tasks, showing the worth that GSV can add to yield prediction. We can not only learn from the predictions, for example, about conditions that favor high yields, but we can also raise trust in our models. This is achieved by showing that the important features are in line with the features that domain experts would consider for yield prediction. We conclude that our approach will be helpful in creating synergy between computer scientists and domain experts. We acknowl-

Grouping Shapley Values for Yield Predictions

225

edge that our approach relies on a natural grouping of features that cannot be determined in every scenario, but is common for yield prediction tasks, as multiple features are often derived from the same sensor or are captured at the same time frames. Also, our visualization works best, if the feature groups values can be represented meaningfully by an aggregated value. Lastly, since fast calculation of the GSV is only enabled for tree structures, it remains an open problem how to handle explanations for models with high-dimensional input data, that do not rely on tree structures. For future work, we will extend our approach to different datasets, and we will address the problem that occurs when no natural grouping of features is accessible. Furthermore, we will experiment on using the calculated feature importances to select features and create lightweight models for yield prediction. Acknowledgments. This work was partially done within the project “Artificial Intelligence for innovative Yield Prediction of Grapevine” (KI-iREPro). The project is supported by funds of the Federal Ministry of Food and Agriculture (BMEL) based on a decision of the Parliament of the Federal Republic of Germany. The Federal Office for Agriculture and Food (BLE) provides coordinating support for artificial intelligence (AI) in agriculture as funding organisation, grant number FKZ 28DK128B20. We thank Timm Haucke for proofreading our manuscript.

Appendix A: Algorithms Based on the work of [15] we present Algorithm 2 as a precise description of Algorithm 1. Regarding the notion within the algorithm, the model M is always a tree represented by a list of nodes v, left children a, right children b, thresholds t, cover r, and features d. The cover is the fraction of training data that are split by the individual nodes. The variable m is used to store the path of the unique feature groups along the path, represented by one feature per group. Together with the path m, we store four attributes. (1) The feature index d, (2) the fraction of paths, where this group is not in the set S that flow through the branch z, (3) the fraction of paths, where this feature group is in the set S–o, and finally (4) the weight w, which keeps track of the weights in front of Eq. 1. Within the algorithm, we access arrays via dot notation and m.d represents the whole vector of features that are traversed so far. Lastly, the values pz and po track the fraction of added contributions, depending on the current feature as represented in the subsets.

226

F. Huber et al.

Algorithm 2 Polynomial Group Shapley Values for Trees (Detailed) 1: procedure GroupTreeShapleyValue(x, tree = {v, a, b, t, r, d}, C) 2: ϕ = array of C zeros  Storage for GSV 3: procedure Expand(j, m, pz , po , pi ) 4: m = Weight update(m, pz , po , pi )  Update m and all fractions to incor5: 6: 7: 8: 9: 10: 11: 12:

if vj = internal then for i ← to len(m) do w = sum(Unwind((m, i).w) C = group of dj ϕC = ϕC + w (mi · o − mi · z) vj end for else h, c = (aj , bj ) if xdj ≤ tj else (bj , aj )

porate the growing amount of features included in the path  vj is a leaf node  Traverse path backwards  Undo Update weight to access all features along the path  Find group of the current feature  Add contribution to SHAP values  vj is an internal node  check which path the feature values of x dictate

13: iz = io = 1 14: k = FINDFIRSTGROUP (m.d, dj )  Check for group of dj 15: if k = nothing then  Undo split if the group is already represented 16: iz , io = (mk .z, mk .o) 17: m = Unwind(m, k) 18: end if 19: Expand (h, m, iz rh /rj , io , dj )  Recursive call for both children 20: Expand (c, m, iz rc /rj , 0, dj ) 21: end if 22: end procedure 23: procedure Weight update(m, pz , po , pi ) 24: l, m = len(m), copy(m) 25: subsetsize = 1 if l = 0, else subsetsize = 0  Check if this is the first call 26: ml+1 · (d, z, o, w) = (pi , pz , po , subsetsize) 27: for i ← to 1 do 28: mi+1 · w = mi+1 · w + po · mi · w · (z/l)  Fraction for bigger subsets 29: mi · w = pz · mi · w · (l − i)/l  Fraction for same size subsets 30: end for 31: return m 32: end procedure 33: procedure Unwind(m,i) 34: l, n, m = len(m), ml .w, copy(m1...l−1 ) 35: for i ← l − 1 to 1 do  Run the path backwards 36: if mi .o = 0 then 37: t = mj · w  Undo the calculations within Weight update 38: mj · w = n · l/ (j · mi · o) 39: n = t − mj · w · mi · z · (l − j)/l 40: else 41: mj · w = (mj · w · l) / (mi · z(l − j)) 42: end if 43: end for 44: for j ← i to l − 1 do  Update weights 45: mj · (d, z, o) = mj+1 · (d, z, o) 46: end for 47: return m 48: end procedure 49: Expand(1, [], 1, 1, 0)  Start at root node 50: return ϕ 51: end procedure

Grouping Shapley Values for Yield Predictions

227

References 1. Amoukou, S.I., Brunel, N.J.B., Sala¨ un, T.: The Shapley value of coalition of variables provides better explanations (2021). arXiv:2103.13342 2. Aumann, R.J., Shapley, L.S.: Values of Non Atomic Games (1974) 3. Bergstra, J., Yamins, D., Cox, D.: Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In: International Conference on Machine Learning, pp. 115–123. PMLR (2013) ´ 4. B´ obeda, G.R.R., Fern´ andez-Combarro Alvarez, E., Mazza, S., Gim´enez, L.I., D´ıaz Rodr´ıguez, S.I., et al.: Using regression trees to predict citrus load balancing accuracy and costs. Int. J. Comput. Intell. 12(1) (2018) 5. Cao, J., Zhang, Z., Tao, F., Zhang, L., Luo, Y., Han, J., Li, Z.: Identifying the contributions of multi-source data for winter wheat yield prediction in China. Remote Sens. 12(5), 750 (2020) 6. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016) 7. Conitzer, V., Sandholm, T.: Computing Shapley values, manipulating value division schemes, and checking core membership in multi-issue domains. In: AAAI, vol. 4, pp. 219–225 (2004) 8. D´ıaz, I., Mazza, S., Combarro, E., Gimenez, L.: Machine learning applied to the prediction of citrus production, vol. 15, pp. e0205 (Jul 2017) 9. Flores, R., Molina, E., Tejada, J.: Evaluating groups with the generalized Shapley value. 4OR, 17(2), 141–172 (2019) 10. Grabisch, M., Roubens, M.: An axiomatic approach to the concept of interaction among players in cooperative games. Int. J. Game Theory 28(4), 547–565 (1999) 11. Huber, F., Yushchenko, A., Stratmann, B., Steinhage, V.: Extreme gradient boosting for yield estimation compared with deep learning approaches. Comput. Electron. Agric. 202, 107346 (2022) 12. Jullum, M., Redelmeier, A., Aas, K.: groupShapley: efficient prediction explanation with Shapley values for feature groups (2021). arXiv:2106.12228 13. Khaki, S., Pham, H., Wang, L.: Simultaneous corn and soybean yield prediction from remote sensing data using deep transfer learning. Sci. Rep. 11(1), 1–14 (2021) 14. Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, vol. 26 (2013) 15. Lundberg, S.M., Erion, G., Chen, H., DeGrave, A., Prutkin, J.M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., Lee, S.-I.: From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2(1), 56–67 (2020) 16. Lundberg, S.M., Lee, S.-I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, vol. 30, pp. 4765–4774. Curran Associates Inc. (2017) 17. Marichal, J.-L., Kojadinovic, I., Fujimoto, K.: Axiomatic characterizations of generalized values. Discret. Appl. Math. 155(1), 26–43 (2007) 18. Meng, L., Liu, H., Ustin, S.L., Zhang, X.: Predicting maize yield at the plot scale of different fertilizer systems by multi-source data and machine learning methods. Remote Sens. 13(18), 3760 (2021) 19. Quarmby, N.A., Milnes, M., Hindle, T.L., Silleos, N.: The use of multi-temporal NDVI measurements from AVHRR data for crop yield estimation and prediction. Int. J. Remote Sens. 14(2), 199–210 (1993)

228

F. Huber et al.

20. Roth, A.E.: The Shapley value: essays in honor of Lloyd S. Shapley. Cambridge University Press (1988) 21. Sirsat, M.S., Mendes-Moreira, J., Ferreira, C., Cunha, M.: Machine learning predictive model of grapevine yield based on agroclimatic patterns. Eng. Agric. Environ. Food 12(4), 443–450 (2019) 22. Thornton, P.E., Thornton, M.M., Mayer, B.W., Wei, Y., Devarakonda, R., Vose, R.S., Cook, R.B.: Daymet: daily surface weather data on a 1-km grid for North America, version 3. ORNL DAAC, Oak Ridge, Tennessee, USA. In USDA-NASS, 2019. 2017 Census of Agriculture, Summary and State Data, Geographic Area Series, Part 51, AC-17-A-51 (2016) 23. USDA. USDA nass quick stats database (2021). www.quickstats.nass.usda.gov/. Accessed 01 Jul 2021 24. Van Klompenburg, T., Kassahun, A., Catal, C.: Crop yield prediction using machine learning: A systematic literature review. Comput. Electron. Agric. 177, 105709 (2020) 25. Vermote, E.: Mod09a1 modis/terra surface reflectance 8-day l3 global 500m sin grid v006. NASA EOSDIS Land Processes DAAC, 10 (2015) 26. Wang, X., Huang, J., Feng, Q., Yin, D.: Winter wheat yield prediction at county level and uncertainty analysis in main wheat-producing regions of China with deep learning approaches. Remote Sens. 12(11), 1744 (2020) 27. Wang, Y., Witten, I.H.: Induction of model trees for predicting continuous classes (1996) 28. You, J., Li, X., Low, M., Lobell, D., Ermon, S.: Deep Gaussian process for crop yield prediction based on remote sensing data. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)

Mining Interesting Aggregate Tuples Viet Phan-Luong(B) Aix-Marseille Univ, Universit´e de Toulon, CNRS, LIS, DANA, Marseille, France [email protected]

Abstract. In business intelligence, the concept of data cube offers aggregate views over multiple dimensions of business. Computing the data cube is a challenge because of the exponential number of cuboids. This number is not only an important problem of computing, but also of searching for what is interesting or useful in the data cube. This paper presents the concept of interesting aggregate tuple that can be useful for managers to decide on their business. This concept is useful because (i) interesting aggregate tuples are those with important and credible aggregate values, and (ii) the number of interesting aggregate tuples is very small that can be considered by humans. The algorithm for searching for interesting aggregate tuples is implemented and experienced on the real datasets.

Keywords: Data warehouse

1

· Data cube · Data mining · Database

Introduction

A data cube of a relational fact table over a relational scheme R with n attributes (also called dimensions) and a measure M (a numerical attribute), and an aggregate function g (as COUNT, SUM, MAX, etc.) is the set of all aggregate tuples over any relational scheme S subset of R with the aggregation of g on M . Practically, the data cube is the result of the set of all Structured Query Language (SQL) group-by queries with group-by on any S ⊆ R and the aggregate function g applied to the measure M . The result of each such a SQL group-by query is called a cuboid. In business intelligence, the concept of data cube is useful for managers as it offers aggregate views of the business over multiple combinations of dimensions. However, computing data cube query has important issues concerning the big size of large datasets and the exponential number of the cuboids. In Online Analytical Processing (OLAP), the data cube is precomputed and stored on disks. The storage space can be tremendous, as for n dimensions, we have 2n cuboids. Because of the tremendous number of aggregate tuples, it is very hard for managers to find what is interesting or useful to help them to make decision on business. To reduce the number of aggregate tuples, there exist many approaches, such as the iceberg cubes [1–4]. In these approaches, only the aggregate tuples with c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 229–243, 2024. https://doi.org/10.1007/978-3-031-47715-7_16

230

V. Phan-Luong

count value above some threshold are stored. Another approach of reduction is the approximate and space-efficient representation of data cubes, based on a multiresolution wavelet decomposition [5]. Another category of approaches to reduce the number of aggregate tuples is based on equivalence relations defined on aggregate functions or on the concept of closed itemsets in frequent itemset mining [6–10], or using tuple references to represent redundant aggregate tuples [7,11–16]. In this category, the computation explores the complete lattice of schemes of cuboids, in a top-down or bottom-up manner. Among thoses approaches, the data cube representation by BUC [16] has the number of aggregate tuples the most reduced. A new approach [17–21] can be competitive with BUC. This new approach represents the data cube by index bases. It can reduce by half the number of aggregate tuples while having the query runtime average competitive to BUC. Another advantage of these index base approaches is that the representations do not depend on the measures of data cube, neither on the aggregate functions. However, the reduced number of aggregate tuples resulting from above approaches is still tremendous. Those data cube representations are too far for to be practical or useful for business decision. Another remark is that those approaches of reduction are not based on the user interests. They do not get rid of aggregate tuples that are not interesting to users. Skyline operator [22], an extended operator of SQL based on the notion of maximal vector [23], allows to filter the result of a query, using interest criteria specified by preferences. A vector is maximal if it is not dominated by any other vector, after the orders (≤ or ≥) defined on the components of the vectors. Applied in databases, skyline operation filters the result of a query to return the maximal tuples after some preferences. Therefore, users can filter the result of a query by specifying their interest by giving preferences on the components of the expected resulting tuples. For example, to purchase or rent a flat, users might prefer the one that is the nearest to the city center, the most comfortable, but with the cheapest price. So the users can specify the preference order ≤ on distances to the city center, ≥ on the number of equipment, and ≤ on the price. That is, for a flat f(d, e, p) with d, e, p as distance to the city center, number of equipments, and the price, respectively, f 1(d1, e1, p1) is better than (or dominates) f 2(d2, e2, p2) if d1 ≤ d2 and e1 ≥ e2 and p1 ≤ p2 and at least d1 < d2 or e1 > e2 or p1 < p2 Many efficient algorithms for computing skyline operation and efficient ranking methods for skyline operation results over relational tables exist [24–29]. A study of complexity in time and disk access of skyline operation on the group-by SQL query can be found in [30]. In this work, the skyline operation is computed for each group of tuples in the query result to provide the maximal tuples of the group, following the preferences. In some way, the skyline operator can be considered as an aggregate function of the group-by query, excepting that instead that if a usual aggregate function returns a value, here the skyline operation returns a set of maximal vectors. The work [30] did not concern the problem of data cube, in particular the problem of data cube representation or reduction.

Mining interesting aggregate tuples

231

This paper presents an approach to apply skyline operation to reduce the number of aggregate tuples of date cube, by filtering interesting aggregate tuples. In contrast with the work [30], here the skyline operation is computed over all cuboids of the data cube, not over each group of tuples resulting from a groupby query. By interesting aggregate tuples we mean the regular aggregate tuples with large value on SUM or AVG or any similar function. For computing such aggregate tuples, we use the complete index base representation of data cube [21], a competitive method for computing queries on data cube. The experimentation on real datasets shows that the concept of interesting aggregate tuples is useful and efficient. The paper is organized as follows. Section 2 recalls the formal concepts of data cube and maximal vector, and formally defines the concept of interesting aggregate tuple of a data cube. Section 3 recalls the complete index base approach to compute queries on data cube. Section 4 presents the methods for computing the interesting aggregate tuples, based on the complete index base. Section 5 reports the experimental results. Finally, discussions, conclusion and further work are in Sect. 6.

2

Preliminary

Let T be a relational fact table over a relational scheme {A1 , A2 , ..., An , M1 , ..., Mk }, where A1 , A2 , ..., An are attributes called dimensions, and M1 , ..., Mk are numerical attributes called measures. 2.1

Data Cube

Let g1 , ..., gj be aggregate functions, such as SUM, AVG, MIN, etc, in SQL queries. The result of a group-by query as SELECT B1 , ..., Bi , g1 (N1 ), ..., gj (Nj ) FROM T GROUP BY B1 , ..., Bi ; where {B1 , ..., Bi } ⊆ {A1 , A2 , ..., An } and {N1 , ..., Nj } ⊆ {M1 , ..., Mk }, is called a cuboid of T over {B1 , ..., Bi }. In other words, each cuboid is a set of aggregate tuples of aggregate functions over a subset of the dimension scheme {A1 , A2 , ..., An }. This set represents a multiple dimension view of aggregate functions over the relational table T . A data cube of T on g1 , ..., gj is the set of cuboids of g1 , ..., gj over all subsets {B1 , ..., Bi } ⊆ {A1 , A2 , ..., An }. A data cube over n dimensions has 2n cuboids. Therefore, a data cube over n dimensions offers 2n multiple dimension views on aggregate functions. 2.2

Skyline Operation

Let p and q be two tuples of the relational table T . A preference of a measure Mi , 1 ≤ i ≤ k, is an expression in one of the two following forms: P ref (Mi ) = min or P ref (Mi ) = max

232

V. Phan-Luong

Let p and q be two tuples of T . We say that p dominates q, denoted p ≺ q, if and only if (i) ∀i ∈ [1, k], p(Mi ) ≤ q(Mi ) if P ref (Mi ) = min or p(Mi ) ≥ q(Mi ) if P ref (Mi ) = max and (ii) ∃i ∈ [1, k], p(Mi ) < q(Mi ) if P ref (Mi ) = min or p(Mi ) > q(Mi ) if P ref (Mi ) = max. A skyline of T is a set of tuples of T that are not dominated by any other tuple of T . The goal of the skyline computing is to find the maximal tuples of T , after the ≺ order. That is the tuples that are the best following the preferences specified by users. 2.3

Interesting Aggregate Tuples

Face to the enormous number of aggregate tuples in a data cube, it is clearly very hard for users to find which aggregate tuples that can help their decision. However, when using data cube to prospect their business, usually users look for aggregate tuples with (i) large values of the functions as SUM or AVG, (ii) these values should be regular, and (iii) the aggregate tuples should have large support. The reasons are the followings. The value of SUM or AVG over a measure Mi represent a direct interest of users over Mi . It is particularly clear in management of commercial transactions or stock transactions that large values correspond to big interest. To ensure the reliable of the prospect, the values at point (i) should be regular: the values that are aggregated into the values at point (i) should not have large difference. And the reason of point (iii) is: the most often the object appears in the past, the best probability that it can appear in future. Let p and q be two aggregate tuples of the data cube on a measure M of the relational fact table T . We say that p is less interesting than q (or q dominates p) with respect to M and the aggregate function SUM (or AVG) if and only if (i) p.SU M (M ) ≤ q.SU M (M ) (or p.AV G(M ) ≤ q.AV G(M ), respectively) and p.COU N T (M ) ≤ q.COU N T (M ) and p.ST DDEV (M ) ≥ q.ST T DEV (M ), and either (ii) p.SU M (M ) < q.SU M (M ) (or p.AV G(M ) < q.AV G(M ), respectively) or p.COU N T (M ) < q.COU N T (M ) or p.ST DDEV (M ) > q.ST DDEV (M ). An aggregate tuple p is interesting if there exists no other tuple q in the data cube that is more interesting than p. The skyline on the measure M of the data cube of T is the set of all interesting aggregate tuples of the data cube.

3

Computing the Complete Index Base of Data Cube

In this section, we recall the method for computing the complete index base representation of data cube [21]. This representation is based on the recursive structure of the power set of R:

Mining interesting aggregate tuples

233

1. The power set of R0 = ∅ (the empty set) is P0 = {∅}. 2. For n ≥ 1, the power set of Rn = {1, 2, ..., n} can be recursively defined as follows: (1) Pn = Pn−1 ∪ {X ∪ {n} | X ∈ Pn−1 } Pn−1 is called the first-half power set of Rn and the second operand of Pn , i.e., {X ∪ {n} | X ∈ Pn−1 }, the last-half power set of Rn . Example 1: For n = 3, R3 = {1, 2, 3}, we have: P0 = {∅}, P1 = {∅, {1}}, P2 = {∅, {1}, {2}, {1, 2}}, P3 = {∅, {1}, {2}, {1, 2}, {3}, {1, 3}, {2, 3}, {1, 2, 3}}. The first-half power set of S3 is P2 = {∅, {1}, {2}, {1, 2}} and the last-half power set of S3 is {{3}, {1, 3}, {2, 3}, {1, 2, 3}}, got by adding 3 to each element of P2 . The first-half data cube is the set of all cuboids over the schemes in the firsthalf power set of R, and the last-half data cube is the set of all cuboids over the schemes in the last-half power set of R. The index of aggregate tuples in a cuboid over a scheme {A1 , ..., Ak−1 } is a partition of rowids of the fact table T over the scheme {A1 , ..., Ak−1 }. The rowids in each part of this partition are the rowids of the tuples that have same values over the scheme {A1 , ..., Ak−1 }. Let call LP this partition. Consider an element X of LP ; X is a list of rowids. The partition of rowids in X based on the values of the attribute n (the last dimension of the fact table T ) is a list X1 , ..., Xi , such that –∀j, 1 ≤ j ≤ i, the tuples at rowids in Xj have the same value on the attribute n, therefore they have the same value on {A1 , ..., Ak−1 , n}. Thus, Xj allows to compute an aggregate tuple in the cuboid over the scheme {A1 , ..., Ak−1 , n}. –the set of all values on the attribute n in all tuples at rowids in X1 ∪ ... ∪ Xi is the set of all values on the attribute n in all tuples at rowids in X, –X = X1 ∪ ... ∪ Xi and Xj ∩ Xh = ∅, ∀1 ≤ h, j ≤ i, h = j. Based on these conditions, instead of saving the partition of rowids of T over {A1 , ..., Ak−1 , n} in a new space, the method to generate the complete index base [21] stores this partition in the storage space of indexe over {A1 , ..., Ak−1 } by using a new separator to separate X1 , ..., and Xi in the list of rowids in X. In what follows, we recall briefly the data structure and algorithms for creating the complete index base used in [21]. Attribut Indexe Tree Data on a dimension (an attribute) of the fact table T is indexed using the search binary tree structure with the following fields: –data: for attributed value, –ltid: for the list of rowids associated with the attributed value, –lsib and rsib: the left and the right sub-trees. A tree with this structure is called an attribute index tree. The algorithm InsData2AttIndex [21] inserts attributed values into an attribute index tree following the method to insertion a value into a search binary tree.

234

V. Phan-Luong

Algorithm InsData2AttIndex: Input: An attributed value val, the rowid of a tuple that contains val, and an attribute index tree P . Output: The attribute index tree P updated. Method: if P == NULL { Create P with P.data = val; Create P.ltid with rowid as the 1st element; P.lsib = NULL and P.rsib = NULL; } else if P.data > val { insert val and rowid into P.lsib; } else if P.data < val { insert val and rowid into P.rsib; } else append rowid to P.ltid; } Tuples Indexes Given a sub-scheme {A1 , ..., Ak } (for 1 ≤ i ≤ k, 1 ≤ Ai ≤ n), we assume that the index over {A1 , ..., Ak−1 } is already created for all tuples of T . Note that the indexes over the schemes {1}, ..., and {n} are created by the InsData2AttIndex algorithm. The algorithm T upleIndex is for creating the index of tuples over {A1 , ..., Ak } based on the index of tuples over {A1 , ..., Ak−1 }, such that Ak > Ai , 1 ≤ i ≤ k − 1. The main idea is the following. For each X being an element the index of tuples over {A1 , ..., Ak−1 } (X is a part of the partition of rowids of T over {A1 , ..., Ak−1 }), using the InsData2AttIndex algorithm we partition X into {X1 , ..., Xi } over the attribute Ak . Each Xj , 1 ≤ j ≤ i, is an element of the partition of rowids of T over {A1 , ..., Ak−1 , Ak }.

3.1

Complete Index Base

Algorithm GenCompleteIndex generates the complete index base of the fact table T over dimension scheme R = {1, ..., n}. The main ideas are the following: (1) Use InsData2AttIndex algorithm to generate n indexes over attributes from 1 to n, respectively. The indexes over the attributes from 1 to n−1 are then partitioned over attribute n using InsData2AttIndex. In each part (list of rowids) of the partition over an attribute k, 1 ≤ k ≤ n − 1, we use a new separator to separate the parts of the partition over the scheme {k, n}. (2) Based on indexes generated in point 1, we create incrementally the indexes over {A1 , ..., Ak } o ∀j, 1 ≤ j ≤ k, Aj ∈ {1, 2, ..., n − 1}. Let I be the index over {A1 , ..., Ak }. Let X be a part in I. Each such a part X is then partitioned over the dimension n into {X1 , ..., Xi }, using the new separator to separate the parts of the partition. The results are parts of the partition of rowids in T over {A1 , ..., Ak , n}. Note that a part in the partition over {A1 , ..., Ak } can contain one or many parts of the partition over {A1 , ..., Ak , n}, these parts are separated by the new separator in the index file of the cuboid over the scheme {A1 , ..., Ak }.

Mining interesting aggregate tuples

3.2

235

The Complete Index Base for Querying Data Cube

The complete index base for querying data cube on T is the triple (T, RS, CIndex), where – RS is the list of dimension scheme identifiers returned by GenCompleteIndex(T ) and – CIndex is set of tuple indexes saved by GenCompleteIndex(T ). For efficient computing, the list of dimension scheme identifiers RS and the fact table T are retrieved in the main memory.

4

Computing Interesting Aggregate Tuples Based on the Complete Index Base

In this section, we define the data structures and methods for computing the skyline of the data cube of T on the measure M , based on the complete index base representation of data cube [21]. Data Structures An aggregate tuple t is represented by a record node (rid, schid, dominated-count, sum, count, stddev) where rid is a rowid of the fact table T and schid is the relational scheme identifier, and dominated count is the number of aggregate tuples that are dominated by t. The pair (rid, schid) allows to get the attributed values (dimensional values) of t. Initially, dominated count is set to zero and it is updated during the search of all interesting aggregate tuples. sum, count, stddev are respectively the values of the aggregate functions SUM, COUNT, STDDEV applied to the aggregate tuple represented by (rid, schid) and the measure M . The interesting aggregate tuples will be stored in a list of record nodes called LIT. Initially LIT is empty. LIT is updated with aggregate tuples of the data cube by the following insertion method. Algorithm Ins2LIT: Insert an aggregate tuple t into LIT if t is not dominated by any tuple currently in LIT and remove from LIT all tuples dominated by t. Input: LIT, a tuple t = (rid, schid, dominated count, sum, count, stddev) Output: LIT updated with t. Method: if (LIT == NULL) { create a new node to insert t into LIT and return LIT} else{ ins = 1 set P to the head of LIT while (P != NULL) { // search through LIT until the end if (P->sum count stddev >= stddev) { if (P->sum < sum || P->count < count || P->stddev > stddev) { // tuple t is more interesting than the tuple at node P // t dominates P.tuple and all dominated by P.tuple; Update dominated_count; Delete P and set P to the next node of LIT // continue the search to insert t into LIT } else { // P.tuple is of the same importance as t,

236

V. Phan-Luong Add t to P.listTuples and return LIT; }

} else if ((P->sum >= sum && P->count >= count && P->stddev sum > sum || P->count > count || P->stddev < stddev)) { // P.listTuples dominate tuple t increase P.dominated_count ins = 0; // do not insert t into LIT; set P to the next node of LIT // because other nodes may also dominate t } else { // P.tuple does not dominate t and t does not dominate P.tuple set P to the next node of LIT // continue the search to insert t into LIT } } } // end while if (ins == 1) { // t is not dominated by any tuple in L and inversely insert t at the end of LIT. } return LIT; } // end of algorithm

Algorithm CuboidSkyline: Computing all currently interesting aggregate tuples on a measure M of two cuboids of the data cube of T that have indexes stored in the same index file. Input: LIT : list of record nodes of currently interesting aggregate tuples, f1: the integrate index file of the two cuboids, one over the dimension scheme X1 ..Xk and the other over X1 ..Xk n, where for 1 ≤ i ≤ k, Xi ∈ {1, 2, ..., n − 1}. Let schid be the scheme identifier of X1 ..Xk . Output: LIT updated with aggregate tuples obtained from the index file f1. Method: For each partition P of indexes over X_1..X_k do sum = 0, count = 0 For each partition Q of P of indexes over X_1..X_k n do sum1 = 0, count1 = 0 For each index rowid in Q do get the row with rowid, sum1 = sum1 + row(M), count1 = count1 +1 sum = sum + row(M), count = count +1 compute stddev1 dominated_count1 = 0 create a record node t1 = (rowid, schid, dominated_count1, sum1, count1, stddev1) Ins2LIT(LIT, t1) // t1: an aggregate tuple over X_1..X_k compute stddev dominated_count = 0 create a record node

Mining interesting aggregate tuples

237

t = (rowid, schid, dominated_count, sum, count, stddev) Ins2LIT(LIT, t) // t: an aggregate tuple over X_1..X_k n return LIT Remarque: to compute the interesting aggregate tuples with respect to the aggregate function AVG, in the algorithm CuboidSkyline, after sum1 (or sum) is computed, we need to compute avg1 = sum1/count1 (or avg = sum/count), and t1 = (rowid, schid, dominated count1, avg1, count1, stddev1) is inserted into LIT (or t = (rowid, schid, dominated count, avg, count, stddev) is inserted into LIT). Algorithm CubeSkyline: Computing the skyline of all aggregate tuples of the data cube of T on a measure M . Input: (1) The complete index base representation (T, RS, CIndex) of the fact table T , where –RS is the list of scheme numbers identifying the sub-schemes of the power set of R (the dimensional scheme of T ), and –CIndex is the set of all index files. Each scheme number corresponds to an index file of Cindex. (2) A measure M . Output: The skyline of the data cube of T on the measure M . Method LIT = empty For each schid in RS do Search in CIndex to get the index file f1 corresponding to schid LIT = Cuboidskyline(LIT, f1) return LIT

5

Experimental Results and Discussions

We experiment the approach to computing interesting aggregate tuples based on the complete index base of data cube on a laptop with 8 GB memory, Intel Core i53320 CPU @ 2.60 GHz x 4, running Ubuntu 18.04 LTS, using C programming, and on four real datasets CovType [31], SEP85L [32], STCO-MR2010 AL MO [33] and OnlineRetail [34], where –CovType is a forest cover-types dataset with 581,012 tuples on ten dimensions, –SEP85L is a weather dataset with 1,015,367 tuples on nine dimensions, –STCO-MR2010 AL MO is a census dataset on population of Alabama through Missouri in 2010, with 640,586 tuples over ten dimensions, and –OnlineRetail is a UK transactions data set with 393,127 complete data tuples and ten dimensions. For more information on these datasets, see [21].

5.1

On Building the Complete Index Base

Table 1 reports the computing runtime and the disk use storage space of the complete index base. In this table,

238

V. Phan-Luong

–Volume: the volume in Mega bytes (Mb) of the dataset files. –DU (Disk Use): the storage space on disk in Giga bytes (Gb) to store the index base. –RT (Run Time): the time in seconds from the start of the program until the index base is completely built, including the time to read/write input/output files. For the four datasets, the main memory use varies from 80 Mega bytes to 200 Mega bytes.

Table 1. Run time and disk use of the complete index base

5.2

Datasets

Volume (Mb) DU (Gb) RT (s)

CoveType

22.9

2.3

178

SEP85L

28.0

1.9

158

STCO-...

17.1

2.3

158

OnlineRetail 13.5

1.5

104

On Query with Aggregate Functions

The group-by SQL queries for the aggregate functions MAX, COUNT, SUM, AVG, and VARIANCE are experimented on the complete index base. The queries are in the following simple form: Select ListOfDimensions, f(m) From Fact_Table Group by ListOfDimensions; For experimentation, we run the query for all cuboids in each half of the above data cubes. For example, for CovType the query is run on 512 cuboids of the first-half and on 512 cuboids of the last-half. Table 2 shows the total time in seconds (including all computing and i/o time) for computing the aggregate queries for all cuboids in the first-half data cubes.

Table 2. Computing time of queries on first-half cube Agg Funct CovType (s) Sep85 (s) STCO (s) OnlineRet (s) Count

225

116

137

106

Max

257

146

157

120

Sum

253

145

157

119

Avg

332

180

189

150

Var

301

174

187

144

Mean

273

152

165

128

avgQRT

0.53

0.59

0.32

0.25

Mining interesting aggregate tuples

239

Mean contains the mean of the above lines, for example, in column CovType, Mean is the mean of aggregate query computing times for the functions Count, Max, Sum, Avg, Var. The avgQRT is the average query response time: Mean divided by the number of cuboids in the first-half data cube, e.g., for CovType, avgQRT = 273/512. Table 3 is similar to Table 2, but for computing the aggregate queries for all cuboids in the last-half data cubes.

5.3

On Computing the Interesting Aggregate Tuples

Tables 4 and 5 show the results of the experimentation of the computation of interesting aggregate tuples for the aggregate functions SUM and AVG, where –IAT-RT: the runtime from the start to the end of the computation, until all interesting aggregate tuples are saved on disk, –IAT-NUM: the total number of the interesting aggregate tuples of the data cube, –IAT-VOL: the volume in kilo bytes (Kb) of the file saving the interesting aggregate tuples of the data cube.

Table 3. Computing time of queries on last-half cube Agg Funct CovType (s) Sep85 (s) STCO (s) OnlineRet (s) Count

252

128

149

118

Max

279

155

166

131

Sum

284

158

166

133

Avg

359

187

196

157

Var

340

191

201

159

Mean

303

164

176

140

avgQRT

0.59

0.64

0.34

0.27

Table 4. Results on interesting aggregate tuples–For sum DATASETS IAT-RT (s) IAT-NUM IAT-VOL (Kb) CoveType

152

141

3.7

SEP85L STCO-...

156

141

3.5

160

102

2.6

OnlineRetail 103

126

3.1

240

V. Phan-Luong Table 5. Results on interesting aggregate tuples–For AVG DATASETS IAT-RT (s) IAT-NUM IAT-VOL (Kb)

5.4

CoveType

159

252

6.3

SEP85L

215

279

6.7

STCO-...

325

432

9.9

OnlineRetail 162

334

7.5

Synthesis of Results

The Table 6 gives an integral view over the precedent tables that allows to appreciate the efficient of the concept of interesting aggregate tuple, where –Index-Vol: the volume in Giga bytes of the files of the complete index base, –IAT-VOL-SUM (IAT-VOL-AVG): the volume in Kilo bytes of the file of all interesting aggregate tuples of the data cube, with the aggregate function SUM (or AVG, respectively), –IAT-NUM-SUM (IAT-NUM-AVG): the number of all interesting aggregate tuples of the data cube, with the aggregate function SUM (or AVG, respectively), –Index-RT: the runtime for computing the complete index base, from the beginning of the computation until the index base is saved on disk, –Query-FH-RT-SUM (Query-LH-RT-SUM): the runtime for computing all queries on the first-half (or last-half, respectively) data cube, with the aggregate function SUM, –Query-FH-RT-AVG (Query-LH-RT-AVG): the runtime for computing all queries on the first-half (or last-half, respectively) data cube, with the aggregate function AVG, –IAT-RT-SUM (IAT-RT-AVG): the runtime for computing all interesting aggregate tuples with the aggregate function SUM (or AVG, respectively).

6

Discussions and Conclusion and Further Work

Based on the Table 6, we can see (1) The ratio IAT-VOL-SUM / Index-Vol is in [1.3 ∗ 10−6 , 2 ∗ 10−6 ], and the ratio IAT-VOL-AVG/Index-Vol is in [2.74 ∗ 10−6 , 5 ∗ 10−6 ]. (2) IAT-NUM-SUM and IAT-NUM-AVG are really tiny with respect to the number of aggregate tuples in a data cube. (3) On the whole data cube, the runtime for computing all integrating aggregate tuples for SUM is about 0.28–0.5 times the time for computing all queries for SUM, and the runtime for computing all integrating aggregate tuples for AVG is about 0.23 to 0.88 times the time for computing all queries for AVG. As the number of interesting aggregate tuples are really tiny with respect to the number of aggregate tuples in a data cube, the users (managers) can look inside to find which are useful to help them to prospect their business. Also because of this tiny number, the time to save interesting aggregate tuples to disk is really negligible with respect to the time to save the results of all queries on data cube to disk. This explains why the ratio between the time for computing interesting aggregate tuples on the time for computing queries on data cube is less than one (i.g. 0.28–0.5 or 0.2–0.88), while

Mining interesting aggregate tuples

241

Table 6. Synthesis of results DATASETS

CovType SEP85L STCO-... OnlineRetail

Dataset-Vol

22.9 Mb

28.0 Mb 17.1 Mb

13.5 Mb

Index-Vol

2.3 Gb

1.9 Gb

2.3 Gb

1.5 Gb

IAT-VOL-SUM

3.7 Kb

3.5 Kb

2.6 Kb

3.1 Kb

IAT-VOL-AVG

6.3 Kb

6.7 Kb

9.9 Kb

7.5 Kb

IAT-NUM-SUM

141

141

102

126

IAT-NUM-AVG

252

279

432

334

Index-RT

178 s

158 s

158 s

104 s

Query-FH-RT-SUM 253 s

145 s

157 s

119 s

Query-LH-RT-SUM 284 s

158 s

166 s

133 s

IAT-RT-SUM

152 s

156 s

160 s

103 s

Query-FH-RT-AVG 332 s

180 s

189 s

150 s

Query-LH-RT-AVG 359 s

187 s

196 s

157 s

IAT-RT-AVG

215 s

325 s

162 s

159 s

the computation of interesting aggregate tuples needs to computes all cuboids of data cube with three aggregate functions (SUM or AVG, COUNT, and STDDEV). With these reasons we can conclude that the concept of interesting aggregate tuples are really useful for managers in business intelligence, and the method for computing interesting aggregate tuples based on the complete index base of data cube is very efficient. For further work, in order to help managers, we plan to apply the ranking methods to interesting aggregate tuples and to extend the concept of interesting aggregate tuples on multiple measures of data cube.

References 1. Beyer, K., Ramakrishnan, R.: Bottom-up computation of sparse and iceberg cube. In: Proceedings of ACM Special Interest Group on Management of Data (SIGMOD’99), pp. 359–370 (1999) 2. Han, J., Pei, J., Dong, G., Wang, K.: Efficient computation of iceberg cubes with complex measures. In: Proceedings of ACM SIGMOD’01, pp. 1–12 (2001) 3. Xin, D., Han, J., Li, X., Wah, B.W.: Star-cubing: computing iceberg cubes by top-down and bottom-up integration. In: Proceedings of VLDB’03, pp. 476–487 (2003) 4. Shao, Z., Han, J., Xin, D.: Mm-cubing: computing iceberg cubes by factorizing the lattice space. In: Proceedings of International Conference. on Scientific and Statistical Database Management (SSDBM 2004), pp. 213–222 (2004) 5. Vitter, J.S., Wang, M., Iyer, B.: Data cube approximation and histograms via wavelets. In: Proceedings of International Conference on Information and Knowledge Management (CIKM’98), pp. 96–104 (1998) 6. Agarwal, S., et al.: On the computation of multidimensional aggregates. In: Proceedings of VLDB’96, pp. 506–521 (1996)

242

V. Phan-Luong

7. Harinarayan, V., Rajaraman, A., Ullman, J.: Implementing data cubes efficiently. In: Proceedings of SIGMOD’96, pp. 205–216 (1996) 8. Ross, K.A., Srivastava, D.: Fast computation of sparse data cubes. In: Proceedings of VLDB’97, pp. 116–125 (1997) 9. Casali, A., Cicchetti, R., Lakhal, L.: Extracting semantics from data cubes using cube transversals and closures. In: Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD’03), pp. 69–78 (2003) 10. Casali, A., Nedjar, S., Cicchetti, R., Lakhal, L., Novelli, N.: Lossless reduction of datacubes using partitions. Int. J. Data Warehous. Mining (IJDWM) 5(1), 18–35 (2009) 11. Lakshmanan, L., Pei, J., Han, J.: Quotient cube: How to summarize the semantics of a data cube. In: Proceedings of VLDB’02, pp. 778–789 (2002) 12. Sismanis, Y., Deligiannakis, A., Roussopoulos, N., Kotidis, Y.: Dwarf: shrinking the petacube. In: Proceedings of ACM SIGMOD’02, pp. 464–475 (2002) 13. Wang, W., Lu, H., Feng, J., Yu, J.X.: Condensed cube: an efficient approach to reducing data cube size. In: Proceedings of International Conference on Data Engineering 2002, pp. 155–165 (2002) 14. Lakshmanan, L.V.S., Pei, J., Zhao, Y.: QC-trees: an efficient summary structure for semantic OLAP. In: Proceedings of ACM SIGMOD’03, pp. 64–75 (2003) 15. Feng, Y., Agrawal, D., Abbadi, A.E., Metwally, A.: Range cube: efficient cube computation by exploiting data correlation. In: Proceedings of International Conference on Data Engineering 2004, pp. 658–670 (2004) 16. Morfonios, K., Ioannidis, Y.: Supporting the data cube lifecycle: the power of ROLAP. VLDB J. 17(4), 729–764 (2008) 17. Phan-Luong, V.: A simple and efficient method for computing data cubes. In: Proceedings of The 4th International Conference on Communications, Computation, Networks and Technologies INNOV 2015, pp. 50–55 (2015) 18. Phan-Luong, V.: A simple data cube representation for efficient computing and updating. Int. J. Adv. Intell. Syst. 9(3 & 4), 255–264 (2016). www.iariajournals. org/intelligent systems 19. Phan-Luong, V.: Searching data cube for submerging and emerging cuboids. In: Proceedings of The 2017 IEEE International Conference on Advanced Information Networking and Applications Science AINA 2017, IEEE, pp. 586–593 (2017) 20. Phan-Luong, V.: First-half index base for querying data cube. In: Intelligent Systems and Applications, Proceedings of Intelligent Systems Conference 2018 (IntelliSys 2018), 6–7 Septembre 2018, London, UK (2018) 21. Phan-Luong, V.: A complete index base for querying data cube. In: Arai, K. (ed.) Proceedings of the 2021 Intelligent Systems Conference (IntelliSys), vol. 2, Intelligent Systems and Applications Lecture Notes in Networks and Systems 295. Springer Nature Switzerland AG 2022, IntelliSys 2021, LNNS 295, pp. 486–500 (2022). https://doi.org/10.1007/978-3-030-82196-8 36 22. Borzsonyi, S., Kossmann, D., Stocker, K.: The skyline operator. In: Proceedings of the ICDE Conference, pp. 421–430 (2001) 23. Bentley, J.L., Kung, H.T., Schkolnick, M., Thompson, C.D.: On the average number of maxima in a set of vectors and applications. J. ACM (1978) 24. Vlachou, A., Vazirgiannis, M.: Ranking the sky: discovering the importance of SKYLINE points through subspace dominance relationships. Data Knowl. Eng. 69, 943–964 (2010) 25. Spyratos, N., Sugibuchi, T., Simonenko, E., Meghini, C.: Computing the skyline of a relational table based on a query lattice. In: CEUR Workshop Proceedings, vol. 876, pp. 145–160 (2012)

Mining interesting aggregate tuples

243

26. Valkanas, G., Papadopoulos, A., Gunopulos, D.: Skyline ranking ´ a la IR. In: CEUR Workshop Proceedings, vol. 1133, pp. 182–187 (2014) 27. Preisinger, T., Endres, M.: Looking for the best, but not too many of them: multilevel and top-k skylines. Int. J. Adv. Softw. 8, 467–480 (2015) 28. Lakhal, L., Nedjar, S., Cicchetti, R.: Multidimensional skyline analysis based on agree concept lattices. Intell. Data Anal. 21, 1245–1265 (2017) 29. Alouaoui, H., Lakhal, L., Cicchetti, R., Casali, A.: CoSky: a practical method for ranking skylines in databases. The 11th International Conference on Knowledge Discovery and Information Retrieval, pp. 508–515, Sept 2019, Vienna, Austria (2019) 30. Luk, M.-H., Yiu, M.L., Lo, E.: Group-by skyline query processing in relational engines. In: CIKM’09, 2–6 Nov 2009, Hong Kong, China, pp. 1433–1436 (2009). https://doi.org/10.1145/1645953.1646138 31. Blackard, J.A.: The forest covertype dataset (1998). https://ftp.ics.uci.edu/pub/ machine-learning-databases/covtype 32. Hahn, C., Warren, S., London, J.: Edited synoptic cloud re- ports from ships and land stations over the globe (2010). www.cdiac.esd.ornl.gov/cdiac/ndps/ndp026b. html 33. Census Modified Race Data Summary File for Counties Alabama through Missouri. www.census.gov/popest/research/modified/STCO-MR2010 AL MO.csv 34. Online Retail Data Set, UCI Machine Learning Repository. www.archive.ics.uci. edu/ml/datasets/Online+Retail

Optimization of Agrivoltaic Plants: Development and Validation of a Numerical Model to Account for Shading Effects on Crop Yields Michela Costa1,2(B) , Stefano Barba3 , Daniele Piazzullo1,2 , and Adolfo Palombo4 1 Istituto di Scienze e Tecnologie per l’Energia e la Mobilità Sostenibili (STEMS), CNR, Via

Marconi 4, Naples, Italy [email protected] 2 DG Twin S.r.l, Campania NewSteel, Via Coroglio, 57d, 27, 80124 Naples, Italy 3 Department of Agricultural Sciences, University of Naples “Federico II”, Naples, Italy 4 Department of Industrial Engineering, University of Naples “Federico II”, Naples, Italy

Abstract. Agrivoltaic systems are becoming increasingly popular as a relevant technology for the achievement of sustainable development goals, such as clean energy and mitigation of climate changes, by providing a significant economic advantage for farms. Understanding the effects of shading on crops is crucial for the selection of an optimal agrivoltaic system able to guarantee minimum loss in agricultural yields. In this study, an innovative numerical procedure is proposed for the evaluation of the dynamic performances of agrivoltaic plants with different configurations: fixed vertical structure, fixed with modules inclined by 30° and tracking on a single axis. A parametric analysis is conducted to determine the optimal solution from the techno-economic point of view when serving a farm operating in the South of Italy. The three solutions are optimized by varying the size of the storage system in such a way to minimize the investment cost and maximize savings (and self-consumed energy). An innovative aspect of the present work is represented by the evaluation of the shading influence of the PV panels on the underlying crops. This influence is determined by developing an in-house numerical model capable of evaluating the shadows casted on the ground, once the geometry of the structure is defined, and the consequent reduction of incident radiation on the ground with the relative change in crop yield. The results in terms of incident radiation reduction showed an accuracy very similar to that evaluated through the commercial PVsyst® software. The solution emerged as optimal after the study is the plant configuration exhibiting a one-axis tracking system. Keywords: Numerical model · Agrivoltaic systems · Photovoltaic · Shadings · Crop yields

1 Introduction Because of the increasing demand for energy production from the exploitation of renewable energy sources, agrivoltaic systems represent an optimal solution for the efficient use of soil and the simultaneous energy and food production [1]. At the same time, an © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 244–263, 2024. https://doi.org/10.1007/978-3-031-47715-7_17

Optimization of Agrivoltaic Plants: Development

245

optimization of water use is possible due to the microclimatic changes that these systems have on soil underneath the modules [2]. This solution is recommended in areas where soil availability is low, as the competition between agricultural and energetic land use can be largely reduced. Agrivoltaic systems have also been found having a great suitability to animal farms, by using the shade as shelter and the soil for grazing at the same time [3]. Despite the aforesaid advantages, the use of this technology is not yet widespread, as lots of factors have different influences on the system yields. In this perspective, numerical-supported studies can be useful to estimate the producibility and yield of crops at the design stage, also leading to significant incentives for the adoption of the agrivoltaic technology. In order to carry out a study of the producibility of photovoltaic systems, research activities have focused on the development of methods to assess the behavior of photovoltaic arrays subjected to dumb shading and shading by external objects. In 1987, Bany and Appelbaum [4] developed equations to calculate the shadow that forms on a solar collector array during the daytime. Subsequently, further models were implemented for the evaluation of shading factors under varying conditions in terms of both PV array geometry and boundary conditions, as developed by Silva et al. [5]. Nowadays, there are many commercial softwares, such as PVSyst® [6], allowing the modelling of any photovoltaic field with the corresponding analysis of the electrical producibility considering the effect of shading by means of a 3D tool. However, such softwares are not capable of analyzing the performance of agrivoltaic solutions as they are limited to the evaluation of shadows insisting on the photovoltaic modules and does not allow the evaluation of shadows cast on the ground with the relevant incident radiation. Moreover, although shading factors are frequently used to evaluate the performance of photovoltaic systems, there is still a limited amount of research on the evaluation of the influence that this shading has on the underlying crops [7, 8]. Numerous studies have therefore focused on a detailed study of the shading phenomenon. In fact, Campana et al. [9] analyzed the performances of an agrivoltaic system characterized by a vertical structure and located in Sweden, with the aim of characterizing the distribution of Photosynthetically Active Radiation (PAR), as this last represents a key parameter for crop yield assessment. Amaducci et al. [10] developed a shading and radiation model for a biaxial tracking system located in Northern Italy. In this perspective, the results obtained from a meta-analysis conducted on experimental data on operating agrivoltaic systems proposed by Laub et al. [11] were analyzed: characteristic yield curves were obtained for crops that benefit from the presence of shading, correlating the reduction of incident radiation on the ground with the correspondent system yield. In this work, a numerical procedure, developed in the Matlab environment and validated against results achieved with PVSyst®, is proposed for the dynamic evaluation of the shade projected on the ground from agrivoltaic plants, with specific attention to their influence on the underlying crop yield. A technical-economic feasibility analysis is carried out among three different layout solutions proposed to serve a real farm installed in South Italy [12]. Lastly, the optimal solution is analyzed in a dynamic regime against real energetic loads in a further model developed within the TRNSYS© environment, so to highlight both the economic advantage and the improvement on crop production that this system may offer under some circumstances.

246

M. Costa et al.

1.1 Shading Effect of Photovoltaic Panels on Crops Production: State of the Art With the advent of agrivoltaic systems, the shade tolerance of crops is becoming increasingly important, attracting the attention of researchers to establish an experimental relationship between shading and variation in crop growth parameters analyzing both greenhouses [13] and open field [14] photovoltaic applications. The crops most studied for their interaction with photovoltaic fields are tomatoes, lettuce, peppers, cucumbers, and strawberries, even though the first two have been favoured [15, 16]. These latter provide more immediate responses to the effects of shading due to their short life cycle. A key parameter in the study of shading is the coverage ratio provided by the PV modules to the crops underneath. Many studies aim to assess the best ratio (and consequently shading) that ensures both the best electrical and agricultural yields. In the study proposed by Cossu et al. the yields of numerous crops were analyzed varying the cover ratio from a minimum of 25% to a maximum of 100%. It was found that many crops that required a high amount of light (tomato, cucumber, e.g.) with a 25% coverage had a limited reduction in yield, which averaged 20% [17]. These results are also consistent with the results obtained by Lopez-Diaz et al. proving that an increase in shading from 15% to 50% resulted in a yield decrease from 16.9 kg/m2 to 11.5 kg/m2 [18] (approximately 40% less than the fully sunny case). To avoid the drop in yield while increasing the module density (above 50%), a study was conducted by installing films onto the back of the photovoltaic modules that would allow light to diffuse. The results showed that in the summer and spring seasons the growth rate increased due to the more diffuse light available. Even though, the latter was reduced during the winter period due to the decrease in Photosynthetic Photon Flux Density (PPFD) [19]. In contrast, adopting a low coverage ratio of 20%, as in the case study analyzed by Kvaga et al. [20] in Greece, better results were obtained in terms of lettuce growth indicators with a coupled energy production of 50.83 kWh/m2 for the observed growing period. Federica Blando et al. investigated the possible combination of energy production with the cultivation of berries by varying the coverage rate from 32 to 100% and finding a change in fruit characteristics such as an increase in antioxidant activity and a corresponding reduction in sugar content [21]. Concerning yield, on the other hand, shaded crops yielded 1.2 times the amount of fully sunlit crops [22]. In detail, strawberries returned a yield reduction of less than 25% with a shading rate between 25 and 50%. For a shading rate of more than 60%, cultivation is not recommended due to excessive yield losses [17]. Another study of particular interest is the one conducted in Germany by Trommsdorff et al. [23], in which the combined production of potatoes and photovoltaics that led to a 3% increase in potato production and an 83% of the electrical yield with an overall Land Equivalent Ratio (LER) of 1.8.

2 Evaluation of Crop Yields Although a wide research has been conducted on crop growth parameters, there is still a lack of systematic evaluations on how yields of different types of crops respond to different levels of shade. To fill this research gap, Laub et al. [11] conducted a meta-analysis on the basis of experimental data collected from literature studies regarding agrivoltaic fields, with the aim of quantitatively estimating the susceptibility of different crop types

Optimization of Agrivoltaic Plants: Development

247

to increasing levels of shading. Yield response curves of specific crops constructed as a function of radiation reduction are shown in Fig. 1.

Fig. 1. Yield response curves due to solar radiation reduction [11].

The three curves shown in Fig. 1 highlight different crop yields associated to the reduction of incident radiation on the ground. In particular, the first curve, in green, describes the average response of crops that benefit from shading. The yellow and red curves represent the average response of crops tolerant and susceptible to radiation reduction, respectively. In the specific, berries, fruits and vegetables benefited from up to a 30% reduction in radiation before suffering a less than proportional decrease, while forages, leafy vegetables and cereals initially showed a less than proportional reduction. Maize and legumes, on the other hand, showed yield losses even at low shade levels [11]. The equations describing each of these curves are here listed:     (1) Yb = 0, 003 · x3 − 0, 128 · x2 + (3, 135 · x) + 99, 052   Yt = − 0, 010 · x2 − (0, 265 · x) + 99, 327 (2)     (3) Ys = − 0, 004 · x3 + 0, 224 · x2 − (6, 496 · x) + 100, 620 where Yb , Yt and Ys respectively represents the predicted yields for crops which benefit, tolerate and are susceptible to the presence of shades. These equations are fundamental for the correct evaluation of the yield of a given crop according to the variation of the global radiation incident on the ground for the entire life cycle. In this paper, a numerical model is developed for estimating the reduction of solar radiation on the ground associated with agrivoltaic systems of different types. An innovative aspect of the work is represented by the integrated estimation of the influence of shading on crop yields with a detailed evaluation of produced energy under various geometrical configurations, and by using a unified numerical model.

248

M. Costa et al.

3 Methodology 3.1 Incident Radiation Reduction at Ground Level The percentage reduction of solar radiation on the ground due to agrivoltaic plant shades is here evaluated through a numerical model developed within the Matlab environment. Diffuse radiation is estimated by using the decomposition method proposed by Erbs et al. [24]. This model defines the diffuse fraction of radiation on a horizontal plane with respect to the global radiation (kd ) starting from experimental data and correlating it with the serenity index of the sky (kt ). Previous studies [25] already revealed the close correlation between these two quantities. In Eqs. (4) and (5), the following formulations are proposed: ⎧ ⎨ 1, 0 − 0, 09 · kt kt ≤ 0, 22 kd = 0, 9511 − 0, 1604 · kt + 4, 388 · kt2 − 16, 638 · kt3 + 12, 336 · kt4 0, 22 < kt ≤ 0, 8 ⎩ 0, 165kt > 0, 22

(4) kt =

Ih Ioh

(5)

where, Ioh and Ih represent the extraterrestrial radiation and the global horizontal radiation [W/m2 ], respectively. Finally, since the global radiation is the sum of the diffused and beam radiation, the latter (Ibh ) has been obtained by difference as proposed in Eq. (6): Ibh = Ih + Idh

(6)

In addition, once the components of the global radiation on a horizontal surface are known, the model of Perez et al. [26] is adopted to calculate the beam (Ib ) and the diffused incident radiation (Id ) on the ground, in the case where no shading is present: Ib = Ibh · Rb

(7)

where Rb is a geometrical factor defined as the ratio of the angle of incidence of radiation to the surface (θ) and the zenith angle (z) [27]: Rb =

cosθ . cosz

(8)

In order to evaluate the diffused incident radiation, this model considers three contributions as shown in Eq. (9). The first one is the isotropic part received uniformly from all the celestial vault, the second one is the diffused circumsolar (resulting from the spread of solar radiation and focused in the sky area around the sun) and the third linked to the brightness of the horizon (focused on the horizon and more pronounced for clear skies). The following relationship holds:   a Id = Idh (1 − F1 ) · Fsc + F1 + F2 sin β , (9) c where:

Optimization of Agrivoltaic Plants: Development

249

• F1 , F2 are the circumsolar and luminosity of the horizon coefficients; • a and c take into account the incidence angles of the cone of circumsolar radiation; • Fsc is the celestial vault view factor for an inclined surface of β relative to the horizontal. Once the beam (Ib ) and diffused incident radiation (Id ) to the ground in the absence of shadows are obtained, the beam and diffuse radiation reductions induced by the presence of shadows are evaluated from relations (10): beam = Ib · fb (10) diff = Id · fd where fB is the direct shading factor, defined as the ratio of the shaded area AS to the area of the ground AG as in Eq. (11) [28]. fB =

AS AG

(11)

while fD is the diffuse shading factor, which is constant throughout the year, so it has to be determined only once [28]. The formulation here used is the one that takes into account the angle of incidence θ and the radiance of the celestial vault R, as seen from the site of interest and as given by Eq. (12) [29], whose region of integration is restricted to only those points in the hemisphere that are in front of the considered surface [28]: ˜ fB · R · cos θ · cos αd αd  . (12) fD = ˜ R · cos θ · cos αd αd  However, relation (12) can be simplified when considering an isotropic sky. In this specific condition, the radiance R can be removed from the integrals and canceled. Equation (12) therefore becomes the summation (13) [29]:

fB · cos θ · cosα fD =

. (13) cos θ · cosα The objective at this stage is to evaluate, once the geometry of the system has been identified by the coordinates of the vertices of the shading element, the projection of these vertices to the ground in order to identify the portion of the ground affected by the presence of the shadow. In this regard, a Matlab function using the characteristic equations of sun geometry [27] was developed by authors in order to define the position of the sun by solar height (α) and solar azimuth () for each hour of the entire year, once coordinates of the site are defined. A vector representing the direction of the sun rays is defined by using the procedure proposed by Cascone et al. [30] in which the solar vector, named S, is evaluated by the relation (14): ⎧ ⎨ dirx = cos  cos α S = diry = cos α sin  . (14) ⎩ dirz = sin α

250

M. Costa et al.

The projections of vertices of the shading panel to the ground are defined by the intersection of the straight lines passing through the vertices of the shading element and the ground. The formulation of the parametric line passing through the first vertex is proposed: ⎧ ⎨ Tpx = Px1 + (dirx · t) (15) Tp = Py1 + diry · t ⎩ y Tpz = Pz1 + (dirz · t) where the intersection between the straight line and the ground is evaluated by defining the parameter t by the relation (16) imposing that the z-coordinate of the point projected to the ground is zero: t=

hz − Pz1 . dirz

(16)

In the case where multiple shading elements are present, the phenomenon of mutual shading of modules may occur, as showed in Fig. 2.

Fig. 2. Two shadings elements affected by mutual shading.

In order to consider this possibility and make sure that the shaded ground area is not oversized, an additional intersection between the shadows brought to the ground is implemented in the model. This intersection makes it possible to define the areas of the ground where two overlapping shadows are incident; by doing so, the area of the intersection (considered twice) is subsequently subtracted from the shaded area so that it is considered only once. Having performed the calculation procedure for each position and obtained the shadows cast on the ground, the intersection of these shadows with the cultivated area is evaluated, obtaining in output the values of the shaded area that actually insists on the area of interest. 3.2 Agrivoltaic Plants Geometry The agrivoltaic systems considered in the study consist of two arrays for each structure investigated whose positioning in space is identified by defining the coordinates of the vertices of each array.

Optimization of Agrivoltaic Plants: Development

251

In the Matlab numerical model, the definition of geometric characteristics of the plant is done introducing a vector that contained the spatial coordinates of the vertices of the shading element as follows: ⎧ ⎨ x1 P1 = y1 (17) ⎩ h1 ⎧ ⎨ x1 + c P2 = (18) y ⎩ 1 h1 ⎧ ⎨ x2 P3 = y1 + l (19) ⎩ h2 ⎧ ⎨ x1 P4 = y1 + l (20) ⎩ h2 where x 1 and y1 are the coordinates of the first vertex of the shed and l is the length of the shed. The parameter c is function of the tilt angle  and the width of the shed w defined in the Eq. (21): c = w · cos

(21)

Subsequently, the height of the vertices of the shed are identified (h1 , h2 ) through the relations (22) and (23), these lasts linking the tilt angle of the shed  with its width w: w (22) h1 = h − sin 2 w h2 = h + sin (23) 2 The same method is applied to define the crop area beneath the modules through the coordinates for the soil: ⎧ ⎨ x0 G1 = yo (24) ⎩ 0 ⎧ ⎨ x0 + wg (25) G2 = y1 ⎩ 0 ⎧ ⎨ x2 G3 = y0 + lg (26) ⎩ 0

252

M. Costa et al.

⎧ ⎨

x0 G4 = y0 + lg ⎩ 0

(27)

where x0 and yo are the coordinates of the first vertex and wg and lg are the ground width and length, respectively. 3.3 Model Results Validation In order to validate the estimation of the radiation reduction obtained from the proposed model in Matlab environment, the results are compared with the one resulting from the commercial software PVSyst®. However, it must be said this last software only allows quantifying the overall shading on a PV module determined by the mutual interaction with another module or by the presence of external objects. Therefore, the irradiance variation on the underlying cultivated ground determined by a single PV module, installed at the height of 3 m from the ground, is estimated in PVSyst by covering the ground area with horizontals PV modules. The single PV module is defined as a parallelepiped of length and width equal to 1 m and thickness 0,02 m. As for the ground, 20 modules arranged in five rows of four modules each were used with an assumed zero spacing between the various boards. Each module has a length of 2,078 m, a width of 0,992 m and a thickness of 35 mm. In the first analysis, a comparison is made on the values of the total incident radiation (I_tot) the incident radiation reduced due to shading (I_shd) evaluated in both Matlab and PVSyst environments. The results relative to June 21st are proposed as example in Table 1 in terms of percentage error, while results evaluated during a whole year are shown in Fig. 3.

4 Agrivoltaic Plants Dynamic Simulation In this section, three configurations for the implementation of an agrivoltaic installation serving a farm located in Sant’Arcangelo, Province of Potenza, Italy (Table 2) are analyzed in a dynamic regime. The effects that this type of installations have on the yields of crops are investigated by the use Matlab model presented above. The design parameters used to model the three configurations are reported in Table 3. Once vertices are defined, their ground projections are identified with subsequent evaluation of the shaded area. Representations of the modeled structures in the Matlab environment are proposed in Fig. 4. Sheds are represented in blue, while the ground considered for the study is in green. Figure 5 shows a flowchart summarizing steps implemented in the numerical model that led from the input data (in red) up to the results in terms of variation of the expected yield. The optimal solution among the three proposed is determined by sizing the storage system within a properly set multi-objective optimization problem and by comparing the achieved results. The two contrasting objective functions are the investment cost (to be minimized) and the self-consumed energy (to be maximized). Maximizing selfconsumed energy also maximizes economic savings, in which income from the possible sale of fruit and vegetables, whose yield was affected by the presence of the plant, was

Optimization of Agrivoltaic Plants: Development

253

Table 1. Comparison of the PVSyst and Matlab model results for June 21st. Time [h]

I_tot PVSyst [W/m2 ]

I_shd PVSyst [W/m2 ]

I_tot Matlab [W/m2 ]

I_shd Matlab [W/m2 ]

Percentage error [%]

4110

68,71

67,78

68,71

68,17

–0,58

4111

162,00

160,00

162,00

159,90

0,06

4112

363,50

355,74

363.50

359,15

–0,96

4113

583,41

570,37

583,41

570,90

–0,09

4114

662,90

648,08

662,90

648„97

–0,14

4115

706,30

690,88

706,30

691,81

–0,14

4116

712,20

696,89

712,20

697,95

–0,15

4117

763,39

746,80

763,39

747,86

–0,14

4118

787,99

770,16

787,99

771,63

–0,19

4119

711,00

694,79

711,00

696,36

–0,23

4120

602.90

589,15

602,90

590,60

–0,25

4121

377,60

369,44

377,60

370,53

–0,29

4122

245,80

244,12

245,80

241,53

1,06

4123

116,99

116,23

116,99

115,15

0,93

Fig. 3. Distribution of the percentage error evaluated for every hour of the year, mean error (red line) and standard deviation (yellow lines).

considered. Another parameter investigated to evaluate the performance of the plant was the Land Equivalent Ratio (LER) [12]. LER =

Yelectricity,APV Ycrop,APV + Ycrop,only Yelectricity,only

(28)

254

M. Costa et al. Table 2. Geographic coordinates of the farm. Sant’ Arcangelo (PZ)

Latitude

40° 15´ 24.16” N

Longitude

16° 15´ 34.06” E

Altitude

388 m a.s.l

Table 3. Design parameters for the three proposals. Input parameters

Vertical

Fixed

One-axis tracking

Module width [m]

0,99

0,99

0,99

Module length [m]

2,08

2,08

2,08

Number of modules

48

48

48

Total module area [m2 ]

98,84

98,84

98,84

Shed width (w) [m]

1,98

11,88

1,98

Shed length (l) [m]

24,96

4,16

24,96

Crop area [m2 ]

782

782

782

Height (h) [m]

1,5

4

4

Fixed tilt angle [°]

–90

–30



Maximum tilt angle [°]





–30

Minimum tilt angle [°]





80

Azimuth angle [°]

0

0

0

Fig. 4. Layout of the a Vertical shed solution; b Fixed shed solution; c One-axis tracking solution. The sheds are represented in blue, the ground considered for the study is in green.

Optimization of Agrivoltaic Plants: Development

255

Fig. 5. Numerical model flowchart.

Finally, in order to evaluate the positive environmental impact of the solar plant on the crop conditions, a relevant parameter being considered is the “electricity mix emission factor.” It represents the average value of CO2 emissions due to electricity production in Italy and is equal to 0.48 kg CO2 /kWh [31]. In order to carry out the economic analysis of the proposed solutions, cost items considered for the evaluation of the investment cost [32, 33] are listed in Table 4. It can be seen that costs in Table 4 are the same for the three configurations, except for costs relating to the structure, which in particular range from the cheapest (vertical structure) to the most expensive (equipped with a tracking system). As far as cost savings are concerned, the latter were evaluated with the relation: Savings = C el_taken + Pel_sold + Pveg_sales

(29)

in which costs and profits related to electricity taken and sold to the grid (C el_taken and Pel_sold respectively) are taken into account together with the extra profits obtained from the sale of fruit and vegetables (Pveg_sales ). The running costs are shown in Table 5 [31, 34]. In particular, the costs and revenues related to the energy withdrawn and transferred to the grid are specified for the time slot in which these flows with the grid occur. The F3 band is defined as Monday to Saturday from 11pm to 7am, Sunday. The F2 band is defined as Monday to Friday from 7 to 8 a.m., and from 7 to 11 p.m., and on Saturdays from 7 a.m. to 11 p.m. The F1 band is defined as Monday to Friday from 8 a.m. to 7 p.m.

256

M. Costa et al. Table 4. Investment costs. Costs

440 W Bifacial module

500 [e/kW]

20 kW Inverter

4000 [e]

Storage

950 [e/kWh]

Vertical

94 [e/kW]

Fixed

300 [e/kW]

One-axis tracking

730 [e/kW]

Administrative costs

10800 [e]

Table 5. Running costs. Cost of energy taken from the grid F1

0,52100 [e/kWh]

F2

0,49200 [e/kWh]

F3

0,49200 [e/kWh]

Profit from energy sold to the grid F1

0,04520 [e/kWh]

F2

0,05306 [e/kWh]

F3

0,03305 [e/kWh]

Profit from sales of vegetables Bean

1,10 [e/kg]

Wheat

0,72 [e/kg]

Potato

0,50 [e/kg]

The optimal solution for each proposal was chosen according to the battery size that would guarantee a simple payback period of less than 7 years. As the useful life of an agrivoltaic system is approximately 10 years. SPB =

Investment cost < 7years Savings

(30)

4.1 TRNSYS© Model Dynamic analyses of the proposed solutions are performed with the aid of the TRNSYS© environment [35]. It can be noticed that the presence of the inverter was modeled by using the Type 48c, which is able to manage the energy flows coming from the generation system, while the grid and the storage were modeled by using the Type 47b. The electrical

Optimization of Agrivoltaic Plants: Development

257

energy demand is provided as input by a load curve such as the hourly producibility of the agrivoltaic plant using two data readers (Type 9c). The decision variable is the capacity of the lithium iron phosphate (LiFePO4 ) storage system, which is varied from a minimum of 1,5 kWh to a maximum of 152 kWh by varying the number of cells in series from 6 to 60 with a step of six and the number of cells in parallel from 7 to 70 with a step of seven. The schematization of the model layout in this environment is shown in Fig. 6 and the Table 6 shows the characteristics of the single cell [32].

Fig. 6. TRNSYS© model considered for the three configurations.

Table 6. Lithium iron phosphate single cell characteristics. LiFePO4 Nominal capacity [Ah]

16,5

Nominal Voltage [V]

2,2

Energy [Wh]

36,3

Max-Min FSOC [%]

0,9–0,2

Nominal life cycles

3000

Electrical Demand of the Farm. The overall energy demand of the farm showed an annual consumption of 37032 kWh/year. By analyzing the demands of utilities on the farm and the consumption read in the bills, the hourly load curve provided as input to the TRNSYS© model was estimated, as shown in Fig. 7. Electrical Production of the Agrivoltaic Plant. The software used for the PV generator sizing is PVsyst®. It allows the user to estimate the electricity produced by the system simulating the dynamic behavior of the panel array. The program has its own weather database and assists the user in the various stages of design, from choosing the orientation

258

M. Costa et al.

plan to defining the complete layout of the strings in the field. In addition, the user can introduce more details, from module specifications to losses in the transformer iron downstream of the inverter. Finally, the software provides the user with a final report with the energy produced and details of losses. The hourly data obtained from the simulation were exported to an “.xlsx” file and used to perform the dynamic simulation in TRNSYS©.

Fig. 7. Estimated load curve.

Influence of the Agrivoltaic Plant on Crop Yield. In order to assess the influence of plant presence on crops, the Matlab model was used once the crop cycle was defined, so to evaluate the reduction of radiation to soil over the entire life cycle of the crop under investigation. With regard to wheat, whose life cycle goes from November to May, the reduction of radiation to the ground for the whole period was evaluated as due to the calculated shadows. This parameter was adopted in the Eq. (3) to obtain the expected yield. The same procedure was used to study the behavior for the bean and potato as well. 4.2 Results of the Simulation By using the just described model, it was possible to calculate the energy self-consumed, withdrawn and sold to the grid in a year as the decision variables varied according to the utility’s needs. Below are presented the main results obtained for each proposed configuration of the plant. Vertical by analyzing the results, it is evident that as the capacity of the battery pack increases, the self-consumed energy increases with a consequent decrease in the energy withdrawn and sold to the grid. From an energy point of view, however, no substantial differences are noticeable if the capacity is above 55 kwh. Thanks to the numerical model implemented, the reduction of incident radiation to the ground, the expected yield reduction and the profits from the sale of fruit and vegetables were evaluated as shown in Table 7.

Optimization of Agrivoltaic Plants: Development

259

Table 7. Expected yields due to shading effect and relative profits. Yield [kg/m2 ]

Radiation reduction [%]

% yield

Expected yield [kg/year]

Sale price [e/kg]

Earnings [e]

Bean

9,9

10,32

20,55

9332,64

1,1

1794,92

Wheat

8,3

12,41

–52,05

3112,24

0,72

–2432,42

Potato

59,4

12,56

23,06

57162,38

0,5

5355,79

As expected, the wheat yield was reduced, the latter being a type of crop that does not tolerate shading, unlike the other two, which are suitable to agrivoltaic application. In spite of this, an extra profit was indeed realised on fruit and vegetable sales of 4673,29 e/year. The optimal solution returns an investment cost of 78.821,32e and a saving of 10.136,11e. The capacity of the battery pack is approximately 38 kWh, while the reduction in CO2 emissions compared to the base case is approximately 24,31 t/year (or 28,46% less). The LER, on the other hand, results in a value of 1,4. Fixed Similarly to what was seen for the first scenario, this case refer to an agrivoltaic system of equal power but characterized by a fixed structure in which modules are exposed to South with a 30° inclination. The optimum point, in order to guarantee an SPB of less than 7 years, is obtained in correspondence with an investment cost of 101.770,13e and a saving of 15.015,05e, which includes an extra profit from the sale of fruit and vegetables of 5089e. The identified optimum returns an SPB of 6,78 years, a battery pack capacity of about 75,00 kWh and an LER of 1,7. The reduction in CO2 emissions compared to the baseline is approximately 28,94 t/year (54,5% less). One-Axis Tracking Finally, for the third scenario, the optimum point corresponds to an investment cost of 110.370,3e and a saving of 15.797,26e (with an extra profit from the sale of fruit and vegetables of 5098e). The SPB is about 7 years and the capacity of the battery pack is about 75 kWh. The LER is 1,8 and the reduction of CO2 emissions compared to the reference system is approximately 29,52 t/year (or 57,74% less). 4.3 Solution Comparison In order to compare the optimal solutions, for each scenario analyzed, the main results obtained for each proposal are summarized in Table 8. By comparing the percentages, it is seen that the third scenario is the one with the best performances. In fact, the percentage of self-consumed energy is higher and that taken from the grid is lower than in the other two cases. Putting the focus on the economic analysis, however, it can be seen from the values in Table 7 that the last two scenarios are very similar, while the first has a lower investment cost and lower savings. The found SPB values are lower than those obtained by neglecting the change in yield associated with crops placed below the plant. In particular, assuming that only crops that benefit from the effect of shading are included, this value could decrease further.

260

M. Costa et al. Table 8. Mean results for each optimal solution. Vertical

Fixed

One-axis tracking

Self-consumed energy [%]

30,57

55,32

58,53

Withdrawn from the grid [%]

69,42

44,67

41,46

Given to the grid [%]

5,42

19,49

31,17

MCo2 [%]

28,46

54,5

57,74

Investment cost [e]

78.821,32 101.770,13 110.797,13

Economic savings [e/year]

10.136,11 15.015,15

15.797,26

SPB [years]

6,29

6,78

6,99

Radiation reduction [%] (during bean lifecycle)

10,32

13,84

10,32

Radiation reduction [%] (during wheat lifecycle) 12,41

15,04

12,41

Radiation reduction [%] (during potato lifecycle) 12,56

14,32

12,56

Bean expected yield [kg/m2 ]

20,55

24,24

24,34

Wheat expected yield [kg/m2 ]

–52,05

–57,68

–55,48

Potato expected yield [kg/m2 ]

23,06

24,63

24,19

LER

1,4

1,7

1,8

Finally, from an environmental point of view, the coupling of the proposed system with the third scenario is the one allowing the greatest reductions in greenhouse gas emissions. Based on the obtained simulations, the solution that best fits the needs of the farm is the one with the tracking system.

5 Conclusions The present work was conducted to otpimise the geometrical configuration of an agrivoltaic power plant. An original code was developed able to determine the reduction of incident radiation on the ground, with the subsequent evaluation that this reduction has on the crops underneath the modules. A techno-economic feasibility analysis of three plant configurations to meet the energy demand by the considered farm was also carried out. In particular, the three proposals differed in the type of used structure, which affects the investment cost. They are detailed below: • A first configuration that has modules mounted on a vertical structure with an investment cost of 78.821e. • A second configuration that has modules mounted 4 m above the ground on a fixed structure facing South, on which modules are tilted by 30°. This configuration involves an investment cost of 101.770 e. • The third and last configuration involves installing the modules on a structure equipped with a single-axis tracking system. The investment cost for this configuration was 110.797 e.

Optimization of Agrivoltaic Plants: Development

261

Installations were sized on the same peak power, using 48 440 W modules. A cultivation plan was defined by alternating a soil-preparing crop, a depleting crop and an improving crop, which were respectively chosen as potato, winter wheat and bean. The first result was obtained through the in house code that made it possible to evaluate the projected shadow on the ground of the three configurations and the consequent variation in yield associated with the underlying crops. For Beans and Potatoes, crops particularly suited to the application, was estimated an expected yield increase of 20% in the case of the first solution and 24% for the last two. In contrast, the yield of Wheat, which is particularly susceptible to radiation reduction, provided a 50% yield reduction. The change in the crop yields was reflected in an increase in extra profits associated with the sale of fruit and vegetables. From this point of view, the preferred solutions were the fixed configuration and the configuration equipped with tracking systems, which yielded an extra profit of 5090 e for the fixed configuration and 5100 e for the configuration equipped with a tracking system. The parametric optimization and the subsequence feasibility analysis showed that savings increase as the size of the battery increases, as do investment costs for all three configurations. The study concluded that the most profitable scenario is the one in which the agrivoltaic field is associated with a 75,0 kWh battery pack equipped with a mono-axis tracking system. This solution guarantees a greater amount of self-consumed energy and CO2 saved up to 60% and 58%, respectively. Ultimately, the results showed the great economic and production potential that these systems could provide for farms. In the future, an approach of this type can be used already in project phases to estimate the yield of the crops in a preliminary way in order to perform both the choice of the crop and the optimal plant solution for the specific application. This information could be a further push to use this technology for the more skeptical people. Finally, we could think of proceeding on the evaluation of a control logic for the tracking systems that aims at a tradeoff solution between agricultural and electrical yield.

References 1. Proctor, K., Murthy, G., Higgins, C.: Agrivoltaics align with green new deal goals while supporting investment in the US’ rural economy. Sustainability 13 (2020) 2. Marrou, H., Guilioni, L., Dufour, L., Dupraz, C., Wery, J.: Microclimate under agrivoltaic systems: is crop growth rate affected in the partial shade of solar panels? Agric. For. Meteorol.Meteorol. 177, 117–132 (2013) 3. Pascaris, A.S., Handler, R., Schelly, C., Pearce, J.M.: Life cycle assessment of pasture based agrivoltaic systems: Emissions and energy use of integrated rabbit production. Clean. Respons. Consum. 3 (2021) 4. Bany, J., Appelbaum, J.: The effect of shading on the design of a field of solar collectors. Sol. Cells. 20, 201–228 (1987) 5. Silva, M., Roberts, J.J., Prado, P.O.: Calculation of the shading factors for solar modules with MATLAB. Energies 14 (2021) 6. PVsyst, https://www.pvsyst.com/, last accessed July 11, 2022 7. Dinesh, H., Pearce, J.M.: The potential of agrivoltaic systems. Renew. Sustain. Energy Rev. 54, 299–308 (2016) 8. Ott, E.M., Kabus, C., Baxter, B., Hannon, B., Celik, I.: Environmental analysis of agrivoltaic systems. Comprehens. Renewable Energy (2nd Ed) 9, 127–139 (2022)

262

M. Costa et al.

9. Campana, P.E., Stridh, B., Amaducci, S., Colauzzi, M.: Optimisation of vertically mounted agrivoltaic systems. J. Clean. Prod. 325 (2021) 10. Amaducci, S., Yin, X., Colauzzi, M.: Agrivoltaic systems to optimise land use for electric energy production. Appl. Energy 220, 545–561 (2018) 11. Laub, M., Pataczek, L., Feuerbacher, A. et al.: Contrasting yield responses at varying levels of shade suggest different suitability of crops for dual land-use systems: a meta-analysis. Agron. Sustain. Dev. 42(51) (2022) 12. Neupane Bhandari, S., Schlüter, S., Kuckshinrichs, W., Schlör, H., Adamou, R., Bhandari, R.: Economic feasibility of agrivoltaic systems in food-energy nexus context: modelling and a case study in Niger. Agronomy 11(10) (2021) 13. Hassanien, RHE., Li M., Yin, F.: The integration of semitransparent photovoltaics on greenhouse roof for energy and plant production. Renew Energy 121, 377–388 (2018) 14. Barron-Gafford, G.A., et al.: Agrivoltaics provide mutual benefits across the food–energy– water nexus in drylands. Nat Sustain 2, 848–855 (2019) 15. Kavga, A., Trypanagnostopoulos, G., Zervoudakis, G., Tripanagnostopoulos, Y.: Growth and physiological characteristics of lettuce (Lactuca sativa L.) and rocket (Eruca sativa Mill.) plants cultivated under photovoltaic panels. Notulae Botanicae Horti Agrobotan. Cluj Napoca 46, 206–212 (2018) 16. Ezzaeri, K., et al.: The effect of photovoltaic panels on the microclimate and on the tomato production under photovoltaic canarian greenhouses. Sol. Energy 173, 1126–1134 (2018) 17. Cossu, M., et al.: Agricultural sustainability estimation of the European photovoltaic greenhouses. Eur. J. Agron.Agron. 118, 126074 (2020) 18. Lopez-Diaz, G., Carreo-Ortega, A., Fatnassi, H., Poncet, C., Diaz-Perez, M.: The effect of different levels of shading in a photovoltaic greenhouse with a north–south orientation. Applied Science 10, 882 (2020) 19. Tani, A., Shiina, S., Nakashima, K., Hayashi, M.: Improvement in lettuce growth by light diffusion under solar panels. J. Agricul. Meteorol. 70, 139–149 (2014) 20. Eberspacher, C., Fredric, C., Pauls, K., Serra, J.: Thin-film CIS alloy PV materials fabricated using non-vacuum, particlesbased techniques. Thin Solid Films 387, 18–22 (2001) 21. Blando, F., Gerardi, C., Renna, M., Castellano, S.: Serio, F: Characterisation of bioactive compounds in berries from plants grown under innovative photovoltaic greenhouses. J. Berry. Res. 8, 55–69 (2018) 22. Tang, Y., Ma, X., Li, M., Wang, Y.: The effect of temperature and light on strawberry production in a solar greenhouse. Sol. Energy 195, 318–328 (2020) 23. Trommsdorff, M., et al.: Combining food and energy production: design of an agrivoltaic system applied in arable and vegetable farming in Germany. Renew. Sustain. Energy Rev. 140, 110694 (2021) 24. Erbs, et al.: Estimation of the diffuse radiation fraction for hourly, daily and monthly-average global radiation. Solar Energy 28, 293–302 (1982) 25. Orgill, J.F., Hollands, K.G.T.: Correlation equation for hourly diffuse radiation on a horizontal surface. Solar Energy 19(357) (1977) 26. Perez, R., Ineichen, P., Seals, R., Michalsky, J., Stewart, R.: Modeling daylight availability and irradiance components from direct and global irradiance. Sol. Energy 44(5), 271–289 (1990) 27. Chwieduk, D.: Solar Energy in Buildings: Thermal Balance for Efficient Heating and Cooling, 1st edn., pp. 21–59. Academic (2014) 28. Quaschning, V., Hanitsch, R.: Shade calculations in photovoltaic systems. In: Proceedings of the ISES SolarWorld Conference, pp. 1–5. Harare, Zimbabwe (1995) 29. Westbrook, O., Reusser, M., Collins, F.: Diffuse shading losses in tracking photovoltaic systems. In Proceedings of the IEEE 40th Photovoltaic Specialist Conference (PVSC), pp. 0891–0896. Denver, CO, USA (2014)

Optimization of Agrivoltaic Plants: Development

263

30. Cascone, Y., Corrado, V., Serra, V.: Calculation procedure of the shading factor under complex boundary conditions. Sol. Energy 85(10), 2524–2539 (2011) 31. ENEL.com,https://www.enel.com/content/dam/enel.com/documenti/investitori/sostenibi lita/inventario-ghg-2020.pdf, last accessed 2022/06/17 32. Grilletto, A.: Dimensionamento in regime dinamico di un impianto agrivoltaico sperimentale per la mitigazione degli impatti climatici sulle colture del Sud Italia. master Thesis, University of Naples Federico II, 2021–2022. Palombo, A., Costa, M 33. tfz.bayern.de,https://www.tfz.bayern.de/mam/cms08/rohstoffpflanzen/dateien/tfz_ber icht_73_agri-pv.pdf, last accessed 2022/07/11 34. ismeamercati.it, https://www.ismeamercati.it/analisi-e-studio-filiere-agroalimentari, last accessed 2022/09/20 35. Adinolfi, D., Costa, M., Palombo, A., Piazzullo, D.: Dynamic optimization of a cogeneration plant for an industrial application with two different hydrogen embedding solutions. Int. J. Hydrogen Energy 47(24), 12204–12218 (2022)

Algorithmic Fairness in Healthcare Data with Weighted Loss and Adversarial Learning Pronaya Prosun Das1(B) , Marcel Mast2 , Lena Wiese1,3 , Thomas Jack4 , Antje Wulff2 , and ELISE STUDY GROUP 1

3

Fraunhofer Institute for Toxicology and Experimental Medicine, Hannover, Germany [email protected] 2 Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Hannover, Germany Institute of Computer Science, Goethe University Frankfurt, Frankfurt am Main, Germany 4 Department of Pediatric Cardiology and Intensive Care Medicine, Hannover Medical School, Hannover, Germany

Abstract. Fairness in terms of various sensitive or protected attributes such as race, gender, age group, etc. has been a subject of great importance in the healthcare domain. Group fairness is considered as one of the principal criteria. However, most of the prevailing mitigation techniques emphasize on tuning the training algorithms while overlooking the fact that the training data may possibly be the primary reason for the biased outcomes. In this work, we address two sensitive attributes (age group and gender) with empirical evaluations of systemic inflammatory response syndrome (SIRS) classification for a dataset extracted from electronic health records (EHRs) for the essential task of improving equity in outcomes. Machine learning (ML)-based technologies are progressively becoming prevalent in hospitals; therefore, our approach carries out a demand for the frameworks to consider performance trade-offs ELISE STUDY GROUP: Louisa Bode a ; Marcel Mast a ; Antje Wulff a, d ; Michael Marschollek a ; Sven Schamer b ; Henning Rathert b ; Thomas Jack b ; Philipp Beerbaum b ; ohnke c ; Andr´e Karch c ; Pronaya Prosun Das e ; Lena Nicole R¨ ubsamen c ; Julia B¨ e Wiese ; Christian Groszweski-Anders f ; Andreas Haller f ; Torsten Frank f a Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Hannover, Germany. b Department of Pediatric Cardiology and Intensive Care Medicine, Hannover Medical School, Hannover, Germany. c Institute of Epidemiology and Social Medicine, University of Muenster, Muenster, Germany. d Big Data in Medicine, Department of Health Services Research, School of Medicine and Health Sciences, Carl von Ossietzky University Oldenburg, Oldenburg, Germany. e Research Group Bioinformatics, Fraunhofer Institute for Toxicology and Experimental Medicine, Hannover, Germany. f medisite GmbH, Hannover, Germany. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 264–283, 2024. https://doi.org/10.1007/978-3-031-47715-7_18

Algorithmic Fairness in Healthcare Data

265

regarding sensitive patient attributes combined with model training and permit organizations to utilize their ML resources in manners that are aware of potential fairness and equity issues. With the intended purpose of fairness, we experiment with a number of strategies to reduce disparities in algorithmic performance with respect to gender and age group. We leverage a sample and label balancing technique using weighted loss along with adversarial learning for an observational cohort derived from EHRs to introduce a “fair” SIRS classification model with minimized discrepancy in error rates over different groups. We experimentally illustrate that our strategy has the ability to align the distribution of SIRS classification outcomes for the models constructed from high-dimensional EHR data across a number of groups simultaneously. Keywords: Neural networks SIRS · Healthcare · EHR

1

· Adversarial learning · Fairness · Bias ·

Introduction

Machine learning (ML) can be utilized to identify statistical patterns from the data that is produced by thousands of physicians and millions of patients. Determining statistical patterns is important to train computers to carry out specific tasks with incredible efficiency from time to time, such as diagnosing eye diseases in diabetic patients, to the extent of an experienced and knowledgeable specialist [21]. However, historical data might contain patterns of disparities regarding health care. Therefore, these inequities can be perpetuated in a ML model which was trained on those data. It is a significant task to promote fairness in the healthcare domain. Therefore, the American Medical Association passed the policy recommendations to “promote the development of thoughtfully designed, high-quality, clinically validated health care AI (artificial or augmented intelligence, such as machine learning) that identifies and takes steps to address bias and avoids introducing or exacerbating health care disparities including when testing or deploying new AI tools on vulnerable populations” [1]. In this work, we have focused on disparity in gender and age groups for the diagnosis of Systemic inflammatory response syndrome (SIRS). It is defined as an excessive defence response of the body to a noxious stressor e.g., infection, acute inflammation, trauma, surgery, reperfusion, ischemia, etc. to localize and subsequently terminate the external or endogenous cause of the insult. Professionals are usually led by SIRS identification criteria which were proposed in 1992 [4]. The advancement from sepsis to septic shock can increase the mortality rate significantly. Study [28] showed a 28-day/in-hospital mortality in serious sepsis and septic shock of 10%–40% and 30%–60%, respectively. Fluid resuscitation and early treatment with antibiotics were highly correlated with a higher survival rate [8]. A number of studies [14,27,29] demonstrated the applicability of Machine Learning algorithms to predict the diagnosis of a disease. The patients in those studies were from different age groups and genders, and these categories were not equally distributed most of the time – a fact that will indeed have an

266

P. P. Das et al.

impact on the classification result. Such misestimation causes considerable harm to SIRS diagnosis in a sense that incorrect classification can endanger patients as a result of both over- or undertreatment leading to avoidable sepsis side effects or incidents from unwanted treatments, respectively. It is our belief that the future selection of patients can be benefited from a much better understanding of the various patient subcategories for specific treatments. A significant amount of attention has been drawn to fairness and bias in ML and it has become a prominent area of research for ML students, researchers, and industry professionals [20]. To ensure an impartial future for AI, the Ethics and AI communities have aspired to decrease biases in ML [21]. By utilizing data analytics, empirical research has been carried out to evaluate fairness with respect to race groups [27]; yet, a very small number of works have focused on enhancing fairness in healthcare from the AI viewpoint. This study proposes techniques intended for investigating the fairness pertaining to the SIRS classification model regarding gender and age groups. Our experimental results are based on the SIRS dataset provided by the Hanover Medical School. We suggest approaches for enhancing group fairness at the time of both data processing and model training stages despite retaining overall accuracy. Our experimental outcomes show that: (1) Adversarial learning is effective since most marginalized groups display more significant average enhancements compared to other groups across all evaluation metrics, (2) In general, our method produces high scores for fairness while causing only a slight decrease in the overall performance of the classification, and (3) different groups require different model strategies for optimal effectiveness. The article is structured as follows. In the next Sect. 2, we provide the used definitions of fairness and a brief literature review on various related works. We describe our datasets and analysis in Sect. 3. Different strategies and weighted loss are depicted in Sect. 4. In Sect. 5, we demonstrate the proposed adversarial learning approach. Results analysis and discussion are provided in Sect. 6. Finally, we derive conclusions of the work in Sect. 7.

2 2.1

Related Works Fair Prediction

In general, supervised learning can be utilized to approximate the conditional distribution p(Y | X) for a function f (X) where N samples {xi , yi , zi }N i=1 are taken from a given distribution p(X, Y, Z). Usually, a vector representation X ∈ X = Rm of the medical history is extracted from the Electronic Health Records (EHRs). A binary label Y ∈ Y = {0, 1} that represents the outcome observed in the EHR for patient i, is used to obtain the outcome. Sensitive attributes, for example, gender, race, or age, with k groups, is indicated by Z ∈ Z = {0, ..., k − 1}. To render a prediction Yˆ ∈ {0, 1}, the output of the learned function f (X) ∈ [0, 1] is thresholded with a value T .

Algorithmic Fairness in Healthcare Data

267

Demographic parity [10] is one of the popular metrics to evaluate the fairness of a classifier regarding a sensitive attribute Z. The demographic parity criterion assesses the independence between the prediction Yˆ and Z, formalized as p(Yˆ | Z = Zi ) = p(Yˆ | Z = Zj )∀Zi , Zj ∈ Z

(1)

Nevertheless, optimizing a ML model for demographic parity is inadequate for the prediction of clinical risk or diagnosis, as it may prevent the model from contemplating pertinent clinical features affiliated with the outcome and the sensitive attribute. Therefore, it can reduce the overall performance of the model for all protected groups [20]. The equality of odds [15] is another metric for evaluating fairness where it specifies that, for the given true label Y , prediction Yˆ is conditionally independent of Z. Equality of odds is formally defined as p(Yˆ | Z = Zi , Y = Yk ) = p(Yˆ | Z = Zj , Y = Yk ) ∀Zi , Zj ∈ Z; Yk ∈ Y

(2)

The definition states that if it is possible to accomplish equality of odds, then both false negative rates (FNR) and false positive rates (FPR) will be equal for a certain threshold T over all pairs of protected groups specified by Z. Therefore, equality of odds is more suitable in a clinical background in contrast to demographic parity [15]. 2.2

Reducing the Impact of Algorithmic Bias

There are various strategies that can be utilized to reduce algorithmic bias. These strategies can be designed and carried out in different phases of a usual ML pipeline: during the construction of the dataset, model training, and inference (i.e., prediction). Removing sensitive features from the training data during the dataset construction phase is a simple and uncomplicated solution. However, due to different feature-class correlations, prediction outcome inequity may still be maintained. Poor model performance can also be observed as a result of removing features directly [26]. Additional techniques to reduce biases during the data construction phase aim to address imbalanced data related to predicted class and group. Predicted group and class size can be balanced by updating the loss function in terms of designating distinct weights to training samples and re-weighting every label, respectively [17]. Nevertheless, even in the case of balanced training data, ML models might still learn correlated information regarding sensitive features like gender and race from the provided intermediate representations [30]. Correlated information relating to sensitive features can be removed from the intermediate representation which is fed as input for the predictive models by utilizing adversarial learning [22,31]. A predictor (classifier) and an adversarial network are trained concurrently during adversarial learning. The primary goal of a predictor is typically to ensure that the intermediate representations used by the model remain

268

P. P. Das et al.

highly informative for the prediction task. In contrast, an adversarial network’s purpose is to hinder the predictor’s capability to anticipate sensitive features [9]. Thus, by eliminating the biased information concerning the sensitive features, a fair representation of model input can be learned using adversarial learning. The mitigation of the bias in ML can also be carried out at the inference phase. The main concept is to detect and turn off the portions of the model, that have learned the sensitive features. Therefore it eliminates the correlation between model output and those sensitive features [9]. In this work, we will focus on data construction and model training phases for the mitigation of bias. 2.3

Different Approaches for Mitigating Bias

A considerable amount of interest has been observed in healthcare [5,7] regarding the ethical implications of applying ML algorithms. However, comparatively, little work exists that represents the applicability of satisfying formal fairness constraints while developing risk prediction or classification models trained with the EHR’s data. We have seen a number of adversarial learning-based approaches in the non-healthcare domains to satisfy fairness constraints, especially in the form of demographic parity. In the situation of image anonymization, one approach [11] showed that a predictive model can be substituted by an autoencoder and an adversarial component to accomplish demographic parity. The adversarial learning technique was further inspected with a gradient reversal objective [3] for the imbalanced data in terms of the sensitive attributes as well as the outcome. It was also demonstrated that a small amount of data is needed to train adversarial networks. Alternatively, the use of equality of odds was presented in another work [15] to deal with the limitations of demographic parity. In that work, post-processing techniques were developed to attain equality of odds for the fixed-threshold classifiers. Recently, equality of odds was achieved for an adversarial framework by giving the discriminator access to the outcome values [22]. Equality of odds and demographic parity are called group fairness criteria as they are mainly involved with evaluating quantities at a group level, generally recognized as sensitive attributes such as age, gender, ethnicity, etc. The reasoning and computation of these metrics are straightforward. However, during optimization, they might generate models which are biased towards certain subgroups over groups of sensitive attributes [18]. By utilizing the notion of individual fairness [10], it may be possible to handle these issues. In this metric, a model is assessed whether it generates similar outputs for similar types of individuals. Nevertheless, this notion has limited practical use, as the domain-specific similarity metric is needed to be developed to encode the preferred criteria of fairness. A recent work [16] has explored an alternative to both individual and group fairness with a technique where subgroups are discovered, for which the model is performing poorly, and subsequently improves the performance of the model for those subgroups. This approach is model oriented; hence it mainly relies on model tuning for mitigating the bias. Another related work in healthcare [6],

Algorithmic Fairness in Healthcare Data

269

examined the fairness of risk prediction models for the context of predicting the mortality of patients in intensive care units. They argued that it is undesirable to carry out a trade-off between the performance of the model and fairness across sensitive attributes. However, none of the works dealt with unbalanced groups and labels. Besides, we have also been able to train a generalized adversarial model that satisfies different fairness constraints which will be discussed in the upcoming sections. From the literature review, we have realized that the fairness of a model should be evaluated in the context of the data [6]. Therefore, it motivates us to build a fair model for the context of EHR data where unbalanced groups and labels will be addressed to make the data bias-free to some extent.

3

Dataset

For the purposes of this work, a dataset of routine data from the pediatric intensive care unit at the Hannover Medical School is utilized. The data, which was obtained from a previously published study [32], has been pseudonymized to protect patient confidentiality. The dataset includes information on 168 pediatric patients, including vital parameters such as temperature, heart rate, respiration rate, and results from laboratory tests, as well as information from medical devices such as cooling blankets, ventilators, and pacemakers. Each patient can be identified by a unique study number, which was generated during the pseudonymization process. The laboratory test results include counts of leukocytes, platelets, and neutrophils, as well as INR values derived from the prothrombin time. Each measurement has a corresponding timestamp, providing a temporal sequence of data. The age of the patients is also recorded, which is crucial for correct diagnosis in the context of pediatric intensive care, particularly for SIRS detection (Fig. 1). Blood pressure values have been added to the existing parameters. 90

120 106

77

80

100

70

80 48

50 40

33

30 20 10

Counts

Counts

60

21

17

24

40

6

3

5

0 0d-1w

1w-1m

1m-1y All pa ents

2y-5y Pa ents with SIRS

6y-12y

13y- p classification problem, the FLCM classifier has comparable performance on the imbalanced and balanced data, similar results were observed for the NMC and ICR classifiers. The over and under-sampling of the data set has insignificant effects on the computational time of the classifiers. Keywords: Imbalance · Balanced · Sample size · Over-sampling · Under-sampling · Classifiers

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 284–300, 2024. https://doi.org/10.1007/978-3-031-47715-7_19

Application of Mean-Variance Cloning Technique

285

1 Introduction Imbalance data (ID) can be described as a data set in which one group sample size is larger than the second group sample size though retaining the same dimensions. The ID often occurs naturally in many practical research areas such as fraudulent classification [1], oil spills detection [2], web mining [3–5], fraud data [6, 7], pattern recognition [8– 11], gene expression [12, 13], cancer genomic data [12, 14], and intrusion detection [15]. The ID can occur in n > p and p > n classification problems, where n denotes the sample size and p the dimension of the data set. Different data preprocessing methods have been adduced to equalize the sample size of the two groups by way of under-sampling or over-sampling techniques. The synthetic minority over-sampling technique (SMOTE) was designed in such a way that allows the replication of the minority group data to balance it such that both groups have an equal sample size. The SMOTE is an over-sampling method that often applies the k-nearest neighbor in which the c of the k is selected and interpolated to generate the new samples. These generated samples are combined with the original minority group to obtain the enhanced majority group. SMOTE has found applications in different fields of research such as network intrusion [16], gene prediction [17, 18], and histopathology annotation [19]. The other method to solve the imbalance data is under-sampling where some portions of the majority group are removed to equate the minority group. Conventional classifiers such as the Fisher linear classification method (FLCM), the nearest mean classifier (NMC), and the independent classification rule (ICR) are well-known classifiers. While the FLCM suffers a course of dimensionality for p > n problem, the NMC and ICR are generally applied for n > p and p > n. The FLCM depends on the sample mean and covariance matrix to develop the classifier coefficient which is impracticable for p > n but the NMC and ICR coefficient strictly rely on the group mean vectors to perform classification based on the training data set. For the ID problem, the over-sampling or under-sampling methods are data balancing preprocessing techniques required before the classifiers such as FLCM, NMC, and ICR are trained to perform the classification task. Different techniques to solve the imbalance data sample size classification problems for high dimensional data (HDD) such as cost-sensitive learning (CSL) [20, 21], random resampling (RR) [15, 22], over-sampling (OS), under-sampling (US), feature selection (FS) and one class learning (OCL) have been proposed [15, 22–24]. It has been noted that the resampling procedure may enhance the overfitting or computational time of the classifiers [25]. Although, it was also noted that the cost-sensitive learning technique would solve the problems associated with the resampling technique. The cost-sensitive learning (CSL) [26, 27] and the sampling techniques have advantages and disadvantages [15, 23]. Other well-known classifiers such as k nearest neighbor [28–31], support vector machine [32–37], random forest [38, 39], decision trees [40] and diagonal linear discriminant analysis [28, 41] have been applied to various research areas to solve ID problems related to HDD. Overfitting often occurs in high dimensional classification problems as such removing redundant variables may reduce the effects of overfitting to a manageable degree [23, 42]. An imbalanced group data occurs when one group’s sample size is larger than the

286

F. Z. Okwonu et al.

sample size of the second group. This happens often in high dimensional data sets which may contain many irrelevant variables or data points. These irrelevant variables perturb the performance accuracy of the classifiers. Therefore, various techniques have been proposed to address the imbalance and irrelevant variables in high dimensional data sets to reduce the majority group prediction power [43]. The minority sample size group in the imbalance data problem may violate the principle of the central limit theorem (CLT) which the over-sampling method tends to ameliorate to satisfy the CLT. The problems associated with ID for either n > p or p > n could be remedied by over-sampling or under-sampling. The classical classifiers such as the Fisher linear classification method (FLCM) [44– 46], the nearest mean classifier (NMC) [45–48] and the independent classification rule (ICR) [49] performance may be biased towards the minority group. The FLCM is based on mean and covariance to formulate the classifier coefficients as such it is assumed to be influenced by sample sizes and dimensions [41] meanwhile the NMC [50, 51] and ICR are simple classifiers that simply relies on the sample mean deviations as such, the dimensionality problem is automatically resolved. However, outliers and imbalanced data are the problems associated with these classifiers. The outliers problems have been addressed in different fora [44, 52–61]. Therefore, the focus now is to investigate the impacts of imbalanced data on n > p problem using conventional classifiers such as the FLCM, NMC, and ICR, respectively. We also investigate the comparative performance of the imbalance and balance data set on the above classifiers. We specifically addressed two problems in this study, first, an alternative to SMOTE is proposed called the meanvariance cloning technique (MVCT). Secondly, the performance evaluation metric for ID for n > p classification problems. To address the first problem, instead of applying the SMOTE to solve the minority group problem, we apply the proposed MVCT as an over-sampling procedure to equalize the sample size of the minority group to the majority group. If over-sampling or under-sampling is applied, the conventional evaluation metric could be applied. However, overfitting may occur. To resolve this problem, we apply the benchmark evaluation threshold (BETH) as the suitable performance evaluation metric [62]. The rest of this article is organized as follows. Section 2 discussed the materials and methods with an elaborate description of the MVCT procedure followed by the FLCM, NMC, and ICR classifiers. Section 3 presents the performance evaluation criteria followed by results and discussions in Sect. 4. The conclusion is given in Sect. 5.

2 Material and Methods The MVCT procedure generates new data (k = N 1 − n) using the mean and variance of the minority group (n) to obtain the new samples (k) which would be merged with the minority sample to obtain a new majority sample size (N2 = n + k) which has the same sample size as the original majority group (N1 ). Therefore, we have two majority groups, that is N1 = N2 . At this point, the classifiers could be applied to formulate its coefficients and perform classification. Therefore, the MVCT is a variant of over-sampling procedure.

Application of Mean-Variance Cloning Technique

287

2.1 Mean-Variance Cloning Technique The synthetic minority over-sampling technique (SMOTE) [63] is defined as Eq. (1)   (1)  = Xj + α X − Xj , 0 ≤ α ≤ 1, 



where X is randomly selected among the c of the k nearest neighbor. Equation 1 is categorized as an over-sampling method used in generating additional data to balance the sample size of the smaller group. The mean-variance cloning technique (MVCT) method works similarly as an over-sampling procedure. Although, the MVCT procedure uses the information from the minority group to recreate a set of data sets that mimics the original minority data set. The combination of the sample size (n) of the minority data set, and the sample size (k) of the cloned data set has an equal sample size, and dimension to form the new majority group (N2 = n + k). The data set generated from the minority group share similar characteristics with the given data set. The implication of this is that the possibility of influential observations generated from the original data will be minimized. However, the MVCT procedure may mimic the SMOTE procedure [1] but with different data extraction and generation procedures. In general, the MVCT method relies on the internal mechanism of the minority data set to compute the mean and standard deviation of the new data set. The MVCT method can be described as an over-sampling method based on the concept in [22]. The MVCT is derived as follows. The first step is to compute the mean and variance of the minority group, that is k Xipmn X mn = i=1 (2) nmn where Xipmn is the minority data set, X mn denote the minority group mean and S denote the minority standard deviation, 2  k i=1 Xipmn Xipmn − nmn 2 Smn = nmin − 1  2 S = Smn (3) Therefore, the new majority group is given as  

2 XCT = Xipmn + N X mn , Smn k = N1 − n, p = Xipmn + ω(k, p)

(4)

The data set of the original majority group is denoted by XMJ . Hence XMJ and XCT has equal sample sizes and dimensions. Thus, N1 j=1 XMJjp X MJ (1×P) = (5) N1

288

F. Z. Okwonu et al.

N2 X CT (1×P) =

j=1 XCTjp

(6)

N2

Equation (5) is the group mean vector of the original majority group (N1 ) while Eq. (6) is the group mean vector for the new majority group (N2 ). The sample variance for the two groups can be computed in a similar way. Before we proceed to develop classifiers based on the MVCT procedure, let’s determine the mean and variance differences between N1 and N2 . This is done by computing the mean of N1 and the combined generated data N2 . Consider the data set in Table 1. The majority group consists of N1 = 10 and the minority group n = 4, p = 3 respectively. To balance N2 , we apply Eq. (4) to obtain k = N1 − n which represents the bold figures in Table 1. After that, we apply Eqs. (5)–(6) to show whether there are strict differences between the two groups. Based on Tables 1 and 2, there is an insignificant difference between the two groups, therefore we could conclude that the two groups satisfy equal mean conditions. From Fig. 1, we observed that the variance of the two groups differ but not severely varied. However, the variance of X2 for the minority group is historically large which the generated data mimic. Therefore, Fig. 1 demonstrated that sample size plays a vital role in the variation of the variance. Therefore, the concept of the central limit theorem is valid for Eq. (4). The average difference between the variance of the original majority group (N1 ), that is 80.9444, and the balanced group (N2 ) is 79.1223 which revealed homoscedasticity. Table 1. Data comparison of majority and minority groups with balanced sample sizes. Majority group

Minority group

X1

X2

X3

Y1

Y2

Y3

21

20

31

33

12

10

23

25

12

23

17

11

34

19

20

29

38

27

35

13

16

25

32

23

28

25

31

30.7854

28.2755

9.6565

11

13

29

32.3147

34.2596

26.7887

33

19

21

24.5952

1.4685

14.0341

12

21

41

25.7753

23.8555

4.3688

31

19

21

26.3755

41.0524

18.9011

18

23

17

23.6413

25.9699

27.6895

25.3881

17.2439

MEAN (n1 ) 24.6

MEAN (n2 ) 19.7

23.9

27.3487

The main difference between Eqs. (1) and (4) is that Eq. (1) depend on the Euclidean distance value of the existing data which allows for continuous adjustment of the minority

Application of Mean-Variance Cloning Technique

289

Table 2. Mean comparison of minority and generated minority. Minority data

Generated minority data (GMD)

Minority + GMD

Grand mean

X1

27.5

27.2479

27.3487

27.36553333

X2

24.75

25.8135

25.3881

25.3172

X3

17.75

16.9064

17.2437

17.30003333

Imbalance and balance data

Mean

MINORITY +GENERATED MINORITY

GENERATED MINORITY

MINORITY

MAJORITY 0

50

100

150

200

Variance Values X3

X2

X1

Fig. 1. Comparative analysis of variance.

using the c from the k nearest neighbor. On the other hand, the MVCT simply relies on the sample mean and variance of the existing minority data set to generate the balance data based on the sample size difference between the majority and minority groups. Then the new majority group is formed by combining the minority data with the generated minority data. 2.2 Imbalance Classifier The imbalance classifier utilizes the data set in its original format such that the majority and minority groups are modeled independently. This method allows the classifier to classify an object based on the information supplied. This might mimic cost-sensitive learning (CSL) which compares the misclassification rate of the minority group to the misclassification rate of the majority group as a decision rule [23]. This method would model the majority and minority groups and apply the same to perform classification. In this subsection, we are going to mention three different classifiers without detailed

290

F. Z. Okwonu et al.

derivations. However, Eq. (4) is applied to the minority group to transform it into the majority group before the following classifiers are applied to train and validate the model. Fisher Linear Classification Analysis. For instance, if we want to perform classification for imbalanced data, we first apply the MVCT procedure to balance the data set which transforms the minority class into the majority class as illustrated in Table 1. The data are applied to formulate the coefficients of the Fisher linear classification analysis (FLCA) for the n > p problems. The FLCA formulation relies on the computation of the group means, covariance matrix, and pooled sample covariance matrix. The coefficient is formulated by taking the difference between the two group mean vectors multiplied by the inverse of the pooled sample covariance. The decision rule is developed by adding the two group mean vectors multiplied by the coefficient [44, 48, 52]. The allocation score is obtained by multiplying the numerical value of the object by the classifier coefficient. Therefore, an object is assigned to the correct group by comparing the allocation score with the benchmark value. For instance, an object in group one is assigned to group one if the numerical value of the object is greater than the benchmark value otherwise, such an object is assigned to group two. Nearest Mean Classifier. The allocation score of the nearest mean classifier (NMC) is derived by taking the group mean vectors difference multiplied by the numerical value of the object of interest [47, 50, 51, 64]. A similar decision rule as FLCA is applied. The uniqueness of the NMC is that it is averse to dimensionality problems as such it could be applied to perform classification for n > p and p > n. Independent Classification Rule. The independent classification rule (ICR) like the NMC can be applied to solve n > p and p > n classification problems. The coefficient of the ICR is formulated by utilizing the diagonal of the pooled covariance matrix multiplied by the group mean vectors difference and the difference between the object values and the grand mean vector [65, 66]. For the ICR technique, an object is assigned to group one if the allocation score is greater than zero otherwise the object is assigned to group two.

3 Performance Evaluation Conventionally, the performance of the n > p classification is often done using the confusion matrix or the apparent error rate (AER). In other words, the apparent error rate is the summation of the diagonal of the confusion matrix, that is the true positive and the true negative. Therefore, this criterion is efficient if we assume an equal cost of misclassification and equal probability from both groups. This evaluation criterion becomes difficult if the imbalance problem persists. To effectively implement this evaluation criteria to avoid bias and overfitting from the majority class, we use an alternative procedure, which is the benchmark evaluation threshold (BETH) [62]. That is, 1−∂ ×∂ (7) τ= 2×∂

Application of Mean-Variance Cloning Technique

291

where ∂ denotes the probability of correct classification from the confusion matrix and τ is the probability of misclassification. Therefore, the BETH value is

 ∩ =  − τ,  = 1 (8) Therefore, the performance probability is determined as Eq. (9).   ∂ β= ∩

(9)

3.1 Data Set This section consists of three data sets with different imbalanced sample sizes and dimensions. The data set is described as follows: 1. SONAR: Mines and rocks. This data set was collected from the UCL machine learning repository. It consists of two groups: sonar mines and sonar rocks. Sonar mines consist of 111 instances and sonar rocks consist of 97 instances with 60 attributes [67]. 2. The second data set consists of acoustic features of Parkinson disease classification based on gender. The data set consists of healthy and unhealthy with 240 instances. It is assumed that the majority group is 144 and the minority group is 96 with 44 attributes [68]. This data set was collected from the UCL machine learning repository (https:// archive.ics.uci.edu/ml/machine-learning-databases/00489/). 3. The third data set consists of the Pima Indian diabetes data set (https://www.kaggle. com/datasets/uciml/pima-indians-diabetes-database) which consists of 768 instances for both groups with 8 attributes. Group one consists of 500 samples and group two consists of 268 samples respectively.

4 Results and Discussion In this section, we apply Eqs. (7–9) to determine the performance of these classifiers for the imbalance and balanced sample sizes. We used over-sampling and under-sampling techniques based on our proposed data generation methods. Tables 3, 4 and 5 demonstrate the results of the classifier based on the imbalanced and MVCT over sampling respectively. From the results from the aforesaid tables, we observed that the classification performance of group sample sizes depends on the nature of the data set and the transformation from minority to majority group via MVCT data generation showed little difference. This also alludes to the concepts of data dependency theory in [48]. 4.1 Part A: Over-Sampling The results reported in this subsection require the comparative analysis of applying the original imbalance data to the classifiers and applying the MVCT data generation to balance the group sample size before the classifiers are applied to determine the performance of the classifiers. The results in Table 3 demonstrated that the FLCM and the MVCT-FLCM performed comparably meanwhile, the other classifiers performed

292

F. Z. Okwonu et al.

Table 3. Comparative performance analysis using sonar signals of mines and rocks [67]. Classifiers

FLCM

MVCTFLCM

NMC

MVCT-NMC

ICR

MVCT-ICR

Performance

0.9639

0.9627

0.8916

0.8955

0.8995

0.9004

CPU time

0.17

0.17

0.17

0.15

0.17

0.18

comparably. Note the FLCM, NMC and ICR are imbalanced data sets while MVCTFLCM, MVCT-NMC, and MVCT-ICR are balanced data sets. The results in Table 4 also showed that the FLCM outperformed other classifiers. Meanwhile, the analysis indicates that the other classifiers performed comparably, and the MVCT-based classifiers compare favorably with the published results associated with this data set using SMOTE technique for data over-sampling. Table 4. Comparative performance analysis using parkinson disease based on gender. Classifiers

MVCT-FLCM NMC

MVCT-NMC

ICR

Performance 0.9485

FLCM

0.9341

0.8788

0.8816

0.8856

CPU time

0.20

0.23

0.20

0.21

0.26

Classifiers

MVCT-ICR LVBA [70]

Performance 0.8959 CPU time

0.8500

SPM [70] Approach A [72] Approach B [72] 0.8710

0.7790

0.7520

0.23

LVBA: Latent Variable based approach; SPM: Standard probit model. Results based on approaches A and B are from Table 2 [69]. In Table 5, the performance of the FLCM for the imbalanced and balanced data set is consistent with previous results reported. Apart from the FLCM variants, the ICR variants performed better than the NMC variants for this data set. In general, the computational time of all the classifiers is comparable. Table 5. Comparative performance analysis using pima indian diabetes data set. Classifiers

FLCM

MVCT-FLCM

NMC

MVCT-NMC

ICR

MVCT-ICR

Performance

0.9109

0.9104

0.8838

0.8829

0.8905

0.9024

CPU time

0.17

0.17

0.14

0.11

0.12

0.14

4.2 Part B: Under-Sampling In Table 6, the sample sizes for both groups are N1 = N2 = 97, p = 60, in Table 7, N1 = N2 = 96, p = 44 and in Table 8, N1 = N2 = 268, p = 8, respectively. In Table 6,

Application of Mean-Variance Cloning Technique

293

the FLCM outperformed the NMC and ICR. The results indicate that the NMC and ICR performed comparably with similar computational time. Table 6. Performance analysis using sonar signals of mines and rocks [67]. Classifiers

FLCM

NMC

ICR

Performance

0.9497

0.8904

0.8923

CPU time

0.20

0.20

0.20

In Table 7, the NMC and ICR performed comparably with the FLCM as the overall best classifier for this data set. Table 7. Performance analysis using parkinson disease based on gender. Classifiers

FLCM

NMC

ICR

Performance

0.9513

0.8880

0.8849

CPU time

0.17

0.17

0.17

In Table 8, the FLCM and ICR performed comparably followed by the NMC with varying degrees of computational time. Table 8. Performance analysis using pima indian diabetes data set. Classifiers

FLCM

NMC

ICR

Performance

0.9049

0.8794

0.8988

CPU time

0.07

0.07

0.09

4.3 Comparative Performance Analysis of Imbalance and Balanced Data Set Based on Classifiers In Fig. 1, we observed that the application of the classifiers on the original data (imbalance data) indicates improved performance more than the balanced data set. The NMC for both categories of data set showed consistent performance more than the FLCM and the ICR. In Fig. 2, the application of the classifiers on Parkinson disease via an imbalance and balanced data set demonstrated that the application of the classifiers on the balanced data set revealed better classification performance than the original data set. In this data set, all classifiers have similar performance results.

F. Z. Okwonu et al.

Performance Probability

294

0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 FLCM

NMC

ICR

Classifiers IMBALANCED DATA

BALANCED DATA

PERFORMANCE PROBABILITY

Fig. 2. Comparative performance of classifiers based on over and under-sampling for sonar data.

0.96 0.94 0.92 0.9 0.88 0.86 0.84 FLCM

NMC

ICR

CLASSIFIERS IMBALANCED DATA

BALANCED DATA

Fig. 3. Comparative performance of classifiers based on over and under-sampling for Parkinson disease based on gender.

Figure 3 indicates varying performance on the classifiers based on data set categories. The comparative analysis indicates that the application of the classifiers on the imbalanced data is more robust than the balanced data set. This comparative analysis revealed that over-sampling and under-sampling on classification models depend on the nature of the data set. The comparative analysis indicates that the type of classifiers plays a significant role in over-sampling and under-sampling. Therefore, the impacts of data dependency theory are echoed in determining the performance of any good classifiers. Hence, classifiers perform better if the data set is normal and follows the central limit theorem concept.

Performance Probability

Application of Mean-Variance Cloning Technique

295

0.92 0.91 0.9 0.89 0.88 0.87 0.86 FLCM

NMC

ICR

Classifiers IMBALANCED DATA

BALANCED DATA

Fig. 4. Comparative performance of classifiers based on over and under-sampling for the Pima Indian diabetes data set.

4.4 Discussion The results in Tables 1 and 2 and support the conclusion by [70] whereby there are no significant differences in performance from the sampling techniques. For n > p classification problems, imbalance group samples perform better for the mean and covariancebased classifiers than the mean deviation method. From the results of this study, we observed that classifier performance depends on the nature of the data set irrespective of the sample sizes of each group. This is evident in the performance probabilities of the classifiers with the FLCM unique performance for the three data sets. From the performance analysis, using the original data set in its forms guaranteed improved performance than manipulating the data set for equality of the sample sizes of both groups. Therefore, data or sample size balance resulting from over-sampling the imbalance sample size group is a statistical cosmetic because it does not enhance the performance of the classifiers [63, 71]. Hence data dependency theory is statistically valid for classifiers. The results in Tables 3, 4 and 5 affirmed the conclusion in [72] that over-sampling does not enhance the performance of the classifiers but the nature of the data set.

5 Conclusion In this study, a new, mean and variance-based data over-sampling procedure (MVCT) has been proposed. The comparison with the original minority group indicates that the MVCT data set mimics the original data set. Its application to transform a minority group into a majority group indicates that it satisfies equal mean and homoscedastic conditions. The applications of the MVCT data generation on the traditional classifiers demonstrated comparable outcomes with the original data set. This study has shown the insignificant impacts of over-sampling and under-sampling on the classifiers. For the data set used in this study, the covariance-based classifiers outperformed the mean deviation classifiers. Therefore, this study demonstrated that the performance of the classifiers depends on the

296

F. Z. Okwonu et al.

nature of the data set. Although over-sampling and under-sampling do not enhance the performance of the classifiers. In summary, the classical FLCM performed comparably for imbalanced and balanced data sets than the other two comparable classical classifiers used in this study.

References 1. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res.Artif. Intell. Res. 16, 321–367 (2002). https:// doi.org/10.1613/jair.953 2. Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30(2–3), 195–215 (1998). https://doi.org/10.1023/a:100745222 3027 3. Costa, E., et al.: A framework for building web mining applications in the world of blogs: a case study in product sentiment analysis. Expert Syst. Appl. 39(5), 4813–4834 (2012). https:// doi.org/10.1016/j.eswa.2011.09.135 4. Yeh, I.C., Lien, C., Ting, T.M., Liu, C.H.: Applications of web mining for marketing of online bookstores. Expert Syst. Appl. 36(8), 11249–11256 (2009). https://doi.org/10.1016/j.eswa. 2009.02.068 5. Ting, i.: Web-mining applications in e-commerce and e-services. Online Inf. Rev. 32(2), 129–132 (2008). https://doi.org/10.1108/14684520810879773 6. Brockett, P.L., Derrig, R.A., Golden, L.L., Levine, A., Alpert, M.: Fraud classification using principal component analysis of RIDITs. J. Risk Insur.Insur. 69(3), 341–371 (2002). https:// doi.org/10.1111/1539-6975.00027 7. Kale, N., Kochrekar, S., Mote, R., Dholay, S.: Classification of fraud calls by intent analysis of call transcripts. In: 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), pp. 1–6. IEEE (2021). https://doi.org/10.1109/ICCCNT 51525.2021.9579632 8. Szabo, A., Boucher, K., Carroll, W.L., Klebanov, L.B., Tsodikov, A.D., Yakovlev, A.Y.: Variable selection and pattern recognition with gene expression data generated by the microarray technology. Math. Biosci.Biosci. 176(1), 71–98 (2002). https://doi.org/10.1016/S0025-556 4(01)00103-1 9. Declerck, K., Novo, C. P., Grielens, L., van Camp, G., Suter, A., vanden Berghe, W.: Echinacea purpurea (L.) Moench treatment of monocytes promotes tonic interferon signaling, increased innate immunity gene expression and DNA repeat hypermethylated silencing of endogenous retroviral sequences. BMC Complement. Med. Therap. 21(1), 141 (2021). https://doi.org/10. 1186/s12906-021-03310-5 10. Bicciato, S., Pandin, M., Didonè, G., di Bello, C.: Pattern identification and classification in gene expression data using an autoassociative neural network model. Biotechnol. Bioeng.. Bioeng. 81(5), 594–606 (2003). https://doi.org/10.1002/bit.10505 11. Romualdi, C., Campanaro, S., Campagna, D., Celegato, B., Cannata, N., Toppo, S., Lanfranchi, G.: Pattern recognition in gene expression profiling using DNA array: a comparative study of different statistical methods applied to cancer classification. Human Molecul. Genet. 12(8), 823–836 (2003). https://doi.org/10.1093/hmg/ddg093 12. Li, Y., Umbach, D. M., Li, L.: Abstract A46: a comprehensive genomic pan-cancer analysis comparing males and females using The Cancer Genome Atlas gene expression data. Clin. Cancer Res. 23(1_Supplement), A46–A46 (2017). https://doi.org/10.1158/1557-3265.pmc cavuln16-a46

Application of Mean-Variance Cloning Technique

297

13. Kim, B.H., Yu, K., Lee, P.C.: Cancer classification of single-cell gene expression data by neural network. Bioinformatics 36(5), 1360–1366 (2020). https://doi.org/10.1093/bioinform atics/btz772 14. Li, Y., et al.: A comprehensive genomic pan-cancer classification using the cancer genome atlas gene expression data. BMC Genom. 18(1), 1–13 (2017). https://doi.org/10.1186/s12 864-017-3906-0 15. García-Pedrajas, N., Pérez-Rodríguez, J., García-Pedrajas, M., Ortiz-Boyer, D., Fyfe, C.: Class imbalance methods for translation initiation site recognition in DNA sequences. Knowl.Based Syst..-Based Syst. 25(1), 22–34 (2012). https://doi.org/10.1016/j.knosys.2011.05.002 16. Cieslak, D.A., Chawla, N.V., Striegel, A.: Combating imbalance in network intrusion datasets. In: IEEE International Conference on Granular Computing, pp. 732–737 (2006). https://doi. org/10.1109/GRC.2006.1635905 17. Xiao, J., et al.: Identification of microRNA precursors based on random forest with networklevel representation method of stem-loop structure. BMC Bioinformatics 12, 1–8 (2011). https://doi.org/10.1186/1471-2105-12-165 18. Batuwita, R., Palade, V.: MicroPred: effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics 25(8), 989–995 (2009). https://doi.org/10.1093/bio informatics/btp107 19. Doyle, S., Monaco, J., Feldman, M., Tomaszewski, J., Madabhushi, A.: An active learning based classification strategy for the minority class problem: application to histopathology annotation. BMC Bioinf. 12, 1–14 (2011). https://doi.org/10.1186/1471-2105-12-424 20. Pes, B., Lai, G.: Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study. PeerJ. Comput. Sci. 7, e832 (2021). https://doi.org/10.7717/peerj-cs.832 21. Zhang, Y., Wang, D.: A cost-sensitive ensemble method for class-imbalanced datasets. Abstr. Appl. Anal.. Appl. Anal. 2013, 1–6 (2013). https://doi.org/10.1155/2013/196256 22. Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial. ACM SIGKDD Explorat. Newslett. 6(1), 1–6 (2004). https://doi.org/10.1145/1007730.1007733 23. Maldonado, S., López, J.: Dealing with high-dimensional class-imbalanced datasets: embedded feature selection for SVM classification. Appl. Soft Comput.Comput. 67, 94–105 (2018). https://doi.org/10.1016/j.asoc.2018.02.051 24. Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorat. Newsl. 6(1), 20–29 (2004). https://doi.org/10.1145/1007730.1007735 25. van Hulse, J., Khoshgoftaar, T. M., Napolitano, A., Wald, R.: Feature selection with highdimensional imbalanced data. In: IEEE International Conference on Data Mining Workshops, pp. 507–514 (2009). https://doi.org/10.1109/ICDMW.2009.35 26. Mienye, I.D., Sun, Y.: Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Inf. Med. Unlocked 25, 100690 (2021). https://doi.org/10. 1016/j.imu.2021.100690 27. Thai-Nghe, N., Gantner, Z., Schmidt-Thieme, L.: Cost-sensitive learning methods for imbalanced data. In: The 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2010). https://doi.org/10.1109/IJCNN.2010.5596486 28. Lin, W.J., Chen, J.J.: Class-imbalanced classifiers for high-dimensional data. Brief. Bioinform.Bioinform. 14(1), 13–26 (2013). https://doi.org/10.1093/bib/bbs006 29. Barandela, R., Sánchez, J. S., Garc´ıa, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognit. 36(3), 849–851 (2003). https://doi.org/10.1016/S0031-320 3(02)00257-1 30. Beckmann, M., Ebecken, N.F.F., Pires de Lima, B.S.L.: A KNN undersampling approach for data balancing. J. Intell. Learn. Syst. Appl. 7(4), 104–116 (2015). https://doi.org/10.4236/ jilsa.2015.74010

298

F. Z. Okwonu et al.

31. Pei, W., Xue, B., Shang, L., Zhang, M.: Genetic programming for borderline instance detection in high-dimensional unbalanced classification. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 349–357 (2021). https://doi.org/10.1145/3449639.345 9284 32. Tang, Y., Zhang, Y.Q., Chawla, N.V., Krasser, S.: SVMs modeling for highly imbalanced classification. IEEE Trans. Syst., Man, Cybern., Part B (Cybern.) 39(1), 281–288 (2008). https://doi.org/10.1109/TSMCB.2008.2002909 33. Tang, Y., Zhang, Y. Q.: Granular SVM with repetitive undersampling for highly imbalanced protein homology prediction. In: 2006 IEEE International Conference on Granular Computing, pp. 457–460 (2006). https://doi.org/10.1109/GRC.2006.1635839 34. Wu, G., Chang, E.Y.: KBA: Kernel boundary alignment considering imbalanced data distribution. IEEE Trans. Knowl. Data Eng.Knowl. Data Eng. 17(6), 786–795 (2005). https://doi. org/10.1109/TKDE.2005.95 35. Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to imbalanced datasets. In: Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, September 20–24, 2004. Proceedings, vol. 15, pp. 39–50 Springer, Berlin (2004). https://doi.org/10.1007/978-3-540-30115-8_7 36. Raskutti, B., Kowalczyk, A.: Extreme re-balancing for SVMs. ACM SIGKDD Explorat. Newsl. 6(1), 60–69 (2004). https://doi.org/10.1145/1007730.1007739 37. Liu, X., Zhao, B., He, W.: Simultaneous feature selection and classification for data-adaptive kernel-penalized SVM. Mathematics 8(10), 1846 (2020). https://doi.org/10.3390/math81 01846 38. Mansour, Y., Schain, M.: Random forest. Mach. Learn. 45(2), 123–145 (2001). https://doi. org/10.1023/A:1010950718922 39. Pes, B.: Learning from high-dimensional and class-imbalanced datasets using random forests. Information 12(8), 286 (2021). https://doi.org/10.3390/info12080286 40. Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002). https://doi.org/10.3233/IDA-2002-6504 41. Dudoit, S., Fridlyand, J., Speed, T.P.: Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97(457), 77–87 (2002). https://doi.org/10.1198/016214502753479248 42. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Recent advances and emerging challenges of feature selection in the context of big data. Knowl.-Based Syst..-Based Syst. 86, 33–45 (2015). https://doi.org/10.1016/j.knosys.2015.05.014 43. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng.Knowl. Data Eng. 21(9), 1263–1284 (2009). https://doi.org/10.1109/TKDE.2008.239 44. Okwonu, F. Z., Othman, A. R.: Robust fisher linear classification technique for two groups. World Appl. Sci. J. 21(Special Issue) (2013). https://doi.org/10.5829/idosi.wasj.2013. 21.mae.99939 45. Mahdianpari, M., et al.: Fisher Linear Discriminant Analysis of coherency matrix for wetland classification using PolSAR imagery. Remote Sens. Environ. 206, 300–317 (2018). https:// doi.org/10.1016/j.rse.2017.11.005 46. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7(2), 179–188 (1936) 47. Okwonu, F.Z., Ahad, N.A., Okoloko, I.E., Apanapudor, J.S., Kamaruddin, S.A., Arunaye, F.I.: Robust hybrid classification methods and applications. Pertanika J. Sci. Technol. 30(4), 2831–2850 (2022). https://doi.org/10.47836/pjst.30.4.29 48. Okwonu, F.Z., Othman, A.R.: Comparative performance of classical fisher linear discriminant analysis. Matematika 29(1c), 213–220 (2013) 49. Huberty, C.J., Holmes, S.E.: Two-group comparisons and univariate classification. Educ. Psychol. Measur.Measur. 43(1), 15–26 (1983). https://doi.org/10.1177/001316448304300103

Application of Mean-Variance Cloning Technique

299

50. Fukunaga, K.: Introduction to statistical pattern recognition second edition. In: Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017 (2018) 51. Fukunaga, K.: Introduction to statistical. Pattern Recogn.Recogn. (1990). https://doi.org/10. 1016/c2009-0-27872-x 52. Okwonu, F.Z.: Several Robust Techniques in Two-Groups Unbiased Linear Classification. https://core.ac.uk/download/pdf/199245931.pdf, Penang (2012) 53. Jennison, C., Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A.: Robust statistics: the approach based on influence functions. J. R. Stat. Soc. Ser. A 150(3), (1987). https://doi. org/10.2307/2981480 54. Law, J., Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A.: Robust statistics-the approach based on influence functions. The Statistician 35(5), (1986). https://doi.org/10.2307/ 2987975 55. Hubert, M., Debruyne, M.: Breakdown value. Wiley Interdiscip Rev. Comput. Stat. 1(3) (2009). https://doi.org/10.1002/wics.34 56. Wang, H., Lu, X., Hu, Z., Zheng, W.: Fisher discriminant analysis with L1-norm. IEEE Trans. Cybern. 44(6), 828–842 (2013). https://doi.org/10.1109/TCYB.2013.2273355 57. Qin, X., Wang, S., Chen, B., Zhang, K.: Robust fisher linear discriminant analysis with generalized correntropic loss function. In: 2020 Chinese Automation Congress (CAC), pp. 7117–7121. IEEE (2020). https://doi.org/10.1109/CAC51589.2020.9326644 58. Maronna, R.A., Martin, R.D., Yohai, V.J.: Robust statistics: theory and methods. (2006). https://doi.org/10.1002/0470010940 59. Chen, C.: Robust Regression and Outlier Detection with the ROBUSTREG Procedure. SAS Institute Inc., (2002) 60. Seheult, A.H., Green, P.J., Rousseeuw, P.J., leroy, A.M.: Robust regression and outlier detection. J. R. Stat. Soc. Ser. A Stat. Soc. 152(1) (1989). https://doi.org/10.2307/2982847 61. Cabana, E., Lillo, R.E., Laniado, H.: Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators. Stat. Pap. 62(4) (2021). https://doi.org/10.1007/ s00362-019-01148-1 62. Okwonu, F.Z., Ahad, N.A., Ogini, N.O., Okoloko, I.E., Husin, W.Z.W.: Comparative performance evaluation of efficiency for high dimensional classification methods. J. Inf. Commun. Technol. 21(3), 437–464 (2022). https://doi.org/10.32890/jict2022.21.3.6 63. Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinf. 14, 1–16 (2013). https://doi.org/10.1186/1471-2105-14-106 64. Skurichina, M., Duin, R.P.: Boosting in linear discriminant analysis. In: Multiple Classifier Systems: First International Workshop, MCS 2000 Cagliari, Italy, Proceedings 1 pp. 190–199. Springer, Berlin (2000). https://doi.org/10.1007/3-540-45014-9_18 65. Bickel, P.J., Levina, E.: Some theory for Fisher’s linear discriminant function, naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 10(6), 989–1010 (2004). https://doi.org/10.3150/bj/1106314847 66. Pang, H., Tong, T.: Recent advances in discriminant analysis for high-dimensional data classification. J. Bioinf. Biostat. 3(2), (2012). https://doi.org/10.4172/2155-6180.1000e106 67. Gorman, R.P., Sejnowski, T.J.: Analysis of hidden units in a layered network trained to classify sonar targets. Neural Netw.Netw. 1(1), 75–89 (1988). https://doi.org/10.1016/08936080(88)90023-8 68. Naranjo, L., Pérez, C.J., Campos-Roca, Y., Martín, J.: Addressing voice recording replications for Parkinson’s disease detection. Expert Syst. Appl. 46, 286–292 (2016). https://doi.org/10. 1016/j.eswa.2015.10.034 69. Naranjo, L., Pérez, C.J., Martín, J., Campos-Roca, Y.: A two-stage variable selection and classification approach for Parkinson’s disease detection by using voice recording replications.

300

F. Z. Okwonu et al.

Comput. Methods Programs Biomed.. Methods Programs Biomed. 142, 147–156 (2017). https://doi.org/10.1016/j.cmpb.2017.02.019 70. Japkowicz, N.: The class imbalance problem: significance and strategies. In: Proceedings of the 2000 International Conference on Artificial Intelligence (2000) 71. Blagus, R., Lusa, L.: Class prediction for high-dimensional class-imbalanced data. BMC Bioinf. 11 (2010). https://doi.org/10.1186/1471-2105-11-523 72. van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning - ICML ’07, pp. 935–942 (2007).https://doi.org/10.1145/1273496.1273614

Research on Music Recommendation Model with Limited Historical Data and User’s Information Myeongjun Kang(B) St. Johnsbury Academy Jeju, Jeju, USA [email protected]

Abstract. Nowadays, music streaming services are becoming more popular with the emergence of smartphones. For instance, about 2 million users used Spotify, which is one of the largest music streaming applications in the world, in 2010 and about 180 million users are on Spotify now. These music streaming applications also provide recommendation char to users for better satisfaction on application. For example, in YouTube music, there is a “Releasing Stress” chart. Also, there is a music list which people made and are able to share with other people who are in a similar situation. However, these recommendation lists are subjective, and this might cause side effects because some users do not choose their favorite artists or do not set their age when they use the streaming site first. Hence, in this paper, we are going to evaluate the music trend of these days and collect the various music streaming data from different users. Moreover, we will implement the pre-trained music recommendation model based on the limited users’ information such as their age, gender, religion, home country, and current location etc. Keywords: Data science · Music · Clustering · Recommendation · Python

1 Introduction Nowadays, music streaming services are becoming more popular with the emergence of smartphones. For Instance, about 2 million users used Spotify, which is one of the largest music streaming applications in the world, in 2010 and about 180 million users are on Spotify now [1]. According to the IFPI, global recorded music revenues grew by 18.5% to $25.9bn. This shows a 7.4% growth in 2020 when the Covid-19 impacts the world [2]. However, most people listen to popular music which is hard to be popular [3]. Currently, there are a lot of popular survival programs in South Korea which is a chance for people who participate. There are many superstars in the world but also unknown artists. Nowadays, there are a lot of lists which people make on streaming sites. For example, in YouTube music, there is a “Releasing Stress” chart. Also, there is a music list which people made and are able to share with other people who are in a similar situation. However, these recommendation lists are subjective, and this might cause side effects © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 301–307, 2024. https://doi.org/10.1007/978-3-031-47715-7_20

302

M. Kang

because some users do not choose their favorite artists or do not set their age when they use the streaming site first. In the music site, they analyze the user’s style and try to recommend the music based on the information that they analyzed. For example, they analyze the music genre that the user likes the most, and the user’s age and we are trying to make a system that considers the user’s gender, age, and religion. However, this music recommendation requires a large amount of data to train the model so that it takes time to achieve a high accuracy recommendation service for each single user. In this paper, we are going to evaluate the music trend of these days and collect the various music streaming data from different users. Moreover, we will implement the pre-trained music recommendation model based on the limited users’ information such as their age, gender, religion, home country, and current location etc.

2 Data Analysis 2.1 Data Information Before implementing the recommendation model, we made data analysis on 2022 music’s data to see the trends in the streaming site. We first evaluated the top music lists in Spotify in April 2022. Data contains titles of top music, their genres, artists, and the year when songs were released. Also, it contains the specific music information such as bpm, energy, danceability, liveness, and valence etc. Energy is a measure of intensity and activity of music, bpm is a tempo of music, danceability describes how suitable a music is for dancing based on the combination of musical elements such as bpm, and liveness is the measure for presence of an audience in the recording. 2.2 Evaluation As shown in Fig. 1 and Table 1, pop was the genre that people most liked. There were 46 pop songs in the top 100. Pop music includes Canadian pop, k-pop, and pops from other countries. Hip-hop music is the second most popular music. There were 22 hip hop music in the top 100. Compared to pop and hip hop, EDM, rock, and soul music were less popular than pop music. From the research, 73% of customers in music streaming services are aged 15 to 25 and 26–35. Hence, the music preferences from age 35 may not be highly applied into the top music lists in the streaming services. We have evaluated the top streamed music from 100 listeners that are aged between 35 and 65. More than 80% of their streamed music was music from 1980 to 2000. However, only few of them were listed in the top music chart and this shows that younger people affect a lot into the top music chart.

Research on Music Recommendation Model

303

Fig. 1. The number of genres in music top chart.

3 Music Recommendation Model 3.1 General Music Recommendation In 2020, as much as 62% of consumers rated platforms like Spotify and YouTube among their top sources of music discovery. For instance, on Spotify, more than one-third of all new artist discoveries happen through “Made for You” recommendation sessions according to the recently released Made to be Found report. Spotify’s recommendation engine is dealing with a similar flow as TikTok. Spotify algorithm will analyze all the song metadata provided by the distributor and metadata specific to Spotify [4]. There are components of Spotify’s audio analysis system such as danceability, energy, and valence. Another component of the content-based track representation is

304

M. Kang Table 1. The exact number of genres in music top chart.

Genre

Number

Hiphop (Rap)

22

EDM

1

Rock

6

Band Pop (Canadian pop, K- Pop and etc.)

1 46

Soul

2

Etc

11

the Natural Language Processing models such as Lyrics analysis, Web-crawled data, and User-generated playlists. The highly used music recommendation algorithm is collaborative filtering recommendation algorithm which predicts the music that the user may be interested in based on user history information. However, this recommendation requires large historical data of the user. Also, if the number of users increases, the model requires a large amount of calculation to implicit feedback data leading to low accuracy on music recommendation. Content-based filtering is similar to collaborative filtering, but only considers the previous actions or explicit feedback from the users. Collaborative filtering recommendation and content-based filtering are memory-based methods that recommend the data only based on the information of past data using a simple distance-measurement approach, such as the nearest neighbor algorithm. Therefore, they may recommend music similar to those already listened to, leading to less variation of music styles. 3.2 Clustering Method To implement the model of music recommendation, we implement the clustering algorithm. Clustering is an algorithm that contains a group of data points into clusters so that it can belong to the same group. Clustering helps to group data into several types. Each type of data contains similar data to each other which is clusters. There are various types of clustering algorithms: k-means and OPTICS. K-means algorithm computes centroids and repeats until the optimal centroid is found [5]. In this algorithm, selecting the appropriate number of k is crucial for optimal cluster results. OPTICS clustering stands for Ordering Points To Identify Cluster Structure. The algorithm is similar to DBSCAN clustering algorithms, but it has two more terms to the concepts of DBSCAN: 1) Core distance and 2) Reachability distance [6]. The algorithm calculates core distance which is the minimum value of radius required to classify a point as a core point. Reachability distance is the distance between two points is the maximum of the Core distance of first point and the Euclidean distance between two points [7]. In this paper, we applied the K-means algorithm, which can apply to the data that is not labeled. We have plotted the music with different musical properties to find the decent clusters of music. For example, we have clustered music by considering the two

Research on Music Recommendation Model

305

music properties such as energy and danceability. Then, we have applied the clustering to every other property to find the optimal clusters. 3.3 Recommendation Model Based on Classification

Fig. 2. Flow chart of data clustering based on the music properties.

Figure 2 shows the process of the proposed recommendation model based on the classification model. We have applied the 2 different classification models to find the better classification algorithm for the recommendation model: 1) Logistic regression and 2) Random Forest [8, 9]. Logistic regression is a method that predicts the relationship between x-axis and y-axis. Random forest is the ensemble algorithm of classification, regression, and other tasks based on the decision tree algorithm [12]. First, we applied the k-means clustering algorithms to the music data to cluster the similar types of music. Then, using the classification model, it returns the 3 index of the

306

M. Kang

clusters that may be interesting to a user based on the user information such as gender, religion, age, home country and current location [10]. In each cluster, music is sorted with the year when the music is released. Then, the model chooses songs on each cluster based on the year and country to the users. However, if the number recommended music is less than 5, the model shuffle musical properties and repeats the clustering process. After we have implemented the music, we have tested our data with the past data that has the recommended songs from the music application and the satisfaction score for that recommendation [11]. The recommendation model has reached the 85% satisfaction level with the Logistic regression model and 88% with the random forest as shown in Fig. 3.

Fig. 3. Satisfaction model with logistic regression and random forest.

4 Conclusion We have applied the k-means clustering and two classification algorithms (logistic regression and random forest) to implement the data-limited music recommendation model. The model has reached an average of 88% satisfaction level. For future work, we are going to apply NLP to the data to find the word similarity. We expect that the word similarity between the music property such as religion and the title of music will increase the satisfaction of users.

References 1. Spotify revenue and Usage Statistics (2022). Business of Apps. (2022, June 30). Retrieved August 27, 2022, from https://www.businessofapps.com/data/spotify-statistics/ 2. IFPI Global Music Report 2022. IFPI GLOBAL MUSIC REPORT 2022. (n.d.). Retrieved August 27, 2022, from https://globalmusicreport.ifpi.org/ 3. Staff, V.: Spotify’s 2010 numbers: Revenues, losses both up. Variety (2011). Retrieved August 27, 2022, from https://variety.com/2011/music/news/spotify-s-2010-numbers-revenues-los ses-both-up-1118044200/amp/

Research on Music Recommendation Model

307

4. How Spotify’s algorithm works? A Complete Guide to Spotify recommendation system [2022]: Music tomorrow blog. (2022) How Spotify’s Algorithm Works? A Complete Guide to Spotify Recommendation System [2022] | Music Tomorrow Blog. (n.d.). Retrieved August 27, 2022, from https://www.music-tomorrow.com/blog/how-spotify-rec ommendation-system-works-a-complete-guide-2022 5. Li, Y., Wu, H.: A clustering method based on K-means algorithm. Phys. Proc. 25, 1104–1109 (2012). https://doi.org/10.1016/j.phpro.2012.03.206 6. AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities. J. Supercomput. 75 (2019). https://doi.org/10.1007/s11227-018-2380-z 7. Ankerst, M., Breunig, M., Kriegel, H.-P., Sander, J.: OPTICS: ordering points to identify the clustering structure. Sigmod Record. 28, 49–60 (1999). https://doi.org/10.1145/304182. 304187 8. Dayton, C.: Logistic Regression Analysis (1992) 9. Sperandei, S.: Understanding logistic regression analysis. Biochem. Med. 24. 12–18 (2014). https://doi.org/10.11613/BM.2014.003 10. Byrne, C.: Music: a source of deep imaginative satisfaction? Br. J. Music Educ. 15, 295–301 (1998). https://doi.org/10.1017/S0265051700003971 11. Shukla, V., Kusumawati, R., Dutt, H., Yusnitasari, T., Oswari, T.: Investigating customer satisfaction towards music website in Indonesia and India: a comparative study. Int. J. Digital Signals Smart Syst. 4, 17 (2020). https://doi.org/10.1504/IJDSSS.2020.10027763 12. Cutler, A., Cutler, D., Stevens, J.: Random forests (2011). https://doi.org/10.1007/978-14419-9326-7_5

A System for Analyzing Human Capability at Scale Using AI Daniel Zhu1 , Dave Ulrich2 , Sanjiv Das3,4(B) , and Norm Smallwood2 1

University of California, Berkeley, CA 94701, USA 2 The RBL Group, Provo, UT, USA https://www.rbl.net 3 Amazon Web Services, Santa Clara, CA 95054, USA 4 Santa Clara University, Santa Clara, CA 95053, USA [email protected]

Abstract. Over 80% of corporate value is now comprised of intangibles, of which a large component is human capability (HC). Reflecting this, the SEC has recently mandated HC reporting requirements (SEC, Q4 2020). We use machine learning to build a prototype system to analyze HC using SEC filings and applied it to 5,760 companies. The approach algorithmically generates lexicons for HC concepts, and then applies machine learning to extract the relevant text on HC and business outcomes from annual reports, to create a dashboard for each firm on the quantity of reporting over four dimensions of HC: talent, leadership, organization, and human resources operations. This system links HC reporting to measurable business outcomes such as revenue per employee, earnings, Tobin’s Q, and social citizenship. This will enable companies to improve the quality of reporting and governance of HC as well as guide investments in specific areas of HC. Keywords: Human capability · Human capital · AI language processing · Multi-modal machine learning

1

· Natural

Introduction

In the United States, more than a third of employees work for big firms,1 and human capital management has become increasingly important as a component of corporate value. It is now possible to use data science techniques to assess human capability, leveraging text and tabular data from regulatory reports. This paper describes a system to do so using multi-modal machine learning. Attention to human capability has increased dramatically in recent years due to contextual challenges around the global pandemic, racial and social injustice, digital and technological advances, political divisiveness, and economic shifts. In We are grateful to Bratin Saha for his ideas, advice, and support on making the system in this paper a socially beneficial one. 1 https://www.wsj.com/graphics/big-companies-get-bigger/. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 308–324, 2024. https://doi.org/10.1007/978-3-031-47715-7_21

Human Capability at Scale Using AI

309

this article, we intentionally use the term human capability (HC) rather than human capital. Our working definition of HC focuses on four pathways: (1) talent, circumscribed with concepts like people, individual competence, employees, human capital, or workforce, (2) organization, with concepts like culture, organization capability, agility, systems, or workplace, (3) leadership, including leader, manager, boss, supervisor, and (4) human resources, with concepts like HR practices, departments, operations, people, services, analytics. As knowledge work increases in every economy, HC becomes an important part of corporate investment. HC enhances the intangible value of all companies. The share of intangible assets in corporate value has increased from 17% in 1975 to 84% in 2015 (McMurrer and Bassi [17]). Tangible physical technology is greatly enhanced by HC around it, for which Tambe et al. [24] coined the term “digital capital.” A number of studies have shown the importance of HC as a core intangible. Smallwood and Ulrich [22] show how organization capabilities such as agility, culture, innovation, collaboration, and strategic clarity shape shareholder value. Ulrich [25] proposes a leadership capital index to help investors identify leadership qualities that will increase investor confidence. Schneider et al. [20] find evidence for the impact of talent or workforce engagement on performance. Ulrich and Brockbank [26] show how the human resource function can deliver value to all stakeholders. For example, Amazon has formally recognized the value of HC by adding a new leadership principle in 2021, to “Strive to be the Earths Best Employer.” The SEC mandated human capital reporting via the Federal Register—Final Rule: Modernization of Regulation S-K Items 101, 103, and 105; Release Nos. 3310825; 34-89670; File No. S7-11-19 on November 9, 2020. These rules modernize the requirements of Regulation S-K applicable to disclosure of the description of the business (Item 101), legal proceedings (Item 103) and risk factors (Item 105). This greatly expands HC management disclosures. The wording of the rule is as follows: Item 101(c)(2)(ii): Provide “A description of the registrants human capital resources, including the number of persons employed by the registrant, and any human capital measures or objectives that the registrant focuses on in managing the business (such as, depending on the nature of the registrants business and workforce, measures or objectives that address the development, attraction and retention of personnel).” Because of the breadth and ambiguity in this definition of “human capital,” SEC reporting of HC varies dramatically both in length (ranging from under 200 words to over 2000) and in content covered—from safety to unions to broad axioms to specific quantitative data. The reporting of HC content is scattered around the annual report, and requires specialized information retrieval.

310

D. Zhu et al.

To help frame the reporting on HC using SEC filings, machine learning can be used to glean an evaluation of the four pathways of talent, organization, leadership, and human resources. This enables automated evaluation of all companies that make 10-K filings, which account for seven to eight thousand firms per year; thus, HC evaluation may be scaled using machine learning. Further, by standardizing the framework along the four pathways, it is possible to create a system to support how all companies standardize their HC reporting, which may then drive management and investor decisions. The system described in this paper offers a common typology for HC reporting so that companies and investors learn from each other using a common framework and vocabulary. At present, there is extensive debate about what “HC” refers to. For example, some work focuses extensively on HR practice areas. ISO-303142 titled “Human resource management—Guidelines for internal and external HC reporting” suggests core HC areas range from general ideas to organization practices to specific metrics. At present the terminology of HC is nebulous. This work creates new word lists to define the breadth and vocabulary of HC and serve as a reference glossary, thesaurus, or lexicons for the HR industry, as presented in Sect. 5. We organize these lexicons into the four HC pathways to provide a comprehensive and cohesive framework for the HC industry. Some studies use employee surveys (e.g., Guiso et al. [10], for S&P 500 companies in 2011) or use earnings calls, which vary dramatically by company. With the new SEC regulation, we are able to extract all text in the SEC filings that relate to HC and to have a common corpus for data (SEC filings). Because SEC data is bound by regulatory requirements, it offers a comparable and reliable source of HC information. We have automated the process to collect all 10-K filings and extract and analyze HC text, to generate a report. Using this approach, our algorithm retrieved and culled HC text for the calendar year 2021, for more than 7,000 filings, and after culling some companies on account of missing data, we are able to undertake analysis on 5,760 companies to demonstrate the prototype. Given this seamless automation, this analysis may be re-run for any period at any time and on other data sources. This paper describes an AI/ML system to extend prior econometric work to: (1) create a more comprehensive model of HC as comprised of four dimensions: talent, organization, leadership, and human resources, (2) influence and standardize more effective and transparent reporting of HC activity in corporations that informs executive decisions and investor confidence, and (3) scale studies of HC beyond surveys to large databases that show relationships between HC and employee, business, investor, and community results. For the machine learning field, we demonstrate how machine learning technologies can define the HC field with broader and more accurate definitions of HC and of its impact on business outcomes.

2

www.iso.org/standard/69338.html.

Human Capability at Scale Using AI

1.1

311

Related Literature

Whereas there is research on the management of HC and on financial metrics, the literature connecting HC, corporate performance, and AI is quite nascent. However, there is an older literature touching upon these issues and a recent surge in interest in using ML methods to understand the effectiveness of HC management better. Research has shown the impact of HC activities on firm performance. Huselid [11] found that a one standard deviation increase in the use of high performing work systems was associated with a per employee increase in market value of $18,641, an increase of $27,044 in sales (on a mean sales per employee of $171,099), and an increased cash flow of $3,814. This early work has expanded dramatically and shows that HC improvements deliver financial returns as well, see Huselid [12]. Most of the work showing the impact of HC on business outcomes relies on surveys or work within specific companies. Storey et al. [23] summarize this work as showing a positive relationship between HR practices and firm performance across industries and geographies, particularly when HR practices are bundled together to deliver individual competencies, organization capabilities, and leadership. The RBL Group3 found that survey results with key informant data from over 1200 organizations show that investments in the four pathways of (1) Talent, (2) Leadership, (3) Organization, and (4) HR can be linked to five stakeholder outcomes: employee well-being/productivity, strategic reinvention, customer value, investor performance, and community reputation. The work in this paper presents a system for organization guidance, i.e., to scale these surveybased studies through application of machine learning and AI to SEC reporting, using large amounts of text and tabular data. There have been preliminary attempts to use econometric approaches to extend survey methodology and show the impact of HC on firm performance. Guiso et al. [10], in an examination of the value of corporate culture, look at S&P 500 companies (from June to October 2011) and show that proclaimed corporate values may be relevant. Notably, firms in which top managers are seen as trustworthy and ethical show strong financial performance, whereas governance structures do not appear to matter. Of corporate web sites, 85% explicitly stipulate some statements about their corporate culture, an important organizing principle of HC. The relation of these statements to corporate performance is tenuous, to say the least. But, Guiso et al. [10] find that responses to surveys by employees are more revealing: improvements in reported management integrity scores are strongly correlated with increases in Tobin’s Q and a decline in the fraction of unionized workers.4 In another study, Li et al. [15] also focus on value words to define culture and draw on data from earnings calls between 2001 and 2018. They find that these words correlate with many aspects of business performance including operational efficiency, risk-taking, earnings management, executive compensation design, and firm value. In a recent survey of 1,348 North American executives, Graham 3 4

https://www.rbl.net/. https://www.investopedia.com/terms/q/qratio.asp.

312

D. Zhu et al.

et al. [9] find that 84% of them strongly believe that culture impacts corporate value. Popadak [19] constructed an innovative measure of corporate culture at the firm level by utilizing insider reviews from popular online job boards and forums, such as Glassdoor.com and Payscale.com. She measures six elements of corporate culture on an annual basis: adaptability, collaboration, customerorientation, detail-orientation, integrity, and results-orientation. These are found to be related to firm value. Finally, Gorton et al. [8] offer a comprehensive survey of the work on corporate culture. The main contributions of the system are described in the following sections. Section 2 describes the benefits of the new system for HC analysis. Section 3 describes the SEC filing data and the downloading and processing of the reported text data at scale. Section 4 discusses how ML is used to extract the relevant HC text from SEC filings that comprise thousands of words. This forms an essential first step in scaling the analysis of HC. Section 5 explains how ML is used to create lexicons for scoring the various attributes of HC. Engineering details are in Sect. 6. Section 7 fits ML models to the extracted text and scores to link HC reporting to business models. These models may be used to understand what aspects of HC drive outcomes such as revenues, earnings, etc. Concluding discussion is in Sect. 8.

2

System Implications

This proposed system has implications for both the overall “HR” industry vertical and for individual firms. For the overall industry, the system: 1. Develops a typology for what constitutes “HC” into four pathways. In almost every field, typologies become the foundation for organizing disparate activities and events into accepted categories or patterns: food typologies (four food groups), political typologies (political parties), biology typologies (kingdom, class, order, genus), employee typologies (full time, part time, contract), industry typologies (farming, manufacturing, service, etc.). This work provides a conceptual and empirical frame that defines the HR industry. (See the impact of framing in Cukier et al. [5].) 2. The research defines the breadth and vocabulary of HC and serves as a reference glossary, thesaurus, or wordlist for the HR industry. 3. Offers an overall measure of HC for SEC (and other) reporting. This overall indicator could become an accepted standard/metric for HC like Tobin’s Q for intangibles or Treadway Commission for risk with four risk categories (compliance, strategic, operational, and financial). For a specific firm, the system: 1. Develops a HC score as a ranking on how a firm compares to the overall sample as well as to the industry, etc. This benchmark score can become part of the firm’s overall performance scorecard used by investors, regulators, customers, media, boards, executive teams, investor relations, and internal human resource groups.

Human Capability at Scale Using AI

313

2. Enables each company to assess their public reporting and likely internal actions in HC. Business and HR leaders will be able to determine how they perform on each of the four HC pathways. This will help them either [a] better report what they are doing since they will now have a framework and language to do so and [b] prioritize where they should focus to improve in each of the four pathways. The system described here has two stages: (1) governance (scoring and reporting of HC for regulators, shareholders, and communities), and (2) guidance on improving HC towards improving business outcomes, of which we focus on: (i) revenue per employee, (ii) Tobin’s Q, (iii) Earnings before interest, taxes, depreciation, and amortization, i.e., EBITDA,5 (iv) social responsibility based on fraud and litigiousness scoring of firms. This paper describes how machine learning is used to implement these two stages.

3

Data

The primary data source for this analysis is 10-K SEC filings. These are annual reports filed by all publicly traded firms as well as private firms that have exceeded a threshold of stock ownership (500 shareholders) and assets ($10 million) as mandated by the Securities and Exchange Act of 1934. These filings are public record and may be downloaded by anyone freely. We built an API6 to download the filings in XML and parse them into plain text. Since being mandated by the SEC, HC reporting has been varied. Some firms created a new section titled “Human Capital” in their 10-Ks, whereas others reported the content in various places in the filing, often in the Management Discussion & Analysis (MD&A) section. Since the HC reporting is not uniform, we cannot just search for and extract a section on HC. Instead, we used a wordbased approach to detect the relevant sentences and paragraphs with HC content. We augmented this approach with a machine learning model trained to detect sentences related to HC content. Our HC text extractor attains a high level of accuracy (details in the following section). Li et al. [15] analyze earnings calls to score five attributes of corporate culture: innovation, integrity, quality, respect, and teamwork. We also apply a similar approach with a much broader set of HC concepts. Whereas they score five attributes, we score 14 and combine them into the four pathways (Sect. 5).

5 6

https://www.investopedia.com/terms/e/ebitda.asp. https://sagemaker-jumpstart-industry-pack.readthedocs.io/en/latest/notebooks/ index.html.

314

4

D. Zhu et al.

Human Capability Text Extraction

The 10-K (annual report) filed by companies with the SEC is an extensive document, comprising tens of thousands of words. Within the 10-Ks, since the reporting of HC by firms is varied, we used word-based and machine learning approaches to extract HC related text from the SEC filings. The various approaches are described here. We extracted sentences containing a preponderance of HC words using a keywords-based extractor. The word lists were generated using an automated algorithm (Das et al. [6]) and further refined by human curation. However, manually checking extracted sentences revealed that this method resulted in many false positives. We then trained a machine learning model to choose sentences in the 10-K filings that are HC related and/or related to business outcomes. This was undertaken with few-shot learning on the 10-Ks from a few companies, from which we manually extracted all sentences that were HC related and consequential business outcomes (the remaining sentences are negative samples). The chosen companies are: Amazon, Applied Materials, BK Technologies, Borg Warner, CEVA, Dell, FCCN, Intel, Interdigital, and Walgreens. This machine learning approach does better and extracts HC sentences with a test accuracy of 88%, with a F1 score of 88.5%, precision of 89.6%, and recall of 87.5%. The trained classifier is then used to extract HC text for all the 5,760 companies in the sample. A two-step approach, where we first use the word lists to run a coarse filter on the 10-K filings and extract sentences that are likely to be HC related, does not result in significant reduction in the amount of text that the ML model must process. Thus, our final approach for extracting HC text is the one-step machine learning model. An example of extracted HC text is shown in Fig. 1.

5

HC Lexicons

In this section, we briefly discuss the lexicons used in the project. Using “seed” words drawn from domain expertise, we used the algorithm in Das et al. [6] to automatically extract words that are conceptually related to the seed words. A brief description of the mechanics of this approach is as follows. The user provides a pair of words that are either synonyms or antonyms. 1. If the words are synonyms, we generate two word lists with numerical vector representations of words (embeddings, based on the word2vec algorithm of Mikolov et al. [18]) that are closest to the two words, using the cosine similarity metric on pre-trained word vectors. These word lists are then intersected with a dictionary to keep only the words that are valid in English, and then the algorithm returns the union set of both word lists. 2. If the words are antonyms, we generate two word lists with embeddings that are closest to the two words, intersect these lists with a dictionary to keep only the ones that are valid words, and then return two separate word lists. If a word appears in both lists, then we keep the word only in the list in which it has highest similarity with the concept word.

Human Capability at Scale Using AI

315

Fig. 1. Example of text extracted using machine learning

In short, with synonyms, the algorithm returns a single list (support for the concept) and with antonyms, it generates two lists (support for, as well as against the concept). We generated 14 such word lists using the following seed words: capability, vision, talent, organization, mission, management, leadership, human resources, human capital, employee, develop, culture, competence, agility. These lists were further triaged (using human curation) to construct a final set that was used for scoring. These 14 word lists are aggregated into the four pathways for Talent, Leadership, Organization, and HR as needed for coarser granularity of HC text scoring. These word lists are assigned to the pathways as follows: 1. Talent = talent + employee + competence 2. Leadership = leadership + management + develop 3. Organization = organization + culture + agility + mission + vision + capability 4. HR = human capital + human resources. Using these word lists, we compute the fraction of the HC text that contains the words in a given list. This operation is compute-intensive and therefore we use special purpose APIs developed in AWS SageMaker JumpStart for the financial sector.7 These scores are then normalized across the dataset to put each companys score on each attribute into a range from 1 to 10. This scoring table permits ranking and filtering companies on one or more attributes, and enables an analysis of where a company stands in relation to others based on their HC reporting. An example of this table is shown in Fig. 2.

7

https://sagemaker-jumpstart-industry-pack.readthedocs.io/en/latest/smjsindustry. nlp scorer.html.

316

D. Zhu et al.

Fig. 2. Human capability scores. The user can filter this table using the filter template above each column

6

Engineering the System

We provide a brief description of the engineering pipeline built to implement the analytic system described in the paper. We leverage AWS SageMaker8 for building the system. Before even getting to the main analyses in the paper, several artifacts were required to obtain the data and featurize it for further use. The pipeline comprises several subsystems/modules, which are as follows: 1. Module to download and parse SEC 10-K filings, packaged into an SDK.9 This leverages a single API call where the user may specify a collection of tickers, a date range, and the specific SEC filings required. The system collects each and every SEC filing for the chosen input, parses it into clean text, and the delivers the final result as a CSV file. This part of the pipeline is a complex information extraction and cleaning exercise, required handshaking with then SEC’s EDGAR system, and can take several weeks to develop properly. This system is now available so that other users may collect similar data on an ongoing basis and analyze it for HC characteristics. 2. Module to extract HC text from the SEC filings using a trained ML model based on hand-labeling and few-shot learning. This process required tedious hand labeling of sentences, those related to HC (positive instances) and those unrelated (negative samples). Few-shot learning proved to be effective to train a model to 89% (f1 score, accuracy). This forms the core information retrieval segment of this work. 3. Module to generate HC word lists for scoring the HC text, using the work in Das et al. [6]. This approach uses seed words for various HC constructs to generate a lexicon of related concept words. The approach is intuitively simple 8 9

https://aws.amazon.com/sagemaker/. https://sagemaker-jumpstart-industry-pack.readthedocs.io/en/latest/notebooks/ index.html.

Human Capability at Scale Using AI

317

and selects words that are close to the seed words using cosine similarity over transformer embeddings for all words. This automatically generated list is then intersected with an English dictionary to weed out non-dictionary words. Further human curation is undertaken by domain experts to attain the final word lists for the 14 concepts in the paper, discussed in Sect. 5. 4. Module to score HC and create a dashboard, using SageMaker JumpStart10 with a special purpose API.11 This API calculates the proportion of words in the HC text that appear in each of the lexicons. This is a compute intensive operation given the length of HC text from SEC filings and the API was designed to distribute this task across any number of chosen machines. The API automatically returns a dataframe with the text column and several additional columns for all the HC attributes scored using the lexicon. We then aggregate subsets of attributes into a score for each of the four pathways (see Fig. 2 for an example). 5. Multi-modal ML Training modules to fit business outcomes to HC text and HC scores using AutoGluon.12 The featurized data comprises both text and tabular columns in a single dataframe. We exploited the ease of use in AutoGluon which enables fitting multi-modal ML models to mixed dataframes in three to four lines of code. Moreover, it fits a wide range of models, such as linear models, tree models, neural nets, with boosting, and also stack ensembles the models. The framework is highly performant and has easily won Kaggle competitions with very little engineering effort.13 6. ML explainers linking the predictions of the trained models to underlying features using SageMaker Clarify.14 Remaining work would entail integration of these components into a workflow, UX additions, and report generation.

7

HC Reporting and Business Outcomes

Does the new reporting mandated by the SEC matter? Does it reflect how corporate value is impacted by HC, and does it help analysts to understand how HC relates to the value of corporate intangibles? To assess this question, we fit machine learning models to the dataset comprising around 5,760 firms. For each firm, we have the four pathway text scores discussed earlier as numerical features. We also have a column of HC text, extracted using our few-shot trained model that recognizes sentences related to HC. Our machine learning is therefore multi-modal (tabular and textual data), yet parsimonious in the number of features (a text column and four tabular columns). 10

11 12 13 14

Scoring to prepare a dashboard is discussed here with various ways to visualize the data: https://aws.amazon.com/blogs/machine-learning/create-a-dashboardwith-sec-text-for-financial-nlp-in-amazon-sagemaker-jumpstart/. https://sagemaker-jumpstart-industry-pack.readthedocs.io/en/latest/smjsindustry. nlp scorer.html. https://auto.gluon.ai/stable/tutorials/tabular prediction/tabular-multi-modal. html. https://github.com/autogluon/autogluon. https://aws.amazon.com/sagemaker/clarify/.

318

D. Zhu et al.

We focus on the following outcomes: 1. Employee: productivity (revenue/employee). 2. Financial: operations, profitability (e.g., EBITDA) or intangible value (Tobin’s Q). 3. Community: reputation and social citizenship (e.g., litigiousness scores, fraud scores, etc.) These outcomes form the labels for our analysis. When the label is continuous, we fit regression models as well as break the outcomes into categories and fit classifiers. Our models are fit using AWS AutoGluon,15 which supports the fitting of accurate machine learning models on multi-modal (text plus tabular) data. These are not causal models, but indicate how HC reporting co-varies with business outcomes in the cross-section of firms. 7.1

Revenue per Employee

This is a common metric used to assess the productivity of HC. The distribution (in log values) is seen in Fig. 3. We fit machine learning models using revenue per employee as the label. Both regression and classification models are implemented. The results of the regression model are shown in Table 1. The errors may be assessed against the spread of the distribution above. For the classification problem, we split revenue per employee into 4 quartiles to build a multi-category classifier. The regression and classification models are both stack-ensembled machine learning models. (The regression model is not ordinary least squares.) The approach ensembles regression versions of ML models such as K nearest neighbors, XGBoost, LightGBM (gradient boosted models), CatBoost, Random Forest, Extra Trees, and Neural Networks, etc. More than one of these model forms may be ensembled. Balanced accuracy is the average of recall across all four classification categories. The Matthews Correlation Coefficient (M CC) is a metric16 that consolidates all values in the confusion matrix into a single score that lies in the range (−1, +1). When the M CC is zero, it implies no classification ability. When M CC > 0, the model demonstrates classification ability, with M CC = 1 being perfect ability. There are several advantages to using MCC as noted by Chicco and Jurman [2]. The M CC = 0.41, which is evidence of good fit of the classification model. The R2 = 0.45 is also better (by ∼2x) than studies in this area of work, for example, in comparison to canonical papers such as Combs et al. [3]; Crook et al. [4]; and Jiang et al. [13]. The good fit of this model may partly be attributed to the use of text in a multi-modal model, a new approach in comparison to previous work in this area, where only tabular data is used. Overall, we may conclude that HC text reported in SEC filings is related to and supports discrimination of the revenue per employee in the cross-section of firms in our sample. 15 16

https://auto.gluon.ai/. https://en.wikipedia.org/wiki/Phi coefficient.

Human Capability at Scale Using AI

319

Fig. 3. Distributions of a Revenue per employee, in log values; b of Tobin’s Q, in log values; c of EBITDA margin

320

D. Zhu et al.

Table 1. ML models fitted to HC text and scores for various business outcomes. This table shows regression and classification results. The feature set comprises a column of HC text and four columns of scores, one each for talent, leadership, organization, and HR. The column header “2-way” stands for binary classification and “4-Way” stands for classification into four categories

Regression metrics

Label Revenue per employee

Tobin’s Q EBITDA Fraud Litigiousness

Root mean-squared error 1.117

0.983

0.336

0.008

0.009

Mean absolute error

0.712

0.707

0.181

0.005

0.006

Median absolute error

0.431

0.533

0.100

0.003

0.004

R2

0.445

0.253

0.261

0.359

0.484

Classification metrics 4-way

4-way

2-way

4-way 4-way

Accuracy

0.556

0.404

0.897

0.472

0.554

Balanced accuracy

0.551

0.411

0.897

0.464

0.554

MCC

0.407

0.216

0.793

0.295

0.407

7.2

Tobin’s Q (Price to Book Value)

This metric is widely used to assess if a firm is undervalued or overvalued. In its pure form, invented in Kaldor [14] as the v-ratio, then popularized by James Tobin as the q-ratio,17 this ratio is defined as a firm’s market value to its intrinsic value, but the latter is not always easy to define and measure, so in practice book value is used in place of intrinsic value. This “market-to-book” ratio proxies for the growth prospects of a company. Hence, it is widely used in forward-looking analyses of corporations. For our sample of firms, we display the Q ratio in logs, shown in Fig. 3. The R2 = 0.25 from the regression model and the M CC = 0.22 in the classification model suggest that the fit to the data supports a connection between HC features and Tobin’s Q. Therefore, reporting on HC is related to firms’ growth prospects, offering support for why a large fraction of firm value is comprised of intangibles. 7.3

Earnings Before Interest, Taxes, Depreciation, and Amortization (EBITDA)

EBITDA is an important measure of firm profitability and operational efficiency. It ignores non-operational expenses and is hence a better metric to use when assessing the impact of HC. EBITDA is also often used to generate baseline firm 17

https://www.investopedia.com/terms/q/qratio.asp.

Human Capability at Scale Using AI

321

valuations, as a multiple of EBITDA. EBITDA margin is used, i.e., EBITDA divided by revenue.18 The range of EBITDA margins in our sample is shown in Fig. 3. Interestingly, the figure above displays the classic cliff to the left of the peak around zero EBITDA levels, evidencing earnings manipulation as first highlighted in the paper by Degeorge et al. [7], and more recently in work by Caramanis and Lennox [1]. This shows that firms that are about to report barely negative EBITDA, may be undertaking window-dressing of their accounts to push EBITDA to the positive region. We fitted both, a regression model and a classification model. For the latter, we created a binary split of the data for positive versus negative EBITDA (notice that the data has a pronounced left skew). For both models, we report the results in Table 1. The R2 = 0.26 from the regression model suggests that the fit to the data supports a connection between HC features and EBITDA. The results from the classification model are very strong with an accuracy level of 89% and an area under the curve (AUC) from ROC analysis of 0.95. We see a high M CC of 0.79 as well. The model fit to this earnings metric strongly relates to HC reporting. 7.4

Social Responsibility

Using a lexicon of words related to two concepts, fraud and litigiousness, we score the Management Discussion and Analysis (MD&A) section of the 10-K filings to get proxies for social responsibility from the management discussion, because an absence of fraud and litigious wording suggests a good level of corporate responsibility. We then see if the feature set (HC text and four pathways) provides a good fit to these social responsibility outcomes. The distribution of these scores in the dataset is shown in the histograms in Fig. 4. The plot on the left is for fraud and the one on the right is for litigiousness. The x-axis values represent the fraction of words in the MD&A section that are matched to the fraud and litigiousness word lists. For both variables, the following is the fit of the regression model, which delivers good R2 (0.36 for fraud and 0.48 for litigiousness) and M CC values (0.30 for fraud and 0.41 for litigiousness). This suggests a relationship between HC activity and business responsibility outcomes in the cross-section of firms. This confirms both the validity of the HC framework we propose and its impact on key business outcomes.

18

https://corporatefinanceinstitute.com/resources/knowledge/valuation/ebitdamultiple/.

322

D. Zhu et al.

Fig. 4. Distribution of social responsibility scores. The plot on the left is for fraud and the one on the right is litigiousness. The X-axis values represent the fraction of words in the MD&A section that are matched to the fraud and litigiousness word lists

8

Concluding Discussion

The SEC mandated HC reporting by companies in their 10-K filings. In the absence of a standardized reporting template, companies reported HC activity in many diverse ways throughout their 10-Ks. The system outlined in this paper enables HC assessment at scale applying AI/ML to a four pathways framework and incorporates managerial guidance to enhance business outcomes through better use of human capital. It uses a trained machine learning model to extract text from the filings that relates to HC activity and business outcomes. It devolves HC activity into four categories: (i) talent, (ii) leadership, (iii) organization, and (iv) human resource processes, and scores HC reporting for these attributes using machine learning generated dictionaries for 14 sub-attributes of the four main activities. The system relates reported HC activity to business outcomes using machine learning models, establishing a link to financials, concomitant with the idea that HC forms a material share of corporate intangible value. In a feedback loop, these analyses will also help companies improve their

Human Capability at Scale Using AI

323

reporting on HC. Productionizing this work may be supported by artifacts on Amazon SageMaker. This work is an early attempt to bring an assessment of human capability, traditionally undertaken via surveys, to the realm of machine learning using natural language methods at scale. When this work was done, only one year of post-regulation data was in place, but as more time passes, the analyses can be extended to additional data as well. The labeling of text as relevant to human capital was limited by resources and it may be possible to bring a largescale labeling effort to this time-intensive task (SEC filings are extremely long). We considered a few financial metrics in this work but there are many others that one may wish to relate to the human capital factors that are extracted here. One other application that was not pursued in the paper and has now become viable as generative AI technologies become increasingly powerful is to automatically generate an analysis of human capabililty from the featurized data. This work may also be extended to explain (using Shapley values, [16,21]) which HC features drive a specific firm’s business outcomes and which do not. It is encouraging that two of the authors19 have applied this work to practice in a framework titled “Governance and Guidance for Growth through Human Capability” (G3HC).20

References 1. Caramanis, C., Lennox, C.: Audit effort and earnings management. J. Account. Econ. 45(1), 116–138 (2008) 2. Chicco, D., Jurman, G.: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21(1), 6 (2020) 3. Combs, J., Liu, Y., Hall, A., Ketchen, D.: How much do high-performance work practices matter? A meta-analysis of their effects on organizational performance. Personnel Psychol. 59(3), 501–528 (2006). www.onlinelibrary.wiley.com/doi/pdf/ 10.1111/j.1744-6570.2006.00045.x 4. Crook, T.R., Todd, S.Y., Combs, J.G., Woehr, D.J., Ketchen, D.J.: Does human capital matter? A meta-analysis of the relationship between human capital and firm performance. J. Appl. Psychol. 96(3), 443–456 (2011). Place: US Publisher: American Psychological Association 5. Cukier, K., Mayer-Sch¨ onberger, V., de V´ericourt, F.: Human Advantage in an Age of Technology and Turmoil. Dutton, Framers (2021) 6. Das, S.R., Donini, M., Zafar, M.B., He, J., Kenthapadi, K.: FinLex: an effective use of word embeddings for financial lexicon generation. J. Finance Data Sci. 8, 1–11 (2022) 7. Degeorge, F., Patel, J., Zeckhauser, R.: Earnings management to exceed thresholds. J. Business 72(1), 1–33 (1999). Publisher: The University of Chicago Press 8. Gorton, G.B., Grennan, J., Zentefis, A.K.: Corporate culture. Ann. Rev. Finan. Econ. 14(1), 535–561 (2022). https://doi.org/10.1146/annurev-financial-092321124541 19 20

Dave Ulrich and Norm Smallwood of the RBL Group. https://www.g3humancapability.com/.

324

D. Zhu et al.

9. Graham, J.R., Grennan, J., Harvey, C.R., Rajgopal, S.: Corporate culture: evidence from the field. J. Financ. Econ. 146(2), 552–593 (2022) 10. Guiso, L., Sapienza, P., Zingales, L.: Corporate culture, societal culture, and institutions. Amer. Econ. Rev. 105(5), 336–339 (2015) 11. Huselid, M.A.: The impact of human resource management practices on turnover, productivity, and corporate financial performance. Acad. Manag. J. 38(3), 635–672 (1995) 12. Huselid, M.A.: The science and practice of workforce analytics: introduction to the HRM special issue. Human Res. Manag. 57(3), 679–684 (2018). www.onlinelibrary. wiley.com/doi/pdf/10.1002/hrm.21916 13. Jiang, K., Lepak, D.P., Hu, J., Baer, J.C.: How does human resource management influence organizational outcomes? A meta-analytic investigation of mediating mechanisms. Acad. Manag. J. 55(6), 1264–1294 (2012). Publisher: Academy of Management 14. Kaldor, N.: Marginal productivity and the macro-economic theories of distribution: comment on Samuelson and Modigliani. Rev. Econ. Stud. 33(4), 309–319 (1966) 15. Li, K., Mai, F., Shen, R., Yan, X.: Measuring corporate culture using machine learning. Rev. Finan. Studies 34(7), 3265–3315 (2021) 16. Lundberg, S., Lee, S.-I.: A unified approach to interpreting model predictions (2017). arXiv:1705.07874 [cs, stat] 17. McMurrer, D., Bassi, L.: Human Resources Gets Its ISO Approval (2021) 18. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2–4 2013, Workshop Track Proceedings (2013). Publisher: arXiv Version Number: 3 19. Popadak, J.: Balancing governance and culture to create sustainable firm value. In: Governance Studies, The Initiative on 21st Century Capitalism, No. 27, pp. 1–13, Brookings Institution (2016) 20. Schneider, B., Yost, A.B., Kropp, A., Kind, C., Lam, H.: Workforce engagement: what it is, what drives it, and why it matters for organizational performance. J. Organ. Behav. 39(4), 462–480 (2018). www.onlinelibrary.wiley.com/doi/pdf/10. 1002/job.2244 21. Shapley, L.S.: A value for n-person games. RAND Corporation (1952). Publisher: RAND Corporation 22. Smallwood, N., Ulrich, D.: Capitalizing on capabilities. Harvard Business Review (2004). Section: Accounting 23. Storey, J., Wright, P., Ulrich, D. (eds.): The Routledge Companion to Strategic Human Resource Management, 0 edn. Routledge (2008) 24. Tambe, P., Hitt, L., Rock, D., Brynjolfsson, E.: Digital capital and superstar firms. Working Paper 28285, National Bureau of Economic Research (2020). Series: Working Paper Series 25. Ulrich, D.: The Leadership Capital Index: Realizing the Market Value of Leadership, 1st edn. Berrett-Koehler Publishers (2015) 26. Ulrich, D., Brockbank, W.: The HR value proposition, 1st edn. Harvard Business Review Press (2005)

The Impact of Data Ingestion Layer in an Improved Lambda Architecture Miguel Landry Foko Sindjoung1(B) , Ernest Basile Fotseu Fotseu2 , Mthulisi Velempini1 , Bernard Fotsing Talla3 , and Alain Bertrand Bomgni (PI)2 1

Department of Computer Science, University of Limpopo, Mankweng, South Africa {miguel.fokosindjoung,mthulisi.velempini}@ul.ac.za 2 Faculty of Science, University of Dschang, Dschang, Cameroon [email protected] 3 Fotso Victor University Institute of Technology, University of Dschang, Dschang, Cameroon [email protected]

Abstract. The explosion of connected objects and increasing Internet flows generate a variety of large amounts of data. More often the data generation speeds is difficult to control for traditional data analysis systems. That is, analyzing the generated data in real-time and in batch process modes appears necessary. The Lambda architecture has been proposed in the literature with a data ingestion layer for data collection, filtration, transformation and transfer to address this challenge. Despite the diversity of data and its high generation speed, the Lambda architecture must enable companies to process efficiently data. However, the Lambda architecture presents some difficulties, particularly in its implementation process, the separation of processing, and data transfer and synchronization. In this work, we propose a variant of the Lambda architecture which decouples the data ingestion layer from the processing layer, to facilitate its implementation. The simulation results show that there is value in implementing the new proposed architecture. Keywords: Batch processing · Big data architecture · Real-time processing

1

· Data ingestion · Lambda

Introduction

Due to the increasing number of connected devices, the volume of generated data has increased in recent years. With the emergence of Big data, traditional information systems techniques are not effective. Big data analytics is proposed as the solution to the big data management problem [4]. Big data can be characterized in five dimensions (5V) [2]: Volume, Velocity, Veracity, Variety and Value. Multiple architectures have been proposed in the literature to analyse big data. In these architectures, we have the Lambda, Kappa and Smack architectures [14]. In this paper, we focus on the Lambda architecture: a design pattern that c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 325–333, 2024. https://doi.org/10.1007/978-3-031-47715-7_22

326

M. L. Foko Sindjoung et al.

combines in a single framework, the big data real-time and batch processing. Unfortunately, in its basic version, the lambda architecture does not address how data is managed and moved into the framework. Authors in [9] proposed an improved version of the Lambda architecture that integrates a data ingestion layer unfortunately, they do not evaluate the impact of the added layer on big data processing. Indeed, being part of the process for data processing, it is important to evaluate the role of the data ingestion layer, since that one can significantly impact the data processing, either for batch or for real time processing. In this paper, we propose a variant of the lambda architecture that decouples the data ingestion from the data processing layer to facilitate its implementation. We then demonstrate which are suitable data ingestion tools to use depending on the type of big data processing (real-time or batch) is being considered. The rest of this paper is organised as follows: we present a state of the art on the lambda architecture in Sect. 2. We then present our contribution in Sect. 3. We present the results of our simulations in Sect. 4 we then conclude the paper in Sect. 3.

2

State of the Art

Multiple works have been proposed in the literature concerning Lambda architecture and the basic lambda architecture was proposed by Marz et al. [10]. The architecture guarantees the fault tolerance and robustness of big data infrastructures. The lambda architecture integrates real-time and batch processing and ensures that all the aspects of data analysis are considered, which prevents the corruption of data. That basic architecture has three layers: (1) The batch layer is responsible for batch processing stored data, (2) The Speed Layer is responsible for real-time processing of incoming data to support fast decision-making when necessary, and (3) The output layer that helps to store and disseminate real-time and batch processing data and present it in views that are accessible to users. That basic version is subject to multiple limitations. Namely, it may have the duplication of business logic which may result in coding overhead due to extensive processing; reprocessing of each batch cycle, which is not beneficial in some scenarios; the lack of layer separation between the business logic of real-time or batch processing and the data ingestion layer; the systematic duplication of data ingestion; the systematic duplication of data which is sent to the real-time and batch layers when data is generated. Several authors have done work to improve the basic lambda architecture to facilitate its implementation and also to improve processing speed. Among these works, we have the ones presented in [1,5–9,11–13,15]. The main limitations observed in the previous works related to the difficulty of implementing the Lambda architecture in some cases, and the synchronisation and data processing in other cases. The common limit to the work carried out to make the lambda architecture operational that we observe is the systematic duplication of data for the real-time and batch-processing layers. In our solution, we address these limitations.

Lambda Architecture for Big Data Processing

3

327

An Improved Lambda Architecture

The basic Lambda architecture presents some difficulties in its implementation process, in particular, due to the lack of separation of processing in the data ingestion layer. Decoupling data ingestion will enable higher yield production. In this section, we present an improved version of the Lambda architecture that decouples the data ingestion to fasten the data processing according to the specificity of the incoming data. Then, we show the impact of the decoupled layer with multiple approaches in Sect. 4.

Fig. 1. A new variant of the lambda architecture

The proposed architecture presented in Fig. 1 is described as follows: When data is generated (step (1)), the ingestion layer receives and processes it, depending on the case, which could be the classification by the source, a pre-processing or duplication of the data; if it is real-time data, the ingestion layer transmits it to the real-time layer (step (2)) and duplicates it to the Batch layer (step (3)); if data is not for real-time processing, it is simply sent to the Batch layer. Depending on the processing process implemented (Batch analysis defined or not), the Batch layer processes all the data it receives in a distributed file system or a distributed database, which allows for further processing. In (Step (4)), after the batch layer processing, the results can be sent to the real-time execution layer to process the obtained results during real-time processing. During the execution of a procedure in real-time, the obtained results may be incorrect or incomplete due to the fact that the generation speed was very high at a given time or the results of certain computations were inconsistent due to lack of information; it is therefore important to replace or complete the results obtained in the realtime execution with those obtained in batch processing because such processing will always be carried out to completion on data previously saved (use of highly reliable procedures). Still, in step (4), another process can be implemented; there is a need to set up a quasi-real-time system requiring significant computing capacities and

328

M. L. Foko Sindjoung et al.

a reduced execution time, the need for simultaneous use of both real-time and batch layers; the ingestion layer is therefore responsible for sending a defined part of data to the real-time layer and all the data elements to the batch layer; the real-time layer performs processing and provides its results to the batch layer; the batch layer will not perform calculations on data already processed by the real Time layer and will just use the results obtained through the link (4) to computes the results obtained on the other data in order to complete missing data where necessary and create a near real-time view. The Batch layer performs processing on previously saved data (step (5)) in a distributed database or a distributed file system by providing users with results in the form of a batch view; users can therefore query these results. The realtime layer performs this processing in (step (6)) and at the end, it populates the real-time view. We make use of three data ingestion approaches in our study: an initial data ingestion approach that corresponds to the basic lambda architecture of Marz et al. [10], an ingestion approach by pre-processing that consists of executing Algorithm 4.1 which is based on tools like Kafka, Storm, Hadoop Distributed File System (HDFS) and some Distributed DataBase (DDB), and an ingestion approach using classification by the data source that identifies data by their source. Algorithms 4.2 (For data classification) and 4.3 (For data ingestion) are used in the later approach. It is important to note that Algorithm 4.2 is based on the work done by Erraisi et al. [3].

4

Simulation Results

The simulations were run in a network environment with two nodes: a Slave node with a Core i5, 8 GB of RAM running on an Ubuntu 18.04 64-bit operating system; a Master node consisting of a Core i3 with 4 GB of RAM on which we installed an Ubuntu 16.04, 64-bit operating system. The Hadoop and Spark cluster environments consist of a NameNode installed on the Master node and a DataNode installed on the Slave node. The Storm real-time execution environment consists of a supervisor and a Nimbus on which we have configured a Storm topology consisting of four bolts and three Spouts; Spouts are responsible for performing tuple, filter, and aggregation operations. The bolts are responsible for processing and generating the results. We used a distributed database, a Cassandra datacenter consisting of a node. Batch analysis was done via Hadoop-MapReduce and Hadoop-Spark. 4.1

Data Ingestion Evaluation Process

The impact of the data ingestion layer is evaluated using Kafka and Flume since both allow real-time and batch ingestion. The obtained results show that Kafka is faster than Flume for the data ingestion process. Indeed, Kafka takes an average of 175.692 s to perform a batch and real-time ingestion on a dataset of about 400,000 records, while Flume takes an average of 217.34 s to perform the

Lambda Architecture for Big Data Processing

329

Algorithm 4.1: Data Ingestion for Real-Time and Batch Processing with Kafka, Storm, HDFS or DDB for Pre-Analysis

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Input: DataSource Output : Recording Of Processed Data In HDFS/DDB Begin /* Initialisation: ServerConfiguration, Communication, Subject */ Save Kafka Stream in HDFS/DDB For Every Batch windows interval for DataSource Do Connect to Kafka Stream If Data requires cleaning Then Filter and transform data End If Pre-analyse the data If Data has a real time processing interest Then Queue consumption with Kafka Kafka saves data in Storm queue Storm saves result in BDD End If End For Create Object Kafka Stream in Broker Print the result in view End

ingestion process on the same dataset. However, for the batch ingest process, Flume is steeper than Kafka due to its native functionality with distributed file systems such as HDFS. Unlike Kafka which after ingest needs a subscriber (MapReduce or Spark) which will have to carry out the data distribution in the distributed file system (which takes a significant time and depends on the volume of data). 4.2

Impact on the Real-Time Processing Process

The real-time execution process was launched on data obtained from the tweeter APIs, Jcdecaux APIs and on log files obtained from the activities carried out on a given website. The topology we implement is presented in Fig. 2. The generated results show that for a dataset with about 600000 inputs, the ingestion layer takes 263.538 s to ingest the data and the real-time processing layer with Storm 2269.563 s to retrieve the value in the ingested data for a total of 2533.101 s to complete the streaming processing. The trends observed in the cases of data duplication (real time and batch) and the cases of data separation (real-time processing) are represented in Fig. 3a and b. For both cases, the ingestion time hardly changes due to the fact that the ingestion process is extremely fast. For a data duplication on a set of 400000 inputs, for example, it takes 2026.4808 s against 1688.734 s for a classified ingestion with Storm.

330

M. L. Foko Sindjoung et al.

Algorithm 4.2: Data Classification for Data Ingestion

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Input: inputData Output : List of Data Sources Begin /* Initialization */ StruturedData = [RelationalDB, XML] SemiStruturedData = [SMS, Email] UnStruturedData = [Email, Multimedia, GPs Data, Office File] typeOf Source = [StreamData, BatchData] DataSource = [inputData, typeOf Source] If inputData == StruturedData OR inputData == SemiStruturedData Then For Every Element Of StruturedData OR SemiStruturedData Do DataSource ← SetFormat(Json, inputData, typeOf Source) End For Else If inputData == UnStruturedData Then For Every Element Of UnStruturedData Do DataSource ← SetDef aultFormat(Json, inputData, typeOf Source) End For End If End If Return Defined Data Sources End

We also performed simulations and compared the Apache Spark and Storm results, then, we observed that Spark Streaming takes 3445.0173 s to process a dataset with a volume of 600000 inputs, unlike Storm which takes 2533.101 s, a reduction of about 26%. We, therefore, conclude that Apache Storm is suitable for a real-time processing tool to use in our version of lambda architecture. Figure 4a and b show the average overall processing time evolution observed between Apache Spark and Storm and confirm the previous observation. 4.3

Impact on the Batch Processing

We performed the batch processing on data obtained from tweets, bike rentals and log files. We observed that with a dataset of about 600000 inputs, Spark takes 699.738 s for processing, Hadoop takes 804.6987 s and Spark+HBase takes 615,76944 s. Then, we observed that the combination of Spark+HBase will be better suited for batch processing in our proposed architecture.

Lambda Architecture for Big Data Processing

331

Algorithm 4.3: Data Ingestion of Real-Time and Batch-Time with Kafka, Storm, HDFS or DDB for Classification

1 2 3 4 5 6 7 8 9 10 11 12 13 14

5

Input: StreamingDataSource Output : Recording Of Processed Data In HDFS/DDB Begin Initialisation : ServerConfiguration, Return Defined Data Sources Create Object Kafka Stream in Broker Save Kafka Stream in HDFS/DDB For Every Batch windows interval for StreamingDataSource Do Connect to Kafka Stream If The data requires cleaning Then Filter and transform data End If Queue consumption with Kafka Kafka save data in Storm Queue Storm save result in DDB End For Save results in HDFS or DDB End

Conclusion

In this paper, we proposed a version of Lambda architecture which is a Big data analysis architecture. The main novelty of our proposal is the integration of a data ingestion layer in the basic lambda architecture and its evaluation. The implementation processes show satisfactory results; for real-time execution. We also observed that tools like Apache Storm offer highly optimized libraries, and algorithms and facilitate continuous analysis in a short time and the reliability of the results. For batch processing, tools like Spark and HBase deliver optimized results due to cluster-wide execution speed. Nevertheless, a number of difficulties were encountered, particularly in the installation tools and communication between them. We first defined the set of unidentifiable data sources and classified the resulting data according to these sources. Secondly, we investigated in depth the data pre-processing process in order to reduce its complexity and to improve the processing time of the real-time layer. Thirdly, we sought to reduce the observed latency of the processing time in the batch layer. Lastly, we established backward compatibility between the lambda architecture and other existing big data architectures.

332

M. L. Foko Sindjoung et al.

Fig. 2. Topology implemented for real-time execution.

Fig. 3. Evaluation with Storm.

Fig. 4. Evaluation between Storm and Spark.

Lambda Architecture for Big Data Processing

333

References 1. Batyuk, A., Voityshyn, V.: Apache storm based on topology for real-time processing of streaming data from social networks. In: 2016 IEEE First International Conference on Data Stream Mining Processing (DSMP), pp. 345–349 (2016) 2. Bremme, L.: D´efinition : Qu’est-ce que le big data (2016) 3. Erraissi, A., Belangour, A.: Data sources and ingestion big data layers: metamodeling of key concepts and features. Int. J. Eng. Technol. (UAE) 7, 3607–3612 (2018) 4. Hanif, M., Yoon, H., Jang, S., Lee, C.: An adaptive sla-based data flow mechanism for stream processing engines. In: 2017 International Conference on Information and Communication Technology Convergence (ICTC), pp. 81–86 (2017) 5. Ji, C., Shao, Q., Sun, J., Liu, S., Li, P., Wu, L., Yang, C.: Device data ingestion for industrial big data platforms with a case study. Sensors (Basel, Switzerland) 16 (2016) 6. Jindal, A., Quian´e-Ruiz, J.-A., Madden, S.: Ingestbase: a declarative data ingestion system (2017) 7. Kim, H., Madhvanath, S., Sun, T.: Hybrid active learning for non-stationary streaming data with asynchronous labeling. In: 2015 IEEE International Conference on Big Data (Big Data), pp. 287–292 (2015) 8. Matacuta, A., Popa, C.: Big data analytics: analysis of features and performance of big data ingestion tools. Inf. Econ. 22, 25–34 (2018) 9. Foko Sindjoung, M.L., Bomgni, A.B., Tagne Fute, E., Chendjou, J.: An improved version of lambda architecture. In: Conf´erence Africaine sur la Recherche en Informatique et en Math´ematiques Appliqu´ee (CARI’18), pp. 236–244 (2018) 10. Nathan, M., James, W.: Big data: principles and best practices of scalable realtime data systems (2015) 11. Pal, G., Li, G., Atkinson, K.: Big data real-time clickstream data ingestion paradigm for e-commerce analytics. In: 2018 4th International Conference for Convergence in Technology (I2CT), pp. 1–5 (2018) 12. Pal, G., Li, G., Atkinson, K.: Big data real time ingestion and machine learning. In: 2018 IEEE Second International Conference on Data Stream Mining Processing (DSMP), pp. 25–31 (2018) 13. Pal, G., Li, G., Atkinson, K.: Multi-agent big-data lambda architecture model for e-commerce analytics. Data 3(4) (2018) 14. Refes, M.: Architecture lambda, kappa ou datalake: comment les exploiter ? In: CYRES, Octobre (2018) 15. Suthakar, U., Magnoni, L., Smith, D.R., Khan, A.: Optimised lambda architecture for monitoring scientific infrastructure. IEEE Trans. Parallel Distrib. Syst. 32(6), 1395–1408 (2021)

Using Clustering Ensembles and Heuristic Search to Estimate the Number of Clusters in Datasets Afees Adegoke Odebode(B) , Mahir Arzoky, Allan Tucker, Ashley Mann, Faisal Maramazi, and Stephen Swift Brunel University, London, UK {afees.odebode,mahir.arzoky,allan.tucker,ashley.mann, faisal.maramazi,stephen.swift}@brunel.ac.uk

Abstract. This paper presents a heuristic search-based approach for estimating the number of clusters within a dataset based on an ensemble of clustering methods. We combine a number of clustering results into near-optimal subsets using two distinct approaches. Firstly, a Gray code-based implementation evaluates the quality of all possible subsets; the quality and consistency were excellent, but the search was exhaustive, leading to exponential run-time as the volume and dimension of the dataset increased. For this reason, a Random Mutation Hill Climbingbased alternative is introduced, which evaluates the subsets in small increments, mimicking the Gray code implementation, with a minimum of ninety-two per cent accuracy (mean 96%) and a significant gain in speed (linear as opposed to exponential run-time). Our algorithms are tested on real-world and benchmark datasets, and their performance is compared to other state-of-the-art estimators with promising results. Additionally, a heuristic is presented to guide when to use the exhaustive or heuristic search. Keywords: Random mutation hill climbing · Ensembles Convergence point · Estimating the number of clusters

1

·

Introduction

This paper presents a heuristic search-based approach for estimating the number of clusters within a dataset based on an ensemble of clustering methods. A sizeable number of datasets available for analysis today are unlabeled—for example, data from social media sources such as billions of Facebook posts and text messages or the ubiquitous data generated by the healthcare industry from automatic record transactions of everyday life [1]. The question is, can we extract some useful information from unlabeled data? Getting labels for this data in some cases is often difficult, sometimes expensive or may be infeasible. Grouping similar objects is the most intuitive way to extract information from this unlabeled data. The grouping of objects into members of mutually exclusive c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 334–353, 2024. https://doi.org/10.1007/978-3-031-47715-7_23

Estimating Dataset Clusters with Clustering Ensembles and Heuristic Search

335

subsets is called data clustering, and it has many practical applications. Many heuristic algorithms are available for this purpose [2,10,23,24]. Analysing clusters to determine patterns and behaviours in datasets is essential in exploratory data analysis and an important process in data pre-processing. Typical methods for clustering include k-means, hierarchical clustering and Partitioning Around Medoids (PAM) [13]. One primary requirement for many popular and effective methods is apriori knowledge regarding how many clusters the data should be arranged into. Getting this wrong or even slightly inaccurate may result in a completely different clustering arrangement to that in the dataset. This paper presents a highly accurate method for determining this vital dataset feature. Furthermore, many clustering algorithms are either dataset-dependent or allocate variables based on some random point which may introduce bias into the search and varying the parameter of the same method can produce different results for the same dataset. Choosing between the different clustering techniques and setting the correct parameter can be difficult for any single clustering method. Thus, in recent times, there has been growing interest in the use of ensemble techniques for data clustering [21], which rather than relying on a single method, improves accuracy by using the agreement from several sets of representative solutions from different techniques. Recently ensemble techniques [21,22] have been applied to estimate the number of clusters in datasets, improving clustering performance [19]. This work extends earlier research on estimating the number of clusters in datasets, where previous findings confirmed the ensemble technique estimates better the number of clusters in datasets [16]. We are motivated to improve the time taken for the current exhaustive search (in this study [16]) of the solution space by building a heuristic search on the clustering ensemble. Unlike the application of an ensemble in classification, a clustering-based ensemble requires a consensus to combine the ensemble into representative solutions. The approach presented in this paper addresses issues related to sub-setting the ensembles for maximum diversity, an essential requirement for achieving accuracy. Sub-setting in this context refers to creating a representative solution as a subgroup from the clustering ensemble. In this paper, we combined clustering ensembles and a heuristic search algorithm to estimate the number of clusters in datasets. The novelty and contribution of our approach is described below: – A heuristic-based ensemble that consistently produces the same quality as the exhaustive search. – A mathematical framework for the fitness function and a regression model that estimates the maximum convergence point for the heuristic search, vastly reducing the already improved run-time. – Lastly, choosing between a heuristic or an exhaustive search can be challenging, considering the trade-off between accuracy, consistency and speed. Through the analysis presented in this paper, we establish the point at which

336

A. A. Odebode et al.

it is optimal to use the exhaustive search and when to use the heuristic approach. To our knowledge, this is the first study that combines a clustering ensemble with a heuristic search to estimate the number of clusters in datasets. The rest of this paper is organised as follows: Sect. 2 reviews standard methods in cluster estimation used to compare against our ensemble. Section 3 introduces the Ensemble Framework and details of the exhaustive search and the heuristic approaches in Sect. 4, whilst Sect. 5 explains the experimental setup and the dataset. The results of the experiments are in Sect. 6. We conclude and make recommendations for future work in Sect. 7.

2

Related Work

Clustering can be described as a grouping problem that partitions a given set of data items into minimal disjoint groups such that items in the same group are similar and different from items in the other group [7]. Identifying clusters is a fundamental process in data pre-processing, and several methods have been developed for estimating the number of clusters in a dataset. 2.1

Methods for Estimating the Number of Clusters

In this paper, we compared our results against two categories of approaches, the Clustering-Based Estimators and Classical Estimators, from which we selected the top eight from the two categories. The result of our comparison is presented in Sect. 6. The Clustering-Based Estimators identify the number of clusters as part of the clustering process, while the Classical Estimators are built primarily to estimate the number of clusters in datasets. For example, x-means [17] uses Bayesian Information Criterion to estimate the number of clusters in datasets. In x-means, the k (the number of clusters) with the best Bayesian Information Criterion [12] score is the estimated number of clusters. On the other hand, the G-means algorithm grows the value of k starting with a small number of centres and tests if the data is from a Gaussian distribution using a statistical test [9] to estimate the number of clusters in a dataset. We also considered the ExpectationMaximisation (EM ) [15] algorithm, which constructs statistical models of data and accommodates categorical and continuous data fields to estimate the number of clusters. The Classical Estimators were a set of methods obtained from R’s NbClust package [4], which implements the most popular techniques. The NbClust package implements thirty indices for estimating the number of clusters. It proposes the best clustering scheme by varying combinations of the number of clusters, distance measures, and clustering methods. The best of the thirty methods were selected for comparison with our techniques. The methods selected, including their modes of determining the number of clusters in datasets, are summarised in Table 1.

Estimating Dataset Clusters with Clustering Ensembles and Heuristic Search

337

Fig. 1. A graph showing a dataset’s optimal number of clusters Table 1. Methods and mode of determining number of clusters Methods

Optimal number of clusters

PtBiserial

Maximum value of the index

Gap

Smallest nc such that critical value  0

Calinski Index (CH) Maximum value of the index Ball

Maximum value difference between hierarchy levels of the index

Krzanowski (KL)

Maximum value of the index

2.2

How Does a Cluster Estimator Work?

A typical estimator seeks to optimise an objective function, for example, the sum of the square distance between each point in a dataset and its closest or assigned centre, as shown in Fig. 1. The figure describes the “peak of a curve” √ based on different values of k (number of clusters) from 2 . . . n√(n = number of objects to be clustered); see Sect. 3.3 for the motivation for n. The peak is a cut-off point commonly used in heuristic and mathematical optimisation to determine where adding another cluster results in breaking the cluster into clusters within identified cluster rather than any appreciable difference between clusters. The graph’s peak, as shown in Fig. 1, corresponds to the optimal value of the objective function; adding another cluster to it from that point does not model the dataset better; instead, it may result in over-fitting.

338

3

A. A. Odebode et al.

The Ensemble Framework

This section introduces the general framework of the ensemble. We highlight the main features of our ensemble approach as shown in Fig. 2, consisting of four key stages: (i) (ii) (iii) (iv) 3.1

Generation of the base clusterings Construction of the agreement matrix from clusterings Creation of subsets from the agreement matrix Selection of the best subset using the objective function Generation of the Base Clustering

Base clustering generation is the first phase in the ensemble framework, and the aim is to generate members used to build the ensembles. In addition, the members should be different to achieve diversity. Diversity means the different representative subsets will be captured as subgroups in the ensembles; precisely, it must reflect most variants of the clusterings through a good combination of the ensembles as subsets. Earlier research efforts suggested several options in the clustering context to achieve diversity in generating subsets [19], which include the followings: First, objects in the dataset can be represented using selected features; for example, varying the pixels or the image coordinates in an image dataset will generate different representations of the dataset. Second, a random selection of the initial number of clusters using iterative algorithms such as kmeans to generate different representations of the dataset for different values of k and combining the result as ensembles. Lastly, we can combine outputs from algorithms to create different clustering solutions. For example, combining output from k-means, DB-Scan, and a graph partitioning algorithm can create an ensemble.

Fig. 2. Ensemble framework

Estimating Dataset Clusters with Clustering Ensembles and Heuristic Search

339

Our approach combines these techniques, and we introduce two new ways of sub-setting ensembles: The first approach is the Gray code sub-setting which applies its single-digit difference property between subsets to create related subsets. Another feature of the Gray code is that it is cyclic, meaning digits within the Gray code are regularly repeated between codes. For example, in a 10-bit Gray code, 4, 5 and 6 correspond to the following subsets: 0000000110 0000000111 0000000101 A value of one in the Gray code means the partition is included in the subset while a zero means it will be excluded. The rationale for choosing this approach is premised on two main reasons: the ease of creating the different representations of the subset and the possibility of establishing a relation between the subsets for fitness evaluation. We build on this to create a heuristic-based version that uses a small change operator built on bit-flipping to generate subsets that map directly to the Gray code implementation. 3.2

Construction of the Agreement Matrix

From the generated base clustering and the subsets, we need a representation that determines the level of agreement of members in the subsets to measure the accuracy and consistency. Using the Weighted Kappa (w) metric, we calculate the agreement strength between adjacent pairs of partitions based on the number of times objects co-exist in the same cluster. The Kappa metric is equivalent to the Adjusted Rand Index [11]. It is an external validation index used to measure accuracy in clustering ensembles; another standard index is the Normalised Mutual Information [19]. In this case, the Kappa metric measures the agreement between two raters, the clustering algorithms. Currently, we use thirty different clustering algorithms to create a database of clustering representations for our dataset. Agreement between the clustering is measured using the Weighted Kappa (w) metric, which allows weight to be assigned to disagreement between two raters, and it has an agreement strength between poor and very good. According to the Kappa metric values of k ≡ 0.2 is poor, k ≡ 0.4 is Fair, k ≡ 0.6 is Moderate, k ≡ 0.8 is Good, and k ≡ 1.0 is Very good [20] 3.3

Sub-setting the Agreement Matrix

We represent a subset as an n-length binary string where a zero in position i means exclude and a one means include the clustering method. This representation allows for the maximum possible subset allocations. In generating √ the subsets, we have a set of input clustering arrangements ranging from 2 . . . n. The square root of n is the commonly suggested maximum number of clusters

340

A. A. Odebode et al.

when the number of clusters in a dataset is unknown [14], where n is the number of objects being clustered. We ranked forty different clustering algorithms or variants and selected the top thirty as our input clustering (r). The selection was based on the algorithms’ performance against the gold standard (expected number of clusters) reported on the web from where the datasets were downloaded. 3.4

Experimental Data

We started with two-hundred and eighteen (218) datasets downloaded from the following sources: UCI Machine Learning Repository, the University of Eastern Finland clustering basic benchmark, and Outlier Detection Datasets from Stony Brook University [6,8,18]. All of which were standard benchmark datasets with challenging clustering difficulties and domains. We used thirty clustering algorithms and a variety of parameter settings to construct a database of the clustering ensemble. We identified some issues from the initial analysis of the datasets. We could only use twenty-seven of the two hundred and eighteen datasets because some failed to cluster, the number of instances is small for some (less than a hundred), and we found large missing values in some. The ones that passed the above three initial criteria were used for our analysis. We present the datasets and the number of clustering arrangements reported for each in Table 2. The kmax column in the table is the maximum number of the value of k based √ on the √ n − 1, the expected number of clusters in the dataset is from 2 . . . n, where n is the number of instances in the dataset. We evaluated the quality value for each k, and the index corresponding to k with the highest quality is the number of clusters in the dataset. 3.5

Selection of the Best Subset

In this stage, the aim is to assess the quality of subsets using standard metrics. Evaluating the quality of subsets is a non-trivial and essential task, as there is no universally agreed standard for measuring what constitutes good-quality clusters. Many aspects need to be considered when evaluating the subsets or the clustering result, but in practice, the most common ones are accuracy and consistency. There are many external validation indexes for measuring accuracy—the most common ones used in clustering ensemble research are the Adjust Rand Index (ARI) and the Normalised Mutual Information (NMI). For this research, we used the Kappa metric, equivalent to ARI built into our fitness function, to identify the best subset of the clustering methods. For consistency, the accepted standard is the average performance over repeated runs with different experimental setup conditions represented as variance or standard deviation. In the exhaustive search and the heuristic search approach, we evaluate the quality of each subset using the fitness function—the best subset is the one that optimises the value of the fitness function, and the corresponding index of the best subset is the number of clusters in the dataset. The quality of each subset is such that larger subsets are scored in parity with smaller subsets, using threshold values

Estimating Dataset Clusters with Clustering Ensembles and Heuristic Search

341

Table 2. Dataset, attributes and number of clusters SN

Datasets

#Clusters Attributes #Instances kmax

1

Aml28

5

2

804

28

2

Atom

2

3

800

28

3

BezdekIris

3

4

150

12

4

Blobs

3

2

300

17

5

Cassini

3

2

1000

31

6

Compound

6

2

399

19

7

Curves1

2

2

1000

31

8

Gaussian-500

5

2

3000

54

9

Glass

6

9

214

14

10

Hepta

7

3

212

14

11

Longsquare

6

2

900

29

12

Lsun

3

2

400

19

13

Pearl

3

2

266

16

14

Pmf

5

3

649

25

15

Shapes

4

2

1000

31

16

Size1

4

2

1000

31

17

Size2

4

2

1000

31

18

Spherical-52

5

2

250

15

19

Square2

4

2

1000

31

20

Synthetic-Control 6

60

600

24

21

Tetra

3

400

19

22

Tetragonular-bee

9

15

236

15

23

ThreeMC

3

2

400

19

24

Traingle1

4

2

1000

31

25

Vehicle

4

18

846

29

26

Veronica

7

8

206

14

27

Zelnik3

3

2

266

16

Total

4

643

to maintain a balance in the search. The following section describes the fitness function used to evaluate each subset. 3.6

The Fitness Function

The quality of each subset is based on its fitness value. The quality of a subset is like a point in a two-dimensional fitness landscape where the altitude represents the fitness of each subset. The fitness function measures the estimated number of clusters using the sum of agreement of the Weighted Kappa metric of adjacent inputs taken from a threshold value; the detail is shown in Eq. 1 as Q. A slight change in the subset results in a corresponding change in the fitness value; an increase in the fitness value indicates a better solution. The Gray code implementations explore the relationship between subsequent subsets to evaluate the fitness of each subset. The Gray code version here, referred to as the Update

342

A. A. Odebode et al.

ˆ is faster given that it uses the quality value of the preceding subset quality Q, in calculating the next quality value due to the single-digit difference between subsets. A summary of the fitness function before the update is shown in Eq. 1, and the derivation of the Gray code version is in Eq. 3. The subset from the ensemble used as input to the fitness function varies across a range of clustering algorithms, and the quality of some subsets expressed as their Weighted Kappa can be poor; there is, therefore, the need to select a threshold value. The threshold value also normalises the fitness values. The average Weighted Kappa of the agreement matrix as threshold (θ) has been shown [16] to give the best performance as a predictor of the number of clusters in a dataset. Similar research conducted by [3] justified the need for introducing thresholds. The central premise of the research is that there is a correlation between the input methods’ average Weighted Kappa (w) and the gold standard (the number of clusters published for each dataset) [3]. Therefore it was shown in [16] that the best subset of the potential inputs (in terms of the average w) is the best predictor of the number of clusters from a given k value—see Fig. 1). However, this average value cannot be used to select the best subset since the single best w value from the agreement matrix would be the maximum (average). ˆ see Eq. 6), which is designed to trade-off between subset size Hence the use of Q and w agreement. The average is then used see Eq. 1) as per Fig. 1. |s|  

|s|−1

Q=

[w(s(a), s(b)) − θ]

(1)

a=1 b=a+1

|s|   w(s(a), s(b))

|s|−1

Av =

a=1 b=a+1

where

|s|(|s|−1) 2

(2)

|s|(|s| − 1) ; 2 Q = sˆAv − sˆθ

sˆ =

Q = sˆ(Av − θ); Q + θ = Av sˆ ˆ= Q

|s| |s| i=1

j=1

[w(si , sj )) − θ]

(3)

Estimating Dataset Clusters with Clustering Ensembles and Heuristic Search

ˆ= Q

ˆ ˆ |s|−1 |s|−1 i=1

j=1

[w(si , sj ) − θ]

ˆ |s|−1 +2 w(si , x) − θ j=1

ˆ |s|−1 ˆ =Q±2 Q w(si , x) − θ i=1

343

(4) (5)

(6)

Reducing the search for the best subset requires subsets to be linked to the previous subset for progressive evaluation of the quality values. One way to achieve this is to use a binary code representation (Gray Code). Each binary vector must be such that their fitness values are related (Update quality function). However, applying it directly to the search for the best subset is still exhaustive. As the size of input algorithms (r = size of input algorithms), the dimension and the volume of the input increase, it may not be feasible to apply it. Hence the need for a heuristic approach as described in the RMHC implementation of the algorithm in Sect. 4.

4

Methods

This section presents the two methods and algorithms: The exhaustive search and the Random Mutation Hill Climbing (RMHC). 4.1

The Exhaustive Approach

The exhaustive approach in Algorithm 1 uses the subsets as the representative number of clusters in the dataset. The Gray code representation is unique in two ways; each successive code differs by a single digit and is cyclic (The digits occur in cycles and are repeated often) [5]. We create all possible subsets of the 2r input algorithms; for example, if r = 10, the number of subsets generated will be 1013 since we exclude singleton and empty subsets. The quality of each subset is calculated from the previous, and we use the average to create the reverse elbow-like curve in Fig. 1. On the other hand, the quality Q of each subset is determined, and the subset with maximum quality is the best. The corresponding index of the best subset is the number of clusters in the dataset. The rationale behind this implementation is that there is a subset value that maximises the quality relative to the threshold value of the Weighted Kappa for each dataset. This value corresponds to the estimated number of clusters in the dataset. The following section explains the RMHC approach.

344

A. A. Odebode et al.

Algorithm 1 Gray Code Implementation of Exhaustive Search Algorithm Require: m × m agreement matrix from clustering algorithms 1: for i = 0 to 2m−1 do 2: g = binary(i)  generate the binary Gray code 3: if nbits(g) > 1 then  test if the subset size > 1 4: s = subset(g)  create subsets 5: count = 0 6: for a = 0 to |s| − 1 do 7: for b = (a + 1) to |s| do 8: Q = w(a, b) − θ 9: count = count + 1 10: end for 11: end for 12: Q = Q/count 13: if (Q > bestQ) then 14: bestSS = s  bestSS = best subset 15: bestQ = Q  bestQ = best quality 16: end if 17: end if 18: end for Ensure: Return subset with the best quality

4.2

Random Mutation Hill Climbing (RMHC)

A heuristic is a general principle or a set of loose guidelines that may find a solution to a problem. Although the solution is not guaranteed, they are often applied in Artificial Intelligence to improve the performance of the search methods. Hillclimbing algorithm is a heuristic search that starts at some random point in the search space. It systematically evaluates each possible solution without considering all the alternatives and selects the best successor node (subset) using the fitness function. It then commits the search to it; then, the successor serves as the actual node, and the search continues until there is no more improvement— the convergence point. RMHC locates points that lead up a slope to maximise the fitness space, which can easily end up in a local optimum. However, it is effective in many applications. Moreover, it is directly applicable to our current fitness function definition. As the fitness function plays a vital role in a clustering ensemble, directly enhancing its performance makes it more effective and efficient. The main idea of the RMHC version is to improve the Gray code implementation further using the small change progressive evaluation of subsets, a requirement in hill climbing algorithms directly mapping into the Gray code version. The ability of RMHC to evaluate and compare its fitness values to its shared neighbours will further improve the search for the best subset. Incorporating this into the algorithm, as shown in lines 6 and 7 in Algorithm 2 we envisage, will further speed up the search for the best subset and appreciably improve the search process while maintaining accuracy, especially as the volume

Estimating Dataset Clusters with Clustering Ensembles and Heuristic Search

345

and dimension of the dataset increases. Theoretically, we could have used more powerful population-based algorithms, such as Genetic Algorithms. However, the crossover operator we envisaged will make significant changes preventing us ˆ which can significantly affect the gain in the speedup owing from using the Q, to each subset being evaluated independently.

Algorithm 2 Random Mutation Hill Climbing(RMHC) Require: iter, w Matrix size r × r 1: procedure RMHC 2: Let s be a random binary string of size r 3: F ← Q(s, w) 4: for i ← 1 to iter do 5: sˆ ← smallChange(s) ˆ F, sˆ, W ) 6: Fˆ ← Q(s, ˆ 7: if F ¿ F then 8: s ← sˆ 9: F ← Fˆ 10: end if 11: end for 12: end procedure

 w = Weighted Kappa

 F = Fitness  iter =Number of iterations  sˆ = a random point close to s  Fˆ = new fitness(smallchange

Another requirement of the RMHC is a random starting point which we generate using a random binary string of size r. We calculate its quality, and the quality of subsequent subsets are from the Update quality because successive subsets s and sˆ are related, and so are their quality values. In the small changed operator, like the Gray code representation, the random binary string flips a random bit in the string to generate a new subset sˆ close to the previous subset. The original fitness value is only used once for the starting point in the search. Changes in subsequent subsets are from the updated quality because of the minimal difference between subsets. For example, let a and b be two 10-bit Gray code where a = 0000000110 and b = 0000000111, the corresponding subset for a and b will be a = 7, 8, b = 7, 8, 9 selected as input from the agreement matrix. ˆ will be Q(a) added to F (9) where Therefore, the quality Q(a) = F (7, 8) and Q F is the fitness function as defined in Eq. 4.

5

Experiments

We test the effectiveness of the exhaustive algorithm and RMHC using twentyseven real-world datasets from the UCI machine learning repository [6], University of Finland’s clustering benchmark [8] and Outlier Detection Datasets [18]. The dataset’s attributes and reported number of clusters are shown in Table 2. The experiments are categorised into three sub-experiments as outlined in Table 3 (table of experiments).

346

A. A. Odebode et al.

In the first experiment, we run the exhaustive search on the subset for the twenty-seven datasets with the average Weighted Kappa of the agreement matrix √ as threshold and for values of k = 2 to n. The corresponding maximum values of k for each dataset are reported in Table 2 as kmax, leading to 643 datasets. We use the Update quality outlined in Algorithm 1 to speed up the process. The algorithm shows detail of the Gray code version of the quality, which essentially avoids recomputing the quality values of the subset on every iteration, thereby drastically reducing the computation time. We only report the result of the average quality because it was the best of the threshold values. Due to the growing input size, the exhaustive, as expected, took extremely long (in terms of run time based on input size), which was part of our motivation for exploring the effect of applying a heuristic search using the RMHC. In the second experiment, we used the RMHC and pegged our number of iterations to ten thousand (determined by trial and error). We start in different sections of the search space with hundred repeats. The method is referred to in Table 3 as RM10k and the model version as RMModel described below. The third experiment uses RHMC on different sizes of inputs ranging from 5 . . . 30 on all datasets. We used the convergence points of varying input sizes to generate a model. The models were validated using the maximum convergence points calculated from the linear regression model, which we used to run a reduced version of the RMHC experiments to test its efficacy further. The results of the RMHC on 10,000 iterations over hundred repeats and the model performance for the same number of repeats (hundred repeats) compared to the exhaustive search results are reported in the result Sect. 6. Table 3. Table of experiments, showing the methods, the datasets, the number of iterations, and repeats Methods

Datasets # Iterations # Repeats

Exhaustive 27

1

1

RM10k

27

10,000

100

RMModel

27

22 × r − 52

Total

6

100 129,243

Results and Discussions

The results are presented in the following summaries: – Exhaustive by Dataset and r (size) presented as average quality – RM10k (RMHC over ten thousand iteration) versus exhaustive – RM10k Versus Model Improvement (RMModel)

Estimating Dataset Clusters with Clustering Ensembles and Heuristic Search

347

The exhaustive performance versus the benchmark methods listed in Sect. 2 indicate that compared to the top eight methods, our ensemble predicted seventeen of the twenty-seven datasets correctly compared to twelve predicted by the Calinski Index (CH), the best among the methods. Similarly, the best error estimate across all datasets for all methods was 0.281 for Ball. Although relatively close to the ensemble at 0.271, the number of clusters predicted correctly by Ball was just eight (8) compared to the ensemble correctly predicted seventeen (17). In summary, our ensemble technique performs better than all the other methods in accurately estimating the number of clusters with fewer errors in the estimates across all the datasets. The accuracy of the estimated number of clusters for datasets from different sources confirms earlier findings on the efficacy of clustering ensembles in determining the number of clusters in a dataset. Table 4. A summary of the percentage errors, average and standard deviation by dataset Exh

RM10k

RMModel

RMModel versus RM10k RM10k versus Exh

Av

Av

Av (%)

SD (%)

Av (%) SD (%)

0.08

0.05

0.90

0.70

Atom

21.34 11.70 20.44 11.70 20.41 11.71 0.17

0.09

4.20

0.01

BezdekIris

11.27 10.71 11.17 10.70 11.18 10.71 0.07

0.07

0.89

0.12

Blobs

18.20 9.36

0.07

0.10

1.28

1.28

Cassini

23.66 10.50 23.55 10.57 23.55 10.57 0.02

0.03

0.47

0.69

Compound

25.93 12.01 25.78 12.11 25.77 12.10 0.04

0.03

0.57

0.85

Curves1

27.22 11.42 27.07 11.52 27.05 11.52 0.05

0.04

0.55

0.84

Dataset

Av

SD

Aml28

18.30 9.59

SD

18.13 9.65

17.97 9.48

19.00 9.86

SD

18.15 9.65

17.95 9.47

Gaussian500

19.11 9.80

0.04

0.09

0.56

0.64

Glass

21.59 11.66 21.58 11.64 21.57 11.64 0.03

19.01 9.85

0.02

0.07

0.13

Hepta

19.55 11.48 19.24 11.59 19.25 11.58 0.05

0.05

1.59

0.98

Longsquare

25.34 11.58 25.19 11.68 25.21 11.67 0.09

0.07

0.61

0.84

Lsun

24.16 12.06 23.97 12.18 23.97 12.18 0.02

0.02

0.80

1.01

Pearl

29.45 13.24 29.24 13.36 29.24 13.35 0.01

0.06

0.72

0.90

Pmf

22.21 11.76 22.09 11.70 22.08 11.69 0.05

0.06

0.52

0.50

Shapes

26.16 13.23 26.01 13.32 26.01 13.32 0.01

0.01

0.56

0.66

Size1

18.33 9.13

18.13 9.23

18.14 9.24

0.07

0.02

1.08

1.13

Size2

16.38 8.70

16.17 8.78

16.17 8.78

0.02

0.02

1.29

1.00

Spherical52

24.87 11.11 24.64 11.26 24.65 11.25 0.03

0.10

0.93

1.34

Square2

18.85 9.13

0.01

0.87

1.01

18.69 9.23

18.69 9.23

0.05

Synthetic control 28.78 12.08 28.57 12.24 28.58 12.22 0.05

0.15

0.75

1.25

Tetra

0.05

1.95

1.24

18.34 10.04 17.99 9.91

17.99 9.92

0.03

TetragonularBee 31.18 20.13 30.91 20.22 30.94 20.20 0.09

0.07

0.87

0.43

ThreeMC

0.05

0.65

0.90

25.30 11.81 25.14 11.92 25.14 11.91 0.03

Triangle1

24.90 14.29 24.75 14.36 24.74 14.37 0.04

0.04

0.60

0.49

Vehicle

24.43 14.39 24.40 14.39 24.40 14.40 0.01

0.02

0.13

0.01

Veronica

30.39 20.33 30.22 20.39 30.22 20.39 0.03

0.03

0.56

0.27

Zelnik3

29.49 13.22 29.30 13.35 29.29 13.35 0.01

0.03

0.65

0.92

0.05

0.91

0.75

Percentage error

0.05

348

A. A. Odebode et al.

Tables 4 and 5 compare the accuracy of the RMHC-based methods, RM10K and RMModel, against the exhaustive search. We track the performance of these algorithms on the dataset and the performance as the input size increases. Errorvalues are in terms of the fitness function Q, and we assessed cluster method consistency based on the quality of output obtained from the exhaustive search compared to RMHC. The result for the RMHC is divided into two RM10k and RMModel. The RM10k is the RMHC over ten thousand iterations for hundred repeats, and the RMModel is the result obtained from a reduced run of the RMHC based on the maximum run-time from the linear regression model over a hundred repeats. The tables also show summary results of exhaustive, the RM10k and the RMModel by dataset and input size. The summary includes both average and standard deviation. The motive behind finding the average is to ascertain the consistency of the RMHC algorithm to produce a similar result as the exhaustive search and how close the model is to the RMHC for similar quality. We present in Fig. 3 the linear regression model by average and maximum convergence. The figure describes the model’s accuracy for the maximum Hill Climbing convergence points, which forms the basis of the RMModel. We also observed a typical pattern with the model’s average and maximum run times linear regression as the input size increases. The plot of the linear regression of the average run time and the maximum run-time against the input size resulted in a linear predictor function which we later used as our new maximum in the model run. We used the linear predictor function to ascertain the performance of the RMHC algorithm compared to the model. The linear equation resulting from the model is iter = 22 × r − 52

(7)

where iter is the number of iterations, and r is the input size. From the figure, the maximum run time describes the model more accurately than the average run time. The error between the model and the actual RMHC run time was negligible (0.49%). Interestingly, a pattern, as shown in Fig. 3, led to a general equation on the linear model, which we then used to obtain a similar result as RMHC, and the results are shown in the Tables 4 and 5 by dataset and by input size respectively. The result by input size, which corresponds to the input algorithms, was about 30 times faster than the initial RM10k and of the same quality. The errors by dataset show that the exhaustive, compared to the RM10k, has less than one per cent error overall. The RMModel, which is thirty (30) times faster than RM10k, produces an error in quality of less than 0.05 per cent on average, producing a similar result 99.95 per cent of the time.

Estimating Dataset Clusters with Clustering Ensembles and Heuristic Search

349

Table 5. A summary of the percentage error, average and standard deviation by input size Exh

RM10k

RMModel

RMModel versus RM10k RM10k versus Exh

Size Av

SD

Av

SD

Av

SD

Av

SD

Av

5

4.16

1.25

4.16

1.25

4.16

1.25

0.01%

0.07%

0.07% 0.38%

6

4.89

1.69

4.89

1.71

4.89

1.71

0.01%

0.09%

0.16% 0.80%

7

6.88

2.32

6.87

2.34

6.87

2.34

0.01%

0.03%

0.15% 0.78%

8

9.33

3.20

9.32

3.22

9.32

3.22

0.01%

0.04%

0.12% 0.55%

9

12.15 4.01

12.14 4.03

12.14 4.03

0.00%

0.01%

0.10% 0.46%

10

14.99 4.97

14.98 4.98

14.98 4.99

0.02%

0.07%

0.08% 0.32%

11

18.56 6.34

18.54 6.36

18.54 6.36

0.00%

0.01%

0.08% 0.32%

12

18.83 6.13

18.81 6.15

18.81 6.16

0.02%

0.07%

0.09% 0.42%

13

18.98 6.07

18.94 6.13

18.94 6.13

0.00%

0.05%

0.21% 0.97%

14

19.30 5.97

19.23 6.08

19.24 6.07

0.02%

0.11%

0.35% 1.71%

15

19.71 5.97

19.60 6.13

19.60 6.13

0.01%

0.05%

0.55% 2.64%

16

20.16 6.21

20.01 6.43

20.01 6.43

0.00%

0.05%

0.78% 3.52%

17

20.62 6.58

20.37 6.91

20.39 6.88

0.10%

0.40%

1.23% 4.99%

18

21.29 7.22

20.90 7.64

20.90 7.64

0.01%

0.05%

1.84% 5.83%

19

24.79 7.49

24.46 7.87

24.48 7.85

0.07%

0.30%

1.32% 5.13%

20

25.04 7.40

24.72 7.80

24.72 7.77

0.03%

0.36%

1.28% 5.35%

21

25.07 7.40

24.71 7.82

24.73 7.81

0.08%

0.19%

1.42% 5.74%

22

25.10 7.41

24.72 7.88

24.73 7.86

0.05%

0.16%

1.53% 6.27%

23

29.42 8.23

29.14 8.59

29.12 8.63

0.09%

0.36%

0.94% 4.45%

24

30.08 8.75

29.71 9.21

29.71 9.21

0.02%

0.06%

1.25% 5.34%

25

34.77 9.77

34.46 10.16 34.46 10.17 0.01%

0.10%

0.88% 3.98%

26

35.65 10.01 35.34 10.41 35.33 10.44 0.03%

0.21%

0.87% 4.01%

27

35.66 10.00 35.30 10.49 35.30 10.49 0.01%

0.01%

1.01% 4.85%

28

35.75 9.97

35.35 10.50 35.35 10.50 0.01%

0.03%

1.10% 5.35%

29

38.42 10.70 38.07 11.16 38.08 11.14 0.03%

0.15%

0.90% 4.28%

30

43.27 11.75 43.02 12.10 43.00 12.12 0.03%

0.13%

0.59% 2.98%

(0.33) 10%

0.68

Percentage error

0.33

SD

(0.49) 1%

350

A. A. Odebode et al.

Fig. 3. A plot showing the linear regression model by average and maximum convergence

Lastly, from the output, shown in Fig. 4, the convergence point begins to grow as the size of inputs grows beyond seven. At this point, combining a RMHC or similar heuristic approach might be preferable to reduce the run timebreak appreciably. From the results there is no correlation between the values of k, the number of clusters and the error, whereas there is a positive correlation between the input size and the error; this may be a result of the quality values not normalised. Figure 4 graphically shows the cut-off (number of clustering ensembles used) between using the exhaustive search and the RMModel-based approach and the trend as the input clustering algorithm increases. The cut-off point is around the input of size seven (7), from which point the run time increases exponentially. Finally, the accuracy of our RM10k ensemble compared to the exhaustive is shown in the graph in Fig. 5, which shows an average minimum of ninety-two per cent (92%) accuracy with most estimates equal to the exhaustive ninety-five per cent (95%) and above of the time.

Estimating Dataset Clusters with Clustering Ensembles and Heuristic Search

Fig. 4. Model versus hill climbing

Fig. 5. A graph showing the accuracy of RMHC compared with exhaustive

351

352

7

A. A. Odebode et al.

Conclusions and Recommendations for Future Work

In this paper, we have introduced a RMHC-based method to search for the best subset of clustering methods that better estimate the number of clusters in datasets. This approach’s novelty lies in establishing a relationship between subsets through mapping successive subsets into a form that allows a slight difference between subsets (Gray Codes), thus enabling progressive evaluation of subset quality. Although the Gray code version considerably reduces the run time from quadratic to linear time based on the input size, the RMHC approach produces similar results much faster than the exhaustive search. The result shows that our heuristic-based method converges to similar results from the exhaustive, on average, ninety-five per cent accuracy. The model (convergence point prediction) was even thirty times faster than the RMHC on the same datasets. Therefore, introducing a heuristic search into the clustering ensemble, as demonstrated above, can significantly reduce the run time and has the potential to be applied in big data clustering problems. Further work will include investigating the starting point in the search process, which can sometimes lead to a local optimal in a hill climbing implementation and extending the application to a real-world dataset. A Random Mutation Hill Climbing may not be the most efficient search technique available; however, this work should be considered a proof of concept. Future work will also investigate whether more advanced local search techniques (e.g. Simulated Annealing) can improve performance.

References 1. Aggarwal, C.C., Philip, S.Y., Han, J., Wang, J.: A framework for clustering evolving data streams. In: Proceedings 2003 VLDB Conference, pp. 81–92. Elsevier (2003) 2. Arica, N., Yarman-Vural, F.T.: An overview of character recognition focused on off-line handwriting. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 31(2), 216–233 (2001) 3. Ayed, S., Arzoky, M., Swift, S., Counsell, S., Tucker, A.: An exploratory study of the inputs for ensemble clustering technique as a subset selection problem. In: Proceedings of SAI Intelligent Systems Conference, pp. 1041–1055. Springer (2018) 4. Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A.: NbClust: an R package for determining the relevant number of clusters in a data set. J. Stat. Softw. 61(6), 1–36 (2014). www.jstatsoft.org/v61/i06/ 5. Doran, R.W.: The gray code. J. Univers. Comput. Sci. 13(11), 1573–1597 (2007) 6. Dua, D., Graff, C.: Uci machine learning repository (2017). www.archive.ics.uci. edu/ml ¨ 7. Elhag, A., Ozcan, E.: Data clustering using grouping hyper-heuristics. In: European Conference on Evolutionary Computation in Combinatorial Optimization, pp. 101– 115. Springer (2018) 8. Fr¨ anti, P., Sieranoja, S.: K-means properties on six clustering benchmark datasets (2018). www.cs.uef.fi/sipu/datasets/

Estimating Dataset Clusters with Clustering Ensembles and Heuristic Search

353

9. Hamerly, G., Elkan, C.: Learning the k in k-means. In: Advances in Neural Information Processing Systems 16 (2003) 10. Higham, D.J., Kalna, G., Kibble, M.: Spectral clustering and its use in bioinformatics. J. Comput. Appl. Math. 204(1), 25–37 (2007) 11. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985) 12. Kass, R.E., Wasserman, L.: A reference bayesian test for nested hypotheses and its relationship to the schwarz criterion. J. Am. Stat. Assoc. 90(431), 928–934 (1995) 13. Kaufman, L., Rousseeuw, P.J.: Partitioning around medoids (program pam). In: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344, pp. 68–125 (1990) 14. Kent, J., Bibby, J., Mardia, K.: Multivariate analysis (probability and mathematical statistics) (2006) 15. McCarthy, M., Wiltshire, S.: Expectation maximization algorithm (e-m algorithm). Dictionary of Bioinformatics and Computational Biology (2004) 16. Odebode.A, Tucker.A, A.M.,S, S.: Estimating the optimal number of clusters from subsets of ensembles. In: Proceedings of the 11th International Conference on Data Science, Technology and Applications, pp. 383–391 (2022) 17. Pelleg, D., Moore, A.W., et al.: X-means: extending k-means with efficient estimation of the number of clusters. In: Icml, vol. 1, pp. 727–734 (2000) 18. Rayana, S.: ODDS library (2016). www.odds.cs.stonybrook.edu 19. Strehl, A., Ghosh, J.: Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3(Dec), 583–617 (2002) 20. Swift, S., Tucker, A., Vinciotti, V., Martin, N., Orengo, C., Liu, X., Kellam, P.: Consensus clustering and functional interpretation of gene-expression data. Genome Biol. 5(11), 1–16 (2004) 21. Topchy, A., Jain, A.K., Punch, W.: Clustering ensembles: models of consensus and weak partitions. IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1866–1881 (2005) 22. Vega-Pons, S., Ruiz-Shulcloper, J.: A survey of clustering ensemble algorithms. Int. J. Pattern Recognit Artif Intell. 25(03), 337–372 (2011) 23. Vishnuvarthanan, G., Rajasekaran, M.P., Subbaraj, P., Vishnuvarthanan, A.: An unsupervised learning method with a clustering approach for tumor identification and tissue segmentation in magnetic resonance brain images. Appl. Soft Comput. 38, 190–212 (2016) 24. Zhang, Y., Zhao, Y.: Automated clustering algorithms for classification of astronomical objects. Astron. & Astrophys. 422(3), 1113–1121 (2004)

Data-Driven Urban Air Mobility Flight Energy Consumption Prediction and Risk Assessment Yonas Ayalew1 , Wendwosen Bedada1 , Abdollah Homaifar1(B) , and Kenneth Freeman2 1

North Carolina A&T State University, Greensboro, NC 27411, USA [email protected], {wbbedada,homaifar}@ncat.edu 2 NASA Ames Research Center, Mountain View, CA 94035, USA [email protected]

Abstract. The current technological advancements revolutionizing the concept of Urban Air Mobility (UAM), has a concurrent need to quantify the operational safety of these vehicles in terms of their associated risk. Providing safety certification of flight operations of UAM vehicles is critical as the concept relies on battery powered electrically Vertical Takeoff and Landing (eVTOL) vehicles, to operate in the current Air traffic control. In this paper, a data-driven method for UAM vehicle energy consumption prediction and risk quantification with conditional value-at-risk based on energy consumption distribution is presented. Significant factors affecting energy consumption, such as density altitude, aircraft design, airspeed, and collision avoidance algorithms, are considered in the data-driven based energy consumption prediction of multiple eVTOL flights. Additionally, a risk metric was deployed to evaluate the risk associated with worst case energy dissipating flights. Our result shows that the proposed approach provides a generalized method to quantify operational safety of UAM network over a given region. Keywords: Data-driven · Density altitude · Electrically vertical takeoff and landing · Energy consumption · Flight risk assessment Risk · Urban air mobility

1

·

Introduction

The commercial application of unmanned aerial vehicles and advanced technological innovations enabled the evolution of the Urban Air Mobility concept in large cities for human transportation systems, cargo, and package delivery use cases [9]. The recent trends, increasing population in urban areas, faster growth of cities, and traffic congestion are the main challenges identified to address in the urban transport system [8]. It is believed that electrically vertical takeoff and landing (eVTOL) aerial vehicles are part of future air mobility since they address the significant hurdles of air transport efficiently with minimal environmental impact. These hurdles include the ability to take off vertically, hover, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 354–370, 2024. https://doi.org/10.1007/978-3-031-47715-7_24

UAM Risk Assessment

355

cruise, and land vertically, carrying people in larger and smaller cities during congested traffic. In line with the use of an electric propulsion system, electrical batteries are favored as the energy source for eVTOL aircraft. Assessing and quantifying the risk of running out of energy during operations is therefore essential to safely integrate the UAM operations in the National Airspace System (NAS), thereby initiating a critical research focus on evaluating the performance of these air vehicles to help the certification process. Although several performance analyses for emerging UAM vehicles were conducted [1,5,26], generalized and critical safety assessments towards the certification procedures are not developed yet. To the author’s knowledge, there are no clear guidelines for UAM vehicles (eVTOLs) airworthiness certification. However, means of compliance for previously defined regulations for small aircraft and rotorcrafts based on safety intent by the Federal Aviation Administration (FAA) could be provided in a form and manner acceptable to the administration. According to NASA procedural requirements [10], NPR 7900, airworthiness study procedures evaluate an aircraft’s flight operation capability, identify critical safety conditions to reduce risks to humans and property, and enhance the plausibility of flight operations based on the developed traffic management algorithms and system components. Hence, evaluation of the performance of the UAM vehicle flight operation based on risk assessment is essential in unmanned operations to ensure vehicle certification. Understanding the battery energy consumption of eVTOLs needs to be determined by assessing the flight risk level with the battery capacity and reserve energy needed to be maintained at the end of the flight to accomplish a realistic and efficient flight mission. In [1], model-based approaches in battery energy consumption performance analysis were made for multiple eVTOL configurations based on the mathematical models of the vehicles. This approach necessitates not only computationally expensive cost and mathematical knowledge of the highly non-linear system dynamics but also the parameter variability of one vehicle model to another vehicle model making the model-based approach not to be a generalized solution. Instead, data-driven models, which need minimal information about the vehicle models, make the computational complexity more manageable and applicable for real-time insights, predictions, and timely decision-making by providing a generalized solution. In this paper, we developed a data-driven energy consumption prediction and risk assessment of UAM vehicle and flight operations. The contributions of our work are summarized as follows: – Development of a state-of-the-art data-driven technique for predicting flight energy to evaluate the performance of two winged eVTOLs (i.e, Lift + cruise and Vector thrust) based on energy consumption in their profiles. – Analysis of the predicted energy consumption distribution of multiple UAM operational flights in the energy space and projecting it to risk distribution via user-based risk equation while considering various energy reserve regulations. – Developed a novel approach that quantifies the flight risk level for specific UAM vehicle configurations using conditional value at risk (CVaR).

356

Y. Ayalew et al.

– Experimental evaluation of the proposed method with realistic UAM flight data that considers traffic management system and dynamic constraints. The structure of this paper constitutes Sect. 2, literature review. Section 3 is a description of the problem statement. Section 4 describes the proposed methodology. Section 5 explains the results and discussion. And Sect. 6 provides conclusions of the work.

2

Literature Review

Different unmanned traffic management system alternatives are proposed based on different objectives and assumptions for approaching the management problem [3,12–14,18,23,27,28,32]. However, these approaches consider only part of the system components in air traffic management (ATM) systems ontology. In [20], a framework that enables the comparison of different traffic management systems based on the system decomposition into four sub-systems (i.e, airspace structure, access control, pre-flight planning, and collision avoidance) is developed in a two-dimensional assumption of the UAM system environment to create a comprehensive simulation tool. The framework compares and evaluates multiple collision avoidance algorithms with time, energy, safety, and capacity efficiency. However, the approach used a kinematic model of the vehicle with agent-based simulation and did not consider the dynamics, infrastructure, and weather conditions. A framework for evaluating the performance of UAM aircraft vehicles is designed in [26]. Unmanned traffic management algorithms are developed for extending the kinematic model of aircraft consideration and two-dimensional approach in [20] to a dynamical aircraft model configuration and dilation of the 2D to a 3D operation. The approach adds the takeoff and landing segments of the flight profile based on the uber elevate [6] specification to a mission profile of eVTOLs. Performance evaluation is based on comparing the baseline simulator and the proposed extended simulator with the number of feasible mission profiles. Additionally, analysis of battery energy consumption of the eVTOLs in the mission profile was done using C-Rating, the rate at which the battery discharges. The approach is appropriate for performance checking but not applicable for flight risk assessment. In [5] formalization of design issues for UAM vehicle were considered. Optimization of the vehicle by identifying the vehicle design variables for lift + cruise eVTOL is implemented and the vehicle performance was analyzed for a given mission based on battery energy efficiency, mission time, and maximum takeoff weight objectives. The comparison of the performance for the different objectives showed the different levels of performance for a reference mission. However, the analysis is mainly on the vehicle level and ATM system components are not considered. Hazards of eVTOL vehicles in advanced air mobility (AAM) operations were surveyed in [30]. The paper identified the outcome of the hazard on the safety and operational performance of the vehicle. The battery energy degradation

UAM Risk Assessment

357

analyzed solely based on the battery component-level faults and the authors recommended subject matter experts to perform mission level evaluation of the vehicle flights given the available battery capacity for re-planning of flights that are not executable. In our proposed framework, an evaluation of UAM vehicles’ battery energy consumption considering the traffic management system and vehicle configuration to assess the risk in multiple flights of two eVTOL vehicle types is presented. Statistical risk metrics are used for quantifying the risk of worst-case energy-consuming missions.

3

Problem Statement 

Suppose Fd = [fd1 , fd2 , .....fdn ] is a representation of multiple UAM vehicle flights where each flight trajectory constitutes traffic management algorithms for unmanned operations ft and factors affecting the battery energy consumption C = [DA , VA , AD ] in which ft , C ∈ Fd(1,2,...n)

(1)

Where, DA , VA and AD are air density, airspeed and aircraft design respectively. A state-of-the-art machine learning model represented as M : (Fd , R) → eˆc takes the flight data, builds a regression model R and gives an output eˆc , where eˆc ∈ R1 is predicted energy consumption of all flights. Given battery capacity of the UAM vehicle bc , the distribution of eˆc will be mapped to risk distribution r based on a risk model that is characterized by eˆc , bc and reserve energy Er that should be maintained at the end of each mission based on the aircraft configuration. Given risk model R(eˆc , bc ), its distribution r and confidence level x, VaR provides the value with the maximum amount of risk likelihood p. And, CVaR provides the weighted average of 1−x risk values beyond the VaR point to assess tail end distribution.

4

Methodology

The proposed methodology considers the UAM flight data following a two-step data-generating procedure with two stand-alone simulators aggregated sequentially. The first is the UAS traffic management simulator (UTM simulator) data comprising the UAM system environment components, including the traffic management algorithms, vertiports, environmental weather conditions of different cities, the airspace structure, and access control subsystems for an agent-based kinematic model simulation. Trajectories of these UTM simulator flights were used as an input in the second UAM vehicles (eVTOLs) dynamical model configuration and mission solver tool. Next, mission operation analysis data for three

358

Y. Ayalew et al.

Fig. 1. Integrated proposed approach for UAM vehicle risk assessment with the flight data consisting of two integrated simulation, data-driven based energy prediction and CVaR based risk assessment.

main configurations of UAM vehicles identified as Lift + Cruise, Vector thrust, and Multirotor eVTOLs were generated. We preprocessed the data considering factors affecting the battery energy consumption of unmanned air vehicles to predict the battery energy consumption of multiple winged eVTOLs missions, and the risk assessment model is employed to evaluate the energy risk level of the UAM vehicles. The respective details of the proposed methods are shown in Fig. 1 and discussed in the following subsections. 4.1

Flight Data Description

The UAM dataset from the UTM simulator contains the decomposed sub-system architecture of the UAM environment. These sub-systems are the Airspace Structure sub-system defined as the regulations which govern the navigation of air vehicles in the national airspace with various kinds of structures. Airspace structure includes free-flight, which allows air crafts to possess a flight with their own choice of path from point A to point B. Routes mandate the aircraft’s flight path to follow certain predefined way points. Layered structure instructs the operational altitude considering the type or characteristics of the flight. Sectorbased airspace structure cleaves the space into different regions and regulates the access of those regions by enforcing rules. The Access Control sub-system manages the traffic in the space by earmarking the access to the various regions of the airspace for a vehicle with a strategic plan in space and time. Pre-flight planning sub-system utilizes strategic algorithms to plan collision-free paths before

UAM Risk Assessment

359

the vehicle takes off and submits them to a centralized manager to comply with the rules and regulations set by the access control and reserve the airspace in a first-come, first-serve manner. The collision avoidance sub-system implements a reactive planning algorithm that enable the UAM vehicles to update their path along with the local/or neighboring traffic during the flight, following guidelines to reactively avoid conflicts using velocity and positional information of the local traffic. Vertiport infrastructure sub-system implements the concept of aerodromes for taking off and landing UAM vehicles (eVTOLs) by configuring the distribution of the vertiports with the ability to embody regions of urban areas and assigning a vehicle capacity into it. Weather conditions of the system are considered to have a wind distribution of hourly collected metropolitan region weather data in the simulation to incorporate dynamic environmental features in the simulation environment. Traffic data and the respective trajectories of each occupied flight are produced from the traffic management simulation. A model-based eVTOL dynamical model design and mission performance evaluator tool named SUAVE [15] is exploited based on the input trajectory data produced in the UTM simulator as a specification for missions of lift + cruise, vector-thrust, and multirotor eVTOL design configurations. The segments of the mission profile are characterized by the Uber elevate mission stipulation described in [6] under the determined complete flight profile. The conducted mission analysis data of each vehicle model and flight operation are generated based on aerodynamic coefficients, aircraft electronics, flight conditions, electric motor, and propeller efficiency categories, each with different variables of performance measure. The free airspace structure, access control with and without 4-dimensional trajectory (4-DT), Pre-flight algorithms such as decoupled, safe interval path planning (SIPP) [19,22] and local velocity obstacle (Local VO) [23] and reactive collision avoidance algorithm such as modified voltage potential (MVP) [16] are implemented in UTM simulator. Details of the traffic management algorithms are described in [20]. Additionally, the vertiport and wind gust implementation are described in [25]. 4.2

Factors Affecting Energy Consumption

Analysis of the battery energy consumption can be achieved by identifying factors affecting the unmanned air vehicle’s energy consumption pattern. Air density, airspeed, and aircraft design are the main factors that influence aerial vehicles based on the analysis of multiple parameters to identify the possible factors affecting the energy consumption of unmanned aerial vehicle routing [29]. The density of air has a significant effect on the air vehicle’s performance, and it changes with temperature. The air density reported in terms of density altitude, which FAA defines as the pressure altitude corrected for non-standard temperature, adversely affects the lift capacity. In addition, the air density decreases as the altitude increases. Thus, this increases takeoff distance, reduces climb rate, and therefore leads to inefficient lift and thrust during the mission flight.

360

Y. Ayalew et al.

The energy consumption also changes against the aerial vehicle’s flying speed (airspeed). In the UAM realm, faster vehicles are possible and are advantageous in terms of time-efficient mission objectives, but for higher flying speeds, the energy consumption will also be higher [5]. Thus, it needs to be critically analyzed and considered in energy consumption prediction. Aircraft design (dynamic configuration) is crucial in determining the energy associated with air vehicles. Multirotor eVTOLs are known to be more efficient than lift + cruise and vector thrust eVTOLs in using energy during the hover state. On the other hand, lift + cruise and vector thrust eVTOLs have a wing for an efficient cruise than multi-rotor eVTOLs [1]. When we separately consider lift + cruise and vector thrust, the former eVTOL types use two separate propulsion systems for hovering and cruising, while the later eVTOLs uses the same propulsion system for hovering and cruising. As a result, for a more extended range, lift + cruise vehicles are less battery energy efficient in cruise than vectored thrust vehicles due to parasitic drag. Additionally, the weight, size of the aircraft body, number and size of rotors, the battery’s energy capacity, and other design parameters define the aerial vehicle configuration. Thus, the different design considerations of the air vehicles can significantly affect the energy consumption of the UAM vehicles on their mission flights. 4.3

Ensemble Learning Prediction Model

State-of-the-art conventional machine learning algorithms are built to perform regression tasks using the most potent ensemble learning models to boost the energy consumption prediction for lift + cruise and Vector thrust eVTOLs. Ensemble learning introduces the generation and combination of multiple models to perform a specific machine-learning tasks. The intuitive definition of ensemble methods brings up human nature and the capability to collect multiple opinions of different kinds and weights and combines them to make high-level decisions [24]. There are many reasons why ensemble learning algorithms can enhance prediction performance. Overfitting mitigation is available when a small amount of data is used by averaging different sample examples to reduce the risk of choosing the wrong examples. This enhances the overall prediction result instead of a single algorithm prone to trying to fit each training data example perfectly and making a poor prediction in the testing data. Furthermore, a single algorithm could end up in a locally optimal solution in the search. Instead, ensemble models reduce the risk of being stuck in local solutions. In [24], the general framework of ensemble model building is described as, let a dataset have m samples along with k feature variables and Q = (xi , yi ) represents the input samples xi ∈ Rk and yi ∈ R is the output label that needs to be predicted. Thus, an ensemble learning model ξ creates aggregation function F that averages J machine learning models, {f 1, f 2, f 3, .., fj } in order to predict the output in terms of yˆi given as follows; yˆi = ξ(xi ) = F (f 1, f 2, f 3, ...., fj ) where yˆi ∈ regression task in our case.

(2)

UAM Risk Assessment

361

Three types of ensemble learning algorithms [2] are implemented in our prediction: Bootstrap aggregation algorithms, boosting algorithms and voting ensemble algorithms. Bootstrap aggregation (bagging) algorithms build multiple models of the same kind, considering different data samples with replacement and a training model for each sample. The prediction results of each sub-model is averaged as an overall prediction result. Each sample often has the same data size to ensure an adequate sample per model. Bagging of a well-known decision tree model (Bagged decision trees) and the random forest, a tree-based ensemble model where each tree is constructed based on a collection of random variables in such a way that reduces the correlation between each regressor, were utilized. Furthermore, extra trees, another modification of bagging for producing more diverseness using extreme randomness, was utilized. The training procedure selects the best splitting feature from the dataset’s variables by randomly choosing split points during the training phase. Boosting ensemble algorithms defines a sequence of models of the same kind. Each model learns to concentrate on samples predicated incorrectly in the previous model training to correct the error in training the current model. Adaboost [7], the first successful dependent boosting ensemble model, works by giving the same weight to all data points in the first model. Then follows with weighting each data point based on how easy or difficult they were to predict. Thus, wrongly predicted points will have more weight given, and correctly predicted points will have less attention in constructing the new subsequent model. Voting ensemble algorithms construct multiple standalone models of different kinds from the training data. The voting regressor is then applied to wrap the models and average the prediction to combine the sub-models. A decision tree and extra trees are built to combine in a voting ensemble method. 4.4

Risk Formulation and Metrics

The flight’s predicted energy consumption distribution is used to generate the risk distribution r by considering risk profile modeling. The risk model given in [4], assumes as flight’s energy consumption nears the air vehicle’s battery capacity, the risk increases exponentially until it reaches a point where it is a high-risk representation and limit value. The mathematical representation of the model is given as; (ζ)

R(eˆc , bc ) = exp( max(bc −eˆc ,

Er )

)−1

(3)

Where, eˆc denotes the predicted energy consumption of each flights, bc denotes the battery energy capacity of the eVTOL. The ζ and Er are constants greater than zero used to define the exponential property and limit point of the risk using energy reserve requirement. A conditional value-at-risk (CVaR) metric was developed in [21] to assess the level of risk statistically within a financial firm and is currently utilized to assess risk in engineering and vehicular applications. CVaR provides the possible risk

362

Y. Ayalew et al.

of worst-case scenarios of the flight energy consumption distribution beyond a given confidence x considered as the Value-at-Risk (VaR) level. CVaR can be defined by considering the weighted average of the extreme losses in the tail-end of the distribution. The mathematical representation of CVaR can be given as follows;

CV aRx =

1 1−x



R(eˆc ,bc )

R(eˆc , bc )rdr

(4)

R(eˆc ,bc )≥V aRx

The Value-at-risk (VaR) is also a statistical risk assessment technique which looks at the past distributions to calculate the percentile for a given level of confidence x. This percentile is the level of risk we expect to face with probability p assuming future distributions are similar to the past ones. CVaR represents the expected risk if a worst-case energy consumption threshold is crossed, while VaR represents a worst-case risk associated with a probability and given confidence level. 4.5

Energy Reserve Requirement

The Federal Aviation Administration part 91.167 fuel requirement for flight in instrument flight rules (IFR) conditions mandates aircraft not to operate unless it carries enough fuel considering environmental conditions for 30 additional minutes of flight after a mission. This mandate is for long-trip aircraft which can be eased for eVTOLs when adequate vertiports are constructed and the maximum eVTOL flight duration is considered. In [17], 20% of battery capacity is assumed as an energy reserve requirement for emergency regulations. The requirement is set for winged eVTOLs like lift + cruise and vector thrust eVTOLs. In [31], a 24% energy reserve is estimated for the winged design configuration of eVTOLs depending on the battery-specific power of different eVTOL configurations. Hence, we consider 20, 22 and 24% energy reserves to assess the energy consumption risk level of the UAM vehicle flights in maintaining energy requirements for emergencies and ensuring compliance with the FAA regulation.

5

Results and Discussion

We performed a two-fold experiment towards the risk assessment of the UAM vehicle. The first, a regression task to predict the energy consumption of flights based on factors affecting the energy consumption of autonomous eVTOLs. The second, a risk analysis of the flights using conditional value-at-risk (CVaR) metrics. In the regression task, sample flights of lift + cruise and vector thrust eVTOL vehicles having different trajectories and traffic management algorithms were considered. We then took 60% of the data for training the regression models

UAM Risk Assessment

363

and 40% for validation and testing. Followed by then training the models and evaluating them using the famous regression task evaluation metrics, namely MAE and MSE. Mean Absolute Error (MAE) is the average of all absolute errors, meaning the absolute average distance between the actual and predicted data. Mean Squared Error (MSE) measures the average of the squares errors; that is, the average squared difference between the predicted values and the actual value. The mathematical representation of the two evaluation metrics is shown as follows: N 1  |yi − yˆ| N i=1

(5)

N 1  (yi − yˆ)2 N i=1

(6)

M AE =

M SE =

The prediction results of each regression model based on the evaluation metrics in Eqs. 5 and 6 are shown in Tables 1 and 2. Table 1. Prediction results of employed regression models for lift + cruise. Regression model

MAE

MSE

Bagged decision trees 0.03028 0.00583 Random forest

0.06471 0.06380

Extra trees

0.02371 0.00486

Adaboost regressor

1.08107 1.85189

Voting regressor

0.02251 0.02245

Table 2. Prediction results of employed regression models for vector thrust. Regression model

MAE

MSE

Bagged decision trees 0.12356 0.06347 Random forest

0.14122 0.07832

Extra trees

0.09260 0.05329

Adaboost regressor

6.28854 1.79971

Voting regressor

0.04713 0.04700

The prediction performance of the trained regression models scored, relatively, comparative results. During the testing Bootstrap aggregation (bagging) and voting ensemble algorithms showed a higher prediction score than the boosting algorithms on the test data.

364

Y. Ayalew et al.

Fig. 2. Predicted energy consumption distribution of lift + cruise and vector thrust eVTOLs.

We conducted the energy analysis of predicted flights, as shown in the Fig. 2a and b. The energy consumed during a flight is described in the predicted energy distribution to analyze 70 sample flights for each lift + cruise and vector thrust eVTOL with an initial battery energy capacity of 93.75 KWhr. The flights had different flight plans, and the collision avoidance algorithm and the conflict events that occur differed within the flights. Hence, the energy spent on the flights differs, as seen from the distribution. The distribution results show that more than ten flights used 60–76% of the vehicles’ initial battery energy capacity for lift + cruise eVTOL, and more than fifteen flights used 60–83% of the vehicles’ initial battery energy capacity for vector thrust eVTOL. Applying the risk profile model to the energy distributions using Eq. 3 and transforming these to risk distribution, the respective risk of lift + cruise and vector thrust eVTOLs are computed for 20, 22, and 24% energy reserve limits using CVaR. Our implementation of VaR and CVaR uses the confidence level of 95% (x = 0.95) which is often used for analysis [11]. Figures 3 and 4 show the risk levels of the air vehicles using demonstrated sample flights based on the specified energy reserve of the flights. It can be inferred from the results that the risk level increases for higher energy-consumption flights as there is a constraint imposed on the remaining battery capacity. Table 3. CVaR and VaR based risk assessment for lift + cruise eVTOL flights. Energy reserve limit (%) CVaR

VaR

20

0.7773(77.73%) 0.6786(67.86%)

22

0.8825(88.25%) 0.7683(76.83%)

24

0.9904(99.04%) 0.8625(86.25%)

UAM Risk Assessment

365

Fig. 3. Lift + cruise eVTOL energy risk evaluation with different energy reserve regulations. We can observe that the risk values are increasing as the requirement of remaining energy increases from 20 to 24% for the worst-end distribution.

366

Y. Ayalew et al.

Fig. 4. Vector thrust eVTOL energy risk evaluation with different energy reserve regulations. It can be observed that the maximum risk value is attained by the worst-end distribution for all the 20, 22 and 24% remaining energy requirements.

UAM Risk Assessment

367

Lift + cruise eVTOL flight risk evaluation shows the energy reserve level of 24% is riskier to maintain since the tail-end flights consumed as close to 76% of the initial battery capacity. Additionally, the CVaR statistical risk assessment results in Table 3 show the extreme end energy consumption flight risk level. The VaR statistical assessment is also shown for comparison. Table 4. CVaR and VaR based risk assessment for vector thrust eVTOL flights. Energy reserve limit (%) CVaR

VaR

20

0.8947(89.47%) 0.7615(76.15%)

22

0.9483(94.83%) 0.8641(86.41%)

24

0.9936(99.36%) 0.9727(97.27%)

Vector thrust eVTOL flight risk evaluation shows there were flights with high risk in all of the reserve limits consideration. Multiple flights were at risk in 24 and 22% reserve limits since there were flights that consumed more than 80% of the initial battery capacity. The CVaR and VaR result in the Table 4 quantify the hazardous flights. As can be seen, the CVaR can capture the tail-end of the distribution with the worst energy uses for both lift + cruise and vector thrust eVTOLs. On the other hand, the VaR cannot capture extreme losses. The results show CVaR is a much more convenient metric for capturing risky scenarios.

6

Conclusions and Future Work

In this work, a data-driven energy consumption prediction of Urban Air Mobility (UAM) vehicles in two kinds of winged eVTOL configurations were performed using comprehensive flight data. The flight data used considered the essential properties of the UAM environment and the autonomous capability of the vehicles. The energy consumption distribution of multiple flights of Lift + cruise and Vector thrust eVTOLs were modeled to a risk distribution with a user-based risk formulation. The tail-end distribution of the energy use of the two eVTOL vehicles was analyzed. The associated risk levels were effectively measured using the Conditional Value at Risk (CVaR) statistical technique. The energy reserve of the vehicles was taken into consideration in analyzing the flights of each vehicle. The assessment was done only for each vehicle independently, and comparative analysis was not done since the flight samples we used for each vehicle did not have the same flight plans. This assessment also can be extended to assess multicopter eVTOL configuration and different optimized fixed-wing configurations. In the future, we will extend the work into flight evaluation and risk assessment of autonomous eVTOLs designing missions and conditions. The mission

368

Y. Ayalew et al.

design focuses on additional reserve segments of the mission profile for an emergency. Moreover, the analysis of the eVTOL flights will be mission segment-wise: takeoff, Cruise, Landing, and the additional emergency reserve segment. Furthermore, Multirotor or wingless aircraft configurations will be incorporated to address additional types of eVTOL vehicles. Acknowledgment. This research work is sponsored by the National Aeronautics and Space Administration University Leadership Initiative (NASA-ULI 2019) research grant number 80NSSC20M0161. The authors would like to thank Mr. Frank Aguilera who has provided us constructive feedbacks.

References 1. Bacchini, A., Cestino, E.: Electric vtol configurations comparison. Aerospace 6(3), 26 (2019) 2. Brownlee, J.: Machine learning mastery with Python: understand your data, create accurate models, and work projects end-to-end. Machine Learning Mastery (2016) 3. Bulusu, V., Sengupta, R., Mueller, E.R., Min Xue, A.: Throughput based capacity metric for low-altitude airspace. In: Aviation Technology. Integration, and Operations Conference, p. 3032 (2018) 4. Choudhry, A., Moon, B., Patrikar, J., Samaras, C., Scherer, S.: Cvar-based flight energy risk assessment for multirotor uavs using a deep energy model. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 262– 268. IEEE (2021) 5. Clarke, M., Smart, J., Botero, E.M., Maier, W., Alonso, J.J.: Strategies for posing a well-defined problem for urban air mobility vehicles. In: AIAA Scitech 2019 Forum, p. 0818 (2019) 6. Elevate, U.: Uber air vehicle requirements and missions. Technical Report, Uber (2018) 7. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997) 8. Hill, B.P., DeCarme, D., Metcalfe, M., Griffin, C., Wiggins, S., Metts, C., Bastedo, B., Patterson, M.D., Mendonca, N.L.: Uam vision concept of operations (conops) uam maturity level (uml) (2020) 9. www.nari.arc.nasa.gov/sites/default/files/attachments/UAMS ConOps v1.0.pdf. Concept of operations v1.0. (2020) 10. www.nodis3.gsfc.nasa.gov/displayDir.cfm?t=NPR&c=7900&s=3D Nasa procedural requirements for aircraft operations management npr 7900.3d, chapter 2: Airworthiness and maintenance. 2017–2023 11. Jabr, R.A.: Robust self-scheduling under price uncertainty using conditional valueat-risk. IEEE Trans. Power Syst. 20(4), 1852–1858 (2005) 12. Jang, D.-S., Ippolito, C.A., Sankararaman, S., Stepanyan, V.: Concepts of airspace structures and system analysis for uas traffic flows for urban areas. In: AIAA Information Systems-AIAA Infotech@ Aerospace, p. 0449 (2017) 13. Johnson, M., Jung, J., Rios, J., Mercer, J., Homola, J., Prevot, T., Mulfinger, D., Kopardekar, P.: Flight test evaluation of an unmanned aircraft system traffic management (utm) concept for multiple beyond-visual-line-of-sight operations. In: USA/Europe Air Traffic Management Research and Development Seminar (ATM2017), number ARC-E-DAA-TN39084 (2017)

UAM Risk Assessment

369

14. Joulia, A., Dubot, T., Bedouet, J.: Towards a 4d traffic management of small uas operating at very low level. In: ICAS, 30th Congress of the International Council of the Aeronautical Sciences (2016) 15. Lukaczyk, T.W., Wendorff, A.D., Colonno, M., Economon, T.D., Alonso, J.J., Orra, T.H., Ilario, C.: Suave: an open-source environment for multi-fidelity conceptual vehicle design. In: 16th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, p. 3087 (2015) 16. Maas, J., Sunil, E., Ellerbroek, J., Hoekstra, J.: The effect of swarming on a voltage potential-based conflict resolution algorithm. In: Submitted to the 7th International Conference on Research in Air Transportation (2016) 17. Melo, S.P., Cerdas, F., Barke, A., Thies, C.,Spengler, T.S., Herrmann, C.: Life cycle engineering of future aircraft systems: the case of evtol vehicles. Procedia CIRP 90, 297–302 (2020) 18. Peinecke, N., Kuenz, A.: Deconflicting the urban drone airspace. In: 2017 IEEE/AIAA 36th Digital Avionics Systems Conference (DASC), pp. 1–6. IEEE (2017) 19. Phillips, M., Likhachev, M.: Sipp: safe interval path planning for dynamic environments. In: 2011 IEEE International Conference on Robotics and Automation, pp. 5628–5635. IEEE (2011) 20. Ramee, C., Mavris, D.N.: Development of a framework to compare low-altitude unmanned air traffic management systems. In: AIAA Scitech 2021 Forum, p. 0812 (2021) 21. R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional valueat-risk. Journal of risk, 2:21–42, 2000 22. Russell, S., Norvig, P.: A* search: minimizing the total estimated solution cost. Artif. Intell. 94–99 (2010) 23. Sachs, P., Dienes, C., Dienes, E., Egorov, M.: Effectiveness of preflight deconfliction in high-density uas operations. Technical Report, Altiscope, Technical report (2018) 24. Sagi, O., Rokach, L.: Ensemble learning: a survey. Wiley Interdiscip. Rev.: Data Mining Knowl. Discov. 8(4), e1249 (2018) 25. Sarkar, M., Yan, X., Gebru, B., Nuhu, A.-R., Gupta, K.D., Vamvoudakis, K.G., Homaifar, A.: A data-driven approach for performance evaluation of autonomous evtols (2022) 26. Sarkar, M., Yan, X., Girma, A., Homaifar, A.: A framework for evtol performance evaluation in urban air mobility realm (2021). arXiv:2111.05413 27. Sedov, L., Polishchuk, V.: Centralized and distributed utm in layered airspace. In: 8th International Conference on Research in Air Transportation, pp. 1–8 (2018) 28. Sunil, E., Hoekstra, J., Ellerbroek, J., Bussink, F., Vidosavljevic, A., Delahaye, D., Aalmoes, R.: The influence of traffic structure on airspace capacity. In: 7th International Conference on Research in Air Transportation (2016) 29. Thibbotuwawa, A., Nielsen, P., Zbigniew, B., Bocewicz, G.: Energy consumption in unmanned aerial vehicles: a review of energy consumption models and their relation to the uav routing. In: International Conference on Information Systems Architecture and Technology, pp. 173–184. Springer (2018) 30. Thompson, E.L., Taye, A.G., Guo, W., Wei, P., Quinones, M., Ahmed, I., Biswas, G., Quattrociocchi, J., Carr, S., Topcu, U., et al.: A survey of evtol aircraft and aam operation hazards. In: AIAA AVIATION 2022 Forum, p. 3539 (2022) 31. Yang, X.-G., Liu, T., Ge, S., Rountree, E., Wang, C.-Y.: Challenges and key requirements of batteries for electric vertical takeoff and landing aircraft. Joule 5(7), 1644–1659 (2021)

370

Y. Ayalew et al.

32. Zhu, G., Wei, P.: Low-altitude uas traffic coordination with dynamic geofencing. In: 16th AIAA Aviation Technology, Integration, and Operations Conference, p. 3453 (2016)

Development of a Decision Support System in a Canning Industry Panagiotis Mallioris1(B) , Georgios Kokkas1 , Alexandros Styliadis-Heinz1 , Ioannis Margaritis2 , Fotios Stergiopoulos1 , and Dimitrios Bechtsis1 1 Department of Industrial Engineering and Management, International Hellenic University, PO

Box 141, 57400 Sindos, Greece [email protected] 2 SAVVYCAN S.A, Karavaggeli 6, 57009 Kalohori, Greece Abstract. Decision Support System frameworks have great importance in the context of Industry 4.0 to prevent production bottlenecks, machine malfunction and to increase the reliability of the industrial processes environment. With the development of digitalization, Decision Support Systems (DSS) alongside Cyber Physical solutions, Internet of Things (IoT) devices and Big Data approaches constitute the main core of an industrially oriented smart manufacturing application. However, a considerable amount of industries lack the technological infrastructure in order to effectively utilize the vast amount of data, collected from various sensors and heterogenous sources scattered at various points of the production process on a daily basis. The scope of this paper is to present a conceptual framework of a DSS in a Canning Industry in order to utilize high-volume data collected from embedded sensors in the production process in order to detect and eliminate bottlenecks. The first part of the solution is dedicated to the integration of Programmable Logical Controllers (PLC) and the KEP Open Platform Communications (OPC) Server for the data acquisition and communication with a MySQL relational database, while the second part is on the data manipulation and the presentation of data analysis strategy for the decision making. Three machine learning models namely Random Forest, Naïve Bayes and SVM were tested for the prediction of total production losses with Random Forest outperforming the rest with Accuracy of 90%. As a final step, a dashboard with real time descriptive statistics and an alarm-based system for upcoming potential failures are presented. This approach will improve the existing production process and act as a guideline for upcoming research in Decision Support Systems in a Canning Industry with a significant replication potential in other types of industries. Keywords: Canning industry · Big data · Decision support system · Data mining

1 Introduction The main core of Industry 4.0 is the utilization of high velocity, variability, veracity, volume, and value (Fosso Wamba et al., 2015) measurements in order to increase the reliability and safety of industrial machines, increase the overall productivity of the organization and reduce unexpected costs. The digital transformation of operations management, supply chains and industrial organizations have enhanced the implementation of © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 371–380, 2024. https://doi.org/10.1007/978-3-031-47715-7_25

372

P. Mallioris et al.

decision support systems achieving the concept of smart manufacturing [1, 2]. In general, a DSS provides crucial information regarding manufacturing processes, informs the operation manager about overall production losses and generates production forecasts [3].Thus, it can be concluded that a DSS framework is essential for a resilient and sustainable organization in the competitive environment of Industry 4.0. The applied framework will provide to researchers and technicians a smart solution for production monitoring incorporating a DSS in a tin can manufacturing process. Several examples exist in literature of DSS regarding predictive maintenance and bottleneck reduction solutions. An intelligent energy-based prognostics approach for conveyor belt tension in discrete manufacturing systems was proposed [4].The authors used an artificial neural network, power consumption measurements and load information in order to predict 3 stages of belt tension, i) low, ii) optimal and iii) over tensed respectively. The results were promising outputting a 96.8% accuracy in the validation data accurately predicting the gradual deterioration of belt tension. Additionally, a machine learning approach for bottleneck prognosis and anomaly detection in cold forming manufacturing line was presented [5]. Important features were considered such as acoustic emissions, maintenance logs and statistical measurements for the classification algorithm with the results indicating a classification of the healthy state with F1 score of 0,632. Although, a considerable amount of research has been published regarding data analysis and decision making in various industries few researchers published their works in the aspect of decision making in Canning Industries. Our work focuses on describing a conceptual framework of a DSS in a Canning Industry utilizing high-volume data collected from embedded sensors in the production process. In more depth, optical sensors proving the total losses of cans were implemented throughout the industrial processes. Moreover, an OPC Server through Programmable Logic Controllers (PLCs) provides a real-time connection with a relational MySQL database for data storage and further analysis. Additionally, regarding the process of data science and decision making on critical production processes python programming language and the Pandas analysis library will be implemented. The remainder of the paper is structured as follow: Sect. 2 describes the data acquisition procedure between embedded sensors and database. In Sect. 3 presents the the conceptual methodology of data science approach, the prediction results and the visualization framework. Finally, in Sect. 4 we provide the conclusions of this work alongside with future research directions.

2 Data Acquisition Through PLC- OPC Server: Database The main scope of the project is to conduct planned industrial research aimed at the study, development and implementation of an integrated DSS at the level of production organization. Through a dynamic process of data acquisition and analysis, our research aims to improve productivity in terms of monitoring of the production line and determination of losses at each individual stage, acquisition of quantitative and qualitative characteristics related to the use of raw materials, prevention of bottlenecks and the planning of the production process. In more depth, our work focus on data acquisition and monitoring of the quantity of products produced and the correlation with critical processes, control of stocks of raw materials (tinplate, lacquer, powder) and appropriate production scheduling, control of losses of tinplate at each stage of the production

Development of a Decision Support System

373

line. Furthermore, regarding the equipment used for the data acquisition, a MeanWell DR-120–24 power supply, a Siemens S7–1200 1215C DC PLC, a Dell OptiPlex 5080 SFF as host machine for the OPC-Web server and Telemecanique XUB2APANL2(R/T) photoelectric sensors were implemented. Figure 1 depicts the integration of hardware equipment in the industrial environment.

Fig. 1. Implementation of hardware equipment in a canning industry.

In order to achieve the main objectives of our research, multiple photoelectric sensors were used throughout the processes of the production line. In total 10 different process stages, namely, Slitters, Body-maker, Oven, Flanger, Beader, Seamer, Tester, Camera, Palletizer and Wrapping machine summarize the operations performed in a Canning Industry. Additionally, with the integration of an OPC server, data aggregation is performed, connecting the real-time measurements collected from sensors with an industrial PC server and the cloud. Furthermore, the OPC server enables the connection with (i) a Data Logger which stores data in a MySQL database, (ii) a REST Client, which broadcasts data in real time in the network and iii) a REST Server, which enables read/write functionality on the PLC through web protocols. Figure 2 shows the architecture of the initial system regarding the connection between the real time measurements and Web applications. For the communication with the DSS system, with the implementation of a MERN (MySQL-Express-React-Nodejs) technology stack, data streams are fed from the database into the Python and Pandas analysis software. The DSS framework afterwards, analyzes the data, in order to create various reports and alarms which are introduced to a web-based application developed in React. Finally, the web API publishes those reports along with any other information created by the system to the operational manager and the web app closing the system architecture cycle.

374

P. Mallioris et al.

Fig. 2. Data acquisition and data aggregation architecture.

3 Decision Support Systems (DSS) In a real-world industrial environment, a decision support system processes a huge amount of data coming from heterogenous sources. The process of data science is essential in order to manipulate raw data through cleaning techniques and transforming them into valuable information for the operational manager and prediction algorithms. In addition, with initial descriptive statistics data science includes data-driven methodologies with subprocesses of i) data cleaning ii) data pre-processing and iii) decision making. In the works of [6] and [7], an extensive research of data manipulation and data mining is presented. The following section describe the process of data science approach throughout raw heterogenous data. 3.1 Data Cleaning As an initial stage of a decision support system, the authors emphasize in the manipulation and transformation of processed data (data-cleaning or cleansing) in order to create a dataset of complete and proper format. I real-world applications, data cleansing process involves identifying incorrect, incomplete or inaccurate historical data and furthermore replacing them with their mean or median value or erasing them [8]. This process should differ for each application and should be conducted cautiously. For example, when the incomplete data are extensive it is suggested to be replaced in order to keep the cohesiveness of the historical values, however, if they are sparse, they could be deleted. In our use case, for the process of data clean as suggested in the works of [9, 10] Python programming language and NumPy and Pandas libraries are implemented. Moreover, commonly used software in literature is Apache Spark and pyspark libraries [11] or MATLAB [12].

Development of a Decision Support System

375

3.2 Data Pre-processing The subsystem of data pre-processing addresses the fact of further processing and exploitation of high-volume data collected from heterogenous sources from the production process. It constitutes a critical stage of data science architecture, manipulating raw data in order to be properly used in statistical and predictive models and visualization charts. Initially, in the concept of data pre-processing data sampling is encountered, especially in cases were the sampling frequency and the collected volume of data is large. Furthermore, data sampling aims on storing all the necessary information without affecting the occurrence and importance on the distributions of the input variables. Hence, the volume of data and the computational requirements are significantly reduced without losing critical historical features or patterns of interest. Secondly, another important factor is the conversion of the data type into an appropriate format, thus avoiding inequalities between the variable types of each column. Last but not least, the transformation of input measurements is encountered [13]. The transformation can be further divided into i) feature engineering, which uses statistical feature reduction models such as Principal Component Analysis or statistical correlation models such as Spearman to introduce the critical features into the predictive algorithms depending on the dependent variable of interest ii) data homogenization in order to avoid biased results and ensure data consistency and iii) data normalization which reduces high variances among features converting the data to a range between 0 – 1. Figure 3 shows a heatmap of the correlation between critical production features such as the overall losses between Flanger and Beader with the predicted label which is the total losses. As presented the overall losses between Beader and Seamer output the highest correlation with the variable of total losses. 3.3 Decision Making A decision-making framework commonly uses Artificial Intelligence (AI) algorithms to predict dependent variables based on input features. Regarding the application, prediction models can be divided into three major categories namely Classification, Regression and Clustering [14]. Classification occurs when the predicted label depicts an integer that expresses a specific category [e.g., 1) safe operation, 2) optional maintenance, 3) mandatory maintenance] or a binary system (e.g., 1 -operating, 0-inactive). Commonly used data-driven algorithms in classification applications are, Random Forest, XGBOOST, CatBoost, Support Vector Machines (SVM), Artificial Neural Networks (ANN) and Convolutional Neural Networks (CNN). Following, regression refers to cases where the predicted label depicts a real number of independent ranges. It is usually encountered in machine lifetime predictions (Remaining Useful Life), total failure losses, operating temperatures or operating pressures predictions. Random Forest Regressor, CatBoost Regressor ANN and Recurrent Neural Networks (RNN) are suggested as the most suitable algorithms for Regression models. Finally, Clustering is an unsupervised datadriven method in which the algorithm itself attempts to classify into clusters the dataset of interest with machine learning algorithms such as K-nearest Neighbours (KNN) and K-means. Our Decision making in a Canning industry framework, focuses on classification models in order to control stocks of raw materials create an appropriate production scheduling, and control losses of tinplate at each stage. Jupyter notebook, Python

376

P. Mallioris et al.

Fig. 3. Heatmap of critical production outputs.

programming language and the NumPy and scikit-learn libraries will be used for the algorithm implementation. Figure 4 presents the process of database communication with the web server through the analysis framework.

Fig. 4. Database communication with the web server through the analysis framework.

In our research three machine learning models, Random Forest, Naïve Bayes and SVM were tested for the prediction of total production losses based on initial production stages namely the processes of Flanger and Beader. More specifically, if the overall

Development of a Decision Support System

377

production losses overcome a pre-defined threshold the outputted label output an alarm of 1, on the other hand the output is 0. Based on the pre mentioned input features, the algorithms should be able, upon training, to recognise the predicted label and output a warning if the classification result is 1. In order to measure the performance of the algorithms, Accuracy, Precision, F1-score, Cohen Kappa Score and Confusion matrix we used as evaluation metrics. Table 1 presents the results of each algorithm with Random Forest outputting the best performance with Accuracy 90%, Precision 90%, F1 score 89.36% and Cohen Kappa score 80.12%. Table 1. Algorithms and evaluation metrics. Evaluation metrics Confusion matrix Algorithm

Accuracy (%)

Precision(%)

F1 score (%)

Cohen Kappa Score (%)

TP

TN

FP

FN

Random forest

90

90

89.36

80.12

24

21

5

0

Naive Bayes

86

86

85.1

72.17

23

20

6

1

SVM

82

82

82.35

64

20

21

5

4

3.4 Data Visualization An important aspect in the completion of a DSS is the data visualization framework. It is essential, the collected and analyzed information through the processing software (cleaning, pre-processing, decision making), to be presented in a simple and understandable format in the production manager or any external observer. Thus, depending on the application, it is recommended graphical charts to be used such as pie charts, histograms, time series charts, scatter plots or heat maps to further exploit the critical results of a production process. In our work, the visualization subsystem framework is divided into two stages. Firstly, the Python programming language and the Matplotlib library enable the creation of comprehensible and representative time series charts, which can depict individually or in total, the various critical parameters of the production process. Figure 5 illustrates several plots of critical system outputs e.g., the timeline of losses between the Oven output and the Flanger output or total losses throughout the system. In the second stage a dashboard containing real-time crucial variables in combination with an alarm-based system are designed for the proposed research Fig. 6. If an anomaly is detected the DSS algorithm forwards to the web-based application a warning message to prompt the users for corrective action. In more depth, the web application upon user request, queries an Express server and forwards the requested results to a React framework in combination with echarts library for visualization. Additionally, critical production variables, namely the amount of rejected cans, the malfunctions of machinery or the total processed items for each stage, are presented in real-time to the production manager.

378

P. Mallioris et al.

Fig. 5. Timeline of total losses throughout the system with matplotlib.

Fig. 6. Dashboard containing real-time crucial variables in combination with an alarm-based system.

Development of a Decision Support System

379

4 Conclusion In this article the development of a DSS in a canning Industry has been presented. An extensive analysis of the data-flow process from photoelectric sensors up until the visualization framework was described proposing the implemented hardware and software methodology for each section respectively. Random Forest, Naïve Bayes and SVM were executed for the prediction of total production losses with Random Forest evaluating the best with Accuracy of 90%. Our decision support application enhances smart manufacturing in modern industries enabling a real-time condition monitoring among various production processes. Bagging solutions such as Random Forest can provide a simple and efficient implementation. However, boosting approaches such as Xgboost and Catboost can further improve the accuracy of prediction models. Thus, as future research, a more extensive analysis on the data-driven algorithms in order to enhance the reliability of industrial machinery and the decision-making process proposed is suggested. Acknowledgments. This research has been co-financed by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH–CREATE–INNOVATE (project code: T2EDK-01806).

References 1. Javier Maseda, F., López, I., Martija, I., Alkorta, P., Garrido, A.J., Garrido, I.: Sensors data analysis in supervisory control and data acquisition (Scada) systems to foresee failures with an undetermined origin. Sensors 21 (2021). https://doi.org/10.3390/s21082762 2. Ivanov, D., Dolgui, A.: A digital supply chain twin for managing the disruption risks and resilience in the era of Industry 4.0. Product. Plann. Control 32, 775–788 (2021). https://doi. org/10.1080/09537287.2020.1768450 3. Singer, G., Cohen, Y.: A framework for smart control using machine-learning modeling for processes with closed-loop control in Industry 4.0. Eng. Appl. Artif. Intell. 102, 104236 (2021). https://doi.org/10.1016/j.engappai.2021.104236 4. Elahi, M., Afolaranmi, S.O., Mohammed, W.M., Lastra, J.L.M.: Energy-based prognostics for gradual loss of conveyor belt tension in discrete manufacturing systems. Energies (Basel) 15 (2022). https://doi.org/10.3390/en15134705 5. Nieves Avendano, D., Caljouw, D., Deschrijver, D., van Hoecke, S.: Anomaly detection and event mining in cold forming manufacturing processes. Int. J. Adv. Manuf. Technol. 115, 837–852 (2021). https://doi.org/10.1007/s00170-020-06156-2 6. Wickham, H.: Tidy data. J. Stat. Softw. 59, 1–23 (2014). https://doi.org/10.18637/jss.v059.i10 7. Broman, K.W., Woo, K.H.: Data organization in spreadsheets. Am. Stat. 72, 2 (2018). https:// doi.org/10.1080/00031305.2017.1375989 8. Osborne, J.: Best practices in data cleaning: a complete guide to everything you need to do before and after collecting your data (2013) 9. Natanael, D., Sutanto, H.: Machine learning application using cost-effective components for predictive maintenance in industry: a tube filling machine case study. J. Manufact. Mater. Process. 6 (2022). https://doi.org/10.3390/jmmp6050108 10. Rodrigues, J.A., Farinha, J.T., Mendes, M., Mateus, R.J.G., Cardoso, A.J.M.: Comparison of different features and neural networks for predicting industrial paper press condition. Energies (Basel) 15 (2022). https://doi.org/10.3390/en15176308

380

P. Mallioris et al.

11. Calabrese, M., Cimmino, M., Fiume, F., Manfrin, M., Romeo, L., Ceccacci, S., et al.: SOPHIA: An event-based IoT and machine learning architecture for predictive maintenance in industry 4.0. Information (Switzerland) 11, 1–17 (2020). https://doi.org/10.3390/INFO11040202 12. Romahadi, D., Luthfie, A.A., Suprihatiningsih, W., Xiong, H.: Designing expert system for centrifugal using vibration signal and bayesian networks. Int. J. Adv. Sci. Eng. Inf. Technol. 12, 23–31 (2022). https://doi.org/10.18517/ijaseit.12.1.12448 13. Garmaroodi, M.S.S., Farivar, F., Haghighi, M.S., Shoorehdeli, M.A., Jolfaei, A.: Detection of anomalies in industrial IoT systems by data mining: study of CHRIST Osmotron water purification system. IEEE Internet Things J. 8, 10280–10287 (2021). https://doi.org/10.1109/ JIOT.2020.3034311 14. Angelopoulos, A., Michailidis, E.T., Nomikos, N., Trakadas, P., Hatziefremidis, A., Voliotis, S., et al.: Tackling faults in the industry 4.0 era—a survey of machine-learning solutions and key aspects. Sensors (Switzerland) 20, 1–34 (2020). https://doi.org/10.3390/s20010109

Estimating the Tendency of Social Media Users to Spread Fake News Ahmad Hashemi1(B) , Wei Shi1 , Mohammad Reza Moosavi2 , and Anastasia Giachanou3 1

2

School of Information Technology, Carleton University, Ottawa, Canada [email protected], [email protected] Department of Computer Science and Engineering, Shiraz University, Shiraz, Iran [email protected] 3 Department of Methodology and Statistics, Faculty of Behavioural and Social Sciences, Utrecht University, Utrecht, The Netherlands [email protected]

Abstract. The unique characteristics of social media, such as popularity, ubiquitousness, and inadequate supervision, make it a perfect medium for fake news propagation. While users play a critical role in this propagation, not all of them have the same level of impact and involvement. Identifying the news-sharing behaviors of different users and predicting them automatically can be a leading step toward detecting fake news and understanding the factors that contribute to its spread. Previous attempts to detect fake news spreaders have focused on binary classification, assuming users as either spreaders or non-spreaders of fake news. To address this oversimplification, we propose estimating users’ tendency to spread fake news by introducing a metric that represents the degree of users’ propensity to spread misinformation. Our provided approach is a supervised regression model utilizing text-based features extracted from users’ writings on social media. We created and annotated a new dataset based on FakeNewsNet, a popular data repository on fake news detection, to train our model and conduct our experiments. In our experiments, we establish the practicality of our approach by achieving a Root Mean Squared Error (RMSE) of 0.26, using a range of values from 0 to 1 to represent users’ inclination to spread fake news. We also demonstrate that utilizing text-based features leads to better performance than using explicit features directly provided by social media. Keywords: Fake news detection · Fake news spreader identification Author profiling · Machine learning

1

·

Introduction

The days of waiting to get a newspaper to read the latest news are gone. Nowadays, consumers are shifting away from traditional media sources and are increasingly using social media to access and read the news [1]. The easiness of sharing c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 381–396, 2024. https://doi.org/10.1007/978-3-031-47715-7_26

382

A. Hashemi et al.

and reposting the news as well as the lack of proper content controlling mechanisms make social media a perfect environment for propagating fake news. In fact, research shows that fake news propagation through social media is even faster than real news [32]. A great amount of literature focuses on distinguishing fake from real news by extracting linguistic features [12,21,22]. However, since fake news content is written to mislead readers, investigating the related information more than just news content itself seems necessary for effective detection. Therefore, some studies provide fake news detection approaches by leveraging information about spreading patterns [34,35]. Considering that users play an essential role in the propagation of fake news on social media, some studies explore the detection of fake news by leveraging the characteristics of the spreaders [7,29]. However, they did not address the task of identifying fake news spreaders or estimating their Tendency to Spread Fake News (TSFN).1 Estimating social media users’ TSFN could provide several benefits, as it can help in designing targeted interventions, such as suggesting fact-checking tools or related educational content to users with a higher TSFN to improve their critical thinking skills. Also, by estimating users’ inclination to propagate false information, we can gain a better understanding of the factors that contribute to the spread of fake news. This can help researchers and policymakers develop more effective strategies to combat the spread of misinformation on social media. Finally, the task of TSFN estimation can be a leading step to detect fake news. Some studies in the literature provide binary classification approaches to identify fake news spreaders [2,10,18,23]. Identifying fake news spreaders as a binary classification task has some limitations because it assumes that users’ behavior can be classified into two distinct categories: those who always spread fake news and those who do not. However, in reality, many users may have a mixed tendency to spread both real and fake news, which can lead to incorrect classification of users as either spreaders or non-spreaders. To address this issue, we propose estimating the extent of users’ tendency to spread fake news by employing a regression approach. Additionally, having an estimation of the tendency to spread fake news for all users can provide more comprehensive information for fake news detection than merely identifying absolute fake news spreaders while disregarding users with more balanced tendencies. In this study, we investigate the following research questions: – RQ1: Can the textual content posted by users on social media be used to predict the extent of their tendency to spread fake news? – RQ2: Which features are more effective in distinguishing between different degrees of users’ propensity to spread fake news? Our main contributions in this study are as follows: 1

In the rest of the paper, we use “TSFN” abbreviation as a short term for Tendency to Spread Fake News and “TSFN score” as the variable we introduced to represent users’ tendency to spread fake news.

Estimating Users’ Tendency to Spread Fake News

383

– We present a novel task description of estimating the extent of social media users’ propensity to spread fake news and introduce the TSFN score to represent this propensity. – We create a dataset and annotate 5,000 Twitter users by calculating their TSFN scores based on their historical news-sharing provided in a fake news detection benchmark data repository. – We propose a supervised regression model that can automatically estimate a user’s tendency to spread fake news given the content they write on social media. – We compare the performance of using different language-based features in the estimation task. – We analyze the contribution of different emotional signals expressed from users’ posts in differentiating between different degrees of the tendency to spread fake news. To the best of our knowledge, this is the first study that addresses the task of fake news spreader identification as a regression task. In this study, we demonstrate the feasibility of this estimation on a test set by obtaining a Root Mean Squared Error (RMSE) of 0.26 considering a value in the range [0.0, 1.0] for the TSFN score. The rest of the paper is organized as follows. In Sect. 2, we provide an overview of related work on fake news detection, focusing on different approaches taken to identify fake news spreaders. In Sect. 3, we describe the process of creating our dataset, including the selection criteria for social media users and the annotation process. Section 4 presents our proposed methodology for estimating the tendency of users to spread fake news, which describes feature extraction methods as well as regression models utilized. In Sect. 5, we present the experimental setup and obtained results, including a comparison with some baseline approaches. Section 6 discusses the results. Section 7 summarizes our conclusions and discusses possible future work in this area.

2

Related Work

Fake news detection has received a lot of research attention in the last few years. Early attempts have proposed to use information from the textual content of the news articles such as Bag-of-Words and Term Frequency—Inverse Document Frequency (TF-IDF) [6]. With the new advancements in neural networks, some researchers have applied the Bidirectional Encoder Representations from Transformers (BERT) model [14,15]. Jwa et al. [14] used BERT to analyze the relationship between the headline and the body text, whereas Kaliyar et al. [15] proposed combining different parallel blocks of the single-layer Convolutional Neural Networks (CNN) with BERT. We refer interested readers to survey papers such as [30] and [36] that give full and comprehensive descriptions of the area of fake news. Given that users are the ones who ultimately decide whether or not to share a post, their role in the propagation of fake news is crucial. To this end, some

384

A. Hashemi et al.

studies have focused on detecting and/or analyzing potential fake news spreaders. Shu et al. [29] analyzed different characteristics of users that were sharing fake or real news. They also utilized the users’ features to detect fake news. Their results showed that the registration time and whether the user is verified or not are the two most important features for detecting fake news. With the increasing interest in this domain, Rangel et al. [23] organized an evaluation task on profiling users who have shared fake news in the past. A wide range of features (e.g., pre-trained word embeddings, n-gram, emotions, personality traits, punctuation marks) and learning approaches (e.g., Support Vector Machines, Logistic Regression, CNN, LSTM) were proposed by the participants for the detection of potential fake news spreaders. In another study, Giachanou et al. [9] explored the impact of different linguistic features and personality traits on differentiating between fact checkers and fake news spreaders. Vo and Lee [31] also focused on fact-checking and analyzed linguistic characteristics of fact-checking tweets and found that factcheckers tend to use formal language and use few swear words and slang. Ghanem et al. [8] proposed an approach to detect non-factual Twitter accounts such as propaganda and clickbait accounts by treating post streams as a sequence of tweets chunks from which they extracted semantic and dictionary-based features, including word embeddings, emotion, and style. Several Natural Language Processing (NLP) studies claim that users’ personality traits can be inferred from the language they produce [4,25]. To this end, some studies focus on predicting users’ behavior in facing fake news based on the textual content they provide in social media [10,18,23]. In the present work, we aim to estimate the tendency of users to spread fake news based on their writing style as a prior step to fake news detection. Unlike previous studies that focus on binary classification, we address this problem as a regression task. This approach provides a more nuanced and realistic estimation of users’ behavior and avoids the oversimplification existing in binary classification.

3

Dataset Creation

In this section, we describe the dataset we collected and used to perform our experiments on estimating users’ tendency to spread fake news. While there exist different datasets containing fake news, most of them are created specifically for the task of fake news detection and consist of news items labeled as fake or real [19,28,32]. Few datasets focus on profiling users, namely consisting of a set of users labeled as fake news spreaders or not [23], fake news spreaders or fact-checker [9], and users who have retweeted posts from unreliable or reliable news sources [18]. These datasets are collected for binary classification tasks and usually consist of users with biased behavior. In contrast, the task we have at hand is a regression task and tends to consider a wider range of users with different sharing actions in confronting fake and real news. Therefore, we collect a new dataset consisting of users with different extents of TSFN.

Estimating Users’ Tendency to Spread Fake News

385

To collect our own dataset, we used FakeNewsNet,2 a widely used benchmark data repository for fake news detection. FakeNewsNet consists of two datasets: PolitiFact and GossipCop, collected based on two fact-checking platforms with the same names.3 Both datasets contain news articles from different websites labeled as either fake or real based on the fact-checking platforms. In this study, we focus on the PolitiFact dataset that covers political news. For each political news item, the PolitiFact dataset provides the news content as well as a list of Twitter user ids that have tweeted about the news. We use the relation between user ids and news items to collect our dataset which consists of 5,000 user profiles. We label each user with a value in the range of [0.0, 1.0], representing their tendency to spread fake news. Since we aim to investigate usergenerated textual content, we download each user’s timeline using the Twitter API. As suggested in the literature [23], for each user, we consider their last 100 tweets/retweets/replies as their timeline and utilize these as a source of their writing. The process of collecting and annotating the dataset consists of the following steps: – User Selection: In this step, we select and label those users from whom we can collect enough information. Since we want to label the users based on their news-sharing behaviors, we select the top 5,000 users who share (either spread or claim as fake news) the highest number of news items labeled as either fake or real news in the PolitiFact dataset. As shown in Table 1, each selected user in our dataset has shared at least 5 labeled news items. Also, each user in our selection has shared an average of 8.43 PolitiFact news items. The total number of news-related tweets is 42,196 (41,433 spreading tweets and 763 fact-check tweets) all have been shared by the 5,000 selected users. It should be noted that the timeline was not available for a small portion of users due to Twitter API policies or simply because the users did not post anything on their account other than the tweet(s) related to the labeled news items. We eliminated such users before the user selection process. – Discriminating Between Fake News Spreaders and Fact-Checkers: In the FakeNewsNet repository, related tweets for each news article are collected by applying search queries formed from the news article’s headline [28]. Therefore, it is very probable that some tweets are related to a news item because it is revealing its falseness (fact-check tweet). Obviously, those tweets should not be labeled as spreading tweets. To discriminate between fact-check and spreading tweets, we utilize a heuristic approach provided in [10]. If a user’s tweet contains any of the terms {fact check, fake, false, hoax, lead stories, politifact leadstories, snopes} we consider it as a fact-check tweet piece (declares its falseness); otherwise, a fake news spreading one. To provide a clearer sense, Fig. 1 shows some examples of fact-check versus fake news spreading tweets. 2 3

The code for downloading the FakeNewsNet repository can be found in the following link: https://github.com/KaiDMML/FakeNewsNet. PolitiFact and GossipCop are two fact-checking websites with the following web addresses: https://www.politifact.com/ and https://www.gossipcop.com/.

386

A. Hashemi et al.

– User Annotation: To measure the extent of a user’s tendency to spread fake news, we introduce a metric named TSFN score, which is a value in the range of [0.0, 1.0] and is calculated as follows: T SF N (i) =

ni (sf ) ni (sf ) + ni (sr) + ni (f )

(1)

Fig. 1. Examples of fact-check and spreading tweets for a particular news piece.

In this formula, T SF N (i) is the TSFN score of user ui , ni (f ) is the number of news items user ui has fact-checked, and ni (sf ) and ni (sr) are respectively the numbers of fake news and real news items spread by the user. Intuitively, the TSFN score shows the ratio of the fake news items a particular user has spread to all the news items they have shared, including fake, real and factchecking ones. The higher the score, the more likely the user is to spread fake news. Figure 2 shows the histogram of the calculated TSFN scores for the 5,000 selected users. As the figure illustrates, roughly 1200 of the users have TSFN scores close to zero, indicating their very low tendency to spread fake news. In contrast, about 800 users have TSFN scores close to one, which means they are highly likely to spread fake news. The rest of the users have a more balanced propensity to spread fake news. Table 1. Statistics of the created dataset. Number of users

5,000

Average number of sharing news items per user

8.43

Maximum number of sharing news items per user 69 Minimum number of sharing news items per user

5

Total spreading tweets

41,433

Total fact-check tweets

763

Estimating Users’ Tendency to Spread Fake News

387

Fig. 2. Histogram showing the distribution of calculated TSFN scores of the 5,000 selected users.

4 4.1

Methodology Task Description

We aim to estimate the tendency of Twitter users to spread fake news that could be used as a prior step for fake news detection. In this paper, we define a novel task of estimating Twitter users’ tendency to spread fake news based on their writing style. To this end, we introduce the TSFN score to represent the extent of users’ propensity to spread fake news. Our proposed model to address the provided task is a supervised regression model that uses a training set of users to learn a function that assigns to each new user a value representing their tendency to spread fake news. The training set consists of n users, each represented as a tuple (x, y) where x is a feature vector of the user and y is a value named TSFN score in the range of [0.0, 1.0] that shows their tendency to spread fake news. We assume different feature groups can be extracted from the user’s timeline as a source of their writings. Since all feature groups are extractable from any textual content the users produce, our approach is likely independent of the social media platform. It is valuable to notice that users may use different languages on different platforms. However, exploring the linguistic differences of the same user on different platforms and analyzing how this can affect our task is beyond the scope of this study.

388

4.2

A. Hashemi et al.

Feature Extraction

Since we focus on leveraging language-based features, we extract three feature groups from the users’ timelines. The timelines we utilized for our experiments do not contain any tweets related to the labeled news items, and we solely use them to extract users’ language-based features. For each user, we concatenate the textual content of the 100 tweets/retweets/replies from their timeline after excluding tweets related to the labeled news items. We consider each user’s last 100 posted contents, as suggested in the study [23], to be informative enough to extract language-based features. We then tokenize the documents using the TweetTokenizer module from the Natural Language Toolkit (NLTK) package [5] of Python and eliminate the URLs. The following three items provide a detailed explanation of the process we followed to extract each feature group. – Pre-trained Word Embedding: Word embedding is a common approach for word representation in which each word is represented in the form of a vector. Pre-trained word embeddings are trained on large datasets to capture the semantic and syntactic meaning of a word. Accordingly, words with similar representations are expected to have similar meanings. To extract word vectors for our task, we utilize a widely used pre-trained word embedding model, the Spacy large-size pre-trained model for English, available on the Spacy package [13]. The model consists of unique vectors of 300 dimensions for 658k different words. In our use case, each user is represented as the weighted average of word vectors for all words used in the user’s timeline. The weights are the word frequencies in the user’s timeline. After calculating the weighted average, for each user, we have a vector of length 300 that represents the sense of the words they use. – Term Frequency—Inverse Document Frequency (TF-IDF): TF-IDF is a widely used technique in information retrieval and text mining. Also, in the area of fake news detection, some studies illustrate the effectiveness of TF-IDF by achieving good results in related tasks [3,33]. TF-IDF evaluates how relative a word is to a document across a set of documents. Each word in each document has its respective TF-IDF score, which is the product of its term frequency and inverse document frequency. The higher the score, the more unique the word is in a given document among the whole collection of documents and vice versa. Before calculating the TF-IDF scores for each user, we take the root of words with Porter stemmer using the NLTK package [5] to reduce the terms with the same root to a common base form. Finally, we consider each user’s timeline as a particular document and assign to it a vector of all words’ TF-IDF scores. Therefore, the resulting TF-IDF representation for each user would be a vector containing the TF-IDF scores calculated based on the words in the user’s timeline and all other users’ timelines. – Emotional Signals: Since some related studies show the effectiveness of emotions in fake news detection [11], tweet act classification [26], and author profiling [24] tasks, we decide to leverage emotional signals and use it as

Estimating Users’ Tendency to Spread Fake News

389

another feature group. We utilize an affect intensity lexicon with real-valued association scores [27] to extract the emotional signals that existed in each user’s timeline. The lexicon includes between 12,000 and 1,500 entries for each of the following eight emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. Each entry consists of a term and its intensity score for the particular emotion. We assign eight emotional signal scores to each user by calculating the weighted average of the word intensity scores for all the words available in the user’s timeline. The weights are the word frequencies in the user’s timeline. 4.3

Model Description

Here we describe how our model exploits language-based features to estimate users’ tendency to spread fake news. For each user, we extract the three aforementioned language-based feature groups from their timelines. Then, we concatenate them into a single feature vector. Since TF-IDF and the pre-trained word embedding feature groups contain more than 10,000 and 300 dimensions, respectively, we utilize Principle Component Analysis (PCA) [17] to reduce the dimensions of each to 20. We denote the obtained Language-Based Feature vector as LBF. Finally, for each user, we have a feature vector and a calculated TSFN score which allow us to train our TSFN estimation model using a regression algorithm. Based on our experiments, we chose the random forest regressor to train the final model.

5 5.1

Experiments and Results Experimental Setup

To perform all the train-test experiments, we utilize 10-fold cross-validation, in the sense that we split the data into 10 folds of 500 users. Then, we considered one of the folds as the test data and used the rest to train the model, and then calculated the performance score. We repeated the process for each of the folds and finally reported the average of the ten obtained performance scores as the model’s performance. For the implementation of all the regression algorithms, we used the open-sourced scikit-learn [20] framework in Python. For all our evaluations and analysis, we utilize Mean Squared Error (MSE) metric. MSE is a metric to show how close the predicted values by a model are to the actual values in a regression task. The lower the MSE is, the better the model performs. In our application, the value that is supposed to be predicted by the model is TSFN score. The actual TSFN score for a user is the one calculated based on the fake and real news they have already shared, and the model tries to predict the score based on the user’s social media timeline. Therefore, the error here would be the difference between the actual TSFN score and the predicted one by the model. Mean square error, as its name suggests and is shown in the

390

A. Hashemi et al.

following formula, is calculated by taking the average of the squared errors. 2 1  Yi − Yi n i=1 n

M SE =

(2)

In this formula, n is the number of all the samples, Yi is the actual TSFN score of user i, and Yi is their predicted TSFN score. We chose to use MSE for our evaluations and comparisons since the square results in penalizing extreme errors and makes MSE a desirable discriminating measure for our application. To provide more intuitive insight into the prediction errors, we also calculate and report the Root Mean Squared Errors (RMSE) which is the square root of MSE and is in the same unit as the target variable (TSFN score). 5.2

Regression Algorithm Selection

To choose the best-performing regression algorithm, we apply different ordinary regression algorithms, i.e., Random Forest, Decision Trees, Linear Regression, and Epsilon-Support Vector Regression. Table 2 shows the prediction errors calculated when using different regression algorithms. The results show that random forest with the lowest Mean Squared Error (MSE) score of 0.0698 yields the best result. While linear regression and support vector regressor, obtained close MSE scores of 0.0725 and 0.0705, respectively, decision tree performs the worst with an MSE score of 0.1460. According to the obtained results, we chose random forest as the regression algorithm for our final model. Table 2. Prediction error comparison using different regression algorithms.

5.3

Regressor

RMSE

Random forest

0.2641 0.0698

MSE

Decision tree

0.3820

0.1460

Linear regression

0.2692

0.0725

Support vector regressor 0.2655

0.0705

Evaluation and Analysis

While the provided regression task is novel, we further implement some base models using different feature groups to evaluate the results of our provided approach. Below is a brief description of the feature groups utilized to implement the base models: – Explicit Features: Since our provided approach utilizes users’ languagebased features, it seems reasonable to compare it with a non-language-based

Estimating Users’ Tendency to Spread Fake News

391

approach. To do so we utilize a non-language-based feature extraction method from the literature [29] to extract users’ features and develop the base model. Shu et al. [29] conducted a comparative analysis of some characteristics between users who are more likely to spread fake news versus those who are more likely to spread real news. One of the feature groups they provide consists of the features directly available in meta-data for each user, i.e., the number of their posts, favorites, followers, and followings, as well as whether the user is a verified user and the user’s registration time. We consider these explicit features as a non-language-based feature group and use it to train a regression model to predict users’ TSFN scores. – Individual Language-Based Feature Groups: As we mentioned earlier, our final model uses a combination of three different language-based feature groups (LBF). Each of the feature groups is obtained by a commonly used language-based feature extraction method. In order to compare the effectiveness of LBF over each of the three language-based feature groups, we produce three models, each of which utilizes solely one of the feature groups that already have been described: pre-trained word embedding, TF-IDF, and emotional signals. In response to RQ1, the results show that our proposed model can predict the TSFN score of users, representing their propensity to spread fake news, with the root mean squared error of 0.2641 which has the same unit as the TSFN score. As Table 3 illustrates, our final provided model (LBF) performs significantly better than the model utilizing non-language-based explicit features. The mean squared error for LBF is also lower than the models using each language-based feature group individually (pre-trained word embedding, TF-IDF, and emotional signals), however, this improvement is not significant except over emotional signals. The table also reveals that all the individual language-based feature groups perform more effectively than the explicit features, especially TF-IDF and pretrained embedding, with significant improvements in terms of MSE. These observations can confirm that the information we can extract from the user-provided textual content is likely more informative for estimating users’ propensity to spread fake news than what social media explicitly provides. Table 3. Prediction errors of different models. The symbols ∗ and + indicate a statistically significant improvement over the explicit and emotions reature groups, respectively. Feature group

RMSE

MSE

Explicit

0.3591

0.1290

Emotions

0.3345

0.1119

TF-IDF

0.2711

0.0735∗

Pre-trained embedding 0.2718

0.0739∗

LBF

0.2641 0.0698∗+

392

A. Hashemi et al.

Fig. 3. Four-bins regression confusion matrix. The matrix, for each bin, represents the ratio of the number of samples predicted in each bin to the number of all samples in the bin.

In order to provide a more in-depth analysis of our model’s performance in predicting TSFN scores of different ranges, we plotted the four-bins confusion matrix of the model (Fig. 3). As the confusion matrix suggests, the model performs better in predicting real news spreaders (TSFN score less than 0.25) as 55% of them are correctly identified as real news spreaders and 36% of them are predicted as mild real news spreaders (TSFN score between 0.25 and 0.5). In contrast, the model could predict only 35% of the fake news spreaders (TSFN score greater than 0.75) correctly, however it predicts 0.45% of them as the adjacent bin of mild fake news spreaders (TSFN score between 0.5 and 0.75). The confusion matrix also demonstrates a low proportion of extreme errors as there are only 1% of fake news spreaders identified as real news spreaders and 2% of real news spreaders identified as fake news spreaders by the model. 5.4

Feature Importance

In this section, we investigate the effectiveness of different features we utilized to estimate users’ TSFN in our final model. According to our experiments described in Sect. 5.3 and by looking at Table 3, we can observe that among the three

Estimating Users’ Tendency to Spread Fake News

393

language-based feature groups, TF-IDF and pre-trained word embeddings are significantly more effective feature groups than emotional signals.

Fig. 4. Emotional signals ranking based on the gini feature importance score.

Since among the language-based features, emotional signals have an interpretable nature, we further investigate the contribution of each emotional signal when the model uses the emotional signals solely. We compute the feature importance score of each emotion based on the mean decrease in impurity (MDI) [16]. Each feature importance based on MDI, also known as Gini importance, is calculated as the sum over the number of splits (across all trees) that include the feature, proportionally to the number of samples it splits. To compare the contribution of different emotional signals in differentiating between different degrees of tendency to spread fake news, Fig. 4 shows the Gini importance scores for the eight extracted emotional signals. The results show that “disgust” with a Gini importance score of 0.18 is the most contributing emotional signal in estimating users’ propensity to spread misinformation. Also, “anger” achieves the lowest feature importance score of 0.1 among the eight emotional signals.

6

Discussion

Our study aimed to investigate the feasibility of estimating users’ tendency to spread fake news by analyzing the textual content they post on social media. The results suggest that it is indeed possible to estimate this tendency with a reasonable degree of accuracy, as demonstrated by the root mean squared error of 0.26 obtained for the TSFN score which ranges from 0 to 1. This finding addresses RQ1 and highlights the potential of using language-based features for

394

A. Hashemi et al.

predicting the likelihood of users spreading fake news. Our further experiments indicate a low proportion of extreme errors in the estimations of the proposed model and show that it performs better in identifying users with a lower TSFN. To address RQ2, our study conducted experiments to compare the influence of different features in predicting users’ tendency to spread fake news. Our findings demonstrate that language-based features outperformed explicit features in estimating this tendency. In addition, performing a feature importance analysis revealed that “disgust” was the most contributing emotional signal in distinguishing between users who spread fake news and those who spread real news. Among the language-based feature groups, the term frequency-inverse document frequency (TF-IDF) approach exhibited the highest performance, although combining all three feature groups produced even better results. Since users are the main focus of this study, dealing with ethics is inevitable. One of the ethical concerns related to our study has to do with using information from the users’ tweets to estimate their tendency to spread fake news. In order to collect tweets from users, we followed all Twitter API regulations. A system that can estimate the tendency of users to spread fake news should be used by no means to stigmatise the users that have shared in the past fake news. We also want to highlight that a system that estimates the tendency of users to spread fake news requires considering ethics at all steps. However, a systematic definition and description of the ethical considerations is beyond the scope of this study.

7

Conclusions and Future Work

In this paper, we proposed the novel task of estimating the degree to which users tend to spread fake news. To achieve this goal, we introduced a metric called the TSFN score, which represents how likely users are to spread fake news. We have also developed a supervised regression model that leverages languagebased features extracted from users’ writings on social media. Our experiments demonstrate the feasibility of estimating the extent of users’ tendency to spread fake news utilizing language-based features. Furthermore, our results show that language-based features outperform explicit features provided directly by the social media platform, indicating that linguistic features are essential for understanding users’ tendency to spread fake news. For future work, we plan to investigate the application of our proposed approach to the task of fake news detection. Specifically, we will explore how estimating users’ tendency to spread fake news can be integrated into existing fake news detection methods to improve their accuracy and reliability. Additionally, we will investigate the impact of combining language-based features with explicit features on the task’s performance.

Estimating Users’ Tendency to Spread Fake News

395

References 1. News use across social media platforms in 2020 (2022) 2. Agarwal, R., Gupta, S., Chatterjee, N.: Profiling fake news spreaders on twitter: a clickbait and linguistic feature based scheme. In: International Conference on Applications of Natural Language to Information Systems, pp. 345–357. Springer (2022) 3. Arag´ on, M.E., Jarqu´ın-V´ asquez, H.J., Montes-y G´ omez, M., Escalante, H.J., Pineda, L.V., G´ omez-Adorno, H., Posadas-Dur´ an, J.P., Bel-Enguix, G.: Overview of mex-a3t at iberlef 2020: fake news and aggressiveness analysis in mexican spanish. In: IberLEF@ SEPLN, pp. 222–235 (2020) 4. Bai, S., Zhu, T., Cheng, L.: Big-five personality prediction based on user behaviors at social network sites (2012). arXiv-1204 5. Bird, S., Loper, E., Klein, E.: Natural Language Processing with Python. O’Reilly Media Inc (2009) 6. Castillo, C., Mendoza, M., Poblete, B.: Information credibility on twitter. In: Proceedings of the 20th International Conference on World Wide Web, WWW ’11, pp. 675–684. Association for Computing Machinery (2011) 7. Tredici, M.D., Fern´ andez, R.: Words are the window to the soul: language-based user representations for fake news detection. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 5467–5479, Barcelona, Spain (Online) (2020). International Committee on Computational Linguistics 8. Ghanem, B., Ponzetto, S.P., Rosso, P.: Factweet: profiling fake news twitter accounts. In: International Conference on Statistical Language and Speech Processing, pp. 35–45. Springer (2020) 9. Giachanou, A., Ghanem, B., R´ıssola, E.A., Rosso, P., Crestani, F., Oberski, D.: The impact of psycholinguistic patterns in discriminating between fake news spreaders and fact checkers. Data & Knowl. Eng. 138, 101960 (2022) 10. Giachanou, A., R´ıssola, E.A., Ghanem, B., Crestani, F., Rosso, P.: The role of personality and linguistic patterns in discriminating between fake news spreaders and fact checkers. In: International Conference on Applications of Natural Language to Information Systems, pp. 181–192. Springer (2020) 11. Giachanou, A., Rosso, P., Crestani, F.: The impact of emotional signals on credibility assessment. J. Assoc. Inf. Sci. Technol. 1–16 (2021) 12. Heydari, A., ali Tavakoli, M., Salim, N., Heydari, Z.: Detection of review spam: a survey. Expert Syst. Appl. 42(7), 3634–3642 (2015) 13. Honnibal, M., Montani, I.: spaCy2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing (2017) 14. Jwa, H., Oh, D., Park, K., Kang, J.M., Lim, H.: exbake: automatic fake news detection model based on bidirectional encoder representations from transformers (bert). Appl. Sci. 9(19), 4062 (2019) 15. Kaliyar, R.K., Goswami, A., Narang, P.: Fakebert: fake news detection in social media with a bert-based deep learning approach. Multimedia Tools Appl. 80(8), 11765–11788 (2021) 16. Liaw, A., Wiener, M., et al.: Classification and regression by randomforest. R. News 2(3), 18–22 (2002) 17. Minka, T.: Automatic choice of dimensionality for pca. Adv. Neural. Inf. Process. Syst. 13, 598–604 (2000) 18. Yida, M., Aletras, N.: Identifying twitter users who repost unreliable news sources with linguistic information. Peer J. Comput. Sci. 6, e325 (2020)

396

A. Hashemi et al.

19. Oshikawa, R., Qian, J., Wang, W.Y.: A survey on natural language processing for fake news detection. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 6086–6093 (2020) 20. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 21. P´erez-Rosas, V., Kleinberg, B., Lefevre, A., Mihalcea, R.: Automatic detection of fake news. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 3391–3401. Association for Computational Linguistics (2018) 22. Potthast, M., Kiesel, J., Reinartz, K., Bevendorff, J., Stein, B.: A stylometric inquiry into hyperpartisan and fake news (2017). arXiv:1702.05638 23. Rangel, F., Giachanou, A., Ghanem, B., Rosso, P.: Overview of the 8th author profiling task at PAN 2020: profiling fake news spreaders on Twitter. In: Cappellato, L., Eickhoff, C., Ferro, N., N´ev´eol, A. (eds.), CLEF 2020 Labs and Workshops, Notebook Papers. CEUR-WS.org, September (2020) 24. Rangel, F., Rosso, P.: On the impact of emotions on author profiling. Inf. Proc. & Manag. 52(1), 73–92 (2016) 25. Rissola, E.A., Bahrainian, S.A., Crestani, F.: Personality recognition in conversations using capsule neural networks. In: IEEE/WIC/ACM International Conference on Web Intelligence, pp. 180–187 (2019) 26. Saha, T., Upadhyaya, A., Saha, S., Bhattacharyya, P.: A multitask multimodal ensemble model for sentiment-and emotion-aided tweet act classification. IEEE Trans. Comput. Soc. Syst. 1–10 (2021) 27. Saif, M.: Word affect intensities. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA) (2018) 28. Shu, K., Mahudeswaran, D., Wang, S., Lee, D., Liu, H.: Fakenewsnet: a data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big Data 8(3), 171–188 (2020) 29. Shu, K., Zhou, X., Wang, S., Zafarani, R., Liu, H.: The role of user profiles for fake news detection. In: Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 436–439 (2019) 30. Varma, R., Verma, Y., Vijayvargiya, P., Churi, P.P.: A systematic survey on deep learning and machine learning approaches of fake news detection in the pre-and post-covid-19 pandemic. Int. J. Intell. Comput. Cybern (2021) 31. Vo, N., Lee, K.: Learning from fact-checkers: analysis and generation of factchecking language. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’19, pp. 335–344 (2019) 32. Vosoughi, S., Roy, D., Aral, S.: The spread of true and false news online. Science 359(6380), 1146–1151 (2018) 33. Zaizar-Guti´errez, D., Fajardo-Delgado, D., Carmona, M.A.A.: Itcg’s participation at mex-a3t 2020: aggressive identification and fake news detection based on textual features for mexican spanish. In: IberLEF@ SEPLN, pp. 258–264 (2020) 34. Zhang, J., Cui, L., Fu, Y., Gouza, F.B.: Fake news detection with deep diffusive network model (2018). arXiv:1805.08751 35. Zhou, X., Zafarani, R.: Network-based fake news detection: a pattern-driven approach. ACM SIGKDD Explor. Newsl. 21(2), 48–60 (2019) 36. Zhou, X., Zafarani, R.: A survey of fake news: fundamental theories, detection methods, and opportunities. ACM Comput. Surv. (CSUR) 53(5), 1–40 (2020)

Analyzing Apache Storm as Core for an Event Processing Network Model Christin Schulze1 , Christian Gerner1 , Matthias Tyca1 , Arne Koschel1 , Anna Pakosch1(B) , and Irina Astrova2 1

2

Hochschule Hannover, University of Applied Sciences and Arts, Hannover, Germany [email protected] Department of Software Science, School of IT, Tallinn University of Technology, Tallinn, Estonia [email protected]

Abstract. In this article, we apply a proposal for a list of generalized requirements for a unified modeling of event processing networks (EPNs) to Apache Storm. It enhances our previous work in this area, in which we analyzed the EPiA model, the BEMN model, and the RuleCore model. Our proposed requirements look at both: The logical model of EPNs and the concrete technical implementation of them. To facilitate this, it is required to define a model, which enables to describe the ‘form’ of events on a conceptual level, the steps of event processing and creation by means of a defined event hierarchy and moreover: The form of inter-event relationships, such as temporal and causal relations. Furthermore, the model needs to interrelate these model-specific traits with the technical side of EPNs, like platform implementation and design specifics. For this purpose, our paper elaborates general requirements for EPN models based on the attributes derived from event processing in general as well as existing models. Finally, as its core contribution, the paper applies those requirements, by analyzing Apache Storm as a particular implementation foundation of an EPN model. Keywords: Event Processing Network (EPN) network model · Apache storm

1

· Event processing

Introduction

Intelligent data management and processing has changed: Collecting a large amount of data from various sources happens in every company today, called ‘Big Data’. It is no longer sufficient to store data in relational databases, log files or events separately. The information of data, especially in the combination of different sources, is important for the competitiveness of enterprises. Batch Processing [18] is an established approach for processing ‘Big Data’. At its core, data is collected and processed ‘in batches’. Therefore, data is collected for a certain period of time before being processed. The drawback is, that no c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 397–410, 2024. https://doi.org/10.1007/978-3-031-47715-7_27

398

C. Schulze et al.

real-time processing is possible. First, data is collected for some time before processing takes place. Recently, (Event) Stream Processing [18] joined the field. An approach to process data directly after generation. Through this near real-time processing, an action can happen immediately after processing. Enterprises can react faster to changes. For the implementation of Stream Processing, a modeling technique called (complex) EPNs found its way into practice. This approach gives a guideline, included components and also requirements, how such a Stream Processing should occur. Along with the rise of Stream Processing, a variety of tools were developed to model and implement EPNs. For this reason, we contribute here an evaluation of different tools, that supports the realization as automatically as possible and addresses the following questions: – In a nutshell, what are EPNs? – Which requirements are given for EPN models and how to provide them in some standardized from? – How important is which requirement? – A look at modern available tools. • How work Apache Storm, Amazon Kinesis Data Analytics and Microsoft Azure Stream Analytics? • Do these tools realize EPNs? • Do these tools fulfill our EPN model requirements? – Which tool to choose and when? In the present article, we will provide two major contributions: First, we provide our—compared to our work from [12]—more formally structured list of generalized EPN model requirements. Second, we provide an in-depth evaluation of Apache Storm regarding our EPN model requirements and will also briefly compare this to Amazon Kinesis Data Analytics. Future work of ours will provide in-depth evaluations of Amazon Kinesis Data Analytics and Microsoft Azure Stream Analytics regarding our EPN model requirements as well. The remainder of this article is structured as follows: After discussing related work in Sect. 2, we place a brief introduction to the topic and provide our EPN model requirements in Sect. 3. Next, we will take an in-depth look at Apache Storm in Sect. 4 and briefly compare it to Amazon Kinesis Data Analytics in Sect. 5. Eventually, Sect. 6 summarizes the results and concludes.

2

Related Work

The basis of our project builds on authors in the scope of EPN and Complex Event Processing (CEP), such as the work from Dunkel and Bruns [8,10]. We also used foundations from our earlier work on EPNs, namely [7,12,13]. In those articles we more informally establish the requirements for EPNs and apply them

Analyzing Apache Storm as Core for an Event Processing Network Model

399

to different EPN modeling approaches and tools (EPiA, BEMN, RuleCore). With the present paper, we extend our work with slightly refined and more formally structured requirements as well as a deep look at significantly more recent tools, here in particular at Apache Storm. Compared to our earlier work, we here cast the requirements into a template from [17], that means, we somewhat formalize them in the present article. We decided to use a template to define the requirements in a standardized form and to show their importance. To ensure the quality of the requirements, we validated them against the quality criteria from [1]. Furthermore, we use the IEEE830-1998 standard [1] for quality criteria for requirements. There exists a newer standard, the IEEE/ISO/IEC29148-2011, which describes the quality criteria from IEEE830-1998 in more summary form. We still meet all the quality criteria from both versions, except for the singularity. That one is new in IEEE/ISO/IEC29148-2011. For the description and evaluation of the different tools, we used the documentation of the publishers. The Apache Software Foundation offers a large documentation in [3] for Apache Storm, Amazon Web Services provides information in [6] for Amazon Kinesis Data Analytics and Microsoft gives an introduction to Microsoft Azure Stream Analytics in [15]. We distinguish ourselves from other publications by standardizing and validating the requirements of EPNs and by evaluating various tools with different open or closed source characteristic, effort and costs. With this variety, we aim to give an overview of different tools and support the decision for a suitable tool.

3

Event Processing Networks and Requirements

This section curtly presents the basics of EPNs and the requirements for this kind of systems. This introduction is brief, the reader should be familiar with the subject. An event is ‘a significant change of state’ or ‘a happening of interest at a certain point in time in a certain location’ [14]. Event Processing Networks (EPNs) can be seen as generalized software systems that allow for the processing of events. However, EPN models lack standardization, which is where our work from [12] aims to help. 3.1

Basics of Event Processing Networks

EPNs are built on the basis of the Event-Driven Architectures (EDA) and CEP. These both approaches [...] represent a new style of enterprise applications that places events at the center of the software architecture - event orientation as an architectural style. — J. Dunkel and R. Bruns [8, p. 4] In this context, EDA is more about the design of event-driven architectures as a design style. CEP describes a technology for dynamic processing of large

400

C. Schulze et al.

datasets [8]. Thus, CEP is a part of an EDA, which can be used for processing data within it. In detail, CEP describes the dynamic processing of large data streams (also called event streams) in real-time. An event is any happening in the system. Here, the change of state of a fact or an object is represented [8]. The processing of events within a CEP is realized using rules. These rules contain knowledge about handling events or event sequences [10]. For the realization of these rules and the processing of the data, the CEP contains Event Processing Agents (EPA). EPAs can process event streams and recognize patterns. They process the data according to the defined rules. An EPN is a set of EPAs which are interconnected and exchange information during and about processing of the data [10]. An EPN can be interpreted as a graphical tool for modeling the flow of events for event processing systems [12]. Thus, the main components of EPN are EPAs in order to be able to perform CEP. EPAs contain various components which can be described as follows [10]: – Event Model: This set of rules defines the possible forms and structures of events that can occur. – (Event) Rules: Based on the Event Model, (Event) Rules are defined to process incoming data. Here, the knowledge of how to deal with which information is implemented. – Event Processing Engine: Pattern recognition takes place in the engine. The incoming data stream is continuously checked for samples. Based on these detected samples, (Event) Rules are applied for processing. We presented the basics of EPNs here in a rudimentary way, to understand why tools are evaluated and in which strategy. Other components, such as producers, are also further elements of EPNs and can be taken from [10,12]. The next part explains the requirements for EPNs. 3.2

Requirements

To evaluate the selected tools, we followed the requirements from [12]. First, we evaluated a standardized form [17] and quality criteria [1] from the requirement engineering for the given requirements. Afterwards, we put these requirements into a standardized form, which is described in the following parts. Handling the Requirements The given requirements are based on the work of [12]. We have standardized these requirements into formal requirement by applying [1,17]. The reason for this approach was, that the previous requirements were described as bullet points, had no formal structure and were partially somewhat ambiguous. To address these issues, we started by evaluating various requirement templates as they address issues such as writeability, readability and learnability and are commonly used [16]. We have chosen [17] because it provides a straightforward structure for requirements. The unified structures are given as the following:

Analyzing Apache Storm as Core for an Event Processing Network Model

401

/∗ S y s t e m a c t i v i t y / The Tool { s h a l l | s h o u l d | w i l l | can } provide

/∗ U s e r i n t e r a c t i o n / The Tool { s h a l l | s h o u l d | w i l l | can } p r o v i d e

Description of the verbs used within the template: – – – –

shall : legally binding should : strongly recommended will : future can: nice to have

In addition to the above structure, we apply the quality criteria of IEEE 830-1998 [1] to achieve high quality requirements in structure and content. Specification of the Quality criteria according to [1]: – correct: Only requirements that the software shall meet. – unambiguous: Requirements has only one interpretation. – complete: All significant requirements, definition of behavior and all labels and references. – consistent: The requirement is free of conflicts with other requirements. – ranked for importance and/or stability: Requirements are scored or classified. – verifiable: The tool can be proven by the specified requirements. – modifiable: Changes of requirements can be done easily, completely and consistently. – traceable: The origin of a requirement is clear and all references are documented. The given requirements are formulated according to a template and fulfill all quality criteria. The requirements templates achieve writeability, readability and learnability and are therefore efficient. This also satisfies the modifiable criteria from IEEE 830-1993. Correctness, unabiguousness and completeness are achieved by splitting, expanding and substitution of the specialist words. The requirements are checked to be consistent and verifiable. Furthermore, the requirements are ranked for importance due to the template structure by the used verbs. The requirements are traceable because their origin and references are documented and identifiers are given.

402

C. Schulze et al.

3.2.1 The Requirements We have split the last requirement ‘Further possibilities to both create the model in an easy way and use the created model on-the-fly.’ from [12] into EPNR8 and EPNR9 as it was a extensive requirement. All requirements (EPNR1 - EPNR9) are standardized forms of [12] by applying [17], without changes of the semantics. The requirements are provided with an identifier. The standardized requirements are described as follows: – EPNR1: The tool shall offer the developer to model events with their inherent attributes as the central component of the engine. – EPNR2: The tool shall map real world descriptions to events as scenarios. – EPNR3: The tool shall offer event structures as simple, complex or aggregated. Simple events can be created and used independently. In addition, complex events have dependencies and references to other events. Also, aggregated events can be grouped logically. – EPNR4: The tool shall offer possibilities to express the relativity of events and their temporal and causal relationships, e.g., sequence, preconditions and postconditions. – EPNR5: The tool shall process and show the flow of events through the system. – EPNR6: The tool shall offer the modeling of EPN by components, their properties and used patterns. – EPNR7: The tool shall offer the modeling of components outside the system boundary and the behavior between inside and outside components. – EPNR8: The tool should be expressive in usage, about readability, writability, learnability and efficiency. – EPNR9: The tool should offer the developer further possibilities to create the model, e.g., IDE, graphical event programming. The requirements are ranked by importance corresponding to their number in descending order, with 1 being the most important to 9 being the least important one. There will be no full usability discussion for EPNR8, as that would exceed the scope of this article. Instead, EPNR8 will be considered from a developers perspective, according to aspects such as: – – – – –

Documentation: Existence of maintained and understandable documentation; Comments: Any comments about the tool; Git repositories: Usage of the tool in projects; Tutorials: Availability of tutorials; Community acceptance and support.

Similarly, this also applies to EPNR9. Thus, our requirements are now standardized and validated. In the next section, the first tool, namely Apache Storm [3], is presented and evaluated indepth. In future work, we will evaluate other EPN tools and models in-depth as well.

Analyzing Apache Storm as Core for an Event Processing Network Model

4

403

Apache Storm

In this section, Apache Storm is examined and evaluated. It was chosen by the authors for its open source characteristics. 4.1

Overview of Apache Storm

Apache Storm ([2–5,9,11,18]) is an open source computing system for processing real-time data analytics. It is possible to reliably process unlimited data streams. The Apache Software Foundation reports a benchmark that processed over one million tuples per second and per node. It further asserts that the system is scalable, fault-tolerant, guarantees data processing and is easy to use [3]. Companies like Twitter or Meta also use Apache Storm to monitor their data streams and determine the relevance of individual posts. The system integrates nicely with other systems and fits well into Apache Hadoop environments [11]. In contrast to Apache Hadoop, which performs batch processing, Apache Storm is able to perform real-time processing. Apache Hadoop processes a large batch of data that has been previously collected [18], whereas Apache Storm continuously processes accumulated data in real-time. In the remarks of [2], Apache Storm is also a distributed application, which can be used for CEP. For this realization, the application introduces a set of abstractions and technical realizations, which are defined using the following concept. Topologies The Storm topology is the logic of real-time data processing, which is shown in Fig. 1. It works analogously to a MapReduce job, except that it never ends (unless the topology is terminated). A topology consists of spouts and bolts connected by stream groupings. Streams The stream is the central abstraction in Apache Storm. A stream consists of an unbounded sequence of tuples that are created and processed either individually or parallel in a distributed manner. Tuple A tuple is the main data structure in a topology. The data, which should be processed, is located and structured in the tuples. Tuples contain lists of values [5]. Spouts The sources of streams in a topology are called spouts. Spouts read tuples from external sources and transfer them to the topology. In general, they can be reliable, in which case the spouts store the tuple even after passing it to the topology, or unreliable, in which case they forget the tuple after transferring it to the topology. The core task is to map data to the correct bolt.

404

C. Schulze et al.

Fig. 1. Logical topology of Apache Storm (based on [3]).

Bolts Data and event processing takes place in the bolts. The possibilities of bolts are various. They can filter, perform operations, execute aggregations or joins, connect and communicate with databases. Moreover, they can transform streams simple, complex, multiple or with other bolts. Thus, the core task of the bolts is to process and reduce the data. Stream Groupings Part of the topology is to specify, which bolt should process which stream. A stream grouping is then used to enable a spout to split the stream onto the task, that executes a bolt. Apache Storm provides 8 different stream groupings defined as follows:

Analyzing Apache Storm as Core for an Event Processing Network Model

405

– Shuffle grouping—random distribution; – Fields grouping—distribution by a defined partitioning, e.g., user-id; – Partial key grouping—as the field grouping, more dynamically to load of the stream and the partition; – All grouping—replicated to all tasks of the bolt; – Global grouping—processed by a single one of the bolts tasks; – None grouping—not defined; – Direct grouping—the producer of the tuple decides; – Local or shuffle grouping—tuples are merged into the in-process tasks only if the target bolt has one or more tasks in the same worker process. Otherwise, this operates like shuffle grouping. Reliability Storm guarantees, that each received tuple of a spout is to be processed in the topology. Therefore, some functions are available, like a timeout or to repeat failed tuples at a later time. Task The execution of a spout or bolt is called a task. Each task has a defined execution thread. Stream grouping determines how groups of tuples are processed. Worker The execution of a topology is done by one or more workers. Each worker runs on a physical unit and processes a subset of the tasks in the topology. A worker can divide into several nodes. In addition to these concepts, clusters are a key implementation in Apache Storm for the technical realization. A cluster in Apache Storm handles one or more workers in one or more topologies. Within a cluster, as many processes are being executed as the physical machine can execute. Each cluster has one Nimbus node (leader node) and one or more Supervisor nodes (follower nodes) [4]. The Nimbus is the central component of the cluster. It monitors the topology, uploads computation for execution and distributes tasks to Supervisors. A Supervisor runs one or more worker processes. It delegates tasks to the workers processes. An executor is an execution thread within a worker process that performs tasks. The structure of an Apache Storm Cluster is shown in Fig. 2. In some models, ZooKeeper nodes are described between the Nimbus node and the Supervisor nodes, which are only responsible for coordination and communication between the Nimbus and Supervisor nodes. Apache Storm splits into a logical topology and technical implementation. The logical topology represents the structure and logical processing steps. The actual, technical implementation by the cluster represents the implementation of the nodes and the physical execution of the processing steps. The actual mapping of spouts and bolts (see Fig. 1) lies in the tasks of the individual executors, which are located in the worker processes (see Fig. 2). Apache Storm is written in Java and Clojure and runs in a Java Virtual Machine (JVM). The topology must be written in Java. The language of the

406

C. Schulze et al.

spouts and bolts can be chosen arbitrarily [9]. After deciding to use Apache Storm, developers only need to define the logical components; topology, spouts and bolts, and their features. The various nodes within the cluster are provisioned in an automated manner. In addition to the general overview of Apache Storm, the following part provides an evaluation of this tool with respect to our set of requirements. 4.2

Evaluation of Apache Storm for EPN Modeling

Based upon our standardized set of requirements from Sect. 3.2.1, this part will argue, which requirements are fulfilled, partially fulfilled or not fulfilled by Apache Storm. – EPNR1—fulfilled: Events are abstracted by tuples. Tuples contain structured lists of attributes which can take any primitive or complex value. – EPNR2—fulfilled: Due to the freely selectable structure of the tuples, things of the real world can be described and mapped in scenarios. – EPNR3—fulfilled: The tool offers different stream groupings, allowing tuples to be routed and processed in different ways. An aggregation can be realized by this grouping and by the implemented knowledge of the spouts. These contain logic, how which tuples should be routed and reduced. – EPNR4—fulfilled: Through the processing steps by the arbitrary number of spouts and bolts, causal relationships and dependencies can be mapped. Especially the function of the bolts can be implemented arbitrarily, which brings freedom, but can also increase complexity. – EPNR5—fulfilled: Apache Storm works with acknowledgements, so it guarantees that a tuple has been processed in the topology. The flow is implicit mapped by defining the processing steps of spouts and bolts. – EPNR6—fulfilled: The tool provides topologies to process data/events, including spouts for mapping and routing and bolts for processing and reducing data. – EPNR7—fulfilled: Spouts provide interfaces for using the system and topology. They receive data from external sources to provide them for processing. Streams with groupings are defined between spouts and bolts. Bolts process the data and pass it to another bolt for another processing step, or output the processed data to a target system or database (see Fig. 1).

Analyzing Apache Storm as Core for an Event Processing Network Model

407

Fig. 2. Apache Storm cluster architecture (based on [4]).

– EPNR8—fulfilled: The Apache Software Foundation provides many tutorials and examples for using Apache Storm. A detailed introduction and tutorial is shown in [4]. The scope of the programming is limited to the implementation of the topology with the spouts, the bolts and the stream groupings. All other tasks are automated by Apache Storm. Bolts can contain arbitrary logic. This can bring freedom, allowing tasks to be solved arbitrarily, but also higher complexity and a higher entry level, since there are no sample solutions. – EPNR9—fulfilled: Apache Storm Quick Guide provides step-by-step instructions for using Apache Storm and offers an initial example [4]. It also affords a user interface to monitor and view topologies. In conclusion, Apache Storm fulfills our standardized set of requirements, is a suitable tool for CEP, can represent things of the real world and guarantees the processing of data. Getting started with this tool can be difficult, but the Apache Software Foundation simplifies it with the Quick Guide. However, the technical infrastructure, e.g., servers must be provided and maintained by one selves. Especially, physical or virtual machines for Storm must also be provided and maintained by the developers (or their dedicated Storm administrators) themselves.

408

C. Schulze et al. Table 1. Tools criteria and characteristics. Apache storm

Amazon kinesis data analytics

Basis

MapReduce

MapReduce

Support

No support

Platform-as -a-Service (PaaS)

Costs

No costs

Costs based on usage

Effort

High

Low

Eventformat

Tuple

Tuple

Environment

JVM

AWS

Language

Topology in Java, others Java, Scala, Python, SQL arbitrary

Distribution

Thread-based

Thread-based in Apache Flink

Maximum processing rate Over one million tuples Real-time per second and per node

5

Reliability

Guaranteed by spouts

Guaranteed by AWS

Data protection

Own realization

Server location can be set

Security

Own realization

Provided by AWS

Comparison

In this article, we analyzed in Sect. 4 Apache Storm in-depth with respect to our standardized set of requirements from Sect. 3.2.1. As we stated there, Apache Storm nicely fulfills our requirements and thus is well suited for the realization of EPNs. Mainly, the collected information for the presentation of Apache Storm (as well as other tools) was taken from the documentation of the publishers or developers of the tools. For this reason, some information may be presented subjectively, as companies would like to widely distribute their tool in any case. Moreover, in the case of Amazon and Microsoft Azure, these are costly tools, so an advertising factor within the documentation cannot be ruled out. Furthermore, information may be incomplete because companies want to keep their implementations private. To provide some more distinctive criteria to other tools, we took in particular a more developer-oriented perspective. The result is summarized in Table 1, which briefly compares two ouf our tools under evaluation, namely Apache Storm and Amazon Kinesis Data Analytics. Developers may use this table to identify the most important criteria that argue for or against a tool. In particular open source nature, price, convenience, and potential vendor lock in some distinctive factors. While Storm is open source and free of charge, Kinesis is not. However, Storm has to be maintained by one selves, while Kinesis is nicely hosted and maintained by Amazon and thus possibly easier to be used. Thus, there is no clear winner between both tools, but more a question of individual developer taste and skills as well as company preferences. For example, if a company prefers open source tools and has good development and hosting skills in house, then Apache Storm seems to be more suitable. If a company is an AWS shop anyway, wants likely

Analyzing Apache Storm as Core for an Event Processing Network Model

409

less maintenance effort and is able to pay the costs for Kinesis, then Kinesis could be more favorable.

6

Conclusion

In this article, we had a deep look at Event Processing Network Models, as a foundation of Event Stream Processing tools. As one key contribution, we provide our enhanced (compared to our previous work) standardized set of requirements for EPN models in Sect. 3.2.1. As the second contribution of this article, we apply those requirements for an in-depth look at Apache Storm in Sect. 4. It turns out, that Storm is a well suited tool for modeling and implementation of EPNs. Comparisons with other tools, thus, may need other criteria as well. In Sect. 5 we briefly compared Apache Storm to Amazon Kinesis Data Analytics. The suitability depends on more individual circumstances, such as a preference for open source tools, but also how high the own development and administration effort should be. If minimal effort is required, Amazon Kinesis Data Analytics is recommended. The tool provides complete support. Developers only need to upload their own code, while Amazon Kinesis Data Analytics handles the distribution, required resources and monitoring. However, Amazon Kinesis Data Analytics is clearly not free of charge. A significant part of its cost results from measured traffic and usage. In comparison, a free, open source tool is Apache Storm. However, this tool also has the highest effort: Developers have to configure and manage everything by themselves. In addition to all the functions and the logical distribution, the physical distribution itself must also be planned and configured. Although Apache Storm handles the distribution of logical tasks to the physical environment, the physical infrastructure must be provided and administered by developers. In future work of ours, we will also provide in-depth evaluations of Amazon Kinesis Data Analytics and Microsoft Azure Stream Analytics regarding our EPN model requirements. Therefore, the decision for a suitable tool is based on the effort, the control and the costs involved. For these reasons, no absolute recommendation can be made. Rather the authors recommend examining each individual use case or at least a set of typical ones, in order to select the ideal tool.

References 1. IEEE Recommended Practice for Software Requirements Specifications: IEEE Std 830–1998, 1–40 (1998) 2. Apache Storm Concepts: Apache Software Foundation. https://storm.apache.org/ releases/current/Concepts. Accessed from 04, 2022 3. Apache Storm: Apache Software Foundation (2021). https://storm.apache.org/. Accessed from 04, 2022 4. Apache Storm Tutorial: tutorialspoint (2021). https://www.tutorialspoint.com/ apache storm/index.htm. Accessed from 04, 2022

410

C. Schulze et al.

5. Interface Tuple: Apache Software Foundation (2021). https://storm.apache. org/releases/current/javadocs/org/apache/storm/tuple/Tuple.html [retrieved: 04, 2022] 6. Streaming Data Solutions on AWS: Amazon Web Services Inc. (2021). https:// docs.aws.amazon.com/whitepapers/latest/streaming-data-solutions-amazonkinesis/welcome.html. Accessed from 04, 2022 7. Astrova, I., Koschel, A., Kobert, S., Naumann, J., Ruhe, T., Starodubtsev, O.: Evaluating rulecore as event processing network model. In: Proceedings of 15th International Conference on Web Information Systems and Technologies (WEBIST 2019), pp. 297–300 (2019) 8. Bruns, R., Dunkel, J.: Event-Driven Architecture - Softwarearchitektur f¨ ur ereignisgesteuerte Gesch¨ aftsprozesse (Software architecture for event-driven business processes) (2010) 9. Drilling, T., Litzel, N.: Apache Storm ist reif f¨ ur den Produktiveinsatz (Apache Storm is ready for productive use). BigData Insider (2014). https://www.bigdatainsider.de/apache-storm-ist-reif-fuer-den-produktiveinsatz-a-460623/. Accessed from 04, 2022 10. Dunkel, J., Bruns, R.: Complex Event Processing - Komplexe Analyse von massiven Datenstr¨ omen mit CEP (Complex analysis of massive data streams with CEP) (2015) 11. Joos, T., Litzel, N.: So analysieren Sie große Datenmengen mit Apache Storm (How to analyze large amounts of data with Apache Storm). BigData Insider (2021). https://www.bigdata-insider.de/so-analysieren-sie-grossedatenmengen-mit-apache-storm-a-668119/. Accessed from 04, 2022 12. Koschel, A., Astrova, I., Kobert, S., Naumann, J., Ruhe, T., Starodubtsev, O.: Towards requirements for event processing network models. In: Proceedings of 8th International Conference on Information, Intelligence, Systems, Applications (IISA 2017), pp. 27–30 (2017) 13. Koschel, A., Astrova, I., Kobert, S., Naumann, J., Ruhe, T., Starodubtsev, O.: On requirements for event processing network models using business event modeling notation. In: Proceedings of 2018 Conference on Intelligent Computing. Advances in Intelligent Systems and Computing (SAI 2018), pp. 756–762 (2018) 14. Luckham, D.: The Power of Events. Addison Wesley, USA. 15. Microsoft: Introduction to Azure Stream Analytics. Microsoft Documentation (2021). https://docs.microsoft.com/en-us/azure/stream-analytics/streamanalytics-introduction. Accessed from 04, 2022 16. Robertson, S., Robertson, J.: Mastering the Requirements Process: Getting Requirements Right (2012) 17. Rupp, C., Pohl, R.: Basiswissen Requirements Engineering (Basic knowledge Requirements Engineering) (2021) 18. Shaikh, T.: Batch Processing - Hadoop Ecosystem. K2 Data Science and Engineering (2019). https://blog.k2datascience.com/batch-processing-hadoop-ecosystemf6da88f11cae. Accessed from 04, 2022

Designing Against Bias: Identifying and Mitigating Bias in Machine Learning and AI David J. Corliss(B) Peace-Work, Plymouth, MI 40170, USA [email protected]

Abstract. Bias in machine learning algorithms is one of the most important ethical and operational issues in statistical practice today. This paper describes common sources of bias and how to develop study designs to measure and minimize it. Analysis of disparate impact is used to quantify bias in existing and new applications. New open-source packages such as Fairlearn and AI Fairness 360 Toolkit quantify bias by automating the measurement of disparate impact on marginalized groups, offering great promise to advance the mitigation of bias. These design strategies are described in detail with examples. Also, a comparison algorithm can be developed that is designed to be fully transparent and without features subject to bias. Comparison to this bias-minimized model can identify areas as bias in other algorithms. Keywords: Bias mitigation · Machine learning · AI · Disparate impact · Fairlearn

1 Introduction: Bias in Machine Learning and AI 1.1 Overview One of the driving purposes of using of developing artificial intelligence and machine learning algorithms was to improve the fairness of processes that were relied on that had previously relied on human judgment the idea was that by taking the human element out of a decision process that the process was supposed to become more fair unfortunately experience has proven that this process often hasn’t worked out AI machine learning and AI have often failed their promise of developing more fair processes. This is a motivating factor in investigating these weaknesses and failures of machine learning and AI, identifying root causes, and developing mitigation strategies and tools to minimize their impact. 1.2 Important Examples One well-known example of algorithm failure is the COMPAS algorithm used by the criminal justice systems in several states [1]. The Correctional Offender Management © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 411–418, 2024. https://doi.org/10.1007/978-3-031-47715-7_28

412

D. J. Corliss

Profiling for Alternative Sanctions (COMPAS) is an algorithm predicting risk of recidivism developed by the justice tech company Northpointe and now owned by Equivant. The algorithm has been used to inform decisions to permit or deny bail, set bail amounts, in sentencing, and by parole boards. In 2016, an investigation by ProPublica [2, 3] found persons of color were much more likely than whites for a false positive High Risk classification. At the same time, whites were found to have a much higher likelihood of false negative than blacks. Despite serious and well-documented failures, algorithms of this type with bias issues continue to be used [4]. Another example is the use of algorithms to screen resumes, where gender bias has sometimes been found. In 2018, the news media agency Reuters reported the tech company Amazon chose not to deploy a resume screening algorithm it had developed after finding evidence of gender bias. As reported by Reuters [5], the company planned to eliminate problems such as gender bias and other concerns by implementing an algorithm and hiring the very highest scores without further review. By contrast, the algorithm was found to replicate a bias it was intended to address.

2 Types of Bias and Their Root Causes 2.1 Overview 2.1.1 Confirmation Bias This familiar form of bias results from the human tendency to acquire and retain information, analyze data, and develop explanations that confirm their preexisting beliefs [6]. Confirmation bias is a well-established phenomenon in the field of psychology and cognitive science, and has been extensively studied and documented in a wide range of domains. This phenomenon is so robust that it has been observed in diverse cultures, ages, and educational levels. This bias is thought to be driven by a number of factors, including cognitive dissonance, motivated reasoning, and the need for consistency [7]. In data science, confirmation bias can adversely impact thoroughness in considering and investigating issues of algorithm performance. 2.2 Sampling Bias Sampling bias results from errors in the manner in which data are selected from a population, especially in the case of data used to train algorithms. This type of bias produces a sample that is not representative of the population from which it is drawn. This can occur when the sample is not selected randomly or when certain groups within the population are underrepresented or overrepresented in the sample. As a result, the sample may not accurately reflect the characteristics or opinions of the population as a whole. This bias can lead to inaccurate conclusions and generalizations about the population based on the sample [8]. A biased, unrepresentative sample can occur in several different ways. In some cases, study samples are taken from all people who volunteer to provide data. This problem, called Convenience Sampling, is especially common in social media surveys [9]. Another

Designing Against Bias: Identifying and Mitigating Bias

413

common sampling error is a sample evenly representative of the population as a whole, with the result that small subsets of the population lack sufficient examples to train the algorithm accurately. An example of this failing to over-sample small sub-populations can be found in some voice recognition systems. In some cases, these algorithms are found to more accurately is easier for them to recognize a voice and correctly interpret speech from certain demographic segments rather than others. If the samples of people used to train the algorithm are not oversampled for small subsets of the population, poor algorithm performance can result. An example of this is described by Najibi [10] and a team from Harvard University, which evaluated the accuracy of facial recognition technologies from several leading different companies. This study found facial recognition for all the products was most accurate for lighter skinned males and least accurate for darker skinned females. This was the result of not properly oversampling people with darker skin in the training set, who constitute a small portion of the US population. 2.3 The History Problem The history problem results from training an algorithm where the labeling of the data is taken from previous human decisions. While the intention may have been to develop an algorithm to reduce bias in a human, subjective process, use of labeled data from previous biased human decisions merely trains the algorithm to replicate the human bias. In cases such as these, the term “Prejudice” is often used in literature (e.g. [11]), as it often results from personal prejudices of the people who labeled the historical data being included in the training set for the new algorithm. The failure of both the COMPAS recidivism algorithm and the Amazon resume screening system are examples of this. However, this problem can result from causes unrelated to prejudice—for example, an algorithm for detecting quality variations in manufacturing. Consequently, the author will recommend use of the term History Problem for all instances of algorithm failure resulting from the error of using historical subjective human labeling of training data. 2.4 The Spaghetti Problem This issue arises where hundreds or even thousands of potential predictors are present. While this most often happens in NLP classification algorithms, the key feature is the very large number of potential features, such as can be found in genetic testing, high volume sensor data, and other situations. In such cases, a large number of candidate predictors may not be carefully screened for bias. This will be termed the “Spaghetti Problem”, from the aphorism “anything that sticks” in the belief of some that spaghetti will stick to a wall when fully cooked. The uncritical acceptance of predictors without carefully testing individual candidate features for bias can result in the inclusion of biased features in the algorithm. An example of this is found in the Amazon resume screening algorithm, where candidates were rated lower by the algorithm if the word “softball” was found in the resume [12]. As women are much more likely to play softball in college than men, the

414

D. J. Corliss

inclusion of the term as a predictor in the algorithm with a negative impact on the score contributed to a bias against women job candidates. 2.5 Lack of Transparency While not a source of bias in and of itself, a lack of algorithm transparency can greatly complicate testing, identification and confirmation of potential bias, and mitigating the effects. One of the most important problems with the COMPAS was the implementation of a black box algorithm due to withholding of information deemed proprietary by the government supplier who developed it. The features included in the COMPAS algorithm were withheld from the people using the algorithm and from the people whose lives were significantly affected by its use.

3 Bias Mitigation 3.1 Measuring Bias with Disparate Impact Disparate Impact is often used to investigate the difference in the impact of an event between different sub-populations, such as by race, gender, age etc. For example, one investigation of the COVID-19 pandemic initial wave (through June 30, 2020) resulted in higher mortality rates (Fig. 1) for several marginalized populations such as BIPOC [13].

Fig. 1. COVID-19 mortality—wave 1: March 1–June 30, 2020), us counties by quintile.

Disproportionate impact on marginalized groups can be measured using relative risk [14], log odds, or odds ratios of a given event occurring within each population subset (Table 1). Comparison of these metrics across population subsets provide useful metrics for quantifying bias [13].

Designing Against Bias: Identifying and Mitigating Bias

415

Table 1. COVID-19 wave 1 odds ratios for U.S. population subjects. Characteristic or risk factor

Odds ratio

Black/African American

10.1

Cardiovascular Disease

9.3

Chronic Lung Disease

5.9

Prison Populations [15]

5.5

Indigenous

3.3

Poverty (% Below Poverty Line)

2.9

High Population Density

1.9

3.2 Biased Minimized Comparison Algorithm One means for the quantification and mitigation of algorithm bias is a Bias-Minimized Comparison Algorithm (BMCA). This is a secondary algorithm in which careful scrutiny excludes any factors which might contribute to unwanted bias. While these models will generally not perform as well as production algorithms containing more factors, the disparate impact between the BMCA and the production algorithm will quantify the amount of bias in the factors added in the production algorithm. This can be used to evaluate variations in a proposed algorithm to inform the final selection of features. The BMCA method is especially useful for wide datasets which can be subject to the “Spaghetti Problem”. Where it can be difficult to screen hundreds of potential predictors, use of a BMCA will help quantify the overall amount and impact bias introduced by a group of potential features. The BMCA methodology is applicable in cases of intrinsic bias, where some bias naturally exists in a model. For example, age may be a real, non-prejudicial factor to some degree in a model but should not be overstated. A Bias-Minimized Comparison Algorithm can estimate the amount of intrinsic bias in a population and minimize the impact. 3.3 Open Source Packages for Measuring and Mitigating Bias Recent developments in data science have produced a number of open source packages facilitating the measurement of bias in machine learning algorithms and AI. Prominent among these are AI Fairness 360 from IBM [16] and Fairlearn from Microsoft. Fairlearn is an open-source Python library developed by Microsoft Research that seeks to enable the development of machine learning models that are fair and transparent [17]. The library provides a suite of algorithms and metrics for evaluating and mitigating bias in machine learning models, and it is designed to work seamlessly with popular machine learning frameworks such as Scikit-Learn and PyTorch. Fairlean focuses on calculation and comparison of model Accuracy (correct prediction rate) and Sensitivity (true positive rate). It supports calculation of confusion matrices (Fig. 2) to display bias characteristics. The Fairlearn library also includes visualization

416

D. J. Corliss

tools for comparing different model versions to optimize accuracy while minimizing bias. One especially useful plot compares performance and bias characteristics for different versions of a model concept (Fig. 3). This allows model developers to test modeling methods, hypertuning the model, and test individual fields for potential impact from bias. Use of this plot supports decisions to maximize model accuracy while minimizing bias.

Fig. 2. Confusion matrices for African-American defendants versus others, with difference, for a fairlearn-adjusted model from Owen’s 2022.

Fig. 3. Fairlean: Disparity in predictors and a visual comparison of different model versions to facilitate maintaining algorithm accuracy while minimizing bias. On the right, different model versions are compared to maximizing accuracy while minimizing bias from Owen 2022.

4 Summary Accurate measurement of bias in machine learning and AI algorithms is critical to the success of these algorithms. Measurement can be accomplished using log odds or odds ratio comparison of model outcomes for population subsets. Log odds is more often used in statistical contexts, while comparison using odds ratios is often better understood and hence more effective with non-technical audiences found in business, policy discussions, and the general public. The new, open-source python package Fairlearn shows great promise for quantifying and mitigating the impact of bias in machine learning and AI.

Designing Against Bias: Identifying and Mitigating Bias

417

Best practices for minimizing bias include: • • • • • •

Parsimonious Models Screen all predictors for bias Transparent Methods, not Black Box Develop the model using new outcomes screened for bias—not past decisions Test for disparate impact on at-risk groups using relative risk or odds ratios Open Source the data and algorithm.

Acknowledgments. Thanks are due to Nancy Brucken and Brandy Sinco for several helpful comments on measurements methods and presentation of the Disparate Impact paper. Thanks are due to Lance Hielbrun, Harvey Qu, and Karry Roberts for comments on work in ethical problems in analytics presented to the Detroit Chapter of the American Statistical Association and especially to Lance Heilbrun for the recommendation of relative risk as a preferred metric for evaluating disparate impact.

References 1. Kirkpatrick, K.: It’s not the algorithm, it’s the data. Commun. ACM. ACM 60(2), 21–23 (2017) 2. Angwin, J., Larson, J.: Machine Bias. ProPublica (2016) 3. Larson J., Mattu S., Kirchner L., Angwin J.: How We Analyzed the COMPAS Recidivism Algorithm. ProPublica (2016) 4. Thomas, C., Pontón-Núñez, A.:Automating judicial discretion: how algorithmic risk assessments in pretrial adjudications violate equal protection. Minnesota J. Law Inequal. 40(2), 5 (2022) 5. Dastin, J.: Amazon scraps secret AI recruiting tool that showed bias against women, Reuters (2018) 6. Nickerson, R. S.: The confirmation bias: a ubiquitous phenomenon in many guises. Rev. Gen. Psychol. (1998) 7. Lewandowsky, S., Cook, J.: Why People Don’t Believe in Climate Change. Scientific American (2013) 8. Lavrakas, P.: Sampling bias and data quality. Public Opinion Quarterly (1993) 9. Groves, R., Couper, M., Lepkowski, J., Singer, E, Tourangeau, R.: Nonresponse Bias in Household Surveys. Handbook of Survey Research (2004) 10. Najibi, A.: Racial Discrimination in Face Recognition Technology, Gender Shades Project, Harvard (2020) 11. Emspak, J.: How a Machine Learns Prejudice. Scientific American (2016) 12. Rodriguez-Villa, F.: Ethical AI: Mitigating Bias, Adept AI (2022). https://www.adept-id.com/ news/ethical-ai22022, last accessed 2023/1/15 13. Corliss, D.J.: Disproportional impact of COVID-19 on marginalized communities. In: Proceedings of the SAS Global Forum 2021 (2021). https://communities.sas.com/t5/SASGlobal-Forum-Proceedings/Disproportional-Impact-of-COVID-19-on-Marginalized-Com munities/ta-p/726372 14. Heilbrun, L.: Personal Communication (2023) 15. Saloner, B., Parish, K., Ward, J.A., DiLaura, G., Dolovich, S.: COVID-19 cases and deaths in federal and state prisons. JAMA 324(6), 602–603 (2020). https://doi.org/10.1001/jama.2020. 12528

418

D. J. Corliss

16. Bellamy, R.K., et al.: AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias. IBM J. Res. Dev. 63(4/5), 4–1 (2019) 17. Owen, S.: Mitigating Bias in Machine Learning with SHAP and Fairlearn, Databricks (2022). https://www.databricks.com/blog/2022/09/16/mitigating-bias-machine-learning-shap-and-f airlearn.html

Study on Grade Discrimination Method of Traffic Accident Risk Factors Considering Road Type Wang Jie, Zhan Zhenfei(B) , and Qian Liuzhu Chongqing Jiaotong University, Chongqing, China [email protected] Abstract. In order to further explore the occurrence mechanism of road traffic accidents; this paper takes into account the differences in road types. Based on the open source traffic accident data set of the British Ministry of Transport, the traffic accident risk factors and their importance of the four road types with the highest traffic accident rates are studied. Firstly, based on data visualization, the distribution of data is preliminarily explored from the spatial and temporal dimensions. Secondly, aiming at the shortcomings of the integrated algorithm with more hyperparameters, which lacks accurate and efficient parameter adjustment methods, a Bayesian optimization based random forest algorithm (BO-RF) implemented by the Optuna framework is proposed to construct the risk factor importance level screening model of four road types. In addition, a control experiment with the random forest algorithm based on random search optimization (RS-RF) was carried out. The results show that the BO-RF algorithm model based on the Optuna hyperparameter optimization framework has shorter tuning time and higher accuracy. Finally, based on the visualization results of road traffic accident risk factor level obtained by the model constructed by BO-RF algorithm, a scheme suitable for reducing accidents under this road condition is proposed, which provides a reference for preventing road traffic accidents and optimizing relevant safety management regulations. Keywords: Data mining · Causation analysis · Random forest · Bayesian optimization · Optuna framework · Road risk differences

1 Introduction As an important mode of transportation, road traffic plays a vital role in people’s daily life. With the continuous innovation of automobile industry technology, the rise of new energy vehicles and other new forms of automobiles has made the road traffic environment increasingly complex. In addition, with the increasing frequency of commercial activities, tourism and cargo transportation activities, road traffic safety has become a problem that needs to be considered in every transportation activity [1]. According to a survey, the number of people who die from road traffic accidents is about 1.2 million [2], and the number of road traffic accidents that do not cause death is innumerable. It can be seen that the road traffic safety situation is still severe and needs to be paid attention to by the society. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 419–435, 2024. https://doi.org/10.1007/978-3-031-47715-7_29

420

W. Jie et al.

As an advanced big data information exploration and processing technology, data mining is widely used in finance, Internet and other fields. It is also popular in the field of traffic safety. In recent years, data mining based on machine learning has been used in traffic safety accident cause mining and prediction research. Cause analysis has a guiding role in formulating problem-solving solutions. Zhan [3] established HFACsRAS models to model and quantitatively analyze traffic accidents, find out the main causes of accidents, and narrow the scope of problem solving, thus effectively improving existing safety defects; Zainuddin [4] et al. conducted Logistic (LR) regression analysis based on Nagano statistical traffic accident data to determine the cause of death in Nagano traffic accidents. Based on the data of urban road traffic accidents, Su et al. [5] used association rules to effectively reveal the important causes of accidents and the relationship between subjective and objective factors to improve urban road traffic safety. Inspired by previous studies, this paper considers the differences of roads, explores the differences and important levels of accident-causing molecules under different road types, and proposes traffic safety prevention and control measures that are more in line with road characteristics. Ensemble learning model is an algorithm model formed by integrating multiple learners. A large number of studies have shown that this kind of model usually has better interpretability and learning performance: Xue [6] et al.‘s transformer fault diagnosis model using random forest algorithm (RF) has higher fault diagnosis accuracy than SVM and other models. Yang et al. [7] designed a control experiment of LR, SVM, ANN and RF models in the study of landslide susceptibility, and found that RF model has better prediction ability. However, the ensemble learning device has many hyperparameters, and its tuning process is complex and time-consuming when the amount of data is large. At present, the more common tuning method is the random search optimization method [8]. It uses random optimization. Although the tuning time is short, it is easy to miss the optimal value. The Bayesian optimization (BO) [9] based on the Optuna framework perfectly combines probability and optimization, and has significant advantages in dealing with multi-parameter optimization problems. It has been applied to navigation, sensors and other fields [10], showing a bright development prospect. Based on the above analysis, this paper proposes a random forest algorithm based on Bayesian optimization (BO-RF) implemented by Optuna framework to carry out risk factor level screening research considering road type differences, so as to deeply analyze the cause mechanism of traffic accidents to guide the formulation of risk aversion schemes. The main contents are as follows: Firstly, the traffic accident data set open sourced by the UK Department of Transportation is visually analyzed to explore the spatial and temporal dimension distribution of accidents. Secondly, the relevant theoretical basis supporting this study is expounded and the effect comparison experiment between the random forest algorithm based on random search optimization and the random forest algorithm based on Bayesian optimization is set up. Finally, based on the visualization results of the risk factor level screening model constructed by BO-RF, the traffic accident prevention and control measures suitable for the road characteristics are proposed.

Study on Grade Discrimination Method of Traffic Accident

421

2 Accident Dataset and Visual Analysis 2.1 Traffic Accident Dataset Information The data set used in this study is a traffic accident data set from 1979 to 2020 collected by the UK Department of Transportation [11]. It is mainly composed of three data files: accident information, victim information and vehicle information. This article is based on its accident information file for research. In addition, because the timeline of the dataset is too long and the data volume is too large, in order to facilitate the study of traffic accident risk factors that conform to the characteristics of modern transportation, the 2015–2020 UK traffic accident data was extracted using the slicing operation of the pandas library. 2.2 Multidimensional Visualization Analysis of Traffic Accident Data 2.2.1 Time Dimension Distribution of Traffic Accidents The stacking histogram of Fig. 1 reflects the distribution of the number of traffic accidents of various severity levels from 2015 to 2020. The number “1, 2, 3” on the legend represents the three levels of severity of the accident, and the severity increases in turn, namely fatal accidents, serious accidents and minor accidents. First of all, from the overall data visualization, the number of traffic accidents is decreasing year by year, and the annual reduction rate of the number of accidents has a trend of increasing year by year, from the side shows that the UK’s traffic management system is constantly improving, and the improvement of the traffic system to reduce traffic accidents is helpful. Secondly, from the analysis of the color ratio of each accumulation color column, the proportion of minor accidents each year is the largest, and the number of fatal accidents is minimal, which reflects that the traffic management system is not enough to prevent small and medium-sized traffic accidents, and there is still much room for improvement.

Fig. 1. Annual distribution of accident types at all levels, 2015–2020.

Figure 2 shows the distribution of traffic accidents in each month. According to the statistical results, the overall trend of the number of accidents in almost every year follows the law of first decreasing, then rising and then decreasing. The number of

422

W. Jie et al.

accidents began to decline again. Among them, the number of accidents in 2020 was the least, the fluctuation range was the largest, and the standard deviation was 1149. It is speculated that the reason may be that traffic has been affected by the new coronavirus epidemic. In addition, the minimum number of accidents in recent years is concentrated in February, April or December, and the maximum number of accidents is concentrated in the late part of each year.

Fig. 2. Monthly distribution of accidents, 2015–2020.

The matrix diagram is used to show the distribution of the number of traffic accidents in Britain in recent years, as shown in Fig. 3: the vertical axis from 0 to 5 represents 2015 to 2020, respectively, the horizontal axis represents the number of days per year, and the brighter the color of the color block represents the more accidents. It can be found that the distribution pattern it shows is basically consistent with the previous two images, the number of accidents is decreasing year by year, and the period of high accident rate is concentrated in the second half of the year.

Fig. 3. Distribution of accidents per day, 2015–2020.

As shown in Fig. 4, using the radar chart to visualize the distribution relationship between the number of traffic accidents and the number of weeks, it is not difficult to find that during the period from 2015 to 2020, the shapes of each line in the radar chart are almost the same, which indicates that the distribution of the number of accidents per week is uniform. The accident rate was high from Monday to Friday, and peaked on Thursday. The lower accident rate over the weekend may be due to higher attendance on weekdays. The time distribution map shows the distribution of these 6-year accidents within 24 h per day. According to the statistical results, traffic accidents are prone to occur around 8:00 a.m. and 5:00 p.m., corresponding to commuting time. In summary, it can be concluded that the traffic congestion environment at work is easy to cause traffic accidents.

Study on Grade Discrimination Method of Traffic Accident

423

Fig. 4. Weekly distribution radar map and time distribution map of traffic accidents from 2015 to 2020.

Next, we will analyze the distribution of traffic accidents in the UK in the past 6 years from the spatial dimension. Figure 5 shows the relationship between different road types and the number of traffic accidents. Among them, the incidence of road accidents in one-way lane type is the highest, and the number of traffic accidents in 6 years reached 541136, accounting for 73.32% of the total number of traffic accidents; two-lane road type accident rate is the second highest, 6 years the total number of accidents 111992, accounting for 15.17% of the total number of accidents; the third and fourth are circular roads and one-way streets, respectively. The number of traffic accidents on the remaining roads accounts for a relatively small proportion. The number of accidents in 6 years adds up to about 20,000, and the number of accidents accounts for 2.79%. In order to find out the risk factors which are easy to cause accidents and effectively reduce the accident rate, the content of the risk factor importance screening of different road types studied later in this paper is based on the four road types with high accident rate.

Fig. 5. Traffic accident distribution map of road types in 2015–2020.

424

W. Jie et al.

3 Data Mining Methods and Model Building 3.1 Data Mining Method Theory Overview 3.1.1 Discriminant Theory of Feature Importance Based on Random Forest Random forest [12, 13] is an integrated learning algorithm model based on decision tree classifier, which has the characteristics of Bootstrap sampling method and feature random selection. The determination of feature importance level is one of the main results obtained by random forest, which is used to explain the importance of variable attributes to the results. In the random forest guided sampling process, about 1/3 of the samples at a time will not appear in the result sample set. Steps of variable attribute importance screening based on random forest [12]: 1. Calculate the corresponding out-of-bag error based on the out-of-bag data (OOB) of each decision tree in the random forest, denoted as Err1; 2. The noise factor is added to the feature X of the OOB sample, and the out-of-bag error is recalculated as Err2; 3. Assuming that there are n trees in a random forest, the importance score for the feature X is calculated as follows: importancescore =

 Err1 − Err2 n

(1)

In the random forest model, the higher the importance score of the input sample features, the greater the impact of the sample features on the classification of the output results. In this paper, this feature of random forest is used to output the severity of the accident. The higher the importance score of the m features extracted from the traffic accident data, the higher the credibility of the feature as a risk factor. 3.1.2 Summary of Bayesian Optimization Theory Bayesian optimization is a method based on Bayesian theorem (as shown in Eq. (2)) to guide the search to find the extreme value of the objective function. Because Bayesian optimization has the characteristics of high efficiency and good optimization effect in the field of hyper-parameter optimization, it can effectively avoid the defects of global grid search and random grid search, and is often used in hyper-parameter optimization of black box models such as deep learning. P(f |D1:t ) =

P(D1:t |f )P(f ) P(D1:t )

(2)

In the formula: f represents a parameter to a black box function or model; D1:t = {xi , yi ∈ R|(xi , yi )} represents the sample set obtained by sampling; P(D1:t |f ) is the likelihood distribution of the observed value y = f + ε, where ε is the observed noise; P(f ) is the prior probability, P(D1:t ) is the marginal likelihood distribution of marginal f , P(f |D1:t ) is the posterior probability distribution of f .

Study on Grade Discrimination Method of Traffic Accident

425

Bayesian optimization is mainly composed of probability surrogate model and acquisition function [14, 15]. The following only introduces the probability surrogate model of Bayesian optimization. The probabilistic surrogate model is a function distribution model constructed by Gaussian processes [6, 10]. The Gaussian process is determined by the mean function m(x) and the covariance function k(x, x). ⎧ ⎪ ⎨ f (x) ∼ GP(m(x), k(x, x)) m(x) = E[f (x)] (3) ⎪ ⎩ k((x, x)) = E[(f (x) − m(x))(f (x) − m(x))] To simplify the calculation, the mean function m(x) usually takes zero, and the prior distribution of the objective function can be expressed as: P(f |X , θ ) ∼ N (0, Ki ) where f denotes the set f (x1 ), . . . , f (xi ) of unknown function values, X denotes the training set x1 , . . . , xi , θ denotes the hyperparameters to be optimized, and Ki denotes the covariance matrix consisting of covariance functions, which can be expressed as: ⎤ ⎡ k(x1 , x1 ) · · · k(x1 , xi ) ⎥ ⎢ .. .. .. (4) Ki = ⎣ ⎦ . . . k(xi , x1 ) · · · k(xi , xi )

Assuming that the observed noise ε obeys the independent identically distributed Gaussian distribution: o` ∼ N (0, σ ), y = f + ε are the sets of observed values, then the likelihood distribution can be obtained: P(y|f ) = N (f , σ 2 I ) Thus the marginal likelihood distribution is obtained:

P(y|X , θ ) = P(y|f )P(f |X , θ )df

(5)

(6)

P(y|X , θ ) = N (0, Ki + σ I ) 2

From formula (6), we can see that we need to optimize hyperparameter θ to improve the accuracy of the observation value y, and usually maximize formula (6) to optimize the hyperparameter θ to achieve the purpose of optimizing the model. The hyperparameter tuning process based on Bayesian optimization is shown in Fig. 6. 3.1.3 Optuna Hyperparameter Optimization Framework The Optuna [6, 12] is an automatic hyperparameter optimization framework designed for machine learning. Optuna offers a variety of different samplers and easily enables optimal hyperparameter search based on Bayesian optimization principles. Optuna hyperparameter optimization framework belongs to the tree Parzen optimizer (TPE) of Bayesian optimization, which can simulate Gaussian process, mainly through the conversion generation process to simulate P(y|f ), and replace the distribution of previous configuration with non-parametric density.  ζ (y) if f < f ∗ (7) P(y|f ) = ϑ(y) if f ≥ f ∗

426

W. Jie et al.

Fig. 6. Hyperparameter flow chart based on Bayesian optimization adjustment model.

3.2 Establishment and Optimization of Risk Factor Grade Screening Model for Different Road Types 3.2.1 Road Type Variable Information Description The risk factor screening model is based on the top four road types in the UK’s 2015– 2020 traffic accident data. The road type name and its corresponding ranking are shown in Table 1. Table 1. Top four road types of accidents. Road type

Number of accidents (ten thousand)

Ranking

One-way lane

54.1136

1

Two-way lane

11.1992

2

Ring road

4.6904

3

One-way street

1.7372

4

Study on Grade Discrimination Method of Traffic Accident

427

Using the index operation of the data analysis library pandas, the accident samples corresponding to the above four road types are concentrated respectively. The key variable information of the sample set is shown in Table 2. Table 2. Variable meaning description. Variable name

Meaning of variable

accident_severity

Accidental severity

speed_limit

Speed limit

junction_detail

Intersection situation

junction_control

Crossroads control

pedestrian_crossing_human_control

Human control of crosswalks

pedestrian_crossing_physical_facilities

Crosswalk infrastructure situation

light_conditions

Lighting situation

weather_conditions

Weather conditions

road_surface_conditions

Pavement condition

special_conditions_at_site

Site special circumstances

carriageway_hazards

Road obstacles

urban_or_rural_area

Urban or rural roads

trunk_road_flag

Whether the main road

day_of_week

What day of the week

3.2.2 Construction of Risk Factors Important Level Screening Model The steps of constructing the risk factor importance level screening model for different road types are: 1. First, select the ‘accident severity’ variable as the output variable of the random forest model, and the remaining variables as input variables. 2. Secondly, based on the scikit-learn machine learning library, the data set is divided using the relevant commands, and the ratio of the training set to the test set is 7:3. 3. Then, the random forest model (RF) is established for these four training sets, and the Bayesian optimization method based on the Optuna hyperparameter optimization framework is used to optimize the constructed road type risk factor level screening model. The main optimization parameters selected are {‘n_estimators’: number of decision trees in the forest, ‘max_depth’: maximum depth of a single decision tree, ‘min_samples_split’: minimum number of samples required to split internal nodes}. 4. Each model was tested 100 times, and the optimized model was scored using the average accuracy of 5-fold cross-validation. Finally, the test set was used to test the model.

428

W. Jie et al.

3.2.3 Comparative Analysis of Hyper-Parameter Optimization Experiments of Risk Factor Importance Level Screening Model In this study, three important hyperparameters of four random forest models are optimized by Bayesian optimization algorithm based on Optuna framework and conventional random search optimization method. Figure 7 shows the historical records of four random forest models using Bayesian optimization (BO) and random search method (hereinafter referred to as RS) for hyperparameter tuning. Among them, subgraphs (a), (b), (c) and (d) represent one-way lane risk factor level screening model, two-way lane risk factor level screening model, ring road risk factor level screening model and one-way street risk factor level screening model respectively. The abscissa of each subgraph is the number of iterations, and the ordinate is the target value returned by the tuning model (five-fold cross-validation). Based on Fig. 7, from the analysis of the overall optimization effect, the model accuracy obtained by the Bayesian optimization algorithm based on the Optuna framework is better than the model accuracy obtained by the random search optimization method. Secondly, from the trend of the two objective functions, it can be found that the optimization process of the random search optimization method is more random and uncertain, while the fluctuation range of the objective function of the iterative optimization process of the Bayesian optimization method is roughly from large to small, and the overall fluctuation is not large. This is because the Bayesian optimization method adopts an active optimization strategy [6], which can effectively avoid a large number of invalid sampling points, thereby improving the success rate of optimization. Figure 8 shows the contour map of the hyperparameters ‘n_estimators’, ‘max _depth’ and ‘min_samples_split’ obtained by the optimization process of the Bayesian optimization method based on the Optuna framework, from which we can easily find the influence of each hyperparameter on the accuracy of the whole model. The scatter position represents the value of the hyperparameter, and the color change of the color column represents the value of the target value. The lighter the color, the larger the target value, and the better the model fitting effect. (a), (b), (c) and (d) represent the same meaning as in Fig. 7. On the whole, the hyperparameter ‘max_depth’ plays an important role in improving the risk factor level screening model of four road types. The maximum depth of the decision tree of the model with better fitting effect is concentrated before 20 layers, and the other two hyperparameter changes have no obvious effect on improving the model fitting effect. 3.2.4 Comparison of Model Optimization Improvement Effects Table 3 is a comparison table of hyper-parameter optimization information of four optimization models based on BO optimization algorithm and RS optimization algorithm. The optimization range of the three hyperparameters, the selection of the best parameters, the five-fold cross-validation score of the risk factor level screening model for the four road types and the time consumed for optimization are shown. The verification results show that under the same hyperparameter space, the optimization efficiency of the Bayesian optimization algorithm based on the Optuna hyperparameter optimization

Study on Grade Discrimination Method of Traffic Accident

429

Fig. 7. Bayesian optimization and random search optimization optimization process diagram.

Fig. 8. Hyperparameter pair—objective value contour map.

framework is higher than that of the random search optimization method. The classification accuracy of the four road type risk factor classification models based on the BO-RF algorithm is above 80%, that is, the final model has a good fitting effect. Compared with the model constructed by RS-RF algorithm, the model accuracy is higher. Table 4 shows the accuracy score of the test set of the corresponding model. The test results show that the model based on BO-RF algorithm has better generalization ability than the model based on RS-RF algorithm. In summary, the Bayesian optimization method implemented by the Optuna optimization framework has certain advantages in the optimization of largescale hyper-parameter integrated algorithm models. It can make full use of historical

430

W. Jie et al.

optimization records, efficiently achieve deep search of parameters, reduce unnecessary objective function evaluation, and improve model optimization speed. It shows that the random forest algorithm based on Bayesian optimization is very suitable for modeling and cause analysis of data sets with large amount of data and data characteristics such as road traffic accident data sets. Table 3. Hyperparameter optimization information comparison table. Model types

Optimization Search range algorithm type

Best 5-fold Optimization parameter cross-validation time (s) value

One-way lane risk factor level screening model

BO

n_estimators: [3, 100]

78

max_depth: [2, 32]

3

min_samples_split: [3, 10]

6

RS

Two-way lane risk factor level screening model

BO

RS

Circular BO road risk risk factor level screening model RS

n_estimators:[3,100] 7 max_depth: [2, 32]

8

min_samples_split: [3, 10]

4

n_estimators: [3, 100]

46

max_depth: [2, 32]

12

min_samples_split: [3, 10]

5

n_estimators: [3, 100]

45

max_depth: [2, 32]

13

min_samples_split: [3, 10]

7

n_estimators: [3, 100]

56

max_depth: [2, 32]

13

min_samples_split: [3, 10]

5

n_estimators: [3, 100]

53

max_depth:[2, 32]

5

min_samples_split: [3, 10]

7

0.8014

1389

0.7934

1391

0.8277

185

0.8176

220

0.8722

87

0.8632

93

(continued)

Study on Grade Discrimination Method of Traffic Accident

431

Table 3. (continued) Model types

Optimization Search range algorithm type

Best 5-fold Optimization parameter cross-validation time (s) value

One-way street risk factor level screening model

BO

n_estimators: [3, 100]

9

max_depth: [2, 32]

8

min_samples_split: [3, 10]

7

n_estimators: [3,100]

53

max_depth: [2,32]

5

min_samples_split: [3, 10]

7

RS

0.8434

41

0.8333

48

Table 4. Test set results comparison information table. Type

One-way lane risk factor level screening model

Two-way lane risk factor level screening model

Circular road risk risk factor level screening model

One-way street risk factor level screening model

BO-RF

0.8026

0.8251

0.8703

0.8444

RS-RF

0.7943

0.8109

0.8614

0.8227

4 Model Results Analysis and Accident Prevention Suggestions Based on the above random forest model optimized by Bayesian optimization method, the risk factor importance levels of four road types can be obtained, respectively. The visualization results are shown in Fig. 9, and the decision results based on the model are shown in Table 5. The subgraphs (a), (b), (c) and (d) in the figure represent the ranking of risk factors for traffic accidents on one-way, two-way, roundabout and one-way streets, respectively. Next, this paper will combine Fig. 9 and Table 5 to analyze the decision results of these four road types and give relevant suggestions for reducing traffic accidents. 1. For one-way lane road types, some uncontrolled T-junctions in one-way lanes with a speed limit of 30 km/h in the city are particularly prone to minor or serious accidents. In the rural one-way lane with a speed limit of 60 km/h, the lack of control at the Tjunction is prone to fatal accidents. In view of the hidden dangers of accidents caused by the top risk factors in the above important grades, the prevention and control suggestions are as follows: (1) Due to the dense population distribution in the city and the large number of cars, it is more likely to cause accidents if the one-way lane is narrower. (2) Unilateral speed limit of 30 km/h may not be suitable for one-way

432

W. Jie et al.

Fig. 9. Risk factor level screening results of four road types.

lanes in the city, because there are always people who want to go beyond the slower vehicles in front of them, so that it is easy to collide with the approaching vehicles. If the speed limit on the basis of 30 km/h plus no overtaking regulations, may reduce the incidence of minor and serious accidents. (3) The speed limit of 60 km/h on one-way roads in rural areas is likely to cause fatal accidents. It is recommended to reduce the speed limit to 40–50 km/h. (4) A convex wide-angle mirror is placed near the T-shaped intersection of the one-way lane to expand the driver’s field of view to detect in advance whether there is a pair of incoming vehicles at the intersection. 2. For the two-way lane road type, minor and serious accidents are prone to occur near the intersection of the non-trunk two-way lane without traffic indication facilities within 50 m of the speed limit of 30 km/h and the speed limit of 70 km/h when the weather and light conditions are good, and the high probability of fatal accidents occurs in the above scenario of the speed limit of 70 km/h. In view of the hidden dangers of accidents caused by the top risk factors in the above important grades, the prevention and control suggestions are as follows: (1) People are prone to dangerous driving behaviors when the external environmental conditions (such as light and weather) are good, resulting in fatal accidents. (2) The speed limit of 30 km/h and 70 km/h are easy to cause traffic accidents. This shows that the speed limit is too low or too high in the two-way lane has a high security risk. The speed limit range of the two-way lane can be adjusted appropriately to prevent the occurrence of rear-end accidents. For example, according to the statistics of the data perspective table, it is found that the two-way lane speed limit is adjusted to about 60 km/h, and the incidence of three types of accidents will be greatly reduced. (3) Traffic control facilities such as road traffic signs play an important role in reducing fatal accidents. It is suggested that traffic guidance infrastructure should be set up on special sections of two-way lanes, especially at intersections, to prevent adverse driving behaviors. 3. For the ring road, the three types of accidents are prone to occur in the better external environment, speed limit of 30 km/h and no traffic indication facilities at the road, which has similar characteristics with the front two-way lane. Therefore, the prevention and control suggestions are given as follows: (1) Considering that people like to drive when the external environment is good, this will cause a large traffic flow and

Study on Grade Discrimination Method of Traffic Accident

433

Table 5. Decision information table of high incidence of accidents based on risk factor grade screening model. Road type

Decision rule

Types of accidents prone

One-way lane

Urban road → Speed limit 30km / h → Give way under the condition of lack of control → T intersection / Within 20 m of intersection

Minor, serious accidents

Rural road → Speed limit 60km / h → Give way under the condition of lack of control → T intersection / Within 20 m from the intersection

Disastrous accident

Two-way lane

Daylight conditions / Night light conditions → Minor, serious accidents T intersection / Within 20 m from the intersection → Speed limit 70km / h / Speed limit 30km / h → No traffic indication facilities within 50 m → Good weather without wind → Non-main road Daylight conditions / Night light conditions → Disastrous accident T-intersection / Within 20 m from the intersection → Speed limit 70km / h / → No traffic indication facilities within 50 m → No windy weather

Ring road

Speed limit 30km / h → 50 m within the scope Minor, serious, fatal accident of no traffic signs → Weather good windless → Daylight conditions → Road dry

One-way street There is no traffic control within 50 m → No traffic indication Facilities within 50 m → Speed limit 30km / h → Give way under the condition of lack of control → T intersection /Within 20 m of intersection

Minor, serious, fatal accident

a high incidence of road traffic accidents. Therefore, it can be considered to arrange traffic police and other traffic management authorized personnel on the ring road to carry out traffic command and organize traffic flow. (2) 30 km/h speed limit is not appropriate for the roundabout, according to the statistical results of the Pivot Table, the roundabout speed limit of 20 km/h fatal accident rate is the lowest, so you can consider the ring road speed limit set within 20 km/h to reduce the occurrence of accidents. 4. For the type of one-way street road, the three types of accidents are prone to occur at the T-junction where there is no traffic control or traffic indication infrastructure within 50 m. In view of the hidden dangers of accidents caused by the top risk factors in the above important grades, the prevention and control suggestions are as follows: (1) The biggest difference between one-way street and the other three roads is that there are more pedestrians in the street, so the driver should pay attention to the direction of pedestrians while paying attention to other vehicles, especially at the

434

W. Jie et al.

intersection. Because the blind area of vision may cause traffic accidents of different severity, it is suggested to add speed bumps within 50 m of one-way street from the intersection and add traffic lights and other traffic facilities at the intersection to reduce the accident rate. (2) It is better to send the relevant traffic control staff to maintain traffic order in the more densely populated one-way street area, which will effectively reduce the incidence of fatal accidents in one-way streets.

5 Discussion Based on the historical road traffic accident data of the United Kingdom, this study systematically analyzes the influence of the difference of road types on the important level of traffic accident risk factors. The results show that different types of roads have different functions, which leads to the inconsistent role of seemingly the same traffic accident risk factors in inducing accidents. For example, according to the results, for one-way lanes, it is in the city or in the countryside. This factor may play a more decisive role in inducing traffic accidents than other risk factors such as weather and speed limits, while for two-way lanes, it is in the city or countryside. The role of this factor in causing road traffic accidents is weak. Therefore, the difference of road types should be considered to identify the important risk factors affecting the traffic safety of this road type. It is of great significance to guide the formulation and improvement of traffic safety regulations for various types of roads. In addition, the BO-RF algorithm implemented by the Optuna hyperparameter framework is superior to the RS-RF model in terms of accuracy and efficiency in dealing with the model construction of large-scale traffic accident data sets. It is applied to the research on the important level identification of traffic accident risk factors considering road differences, which has higher reliability and lower time cost.

6 Conclusion This paper uses the open source traffic accident data set of the British Department of Transport to carry out research on risk factor level screening considering road types. Firstly, the temporal and spatial distribution characteristics of traffic accidents are deeply analyzed. Then, taking the optimization of integrated algorithms such as random forest as the object, a comparative experiment of the optimization effect of Bayesian optimization algorithm based on Optuna and random search optimization algorithm is set up. Finally, based on the results of the risk factor level screening model of four road types constructed by the BO-RF algorithm, the optimization ideas of targeted traffic control measures are proposed. The results show that: 1. Road traffic accidents are more likely to occur in densely populated urban areas, especially during working days and commuting periods. 2. Bayesian optimization based on Optuna framework is more suitable for the optimization of integrated algorithm models with large amount of data and numerous hyperparameters than random search optimization. 3. The same risk factors in different road types, such as one-way lane, two-way lane, ring road and one-way street, have different effects on traffic accidents.

Study on Grade Discrimination Method of Traffic Accident

435

In the future research, this study will continue to move in the following directions to further improve the relevant research: using updated data sets with more abundant quantities and types, fully considering the relationship between road differences and risk factor levels, using better integrated algorithms combined with the Bayesian optimization parameter adjustment method based on Optuna to construct a more accurate risk factor importance level screening model, and using it as a feature dimension reduction method. On this basis, a traffic accident risk level assessment model and a personnel injury level assessment model under specific scenarios of different road types are constructed.

References 1. Wu, G.F., Liu, F.J., Dong, G.L.: Analysis of the influencing factors of road environment in road traffic accidents. In: Changsha: 2020 4th Annual International Conference on Data Science and Business Analytics (ICDSBA), pp. 83–85. IEEE 2. Pawlowski, W., Lasota, D., Goniewicz, M., et al.: The effect of ethyl alcohol upon pedestrian trauma sustained in traffic crashes. Int. J. Environ. Res. Public Health 16(8), 1471 (2019) 3. Zhan, Q.J.: Based Railway Accident Casual Factor Modeling and Hybrid Learning Approach. Beijing Jiaotong University, Beijing (2017) 4. Kuniyuki, H., Zainuddin, A.F.: A study on method of extracting high-risk accident causes in Nagano using Bayesian network analysis. Int. J. Autom. Engin. 13(3), 139–146 (2022) 5. Su, F., Yuan, Q.: Analysis on association rules of urban road traffic accidents. J. Wuhan Univ. Technol. (Inf. Manag. Engin.) 42(04), 313–318+331 6. Wang, X., Han, T.: Transformer fault diagnosis based on bayesian optimized random forest. Electr. Measure. Instrum. 58(06), 167–173 (2021) 7. Yang, C., Liu, L.L., Zhang, Y.L., et al.: Landslide susceptibility assessment based on Bayesian optimized machine learning hyperparameters. Geol. Technol. Bull. 41(02), 228–238 (2022) 8. Ma, H., Ren, M.X., Gao, K.: Gas emission prediction model based on random search optimization XGBoost. J. Safety Technol. 18(05), 129–134 (2022) 9. Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., Freitas, N.D.: Taking the human out of the loop: A review of Bayesian optimization. Proc. IEEE 104(1), 148–175 (2016) 10. Cui, J.X., Yang, B.: Overview of Bayesian optimization methods and applications. J. Softw. 29(10), 3068–3090 (2018) 11. UK Road Safety Dataset: https://www.data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e 5ce24a11f/road-safety-data 12. Wang, M., Liu, D.: Cause analysis of railway traffic accidents based on random forest. J. Phys.: Confer. Ser. IOP Publishing 1910(1), 012017 13. Li, Y.S., Zhang, X., Wang, W.J., et al.: Analysis on influencing factors of accident injury degree of electric bicycle riders based on random forest. J. Transp. Syst. Eng. Inf. Technol. 21(01), 196–200 (2021) 14. Zhang, Y.: Research on Hyperparameter Optimization of Deep Learning Model. Capital University of Economics and Business, Beijing (2020) 15. De Ath, G., Everson, R.M., Rahat, A.A.M., et al.: Greed is good: exploration and exploitation trade-offs in Bayesian optimisation. ACM Trans. Evolut. Learn. Optim. 1(1), 1–22 (2021)

Summarization of Feedback from Residents in Urban Area Using the Unsupervised Method Nur Maisara Deli1 , Sofianita Mutalib1,3(B) , Mohd Fadzil Abdul Rashid2 , Haslizatul Fairuz Mohamed Hanum1 , and Shuzlina Abdul-Rahman1,3 1 School of Computing Sciences, College of Computing, Informatics and Mathematics,

Universiti Teknologi MARA, 40450 Shah Alam, Selangor, Malaysia [email protected] 2 Department of Built Environment Studies and Technology, College of Built Environment, Universiti Teknologi MARA Perak Branch, Tapah, Perak, Malaysia 3 Research Initiative Group Intelligent Systems, Universiti Teknologi MARA, 40450 Shah Alam, Selangor, Malaysia

Abstract. In light of the rapid growth of urbanization in Malaysia, many people have decided to congregate to the cities to gain a better quality of life as what believed. However, it is not as expected when different problems arise daily. The residents’ voices are being ignored, and the same urban problems keep happening even though there are complaints everywhere, including on the social media. To cast light on this issue, the current paper attempts to summarize the residents’ feedback using the unsupervised method in the Data Mining approach. The residents’ feedback or dataset were collected from Twitter and CARI Infonet, which is a total of 2320. Moreover, Latent Dirichlet Allocation (LDA) method is selected to perform Topic Modelling. To extract noteworthy topics in the dataset, the Coherence Score measure is performed to find the optimal number of k-values. Finally, three topics were identified and clustered according to their similarity of words: “road problems and traffic congestion”, “public transport”, and “pollution.” The results provide insightful information to the stakeholders, particularly urban policymakers, to lead them to a strategic planning decision-making process reflecting urban residents’ desires. Keywords: LDA · Topic modelling · Urbanization · Urban area

1 Introduction Contrary to rural areas, which provide a more calm and blissful life, urban areas exist to bring out the best outcomes of human lifestyle transitions. An urban can be stated as a development that is proceeding at a great rate (Dewan Bahasa Pustaka, year?) which attracts many people to live there. As stated by [1], urbanization refers to the mass movement of populations from rural to urban settings where different type of jobs is created in demand of the growing population. As job-based technology is emerging, many people decided to move and find better chances to live in the urban area and are © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 436–448, 2024. https://doi.org/10.1007/978-3-031-47715-7_30

Summarization of Feedback from Residents in Urban Area

437

impacting its growth of it in Malaysia. According to [2], Kuala Lumpur is the capital center of Malaysia, and it is said to have increased approximately 1.8 million people residing there. In 2022, total population in Malaysia that lived in urban areas is 78.2% which shows that more than half of the country’s population chooses to spend their life there [3]. They are called as the urban residents which then become the key for the evergrowing cities to develop. Are they facing any problems while living in cities? This is the main concern to be further discussed in this paper through feedback dataset analysis. Feedback has been historically a tool of sociology and psychology in determining the behavior of certain parties [4]. In this context, urban residents are free to convey their opinions and feedback through multiple sources. Social media is one of the valuable tools that have helped break down geographical barriers that once restricted communication and has led to an explosion of e-participation, virtual presence, and online communities [5]. It supplied us with volumes of feedback which it needed to be taken care of. As there are many feedbacks expressed by the residents using the internet, it is safe to say that all those texts are not in a good manner in respect of analyzation. Previous researchers have used different techniques, such as Convolutional Neural Network (CNN) and Data Mining to classify, cluster and even identify the context of the text within the social media platform or through surveys. Both approaches are beneficial, as it is a process of extracting and understanding the information from the text. However, the urban problems are not stated within the residents’ context. The impact of urbanization is the numerous problems occur in the city and it is harder for the planners and policy makers to address those problems [6]. As a result of those problems, there are many complaints received from the residents. Since there are many of them residing in the area, most of them experience the same things. Therefore, one issue can result in many complaints. The cycle of an unhealthy lifestyle in urban areas continues to occur because the urban sectors are not fully engaging with the residents, primarily due to a lack of human resources and platforms for engagement. Because of not adequate exposure to community engagement, most urban people have no idea about the discernible impacts of community engagement in urban development plans can affect their daily lives [7]. It is very crucial for the city developers or urban sectors to create sustainable development by including the residents’ feedback or informal text communications, mainly via social media, namely data mining that can help in performing Topic Modelling analysis. Inspired by this, the current paper attempts to extract the related keywords in social media postings from the residents of urban areas. Then, to develop a model that can identify the problems happening in urban areas using Topic Modelling Technique. The paper is segmented as follows: Sect. 2 provides a short overview of urban and urbanization, Topic Modelling and LDA method, which then ends with the past studies. Section 3 describes the methodology of the study, and Sect. 4 discusses the results and findings of this research, Lastly, Sect. 5 concludes this research.

438

N. M. Deli et al.

2 Literature Review 2.1 Urban and Urbanization As stated by [1], urbanization refers to the mass movement of populations, particularly from rural to urban settings where different type of jobs is created in demand of the growing population. Consequently, it leads to a shortfall in development planning if the urban sectors neglect to understand their behaviours and desires accordingly. Urbanization is an inevitable phenomenon. Thus, from time to time, the urbanization rate is considered in light of its challenges and development potential. The higher urbanization rate reflects the growth of urban areas. There are various definitions of urban that can be interpreted in its own meaning. As said in [8], there exists no universal definition of what constitute an urban area, but countries can apply their own criteria towards the definition of urban. In Malaysia, urban can be differentiate from the rural areas when it is a built-up area, the population is 10,000 and above and 60% of the population involved in non-agriculture economic activities [9]. Malaysia is now one of the most urbanized countries in East Asia and has one of the fastest urbanization rates [3]. Urban areas are known to play multifaceted functions in the societies such as becoming the heart of technological development, economic growth and not to forget, the breeding ground for many types of city problems [1]. Those can be well-planned and executed strategically when the people’s voices are well-observed and incorporated into decision-making. 2.2 Topic Modelling Using LDA Method Topic Modelling is one of the techniques used to identify patterns and trends from an enormous set of data [10] from any source, mainly social media. It is a simple way to understand what is currently discussed among urban residents. This unsupervised model is considered a clustering problem where multiple topics are put into one document and each word assumed to at least belong to one of the topics [11]. Online feedbacks or reviews such as tweets and postings from Facebook can be related to each other especially when a user is discussing about the same topic. Thus, Latent Dirichlet Allocation (LDA) method can be implemented to arrange, manage and visualize the similarity of the content and make it easier for anyone to understand. The LDA method is capable of summarizing or clustering the large volume of text in the document. This is because LDA method can produce a result of a list of topics that is generated from the document [10]. The advantages of using this technique are it can utilize the clues from the document and subjective bias when analyzing the data can be minimized as it is performed automatically based on the mathematical algorithm [12]. To enhance the use of technology in providing better life quality for the people, LDA in Topic Modelling suits the purpose of it. For example, presently, LDA is used to identify urban problems and their solutions by conducting an online survey on a website platform [12]. Very recent, Topic Modelling-based on LDA is used to discover what kind of topics is discussed by the residents in South Africa related to Covid-19 [11].

Summarization of Feedback from Residents in Urban Area

439

2.3 The Importance of Voices of Urban Residents Some related works around this domain are presented in this section. As stated by [13], growing city size has always been considered an important factor impacting happiness. When there are numbers of people living in the same area, the environment will either improve or cause difficulty to the residents. The environment includes the people, services, income, the surrounding of nature and the way of living. As the environment can affect the residents’ happiness, [14] also claimed that the quality of life in urban societies is linked with the infrastructure provided in the neighborhood and the expectations towards the environment. Research done by [15], uses the data mining technique to outline the strategic requirements for urban design. As mentioned earlier, to build a sustainable environment, all sectors, including the communities should learn and participate in improving the urban system and atmosphere from the students’ level. Their feedback or reviews can significantly contribute to improving the products. Misuraca et al. [16] mentioned that texts written in natural language hold information in a form difficult to analyze from a quantitative point of view since the content of it does not follow a given data model. In the same vein, [17] collected the users’ reviews related to the fashion renting experience in three different companies. Jiang et al. [18] takes the user’s comments which related to waste management and household waste reduction in the social medias and found that by engaging with the users in social medias can improve their household waste segregation and recycling. So, no matter how difficult it is, there is a need for further processing and analysis to give it depth and meaning, especially for making decisions about urban planning. Hence, the community’s involvement via an unsupervised way is also crucial in building a city that fits all. As discussed in [7], local communities are not considered as the participants in risk-sensitive urban development rather as the beneficiaries which results in little consideration of local knowledge about them. This issue has been going on for many years and thus this paper decided to do research about the barriers and challenges in participatory urban planning decision making. Therefore, using Data Mining and Topic Modelling, can help analyze the residents’ feedback regarding the urban areas to cater to their needs and social well-being. 2.4 Filling the Gaps of Previous Research To summarize, there are multiple benefits and also shortcomings in the reviewed papers that can be improved within this research. Most of the researchers only focused on one topic within urban growth, such as agriculture or the index of happiness for the residents living in the urban area. In order to progress forward, this paper tries to provide compact information regarding what happens and how it can help in managing life better in urban areas. Secondly, there has been a small amount of research conducted that is related to urban areas in Malaysia and the problems they face. Therefore, the strength of this research is that it can open new doors to other researchers for doing research and analyzing the urban area in Malaysia. Lastly, there is the negligence of the residents’ voices and opinions regarding how it is to live in an urban area. Past researchers used topic modeling for various issues, including waste management, but there has been a lack of research on using that method to collect feedback regarding problems in urban

440

N. M. Deli et al.

areas. To close the gap between the shortcomings, this paper includes the residents’ voices by collecting their opinions in social media postings and gives insights to the city developers on what is actually happening in the urban area.

3 Methodology 3.1 Data Pre-processing Data is collected from two social media which are Twitter and CARI Infonet, within 2018 till 2021. It focuses on four urban areas (districts) which are Kuala Lumpur, Petaling Jaya, Klang and Shah Alam. The collected data is then processed before it is used to perform Topic Modelling. The data is collected using Python and Parsehub which contains a mixture of English and Malay languages alongside other irrelevant content. Next, remove missing values and filtering process are done to remove any duplicates, unimportant words, or characters. Then, lemmatization is done to convert the word into its root form without changing the meaning of the word. Lastly, tokenization is performed to transform the sentences into separate words which known as tokens. These preprocessing tasks are crucial in creating meaningful features either for supervised [19] or unsupervised learning [20]. 3.2 Modelling In this phase, the first step is to identify which modelling technique to use for the process of finding k-values, find the similarity between data, and to identify which topic has the highest number of postings. There are several techniques that is suitable for identifying the best k-value to start clustering them into different clusters. As decided, Coherence score measure as in Eq. (1) is used:  Coherence = score(wi, wj) (1) Coherence measures the sum of the scores between words that exist in the corpus or documents. (wi, wj) refers to the pair of words that is found in the topics and this process will be repeated until it found the optimal k-value from this measure. Next, we will use the Latent Dirichlet Allocation (LDA) method, which focuses on Topic Modelling, to find the similarity between the postings and tweets. In LDA, it is better explained in two parts, Eqs. (2) and (3), respectively: p(topic t | document d)

(2)

p(word w | topic t)

(3)

In Eq. (2), LDA counts the percentage of the words in the document d that are randomly assigned earlier to topic t. Meanwhile, in Eq. (3), LDA counts the percentage of times the word w was assigned to the topic t from all the 40 documents exist in the process. Through the steps taken, it can identify the common words mentioned in the resident’s postings and find what is the content of each topic. When the process of summarizing the feedback were done using the Topic Modelling method, the results are to be validated by using the keywords obtained from the topic modelling results from any of the topics with the use of tools provided by Google which is Google Trends.

Summarization of Feedback from Residents in Urban Area

441

3.3 Visualization of Results After completing the process in the previous phases, the processed data that is clustered into their respective topics will be visualized. The idea is to visualize the urban problems which the top 100 most common words that were mentioned in the residents’ postings according to which topic they belong to from the results obtained from previous phase. The results are visualized in a dashboard which contained the dataset information and the related words for each topic found.

4 Results and Discussion The total of tweets and postings scrapped from Twitter and CARI Infonet are 2320 within 2018 till 2021. It consists of English and Malay languages which then translated into one language only that is English. The dataset consists of multiple attributes but only ‘text’ attribute (field) was used as shown in Fig. 1. After the scraping process was done, the dataset was saved in CSV format as seen in Fig. 2.

Fig. 1. Sample records of twitter.

After all the data pre-processing steps are done, the dataset is used to find the optimal k-value using the Coherence Score evaluation. It is a well-practiced process and is better rather than setting it manually. Coherence score measures the similarity between the words within the topics and the higher the score is, the better it correlates with how human thinks [21, 22]. The range set to find the best score is from the range of (2,5), (2,6), (2, 7), (2, 8), (2, 9), until (2, 10). Each range starts with 2 (as number of topics) need to be identified within the residents’ feedbacks. The end value for each range differs as the evaluation is needed to ensure wider view of how Coherence score works.

442

N. M. Deli et al.

Fig. 2. Selected ‘text’ dataset.

Figure 3 shows how the coherence score works using Python programming language in the Jupyter Notebook. Range of the coherence score is where the parameter tuning happens.

Fig. 3. Snippet code of coherence score process.

Table 1 shows the results of each range with the highest coherence score by the topic number. As seen, topic 3 has the lowest score, with 0.2015. However, the highest value of the coherence score is 0.2999, which is nearer to 0.3, and it falls under the same topic number, topic 3. Based on the evaluation, topic 3 is the chosen k-value as the highest and lowest scores differ by their range of topics and it was proved that the higher range are set, the better performance will be produced. Topic Modelling is used to identify topics from a document that consists of thousands of words. It is to uncover the underlying information within the document. As for that, Latent Dirichlet Allocation (LDA) method was implemented for this Topic Modelling

Summarization of Feedback from Residents in Urban Area

443

Table 1. Coherence score results. Range

Highest coherence score

Topic number

(2, 5)

0.2015

3

(2, 6)

0.2232

5

(2, 7)

0.2502

4

(2, 8)

0.2999

3

(2, 9)

0.2862

6

(2, 10)

0.2398

5

technique. To perform Topic Modelling, the optimal numbers of the topics need to be pre-determined to ensure a better result. With the use of the coherence score measure as discussed in the previous findings, the number of k-value can be determined based on the coherence score. Using Python programming language to implement this method, the only parameter tuning happens on the number of topics where it was obtained after experimenting with the Coherence score measure. Next, by using the pyLDAvis package in python, the topics was displayed with interactive visualization, where the distance between each topic is quite significant as the right side of the topic display are number of words that represents each topic. Using this visualization, top 30 most words used in the topics are identified and each topic may differ from one to another as shown in Fig. 4. Moreover, the contents of the topics can be grouped into 3 clusters as shown in Table 2, which is Topic 0 is about road problems and traffic congestion, Topic 1 is about public transport, and Topic 3 is about pollution. In other words, they are the most talked-about problems or issues among urban dwellers in the four district areas. This provides friendly alarms or information to urban planners or policymakers for further investigation to find solutions for improving the city’s living environment and people’s comfort and safety. The information can then be visualized using Power BI to make it more interactive and easily retrievable (Fig. 5). The Power BI dashboard consists of two pages. As shown in Fig. 5, the first page is a visualization of the basic information about the dataset such as the number of tweets and postings of the dataset and its word cloud. Next, for the second page is a visualization of the results from the Topic Modelling phase (see Fig. 6).

444

N. M. Deli et al.

Fig. 4. Topic visualization based on three clusters.

Table 2. Topic name and the related words (contents). Topic number Topic name 0

Related words

Road problems and traffic congestion Jalan, transport, time, jam, travel, many, road, problem, lrt, mrt, car, morning, bus, hours, pothole, come, home, patch, people, petaling, klang, lumpur, shah, Malaysia, jaya, ampang, kuala, alam, muda, Selangor, thank, index, taman, sri, please, system, give, zone, take, want, use, clean, say, ministry, operate, around, back, amp, area, Malaysian, one, office, help, state, may, today, unlimited, new, line, federal, like, haze, reply, case, city, change, pollution, rain, rainbow, flood, covid, user, fee, government, get, brt, good, climate, service, river, house, valley, make, health, public, need, due, garbage, case, month, water, district, near, day, care, work, still, record, air, mental (continued)

Summarization of Feedback from Residents in Urban Area

445

Table 2. (continued) Topic number Topic name

Related words

1

Public transport

Service, jam, jalan, transport, road, people, mrt, save, monorail, line, building, office, section, many, construction, area, district, morning, time, work, bus, lrt, problem, public, shah, klang, alam, jaya, Malaysia, lumpur, sri, ampang, kuala, petaling, Selangor, subang, kelana, take, tak, top, thank, pollution, water, user, include, lot, start, health, come, project, one, around, back, kat, house, council, every, nak, also, air, move, get, flood, say, rain, increase, valley, pothole, face, build, month, government, operate, state, city, new, like, clinic, use, issue, year, make, mental, want, right, amp, good, old, river, hospital, near, know, garbage, every, change, brt, even, help, world, reply

2

Pollution

Health, people, damage, clinic, flood, water, sabotage, air, land, index, construction, climate, problem, river, garbage, pollution, covid, Selangor, kuala, lumpur, shah, sri, jaya, kelana, Malaysia, petaling, ampang, klang, alam, time, help, use, thank, high, today, user, like, road, transport, follow, many, also, victim, city, pedestrian, pothole, morning, valley, lrt, jam, day, make, give, case, section, area, rapid, bridge, office, around, bus, change, hour, call, center, mrt, public, get, brt, factory, mental, country, ferry, one, department, line, state, say, good, due, want, year, taman, house, amp, car, take, even, waste, last, look, government, near, monorail, operate, jalan, district, work, still, world

446

N. M. Deli et al.

Fig. 5. Dashboard of the first page about the dataset information.

Fig. 6. Dashboard of the second page about the results.

5 Conclusion This study demonstrates the process of analyzing feedback from social media using Topic Modelling technique in the four districts: Kuala Lumpur, Petaling Jaya, Klang and Shah Alam. The data collected are based on the identified keywords. Several keywords that have been identified are ‘traffic congestion’, ‘traffic jam’, ‘potholes’, ‘flood’, ‘air pollution’, ‘transportation problem’, ‘public transport’ and other 62 keywords. The results show the 3 most-talked topics are extracted from the dataset collected and are visualized in a dashboard where the city developers or urban sectors or any related

Summarization of Feedback from Residents in Urban Area

447

parties can view and better understand what is happening in the areas. This provides fundamental insights into what urban dwellers are concerned about and calls for more actions and interventions by the respective stakeholders. In addition, this study can also give the public awareness of how the important the residents’ voices are to build a better urban area in the future. In the future research, the expansion of the scope is needed to make it more useful for everyone in this country. Expansion of scope in terms of areas covered, mapping the phenomenon to location, languages, and techniques used to obtain more impactful information that can influence urban planning decisions. Acknowledgment. The authors would like to express the gratitude to Ministry of Higher Education, Malaysia for the research fund (FRGS/1/2021/ICT02/UITM/02/8) and School of Computing Sciences, College of Computing, Informatics and Mathematics, Universiti Teknologi MARA, Shah Alam, Selangor, Malaysia for the research support.

References 1. Kuddus, M.A., Tynan, E., McBryde, E.: Urbanization: a problem for the rich and the poor? Public Health Rev. 41(1), 1 (2020). https://doi.org/10.1186/s40985-019-0116-0 2. Department of Statistics, Malaysia.: Federal Territory of Kuala Lumpur (20 Aug 2021) 3. https://www.dosm.gov.my/v1/index.php?r=column/cone&menu_id=bjRlZXVGdnBueDJ KY1BPWEFPRlhIdz09 4. Population Review.: Most Urbanized Countries 2023 (2023). https://worldpopulationreview. com/country-rankings/most-urbanized-countries 5. Chatzigeorgiou, I.M., Andreou, G.T.: A systematic review on feedback research for residential energy behavior change through mobile and web interfaces. Renew. Sustain. Energy Rev. 135, 110187 (2021). https://doi.org/10.1016/J.RSER.2020.110187 6. Dwivedi, Y.K., Kelly, G., Janssen, M., Rana, N.P., Slade, E.L., Clement, M.: Social media: the good, the bad, and the ugly. Inf. Syst. Front. 20(3), 419–423 (2018) 7. Somasundram, S., Sambasivan, M., Rasiah, R., Pei-Leng, T.: Ranking the challenges of the urban community in Malaysia. Inst. Econ. 10(4) (2018) 8. Geekiyanage, D., Fernando, T., Keraminiyage, K.: Assessing the state of the art in community engagement for participatory decision-making in disaster risk-sensitive urban development. Int. J. Disaster Risk Reduct. 51, 101847 (2020). https://doi.org/10.1016/j.ijdrr.2020.101847 9. Wineman, A., Alia, D.Y., Anderson, C.L.: Definitions of “rural” and “urban” and understandings of economic transformation: evidence from Tanzania. J. Rural. Stud. 79, 254–268 (2020). https://doi.org/10.1016/J.JRURSTUD.2020.08.014 10. Fadzil Ismail, M.: Urbanisation and urban growth in Malaysia definitions and concepts trends and levels of urbanisation spatial distribution of urban population metropolitan towns (2020) 11. Negara, E.S., Triadi, D., Andryani, R.: Topic modelling twitter data with latent dirichlet allocation method. In: 2019 International Conference on Electrical Engineering and Computer Science (ICECOS), pp. 386–390 (2019) 12. Mutanga, M.B., Abayomi, A.: Tweeting on COVID-19 pandemic in South Africa: LDA-based topic modelling approach. Afr. J. Sci. Technol. Innov. Dev. 14(1), 163–172 (2022) 13. Chong, M., Habib, A., Evangelopoulos, N., Park, H.W.: Dynamic capabilities of a smart city: an innovative approach to discovering urban problems and solutions. Gov. Inf. Q. 35(4), 682–692 (2018)

448

N. M. Deli et al.

14. Dang, Y., Chen, L., Zhang, W., Zheng, D., Zhan, D.: How does growing city size affect residents’ happiness in urban China? a case study of the Bohai rim area. Habitat Int. 97, 102120 (2020). https://doi.org/10.1016/J.HABITATINT.2020.102120 15. Roslan, S.N.M., Kari, F., Satar, N.B.T.M., Nor, W., Abd Aziz, A.B.T.W.: Assessment of the relationship between environmental attributes and urban quality of life in Malaysia. Int. J. Recent. Technol. Eng. 8(2S), 348–357 (2019) 16. Valls, F., Redondo, E., Fonseca, D., Torres-Kompen, R., Villagrasa, S., Martí, N.: Urban data and urban design: a data mining approach to architecture education. Telematics Inform. 35(4), 1039–1052 (2018). https://doi.org/10.1016/J.TELE.2017.09.015 17. Misuraca, M., Scepi, G., Spano, M.: Using Opinion Mining as an educational analytic: an integrated strategy for the analysis of students’ feedback. Stud. Educ. Eval. 68, 100979 (2021). https://doi.org/10.1016/J.STUEDUC.2021.100979 18. Lang, C., Li, M., Zhao, L.: Understanding consumers’ online fashion renting experiences: A text-mining approach. Sustain. Prod. Consum. 21, 132–144 (2020) 19. Jiang, P., Zhou, J., Fan, Y.V., Klemeš, J.J., Zheng, M., Varbanov, P.S.: Data analysis of resident engagement and sentiments in social media enables better household waste segregation and recycling. J. Clean. Prod. 319, 128809 (2021). https://doi.org/10.1016/J.JCLEPRO.2021. 128809 20. Nabiha, A., Mutalib, S., Ab Malik, A.M.: Sentiment analysis for informal malay text in social commerce. In: 2021 2nd International Conference on Artificial Intelligence and Data Sciences (AiDAS), pp. 1–6. IEEE (2021) 21. Mangsor, N.S.M.N., Nasir, S.A.M., Yaacob, W.F.W., Ismail, Z., Rahman, S.A.: Analysing corporate social responsibility reports using document clustering and topic modeling techniques. Indones. J. Electr. Eng. Comput. Sci. 26(3), 1546–1555 (2022) 22. Jónsson, E., Stolee, J.: An evaluation of topic modelling techniques for twitter. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), pp. 489–494

Generating and Updating Supervised Data Mining Models on a Periodic Basis Ana Duarte

and Orlando Belo(B)

ALGORITMI R&D Centre/LASI, University of Minho, Campus of Gualtar, 4710-057 Braga, Portugal [email protected], [email protected]

Abstract. Data mining techniques are currently of great importance in companies and organisations worldwide for building predictive models. These models are particularly useful for classifying new data and supporting decision-making processes by helping to make the most appropriate decisions. However, over time, the predictive models created can become outdated as the patterns found in the data change due to natural evolution. This aspect can affect the quality of the models and lead to results that do not match reality. In this paper, we present a general approach for creating a self-updating system of predictive models that can be adapted to specific contexts. This system periodically generates and selects the most appropriate predictive model for ensuring the validity of its predictions. It integrates data processing and data mining model generation, and allows for the detection of changes in existing patterns as new data is added. This is suitable for supervised data mining tasks that may be affected by data evolution. The implementation of the system has demonstrated that it is possible to pre-process the data and select the best predictive model. In addition, since the execution is triggered automatically, the need for system maintenance is reduced. Keywords: Concept drift · Data mining · Self-updating models · Weka · Pentaho data integration · Workflow

1 Introduction Small changes and abrupt events quickly transform the world into a completely different place. Over the years, habits and behaviours change, as well as unexpected episodes, such as the coronavirus pandemic occur, causing a break with previous patterns. The use of mobile phones is an example of change. According to [1], in the early 2000s, mobile phones were mainly used for audio calls. Nowadays, however, these devices are primarily utilized for accessing the Internet or taking photographs. This evolutionary process has happened gradually throughout the years. Suddenly, without us realising, the patterns of the past are no longer valid. In the specific case of Data Mining (DM) techniques, new trends can lead to outdated predictive models. DM models are used in many areas, namely insurance, retail, healthcare industry and marketing [2]. Whenever patterns significantly change, these © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 449–456, 2024. https://doi.org/10.1007/978-3-031-47715-7_31

450

A. Duarte and O. Belo

models no longer reflect reality and can lead to misinterpretations. For example, in the development of personalised recommendation systems, the predictive models used need to be updated from time to time in order to follow the evolution of the users’ preferences. This problem, known as “concept drift”, is responsible for the loss of accuracy in predictive models that consider a stationary association between inputs and outputs [3, 4]. In these situations, predictive models need to be updated as new trends emerge in the data. Continuous data streams are typically associated with a rapid obsolescence of the predictive models. For these cases, several methods [5] have already been proposed to solve this issue, such as the Drift Detection Method [6], the Early Drift Detection Method [7] and ADWIN [8]. However, when the data fluctuate more slowly, it is also important to define an appropriate methodology for updating the predictive models. In this context, the overall aim of this paper is to provide an integrated structure for data processing and for building and updating predictive models related to slowly changing data. This structure should allow the entire process of selecting the best predictive model for a given problem to be performed independently and with the appropriate periodicity. Furthermore, it should be comprehensive enough to be extended in any area. The rest of the paper is structured as follows. In Sect. 2, we present some studies that have already been developed in the area under analysis that are affected by concept drift. In Sect. 3, we describe our proposed methodology for self-updating predictive models. This chapter starts with an overview of how the system works and then explains in more detail the processes related to data modelling and model evaluation. Finally, in Sect. 4 we present the main conclusions and possible future directions for this work.

2 Handling Concept Drift The need to regularly update predictive models is an issue that has been addressed by a considerable number of authors in the literature. In [9], for example, the authors propose a strategy for updating support vector regression predictive models. The study applies to time series data that exhibit seasonal patterns. This updating strategy occurs cyclically at the end of a seasonal pattern and consists of training the model with historical data and new recent collected data. Then, the parameters of the model are optimised in order to maintain it updated. Another example where concept drift must be considered is in the detection of credit card fraud, as new fraud patterns emerge over time. For this particular problem, [10] proposed an automatic update system of a DM algorithm in which the predictive model is updated using a weighted linear combination of the outputs of previous models, in order to detect changes in the patterns associated with fraud in a smoothed manner. Compared to models that are not updated, the proposed method resulted in significant improvements in error rates associated with drift scenarios. In a different study [11], the authors explored the construction of predictive models, including their updating and use in a Ubiquitous Health Telemonitoring System. This work requires the collaboration of a researcher and a doctor to build and validate the data mining models. Due to its strong dependence on specialists, the process of updating the models cannot be done automatically. A year later, [12] conceived an online data stream classification to incrementally learn from labelled and unlabelled data using a selective self-training method. The main advantage of this method is that it increases

Generating and Updating Supervised Data Mining

451

the speed of data classification while maintaining high accuracy. Also in the context of data stream mining, [13] presents a kernel-based clustering method to deal with the variations associated with massive data streams and the fact that data may evolve over time. A more recent study developed by [14], introduces an incremental learning method using a formal concept analysis. The authors created an algorithm that is able to identify human behavioural patterns in nonstationary sensor-based smart homes and that dynamically adapts to new data with new features. In general, many studies centred on non-stationary data include datasets that have high pattern volatility. An example of this is the various studies that focus on continuous data streams. However, in other cases, it is necessary to consider other scenarios where the data being analysed have a slower variation. Furthermore, the strategy used in previous literature usually focuses on optimising a single algorithm to update its effectiveness. In this sense, these studies do not analyse other algorithms that could provide better predictive results. Therefore, to fill these gaps, further studies addressing the issue of slowly changing data are required. In addition, to improve the performance of these systems, new approaches for examining different algorithms may lead to better results. For example, one promising strategy is to compare different algorithms that compete with each other each time the system is updated in order to select the one that gives the best results.

Fig. 1. General overview of the periodic process of self-updating DM models in PDI.

3 Self-updating Algorithms To reflect reality and new trends, there are several situations where it is important to regularly update DM models. Although this process cannot be done entirely without the intervention of experts, it can be executed in an automated way. In this section, we present a general methodology for regularly updating supervised predictive models. Depending on the context and requirements, the model update strategy may vary. This methodology assumes the use of Pentaho Data Integration (PDI) [15] and Weka [16] tools. In general, the maintenance of the predictive models is carried out as follows (Fig. 1): • According to the specified time interval (e.g. once a month), the new records added to the data source since the last update of the DM model are counted. • When the number of new records exceeds a certain minimum value (which should be adjusted depending on the problem, e.g. 5000), the DM job is executed to determine

452

A. Duarte and O. Belo

the new best predictive model. It was assumed that whenever this condition is met, the number of new records is significant and may contain new patterns in the data. • If the number of new records is low, an email notification is sent informing that the DM model has not been updated. This occurs because no significant records have been added to the dataset since the last update. • The maintenance process ends and is repeated at regular periods, at the specified time interval. The DM job is responsible for the construction of new predictive models and for selecting, among them the one with the best predictive values. Therefore, this process is divided into two phases: in the first phase, new DM models are created using Weka and, in the second phase, the parameters of each model are compared in order to select the one that gives the better results. If the execution of these steps fails, a notification is sent by email. In Fig. 2, we can see the job workflow (designed in PDI) that performs these two operations.

Fig. 2. DM job (training and evaluating the models).

3.1 Constructing and Training the Models Before building the models, it is important to ensure that the data are prepared to be properly processed. This is a subjective process because, depending on the dataset and the nature of the problem under analysis, the processing required varies. Therefore, it is important that, at the first time, a data specialist defines a workflow in PDI that is able to perform all the necessary data treatments. These treatments may include, for example, eliminating unnecessary or redundant columns; removing old records to keep the models as close as possible to reality; or checking the consistency and coherence of the data. Other aspects that may be considered at this stage are the missing values and the data format and their units. At the end of this process, the data are processed through the “Knowledge Flow” of PDI. This node enables the integrated use of Weka software with PDI. To test different scenarios, we created two different workflows in Weka. In one of them, all resulting attributes are considered and, in the other workflow, we used the “Attribute Selection” filter in order to consider only the most relevant attributes. Figure 3 demonstrates the workflow created in Weka for the scenario where all the attributes are considered (Scenario I). For illustration purposes, only two algorithms were presented. When the attributes are selected automatically (Scenario II), the architecture is identical.

Generating and Updating Supervised Data Mining

453

The only difference is the fact that the “Attribute Selection” filter is used. Note that, at the first time, an initial optimisation of the parameters of the different models should be done in order to achieve the best predictive results.

Fig. 3. DM modeling architecture in weka.

In the upper part of Fig. 3, the DM job evaluates the techniques and saves the results, including metrics such as accuracy, precision and recall, in text files. In Fig. 4, we can see an example of the structure of these text files. To facilitate their identification, the name of each file corresponds to the name of the technique used. On the other hand, the construction process of each model is shown schematically below. At this stage, the models generated are saved, and their filename corresponds to the name of the technique used. In order to distinguish the files for scenarios I and II, the prefix “AS” is additionally assigned to the name of the files generated based on scenario II.

Fig. 4. Text file containing the evaluation metrics of a DM model.

454

A. Duarte and O. Belo

3.2 Evaluating the Models The text files containing the models’ metrics can be used to support their evaluation. Therefore, we created a workflow in PDI that loads the resulting files and selects the model with the best predictive results. Similar to the data processing stage, the evaluation of the models depends on the problem under analysis. The importance (weight) of accuracy, sensitivity and other possible metrics must be adjusted to each specific context. For testing the system, we considered only two metrics for the evaluation process— accuracy and sensitivity. We assumed that they are equally important. Therefore, the selected model will be the one associated with the highest scores, taking into account the accuracy and sensitivity values. In general terms, the workflow designed in PDI encompasses the following sequential steps: • Reading the files containing the metrics associated with each algorithm. • Calculating the score for each model. In this case, it was assumed that the score of each model is determined by the average of its accuracy and sensitivity values. In other situations, the calculation of the score should therefore be adapted to the defined metrics and weights. • Ranking the algorithms according to the score value. • Selecting the best model (highest score). After completing these steps, it is possible to generate a predictive model that is adapted to the new data and thus able to detect new patterns that may occur. 3.3 Testing the System To test the system, we simulated the proposed methodology with artificial data, using five different DM algorithms. This allowed us to verify that the text files containing the details of the metrics obtained for each of the tested models were saved. Furthermore, we also confirmed that the score for each model was properly calculated, taking into account the accuracy and sensitivity values. The algorithms were correctly ranked in descending order according to the values obtained (Fig. 5), and the system identified the algorithm with the highest score as the best model.

Fig. 5. Scores obtained for each algorithm.

Generating and Updating Supervised Data Mining

455

When introducing a significant amount of new data since the last update, we found that the selected model varies. Therefore, we can conclude that the system is able to optimise the accuracy and sensitivity on a regular basis.

4 Conclusions and Future Work In summary, the existence of non-stationary data represents a challenge for predictive DM models. In this paper, we developed a methodology to update these models taking into account the most recent data. This methodology can serve as a reference for any work where data volatility may affect the quality of DM models. The main advantage is that the DM models can be routinely updated without the need for external intervention. Therefore, this work demonstrated that it is possible to implement an automatic system that can generate and update different algorithms. At each update, new algorithms are created, their metrics are evaluated and the model with the best results is selected. This whole process is done in PDI in an integrated way with Weka. Unlike most studies, this methodology is designed for data whose patterns change slowly. Moreover, it is also flexible enough to enable the adaptation of the data processing and the algorithms and metrics to be tested for each specific problem. In the future, it would be important to test the system with real data from different domains, such as healthcare, sales or industrial production. In this way, making the necessary adjustments for each context, namely in terms of evaluation metrics and predictive models, it would be possible to test whether the system leads to good results with a reasonable processing time. Furthermore, these tests are important to optimise the value of the time window set for updating the DM models. Other parameters, such as the score calculation and the algorithms used, can also be improved through testing. Acknowledgments. This work has been supported by FCT—Fundação para a Ciência e Tecnologia within the R&D Units Project Scope: UIDB/00319/2020, and the PhD grant: 2022.12728.BD.

References 1. Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., Zhang, G.: Learning under concept drift: a review. IEEE Trans. Knowl. Data Eng. 31, 2346–2363 (2019). https://doi.org/10.1109/TKDE.2018. 2876857 2. Delen, D.: Introduction to predictive analytics and data mining. In: Predictive Analytics: Data Mining, Machine Learning and Data Science for Practitioners. Pearson FT Press (2020) 3. Cohen, L., Avrahami-Bakish, G., Last, M., Kandel, A., Kipersztok, O.: Real-time data mining of non-stationary data streams from sensor networks. Inf. Fusion. 9, 344–353 (2008). https:// doi.org/10.1016/j.inffus.2005.05.005 4. Kadwe, Y., Suryawanshi, V.: A review on concept drift. IOSR J. Comput. Eng. 17, 20–26 (2015). https://doi.org/10.9790/0661-17122026

456

A. Duarte and O. Belo

5. Agrahari, S., Singh, A.K.: Concept drift detection in data stream mining : a literature review. J. King Saud Univ. - Comput. Inf. Sci. 34, 9523–9540 (2022). https://doi.org/10.1016/j.jks uci.2021.11.006 6. Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detection. In: Lecture Notes in Computer Science, pp. 286–295. Springer, Berlin, Heidelberg (2004). https://doi. org/10.1007/978-3-540-28645-5_29 7. Baena-García, M., Del Campo-Ávila, J., Fidalgo, R., Bifet, A., Gavaldà, R., Morales-Bueno, R.: Early drift detection method. In: 4th ECML PKDD International Workshop on Knowledge Discovery from Data Streams, pp. 77–86. Berlin, Germany (2006) 8. Bifet, A., Gavaldà, R.: Learning from time-changing data with adaptive windowing. In: SIAM International Conference on Data Mining, pp. 443–448 (2007). https://doi.org/10.1137/1.978 1611972771.42 9. Guajardo, J.A., Weber, R., Miranda, J.: A model updating strategy for predicting time series with seasonal patterns. Appl. Soft Comput. 10, 276–283 (2010). https://doi.org/10.1016/j. asoc.2009.07.005 10. Schockaert, C.: a self-updating machine learning model strategy for credit card fraud detection (2013). https://doi.org/10.13140/RG.2.2.16141.56804 11. Kobayashi, V., Maret, P., Muhlenbach, F., Lherisson, P.-R.: Integration and evolution of data mining models in ubiquitous health telemonitoring systems. In: Stojmenovic, I., Cheng, Z., and Guo, S. (eds.) Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, pp. 705–709. Springer, Cham (2014). https://doi.org/ 10.1007/978-3-319-11569-6_57 12. Loo, H.R., Marsono, M.N.: Online data stream classification with incremental semisupervised learning. In: 2nd ACM IKDD Conference on Data Sciences, pp. 132–133 (2015). https://doi.org/10.1145/2732587.2732614 13. J˛edrzejowicz, J., J˛edrzejowicz, P.: Distance-based ensemble online classifier with kernel clustering. In: Neves-Silva, R., Jain, L., and Howlett, R. (eds.) Intelligent Decision Technologies. Smart Innovation, Systems and Technologies, pp. 279–289. Springer, Cham (2015). https:// doi.org/10.1007/978-3-319-19857-6_25 14. Hao, J., Bouzouane, A., Gaboury, S.: An incremental learning method based on formal concept analysis for pattern recognition in nonstationary sensor-based smart environments. Pervasive Mob. Comput. 59 (2019). https://doi.org/10.1016/j.pmcj.2019.101045 15. Hitachi Vantara: Pentaho Data Integration. https://help.hitachivantara.com/Documentation/ Pentaho/9.4. Accessed 16 Jan 2023 16. Weka. https://www.cs.waikato.ac.nz/ml/weka/. Accessed 16 Jan 2023

Home Appliance Demand Forecasting: A Comparative Approach Using Traditional and Machine Learning Algorithms Lissette Culcay1 , Fernanda Bustillos1 , and Diego Vallejo-Huanga1,2(B) 1

2

Universidad Polit´ecnica Salesiana, Production and Industrial Operations, Quito, Ecuador {lculcayg,mbustillost}@est.ups.edu.ec Universidad Polit´ecnica Salesiana, IDEIAGEOCA Research Group, Quito, Ecuador [email protected]

Abstract. The manufacturing industry is considered one of Ecuador’s most important productive sectors because it is an excellent employment and national income source. Durable consumer goods such as white and brown goods have shown an increase with a positive trend on their GDP, so there is an expectation of growth in the market in the following years. The profitability of this industry depends on various internal factors, such as supply chain management, and external factors, such as market dynamics, which subsequently allow for generating demand forecasts. This scientific article uses sales data from an Ecuadorian white goods manufacturer company to forecast demand in two production lines. The KDD methodology was used for data processing and model construction. Three classic forecasting methods were used in the experimentation: Simple Moving Average, Simple Exponential Smoothing, and ARIMA, and three forecasting methods that use artificial intelligence algorithms: Random Forest, K-Nearest Neighbors, and Artificial Neural Networks. The performance of the forecast models was evaluated using four error metrics: MSE, MAE, RMSE, and MASE. The first experiment considered all the observations in the dataset, while for the second experiment, the dataset was partitioned into training and test sections for cross-validation. Based on the results of error metrics, ARIMA is the best-performing model for the classic algorithms and Random Forest for the Machine Learning models. Machine Learning models generally show a superior performance of up to 30% compared to classical forecasting methods to generate demand forecasts for household appliances. Keywords: Time series · Consumer durables manufacturing industry · Data modeling

· Ecuadorian

This work was supported by the IDEIAGEOCA Research Group of the Universidad Polit´ecnica Salesiana in Quito, Ecuador. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 457–473, 2024. https://doi.org/10.1007/978-3-031-47715-7_32

458

1

L. Culcay et al.

Introduction

Demand forecasting allows for estimating the future consumption of a product or service over a specified period using historical data and related information. Adequate demand forecasting provides valuable information for strategic decision-making throughout the supply chain [4]. Supply chain management plays a strategic role in companies due to its impact on sales, production, and financial results. Therefore, effective consumer demand forecasting is critical to a successful supply chain [6]. Consumer behavior is increasingly complex for social and economic reasons. Factors such as holidays, income levels, and market supply add to the uncertainty of product demand, making the forecasting process increasingly unpredictable [8]. Demand forecasting directly affects capacity planning and inventory levels. Unexpected changes in demand can lead to product shortages when demand is high and, conversely, overstocks when demand is low. These issues increase costs and reduce flexibility throughout the supply chain, from suppliers to consumers [16]. Then, the demand forecast becomes a central axis that directs the entire supply chain. Forecasting methods can be classified into two categories: traditional statistical models and models based on Machine Learning (ML). ML is a tool that helps reduce and eliminate bias in interpreting information because people are not directly manipulating it. It incorporates large amounts of data as input to the system and responds quickly to demand requirements through algorithms [13]. On the other hand, the manufacturing industry is considered one of the most critical sectors for Ecuador since, by generating added value, it creates a differentiation attribute to the products, which allows it to compete in the national and international markets [7]. This economic sector constitutes a great source of employment generation and income for the country. According to the Central Bank of Ecuador, manufacturing generated the most significant contribution to the Gross Domestic Product (GDP) between 2013 and 2018, with an annual average of 14.09%. This analysis considered 7058 companies whose average annual income represents 21% of the total sales income generated by the corporate sector [5]. Consumer durables have a life cycle of several years. The growing trend in the consumption of durable goods such as televisions, mobile phones, refrigerators, and washing machines, among others, is attributed, on the one hand, to greater access to electricity and greater purchasing power of consumers, which is reflected in a sustained increase in GDP [14]. The trade balance of goods and services in Latin America and the Caribbean reflects a percentage increase in GDP with a positive trend, generating expectations of the internal market. Given the Ecuadorian manufacturing sector’s importance and limited research on household appliance demand forecasting in Latin America and the Caribbean, the analysis of different demand forecasting models is presented in this scientific article. This scientific article aims to analyze the current forecast method and compare traditional and non-traditional forecast models to identify the method with the best performance and the lowest error rate.

Home Appliance Demand Forecasting: A Comparative Approach

1.1

459

Literature Review

Leyva et al. compare traditional techniques and those based on ML applied to the demand for textile products. In their research, the authors included macroeconomic indicators such as the unemployment rate, manufacturing GDP, and the consumer price index. These variables were determined to influence the behavior of this economic sector. Finally, it concludes that Multilayer Perceptron (MLP), a type of Artificial Neural Network (ANN), performs better than other conventional ML algorithms, such as LASSO, Random Forest (RF) or k-Nearest Neighbors (kNN), and also traditional statistical methods, such as Linear Regression (LR), Holt-Winters, Exponential Smoothing or ARIMA. This conclusion is based on the Mean Absolute Error Percentage (MAPE), and the Root Mean Squared Error (RMSE) obtained in each method [11]. On the other hand, Benhamida et al. designed a hybrid forecasting model for a Norwegian company’s intermittent and irregular demand. The article proposes the CombTSB model that automatically selects the model that best suits each product’s time horizon and demand characteristics. This hybrid model comprises an automatic selection component (TSB) and a forecast component (Comb). The latter combines the ARIMA, Theta, and MLP methods. The evaluation of the proposed model through metrics such as Mean Absolute Error (MAE), Mean Absolute Scale Error (MASE), and MAPE. The method shows superior performance, being more precise for each Stock Keeping Unit (SKU) [3]. Vivas et al. systematically reviews the forecast models for electric power demand based on ML algorithms and classical statistical models. In a case study, the authors analyze the performance of different models and compare them according to their MAPE values. This article analyzed 257 studies in different geographical areas. The study determined that the most widely used models in the field of Artificial Intelligence (AI) are ANNs, followed by those that include fuzzy logic algorithms and Recurrent Neural Networks (RNN). Regarding the classical statistical models, ARIMA and regression analysis predominate. In the case of multivariate models, those that incorporate several sources of variability in their fit tend to present, on average, great precision in their forecast [17]. Similar results can be observed in the case study described by Hribar et al. The authors compared natural gas demand forecast models in the urban area of Ljubljana, Slovenia. The models were based on historical temperatures, forecast temperatures, and time variables such as special dates, holidays, etc. Machine learning models such as Kernel Machine (KM), RNN, LR, and empirical models built from the characteristics of the data were implemented and compared, such as the Two-Reservoir Model (TRM), Two-Reservoir Model with Linear Memory (TLM), and Two-Reservoir Model with Nonlinear Memory (TNM). The two most accurate models were the RNN and LR. These results are attributed to its ability to incorporate historical and forecast temperature information. On the other hand, empirical models are helpful and perform well; because they are designed based on the parameters that model the system’s behavior, in reality [9].

460

L. Culcay et al.

Kiefer et al. compares statistical methods, ML, and Deep Learning (DL) by applying a new metric called Stock-Oriented Cost of Error Prediction (SPEC). The article used the traditional MASE metric to compare the results to these two criteria. The statistical methods analyzed were Croston, Holt-Winter, and Auto-ARIMA. The ML methods were RF, XG Boost, and Auto-SVR. The DL methods were MLP, Short and Long Term Memory (LSTM), and LSTM-2. The models were applied to data extracted from the Walmart Company and presented intermittency and irregular behavior. The methods with the best performance according to MASE are LSTM, LSTM-2, and Auto-ARIMA, i.e., the DL models performed better. However, these results differ from the SPEC metric, according to which the models with the best performance were Croston, LSTM, Auto-ARIMA, and Holt-Winter; i.e., the statistical models performed better. The authors attribute this difference to the characteristics of the data in a univariate time series. The article does not consider additional external data to improve the performance of the ML and DL models [10]. A multi-layer LSTM is also proposed for furniture demand forecasting by Abbasimehr et al. The proposed configures an LSTM network that models time-series patterns. This model was compared with the ARIMA, Exponential Smoothing statistical methods, ANN, kNN, RNN, single-layer LSTM, and Support Vector Machine (SVM). Comparing the performance of the models based on statistical tests, it is concluded that the proposed model has superior performance in terms of RMSE and Symmetric Mean Absolute Percentage Error (SMAPE). The authors attribute these findings to the ability to capture realworld behavioral patterns in time series data. The results also indicate that classical statistical methods such as ARIMA and Exponential Smoothing perform less than SVM, ANN, and LSTM [2]. This scientific article is organized into three subsequent sections. The Materials and Methods Sect. 2 presents the methodological procedures for data extraction and forecasting model generation in a case study of an Ecuadorian manufacturing company. The Experimentation and Discussions Sect. 3 shows the experimental results of applying the different models with a comparative approach to their performance. Finally, the article shows the conclusions and limitations (Sect. 4) derived from the methodology and experimentation applied.

2 2.1

Materials and Methods Industry Production Process for the Forecast

As a case study, an Ecuadorian appliance manufacturing company has been considered. Its average production from 2018 to 2020 is 500,000 units, with a payroll of approximately 2,000 workers. Demand forecasting performance has been identified as an opportunity for improvement, as it is a critical input to strategic decision-making in the supply chain. The current demand forecasting method for the company is based on the Holt-Winters model for all SKUs, justified by the existence of a seasonal component in their sales. The statistical analysis results are adjusted according to the criteria of a committee of experts

Home Appliance Demand Forecasting: A Comparative Approach

461

from the commercial area. Based on this information, the dimensioning of the required capacity and inventory levels is carried out. Despite having control of compliance with the sales budget, an evaluation of the accuracy of the demand forecast is not carried out; therefore, it has not been possible to improve its performance.

Fig. 1. Macro-process of manufacturing appliances.

The current demand forecast procedure was analyzed in the study case, which begins with the collection of requirements data until the distribution of the finished product in the industry. The commercial area of the company surveys needs through a sales executive who maintains direct communication with the stores and commercial chains. This dataset in the first phase is handled globally, processing quantities without detailing SKUs or obtaining productions. The information collected is input by the market intelligence department. The latter is responsible for consolidating, processing, and delivering the demand forecast. The forecast proposal considers financial goals that maximize utility and market presence. The results of the approved demand forecasts are shared with different areas of the supply chain, which will perform complementary analyses to the budget. These analyses contemplate inputs, which consider raw materials, tools, equipment, spare parts, etc.; manufacturing, which refers to direct and indirect labor for the production process; and economic resources, the working capital necessary for production. This scientific article evaluates the demand forecast, considering the historical information of actual sales as a latent element to be evaluated. Figure 1 shows a methodological scheme of the processes analyzed in the case study. The organization’s movement requires the dataset at the macro level to be approved within the production process and the budgets, which allow one to start in each part of the supply chain with negotiation processes to obtain production, stock, range, and product distribution.

462

L. Culcay et al.

With this general vision, the level of uncertainty is greater and external factors generate significant impacts in each area of the organization, generating insufficient inventory levels that cause loss of sales of which a record is not kept, or on the other hand, excess inventories that imply maintenance and storage costs. These economic conditions are not currently dimensioned in the organization. 2.2

Dataset Description

The data analyzed for the case study corresponds to sales in white goods units in Ecuador’s manufacturing company. The dataset has 836 observations, representing the units sold in a calendar week from Monday to Sunday. There are two differentiated production lines between Heat, i.e., gas or induction cookers, and the Cold line, which refers to refrigerators, freezers, and refrigerated display cabinets. The level of detail of the dataset is not disaggregated by type of product or SKU, which limits the depth of the analysis. However, the dataset allows us to effectively apply the classical and ML methods. Organizations with appropriate information management could carry out product stratification or Table 1. Statistical data for the time series from 2014 to 2021 for the Heat and Cold production lines Line

Statistical

2014

2015

2016

2017

2018

2019

2020

Instances

104

106

104

104

104

104

106

104

Mode

2

2

0

2

0

2

5

0

Median

4237

4038

3496

5012

5184

5937

4347

6207

Mean

4271

4825

3893

5008

6195

6793

4975

6439

Standard deviation 2641

3338

2578

2860

3848

4566

3770

3667

Min

441

48

20

484

378

0

143

Total Max

Heat

Cold

0

2021

10922 18918 16786 12050 16935 21351 17340 16281

Instances

52

53

52

52

52

52

53

52

Mode

2

0

0

0

0

0

3

0

Median

2624

3706

3366

7090

8111

9086

7205

8283

Mean

3441

4699

4202

6840

8467

9214

6652

8488

Standard deviation 2594

3546

3238

2430

3864

4796

4136

3430

Min

0

441

48

462

1652

378

0

2084

Max

10764 13387 16786 12050 16935 21351 17340 16281

Instances

52

53

52

52

52

52

53

52

Mode

0

2

0

0

0

0

2

0

Median

4719

4479

3598

2991

3396

3370

3057

3917

Mean

5102

4951

3585

3175

3923

4373

3299

4390

Standard deviation 2440

3145

1656

1942

2109

2675

2422

2613

Min

0

614

183

20

484

392

0

143

Max

10922 18918 7327

8481

10452 13545 9789

10768

Home Appliance Demand Forecasting: A Comparative Approach

463

incorporate segmentation criteria according to the range of markets to which it is directed. The market demand comprises actual and lost sales due to a lack of product availability. Currently, the company under study does not record lost sales, so that the market demand may be greater than the available data. By including this type of complementary information, the accuracy of the forecasts can be improved since the particularities of each type of product would be considered. Considering these limitations regarding the detail of the information, we have worked on the dataset at the level of the Cold and Heat production line. The current process in the case study company uses production line-level information to analyze production resource allocation initially. The period considered for analyzing the available information begins in week 01 of 2014 and ends in week 52 of 2021. In the study, seven observations with zero value have been identified; this implies that no billing has been registered in any of the two lines analyzed. Two of the observations belong to week 01 of 2014, both for the Heat and Cold lines, and are attributed to the holiday at the beginning of the year. The other five observations are generated in the year 2020 and correspond to weeks 15, 26, and 27 of the Heat line; while the other two instances correspond to the Cold line in weeks 26 and 27 that are possibly attributable to the Covid-19 pandemic in Ecuador. In the years 2015 and 2020, 53 weeks were presented for each production line, while in the remaining years, 52 weeks were recorded in each period. The observations of the entire time series are segmented into 50% for the Heat line and 50% for the Cold line. The average number of units sold for the total dataset is 5,298. For the Heat line, the average sales were 6,496; in the Cold line, an average value of 4,100 units was sold. In the line Heat data, the minimum sales value was 0 units, and the maximum were 21351 units. On the other hand, observations along the Cold line have a minimum of 0 units and a maximum of 18,918 units. The statistical values of both production lines are summarized in Table 1.

Fig. 2. Boxplots for the number of annual sales from 2014 to 2021.

464

L. Culcay et al.

The data from both production lines show two trends in the time series and a low presence of outliers. Therefore, taking 2014 as a base, it can be seen that up to 2016, the trend is decreasing and shows a decrease of approximately 9%, while from 2017 to 2021, the trend is increasing. The trend reaches a maximum in 2019 and 2021, with a growth of approximately 50%. For 2017, there was a growth of 17%; in 2019, the maximum sales in the time series were reached at 59%, and the second-best year of sales was 2021 with 51%. Considering the sales growth from 2017 to 2021, a trend decline is observed in 2020 due to the recession caused by the Covid-19 pandemic. However, compared to the base year, sales increased by 19%. Figure 2 summarizes the time series evolution from 2014 to 2021, using box diagrams to mark the middle trend lines and show the possible outliers in the dataset. On the other hand, Fig. 3 summarizes the number of sales for each year of the time series, broken down into two production lines. A marked trend increase can be observed in the Heat production line. Taking 2014 as a base, it can be seen that in 2016 there was an increase of 22%, and in 2017 the value grew by 99%. The maximum sales occurred in 2019 with a 168% growth, the second best sales year is 2021 with 147%, and in 2018 there was a growth of 146%. On the other hand, a downward trend is identified in the Cold production line, taking 2014 as the base year. In 2015 sales decreased by 1%; for 2016, the decrease was 30%, and the lowest sales point in 2017 was 38%. Furthermore, for the year 2020, affected by the pandemic, the decrease was 34%, while in 2019 and 2021, the decrease was 14%.

Fig. 3. Unit sales per year and production line.

Generically, the Heat production line presents an increasing trend, similar to the trend of the total observations per year, i.e., that the Heat production line contributes with a greater weighting on the downward trend of the line of Cold.

Home Appliance Demand Forecasting: A Comparative Approach

2.3

465

Forecast Models

In data processing, the Knowledge Discovery in Databases (KDD) methodology [15] was used, which uses data mining techniques integrated with artificial intelligence and statistical models to discover patterns within databases. This methodology is adapted for a demand forecast model by selecting the data shown, analyzing, cleaning, and processing the dataset. For the computational processing of the data, the programming language R V4.2.0 and the development interface RStudio V2021.09.1 were used because it is free software, open access, and with a broad set of libraries for modeling in time series and data visualization. In the literature review, several authors establish a classification for forecasting methods between traditional statistical models and models based on Machine Learning. In traditional models, the input parameters are the dataset and the statistical method for the forecast, and its output is a prediction at t + 1. On the other hand, for an ML model, it is necessary to enter the dataset labeled with its respective output. In the first stage of training, the algorithm will recognize the pattern within the dataset and develop a model for the prediction at t + 1, which will be validated through a second testing stage. This scientific article will use traditional models: simple moving average, simple exponential smoothing, and ARIMA, and ML models: Random Forest, kNN, and ANN. These methods were chosen based on their relevance, performance, and ability to predict real behavior patterns in time series. To determine and compare the performance of the six different algorithms, metrics have been selected for calculating the cross-sectional error, i.e., that can be applied to all the techniques without restriction. The error measures that the scientific literature has shown to be determinants and highly discriminating were also considered; these are the MSE, MAE, RMSE, and MASE. These indicators are widely used in time series analysis to measure forecast accuracy. Values close to zero are desirable in all error metrics, indicating better forecast results. ˆ t is Simple Moving Average (SMA) The demand forecast for a period X defined by Eq. 1. n Xt−1 ˆ Xt = t=1 (1) n For the case study, the size of the sliding window was determined through an iterative process in which values from 2 to n − 1 were tested. An error metric and its overall average were calculated for each window size. The assigned sliding window size is the one whose error measure is closest to the mean of all iterations. Simple Exponential Smoothing (SES) The simple exponential smoothing method was considered an evolution of the weighted moving average method. ˆ t is defined by Eq. 2. The forecast demand for a period X    ˆ t−1 + α Xt−1 − X ˆ t−1 Xˆt = X (2)

466

L. Culcay et al.

The assigned value of α has been determined through an iterative process similar to that used in the SMA method. Thus, the range of the domain of definition of α was partitioned into ten sections, i.e., that each jump is 0.1, and the maximum (1) and minimum (0) values were not considered. For each value of α, an error measure and its total average were calculated. The value of α is the one whose error measure is closest to the mean of the errors. ARIMA (ARM) This method is configured using the Akaike Information Criterion (AIC), a statistical metric that evaluates the quality of a dataset model by analyzing its utility and complexity [1]. A confidence level of 99% has been used to predict the next period. Random Forest (RF) The model used was configured with 500 trees and five regression nodes, and the significance of the predictor was calculated. Use starting points to keep the sample training data set fixed. KNN The algorithm chooses the closest training points K and generates the forecast from the average target output values for these points. The parameter K determines the smoothness of the fit and is optimized based on the number of input data in the dataset [12]. ANN This case study used an artificial neural network with an autoregressive model (NNAR). The model is represented as N N AR(p, P, k), and when the dataset has seasonality, the parameter P is included, indicating the number of seasonal lags. The algorithm has identified the model that minimizes the sigma coefficient of variation (σ). For the prediction of the next period, a confidence level of 99% was used.

3 3.1

Experimentation and Discussions First Experiment

For the first experiment, we used the total set of sales from the dataset in weekly units for the Heat and Cold lines. The case study applied classical and ML methods to determine the sales forecast for the following period t + 1. The results generated by each method can be expressed in whole numbers or decimals. The results in this section are summarized in Table 2 and will be presented as integers as discrete variables. In the SMA method, the RMSE metric was used to determine the sliding window size, taking the value closest to the RMSE of the total average. In the Cold line, the calculated window size was 15, and the forecast value of the period t + 1 was 5438. In line Heat, the sliding window size was 170, and the forecast value of period t + 1 was 8470. The same error metric was used as in the SMA method to determine the value of α for the simple exponential smoothing

Home Appliance Demand Forecasting: A Comparative Approach

467

Table 2. Results of the error metrics of the forecast models in the first experiment. First experiment—total data Method Cold MAE

MSE

RMSE

MASE t+1

Heat MAE

MSE

RMSE

MASE t+1

SMA

1961.05 6627100.05 2574.32 0.88

5437.93 3271.10 18089276.31 4253.15 1.03

8469.65

SES

1896.97 6832786.26 2613.96 0.86

4376.00 2771.88 13432883.78 3665.09 0.90

1263.00

ARM

1715.39 5485376.83 2342.09 0.76

4712.21 2586.12 11614311.32 3407.98 0.85

5220.14

RF

898.82

1541642.65 1241.63 0.40

5100.00 1261.34 2831765.36

1682.79 0.41

1346.00

KNN

1417.05 3937171.13 1958.87 0.63

4985.40 2155.47 7844794.77

2800.86 0.70

5219.60

ANN

151.38

1227.00 885.45

1254.68 0.29

3593.00

50124.21

223.88

0.07

1574225.46

method. The selected value of α was 0.60 for the Cold line, with a forecast for the next period t + 1 of 4376. In the Heat line, the value of α was 0.60, with a forecast for the next period t + 1 of 1263. In the ARIMA method, the algorithm identified the model that best fits the time series. For the Cold line, the model was adjusted with the parameters (2, 0, 1), and the forecast for the next period was 4712, considering a confidence level of 99%. On the line Heat, the model was fitted to (1, 1, 1), with a forecast of 5220 and the same level of confidence as on the line Cold.

Fig. 4. First experiment: time series for Cold line and classical forecast models.

Figure 4 shows the sales level of the Cold line with the results of the three classic forecast models. When comparing the curves of each model with actual sales, it can be seen that the curve of the SMA method does not quickly adjust to the curve of actual sales since when sales increase, the SMA generates lower values and vice versa. The SES method approximates the behavior of the real sales curve both in increases and decreases but with a shift to the right. On the other hand, the ARM method follows the trend of the actual sales curve, although not to the same extent.

468

L. Culcay et al.

For the three forecast algorithms, their performance was evaluated with the four error metrics. The ARM method has the lowest error values for the Cold line and the Heat line. In general, the SMA, SES, and ARM method error measurements have close values, with an approximate variability of 13%. The summary of the error metrics of each method and the forecasts generated for the lines Cold and Heat are summarized in Table 2. On the other hand, with the same methodology as the classical forecasting methods, the ML methods were applied to the total time series. The kNN algorithm identified five nearest neighbors for the Cold line, and its forecast for t + 1 was 4985; for the Heat line with the same number of neighborhoods, the forecast at t + 1 was 5220. In ANN, the algorithm identified the model that best fits the time series. For the Cold line, the model has an architecture N N AR(22 − 12 − 1), i.e., 22 neurons in the input layer, 12 neurons in the hidden layer, and one neuron in the output layer, with an a-posteriori prediction of 1227, at a confidence level of 99%. In line Heat, the selected model has the architecture N N AR(14 − 8 − 1), the forecast for t + 1 of 3593, and the same 99% confidence level. Finally, the RF model on the Cold line generates a forecast at t + 1 of 5100, and for the Heat line, the forecast is 1346.

Fig. 5. First experiment: time series for line Heat and forecasting machine learning models.

Figure 5 shows the sales volume of the line Heat with the results of the three ML models. The curve of the ANN method is the one that best fits the sales time series. The kNN and RF methods showed predicted trends similar to the sales curve but not with the same accuracy as the ANN algorithm. Thus, ANN is the forecast method that generates the lowest error values for the two production lines. 3.2

Second Experiment

For the second experiment, a random cross-validation process was carried out on the total dataset, where the data set was partitioned into two sections. The

Home Appliance Demand Forecasting: A Comparative Approach

469

first partition, the training set, allows us to generate the forecasting model that will be evaluated with the second partition of the dataset, called the test set partition. The primary purpose of this method is to objectively evaluate the model because the instances of the dataset are random in both partitions rather than to evaluate the algorithm’s performance with the same input data. So, the two partitions of the dataset contain instances for both production lines. A higher proportion of random instances are selected for the training section that generates the model. The proportion of the dataset used for training is 70% of the instances, that is, 292 periods; while for the test, the proportion is 30%, corresponding to 126 periods. A seed point was applied to the partitions to keep the randomization of the data fixed and to be able, a-posteriori, to replicate the experiments with all the forecast algorithms. In SMA, the training set allows you to define the sliding window size that will later be used in the test partition. The RMSE metric determined the minimum error value and its corresponding window size n. Similarly to Cold, the window size was n = 290, and this optimal value cannot be extrapolated to the test partition because the data subset has only 126 instances. Therefore, obtaining a forecast for this algorithm with the production line of Cold is impossible. Similarly to Heat, something similar happens since the optimal window size in training is n = 256, and the test dataset has only 126 instances. In the SES method to determine the value of α, the minimum RMSE error metric was used in the training partition, and the optimal value of α was applied again in the test partition. In the Cold line, the value of α calculated in training is 0.30, with a forecast value for the period t + 1 of 5516, and for the test, the forecast value is 4794, keeping the exact value of α at 0.30. In the line of Heat, the value of α calculated in training is 0.30 with a forecast value of period t + 1 of 6836; for the test, it is 7897. The ARIMA method aims to obtain the ideal configuration of (p, d, q) in the training section and replicate this same configuration in the test set. In the training partition of the Cold line. The model identified (1, 0, 4) as the optimal configuration with a value for the next period t + 1 of 4346; while for the production line Heat, the design was set at (1, 1, 5) with a value for the next period t + 1 of 4666. The results of the error metrics indicate that ARIMA is the classical method with the best performance in the two production lines, considering that it was impossible to calculate the error and forecast values for the SMA algorithm. The results of the classical models in the test partition corresponding to the Heat line are plotted in Fig. 6. The SMA does not present results as it cannot apply the training model. On the other hand, it is observed that the ARM method maintains a better fit in the trend to the real sales curve. The SES method presents a trend but does not fit the sales curve. The results of the classic models and the evaluation of the performance through the error metrics are summarized in Table 3, in which it is shown that ARIMA is 11% better than the method SES on the test partition.

470

L. Culcay et al.

Fig. 6. Second experiment: time series for the Heat line and classical forecast models on test data. Table 3. Results of the error metrics of the forecast models in the second experiment. Second experiment—partial data Method Cold MAE

MSE

RMSE

MASE t+1

Heat MAE

MSE

RMSE

MASE t+1

Training SMA SES ARM RF KNN ANN

2208.68 1819.08 1693.55 893.45 1407.89 1658.60

7506280.30 6349278.83 5599240.06 1589810.04 3957111.04 5640809.06

2739.76 2519.78 2366.27 1260.88 1989.25 2375.04

0.50 0.81 0.75 0.40 0.62 0.73

4135.57 5516.00 4345.66 2888.00 3096.80 4075.00

2933.48 2706.61 2626.83 1325.25 2159.26 2539.89

13057174.79 12956529.79 12210965.44 3096911.04 8250337.26 11210134.14

3613.47 3599.52 3494.42 1759.80 2872.34 3348.15

0.82 0.80 0.78 0.39 0.64 0.75

7010.68 6836.00 6114.84 9516.00 10080.60 5284.00

Test

– 1944.25 1787.09 877.25 1518.43 1725.52

– 6099540.47 5011213.41 1402532.34 3915962.82 5046559.64

– 2469.72 2238.57 1184.29 1978.88 2246.45

– 0.87 0.81 0.40 0.68 0.77

– 4794.00 4666.22 4013.00 4104.20 4284.00

– 2786.53 2674.03 1403.79 2261.93 2383.68

– 13358600.49 11974027.72 3684856.01 8862153.43 10442105.36

– 3654.94 3460.35 1919.60 2976.94 3231.42

– 0.86 0.81 0.43 0.69 0.74

– 7897.00 6527.54 2769.00 1770.40 1034.00

SMA SES ARM RF KNN ANN

In the ML methods, the subsets of data were used, the training set for the identification of models, and the test set for the cross-validation. The obtained data were evaluated using the same error metrics used in the classical methods. For the kNN method in the Cold line, the value of the five nearest neighbors was established in the training partition. The resulting model is replicated to the test partition, obtaining the value of 4104 in the forecast for the period t + 1. In the Heat line, the regression model of five nearest neighbors was identified, with a forecast for the period t + 1 of 1770. For the ANN model, the training set identifies the ideal NNAR arrangement tested in the test partition. Keeping with the lines of Cold the model was fitted with N N AR(3 − 2 − 1) with a t + 1 in the test set of 4284. In the line of Heat, the configuration is N N AR(4 − 2 − 1), and the forecast for the period t + 1 in the test partition is 1034.

Home Appliance Demand Forecasting: A Comparative Approach

471

The third ML model is Random Forest, the configuration of 500 trees and five regression nodes is used, obtaining a prediction for the Cold line, in test partition, of 4013. On the Heat line, the prediction sales were 2769 units.

Fig. 7. Second experiment: time series for Cold line and forecast machine learning models on test data.

Figure 7 shows the actual sales of the Cold line in the test partition together with the results obtained from the ML models. The RF model demonstrates the best fit to the actual sales curve increases and decreases. kNN trends close to the sales curve but to a lesser extent, and ANN follows the sales curve on decreases but not increases. The performance evaluation using the error metrics defined for both the Cold line and the Heat line is found in Table 3. It is worth mentioning that the variation in error measures between the kNN, ANN, and FR methods is considerable.

4

Conclusions and Limitations

Considering the error metrics used, the forecast models that use ML algorithms show a higher performance of up to 30% in the experiments compared to the classical models. The ARIMA method has the lowest error value for the classical forecasting algorithms. On the other hand, in the ML models in the first experiment, ANN presents the best performance, and in the second experiment, RF is the algorithm with the best performance. The calculation of the demand forecast using traditional statistical methods and ML models, as well as the evaluation of its reliability, was generated at the production line level based on historical sales data. This level of information corresponds to that required in the productive resource allocation instance of the macro process. However, analysis of the demand for products is limited by the lack of complementary information and a greater level of detail in the available data.

472

4.1

L. Culcay et al.

Future Work

A lower forecast error represents lower uncertainty in companies’ supply chain and resource management administration. For this reason, in the future, it is recommended that the company in the case study establish data management policies to integrate the variables that influence demand, such as lost sales that must be recorded, to obtain the actual market demand. This information must be complemented with internal factors such as product characteristics, segmentation, and stratification of products and markets, as well as external factors such as consumer price indices and inflation, among others relevant to the market. These variables enhance the performance of the algorithms to be used in generating the demand forecast. A limitation of this scientific article is the number of dependent variables considered in the forecasting process. Since ML models perform better when a more significant number of internal and external variables are integrated into the model, it is recommended for future work to consider variables that reflect the dynamics of this industry.

References 1. Demand-Driven Forecasting, Chap. 7, pp. 203–237. Wiley (2013) 2. Abbasimehr, H., Shabani, M., Yousefi, M.: An optimized model using lstm network for demand forecasting. Comput. Ind. Eng. 143, 106435 (2020) 3. Benhamida, F.Z., Kaddouri, O., Ouhrouche, T., Benaichouche, M., CasadoMansilla, D., L´ opez-de Ipi˜ na, D.: Stock&buy: A new demand forecasting tool for inventory control, pp. 1–6 (2020) 4. Boone, T., Boylan, J.E., Fildes, R., Ganeshan, R., Nada, S.: Perspectives on supply chain forecasting. Int. J. Forecasting 35, 121–127 (2019) 5. Camino-Mogro, S., Armijos-Yambay, M., Parrales-Guerrero, K., Herrera-Palt´ an, L.: Estudio sectorial: La eficiencia de las empresas manufactureras en el ecuador 2013–2018 (2020) 6. Gon¸calves, J., Cortez, P., Sameiro Carvalho, M., Fraz˜ ao, N.: A multivariate approach for multi-step demand forecasting in assembly industries: Empirical evidence from an automotive supply chain. Decis. Supp. Syst. 142, 113452 (2021) 7. Guevara-Rosero, G.C.: Determinants of manufacturing micro firms’ productivity in ecuador. do industry and canton where they operate matter? Region. Sci. Policy & Pract. 13(4), 1215–1248 (2021) 8. Guo, L., Fang, W., Zhao, Q., Wang, X.: The hybrid prophet-svr approach for forecasting product time series demand with seasonality. Comput. Ind. Eng. 161, 107598 (2021) ˇ 9. Hribar, R., Potoˇcnik, P., Silc, J., Papa, G.: A comparison of models for forecasting the residential natural gas demand of an urban area. Energy 167, 511–522 (2019) 10. Kiefer, D., Grimm, F., Bauer, M., van Dinther, C.: Demand forecasting intermittent and lumpy time series: Comparing statistical, machine learning and deep learning methods. In: Proceedings of the 54th Hawaii International Conference on System Sciences (2021) 11. Lorente-Leyva, L.L., Alemany, M.M.E., Peluffo-Ordo˜ nez, D.H., Araujo, R.A.: Demand forecasting for textile products using statistical analysis and machine learning algorithms. pp. 181–1*7 (2021)

Home Appliance Demand Forecasting: A Comparative Approach

473

12. Makridakis, S., Spiliotis, E., Assimakopoulos, V.: Statistical and machine learning forecasting methods: concerns and ways forward. PLOS ONE 13(3), 1–26, 03 (2018) 13. Nagar, D., Raghav, S., Bhardwaj, A., Kumar, R., Singh, P.L., Sindhwani, R.: Machine learning: Best way to sustain the supply chain in the era of industry 4.0. Mater. Today: Proc. 47, 3676–3682 (2021). 3rd International Conference on Computational and Experimental Methods in Mechanical Engineering 14. Rao, N.D., Ummel, K.: White goods for white people? drivers of electric appliance growth in emerging economies. Energy Res. Soc. Sci. 27, 106–116 (2017) 15. Safhi,H.M., Frikh, B., Ouhbi, B.: Assessing reliability of big data knowledge discovery process. Proc. Comput. Sci. 148, 30–36 (2019). The Second International Conference On Intelligent Computing In Data Sciences, ICDS2018 16. Tsao, Y-C., Chen, Y-K., Chiu, S-H., Lu, J-C., Vu, T-L.: An innovative demand forecasting approach for the server industry. Technovation, p. 102371 (2016) 17. Vivas, E., Allende-Cid, H., Salas, R.: A systematic review of statistical and machine learning methods for electrical power forecasting with reported mape score. 22 (2020)

Predictive Analysis of Bank Marketing for Financial Decision Support and Smart Economy Hiral Vashi1 , Jyoti Yadav2 , and Aparna S. Varde3(B) 1 IT Delivery Lead, Stryker Corporation, Kalamazoo, MI, USA 2 Data Science Program, Montclair State University, Montclair, NJ, USA

[email protected]

3 Department of Computer Science, Montclair State University, Montclair, NJ, USA

[email protected] Abstract. An active purpose of financial organizations is to preserve existing customers and accomplish imminent long-term ones. Bank marketing campaigns often depend on huge electronic data from a plethora of customers. Given the enormous and ever-growing data, it is not feasible for human analysts to procure interesting information and derive inferences for financial decision support. This motivates us to build a software tool for predictive analysis of bank marketing based on data mining from customer profiles. The success of telemarketing depends on various factors such the customers’ age, job, loan status etc. Hence, these factors constitute various features analyzed by data mining to predict customer tendencies with respect to marketing campaigns. We deploy classical methods of association rules and decision trees because they fall in the category of explainable AI and hence provide good interpretability for decision-making. The resulting software tool helps to predict the types of clients that will subscribe to a given term deposit. Hence, it aims to improve bank marketing by targeting more customers, hitting the right audience. It assists telemarketing campaigns and offers financial decision support, in line with e-commerce. This work fits the theme of smart economy, an important characteristic of smart cities. Keywords: Banking · Customer profiles · Data mining · e-Commerce · Explainable AI · Financial decision support · Machine Learning · Predictive analysis · Smart cities · Smart economy · Telemarketing · White-box models

1 Introduction Marketing companies typically use various campaigning methods to improve their business. Financial institutions such as banks usually select prospective customers and reach out to them through personal contacts, telephone, email, postal mail etc. to advertise new products or services. This process is referred to as direct marketing. It can be assisted by computational analysis via techniques in data mining and machine learning given the vast amounts of data consisting of customer profiles and related information, practically infeasible for domain experts to analyze [1]. This is precisely the problem we address in this paper. Our problem is defined with the following goals: © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 474–490, 2024. https://doi.org/10.1007/978-3-031-47715-7_33

Predictive Analysis of Bank Marketing for Financial Decision

475

• Analyze customer profiles to discover knowledge on bank marketing. • Build a software tool for predictive analysis in financial decision-making. The novelty of this paper (to the best of our knowledge) is that it is among the first to present a software tool via explainable AI methods for financial decision support in bank marketing, more specifically term deposits, by mining customer profiles. Thus, we claim that our work is among the first to build a predictive analysis tool for bank marketing, through knowledge discovery from existing customer profiles in order to target the appropriate audience for bank marketing campaigns. We explore customer data based on attributes such as age, job, prior loan and more, to discover knowledge using two machine learning techniques in explainable AI: association rule mining, and decision tree classification. This knowledge is used to develop a software tool wherein users can enter inputs on new potential customer profiles and receive a predicted output on whether the new customer would subscribe to a given term deposit. Hence, future trends are predicted based on past customer analysis. It helps to address more adequate potential customers for marketing campaigns so as to sell the services to correctly identified prospective buyers. This work makes positive impacts on e-commerce, and hence broadly impacts smart economy, an important facet of smart cities. As well-known in the AI community, data mining is the process of discovering new and interesting knowledge from huge amounts of data. This includes locating hidden and unknown relationships between features in demand within the data records, spotting unusual events and patterns, summarizing the information, e.g. by extracting rules for predictive analysis, and minimizing the features that describe the discovered knowledge for adequate presentation. Explainable AI plays a very important role here because the machine learning techniques in this realm such as decision tree classifiers and association rule mining provide white-box methods, easy to comprehend, and hence conveniently interpreted to trace causes of specific trends [2, 3]. This is vital in decision support applications where domain experts need to understand reasons behind various inferences. In the context of our work, such techniques are useful in bank marketing analysis. Stakeholders including bank managers, marketing professionals and financial decisionmakers can benefit from such white-box models. We thus propose to deploy an approach for data mining on bank customer profiles that entails association rules to discover relationships of the type “A implies B” to spot interesting trends in the data; along with decision trees to trace specific paths and arrive at the respective inferences based on various conditions. We assert that both these techniques in classical data mining are useful in the context of our analysis because they are white-box machine learning methods in explainable AI. This is important in our work to fathom the reasoning in the knowledge discovery to assist adequate decision-making. Based on the data mining, we build a tool for predictive analysis of bank marketing. This tool serves as a prototype to support financial decision-making in banks. More specifically, in this work, we aim to predict the types of clients that are likely to subscribe to a given term deposit. This paper makes direct impacts on e-commerce by helping to buy and sell services electronically, more specifically, term deposits here (by helping stakeholders to hit the right audience in marketing campaigns). Hence, it makes broader impacts on smart economy, which is one of the significant facets of smart cities today. In the forthcoming

476

H. Vashi et al.

sections, we describe our data mining approach, the predictive analysis tool, and its contributions to financial decision support with impacts on e-commerce and smart economy. We also overview related literature in the area, finally presenting our conclusions, along with the scope for future work.

2 Data Mining on Bank Customer Profiles 2.1 Data Description and Preprocessing The data in this work is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns are mainly based on phone calls. Often, more than one contact to the same client is required, to access whether the product (bank term deposit) would be subscribed (Yes) or would not (No). It comprises 45211 records with numeric attributes, e.g. age, balance, day, duration; categorical attributes such as job, marital status, and education; binary attributes on whether the customer has a housing loan, personal loan etc. and the target attribute of whether the customer applies for a fixed term deposit. Table 1 summarizes the attributes of the data acquired in this work. These are used for further learning through knowledge discovery by data mining. The main preprocessing steps in our work include discovering the most relevant attributes through methods such as forward selection and backward elimination, converting numeric data into nominal as needed, and performing discretization or normalization. The popular WEKA tool [4], which has received a service award in the premier ACM KDD conference, is used for data preprocessing as well as fundamental execution of mining techniques here. The data attributes are visualized from various perspectives to provide at-a-glance displays for easier comprehension. Figure 1 is an excerpt from the visualization. 2.2 Association Rule Mining Association rules, as well-known in data mining, are statements that help to uncover relationships between unrelated data in a relational database or other information repository. Association rule mining can be used to discover relationships between items frequently occurring together. Applications of association rules include market basket data analysis, cross-marketing, catalog design and more [5]. We harness interestingness measures popular in association rules, namely, support and confidence. Interesting association rules are expected to satisfy a user-specified minimum support and a user-specified minimum confidence, forming the experimental parameters in our work. We deploy the classical Apriori algorithm [5] for association rule mining in this work. Note that our goal is to find customers who will say “yes” for subscribing to a given term deposit, represented here as the target attribute. Hence, we discover association rules such as the following. • Job = “management” ==> Subscribe= “no” • Marital= “married” ^ Housing= “yes” ==> Subscribe= “yes”.

Predictive Analysis of Bank Marketing for Financial Decision

477

Table 1. Description of bank marketing data with sample customer profile information. Attribute

Explanation

Value

Age

Age of the customer (numeric)

1–100 years

Job

Employment of customer (categorical)

{Admin, unknown, unemployed, management, housemaid, entrepreneur, student, blue-collar, selfemployed, retired, technician, services}

Marital status Relationship/marriage (categorical)

{Married, single, living_with_partner, divorced, other}

Education

Level of education only (categorical)

{Unknown, secondary, primary, tertiary}

Default

Binary attribute on whether the customer has credit in default

{Yes, No}

Balance

Numeric average yearly balance with bank

Number of Euros

Housing

Binary attribute on whether the customer has a housing loan

{Yes, No}

Contact

Type of contact with customer (categorical)

{Unknown, telephone, cellular, email, post}

Day

Last contact day of the month (numeric)

1 to 31

Month

Last contact month of the year (numeric)

1 to 12

Duration

Last contact duration (numeric)

Number of seconds

Campaign

Contacts during a campaign for a customer

Number of contacts

Pdays

Days passed after the customer was last contacted from a previous campaign

Number of days

Previous

Contacts before a campaign for a customer

Number of contacts

Poutcome

Outcome of previous marketing campaign (categorical)

{Unknown, other, failure, success}

Term deposit

Target attribute on whether the {Nes/No} customer subscribes to the given term deposit

478

H. Vashi et al.

Fig. 1. Excerpt from visualization of attributes in bank marketing data.

These can be interpreted respectively as: “If the customer’s job is management, they will not subscribe to the term deposit”, and “If the customer is married and has housing, they will subscribe to the term deposit”. Likewise, many rules are discovered. Moreover, association rule mining can be used to study the relationships among various attributes in the data, in addition to their relationship with the target attribute. For example, we can analyze how the education level of the customer relates to their bank balance. Figure 2 presents a screenshot of visual analytics using association rule mining where any two attributes can be visualized together to observe their mutual relationship. In this case, the attributes analyzed are “balance” and “duration”, referring to the customer’s average yearly balance in Euros, and the last contact duration with the customer in seconds. Furthermore, each attribute can be analyzed with respect to how it affects the target. In this case, the blue values (Yes) indicate customers who subscribed to the term deposit while the red ones (No) indicate those who did not. Hence, this provides a convenient display for easy comprehension. Furthermore, the discovered association rules can be coded into the predictive analysis tool for analyzing various customer tendencies. 2.3 Decision Tree Classification A decision tree, also well-known in data mining, is a classifier expressed as a recursive partition of the instance space [6]. The decision tree consists of nodes that form a rooted tree, i.e. it is a directed tree with a node called “root” that has no incoming edges. A node with outgoing edges is called an internal or test node. Classification rules can be derived by tracing decision tree paths to predict a target attribute. We deploy the classical

Predictive Analysis of Bank Marketing for Financial Decision

479

Fig. 2. Example screenshot of association rule mining with visual analytics in bank marketing.

J4.8 algorithm for decision trees in our work [6, 4]. Tree pruning is conducted to reduce overfitting to the training data. The classification with decision trees helps us trace paths such as those depicted within a partial snapshot in Fig. 3. Here, “duration” becomes the root of the tree due to its highest information gain, followed by other attributes such as “education” and “job”, eventually leading to the classification of the target attribute on whether the customer will subscribe to the term deposit (yes/no). Such paths help to estimate the tendency of potential customers, and the derived classification rules help to code the program for building the predictive analysis tool. Hence, new customers’ data can be entered in order to estimate whether they can be prospective clients to apply for the given term deposit. In other words, this can help to estimate whether these customers should be contacted for telemarketing campaigns. More details on predictive analysis appear in the next section.

3 Tool for Bank Marketing Predictive Analysis 3.1 Implementation of the Tool A bank marketing predictive analysis tool is developed based on the knowledge discovered from the real data in this work. This is a Java-based implementation that utilizes the results of the data mining conducted using association rules and decision trees on the data described here. This predictive analysis tool can be used to predict the potential outcomes of prospective bank marketing campaigns.

480

H. Vashi et al.

Fig. 3. Partial snapshot of decision tree classifier in bank marketing data.

The input to this tool consists of values of the customer attributes such as age, housing, job etc. Upon getting different attribute values representing different customer traits, the system produces results about the prediction of “yes” or “no” as per the given customer’s intentions to subscribe the given term deposit. This helps in supporting decisions about the types of customers that can be targeted during various real bank marketing campaigns, e.g. through telemarketing. Hence, the output displayed on the screen is a user-friendly plain text message about the estimation of whether the customer will subscribe to the term deposit. This plain text also helps to decipher the reasons behind a potential customer’s tendency in a simple, comprehensible manner. The implementation of this tool for predictive analysis of bank marketing data, can be summarized in Algorithm 1, as described next.

Predictive Analysis of Bank Marketing for Financial Decision

481

Algorithm 1: Implementation of Bank Marketing Predictive Analysis Tool Input: Attributes α, δ, β, σ, γ, λ where α=age, δ=duration, β=housing, σ=marital status, γ=job, λ=loan (prior) 1. GET α as numeric variable [1-100] 2. GET δ as numeric variable [min, max] as upper & lower limits of duration 3. GET β as Boolean variable [Y/N] 4. GET σ as categorical variable [married, single, …] 5. GET γ as categorical variable [management, technician, …] 6. GET λ as Boolean variable [Y/N] 7. SET TD=Term Deposit 8. USE Learned Hypothesis from Decision Trees {Hτ} to find closest path with threshold > μ // where μ is minimum threshold 9. USE Learned Hypothesis from Association Rules {Hρ} to find closest match with confidence > ε, support > η // where ε is minimum confidence; η is minimum support 10. COMPUTE TD based on {Hτ}, {Hρ} as [Y/N] 11. SET μ=message // user-friendly text message for to predict outcome 12. IF TD=Y, ASSIGN μ = “The person can subscribe for deposit” Go To 14. 13. ELSE TD=N, ASSIGN μ = “The person will not subscribe for deposit” 14. SET ω=Reason // user-friendly text message for reasoning of outcome // plain text reasoning 15. TRACE {Hτ} for reasons of TD= [Y/N], convert to t1 (from trees) 16. SEARCH {Hρ} for reasons of TD= [Y/N], convert to t2 // more plain text reasoning (from rules) 17. ASSIGN ω = t1, t2 18. RETURN final Output using μ, ω Output: μ, ω in plain text along with reasons

// text predicting customer subscription to term deposit

3.2 System Demo and Experiments A basic system demo of our prototype bank marketing predictive analysis tool is presented as follows. We include demo snapshots with sample input conditions and the predicted output. Figure 4 is an example of a demo snapshot, showing the predicted “Output” as positive, i.e. where the customer is likely to subscribe to the given term deposit. Figure 5 shows a similar snapshot, this time with the “Reason” behind the output. For example, one aspect of the reasoning is that a customer without an existing loan (as indicated by the “Loan: No” button in the snapshot), is likely to subscribe to the term deposit. This snapshot can be obtained simply upon clicking the “Reason” button instead of the “Output” button. Note that these are only partial snapshots from the tool, as shown for the selected input conditions and the corresponding outputs. Conversely, Fig. 6 depicts an example with input conditions where the predicted “Output” is negative, implying that the customer would not subscribe to the term deposit. Likewise, Fig. 7 presents a snapshot of the “Reason” with respect to Fig. 6, i.e. the same case, this time indicating the reasoning with a short phrase. Many such demo snapshots can be studied in order to gain more insights into customer tendencies in various bank marketing scenarios. Hence, the system demo of this tool can assist in real decision-making as per reaching out to potential customers in the positive category, and

482

H. Vashi et al.

Fig. 4. Demo snapshot of bank marketing predictive analysis tool with positive output message.

Fig. 5. Continuation of demo snapshot of the tool along with reasons for positive output.

Predictive Analysis of Bank Marketing for Financial Decision

483

additionally, fathoming the reasons behind the positive as well as negative categories. The former would help to tap in more customers by building further upon the causes that lead to the positive outcomes, while the latter would help to improve future outcomes by addressing the causes leading to the negative outcomes. Hence, on the whole, explainable AI with classical methods of association rules and decision trees can be helpful in these processes. Accordingly, they can be helpful in various other related scenarios as well.

Fig. 6. Example of demo snapshot of the tool, predicting negative output with a message.

Multiple users experiment with our predictive analysis tool for bank marketing. The users are from academia and industry, in areas including computer science, finance, information technology, data science, economics, business analytics, management information systems, marketing and international business. We give the users a demo, based on which they experiment with parameter settings of their choice (to analyze various scenarios). Accordingly, they indicate whether they are satisfied with the tool (Yes/No). The tool is a prototype, and hence the user evaluation is simplistic at this point. Around 80% of the users indicate that they find the tool useful, i.e. they are satisfied with its performance. On the whole, we receive positive responses from most users, hence corroborating the basic effectiveness of the prototype tool. We now address some applications, as relevant to our work in this area.

484

H. Vashi et al.

Fig. 7. Continuation of demo snapshot, with reasons for the negative output in the same case.

4 Applications: Financial Decision Support and Smart Economy Advancements in technology have revolutionized banking marketing through the use of data warehousing, mining, and campaign management software. Data mining enables banks to identify valuable customers, predict their behavior, and make proactive decisions. It provides a deeper insight into customer behavior than traditional history-oriented analysis, allowing banks to anticipate customer needs and tailor their marketing efforts accordingly. Different data mining techniques have their own advantages and challenges; these techniques play a crucial role in achieving the goals of modern banking marketing [1]. In predictive machine learning approaches, the last two decades have witnessed ANN being extensively used in a plethora of applications. Yet, some analysts prefer explainable AI with white-box models such as decision trees to enhance interpretation and reasoning, as we advocate in our work. The field of bank marketing has advanced significantly with technology and data analysis, focusing on the use of marketing strategies for financial decision support that provide more personalized and relevant financial information [2]. Bank marketing provides useful information to customers about available financial products and services. This information can help customers make informed decisions about how to manage their money, and can also help banks in hitting the right audience for direct marketing. Banks can market their digital payment solutions, via tools such as mobile banking apps or online payment gateways, to harness potential customers who use e-commerce platforms for online shopping and banking. Moreover, this can help the customers choose the best and most secure payment options for their online transactions [7]. Banks can promote their cryptocurrency services, such as buying, selling, or holding cryptocurrencies, to customers who are interested in investing in Bitcoin, Stablecoin or other digital currencies. This can help customers understand the risks and benefits of investing in cryptocurrencies and hence make informed decisions about their investments [8]. Our predictive analysis tool is a step towards applications in this area. Analogous to predicting whether a customer will subscribe to a given term deposit, it can estimate related tendencies, e.g. investment in cryptocurrencies. This has a twofold advantage of

Predictive Analysis of Bank Marketing for Financial Decision

485

helping banks harness potential customers, as well as helping customers decide what is best for them. Moreover, banks can educate their customers about the basics of e-commerce and cryptocurrency through campaigns, articles, videos, or webinars. This can assist customers in understanding the underlying technology and usage of e-commerce and cryptocurrency, and make informed decisions about how to use them. Banks can provide financial advice to customers on how to manage their funds in e-commerce platforms or cryptocurrency. This includes information on how to budget, save, and invest their money, as well as how to minimize the risks of fraud and cybercrime when using digital platforms [8]. Figure 8 outlines the general theme of such work [9]. The class membership-based (CMB) approach mentioned here, is dedicated to accomplish the supreme execution of logical prediction by reshaping the diversified dataset. Hence, this can help to control the primary execution of ongoing data mining and machine learning algorithms with the assistance of cleaning every sort of diversified dataset. In line with such work, our predictive analysis tool for bank marketing makes a modest contribution. It predicts yes/no classifications (analogous to the red box here) by data mining on existing customer profiles, and thus proffers decision support in bank marketing applications.

Fig. 8. The global process of a predictive classification system with the CMB approach [9].

On the whole, predictive analysis assists corporations to utilize its data to make better, more well-informed decisions. This is possible on the basis of authentic data operated by logical conclusions regarding the latest events and time-ahead thinking. It can be obtained by applying a variety of data mining, statistical, and machine learning techniques to make predictions. Figure 9 synopsizes the usefulness of predictive analysis in bank marketing [10]. It places predictive modeling for banks at the centerstage with facets such as retention, segmentation, fraud / recovery, and cross-selling around the modeling, based on the analysis. Our predictive analysis tool can fit into the “modeling” aspect here, such that its outcomes can potentially guide various facets, including targeting the right customers for retention, cross-selling schemes, and setting the stage for more applications

486

H. Vashi et al.

based on its prototype that can provide the framework for further development of more advanced tools. Some of these tools can be tailor-made to specific needs, depending on the bank, its clientele, the location, and other factors.

Fig. 9. Usefulness of predictive analytics in banking [10].

Hence, on the basis of this general discussion, we claim that the predictive analysis tool, in this paper, has applications that address financial decision support. It fits well within a myriad of work in this area. In sum, it contributes to the realm of e-commerce and makes impacts on smart economy, an important characteristic of smart cities.

5 Related Work A study has been conducted to understand the focus of a bank on customer relationships to build strong, profitable connections by managing customer impressions, culture, and treatment while developing main competencies and strategies using artificial neural networks (ANN) to predict customer behavior [11]. This study uses an integrated data mining model to analyze bank customer behavior by using a self-organizing map (SOM) neural network and association rules with the Apriori algorithm. It identifies groups of customers based on repayment behavior, classifies customers into three groups, and profiles them using customer features. The result shows that using a behavioral scoring model is useful in predicting customer characteristics and helps in developing marketing strategies. Note that while ANN is used here for predictive analysis, it is augmented by association rules for a more descriptive analysis.

Predictive Analysis of Bank Marketing for Financial Decision

487

Several data-driven marketing studies proposed data mining approaches to predict the success of telemarketing for banks. An interesting study [12] compares four models (logistic regression, decision trees, neural network, support vector machine) and finds that ANN has the best results reaching 79% of subscribers. Sensitivity analysis and decision tree classification are applied to the ANN model to reveal the key attributes making the model credible for telemarketing campaigns. Thus, it is evident that though ANN yields the best results in this study, they need a more explainable model such as a decision tree in order to discover the key attributes significant for telemarketing. A study of 1,000 college students in Bahrain is performed in order to identify criteria for attracting new customers for commercial banks [13]. Thirty selection factors are used, including bank reputation, parking, personnel friendliness, and ATM availability/location. Results reveal that reputation, parking, personnel friendliness, and ATM location are the main factors in bank selection, while gender may play a role in the selection process. However, this study does not actually build a predictive analysis tool. Many other studies are conducted in related areas. A framework for telemarketing is built by deploying machine learning models; it makes impacts on risk assessment among other factors in financial decision support [9]. The execution of association rules as well as item-based collaborative filtering occurs on Amazon product and user data to lay the foundation for baseline recommender assisting e-commerce [14]. A customer behavioral scoring model is designed [15], constituting an integration with data mining techniques, in the general context of expert systems. A study is conducted [16] on correlating socio-economic status in various regions with the use of alternative fuel vehicles and the impacts on air quality index, along with predictive analysis; using Pearson’s correlation coefficient and linear regression. An overview of smart economy in smart cities is presented [17] from a 21st century perspective, considering numerous advancements. The Minimax algorithm in game theory is adapted in electric vehicle ride-sharing [18] where each company acts as a player aiming to win rides by strategies, contributing to energy-efficiency and cost-effectiveness, affecting smart economy. Human and machine collaboration is explored for making positive impacts on the economy [19] via AI technologies. An approach for smart manufacturing is proposed based on commonsense knowledge in human-robot collaboration [20] explaining how it can be more effective that humans or robots working alone, and how the approach makes the manufacturing processes more efficient, thus saving costs. Such work on human and AI systems working together, can make good impacts on smart economy. Other pertinent research, e.g. [21–31], can directly or tangentially impact financial decision support, smart economy, sustainability, e-commerce, and related areas. Our work in this paper fits the general theme of such research and goes a step beyond by providing a user-friendly tool for predictive analysis in bank marketing, entailing visualization and reasoning in addition to estimation of customer tendencies. This tool based on data mining from existing customer profiles serves as a prototype and paves the way for further developments on a much larger scale that can be used for financial decision support and e-commerce. It makes impacts on the characteristics of smart economy in smart cities, leveraging more automation in the realm of economics, while proffering more data-driven decision support. It is analogous to other works [16–20] that harness various AI paradigms, making good impacts on smart economy.

488

H. Vashi et al.

6 Conclusions A major goal of financial organizations is to maintain current customers and attract potential long-term ones. There is a push from both customers and financial institutions to develop a software tool that can enhance bank marketing. We achieve this through predictive analysis by data mining on customer profiles using two classical machine learning methods in explainable AI, namely, association rules, and decision trees. Factors such as customer age, job, and loan status prove to be a backbone to predict customer tendencies and improve marketing campaigns, based on knowledge discovery by data mining. Hence, using the discovered knowledge, we build a predictive analysis tool for bank marketing. This can assist stakeholders such as bank managers to nail the right customers for telemarketing campaigns, by conducting various experiments with the tool. They can enter various customer inputs and observe the predicted outputs on the types of customers likely to subscribe to a given term deposit, and make decisions accordingly. Therefore, this tool provides financial decision support. Our solution to the problem of customer profile analysis addresses the challenge of extracting relevant information from a plethora of customer data for financial decision-making. It makes positive impacts on e-commerce by harnessing online data and enhancing user interaction in buying and selling services electronically. In general, it fits the theme of smart economy, an important characteristic of smart cities today. We highlight the main contributions of our work in this paper as follows: • It exemplifies the use of classical white-box data mining methods based on explainable AI (rather than neural models) for financial decision support. • It builds a novel tool for predictive analysis in bank marketing by harvesting vast amounts of data on existing customer profiles, assisting stakeholders to pin the right audience (instead of casting the net too wide) for telemarketing. • It makes direct (albeit modest) impacts on e-commerce, and hence broader impacts on smart economy which is a significant pillar of smart cities. Our work in this paper motivates much future research in financial decision support using AI methods. Future work in this area includes the following: (1) harvesting knowledge from huge amounts of data in financial institutions; (2) exploring techniques in deep learning for detailed comparative studies with the data mining methods based on explainable AI deployed in this work; and (3) aiming to build a more advanced decision support system catering to multiple schemes beyond the term deposit addressed in this paper. This would provide enhanced contributions to smart economy. Acknowledgments. Much of the work in the tool development occurred while H. Vashi was a student work at Montclair State University (MSU). J. Yadav is supported by a GA (graduate assistantship) at MSU. A. Varde acknowledges NSF MRI grants 2117308 and 2018575.

Predictive Analysis of Bank Marketing for Financial Decision

489

References 1. Rygielski, C., Wang, J.C., Yen, D.C.: Data mining techniques for customer relationship management. Technol. Soc. 24(4), 483–502 (2002) 2. Tkáˇc, M., Verner, R.: Artificial neural networks in business: two decades of research. Appl. Soft Comput. 38, 788–804 (2016) 3. Xu, F., Uszkoreit, H., Du, Y., Fan, W., Zhao, D., Zhu, J.: Explainable AI: a brief survey on history, research areas, approaches and challenges. In: CCF International Conference, NLPCC, Proceedings, Part II 8, pp. 563–574. Springer, Berlin (2019) 4. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. 11(1), 10–18 (2009) 5. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings International Conferences Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994) 6. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1, 81–106 (1986) 7. Alzoubi, H., Alshurideh, M., Kurdi, B., Alhyasat, K., Ghazal, T.: The effect of e-payment and online shopping on sales growth: evidence from banking industry. Int. J. Data Netw. Sci. 6(4), 1369–1380 (2022) 8. Liao, G.Y., Caramichael, J.: Stablecoins: growth potential and impact on banking (2022) 9. Tékouabou, S.C.K., Gherghina, SC., ¸ Toulni, H., Neves Mata, P., Mata, M.N., Martins, J.M.: A machine learning framework towards bank telemarketing prediction. J. Risk Financ. Manag. 15(6), 269 (2022) 10. Predictive Analytics in Banking. https://www.predictiveanalyticstoday.com/predictive-analyt ics-banking/ 11. Ogwueleka, F.N., Misra, S., Colomo-Palacios, R., Fernandez, L.: Neural network and classification approach in identifying customer behavior in the banking sector: a case study of an international bank. Hum. Factors Ergon. Manuf. Serv. Ind. 25(1), 28–42 (2015) 12. Moro, S., Cortez, P., Rita, P.: A data-driven approach to predict the success of bank telemarketing. Decis. Support. Syst. 62, 22–31 (2014) 13. Almossawi, M.: Bank selection criteria employed by college students in Bahrain: an empirical analysis. Int. J. Bank Mark. 19(3), 115–125 (2001) 14. Lourenco, J., Varde, A.S.: Item-based collaborative filtering and association rules for a baseline recommender in e-commerce. In: IEEE International Conferences Big Data, pp. 4636–4645 (2020) 15. Hsieh, N.C.: An integrated data mining and behavioral scoring model for analyzing bank customers. Expert. Syst. Appl. 27(4), 623–633 (2004) 16. Singh A., Yadav, J., Shrestha, S., Varde, A.: Linking alternative fuel vehicles adoption with socioeconomic status and air quality index. In: AAAI Conference on Artificial Intelligence (Workshops) (2023) 17. Vinod Kumar, T.M., Dahiya, B.: Smart economy in smart cities, Book Chapter In: Smart Economy in Smart Cities. Advances in 21st Century Human Settlements. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-1610-3_1 18. Radakovic, D., Singh, A., Varde, A.: Enriching Smart Cities by Optimizing Electric Vehicle Ride-Sharing through Game Theory. In: IEEE International Conference on Tools with Artificial Intelligence (ICTAI) (2022) 19. Bolton, C., Machová, V., Kovacova, M., Valaskova, K.: The power of human–machine collaboration: artificial intelligence, business automation, and the smart economy. Econ., Manag., Financ. Mark. 13(4), 51–56 (2018) 20. Conti, C.J., Varde, A.S., Wang, W.: Human-robot collaboration with commonsense reasoning in smart manufacturing contexts. IEEE Trans. Autom. Sci. Eng. (TASE) 19(3), 1784–1797 (2022)

490

H. Vashi et al.

21. Kraus, M., Feuerriegel, S.: Decision support from financial disclosures with deep neural networks and transfer learning. Decis. Support. Syst. 104, 38–48 (2017) 22. Alamaniotis, M., Gao, R., Tsoukalas, L.: Towards an energy internet: a game-theoretic approach to price-directed energy utilization. In: E-Energy Conference, pp. 3–11 (2010) 23. Varde A., Liang, J.: Machine learning approaches in agile manufacturing with recycled materials for sustainability. In: AAAI Conference on Artificial Intelligence (Bridge) (2023) 24. Bifulco, F., Tregua, M., Amitrano, C.C., D’Auria, A.: ICT and sustainability in smart cities management. Int. J. Public Sect. Manag. 29(2), 132–147 (2016) 25. Dave, D., Celestino, A., Varde, A.S., Anu, V.: Management of implicit requirements data in large srs documents: taxonomy and techniques. SIGMOD Rec. 51(2), 8–29 (2020) 26. Varghese, C., Pathak D., Varde A.: SeVa: a food donation app for smart living. In: IEEE Computing and Communication Workshop and Conference (CCWC), pp. 408–413 (2021) 27. Puri, M., Du, X., Varde, A.S., de Melo, G.: Mapping ordinances and tweets using smart city characteristics to aid opinion mining. In: WWW Conference, (Computer Volume), pp. 1721– 1728 (2018) 28. Gandhe, K., Varde, A., Du, X.: Sentiment analysis of twitter data with hybrid learning for recommender applications. In: IEEE UEMCON Conference, pp. 57–63 (2018) 29. Gupta, A.: E-Commerce: Role of E-Commerce in today’s business. Int. J. Comput. Corp. Res. 4(1), 1–8 (2014) 30. Du, X., Kowalski, M., Varde, A.S., de Melo, G., Taylor, R.W.: Public opinion matters: mining social media text for environmental management. In: ACM SIGWEB Newsletter, (Autumn), pp. 1–15 (2020) 31. Kedah, Z.: Use of e-commerce in the world of business. SABDA J. 2(1), 51–60 (2023)

Wireless Sensor Networks Anomaly Detection Using Machine Learning: A Survey Ahshanul Haque(B) , Naseef-Ur-Rahman Chowdhury, Hamdy Soliman, Mohammad Sahinur Hossen, Tanjim Fatima, and Imtiaz Ahmed New Mexico Tech, 801 Leroy PL, Socorro, NM, USA {ahshanul.haque,naseef.chowdhury,mohammad.hossen,tanjim.fatima, imtiaz.ahmed}@student.nmt.edu, [email protected]

Abstract. Wireless Sensor Networks (WSNs) have become increasingly valuable in various civil/military applications like industrial process control, civil engineering applications such as buildings’ structural strength monitoring, environmental monitoring, border intrusion, IoT (Internet of Things), and healthcare. However, the sensed data generated by WSNs is often noisy and unreliable, making it a challenge to detect and diagnose anomalies. Machine learning (ML) techniques have been widely used to address this problem by detecting and identifying unusual patterns in the sensed data. This survey paper provides an overview of the stateof-the-art applications of ML techniques for data anomaly detection in WSN domains. We first introduce the characteristics of WSNs and the challenges of anomaly detection in WSNs. Then, we review various ML techniques such as supervised, unsupervised, and semi-supervised learning that have been applied to WSN data anomaly detection. We also compare different ML-based approaches and their performance evaluation metrics. Finally, we discuss open research challenges and future directions for applying ML techniques in WSNs sensed data anomaly detection. Keywords: Wireless sensor network · Anomaly detection · Machine learning · Survey · Energy efficiency · Hybrid networks · Techniques · Algorithms · Data · Performance metrics

1

Introduction

Wireless Sensor Networks (WSNs) are widely used in various types of applications, including environmental monitoring, Internet-of-Things (IoT), healthcare, security [18], and industrial control. Some WSNs [20] consists of numerous tiny sensors that are distributed in an area to collect data and send it to a base station. However, due to the resource-constrained nature of sensor nodes, WSNs [21] face several challenges, such as limited energy, computational power, and memory, making it difficult to make decisions and analyze data locally [1,2,31,33]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 491–506, 2024. https://doi.org/10.1007/978-3-031-47715-7_34

492

A. Haque et al.

Anomaly detection is a critical task in WSNs since it helps detect unusual events and abnormal behavior in the sensed data. Anomalies may indicate a malfunction in the system’s sensors, equipment failure, or potential security threats, which must be addressed immediately, and appropriately. However, traditional rule-based anomaly detection techniques may not be suitable for WSNs since they require predefined rules that are challenging to design for complex WSNs systems with high-dimensional data [3,5]. ML has emerged as a promising technique for anomaly detection in WSNs. ML algorithms can learn from the sensor data and discover patterns that are indicative of normal and abnormal behavior, without the need for manual rule definition. Moreover, ML techniques can adapt to changes in the system and the environment and provide accurate and timely detection of anomalies [16], thus improving the efficiency and reliability of the WSNs [4]. A common MLbased technique is to use the Ensemble Learning method [11,12], where multiple ML models are used to train and test, then pick the best-performing models. This approach is not only effective in WSN but also widely used in many other anomalies/malware detection [11,12]. Given the growing interest in using ML for anomaly detection in WSNs, it is essential, for life-saving critical applications, to survey the state-of-the-art techniques and identify the challenges and opportunities in this field. This paper aims to provide a comprehensive survey of ML-based anomaly detection methods in WSNs. The survey covers various ML algorithms, including supervised [3], unsupervised [13], semi-supervised [5], and deep learning [7], and discusses their advantages in modeling sensed data anomalies and limitations in WSNs. The paper also discusses the challenges of deploying ML-based anomaly detection systems in WSNs, such as limited resources, collected data heterogeneity, and privacy/security concerns, and suggests potential solutions to address these challenges. Overall, this survey can serve as a useful reference for researchers and practitioners working on anomaly detection in WSNs using ML techniques [6–9,20,24,27,28,30–32,34–38,40,41]. The rest of this paper is organized as follows. In the second section, we will discuss the most recent Criterion of Anomalies. The section following it, will discuss the Anomaly Detection Techniques used to date. In Sect. 4, we will compare the different anomaly detection approaches for WSNs. In the final section, we conclude by shedding some light on the most useful ML models to be deployed in the field of anomaly detection of collected WSNs sensed data.

2

Classification Criterion of Anomalies

In recent years, WSNs have emerged as a captivating research field due to their ability to monitor vast regions, reach remote and hazardous locations, respond in real time, and their relative simplicity of use [34,36]. This technology has opened up a whole new world of possibilities for scientists. In addition to the aforementioned important civil applications, WSNs have already been utilized in many other various military activities such as surveillance, target recognition,

WSN Anomaly Detection Using Machine Learning: A Survey

493

and environmental. These sensor networks typically consist of numerous small, inexpensive nodes distributed across a wide area, as shown in Fig. 1.

Fig. 1. An Example architecture of WSNs [2].

WSNs not only have the ability to sense, compute and coordinate their activities but also to communicate their results to end users, making them revolutionary in data collection across various domains [23]. However, the unique research and engineering challenges that arise during the deployment and design of these networks must be considered, as well as the limitations of their software development [37,42]. These limitations include their large intended area of deployment range, communication obstacles, random and hazardous deployment, high components’ failure rate, and limited computational and energy/battery power. To ensure better critical decision-making, it is crucial to maintain the quality of collected sensor data. Although cryptographic and key management techniques are used to protect the security of sensor nodes from intruder attacks, they are not sufficient to ensure the reliability and integrity of their sensed data [35,41]. We are introducing the identification of anomalies in Fig. 2.

Fig. 2. Identification of anomalies in WSN [22].

Thus, outlier detection techniques have been developed to identify any abnormal behavior in sensor data streams. WSNs are particularly susceptible to outliers due to several factors. Such factors include their use of weak and vulnerable sensors to collect data in real-world applications and their battery-powered

494

A. Haque et al.

nature. The potential accumulation of errors when numerous sensors are used over wireless media, and the vulnerability of unguarded sensors in critical security and military applications to manipulation by intruders. Therefore, outlier detection is an integral part of any perilous data processing task that utilizes WSNs. The following subsections outline the fundamental concepts, sources, and requirements of outlier detection in WSNs [38]. In WSNs, anomalies refer to any unusual, abnormal, or unexpected behavior or events in the collected sensor data stream that deviate from the expected or normal patterns. Anomalies can be caused by various factors, such as faulty sensors, environmental changes, malicious attacks, or random fluctuations of the sensed data, due to external random/asynchronous events in the terrain.

Fig. 3. An example anomalies with 2-dimensional dataset [5].

Anomalies can be categorized into three major types, namely: noise, event, and attack anomaly sources. We pictured the classification of anomalies in Fig. 3. Noise or errors anomaly in WSNs refer to measurement inaccuracies or data sensed from sources like faulty or malfunctioning sensors [28]. Outliers resulting from errors can occur frequently and are typically represented by a data point that differs significantly from the rest of the collected dataset. They can arise due to various environmental factors, including bad deployment due to difficulties and harsh conditions [29]. To ensure data quality, both detected noisy and erroneous data should be eliminated or corrected, if possible [39]. Event anomaly sources in WSNs are defined as sudden changes in the realworld state, such as fire detection [31], earthquakes, weather changes, and air pollution [30]. Outliers caused by such anomaly sources tend to have a significantly lower probability of occurrence compared to those caused by errors. Such outliers typically last for a relatively long period of time and can alter the historical pattern of sensor data [24]. Removing event outliers from the dataset can lead to the loss of crucial information about the events [32]. Outliers that are similar in size to random errors can only be identified through the application of outlier tests.

WSN Anomaly Detection Using Machine Learning: A Survey

495

In WSNs, malicious attack anomaly sources are associated with network security, and researchers in [33] have addressed this issue. Due to the unattended nature of the deployment of the sensors, intruders can gain access to and control, damage, and/or hijack specific nodes to launch attacks, which can deplete the network’s limited resources or inject false and corrupted data. Malicious attacks can be broadly classified into two categories: passive and active attacks. Passive attacks involve obtaining data interchanged in the network without interrupting communication, while active attacks aim to disrupt the normal functioning of the network [40].

3 3.1

Anomaly Detection Techniques in WSN Classification of the Approaches

There are three main types of modeling applications categories of ML [14] for anomaly detection in WSNs namely: supervised, unsupervised, and semisupervised. The appropriate category should be chosen based on the WSN anomaly detection task’s specific requirements, as each category has its own advantages and disadvantages. Based on the characteristics of the Data-set and the user’s needs we can use different ML approaches. Figure 4 is showing the demonstration of the ML approaches in detecting the anomalies for WSNs. SVM, KNN, Random Forest, Decision Tree, ANN, K-means Clustering, Density-Based Clustering, Auto-Encoders, Reinforcement Learning, Self Training, Co-Training, and Label Propagation, are popular ML algorithms used for anomaly detection in WSN [1,4,5,7]. Each algorithm has its own strengths and weaknesses depending on the specific use case. Next, we will shed some light on each of such ML algorithms to justify its utilization in some applications. In the next three sub-sections, we will survey the detailed literature works in each of the aforementioned training/learning categories: supervised, unsupervised, and semi-supervised. 3.2

Supervised Learning Approaches

Labeled training data are used by supervised learning [3,5,7–9] algorithms, and along with teacher supervision, to learn the normal network behavior for later identification of anomalies based on deviations from the normal learned behavior. When labeled data for training are available and the network’s normal behavior class is well-defined, this strategy is appropriate. Next, we will briefly introduce some of the prominent supervised ML models utilized in WSNs data classification for anomaly detection and explore some related literature works.

496

A. Haque et al.

Fig. 4. ML Approaches to Detect Anomalies in WSN [22]

3.2.1 A Brief Introduction for Supervised ML Models SVM [15] is a statistical-based [17] approach that is effective in handling high-dimensional data and can separate classes with large margins, but it requires tuning of hyperparameters and can be computationally expensive [3,8]. KNN [22] is a clustering-based method that is useful for detecting local anomalies and is simple to implement, but it is sensitive to the choice of distance metric and can be computationally expensive. Random Forests [9,10] is a ML-based approach that is useful for handling noisy data, can identify feature importance, and is parallelizable, but it has limited interpretability and can overfit the data. ANN [9] is another ML-based approach that is good at handling non-linear relationships, can handle complex data, and is parallelizable, but requires careful selection of architecture and hyperparameters and can be computationally expensive. Decision Tree [22] is a ML-based approach that is useful for handling noisy data, is easy to interpret, and can handle missing values, but it is prone to overfitting and sensitive to the choice of split criteria. Ultimately, the choice of algorithm depends on the specific characteristics of the data and the requirements of the application. 3.2.2 Literature Work in Supervised Learning In [1], the authors reviewed the challenges faced by WSNs and proposed the use of ML techniques to improve their energy efficiency and detect anomalies in their collected data. The authors provide a comprehensive literature review of ML algorithms that have been applied to WSNs, including K-Means, ANN, Decision Trees, SVM [19], and Bayesian Networks [2,5,13,22]. The author presents two case studies that demonstrate the effectiveness of ML techniques for energy efficiency and

WSN Anomaly Detection Using Machine Learning: A Survey

497

anomaly detection in WSNs. Overall, the paper contributes to the growing interest of research on the use of ML for WSNs and highlights the potential of these techniques for improving the performance and reliability of WSNs. The authors in [4] proposed a method for detecting spatial anomalies in sensor networks using neighborhood information. The authors provide a literature review of related work in anomaly detection in WSNs and highlight the challenges of detecting spatial anomalies due to the complexity of spatial data. The paper presents the proposed method, which uses neighborhood information to identify anomalies in spatial data (consisting of latitude and longitude). The authors demonstrate the effectiveness of their method through experiments conducted on simulated and real-world data sets. Overall, the paper contributes to the growing interest in research on using sensor networks for anomaly detection and highlights the potential of using neighborhood information to improve such systems’ accuracy and reliability. The authors in [5] proposed a new approach to anomaly detection in WSNs using support vector data description (SVDD). The authors provide a literature review of related work in the field of anomaly detection in WSNs and highlight the limitations of existing approaches, such as high false alarm rates and low detection accuracy. The paper presents the proposed method, which uses SVDD to construct a boundary around normal data and detect anomalies outside this boundary. The authors demonstrate the effectiveness of their method through experiments conducted on real-world data sets, showing that their method outperforms existing methods in terms of detection accuracy and false alarm rate. Overall, the paper contributes to the field of smart anomaly detection in WSNs and highlights the potential of using SVDD to improve such systems’ accuracy and reliability. The author in [7] proposed a data-driven approach for hyperparameter optimization of one-class SVMs for anomaly detection in WSNs. The authors provided a literature review of related work on anomaly detection in WSNs and highlight the importance of hyperparameter optimization for improving the accuracy and efficiency of SVM-based methods. The paper presents the proposed method, which uses a genetic algorithm to optimize the hyperparameters of the one-class SVM model. The authors demonstrate the effectiveness of their method through experiments conducted on real-world data sets, showing that their method outperforms existing methods in terms of detection accuracy and computational efficiency. Overall, the paper contributes to smart anomaly detection in WSNs and highlights the potential of using data-driven approaches for hyperparameter optimization of SVM-based methods. The authors in [3] proposed a new algorithm for anomaly detection in WSNs using a combination of density-based spatial clustering of applications with noise (DBSCAN) and SVM. The authors provided a literature review of related work on anomaly detection in WSNs and highlight the challenges faced in implementing such systems, including limited computational resources, communication bandwidth, and energy constraints. The paper presents the proposed DBSCAN algorithm and demonstrates its effectiveness in detecting anomalies in WSNs

498

A. Haque et al.

using simulations. Overall, the paper contributes to the smart use of ML techniques for anomaly detection in WSNs and highlights the potential of combining DBSCAN and SVM for improving the accuracy and reliability of such systems. The authors in [8] presented a comprehensive literature review of existing intrusion detection systems (IDSs) for WSNs and investigated the applicability of computational intelligence techniques for enhancing the performance of IDSs. The authors discussed the challenges and requirements of intrusion detection in WSNs, including limited resources, wireless communication, and distributed deployment. Then, they reviewed the use of various soft-computational intelligence techniques for IDSs, such as ANNs, decision trees, fuzzy logic, and genetic algorithms. The paper also provides an overview of existing benchmark datasets and evaluation metrics for IDSs in WSNs. Overall, the paper highlights the potential of using soft-computing intelligence techniques for improving the accuracy. The authors in [9] proposed a distributed anomaly detection scheme based on autoencoder neural networks for WSNs. The authors aim to address the limitations of existing centralized anomaly detection methods that may not be suitable for large-scale WSNs due to bandwidth constraints, power consumption, and privacy concerns. The proposed scheme involves training autoencoder neural networks at each sensor node to learn the normal behavior of the local environment and detect anomalies based on reconstruction errors. The paper also discusses the implementation of the proposed scheme on a testbed and evaluates its performance using various metrics. The results show that the proposed scheme achieves high accuracy in detecting anomalies while maintaining low communication overhead and energy consumption. Hence, such an approach will be suitable for large-scale WSNs in IoT applications. Overall, the paper provides a valuable contribution to the field of distributed anomaly detection in WSNs and demonstrates the potential of using autoencoder neural networks for improving the efficiency and effectiveness of anomaly detection in IoT applications. The authors in [10] proposed a deep learning-based approach for developing middleware for WSNs. The authors aim to address the limitations of existing middleware approaches that may not be able to handle the complexity and heterogeneity of WSNs, leading to low accuracy and reliability. The proposed approach involves training a deep neural network (DNN) to predict the behavior of the WSN based on historical data and using the predictions to adjust the middleware parameters in real time. The paper also discusses the implementation of the proposed approach on a testbed and evaluates its performance using various metrics. The results show that the proposed approach achieves high accuracy and reliability in adapting the middleware to the dynamic behavior of the WSN, making it suitable for various WSN applications. Overall, the paper provides a valuable contribution to the field of WSN middleware development and demonstrates the potential of using deep learning for improving the accuracy and reliability of WSN middleware.

WSN Anomaly Detection Using Machine Learning: A Survey

3.3

499

Unsupervised Learning Approaches

The assumption that anomalies are data points that are significantly distinct from most of the data points serves as the foundation for unsupervised learning algorithms [6,25–27]. Hence, no need for a supervisor teacher, the ML model will cluster anomaly data separately from normal data, even with unlabeled data based on the data quality. When labeled data is unavailable or the network’s normal behavior is unclear, this method is appropriate. Next, we will briefly introduce some of the prominent unsupervised ML models utilized in WSNs data classification for anomaly detection and explore some related literature works. 3.3.1 Brief Introduction to Unsupervised ML Models K-Means Clustering [6] is a popular unsupervised clustering method that is used for identifying clusters of data points with similar characteristics [6,26,27]. It is useful for detecting global anomalies and can handle large datasets, but it requires the selection of the appropriate number of clusters and can be sensitive to outliers. Density-based clustering, on the other hand, is useful for detecting local anomalies and does not require specifying the number of clusters beforehand [2,25]. However, it can be computationally expensive and sensitive to the choice of parameters. Auto-encoders [26] are a type of neural network that is used for unsupervised feature learning and anomaly detection. They are useful for detecting nonlinear relationships in the data and can handle complex data, but they require a careful selection of architecture and hyper-parameters and can be computationally expensive. Ultimately, the choice of technique depends on the specific characteristics of the data and the requirements of the application [25,26]. Density-based clustering [26] is a data clustering technique that identifies clusters of points based on the density of their distribution. It is particularly useful for identifying clusters of arbitrary shape and can handle noise and outliers effectively. 3.3.2 Literature Work in Unsupervised Learning In [2], the authors presented a method for detecting anomalies in a WSN used in a smart home system. The authors propose using statistical analysis and ML techniques to detect abnormal behavior in the network, which can be used to identify security breaches, faults, and other anomalies. The paper provides a literature review of related work in the field of anomaly detection in WSNs and highlights the challenges of implementing such systems in smart home environments. The authors present the results of experiments conducted on a prototype smart home system, which demonstrate the effectiveness of their proposed method for detecting anomalies in WSNs. Overall, the paper contributes to the growing interest in the use of WSNs in smart home environments and highlights the potential of ML techniques for improving the security and reliability of such systems.

500

A. Haque et al.

The authors in [6] proposed an attention-based multi-filter long short-term memory (AMF-LSTM)-based deep learning strategy for network anomaly detection. The authors present a review of related research on anomaly detection in networks and draw attention to the drawbacks of existing approaches, such as the need for domain knowledge and the inability to identify anomalies that were not previously known, hence the use of unsupervised learning. The proposed approach, which makes use of AMF-LSTM to capture the temporal and spatial correlations that exist in network traffic, is presented in the paper, along with a focus on the essential characteristics for anomaly detection. Experiments on realworld data sets used by the authors to demonstrate the efficacy of their method show that, in terms of detection accuracy and false alarm rate, it outperforms the peers. In general, the paper highlights the potential of using AMF-LSTM to improve the accuracy and dependability of such systems and adds to the growing interest in research on the use of deep learning techniques for anomaly detection in networks. The authors in [25] proposed a new approach for detecting outliers in WSNs using k-means clustering and lightweight methods. They argue that outlier detection is essential for improving network performance and reliability, and that existing methods have limitations in terms of computational complexity and memory requirements. The authors introduce a two-phase approach that divides sensor nodes into clusters based on similarity and then applies lightweight outlier detection methods to each cluster. To evaluate the effectiveness of the proposed approach, the authors conducted experiments using real-world data from a WSN. The results showed that the proposed approach outperformed existing outlier detection methods in terms of both detection accuracy and computational efficiency. The authors in [26] presented two novel approaches for detecting outliers in WSNs using density-based methods. The authors argue that density-based methods have advantages over other outlier detection methods in terms of their ability to handle high-dimensional data and non-linear relationships. The proposed approaches use different density-based techniques, including the local outlier factor (LOF) and the relative density ratio (RDR), which are evaluated using real-world WSN datasets. The results show that both approaches outperform existing methods in terms of detection accuracy and computational efficiency, demonstrating the potential of density-based methods for outlier detection in WSNs. The authors in [27] proposed a novel approach for detecting anomalous activity in wireless communication networks using deep auto-encoders [22]. The authors argue that traditional methods for anomaly detection in wireless communication networks are limited in their ability to handle complex and dynamic data. The proposed approach uses a deep auto-encoder to learn a compressed representation of the spectrum data, and then compares the reconstructed data with the original data to identify anomalies. The experimental results demonstrate the effectiveness of the proposed approach in detecting anomalies in real-

WSN Anomaly Detection Using Machine Learning: A Survey

501

world spectrum data, showing the potential of deep auto-encoders for anomaly detection in wireless communication networks. 3.4

Semi-Supervised Learning Approaches

Semi-supervised Learning [23,24,30] is a mix of labeled and unlabeled data is used by semi-supervised learning algorithms to learn normal network behavior. When labeled data is limited or inaccurate and the network behavior is complex and unclear, this strategy is appropriate. Deep Belief Networks (DBNs) and Generative Adversarial Networks (GANs) are two examples of semi-supervised learning algorithms utilized in WSN anomaly detection [2]. Reinforcement learning, label propagation, self-training, and co-training are popular techniques used for anomaly detection in WSN with semi-supervised learning. 3.4.1 Brief Introduction to Semi-Supervised ML Models Reinforcement Learning [23] is a type of ML that is useful for sequential decision-making problems where an agent interacts with an environment [23,24]. It is useful for detecting anomalies in dynamic environments where the distribution of data changes over time, but it can be computationally expensive and requires careful selection of the appropriate reward function. Label Propagation [29] is a semi-supervised learning approach that is useful for propagating labels in a network [23,29,30]. It is useful for detecting anomalies in large-scale networks and can handle missing data, but it requires a careful selection of parameters and can be sensitive to the choice of initialization. Self-Training [24] is another semi-supervised learning approach that is useful for leveraging unlabeled data to improve the performance of a classifier. It can also improve the accuracy of a classifier, but it requires careful selection of the appropriate threshold for adding new labeled data. Co-Training [28] is a semi-supervised learning approach that is useful for learning from multiple views of data [24,28]. It is useful for detecting anomalies in multi-modal data and can improve the accuracy of a classifier, but it requires careful selection of the appropriate number of views and can be sensitive to the choice of features. Ultimately, the choice of technique depends on the specific characteristics of the data and the requirements of the application [42]. 3.4.2 Literature Work in Semi-Supervised Learning The authors in [23] proposed a deep actor-critic reinforcement learning-based approach for anomaly detection in WSNs. The authors utilize an actor-critic model to learn the mapping between the current state and the optimal action to take for the anomaly detection task. They introduce a reward function that considers the detection accuracy, false positives, and false negatives to train the actor network. Additionally, the authors incorporate a replay buffer and target networks to stabilize

502

A. Haque et al.

the learning process. Their experimental results show that the proposed approach outperforms several state-of-the-art anomaly detection methods in terms of detection accuracy and false positives. Authors in [24] proposed a novel anomaly detection method called AESMOTE, which combines adversarial reinforcement learning (ARL) and synthetic minority over-sampling technique (SMOTE). AESMOTE enables the system to learn the underlying distribution of normal and anomalous data, and it leverages SMOTE to generate synthetic samples that represent anomalies that may not have been observed in the original dataset. The proposed method is evaluated on multiple benchmark datasets, and the experimental results show that AESMOTE outperforms other state-of-the-art methods, achieving high detection rates with low false positive rates. The authors in [28] presented a novel approach for video anomaly detection using a self-trained prediction model and a novel anomaly score mechanism which can be useful in WSN anomaly detection also. The authors argue that existing methods for anomaly detection have limitations in terms of their ability to handle complex scenes and learn the underlying patterns of anomalies. The proposed approach trains a deep neural network on normal videos to learn the patterns of normal behavior and then uses it to predict future frames. An anomaly score mechanism is introduced to measure the deviation between predicted and actual frames, enabling the identification of anomalous events. The experimental results demonstrate the effectiveness of the proposed approach on both synthetic and real-world datasets, showing its potential for anomaly detection in various applications. The authors in [29] proposed a novel approach for intrusion detection using semi-supervised co-training and active learning techniques. The authors argue that existing intrusion detection methods have limitations in terms of their ability to handle high-dimensional and diverse data, leading to low detection accuracy and high false alarm rates. The proposed approach utilizes multiple views of data to capture different aspects of intrusion behavior and uses semi-supervised co-training and active learning to improve the performance of the intrusion detection system. The experimental results demonstrate the effectiveness of the proposed approach in terms of detection accuracy and false alarm rate reduction, showing its potential for intrusion detection in various applications. The authors in [30] proposed a novel approach for group anomaly detection in large-scale networks using adaptive label propagation. The authors argue that existing methods for group anomaly detection have limitations in terms of their ability to handle large-scale and dynamic network data, leading to low detection accuracy and scalability issues. The proposed approach utilizes label propagation to propagate labels in the network and identify groups of anomalous nodes, while adaptively adjusting the propagation process based on the local structure of the network. The experimental results demonstrate the effectiveness of the proposed approach in terms of detection accuracy and scalability, showing its potential for group anomaly detection in various applications, such as social networks and transportation systems.

WSN Anomaly Detection Using Machine Learning: A Survey

4

503

Conclusion

A growing area of research in recent years has been the application of ML to the detection of anomalies in WSNs. We surveyed Traditional and most recent ML models utilized in WSNs data anomaly detection. The choice of any of such models is determined by the network’s specific application and requirement factors. Hence, numerous papers have investigated the impact of such factors on the deployment of ML in this field. In general, this survey has demonstrated the significance and efficacy of utilizing ML for the purpose of detecting anomalies in WSNs. However, there is still a need for additional research in this area, particularly in creating algorithms that are more effective and efficient to be used on large and complex networks. In addition, more in-depth evaluations of the various methods employing real-world datasets are required to comprehend their performance and limitations.

References 1. Mohit et al., E.: Machine Learning Techniques for Energy Efficiency and Anomaly Detection in Hybrid Wireless Sensor Networks 2. Kanev et al., A.: Anomaly detection in a wireless sensor network of the “smart home” system. In: 2017 20th Conference of Open Innovations Association (FRUCT), St. Petersburg, Russia, pp. 118–124 (2017), https://doi.org/10.23919/ FRUCT.2017.8071301 3. Saeedi Emadi, H., Mazinani, S.M.: A novel anomaly detection algorithm using DBSCAN and SVM in wireless sensor networks. Wirel. Personal Commun. 98(2), 2025–2035 (2017). https://doi.org/10.1007/s11277-017-4961-1 4. Bosman, H.H.W.J., Iacca, G., Tejada, A., W¨ ortche, H.J., Liotta, A.: Spatial anomaly detection in sensor networks using neighborhood information. Inf. Fusion 33, C 41–56 (2017). https://doi.org/10.1016/j.inffus.2016.04.007 5. Feng, Z., Fu, J., Du, D., Li, F., Sun, S.: A new approach of anomaly detection in wireless sensor networks using support vector data description. Int. J. Distribut. Sensor Netw. 13(1) (2017). https://doi.org/10.1177/1550147716686161 6. Zhu, M., Ye, K., Wang, Y., Xu, C-Z.: A deep learning approach for network anomaly detection based on AMF-LSTM. In: 15th IFIP International Conference on Network and Parallel Computing (NPC), Muroran, Japan (2018) 7. Trinh, V.V., Tran, K.P., Huong, T.T.: Data-driven hyperparameter optimization of one-class support vector machines for anomaly detection in wireless sensor networks. In: 2017 International Conference on Advanced Technologies for Communications (ATC), Quy Nhon, Vietnam, pp. 6–10 (2017). https://doi.org/10.1109/ ATC.2017.8167642 8. McDermott, C.D., Petrovski, A.: Investigation of computational intelligence techniques for intrusion detection in wireless sensor networks. Int. J. Comput. Netw. Commun. [online] 9(4), 45–56 (2017) 9. Luo, T., Nagarajan, S.G.: Distributed anomaly detection using autoencoder neural networks in WSN for IoT. In: 2018 IEEE International Conference on Communications (ICC), Kansas City, MO, USA, pp. 1–6 (2018). https://doi.org/10.1109/ ICC.2018.8422402

504

A. Haque et al.

10. Alshinina, R.A., Elleithy, K.M.: A highly accurate deep learning based approach for developing wireless sensor network middleware. IEEE Access 6, 29885–29898 (2018). https://doi.org/10.1109/ACCESS.2018.2844255 11. Arai, K., Bhatia, R. (eds.): FICC 2019. LNNS, vol. 70. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-12385-7 12. Chowdhury, M.NUR., Alahy, Q.E., Soliman, H. (2021). Advanced android malware detection utilizing API calls and permissions. In: Kim, H., Kim, K.J. (eds.) IT Convergence and Security. Lecture Notes in Electrical Engineering, vol. 782. Springer, Singapore 13. Tong, S. et al.: Support Vector Machine Active Learning with Applications to Text Classification 14. Ghahramani et al., Z.: Introduction to Hidden Markov Models an Bayesian Networks 15. Ahmed et al., T.: CKD Prediction Using Data Mining Technique As SVM And KNN With Pycharm 16. Di et al., M.: A survey of machine learning in Wireless Sensor networks From networking and application perspectives 17. Vapnik et al., V.: The Nature of Statistical Learning Theory. Springer (2000) 18. Schultz et al., M.: Data mining methods for detection of new malicious executables. In Security and Privacy Proceedings IEEE Symposium, pp. 38–49 (2001) 19. Tang et al., T.: Very large-scale data classification based on K-means clustering and multi-kernel SVM 20. O’Reilly et al., C.: Anomaly Detection in Wireless Sensor Networks in a NonStationary Environment 21. Gupta, S., Zheng, R., Cheng, A.M.K.: ANDES: an anomaly detection system for wireless sensor networks. In: 2007 IEEE International Conference on Mobile Adhoc and Sensor Systems, Pisa, Italy, pp. 1–9 (2007). https://doi.org/10.1109/ MOBHOC.2007.4428636 22. Ayadi, A., Ghorbel, O., Obeid, A.M., Abid, M.: Outlier detection approaches for wireless sensor networks: a survey, Computer Networks, Vol. 129, Part 1, pp. 319– 333 (2017). ISSN 1389–1286 23. Zhong, C., Gursoy, M.C., Velipasalar, S.: Deep actor-critic reinforcement learning for anomaly detection. In: IEEE Global Communications Conference (GLOBECOM). Waikoloa, HI, USA vol. 2019, pp. 1–6 (2019). https://doi.org/10.1109/ GLOBECOM38437.2019.9013223 24. Ma, X., Shi, W.: AESMOTE: adversarial reinforcement learning With SMOTE for anomaly detection. IEEE Trans. Netw. Sci. Eng. 8(2), 943–956, 1 April-June (2021). https://doi.org/10.1109/TNSE.2020.3004312 25. Andrade, A.T.C., Montez, C., Moraes, R., Pinto, A.R., Vasques, F., da Silva, G.L.: Outlier detection using k-means clustering and lightweight methods for Wireless Sensor Networks. In: IECON 2016–42nd Annual Conference of the IEEE Industrial Electronics Society, Florence, Italy, pp. 4683–4688 (2016). https://doi.org/10.1109/ IECON.2016.7794093 26. Abid, A., Khediri, S.E., Kachouri, A.: Improved approaches for density-based outlier detection in wireless sensor networks. Computing 103(10), 2275–2292 (2021). https://doi.org/10.1007/s00607-021-00939-5 27. Feng, Q., Zhang, Y., Li, C., Dou, Z., Wang, J.: Anomaly detection of spectrum in wireless communication via deep auto-encoders. J. Supercomput. 73(7), 3161–3178 (2017). https://doi.org/10.1007/s11227-017-2017-7

WSN Anomaly Detection Using Machine Learning: A Survey

505

28. Guo, A., Guo, L., Zhang, R., Wang, Y., Gao, S., Self-trained prediction model and novel anomaly score mechanism for video anomaly detection, Image and Vision Computing, vol. 119, p. 104391. ISSN 0262–8856 (2022) 29. Mao, C-H., Lee, H-M., Parikh, D., Chen, T., Huang, S-Y.: Semi-supervised cotraining and active learning-based approach for multi-view intrusion detection. In: Proceedings of the 2009 ACM Symposium on Applied Computing (SAC ’09). Association for Computing Machinery, New York, NY, USA, pp. 2042–2048 (2009). https://doi.org/10.1145/1529282.1529735 30. Li, Z., Chen, X., Song, J., Gao, J.: Adaptive label propagation for group anomaly detection in large-scale networks. IEEE Trans. Knowl. Data Eng. https://doi.org/ 10.1109/TKDE.2022.3176478 31. Rajasegarar, S., Leckie, C., Palaniswami, M.: Anomaly detection in wireless sensor networks. IEEE Wirel. Commun. 15(4), 34–40 (2008). https://doi.org/10.1109/ MWC.2008.4599219 32. O’Reilly, C., Gluhak, A., Imran, M.A., Rajasegarar, S.: Anomaly detection in wireless sensor networks in a non-stationary environment. IEEE Commun. Surv. Tutor. 16(3), 1413–1432, Third Quarter (2014). https://doi.org/10.1109/SURV. 2013.112813.00168 33. Abduvaliyev, A., Pathan, A.-S.K., Zhou, J., Roman, R., Wong, W.-C.: On the vital areas of intrusion detection systems in wireless sensor networks. IEEE Commun. Surv. Tutor. 15(3), 1223–1237, Third Quarter (2013), https://doi.org/10.1109/ SURV.2012.121912.00006 34. Rajasegarar, S., Leckie, C., Palaniswami, M., Bezdek, J.C.: Quarter sphere based distributed anomaly detection in wireless sensor networks. In: 2007 IEEE International Conference on Communications, Glasgow, UK, pp. 3864–3869 (2007). https://doi.org/10.1109/ICC.2007.637 35. Suthaharan, S., Alzahrani, M., Rajasegarar, S., Leckie, C., Palaniswami, M.: Labelled data collection for anomaly detection in wireless sensor networks. In: 2010 Sixth International Conference on Intelligent Sensors, Sensor Networks and Information Processing, Brisbane, QLD, Australia, pp. 269–274 (2010). https:// doi.org/10.1109/ISSNIP.2010.5706782 36. Xie, M., Hu, J., Han, S., Chen, H.-H.: Scalable Hypergrid k-NN-Based Online Anomaly Detection in Wireless Sensor Networks. IEEE Trans. Parallel Distrib. Syst. 24(8), 1661–1670 (2013). https://doi.org/10.1109/TPDS.2012.261 37. Egilmez, H.E., Ortega, A.: Spectral anomaly detection using graph-based filtering for wireless sensor networks. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, pp. 1085–1089 (2014). https://doi.org/10.1109/ICASSP.2014.6853764 38. Rajasegarar, S., Leckie, C., Palaniswami, M., Bezdek, J.C.: Distributed anomaly detection in wireless sensor networks. In: 2006 10th IEEE Singapore International Conference on Communication Systems, Singapore, pp. 1–5 (2006). https://doi. org/10.1109/ICCS.2006.301508 39. Sun, B., Osborne, L., Xiao, Y., Guizani, S.: Intrusion detection techniques in mobile ad hoc and wireless sensor networks. IEEE Wirel. Commun. 14(5), 56–63 (2007). https://doi.org/10.1109/MWC.2007.4396943 40. Xie, M., Hu, J., Guo, S., Zomaya, A.Y.: Distributed Segment-Based Anomaly Detection With Kullback-Leibler Divergence in Wireless Sensor Networks. IEEE Trans. Inf. Forensics Secur. 12(1), 101–110 (2017). https://doi.org/10.1109/TIFS. 2016.2603961

506

A. Haque et al.

41. Xie, M., Hu, J., Guo, S.: Segment-Based Anomaly Detection with Approximated Sample Covariance Matrix in Wireless Sensor Networks. IEEE Trans. Parallel Distrib. Syst. 26(2), 574–583 (2015). https://doi.org/10.1109/TPDS.2014.2308198 42. Can, O., Sahingoz, O.K.: A survey of intrusion detection systems in wireless sensor networks. In: 2015 6th International Conference on Modeling, Simulation, and Applied Optimization (ICMSAO), Istanbul, Turkey, pp. 1–6 (2015). https://doi. org/10.1109/ICMSAO.2015.7152200

Android Malware Detection Using Machine Learning: A Review Naseef-Ur-Rahman Chowdhury(B) , Ahshanul Haque, Hamdy Soliman, Mohammad Sahinur Hossen, Tanjim Fatima, and Imtiaz Ahmed New Mexico Tech, 801 Leroy PL, Socorro, NM, USA {naseef.chowdhury,ahshanul.haque,mohammad.hossen,tanjim.fatima, imtiaz.ahmed}@student.nmt.edu, [email protected]

Abstract. Malware for Android is becoming increasingly dangerous to the safety of mobile devices and the data they hold. Although machine learning (ML) techniques have been shown to be effective at detecting malware for Android, a comprehensive analysis of the methods used is required. We review the current state of Android malware detection using machine learning in this paper. We begin by providing an overview of Android malware and the security issues it causes. Then, we look at the various supervised, unsupervised, and deep learning, machine learning approaches that have been utilized for Android malware detection. Additionally, we present a comparison of the performance of various Android malware detection methods and talk about the performance evaluation metrics that are utilized to evaluate their efficacy. Finally, we draw attention to the drawbacks and difficulties of the methods that are currently in use and suggest possible future directions for research in this area. In addition to providing insights into the current state of Android malware detection using machine learning, our review provides a comprehensive overview of the subject. Keywords: Android malware · Mobile security · Machine learning · Detection · Supervised learning · Unsupervised learning · Deep learning

1

Introduction

Android malware attacks have skyrocketed in recent years due to the widespread use of mobile devices. Android malware is malicious software that targets security holes in Android devices. Malware for Android devices has the potential to harm one’s financial situation as well as gain unauthorized access to personal information. As the number of Android malware attacks continues to rise, the importance of having reliable detection methods grows. The well-established field of computer science known as machine learning has shown great promise for detecting Android malware. Because they can recognize complex data patterns and learn from large datasets, machine learning algorithms are ideal for detecting Android malware. Due to the growing interest in c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 507–522, 2024. https://doi.org/10.1007/978-3-031-47715-7_35

508

N.-U.-R. Chowdhury et al.

utilizing machine-learning techniques for Android malware detection, numerous studies have been published in this area. However, due to the scattered nature of the existing studies in this field, a comprehensive review of the machine learning-based approaches utilized for Android malware detection is required. This paper fills this void by providing a review of the current state of the art in Android malware detection using machine learning. In our review, we will go over each of the various machine-learning techniques used to detect Android malware, the metrics used for performance evaluation, and the drawbacks and difficulties of the methods currently in use. We will identify future research directions for this field in the final section. The purpose of this paper is to provide a comprehensive analysis of how Android malware is detected using machine learning. The approaches used, performance evaluation, potential drawbacks, and directions for future research will all receive special attention. The operation of machine-literacy styles to the discovery of Android malware is the sole focus of this disquisition. The study focuses on the following machine literacy-grounded aspects of Android malware discovery: • An overview of Android malware and its security pitfalls. • Examination of the colorful supervised, unsupervised, and deep literacy machine literacy strategies employed for the discovery of malware on Android. • Evaluation of the colorful machine learning styles used to descry malware on Android challenges and limitations of current styles, as well as openings for enhancement. • Directions for unborn exploration in this area and suggestions for work to be done in the future. Our exploration examines the current state of the art and the operation of machine literacy styles to the discovery of Android malware. The remainder of the paper is structured as follows. Section 2 includes the existing literature review, Sect. 3 depicts our methodology. Outcome and discussion is introduced in Sect. 4; then our conclusion is stated in Sect. 5.

2 2.1

Literature Review Overview of the Relevant Research

Due to the growing number of Android devices and the associated security risks posed by Android malware, the field of Android malware detection using machine learning has seen significant growth in recent years. For the purpose of detecting Android malware, supervised learning, unsupervised learning, and deep learning strategies have all been proposed by researchers [24]. Support vector machines (SVMs) and decision trees, two examples of supervised learning techniques, have been extensively utilized in Android malware detection [25]. In order to construct a model that is capable of distinguishing between legitimate and malicious Android applications, these methods rely on labeled training data.

Android Malware Detection Using Machine Learning: A Review

509

Android malware detection has also utilized unsupervised learning techniques like clustering and dimensionality reduction. These techniques are capable of recognizing patterns in the data that may indicate malware and do not require labeled training data [26]. For Android malware detection, it has been demonstrated that deep learning techniques like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are effective [27]. When compared to conventional ML approaches [27], these methods can boost malware detection accuracy by utilizing deep neural networks to acquire intricate data representations. Android malware detection has also utilized signature-based, rule-based, and heuristic-based techniques in addition to ML methods [13,14]. However, the use of machine learning techniques for Android malware detection is the subject of this survey. 2.2

Classification of the Existing Approaches

Various criteria, such as the type of learning, the features used, and the performance evaluation metrics used, can be used to classify the various machinelearning approaches used to detect Android malware. There are two main types of machine learning approaches for Android malware detection, according to the type of learning: supervised and unsupervised [26]. Unsupervised learning methods do not require labeled training data to construct a model, whereas supervised learning methods do. Machine learning methods for Android malware detection can be further categorized into the following groups according to the features they employ [28]: Methodologies based on Static Analysis: These methods make use of features like the permissions that an Android application asks for and its code structure that are taken from static analysis. Methods that are based on Dynamic Analysis: These methods make use of characteristics gleaned from the dynamic analysis of Android applications, such as the patterns of network communication and the application’s behavior when it is running on a device. Alternative Methods: For Android malware detection, these strategies employ a mix of static and dynamic analysis-based features. There are several categories of machine learning approaches for Android malware detection based on the metrics used for performance evaluation, including: Methods based on Accuracy: Precision, recall, and the F1-score are some of the accuracy metrics on which these methods base their evaluations of the machine learning model’s performance. Time-Based Methods: Time metrics, such as the amount of time needed to build the model and make predictions are used in these approaches to assess the machine learning model’s performance. Approaches based on Robustness: The robustness of the machine learning model to adversarial examples, such as samples of malware designed to evade detection, is evaluated using these methods.

510

N.-U.-R. Chowdhury et al.

In summary, a clear understanding of the various machine-learning approaches used for this task and the criteria used to evaluate their performance is provided via the classification of the approaches used for Android malware detection based on the type of learning, the features used, and their performance evaluation metrics. 2.3

Comparison of the Approaches

The authors in [1] present a new deep learning-based approach to detecting Android malware. The authors aimed at improving the accuracy and efficiency of Android malware detection by utilizing deep learning techniques. They utilized the Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) ML algorithms along with 20,00 android APK (10,000 benign and 10,000 malware). The results showed that the proposed system achieved high accuracy, with a value of 97.12%. The results also showed that the deep learningbased approach outperformed the traditional ML approaches in terms of accuracy. The authors in [2] focused on the use of deep neural networks for the attribute-based recommendation. An input layer, hidden layers, and an output layer make up the multiple layers of the utilized deep neural network algorithm. The output layer predicts a score that indicates the likelihood that the user will prefer the item, after receiving user-item attributes from the input layer. A real-world movie dataset containing information about users, movies, and their ratings was used in the experiments. The precision, recall, F1-score, and mean average precision are some of the evaluation metrics used to assess the proposed recommendation system’s performance. The effectiveness of using deep neural networks for attribute-based recommendation is demonstrated by the fact that the proposed algorithm outperforms other traditional recommendation algorithms in terms of precision and recall. The authors in [3] suggested a strategy for engaging in adversarial attacks on trading agents that are based on deep reinforcement learning. The authors tested their method through its paces in two distinct trading settings: synthetic and historical datasets of the stock market. A reinforcement learning algorithm is used to teach a deep neural network to make trades based on market conditions. The authors then modify the decisions made by the agent by adding adversarial perturbations to the market state. The results demonstrate that adversarial attacks can significantly affect the performance of deep reinforcement learning-based trading agents. The performance metric used is the profit or loss of the agent’s trades. The adversarial attacks were successful in some instances, but they were unsuccessful in others, yet resulting in profits. In their conclusion, the authors state that reinforcement learning-based trading agents must be robust. The authors in [4] aimed at a comprehensive understanding of Android malware’s characteristics and evolution. In order to identify common patterns and behaviors of malware, the authors investigated a large dataset of Android malware and a benign applications’ dataset. Additionally, they investigated the

Android Malware Detection Using Machine Learning: A Review

511

development of Android malware overtime to comprehend how it has advanced and changed. Though the authors claimed high accuracy results, the paper does not specify the quantitative metrics used for performance evaluation. Moreover, the paper does not clearly mention the utilized algorithms. The study [5] presented “Virus Detection and Alert for Smartphones” [34]. The authors presented a system that is capable of detecting malware on a smartphone in real time and letting the user know about it. However, though the authors claimed they have used dynamic analysis and mentioned high-accuracy results, the paper does not clearly mention the utilized algorithm, exact accuracy results, and evaluation metrics. The authors in [6], presented PUMA (Permission Usage to detect Malware in Android), a novel strategy for detecting malware on Android devices. The authors contend that malware’s excessive use of permissions can serve as a detection signature for malicious applications. The PUMA employs an ML-based algorithm that trains a classifier from a dataset (more than 4000 APKs containing both benign and malware) of malware and benign apps. The app-requested permissions and their usage patterns are the features used for the classification. The authors stated that PUMA detects malware with an accuracy of over 90% and a low rate of false-positives. In [7], a virus detection system based on data mining techniques is presented. The authors contend that large software datasets can be mined for patterns and features that can be used to identify malware. The virus detection system’s algorithm is not described in the paper. However, the authors claim that they identify malware-inducing patterns and characteristics by employing data mining methods like the association rule of data mining and the ML decision trees algorithms. The paper does not specify the data used to evaluate the virus detection system’s performance. However, the authors claim that they evaluated a large dataset of software, which includes both beneficial and harmful software. The paper does not specify the performance metric used to evaluate the results. However, the authors assert that their virus detection system has a low rate of false-positives and high accuracy in identifying malware. The behavior of modern malware in the presence of anti-virtualization and anti-debugging techniques is the subject of the study in [8]. The authors argue that in light of the growing threat posed by malware, these methods, which are used to detect and prevent malicious activity, have become increasingly important. The behavior of malware in the presence of anti-virtualization (AV) and anti-debugging (AD) techniques is thoroughly examined by the authors. They evaluated the behavior of each sample when it is running in a virtual environment and when it is being debugged using a dataset of real-world malware samples. In addition, a classification framework is developed by the authors to classify the various AV and AD behaviors that were observed in the malware samples.

512

N.-U.-R. Chowdhury et al.

A dataset of actual malware samples was used in the study. The classification framework’s ability to accurately classify the various kinds of AV and AD behaviors is the performance metric used to evaluate the results. The study reveals a wide range of anti-virtualization and anti-debugging behaviors in contemporary malware. The authors also find that these actions are getting better and more sophisticated, making it hard for anti-malware methods to stop them. In [9], a singular value decomposition (SVD) method for detecting metamorphic malware was presented. The authors evaluated the method’s effectiveness with a large data set of benign and metamorphic executables. The paper’s algorithm is based on SVD, a mathematical method for looking at how data is structured. The singular values extracted from the executables’ opcode sequences are used as features in an ML classifier, employing SVD. Control flow graph (CFG) and opcode n-gram analysis are two examples of traditional dynamic analysis methods that compare the efficacy of peers’ works. The experiments used a large collection of benign and metamorphic executables from a variety of sources as their data. The accuracy, false-positive, and false-negative rates were some of the metrics used to evaluate the SVD-based method’s performance. With an accuracy of 94.2% and a false-positive rate of 0.7 percent, the SVDbased method performed better than conventional dynamic analysis methods [9]. The authors came to the conclusion that metamorphic malware can be effectively detected with SVD. In [10], a novel strategy for synthesizing malware specifications from suspicious behaviors is presented. The goal of the authors is to solve the problem of finding malware in large, complicated software systems, where traditional signature-based methods are frequently insufficient. Through dynamic software system analysis, the authors deduced a novel algorithm for synthesizing malware specifications from suspicious behaviors. A cost model and the findings of dynamic analysis are combined by the algorithm to produce near-optimal malware specifications in terms of coverage and specificity. Software systems and their dynamic analysis results formed the data used in the study. They used the accuracy metric to evaluate their algorithm’s performance. Such accuracy measure is also measured in terms of the synthesized malware specifications, measured in terms of both coverage (the proportion of malicious behavior that is detected) and specificity (the proportion of benign behavior that is not detected). The study demonstrates that the proposed algorithm is capable of synthesizing malware specifications that are close to optimal for suspicious behavior. In addition, the algorithm outperforms conventional signature-based methods in terms of accuracy [29], indicating its potential for enhancing malware detection in large, complex software systems. The authors in [11] presented a new approach for detecting malware on enduser devices. The authors propose a system that integrates multiple techniques for detecting malware, including signature-based detection, behavioral analysis,

Android Malware Detection Using Machine Learning: A Review

513

and data mining, to achieve improved accuracy and efficiency in comparison to traditional methods. The authors use a combination of dynamic and static analysis techniques to extract features from malware specimens and build models that are used to detect malware on end-user devices. The performance of the system is evaluated using a large dataset of benign and malicious software, and the results show that the system is able to detect malware with high accuracy while incurring low overhead. The algorithm used in the study is a combination of signature-based detection, behavioral analysis, and data mining. The data used in the study consists of a large dataset of benign and malicious software specimens. The performance metric used to evaluate the results is the accuracy of the malware detection system, measured in terms of the proportion of benign and malicious software specimens that are correctly classified. The results of the study show that the proposed system is effective and efficient in detecting malware on end-user devices. The authors also find that the system outperforms traditional methods in terms of accuracy and efficiency, demonstrating its potential for improving the security of end-user devices. The authors in [12], suggested AccessMiner (AM), a system that uses systemcentric models to study software behavior and spot malicious activity. A system-centric model of how software behaves on a device is built by AM, which then uses this model to find anomalies that could indicate malicious behavior. The system constructs models of typical software behavior by employing ML algorithms and a combination of static and dynamic analysis methods to extract features from software samples. Using a large dataset of both benign and malicious software samples, the authors assess AM’s performance. The study demonstrates that AM outperforms conventional methods in terms of both efficiency and accuracy when it comes to malware detection [12]. System-centric models, static and dynamic analysis, and machine learning are combined in the study’s algorithm. The study relies on a substantial set of examples of both benign and malicious software. The malware detection system’s accuracy, expressed as the proportion of benign and malicious software samples correctly classified, is the performance metric used to evaluate the outcomes. In [13], the authors presented a smart approach for detecting Android malware in a large dataset. They utilized some of the most popular android datasets such as VirusTotal [18], Marvin [17], Drebin [21], and Malgenome [19,20]. The authors propose an ML-based approach that utilizes requested permissions by an android app for malware detection. The paper identified a list of sensitive permissions which are not supposed to be requested by any user applications but rather should be only used by system apps. The same group extended their work, proposing a method for detecting Android malware utilizing API calls [14]. The proposed approach involves creating a feature vector based on API calls and permissions, which are then used to train an ML classifier. The performance of the proposed method was evalu-

514

N.-U.-R. Chowdhury et al.

ated on a large dataset, and the results showed improved accuracy compared to existing approaches [13,14]. The authors conclude that the combination of API calls and permissions (check Figs. 1 and 2 for a list of sensitive APIs and permissions) can be used as a robust and effective feature set for detecting malware on Android devices. The performance was evaluated using several metrics, such as accuracy, precision, recall, and F1-score. The results show that the proposed approach outperforms other state-of-the-art methods [14], achieving an accuracy of 99.08%, a precision of 98.55%, a recall of 99.20%, and an F1-score of 98.87%.

Fig. 1. List of Sensitive APIs [14]

The static analysis involves extracting features from the Android Manifest and the Dalvik Bytecode, while the dynamic analysis involves capturing system calls and network behavior. The dataset used for evaluation consists of over 10,000 Android applications, of which 5,000 are benign and 5,000 are malicious. In [15], their focus was on a hybrid deep learning model for Android malware detection. They used LSTM [30] and CNN algorithms [31] with two datasets: one from AndroZoo [22] and the other from VirusShare [23]. In terms of accuracy and F1-score, the experiments show that the hybrid deep learning model outperforms conventional ML algorithms [32], demonstrating the method’s efficacy for Android malware detection. In [16], a deep learning-based Android malware detection system is presented. The authors made use of a two-phase deep learning model: the prediction training phase and the testing phase. The model is trained on a large dataset of both benign and malicious applications. The deep learning model is used to predict whether an unidentified Android application is malicious, during the prediction phase. Over 10,000 legitimate and malicious Android applications were used in

Android Malware Detection Using Machine Learning: A Review

515

Fig. 2. List of Sensitive Permissions [13]

the authors’ dataset. The data came from Google Play, third-party marketplaces, malware databases [33], and other sources. The accuracy, precision, recall, F1score, and Area Under the ROC Curve (AUC) were used by the authors to measure the MAPAS’s (Malware Analysis and Protection Using Artificial Intelligence System) performance. The findings demonstrate that the MAPAS system was able to identify Android malware with a high degree of accuracy-more than 98%. In addition, the system demonstrated high precision, recall, F1-score, and AUC, all of which indicate its effectiveness in detecting Android malware.

3 3.1

Methodology Overview of the Selection Criteria

A set of selection criteria was established to guide the selection of the studies to be included in this survey. Such selection criteria will provide a comprehensive review of the various ML approaches that are utilized for the detection of Android malware. In addition, these selected criteria will provide a comprehensive overview of the current state of the art and their relevance to Android malware detection. Among the criteria for selection are:

516

N.-U.-R. Chowdhury et al.

Relevance: This includes research on malware detection on Android platforms using ML algorithms. Year of Publication: It is important to keep up with the latest developments, so studies published in recent years (since 2015) were given priority. Methodology: For Android malware detection, the studies included in this survey must make use of ML algorithms. Evaluation: Quantitative metrics like accuracy, time, and robustness are used in the evaluations of the ML algorithms included in this survey. Data Availability: The studies that are included in this survey either have to make the evaluation data available to the general public or provide enough information to make it possible to reproduce the results. 3.2

Selection of the Papers

A comprehensive search was carried out using multiple sources, including Google Scholar and online databases like IEEE Xplore, Springer, ACM Digital Library, and ScienceDirect, to locate relevant studies for this review. A set of keywords related to Android malware detection and machine learning were used in the search. The initial search yielded a plethora of results, which were then filtered according to the selection criteria outlined in the preceding section. In order to determine each study’s relevance and suitability for inclusion in this survey, the abstract and full text was thoroughly examined during this process. 3.3

Data Collection and Analysis

The selected papers were thoroughly examined during the process of data collection and analysis to obtain pertinent information on Android malware detection using ML. To ensure that this review’s findings are consistent, complete, and current, this information was collected in a structured manner. Each paper contained the following information that we extracted. The goal of the study was to find common themes, trends, and gaps in the existing literature. An overview of the current state of the art in Android malware detection using ML, including the advantages and disadvantages of the methods that are in use, is made possible by the results of this analysis in this paper. Additionally, the data gathered from the selected papers were utilized for contrasting and contrasting the various approaches as well as determining potential areas of future study. With the help of this analysis, a comprehensive understanding of the field’s current state was provided, as well as the main obstacles and opportunities for future research.

4 4.1

Results and Discussion Overview of the Key Findings

This section presents the main findings of this literature review on Android malware detection with ML. A comprehensive analysis of the selected papers, which

Android Malware Detection Using Machine Learning: A Review

517

were chosen based on the established selection criteria, serves as the foundation for the findings. Next is a summary of the most important findings from this review. Android malware detection frequently makes use of ML algorithms in the majority of the studies examined in this paper. Hence, we can say that ML is the appropriate workhorse for Android Malware detection. For Android malware detection, a variety of ML algorithms are utilized. Various machine learning algorithms, such as decision trees, artificial neural networks, support vector machines, and others, are in the reviewed studies. Yet, depending on the system’s particular requirements and the nature of the data being analyzed, different ML algorithms will vary in their performances to carry out the malware detection task. The Android malware detection system’s performance is highly dependent on the selected dataset. The selection of the dataset is crucial to the system’s performance and can significantly affect the outcomes. A variety of datasets, both real-world and synthetic, were used in the reviewed studies. The reviewed studies have a wide range of evaluation metrics. A variety of evaluation metrics, such as accuracy, precision, recall, and the F1-score, were utilized in the reviewed studies. The varying of such evaluation metrics emphasizes the significance of selecting the appropriate evaluation metric for the system’s particular requirements. 4.2

Summary of the Contributions

Based on our comprehensive literature review on Android malware detection using machine learning, the following are the main contributions made by this review: 1. A systematic review of relevant sources: The relevant literature on Android malware detection using machine learning is systematically examined in this review. The papers were chosen using the established selection criteria, and thorough and systematic data collection and analysis were carried out. 2. An overview of how Android malware is detected using machine learning: The various machine learning algorithms and datasets used in Android malware detection are covered in this paper of the use of machine learning. Hence, researchers and practitioners in the field seeking to comprehend the current state of the art in this field may find this information helpful. 3. Analyzing the advantages and disadvantages of current methods: The current machine learning-based methods for Android malware detection are compared and contrasted in this review. The review sheds light on the difficulties and drawbacks of these approaches and reveals the areas that require additional investigation. 4. Identifying future directions for research: Future directions for machine learning-based Android malware detection research are identified in this review. The review offers suggestions for enhancing the performance of existing methods and developing new, more efficient methods for this task.

518

N.-U.-R. Chowdhury et al.

By providing a comprehensive overview of the current state of the art, evaluating the strengths and weaknesses of existing approaches, and identifying future research directions, this review makes a significant contribution to the field of Android malware detection using machine learning. The paper’s findings can be used to guide the creation of Android malware detection systems that are more effective and to advance future research in this field. 4.3

Discussion of the Limitations

Although the current review provides a comprehensive overview of the existing literature on the application of machine learning to the detection of Android malware, it does have some drawbacks. The following are some significant limitations. 1. The literature covered: The current review looks at the literature that has been written up to a certain point, so it might not include the most recent work on this subject. As a result, it’s possible that this review missed out on some significant research or developments in this area. 2. Dataset with bias: The quality and composition of the datasets used to determine the effectiveness of machine learning algorithms for Android malware detection. Numerous studies have used datasets that may not accurately represent the distribution of malware in the real world or may be biased toward particular types of malware [33]. The generalizability of these studies’ findings may be limited as a result. 3. Standard metrics for evaluation are missing: The absence of a standard evaluation metric presents a significant obstacle when assessing the effectiveness of machine learning algorithms for Android malware detection. It is difficult to compare the results of different studies because different metrics have been used in each one. 4. Demand for extensive and varied datasets: To accurately capture the patterns and characteristics of malware, ML algorithms for Android malware detection require extensive and diverse datasets. However, obtaining such datasets is difficult, and numerous previous studies have utilized smaller or less diverse datasets, limiting the algorithms’ accuracy [33]. 5. Malware for Android is complex: It is challenging to develop efficient ML algorithms for detecting Android malware because it is highly dynamic, i.e. constantly changing. Algorithms that are capable of adapting to shifts in the malware landscape and accurately detecting all types of malware are difficult to develop because of this complexity. Even though there are some limitations, this review’s findings are a good place to start more research on Android malware detection with machine learning. The limitations provide insight into how to improve the performance of existing algorithms and how to develop more efficient algorithms for this task. They also highlight the areas in which additional research is required.

Android Malware Detection Using Machine Learning: A Review

4.4

519

Identification of Future Research Directions

The following are some possible directions for future machine learning-based Android malware detection research based on the following review’s findings: 1. Improvement of diverse and more accurate datasets: The absence of extensive and diverse datasets is one of the greatest obstacles in the development of efficient machine learning algorithms for Android malware detection. Future research should focus on creating more diverse and accurate datasets that accurately represent the distribution of malware in the real world to address this issue more accurately. 2. Utilization of deep learning methods: Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are two examples of deep learning methods that have demonstrated promising results in numerous applications, including speech and image recognition [31]. The focus of future research should be on advancing these two techniques and identifying applications where they outperform all the other peers. 3. The creation of adaptive and dynamic algorithms: It is very challenging to develop efficient ML algorithms for detecting Android malware because of its highly dynamic and constantly changing applications’ environments. The development of dynamic and adaptable algorithms that can respond to shifts in the malware landscape ought to be the primary focus of future research. 4. Including security-related features: Code structure and API calls are two examples of features that have been used in numerous studies that are not specifically related to security. For Android malware detection, securityrelated features like permission requests and system logs should be investigated, in more depth, in future research. 5. Evaluation of the algorithms in comparison: The absence of a standard evaluation metric presents a significant obstacle when assessing the effectiveness of machine learning algorithms for Android malware detection. The development of a standard evaluation metric and the comparative evaluation of algorithms that make use of this metric should be the primary focus of subsequent research. 6. Integration with current security measures: ML-based Android malware detection can be integrated with existing security systems to offer greater protection against malware. The effectiveness of these algorithms and their integration with existing security systems should be investigated and evaluated in future subsequent research. In summary, there is a lot of room for additional research in the field of Android malware detection using ML, which is rapidly evolving. This review’s future research directions will help advance the field and enhance the effectiveness of Android malware detection algorithms and serve as a useful starting point for additional research.

520

5

N.-U.-R. Chowdhury et al.

Conclusion

Malware for Android has become a serious threat to the Android platform’s and its users’ security, in recent years. Android malware detection has become a vital area of research due to the rapid growth of mobile devices and the ease with which malicious software can be distributed by intruders. ML-based solutions have been proposed and implemented to address this critical issue. In this paper, we conducted a comprehensive literature review on the use of ML to smartly detect Android malware. Our objective was to provide a comprehensive understanding of the current state of the art in this field, highlight the limitations and shed some light on future research directions, and highlight the most important findings and contributions of the most recent related research in the field. Through our comprehensive review of the relevant literature, we found out that ML has been extensively used for Android malware detection and has been demonstrated to be effective in detecting malware in numerous instances. Decision trees, random forests, support vector machines, artificial neural networks, and deep learning-based strategies are among the ML algorithms that have been utilized for this purpose. System calls, API calls, and permissions are among the feature sets that have been used as input for training these algorithms. Additionally, our literature review revealed that much more research is required to address some of the current approaches’ drawbacks. For instance, the generalizability of many of the existing methods to new and evolving malware is poorly understood because they are only tested on a small number of malware types. Additionally, more in-depth evaluations of these approaches are required, with an increased focus on the trade-off between efficiency and accuracy. In conclusion, the current state of the art in Android malware detection using machine learning is comprehensively reviewed in this paper. This survey’s significant findings and contributions offer researchers and practitioners in the field valuable insights. This study’s limitations and future research directions serve as a road map for future research in this field. We believe that this paper will be a very useful reference for those who are interested in this field. Such belief is based on the ongoing development of effective and efficient ML-based solutions to detect and prevent Android malware, which is a crucial area of research with practical significance.

References 1. Mahindru, A., Sangal, A.L.: MLDroid-framework for Android malware detection using machine learning techniques. Neural Comput. Appl. 33, 5183–5240 (2021) 2. Mahindru, A., Singh, P.: Dynamic permissions based android malware detection using machine learning techniques. In: Proceedings of the 10th Innovations in Software Engineering Conference (ISEC ’17). Association for Computing Machinery, New York, NY, USA, pp. 202–210 (2017). https://doi.org/10.1145/3021460. 3021485

Android Malware Detection Using Machine Learning: A Review

521

3. Zhou, Y., Wang, Z., Zhou, W., Jiang, X.: Hey, you, get off of my market: detecting malicious apps in official and alternative Android markets. In: Proceedings of the 19th Annual Network & Distributed System Security Symposium (2012) 4. Zhou, Y., Jiang, X.: Dissecting android Malware: characterization and evolution security and privacy (SP). In: 2012 IEEE Symposium on Security and Privacy (2012) 5. Cheng, J., Wong, S.H., Yang, H., Lu, S.: SmartSiren: virus detection and alert for smartphones. In: International Conference on Mobile Systems, Applications, and Services (MobiSys) (2007) 6. Sanz, B., Santos, I., Laorden, C., Ugarte-Pedrero, X., Bringas, P.G., Alvarez, G.: PUMA: permission usage to detect Malware in Android. In: Advances in Intelligent Systems and Computing (AISC) (2012) 7. Wang, J., Deng, P., Fan, Y., Jaw, L., Liu, Y.: Virus detection using data mining techniques. In: Proceedings of IEEE International Conference on Data Mining (2003) 8. Chen, X., Andersen, J., Mao, Z., Bailey, M., Nazario, J.: Towards an understanding of anti-virtualization and anti-debugging behavior in modern malware. In: DSN (2008) 9. Jidigam, R.K., Austin, T.H., Stamp, M.: Singular value decomposition and metamorphic detection. J. Comput. Virol. Hacking Tech. 11(4), 203–216 (2014). https://doi.org/10.1007/s11416-014-0220-0 10. Fredrikson, M., Jha, S., Christodorescu, M., Sailer, R., Yan, X.: Synthesizing nearoptimal malware specifications from suspicious behaviors. In: SP 2010 Proceedings of the 2010 IEEE Symposium on Security and Privacy, pp. 45–60 (2010) 11. Kolbitsch, C., Comparetti, P.M., Kruegel, C., Kirda, E., Zhou, X., Wang, X.: Effective and efficient malware detection at the end host. In: USENIX Security (2009) 12. Lanzi, A., Balzarotti, D., Kruegel, C., Christodorescu, M., Kirda, E.: AccessMiner: using system-centric models for malware protection. In: CCS (2010) 13. Arai, K., Bhatia, R. (eds.): FICC 2019. LNNS, vol. 70. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-12385-7 14. Chowdhury, M.N.-U.-R., Alahy, Q.E., Soliman, H.: Advanced android malware detection utilizing API calls and permissions. In: Kim, H., Kim, K.J. (eds.) IT Convergence and Security. LNEE, vol. 782, pp. 123–134. Springer, Singapore (2021). https://doi.org/10.1007/978-981-16-4118-3 12 15. Lu, T., Du, Y., Ouyang, L., Chen, Q., Wang, X.: Android malware detection based on a hybrid deep learning model. Secur. Commun. Netw. 2020, Article ID 8863617, 11 p. (2020) 16. Kim, J., Ban, Y., Ko, E., et al.: MAPAS: a practical deep learning-based android malware detection system. Int. J. Inf. Secur. 21, 725–738 (2022) 17. MARVIN: Efficient and Comprehensive Mobile App Classification through Static and Dynamic Analysis 18. Virus Total. http://www.virustotal.com/gui/graph-overview 19. Zhou, Y., Wang, Z., Zhou, W., Jiang, X.: Hey, you, get off of my market: detecting malicious apps in official and alternative Android markets. In: Proceedings of the 19th Annual Network & Distributed System Security Symposium (2012) 20. Zhou, Y., Jiang, X.: Dissecting android malware: characterization and evolution security and privacy (SP). In: 2012 IEEE Symposium on Security and Privacy 21. Arp, D., Spreitzenbarth, M., Huebner, M., Gascon, H., Rieck, K.: Drebin: efficient and explainable detection of android malware in your pocket. In: 21st Annual Network and Distributed System Security Symposium (NDSS) (2014)

522

N.-U.-R. Chowdhury et al.

22. Allix, K., Bissyand´e, T.F., Klein, J., Traon, Y.L.: AndroZoo: collecting millions of android apps for the research community. In: 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR), Austin, TX, USA, pp. 468– 471 (2016) 23. http://virusshare.com/ 24. Liu, Z., Wang, R., Japkowicz, N., Tang, D., Zhang, W., Zhao, J.: Research on unsupervised feature learning for Android malware detection based on restricted Boltzmann machines. Future Gener. Comput. Syst. 120, pp. 91–108 (2021). ISSN 0167–739X. https://doi.org/10.1016/j.future.2021.02.015 25. Liu, Z.-L., Yang, M., Chen, X., Luo, Y., Zhang, H.: An android malware detection model based on DT-SVM. Secur. Commun. Netw. (2020). https://doi.org/10.1155/ 2020/8841233 26. AlJarrah, M.N., Yaseen, Q.M., Mustafa, A.M.: a context-aware android malware detection approach using machine learning. Information 13(12), 563 (2022). https://doi.org/10.3390/info13120563 27. Jeon, S., Moon, J.: Malware-detection method with a convolutional recurrent neural network using opcode sequences. Inf. Sci. 535, 1–15 (2020). ISSN 0020–0255. https://doi.org/10.1016/j.ins.2020.05.026 28. Lee, J., Jang, H., Ha, S., Yoon, Y.: Android malware detection using machine learning with feature selection based on the genetic algorithm. Mathematics. 9(21), 2813 (2021). https://doi.org/10.3390/math9212813 29. Kwon, H.-Y., Kim, T., Lee, M.-K.: Advanced intrusion detection combining signature-based and behavior-based detection methods. Electronics 11(6), 867 (2022). https://doi.org/10.3390/electronics11060867 30. Pulver, A., Lyu, S.: LSTM with working memory. In: 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, pp. 845–851. https:// doi.org/10.1109/IJCNN.2017.7965940 31. Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamar´ıa, J., Fadhel, M.A., Al-Amidie, M., Farhan, L.: Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J. Big Data 8(1), 1–74 (2021). https://doi.org/10.1186/s40537-021-00444-8 32. Yahya, A.E., Gharbi, A., Yafooz, W.M.S., Al-Dhaqm, A.: A novel hybrid deep learning model for detecting and classifying non-functional requirements of mobile apps issues. Electronics 12(5), 1258 (2023). https://doi.org/10.3390/ electronics12051258 33. Liu, K., Xu, S., Xu, G., Zhang, M., Sun, D., Liu, H.: A review of android malware detection approaches based on machine learning. IEEE Access 8, 124579–124607 (2020). https://doi.org/10.1109/ACCESS.2020.3006143 34. Cheng, J., Wong, S.H., Yang, H., Lu, S.: SmartSiren: virus detection and alert for smartphones. In: Proceedings of the 5th international conference on Mobile systems, applications and services (MobiSys ’07). Association for Computing Machinery, New York, NY, USA, pp. 258–271 (2007). https://doi.org/10.1145/1247660. 1247690

A Data Protection Design for Online Exam Proctoring in Compliance with the Indonesian Personal Data Protection Law Sidik Prabowo1(B) , Maman Abdurohman1 , Hilal Hudan Nuha1 , and Sarwono Sutikno2 1 Telkom University, Bandung, West Java 40267, Indonesia

[email protected] 2 Institut Teknologi Bandung, Bandung, West Java 40116, Indonesia

Abstract. This paper proposes a personal data protection system for online exam proctoring that complies with Indonesian personal data protection laws. Ratified in 2022, the Government of Indonesia provides a grace period until 2024 in implementing this Law to prepare related parties to adapt their services to electronic and non-electronic based. Later violations of this Law can be subject to criminal penalties of 4–6 years, a maximum fine of 6 billion Rupiah per incident, and a maximum sanction of 2% of total annual income. This is a different motivation for related parties to ensure compliance with the PDP Law. Here we propose applying the data labeling model in combination with Fusion/UML to provide access control and guarantee privacy for access to personal data to comply with the PDP Law. We illustrate the design that we are proposing with a case study for Face Recognition-based remote proctoring in an online proctoring Learning Management System that runs at our institution. Keywords: Privacy-critical · Indonesia · Personal data protection · Learning management system · Online proctoring exam

1 Introduction Violations of personal data protection (PDP) law in Indonesia related to the processing of personal data, the rights of personal data subjects, or non-compliance with the articles contained in the PDP can be sanctioned with a maximum prison sentence of 6 years, a maximum fine of 6 billion rupiahs for individuals, and a maximum of 10 times the penalty decided if a corporation commits the violation, (Article 70 paragraph 3) [1]. The PDP Law in Indonesia, passed in October 2022, contains 15 chapters, 371 problem inventory lists (DIM), and 76 articles regulating the management of personal data in Indonesia by various parties [2]. On the other hand, COVID-19 has forced an increase in online learning trends, including online exams [3]. As with conventional exams, online exams are also prone to cheating problems [4]. The online proctoring system is a solution against cheating in online exams [5]. However, online proctoring systems, which usually use surveillance © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 523–535, 2024. https://doi.org/10.1007/978-3-031-47715-7_36

524

S. Prabowo et al.

cameras, are sometimes seen as a threat to data privacy [6]. Observing and adjusting proctoring services to comply with the PDP law is challenging for these service providers [7]. Several studies have created a privacy ontology or model to comply with a privacy policy. Arruda et al. [8] researched the privacy protection ontology for IoT, which was compiled based on 27-point requirements. Based on seven metrics, the privacy policy is lighter than the state-of-the-art privacy policy. Rajagopalan et al. [9] created a privacy framework for smart metering. The issue they discussed was that smart metering data carries a sensitivity where load forecasting can usually explain everyone’s daily activities. The framework they compiled was made to comply with the restrictions put forward by America’s National Institute of Standards and Technology (NIST). Morrisa et al. [10] created a privacy model for data as a service (DaaS). This research concerns today’s DaaS, where access is fast, but there is no distinction between the service provider and data provider roles. The results of their privacy model comply with the privacy data tree (PDT) declared by the World Wide Web Consortium (W3C). Kammueller et al. [11] modeled the implications of GDPR compliance that must be complied since 2018 in EU member countries, using IoT-based Alzheimer’s monitoring. From the results of the literature review of the related studies above, as well as judging from the Indonesian PDP Law, which was issued at the end of 2022, no research has been found that discusses compliance with PDP Law in Indonesia. The results of this study indicate that there is a research opportunity to create a data privacy model for online exam proctoring that complies especially with the PDP Law in Indonesia. Departing from this problem, in this paper, we try to show how a service involving personal data can remain in line with the substance of the PDP Law. We extracted the relevant legal documents and highlighted the technical parts to achieve this goal. Then we provide examples of applying the principles of the PDP Law in the context of online learning to provide data protection. Finally, we describe the global architecture that complies with the PDP Law with analysis and design from UML. To the best of our knowledge, there has never been a study that made a privacy model regarding online exam proctoring that complies with the PDP law from Indonesia. The contributions to this publication are: • A novel privacy model regarding online exam proctoring • A novel privacy model that complies with Indonesia’s PDP law • It translated the legal document UU PDP no 27 of 2022 Indonesia into technical requirements that are easier for developers to understand. • Demonstrate the application of the technical requirements of the PDP Law into a case study of the Learning Management System (LMS) to protect personal data. Furthermore, this paper was written with the following organizations: Sect. 2 contains presentations on documents regarding theories and discussions related to our research. Section 3 describes our proposed design of the privacy model based on the PDP law. Section 4 reports the test results and compares our presented results with state-of-the-art research. Finally, Sect. 5 concludes our study.

A Data Protection Design for Online Exam Proctoring

525

2 Background 2.1 Indonesian Personal Data Protection Regulation At the end of 2022, the Indonesian government officially has a law that aims to protect and guarantee the basic rights of citizens regarding their data. Indonesian Law Number 27 year 2022, concerning the Protection of Personal Data (UUPDP), is a mandate from Article 28G paragraph 1 of the 1945 Constitution of the Indonesia Republic. This article states, “Everyone has the right to protection of themselves, family, honor, dignity and property under their control, and are entitled to a sense of security and safety from the threat of fear to do or not do something that is a human right”. The UUPDP comprises 15 chapters and is broken down into 76 articles. All the substances listed in the Indonesian UUPDP related to this paper are contained between chapter I to chapter VI, or about 23 pages out of a total of 50 pages in the UUPDP. Chapter I contains general provisions, followed by chapter II which deals with the principles of personal data processing, and chapter III, types of personal data. Then chapter IV includes the rights of individual data subjects, which are still general information related to this UUPDP. The needs of a more technical nature are described in chapter V which contains the processing of Personal Data, and chapter VI which outlines the obligations of controllers and processors of Personal Data. The related articles in this paper are described as follows: • Chapter V–Article 16 (2a) states that collecting personal data must be done in a limited and specific manner. Furthermore, article 16 (2f) states that the processing of Personal Data must be carried out by preventing unauthorized access and disclosure. • Chapter VI–in general, the UUPDP categorizes data managers into 2 (two) positions, namely controllers and processors of Personal Data. Chapter VI conveys each party’s obligations in processing personal data. Articles 38 and 39 represent the obligations of individual data controllers to protect personal data from unauthorized processing. Furthermore, for personal data processors, it is conveyed in article 51 that the conditions and obligations that must be fulfilled in processing personal data are the same as those that controllers of Personal Data must carry out. From chapters V and VI above, it can be concluded that the principles of integrity and confidentiality are absolute requirements that must be fulfilled while processing personal data. In general, the Indonesian UUPDP adopts the General Data Protection Right (GDPR) used in the European Union (EU), where the UUPDP has data processing principles and individual rights that are relatively the same as the GDPR as shown in Fig. 1:

526

S. Prabowo et al.

Fig. 1. GDPR personal data processing principles and rights of personal data subjects.

It can be seen that integrity and confidentiality are principles of personal data management commonly used in personal data regulations. Apart from that, one of the main principles in personal data regulation is accountability, whereby all personal data processing must be able to show compliance and records of activities related to personal data. There are several definitions related to personal data used in various literatures in Table 1: The UUPDP classifies personal data into two categories, namely, personal data that is specific and general, as shown in Table 2: From the explanation of article 4, Facial Recognition, a form of Biometric data, is included in the specific data category. Therefore, all processing activities involving facial recognition data must comply with the personal data protection rules in this UUPDP. Continuing research on developing a Learning Management System with online remote proctoring features used in research institutions [20], it is seen that it is necessary to provide additional guarantees of security and privacy for the personal data involved. 2.2 Learning Management System (LMS) There is no doubt that information technology has developed very rapidly, especially in the world of education. On the other hand, the COVID-19 pandemic, which occurred on a massive scale, also forced the adaptation of information technology in the learning

A Data Protection Design for Online Exam Proctoring

527

Table 1. Definition of personal data. No Source

Year Personal data definition

1

Australian–Privacy Act [12]

1988 “Information or an opinion about an identified individual, or an individual who is reasonably identifiable: (a) whether the information or opinion is true or not; and (b) whether the information or opinion isrecorded in a material form or not”

2

Spanish ORGANIC LAW 15/1999 of 13 1999 “any information concerning identified December on the Protection of Personal or identifiable natural persons” Data [13]

3

Japan: Act on the Protection of Personal 2003 “information about a living individual Information Act No.57 of (2003) [14] which can identify the specific individual by name, date of birth or other description contained in such information including such information as will allow easy reference to other information and will thereby enable the identification of the specific individual”

4

Singapore Personal Data Protection Act 2012 [15]

2012 “data, whether true or not, about an individual who can be identified (a) from that data; or (b) from that data and other information to which the organization has or is likely to have access.”

5

Philipines Republic act 10173–Data Privacy Act of 2022 [16]

2012 “any information, whether recorded in a material form or not, from which the identity of an individual is apparent or can be reasonably and directly ascertained by the entity holding the information or when put together with other information would directly and certainly identify an individual;.”

6

United Kingdom–Data Protection Act 2018 [17]

2018 “any information relating to an identified or identifiable living individual”

7

Thailand Personal Data Protection Act, B.E 2562 [18]

2019 “any information relating to a Person, which enables the identification of such Person, whether directly or indirectly, but not including the information of the deceased Persons in particular” (continued)

528

S. Prabowo et al. Table 1. (continued)

No Source

Year Personal data definition

8

South Korea Personal Information Protection Act No 16930 [19]

2020 “any of the following information relating to a living individual”

9

Undang-Undang No 27 Tahun 2022 Tentang Perlindungan Data pribadi–Indonesia [1]

2022 “data about identified or identifiable individuals individually or in combination with other information either directly or indirectly through electronic or non-electronic systems”

Table 2. Classification of personal data according to Indonesian UUPDP Article 4(2)(3). Generic data

Specific data

Full name

Health data Information

Gender

Biometric data

Nationality

Genetic data

Religion

Crime record

Marital status

Child data

Personal data that combined to identify a person Personal financial data Other data under the provision of the legislation

process. The Learning Management System (LMS) is one of the technologies that has become a center for the development of digital learning technology in the world with various features which can be seen in Fig. 2. LMS is a technology that facilitates learning to be done remotely. Turnbull et al. [21] define LMS as a website-based application or platform that provides interactive online learning methods. In addition to the online learning method, there is also an automation function for administrative processes and organization, up to the reporting stage of learning strategies or activities, which are also included in the LMS service. In general, the currently developing LMS features, from assessment to reporting of exam results. However, amid the development of LMS technology, one of the features that have grown significantly with the existence of computer vision technology is proctoring. Test supervision methods with various methods have been developed to provide the proctoring function expected by educational institutions. Research by Labayen et al. [22] developed online proctoring with multimodal biometric technology and complied with GDPR personal data protection regulations.

A Data Protection Design for Online Exam Proctoring

529

Fig. 2. Learning management system feature overview.

3 Methodology 3.1 System Model The case study used in this study uses the Center of Learning Excellence (Celoe) application, a Learning Management system application used at our institution [20]. This study focuses on the online remote proctoring module, which involves biometric data in the form of facial recognition.

Fig. 3. Online remote proctoring celoe LMS.

Figure 3 explains the flow of communication, especially in the proctoring module, which is connected to the LMS and students as personal data subjects in this study. In this study, the transactions involved are at the authentication stage before a quiz is conducted.

530

S. Prabowo et al.

The initial authentication stage is carried out with a password in plain text previously validated (stages 1, 2). Next, students go to the course and quiz that will be done (stage 3). Stages 1–3 are processed in the main LMS application. Furthermore, after receiving a quiz request from an authorized student, the LMS sends a request to the plugin proctoring server (stage 4). From the proctoring server, it sends data requests that include bandwidth status, screen activity recording, webcam, and facial recognition (step 5). Stage 6 is a reply to the proctoring server request in the previous stage. The proctoring server then uses the data resulting from student replies for user authentication and verification by matching with previously stored data in storage (DB) (step 7). In addition to checking user IDs, at stage 7, a match calculation is also carried out from facial recognition sent with data stored in the previous storage to produce a confidence score indicating facial resemblance. The results of this confidence score are then sent back to the user (step 8) in the form of whether the user can access the exam or not (stage 9). Suppose this confidence score exceeds the specified minimum threshold. In that case, it is concluded that the individual currently in front of the webcam and captured is the same as registered in the previous LMS storage, and the user can continue to the exam stage. Before proceeding to a deeper analysis stage, several issues related to LMS are first described in the context of the research conducted: • Computing devices used by students for exams can vary widely from hardware to operating systems and software contained therein. And this is beyond the control of our LMS application because this application is a full website based, without the need to install plugins or additional add-ons. • The network students use to access the LMS can also vary greatly, from speed to access status and location, whether private or public. It is also not limited to the implementation of LMS in our institution.

Fig. 4. System class model for online remote proctoring LMS.

Departing from the abovementioned issues, we exclude home and student devices from the security perimeter as the main assumptions in this study. Due to this assumption, no data is stored or processed by the user to maintain security. Still, everything is immediately sent and processed on the LMS or proctoring server.

A Data Protection Design for Online Exam Proctoring

531

Departing from the results of the analysis of the current conditions accompanied by assumptions and issues that arise, we then carry out the Extended Fusion/UML software development process to be able to describe the relationships between objects in the LMS descriptively and consistently [11, 23]. Concerning security and privacy, the results of this method can then be used to obtain formal specifications from the LMS to provide secure data handling between related objects. The results of the system class model with fusion/UML can be seen in Fig. 4.

4 Indonesian PDP Compliant Online Proctoring in Learning Management System Architecture In this study, we use proven techniques related to security and privacy to show how the UUPDP can be systematically mapped onto the system architecture. A combination of information control flow is used to provide data protection in a distributed system with the Decentralized Label Model (DLM) approach [24, 25] and using Fusion/UML [23] method to produce a consistent connection between design and analysis to produce formal implementation specifications. This chapter provides detailed DLM-related needs with additional information labels such as purpose and retention time to meet personal data protection needs. Next, it describes the application of this labeling model to use case learning management systems as an overview of the system that shows the interactions between objects. We then apply the extended Fusion/UML [23] process to describe the formal system specification from the previous stage’s system class model. 4.1 Privacy and Security by Data Labelling First of all, we need to define the reader and owner in the format type the form as follows: type dlm = actor x actor set Furthermore, data labeling can be done with type in x data. Additional information in the metadata, such as processing purpose and retention time, can then be encoded into this part of the data type. With a data labeling approach, we can represent Article 1(1) UUPDP, which states, “Personal data is data about an identified or identifiable individual”. 4.2 Use Cases Fusion/UML analysis departs from the use case model, which is then used to produce a system class model with a system border. However, in this study, we use the application of DLM [11] to provide additional justification in determining the system border as a security perimeter in the use case used. Several use cases in the context of the LMS application used here: 1. Student biometric data in the form of facial recognition has been stored on the server as a basis for verification.

532

S. Prabowo et al.

2. Facial recognition data captured from the camera on the student’s device when accessing quizzes in the LMS as part of the authentication process. 3. Facial recognition data is downloaded from the student’s webcam during the quiz process as input for the remote proctoring server. 4. Need to ensure that the system is consistent during the data processing manner 5. Ensuring that operations are taking place, accordingly, maintaining label consistency. 6. Data is always tracked during the entire process. 4.3 Operation Schemata Operation schemata are UML descriptions with the addition of tagged values in them. Values that can be added (tagged) in the operation schemata are Changes, Description, Inputs, Operation, Post-conditions, Pre-conditions, Reads, and Send [11]. Furthermore, apart from operation names, there are also informal descriptions and inputs in describing operation schemata. This attribute is used to document the relationship between objects by including information from where an object is read, and where an object is written. The pre-and post-conditions in the operation schemata can be used to define the formal conditions of an object. Operation schemata upload is used to describe use case 2: with data item [d] and DLM-label (o, r) input by the actor whose ID is used for the next identity. Suppose the ID used is equal to the actor’s identity o. In that case, the DB update process can be carried out by sending messages regarding the success status of this update process to the initiator. The party that uploaded this can be any device used by students to access the LMS, or it can also be expressed as a class formed {e.g., home, sphone, laptop}. Table 3. Operation schemata delete. Attribute

Description

Operation

= Upload

Description

= The student device upload facial recognition data to an online proctoring server in the cloud

Input

= d: data | (o, r):dlm | id: actor

Reads

= as: Auth with o ∈ as.student ∧ r ⊆ as.reg_user, Controls

Changes

= db:DB with (as, db) ∈ Controls

Sends

= :{home, laptop}:{upload_ok}

Pre-cond

= id = o

Post-cond

= db.table’ = db.table ∪ {((o, r),d)} ∧ is sent {trans_ok}

Table 3 above shows the relationship between objects in the object collaboration diagram, as shown in Fig. 5. In this upload operation, the pre-conditions that must be met before the upload operation can run are described, namely, the initiator ID must match the owner label o, or in other words, the initiator must be a previously authenticated student in the database.

A Data Protection Design for Online Exam Proctoring

533

Fig. 5. Upload object collaboration diagram.

Furthermore, the operation schemata described is download; this schemata answers use case 3, where the server will request data in the form of periodic or accidental facial recognition when prohibited actions are detected as input for the cheating detection process. The proctoring server initiates this process for the laptops used by students. Table 4. Download operation schemata. Attribute

Description

Operation

= Download

Description

= The Proctoring server downloads facial recognition data from the student’s webcam (periodically or triggered) and used as input for the cheating detection process carried out

Input

= o | id: actor

Reads

= as: Auth with o ∈ as.student ∧ r ⊆ as.reg_user, db: DB with (as, db) ∈ Controls

Changes

= c:campus with (as, c) ∈ has

Sends

= :{laptop}:{access}, :{proctor}:{download_ok}

Pre-cond

= ((o, r), d) ∈ db.table ∧ id ∈ c.staff ∧ h ∈ r ∧ o ∈ as.student ∧ h ∈ as.reg_user

Post-cond

= c.table’ = c.table ∪ {((o ,r), d)} ∧ is sent {download_ok}∧ is sent {access}

Shown in Table 4 above unlike upload operation, for download schemata, the Online Proctoring server initiates the process, which requires Facial Recognition data from the camera on the device used by students to access the LMS.

Fig. 6. Download object collaboration diagram.

534

S. Prabowo et al.

The image of the download collaborator object illustrated in Fig. 6 shows that the download process is initiated by the OnlineProctoring module. The collaborator download object also indicates that authenticity and label correctness is maintained consistently. The final stage in the Fusion/UML process is to change the model generated in the previous analysis and design stages into an interface specification. As done by Kammueler et al. [11], the last stage of the Fusion/UML process produces a complete Object-Z specification.

5 Conclusion In this paper, we have reduced the technical requirements of UUPDP Indonesia, followed by illustrating them in a case study of an LMS application with associated facial recognition data. Use the Decentralized Label Model (DLM) to provide privacy access control and the Fusion/UML method to describe the interrelationships between objects for the software development process. We have described related research to meet data protection regulations, which have not yet been found that specifically target the Indonesian UUPDP. From the security analysis carried out from the results of formal modeling, it can be seen that there are still several features that need to be completed by our LMS to be able to fully comply with Indonesian UUPDP regulations, such as the deletion feature, data retention time and notifications to personal data subjects regarding processing activities and processing failures that occur. This can be material for further research.

References 1. Dewan Perwakilan Rakyat-indonesia: UU no 27 Tahun 2022 Tentang Perlindungan Data Pribadi (2022) 2. Hisbulloh, M.H.: Urgensi Rancangan Undang-Undang (RUU) Perlindungan Data Pribadi. Jurnal Hukum (2021) 3. Clark, T.M., Callam, C.S., Paul, N.M., Stoltzfus, M.W., Turner, D.: Testing in the time of COVID-19: a sudden transition to unproctored online exams. J. Chem. Educ. 97, 3413–3417 (2020) 4. Lee, K., Fanguy, M.: Online exam proctoring technologies: Educational innovation or deterioration? Br. J. Educ. Technol. 53, 475–490 (2022) 5. Kharbat, F.F., Abu Daabes, A.S.: E-proctored exams during the COVID-19 pandemic: a close understanding. Educ. Inf. Technol. (Dordr) 26, 6589–6605 (2021) 6. Coghlan, S., Miller, T., Paterson, J.: Good proctor or “Big Brother”? Ethics of online exam supervision technologies. Philos. Technol. 34, 1581–1606 (2021) 7. Hertianto, M.R.: Sistem Penegakan Hukum Terhadap Kegagalan Dalam Perlindungan Data Pribadi Di Indonesia. Kertha Patrika (2021) 8. Arruda, M.F., Bulcão-Neto, R.F.: Toward a lightweight ontology for privacy protection in IoT. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. Association for Computing Machinery, pp 880–888. New York, NY, USA (2019) 9. Rajagopalan, S.R., Sankar, L., Mohajer, S., Poor, H.V.: Smart meter privacy: a utilityprivacy framework. In: 2011 IEEE International Conference on Smart Grid Communications (SmartGridComm), pp .190–195 (2011). Ieeexplore.ieee.org 10. Mrissa, M., Tbahriti, S.-E., Truong, H.-L.: Privacy model and annotation for DaaS. In: 2010 Eighth IEEE European Conference on Web Services, pp. 3–10 (2010). ieeexplore.ieee.org

A Data Protection Design for Online Exam Proctoring

535

11. Kammueller F (2018) Designing data protection for GDPR compliance into IoT healthcare systems. In: 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp 3319–3324 (2018). ieeexplore.ieee.org 12. Australian Government, “Privacy Act 1988,” Cth, No. 119 (1988) 13. Jefatura, D., De España, R.: Spanish Fundamental Law on Data Protection and Digital Rights Guarantee (1999) 14. Japan: Act on the Protection of Personal Information Act No. 57 of ( 2003) (2003) Act on the Protection of Personal Information Act No. 57 of (2003) 15. Government of Singapore, “Personal Data Protection Act 2012,” No. 26 (2012) 16. Republic of the Philippines, “Data Privacy Act of 2012,” Republic Act No. 10173 (2012) 17. United Kingdom, “Data Protection Act 2018,” c. 12 (2018) 18. Government of Thailand, “Personal Data Protection Act B.E. 2562 (2019) of Thailand (PDPA),” B.E. 2562 (2019) 19. Government of South Korea, “Personal Information Protection Act 2020,” No. 16930 (2020) 20. Nugroho, M.A., Abdurohman, M., Prabowo, S., Nurhayati, I.K., Rizal, A.: Intelligent Remote Online Proctoring in Learning Management Systems. In: So-In, C., Londhe, N.D., Bhatt, N., Kitsing, M. (eds.) Information Systems for Intelligent Systems. Smart Innovation, Systems and Technologies, vol 324. Springer, Singapore (2023). https://doi.org/10.1007/978-981-197447-2_21 21. Turnbull, D., Chugh, R., Luck, J.: Learning management systems, an overview. In: Tatnall, A. (ed.) Encyclopedia of Education and Information Technologies, pp. 1052–1058. Springer International Publishing, Cham (2020) 22. Labayen, M., Vea, R., Flórez, J., Aginako, N., Sierra, B.: Online student authentication and proctoring system based on multimodal biometrics technology. IEEE Access 9, 72398–72411 (2021) 23. Bittner, M., Kammuller, F. Translating fusion/UML to object-Z. In: First ACM and IEEE International Conference on Formal Methods and Models for Co-Design, 2003. MEMOCODE’03. Proceedings, pp 49–50 (2003) 24. Myers, A.C., Liskov, B.: Protecting privacy using the decentralized label model. ACM Trans. Softw. Eng. Methodol. 9, 410–442 (2000) 25. Myers, A.C., Liskov, B.: Complete, safe information flow with decentralized labels. In: Proceedings. 1998 IEEE Symposium on Security and Privacy (Cat. No.98CB36186), pp 186–197 (1998) .ieeexplore.ieee.org

A Method to Transform Datasets into Knowledge Graphs Maricela Bravo(B) , José L. Barbosa, and Leonardo D. Sánchez-Martínez Autonomous Metropolitan University, CDMX, Av. San Pablo No. 420 Col. Nueva el Rosario, Alcaldía Azcapotzalco, Mexico City, Mexico {mcbc,ldsm}@azc.uam.mx

Abstract. Knowledge graphs are representations of data and information about resources in a triplet-based format which are identifiable by unique IRIs, are reference enabled and expansible; these characteristics make knowledge graphs easy to upload and manage large volumes of data in an agile way. In this article we propose a semi-automatic method for transforming datasets into knowledge graphs. Specifically, we describe the method in the transformation of a set of files representing the logs of a medical research protocol whose purpose is to evaluate the efficacy of the use of continuous glucose monitors in patients with Type 1 diabetes. For evaluation purposes we implemented a set of programs to perform data extraction from the dataset, parsing, cleaning and finally the automatic population of the knowledge graph. The resulting graph has been evaluated by verifying its logical consistency. Keywords: Knowledge graphs · Medical datasets · Medical knowledge graphs

1 Introduction Medical scientific research produces a large amount of data from experiments and research protocols. These results are commonly published as datasets, which represent excellent information resources for other researchers who are interested in conducting more experiments using different approaches. However, reutilization and exploitation these datasets is not as simple; it often requires a lot of effort from researchers to interpret data and to adapt datasets to their needs. The following are among the main difficulties that the researcher faces when trying to use a dataset: 1. The researcher interested in a certain dataset must be familiar with the domain of data that is being reported in the dataset. 2. The researcher must evaluate the relevance of the reported data, based on accuracy and veracity to determine if data is of interest, and if these results will serve or contribute for another project. 3. The researcher must become familiar with the necessary software tools to read, exploit, and process the data contained in the dataset.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 536–554, 2024. https://doi.org/10.1007/978-3-031-47715-7_37

A Method to Transform Datasets into Knowledge Graphs

537

Diverse medical or biomedical datasets are published continuously, but each imposes different requirements for their use, in addition to the fact that the datasets are in a wide variety of syntax and formats. The working hypothesis that we support in this article is that if we develop a method that allows the transformation of raw format datasets into a knowledge graph representation, potential users of these resources will obtain great benefits, including extending the meaning of the data represented and the execution of logical inference will allow the generation of very useful semantic relationships. In this paper we propose a semi-automatic method for transforming datasets into knowledge graphs. We describe the process of transforming a set of data files recorded about a set of diabetic patients who participated in a research protocol to identify the benefit of constant glucose monitoring in preventing low-glucose level health crises. This article deals mainly with the transformation of data representation; specifically, we are interested in transforming data from a tabular form or in text files to a form of representation through knowledge graphs. The term Knowledge Graph became more popular in 2012 when Google launched its “Google Knowledge Graph”. After this, other well-known companies around the world incorporated Knowledge Graphs to support search, for instance: Airbnb, Amazon, eBay, Face-book, IBM, LinkedIn, Microsoft, Uber, among others. A knowledge graph is essentially a graph consisting of nodes and arcs, where the nodes represent concepts, and the arcs represent semantic relationships between the concepts. The use of a formal representation of the data using graphs allows carrying out tasks of reasoning, search, and querying information. Additionally, the representation of data as knowledge graphs facilitates the semantic interpretation of the information being represented; knowledge graphs can be augmented by referencing Internationalized Resource Identifier (IRI) from other public knowledge graphs. We believe that transforming datasets completely or partially into knowledge graphs can be of great benefit both for those who produce the original datasets and for those who can reuse them, enabling the dissemination and sharing of data for research purposes. In this article we describe a sequence of processes that allows the generation of a knowledge graph that represents data extracted from the dataset. The method for the construction of the knowledge graph that is reported in this article differs with related works in the particular implementation of the automatic population programs. Likewise, the evaluation or validation of the resulting graph is different. The rest of the paper is organized as follows. In Sect. 2, a revision of related work based on specialized literature is presented; in Sect. 3 the important methodological aspects are described; in Sect. 4 the description of the dataset is presented; in Sect. 5 the data extraction process is detailed; in Sect. 6 the Knowledge Graph construction is described; and finally in Sect. 7, Conclusions are presented.

2 Revision of Related Work 2.1 Knowledge Graphs in the Medical Domain The use of knowledge graphs to represent medical or clinical data of patients is an area of research in constant development. In this subsection we present a brief summary of some works related to the topic.

538

M. Bravo et al.

Hermsen et al. [2] describe the advantages of representing patient data using a knowledge graph-based Clinical Decision Support System, as it allows for improved decision-making on antimicrobial stewardship through the availability of a combination of patient-specific data and costs. The use of this system allows obtaining several benefits, including the reduction of adverse events, the reduction of the length of stay of a patient in the health care center, the reduction of costs during the stay and the use of antimicrobials more appropriate for the patient. In Sherimon and Krishnan [3], authors describe OntoDiabetic, a knowledge basebased decision support system for assessing risk factors and generating treatment suggestions for patients presenting diabetes. The authors define a set of inference rules to obtain information about the health status of the patient. The prevention and treatment of patients with chronic diseases requires multiple coordinated actions. Zhang, Gou, et al. [4] address the complex interactions between risk factors, diseases, patient conditions, and treatment modalities by implementing a knowledge graph. In the work of Ajami and Mcheick [5], authors present the development of a model that allows the representation of patient information to create safe environments for patients with chronic obstructive pulmonary disease (COPD). This model is based on the formal description of knowledge graphs in a health-related domain, using the Semantic Web Rule Language (SWRL). The knowledge graph contains all relevant concepts related to COPD, including personal patient information, location, activity, symptoms, risk factors, laboratory test results, and treatment plan. In Oyelade et al. [6], the construction of a knowledge graph for the representation of patient profiles is presented, the proposed framework is achieved through the formalization of knowledge. The obtained result shows that the proposed knowledge graph demonstrated interesting performance compared to similar state-of-the-art Case-Based Reasoning (CBR) studies using fuzzyCBR. In Govindan, Mina, Alavi [7], the authors present a knowledge graph model for the representation of patient profiles to support decision making to classify community members accordingly, manage demand and control epidemic outbreaks in the healthcare supply chain. In Harry et al. [8], authors describe a patient profiling knowledge graph model for Electronic Health Record (EHR) management and decision-making that can improve cancer prevention and detection in primary care. In previous work, Bravo et al. [9] addressed the representation of patient profiles by means of a semantic representation model offering benefits for inference and automatic reasoning, facilitating the identification of risk cases. These works do not represent an exhaustive list, but show a clear trend towards the use of knowledge graphs for the management and representation of medical and clinical data on patients. 2.2 Knowledge Graph Construction Methods Regarding methods for knowledge graph construction, there exist multiple approaches which range from pure manual method construction to fully automatic methods. The construction of a knowledge graph starts by analyzing data or information sources to be

A Method to Transform Datasets into Knowledge Graphs

539

used as input. In this subsection we briefly describe the methods for the construction of multi-domain knowledge graphs. In 2021 Hao et al. [10] describe a framework for knowledge graph construction which has two parts: the model level construction and the data layer construction. The model level consists of a manual executed methodology for the definition, construction and storage of an ontology. The data layer construction consists of the automatic concept, entity and relation acquisition. This framework is a semi-automatic method for knowledge graph construction. In Jia Hao et al. [11] authors describe a method to build a decision knowledge graph which consists of three blocks: definition of the meta-model of the decision knowledge graph, knowledge graph construction, and implementation of the decision knowledge navigation. The definition of the meta-model was carried out manually, whereas the construction is semi-automatic involving Natural Language techniques. Brack et al. in 2022 [12] described a method for the construction of the Open Research Knowledge Graph (ORKG) which consists of two phases: knowledge graph requirements, and knowledge graph construction. In particular the later phase conceives the implementation of manual or automatic methods and techniques. Chen et al. in [13] describe a method for the construction of a knowledge graph to represent COVID-19 infectors. This method is automatic and it is based on the wellknown named entity recognition model BERT, BILSTM, and CRF. In Zhan et al. [14] authors describe a methodology for the construction of a fluvial facies knowledge graph which consists of the following steps: acquire fluvial facies knowledge, design the knowledge architecture, construct a structured, standardized, and systematic knowledge system, construct the knowledge graph, and design a knowledge reasoning algorithm. Table 1 shows some of the methods for the construction of knowledge graphs applied in various domains. A relevant aspect that can be observed is the use of different support technologies in construction automation. Likewise, many different ways of validating or evaluating the resulting graphs can be observed. Table 1 describes a set of construction methods reported in the specialized literature. It should be noted that this is not an exhaustive list, but it exemplifies the breadth of the research field of the construction of knowledge graphs. The method for the construction of the knowledge graph that is reported in this article is similar to that of the related works. However, the specific technical aspects for the automation of the construction of the graph differ in the sense that specific programs are developed for the acquisition of data. Likewise, the evaluation or validation of the resulting graph is different.

3 Methodological Issues From a methodological point of view, it is important to point out the main difficulties that must be solved to transform a dataset into a knowledge graph are described here:

540

M. Bravo et al. Table 1. Knowledge graph construction methods.

Reference

Method approach Knowledge domain

Hao et al. [10]

Semi-automatic

Surveying and DeepDive1 remote sensing

Application analysis of the knowledge graph

Jia Hao et al. [11]

Semi-automatic

Decision knowledge graph

Advanced natural language processing techniques

Implemented a comparison using the Hit Ratio measure

Brack et al. in [12]

Manual and automatic

Scientific and scholarly articles

Natural language processing techniques

Chen et al. in [13]

Automatic

COVID-19

Optimized Precision, recall and named entity F1 measures recognition model BERT + BILSTM + CRF

Geology, fluvial facies

Cypher graph and Neo4j

Zhan et al. [14] Semi-automatic

Processing resources

Validation

Application-based evaluation

1. Data format. This issue considers the analysis of the data presented in the original dataset considering the attributes of the data represented, which may exist or not. Additionally, the format of the data represented refers to the type of data used, which is also not easily identifiable in the datasets. 2. Incomplete or inconsistent data. The use of datasets resulting from medical research protocols does not guarantee that their capture and organization have been carried out carefully. During the analysis of large datasets, problems can arise due to missing data or even badly coded or corrupted data. 3. Data interpretation. In the scientific medical literature, researchers normally adhere to certain criteria and standards for the presentation of results. However, the data that is included in the datasets could be very technical and very specialized, which could lead to problems of misinterpretation. 4. Data selection. Data selection refers to which data is relevant for use in another project since each research project may have different objectives and therefore require the use of a subset of the data present in the dataset. The list of aspects to carry out the transformation of a dataset into a knowledge graph mentioned above is not exhaustive, but it does show the great difficulty of generating a completely automatic transformation method. What can be seen from these difficulties is that an intense intervention is required from the scientist who wants to reuse the data from a dataset and manage it as a knowledge graph. 1 http://deepdive.stanford.edu/.

A Method to Transform Datasets into Knowledge Graphs

541

3.1 Description of the Method The transformation method consists of the following general processes: 1. Analysis of the dataset. A given dataset can consist of one or several files, which can be stored in text files, in csv tabular files, or in any other type of syntax. During the analysis of the data in the dataset, it is necessary to review the attributes (if any), the types of data, the values of the data, whether the data is complete, whether it is consistent, and above all it is necessary to decide what data is useful to be integrated into a knowledge graph. 2. Data extraction. Data extraction is a process that can be solved in a completely automated way, since datasets usually consist of large volumes of data, a manual procedure would be completely inappropriate. For this reason, specific parsers must be implemented that allow reading the data from the dataset and transforming it into some intermediate representation (which in the case reported in this article is based on the object-oriented programming paradigm). 3. Knowledge Graph Construction. The construction of the knowledge graph is a process that depends mainly on carrying out a design of the main structure of the graph that is going to be generated. The design of the graph is based particularly on the result of the dataset analysis, since depending on the types of data and values that are selected, the final representation of the knowledge graph will be made. Once the general design of the knowledge graph has been carried out, the automatic population with instance data is executed. The process of populating the knowledge graph can also be done automatically by implementing programs, which can be executed in sequence after the execution of the parsers. Regarding the parsing and automatic population processes, it is important to consider the order in which the data must be stored in the graph, since there may be dependencies between the data that will be represented. The following sections describe this transformation method applied to a dataset resulting from a medical research protocol.

4 Analysis of Dataset The dataset used in this research [1] consists of a series of text files with information recorded during a research project with patients with Type 1 diabetes (T1D). The study was conducted by selecting a set of volunteer participants who were given various tests and asked to answer a set of questionnaires together with their treating physicians, from which the data log files were generated. During the study, patients were monitored through medical supervision visits. Patients were fitted with control glucose meters and constant monitoring glucometers. One of the main objectives of this study was to determine whether continuous glucose monitoring (CGM) can reduce hypoglycemia and improve quality of life in adults with T1D. The research protocol can be reviewed in detail in the study report. For the transformation three datasets were selected, their respective files are described next. DiabScreening. This file consists of patient records including the patient id, gender, ethnicity, race, and particular data about the patient medical condition, and medication.

542

M. Bravo et al.

DiabScreening file was used to extract the patient general data and the information related to insulin administration. Specific fields used are described in Table 2. Table 2. DiabScreening file fields used for patient general data. Field name

Description

Possible values

PtId

Patient identifier

Gender

Gender

M = Male, F = Female

Ethnicity

Ethnicity

Hispanic or Latino, Not Hispanic or Latino, Unknown/not reported

Race

Race

White, Black/African American, Asian, Native Hawaiian/Other Pacific Islander, American Indian/Alaskan Native, more than one race, Unknown/not reported

PreExistMedCond

Does subject have any pre-existing medical conditions other than T1D

Yes, no

PtCurrMed

Is subject currently taking any medication

Yes, no

DiabPhysExam. This file consists of follow-up records of participating patients. The data stored in this file is relevant because it contains data recorded during each medical review of the patients, for example: glucose level, arterial pressure, weight, height, among others. Table 3 shows the fields recorded in this file. Table 3. DiabPhysExam file fields used for patient general data. Field name

Description

RecId

Unique record ID in table

PtId

Patient identifier

Visit

Visit

Weight

Weight

Height

Height

BldPrSys

Blood pressure systolic

BldPrDia

Blood pressure diastolic

PEHeartRt

Heart rate bpm

Temp

Temperature

FingStkBG

Fingerstick blood glucose result

A Method to Transform Datasets into Knowledge Graphs

543

DeviceUploads. This file contains recorded data about the device for the continuous monitoring of the patient’s glucose. For this study a specific CGM was selected and used by the group of patients selected to perform the control, generating various readings of the data from the CGM. Table 4 shows the fields recorded in this file. Table 4. DeviceUploads file fields used for patient general data. Field name

Description

RecId

Unique record ID in table

PtId

Patient identifier

DeviceManufact

Device manufacturer(s)

DeviceModel

Device model

DeviceType

Device type (CGM, BGM, etc.)

Visit

Temperature

FingStkBG

Visit at which data was downloaded

5 Data Extraction Data extraction requires a thorough analysis of the data to be included in the knowledge graph. This dataset contains a myriad of data, not all of which was used for the knowledge graph generation. Having identified and selected the specific files and data as input, a set of specific parsers were developed to read each file. To carry out data extraction, a class for each relevant entity was designed and implemented in Java, this class is latter mapped with a concept definition into the knowledge graph. Based on this class, the parser will generate a list of elements. This section describes the details of the modules that perform data extraction and return a list according to the defined type. 5.1 Patient Data Parsing The knowledge graph construction process requires identifying the order in which the data should be processed. For example, to register data about the insulin administration of patients, it is essential that the general data of the patients have been previously recorded. The first class or concept that must be defined and parsed from the datasets corresponds to the representation of the patient. The Patient Class was developed for this purpose, including the following attributes: identification, gender, ethnicity, race, pre-existing medical conditions, and if the patient is currently taking medication or not. Figure 1 shows the structure of the classes implemented for patient data extraction. During execution of the main program DatasetsToKnowledgeGraph the file DiabScreening.txt is opened, and the extraction methods read every record, recover the relevant data, and create Patient objects. DataProcessing Class contains a set of transformation

544

M. Bravo et al.

Fig. 1. Patient data extraction and representation.

methods, the readScreeningData is executed and produces a list of patient objects as a result. As a result of the patient extraction process a list of patients is returned: List patientList. This list will be latter used as input for the population of the knowledge graph. 5.2 Insulin Administration Data Parsing The administration of insulin can be carried out in different ways, by the application of various injections during the day or by using an insulin pump that automatically injects insulin doses whenever the patient requires it. For the representation of data about the administration of insulin the Class InsulinAdministration was defined, this class includes the mode of administration, the type of administration, the units of daily insulin administered, and the status of the glucometer. Figure 2 describes the insulin administration data extraction and parsing. As a result of the insulin administration data extraction, a list of insulin administration records is returned: List insulinList. 5.3 Physical Exam Data Parsing Similarly, the process of extracting data from the DiabPhysicalExam.txt file is carried out, a specific parser was developed that obtains the records of patient medical followup visits (see Fig. 3). Each record contains weight, height, blood pressure (systolic and diastolic), heart rate, temperature, and glucose data. For the representation of patient medical follow-up, a PatientRecord Class was developed and used for the transformation process. As a result of the physical exam file extraction process a list of patient records is returned: List medicalRecords.

A Method to Transform Datasets into Knowledge Graphs

545

Fig. 2. Insulin administration data extraction and representation.

Fig. 3. Follow-up visit records extraction and representation.

5.4 Continuous Glucose Monitoring Data Parsing Recovery of records with continuous glucose monitoring (CGM) data is of great importance with respect to the objectives of the research protocol, which is to evaluate whether the use of glucometers can improve the timely administration of insulin and the health status of patients with T1D. To extract and represent data of CGM the Class ContinuousGlucoseMonitoring was defined as show in Fig. 4.

Fig. 4. CGM data extraction and representation.

546

M. Bravo et al.

As a result of the CGM extraction process a list of CGM records is returned: List cgmList. It is important to note that each defined class includes the identification of the patient; this is because each patient is correlated with other entities from extracted concept. For example, each patient has multiple medical follow-up records, as well as multiple CGM data records.

6 Knowledge Graph Construction The construction of the knowledge graph is divided into two processes: the first is the design and construction of the general model of the knowledge graph, and the second consists of the implementation of programs that automatically populate the knowledge graph. 6.1 Knowledge Graph Design Considering the analysis of the data from the previous stage, in this section we present the general structure design of the semantic relations defined, including: 1. object properties, 2. data properties, and 3. hierarchical relations. In a knowledge graph, different types of semantic relationships between concepts can be defined. The properties between objects are the type of semantic relationship that has a class or concept as domain and another class or concept as range. Table 5 presents the main object properties defined between classes of the knowledge graph. Data properties allow the definition of relationships between a concept (domain of the relationship) and a data type (range of the relationship). From the point of view of knowledge graphs, data properties can also be translated as triplets, where the subject is the individual (relationship domain), the data property name is the verb, and the data type is the object (relationship range), depending on the syntax used to record the knowledge graph, we may define it as an ontology (if the syntax is based on descriptive logic) or as a graph (if it is based on RDF triplets). Table 6 describes the data properties defined for the representation of records extracted from the dataset described in Sect. 3. 6.2 Knowledge Graph Implementation Figure 5 shows the general architecture of the system implemented to perform data extraction from the dataset, parsing, cleaning and finally the automatic population of the knowledge graph according to the model described above. The architecture is divided into two general layers, the data extraction, and the knowledge graph construction. However, the execution of the system must follow a specific sequence, first the main data of the patient must be extracted and recorded, then the data related to the physical examination, the data of medical monitoring and glucose readings, and finally the data related to glucometer readings are recorded.

A Method to Transform Datasets into Knowledge Graphs

547

Table 5. Object properties defined. Property name

Description

Triple-based definition in Turtle format

Patient has continuous glucose monitoring

This semantic relation is used to correlate a patient with the CGM device used for monitoring glucose level

wisdm:patientHasCGM rdf:type owl:ObjectProperty; rdfs:domain wisdm:Patient; rdfs:range wisdm:ContinuousGlucoseMonitoring

Patient has insulin administration

This semantic relation defines if a patient is using insulin administration and the doses

wisdm:patientHasInsulinAdministration rdf:type owl:ObjectProperty; rdfs:domain wisdm:Patient; rdfs:range wisdm:InsulinAdministration

Patient has record

This relation is defined to correlate a patient with its records

wisdm:patientHasRecord rdfs:domain wisdm:Patient; rdf:type owl:ObjectProperty; rdfs:range wisdm:PatientRecord

As can be seen in Fig. 5, the automatic population of the knowledge graph was carried out in stages, first recording the patient’s data and later the other data related to glucose readings and insulin administration. Likewise, it can be identified that there are two stages in general: read and extract data from files and populate de Knowledge Graph. These stages are also executed in order. Figure 6 presents the main script that is executed sequentially to obtain the knowledge graph presented in Fig. 7. The resulting Knowledge graph consists of 23,199 axioms, 4 classes, 3 object properties, 24 data properties, and 2882 individuals, as shown in Fig. 8. As an example of the information that was included in the knowledge graph Fig. 9 shows a patient record represented as a knowledge graph using the RDF-based Turtle syntax. This image shows how the patient identified with the id Patr102 has associated several readings of his glucose levels (which were previously registered in separate files). The glucose level readings collected by the CGM can be known or extracted by the patientHasCGM relation, which is a relation between the Patient class and the CGM class. Information about the type of insulin and the dose that the patient is taking can be obtained using the relationship patientHasInsulineAdministration. Likewise, we can obtain information about the visits that the patient has made for follow ups. Finally, the knowledge graph that is used for patients also stores other data such as: ethnicity, gender, pre-existing medical conditions, race, and if the patient is taking medication.

548

M. Bravo et al. Table 6. Data properties defined.

Property name

Triple-based definition in Turtle format

Patient has blood glucose level

wisdm:hasBloodGlucose rdf:type owl:DatatypeProperty; rdfs:domain wisdm:PatientRecord; rdfs:range xsd:float

Insulin administration has continuous glucose monitoring unit status

wisdm:hasCGMUsageStatus rdf:type owl:DatatypeProperty; rdfs:domain wisdm:InsulineAdministration; rdfs:range xsd:string

CGM device has device manufacturer

wisdm:hasDeviceManufacturer rdf:type owl:DatatypeProperty; rdfs:domain wisdm:ContinuousGlucoseMonitoring; rdfs:range xsd:string

Patient record has diastolic blood pressure

wisdm:hasDiastolicBloodPress rdf:type owl:DatatypeProperty; rdfs:domain wisdm:PatientRecord; rdfs:range xsd:integer

Patient has ethnicity

wisdm:hasEthnicity rdf:type owl:DatatypeProperty; rdfs:domain wisdm:Patient; rdfs:range xsd:string

Patient has gender

wisdm:hasGender rdf:type owl:DatatypeProperty; rdfs:domain wisdm:Patient; rdfs:range xsd:string

Patient record has heart rate

wisdm:hasHeartRate rdf:type owl:DatatypeProperty; rdfs:domain wisdm:PatientRecord; rdfs:range xsd:integer

Patient has height

wisdm:hasHeight rdf:type owl:DatatypeProperty; rdfs:domain wisdm:PatientRecord; rdfs:range xsd:float

Patient has mode of insulin administration

wisdm:hasModeOfInsAdm rdf:type owl:DatatypeProperty; rdfs:domain wisdm:InsulinAdministration; rdfs:range xsd:string (continued)

A Method to Transform Datasets into Knowledge Graphs

549

Table 6. (continued) Property name

Triple-based definition in Turtle format

Patient has preexisting medical condition

wisdm:hasPreExisMedicCond rdf:type owl:DatatypeProperty; rdfs:domain wisdm:Patient; rdfs:range xsd:string

Insulin administration has pump use

wisdm:hasPumpUse rdf:type owl:DatatypeProperty; rdfs:domain wisdm:InsulinAdministration; rdfs:range xsd:string

Patient has race

wisdm:hasRace rdf:type owl:DatatypeProperty; rdfs:domain wisdm:Patient; rdfs:range xsd:string

Patient record has systolic blood pressure

wisdm:hasSystolicBloodPress rdf:type owl:DatatypeProperty; rdfs:domain wisdm:PatientRecord; rdfs:range xsd:integer

Patient record has temperature

wisdm:hasTemperature rdf:type owl:DatatypeProperty; rdfs:domain wisdm:PatientRecord; rdfs:range xsd:float

Patient has weight

wisdm:hasWeight rdf:type owl:DatatypeProperty; rdfs:domain wisdm:PatientRecord; rdfs:range xsd:float

Continuous glucose monitoring is realized during wisdm:isRealizedDuring a specific time rdf:type owl:DatatypeProperty; rdfs:domain wisdm:ContinuousGlucoseMonitoring; rdfs:range xsd:string Patient is taking medication

wisdm:isTakingMedication rdf:type owl:DatatypeProperty; rdfs:domain wisdm:Patient; rdfs:range xsd:string

Insulin administration has daily insulin units

wisdm:hasDailyInsulinUnits rdf:type owl:DatatypeProperty; rdfs:domain wisdm:InsulinAdministration; rdfs:range xsd:float (continued)

550

M. Bravo et al. Table 6. (continued)

Property name

Triple-based definition in Turtle format

Continuous glucose monitoring uses device model

wisdm:usesDeviceModel rdf:type owl:DatatypeProperty; rdfs:domain wisdm:ContinuousGlucoseMonitoring; rdfs:range xsd:string

Insulin administration uses pump type

wisdm:usesPumpType rdf:type owl:DatatypeProperty; rdfs:domain wisdm:InsulinAdministration; rdfs:range xsd:string

Fig. 5. Datasets to knowledge graphs transformation process.

6.3 Knowledge Graph Evaluation For evaluation of the resulting Knowledge Graph we executed the Hermit2 reasoner. Hermit is an owl reasoner capable to determine whether or not a given knowledge graph is consistent, it also identifies subsumption relationships between classes. Hermit algorithm implements the “hypertableau” calculus which reduces the time of execution and results more efficient in reasoning tasks (Fig. 10).

7 Conclusions In this paper we have described a semi-automatic method that allows the transformation of a medical dataset into a knowledge graph. The transformation process is semiautomatic because there are many human decisions that must be made. For example, defining what data is useful in a knowledge graph, the design of the graph model, what 2 http://www.hermit-reasoner.com/

A Method to Transform Datasets into Knowledge Graphs

551

Fig. 6. CGM data extraction and representation.

Fig. 7. Resulting knowledge graph diagram.

relationships are relevant, and in which specific organization of concepts is the most useful. The transformation method also has a stage in which the populating or registration of the instances is automated by means of a set of programs.

552

M. Bravo et al.

Fig. 8. General metrics of the resulting knowledge graph.

Fig. 9. Detailed information represented in the knowledge graph.

Fig. 10. Hermit reasoner execution results.

A Method to Transform Datasets into Knowledge Graphs

553

Future work consists in the development of a wizard-type tool that allows a user to select any dataset that is of interest and transform it into a knowledge graph, such a tool would have to offer suggestions for the design of the graph and the relationships. The resulting graph has been evaluated by verifying its logical consistency, it is necessary to make an application that uses it in order to verify its usability in the community of health researchers.

References 1. Pratley, R.E., Kanapka, L.G., Rickels, M.R., Ahmann, A., Aleppo, G., Beck, R., Miller, K. M.: Wireless innovation for seniors with diabetes mellitus (WISDM) study group. Effect of continuous glucose monitoring on hypoglycemia in older adults with type 1 diabetes: a randomized clinical trial. Jama 323(23), 2397–2406 (2020) 2. Hermsen, E.D., VanSchooneveld, T.C., Sayles, H., Rupp, M.E.: Implementation of a clinical decision support system for antimicrobial stewardship. infection control and hospital epidemiology 33(4), 412 (2012). https://doi.org/10.1086/664762 3. Sherimon, P.C., Krishnan, R.: OntoDiabetic: an ontology-based clinical decision support system for diabetic patients. Arab. J. Sci. Eng. 41(3), 1145–1160 (2016). https://doi.org/10. 1007/s13369-015-1959-4 4. Zhang, Y.F., et al.: An ontology-based approach to patient follow-up assessment for continuous and personalized chronic disease management. J. Biomed. Inform. 72, 45–59 (2017). https:// doi.org/10.1016/j.jbi.2017.06.021 5. Ajami, H., Mcheick, H.: Ontology-based model to support ubiquitous healthcare systems for COPD patients. Electronics 7(12), 371 (2018). https://doi.org/10.3390/electronics7120371 6. Oyelade, O.N., Ezugwu, A.E.: A case-based reasoning framework for early detection and diagnosis of novel coronavirus. Inform. Med. Unlocked 20, 100395 (2020). https://doi.org/ 10.1016/j.imu.2020.100395 7. Govindan, K., Mina, H., Alavi, B.: A decision support system for demand management in healthcare supply chains considering the epidemic outbreaks: a case study of coronavirus disease 2019 (COVID-19). Transp. Res. Part E: Logist. Transp. Rev. 138, 101967 (2020). https://doi.org/10.1016/j.tre.2020.101967 8. Harry, M.L., Saman, D.M., Truitt, A.R., et al.: Pre-implementation adaptation of primary care cancer prevention clinical decision support in a predominantly rural healthcare system. BMC Med. Inform. Decis. Mak. 20, 117 (2020). https://doi.org/10.1186/s12911-020-01136-8 9. Bravo, M., González, D., Ortiz, J.A.R., Sánchez, L.: Management of diabetic patient profiles using ontologies. Contaduría y administración 65(5), 12 (2020). https://doi.org/10.22201/fca. 24488410e.2020.3050 10. Hao, X., et al.: Construction and application of a knowledge graph. Remote Sens. 13(13), 2511 (2021) 11. Hao, J., Zhao, L., Milisavljevic-Syed, J., Ming, Z.: Integrating and navigating engineering design decision-related knowledge using decision knowledge graph. Adv. Eng. Inform. 50, 101366 (2021) 12. Brack, A., Hoppe, A., Stocker, M., Auer, S., Ewerth, R.: Analysing the requirements for an open research knowledge graph: use cases, quality requirements, and construction strategies. Int. J. Digit. Libr. 23(1), 33–55 (2022) 13. Chen, L., Liu, D., Yang, J., Jiang, M., Liu, S., Wang, Y.: Construction and application of COVID-19 infectors activity information knowledge graph. Comput. Biol. Med. 148, 105908 (2022)

554

M. Bravo et al.

14. Zhang, L., Hou, M., Chen, A., Zhong, H., Ogg, J.G., Zheng, D.: Construction of a fluvial facies knowledge graph and its application in sedimentary facies identification. Geosci. Front. 14(2), 101521 (2023)

Explainable Dynamic Ensemble Framework for Classification Based on the Late Fusion of Heterogeneous Multimodal Data Firuz Juraev1 , Shaker El-Sappagh1,2 , and Tamer Abuhmed1(B) 1

College of Computing and Informatics, Sungkyunkwan University, Seoul, South Korea [email protected], [email protected], [email protected] 2 Faculty of Computer Science and Engineering, Galala University, Suez, Egypt

Abstract. Ensuring precise and reliable classification effectiveness holds paramount importance in essential sectors such as medicine, industry, and healthcare. Machine learning (ML) techniques have evolved in recent years to address the performance, efficiency, and robustness of the applied models. Recent advancements in Machine Learning (ML) techniques have aimed to enhance the performance, efficiency, and robustness of applied models. Ensemble learning has been shown to exhibit superior accuracy, robustness, and generalization capability over classical single ML models in most classification problems. Although dynamic ensembles have extended the performance of static ensembles such as random forest and boosting, current literature on dynamic ensembles primarily focuses on the early fusion of multimodal data. In this study, we present a novel framework that combines dynamic ensemble selection (DES) with a late fusion of heterogeneous multimodal data and model explainability. We evaluated our approach on a classification task of hospital mortality prediction, and our approach achieves a testing accuracy of 90.16%, surpassing existing techniques and providing physicians with case-based reasoning and deep-based classifiers contributions explanations to support their decision-making. We compare our proposed framework against nine widely used ML techniques, including static and dynamic ensemble models with early fusion and static ensemble models with late fusion on a dataset of 6,600 patients from MIT’s GOSSIS dataset. The dynamic ensemble model with early fusion achieves a testing accuracy of 86.89%, the LightGBM model achieves a test accuracy of 87.72%, and the soft voting model reaches 87.97% and 89.45% using early and late fusion, respectively. Our proposed framework not only improves the accuracy and robustness of in-hospital mortality prediction models but also offers explainability and potential for further optimization to achieve even higher performance. Keywords: Dynamic ensemble classifier · Multi-modality · Early fusion · Late fusion · Static ensemble classifier · Explainable ai · In-hospital mortality prediction c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 555–570, 2024. https://doi.org/10.1007/978-3-031-47715-7_38

556

1

F. Juraev et al.

Introduction

Reliable prediction in a classification task is critical in ensuring precise decisionmaking, performance assessment, efficient resource allocation, and enhanced confidence in the entire system. This is particularly crucial in the realm of medical decision-making, where reliable prediction can potentially lower healthcare expenses and ease the burden on medical resources [14]. The literature has extensively investigated the use of various machine-learning approaches to improve prediction performance [6]. For example, ensemble learning techniques, such as static and dynamic ensembles, have been widely explored in the field of in-hospital mortality prediction [8,14], Sepsis prediction in intensive care [7], Alzheimer’s disease progression [10] to improve classification accuracy. While static ensemble methods combine the predictions of all available base classifiers, dynamic ensemble selection chooses a subset of the most relevant classifiers for a given task, resulting in improved predictive performance and reduced computational cost [6,11]. The use of multimodal data in healthcare is prevalent due to the representation of numerous intricate medical problems through datasets comprising heterogeneous modalities. Different types of data such as medical images, clinical notes, and laboratory results are combined to improve diagnosis and treatment outcomes and to provide customized and individualized decisions. The most popular two approaches for combining multimodal data are early fusion and late fusion. Early fusion combines the raw data of different modalities at the feature level, while late fusion combines the outputs of classifiers trained on individual modalities at the decision level [9,26]. Early fusion has been demonstrated to be efficient when the modalities are highly correlated since the combining of raw data allows for more complete feature extraction and integration. On the other hand, late fusion is more suitable when the modalities have low correlation or requires different feature representations, as it allows for a more flexible combination of the outputs of individual classifiers [27]. Late fusion has also been shown to be effective when the modalities have varying amounts of missing data since it can handle missing data in particular modalities while improving overall performance [12]. In this work, we propose a novel framework that merges the advantages of dynamic ensemble modeling and the late fusion of data. We proposed a new architecture following a similar design of the existing dynamic ensemble model, i.e., the k-Nearest Oracle Union (KNORA-U) [15], with a novel late fusion capability. In addition, to the best of our knowledge, there is no work that provides explainability to dynamic ensemble models. In this work, we proposed two different explainability techniques for our proposed dynamic ensemble model with late fusion. The contributions of the study can be summarized as follows. 1. We propose the first dynamic ensemble selection classifier with a late fusion of multiple modalities. The proposed model is capable of learning complex tasks based on heterogeneous multi-modalities.

Explainable Dynamic Ensemble Framework for Classification

557

2. We propose a novel explainability for dynamic ensemble selection classifiers based on Case-Based Reasoning and deep-based classifiers contributions techniques. 3. We compare our proposed models’ performance with existing approaches including the static ensemble model with early and late fusion, and the dynamic ensemble model with early fusion. The comparison is based on a real-world dataset of 6,600 patients from MIT’s GOSSIS (Global Open Source Severity of Illness Score) dataset. The study is organized as follows. Section 2 highlights the related work. Section 3 presents the proposed model, Sect. 4 discusses the results, and Sect. 5 introduces explainability. Section 6 provides discussion for our work and then Sect. 7 concludes the paper.

2

Related Work

Ensemble learning has emerged as a highly effective technique in the field of machine learning and pattern recognition. It involves generating and combining multiple models to improve the accuracy, robustness, and generalizability of predictions. By leveraging the strengths of multiple models, ensemble learning techniques have consistently demonstrated their ability to improve performance, reduce overfitting, and provide more reliable and robust predictions than single models alone. This led to successfully applying ensemble learning in various domains, including computer vision, natural language processing, and medical diagnosis, among others [3]. The ensemble learning approach can be broadly categorized into two main types: static and dynamic ensemble learning. 2.1

Static Ensemble

Static ensemble learning combines a fixed set of models to generate diverse models that work collectively to produce a better prediction (see Fig. 1). This approach leverages methods such as bagging [2], boosting [25], and stacking [23]. Bagging (Bootstrap Aggregating) creates multiple training datasets through bootstrapping, training independent models on these datasets, and then combining their predictions by averaging or voting to reduce overfitting and improve generalizability [20]. Boosting, on the other hand, entails training weak models consecutively on the misclassified data points, with each new model focused on the regions where the prior models failed. The final prediction is obtained by combining these weak models through a weighted majority vote or a weighted sum, ultimately resulting in a strong model with improved accuracy and reduced bias. Popular boosting algorithms include AdaBoost, which adjusts the weights of misclassified instances at each iteration, and Gradient Boosting, which utilizes the gradients of the loss function to guide the addition of new weak models [22,25]. Stacking involves training several diverse base models, which could include a mix of machine learning algorithms such as linear regression, decision trees, support vector machines, or even deep learning models like neural networks. Once base

558

F. Juraev et al.

models are trained to generate predictions on the given dataset, instead of combining the predictions through simple methods like averaging or majority voting, stacking introduces a meta-model, which is trained on the predictions generated by the base models, effectively learning how to optimally combine their outputs to achieve better performance [22].

Fig. 1. Architecture of the static ensemble

2.2

Dynamic Ensemble

Dynamic ensemble learning constitutes an approach in which models are generated and combined in real-time based on specific query samples. This technique can be divided into two primary categories, namely dynamic classifier selection (DCS) and dynamic ensemble selection (DES), as illustrated in Fig. 2. DCS [21] is a technique that selects the most competent model from a pool of models, based on the characteristics of the query sample. This involves calculating the competences of each model in the pool and selecting the one that is best suited to make a final prediction for the given query. On the other hand, DES [15] is a technique that selects a subset of the most competent models from a pool of models and combines their predictions to make a final decision for the given query. This involves calculating the competences of each model in the pool and selecting the N most competent models to form the ensemble. The predictions of the selected models are then aggregated to make a final decision. The literature has shown that dynamic ensemble learning outperforms static ensemble learning for several reasons. Dynamic ensemble learning is more robust and reliable because it can adapt to changes in data characteristics and distributions. It also reduces the risk of overfitting by selecting the most competent models for each query, and it produces better interpretable results by identifying the most relevant models and their contributions to the final prediction. These advantages make dynamic ensemble learning a powerful and effective approach for improving the accuracy and robustness of machine learning models [6]. 2.3

Early Fusion-Based Ensemble

Early fusion-based ensemble modeling is a frequent data fusion technique where multiple modalities are integrated at an initial stage of the learning process

Explainable Dynamic Ensemble Framework for Classification

559

Fig. 2. The architecture of the dynamic classifier and dynamic ensemble selection

before training individual models within the pool. By combining various modalities during the early stages, this approach aims to capture the potential interaction and interdependencies among the modalities to enhance the model’s overall performance [3]. However, a critical limitation of early fusion-based ensemble modeling is the training of base models within the pool using the entire dataset. This technique not only slows the training time for each model but also limits the diversity in the generated predictions. The diversity is reduced by the fact that all the models are trained on identical datasets, which may result in models with similar biases and limitations, thereby affecting the ensemble’s overall robustness and generalization. Despite this drawback, early fusion-based ensemble modeling has been thoroughly investigated and employed in both static and dynamic ensemble modeling contexts within the literature. In the case of static ensemble modeling, early fusion has been used to create diverse models by combining multiple modalities, assuming that the data distribution remains consistent throughout the process. In contrast, dynamic ensemble modeling has incorporated early fusion to adaptively select and combine models in response to specific query samples, thereby leveraging the advantages of fusing multiple modalities while accounting for changing data characteristics and distributions [22]. Furthermore, researchers have also attempted to address the limitations of early fusion-based ensemble modeling by exploring alternative techniques such as late fusion and intermediate fusion, which integrate the modalities at different stages of the learning process. These alternative fusion strategies aim to balance the benefits of early integration with the need for maintaining diversity

560

F. Juraev et al.

and adaptability among the models within the ensemble, and offering potential alternatives for further enhancement in ensemble learning methods [5,17]. 2.4

Late Fusion-Based Ensemble

Late fusion-based ensemble modeling is a widely adopted strategy that includes training separate models in various modalities and combinations of modalities and then integrating their output predictions at a later stage of the learning process. A key advantage of late fusion lies in its ability to permit models to learn tasks across multiple modalities, thereby fostering greater diversity in their decision boundaries. This diversity within the model pool is crucial for constructing a robust ensemble that can effectively generalize to previously unseen data. While late fusion has been extensively researched and applied for static ensemble learning [19], its potential for dynamic ensemble selection remains comparatively under-explored in the literature. We want to solve this research gap by examining the applicability of late fusion to dynamic ensemble selection models in this paper. By combining the output predictions of multiple dynamically selected models, we aim to improve the accuracy and robustness of the ensemble model, especially in scenarios where the data distribution may change over time. Another significant contribution of our work is the development of an innovative explainability technique for dynamic ensemble selection models, an aspect that has not been previously investigated. This novel approach aims to provide insights into the decision-making process of the ensemble, clarifying the contributions of individual models and the rationale behind the dynamic selection of models. Our proposed explainability technique can potentially facilitate the adoption of dynamic ensemble selection models in various domains, particularly those where interpretability is significantly important, such as healthcare and finance.

3

Methods

This section describes the medical dataset utilized in this study, the architecture of the suggested framework for predicting in-hospital mortality, and the procedures utilized for data preparation. 3.1

Dataset

In this study, we used MIT’s GOSSIS (Global Open Source Severity of Illness Score) dataset [18]. The study includes 6600 subjects (57.3% male). The average age of subjects was between 16 and 89 years. There are two classes in the dataset whether the patient is dead or survived. There are 5264 (79.7%) subjects who survived and 1336 (20.3%) as dead patients.

Explainable Dynamic Ensemble Framework for Classification

3.2

561

Proposed Model

Figure 3 illustrates our proposed framework. This framework learns in-hospital mortality based on the multimodal late fusion paradigm. The proposed algorithm extends the KNORA-U (KNORA-UNION) and works as follows: 1. Split dataset D into training Train, validation (for DES) Dsel , and testing Test. 2. Define the models pool M with corresponding feature set F (modality or combination of modalities): M = {m1 , m2 , m3 , ..., mn }, F = {f1 , f2 , f3 , ..., fn } 3. Train models’ pool M with their corresponding subsets of F masked on the Train set. 4. Define a number of neighbors K for selection. 5. Select K nearest samples from Dsel for the given xi (test sample) using the k-Nearest Neighbors algorithm with Minkowski distance (Eq. 1). These K nearest samples are called regions of competence (RoC) for xi .  dM (x, y) =

d 

 p1 |xi − yi |p

(1)

i=1

where p is a positive constant that determines the order of the distance and yi ∈ Dsel . 6. Make predictions P of the samples in RoC using the trained models’ pool M : P = {p1 , p2 , p3 , ..., pn } and pi is predictions of mi (where mi ∈ M ) for K samples in RoC. 7. Calculate the competence scores C of each model in the models’ pool M using their prediction P in RoC. The competence score is the accuracy of the model in RoC: C (mi ) = accuracy(pi ) where mi ∈ M , pi ∈ P . 8. Select models that have a competence score higher than zero. This means that the model predicted at least one sample in RoC correctly. 9. Predict the given test sample xi with selected models. 10. Aggregate the predictions of selected models for xi using soft voting and use competence scores as weights. Our model also provides two types of explainability: case-based reasoning (CBR) and deep-based classifiers contribution (DBCC) to a final decision. For CBR, the model provides a map for sample xi with the given test sample xi , selected K nearest samples, and other samples on Dsel set. In addition, our model provides a detailed table of selected K nearest samples that help the physicians to get more insight into similar cases to the given test sample. For DBCC, our model provides an explanation of how each selected model contributes to the final decision. 3.3

Data Preprocessing

3.3.1 Missing Data Handling A feature that is absent more than 25% of the time gets eliminated. k-Nearest Neighbors (KNNImputer) [28] with 5 nearest neighbors is used to impute features with less than 25% missing. The mean value

562

F. Juraev et al.

Fig. 3. The architecture of the proposed framework

from the five nearest neighbors discovered in the training set is used to impute the missing values in each sample. 3.3.2 Data Standardization We used the MinMax normalization approach to transform the data into a homogenous normal distribution, as described in Eq. 2. x − xmin xscaled = (2) xmax − xmin 3.3.3 Feature Selection We employed a recursive feature elimination (RFE) approach [13] to select a subset of important features from a large set of 185 statistical features. To carry out the selection process, we utilized the Random Forest classifier in conjunction with RFE (RFE-RF). This method was used to rank the features based on their relevance and contribution to the classification task. After applying RFE-RF, we were able to identify 83 important features that were deemed essential for accurate classification. 3.3.4 Data Balancing Biased results are always the result of unbalanced datasets. In classification modeling, unbalanced datasets may be addressed using a variety of techniques, such as either oversampling the minority stage or undersampling the majority stage. In this work, although our dataset is not highly imbalanced, we applied the synthetic minority over-sampling technique (SMOTE) [4] for balancing our dataset.

4

Results

In this section, we discuss and compare the results of different machine-learning approaches with our proposed late-fusion dynamic ensemble selection model.

Explainable Dynamic Ensemble Framework for Classification

563

We collect and present the testing results for every model. In order to get more consistent results, we applied the 10-holdout testing method [24]. The results are shown as (mean ± standard deviation). 4.1

Early Fusion-Based Ensemble Modeling

In early fusion, we combine all modalities first and train machine learning (ML) models with them. For ensemble techniques, we need a pool of ML models. In this study, we used nine well-known and diverse ML algorithms [1,16]: XGboost (XGB), LightGBM (LGBM), Random Forest (RF), Support Vector Classifier (SVC), Logistic Regression (LR), Multi-Layer Perceptron (MLP), Decision Tree (DT), Naive Bayes (NB), and k-Nearest Neighbors (KNN). First, Table 1 shows the performance of the nine selected models. Among the models, the LightGBM ensemble model achieved the highest accuracy of 87.72%. Following the LGBM, XGB, and RF models reached 86.97% and 81.35%, respectively. Please note that the ensemble models achieved better results than classical ML models. Table 1. The performance of the base models with the early fusion of modalities Base classifier Accuracy

Precision

Recall

F1-score

LGBM

87.72±0.47 91.00±0.61 83.73±0.90 87.21±0.52

XGB

86.97±0.56 90.80±1.10 82.30±1.07 86.33±0.60

RF

81.35±1.11 86.21±0.73 74.63±2.29 79.99±1.43

SVC

80.72±1.14 81.32±0.85 79.76±2.51 80.52±1.38

LR

78.20±1.19 77.48±0.63 79.51±2.87 78.46±1.52

MLP

76.83±1.01 83.03±1.41 67.47±2.22 74.42±1.37

DT

73.11±1.27 77.39±1.42 65.31±1.78 70.83±1.49

NB

72.82±1.76 73.25±1.49 71.86±2.76 72.54±2.02

KNN

72.44±1.51 71.24±1.32 75.25±2.65 73.18±1.70

In Table 2, we can see the results of the static ensemble models with a different set of the model pool. The best accuracy of 87.97% is reached with the pool of the three most accurate models: LGBM, XGB, and RF. We also tried to add more diverse models, but the results do not improve. The training data of all models are the same and for that reason, there is no high diversity in predictions. For testing the performance of the dynamic ensemble selection model with early fusion, we selected KNORA-U because we extended this technique for late fusion. The performance of the KNORA-U is not high as compared to the static ensemble model. The results of KNORA-U are shown in Table 3, and the highest accuracy that KNORA-U reached is 86.89% with the pool of the three most accurate models.

564

F. Juraev et al.

Table 2. The performance of the static ensemble model (soft voting) with the early fusion ofmModalities Selection strategy

Model pool

Accuracy

Precision

Recall

F1-score

All models

XGB, RF, LGBM, SVC, MLP, LR, NB, KNN, DT

83.96±0.82

85.70±0.77

81.50±1.73

83.53±0.96

3 most accurate

LGBM, XGB, RF

87.97±0.54

90.95±0.64

84.35±0.93

87.50±0.60

4 most accurate

LGBM, SVC

XGB,

RF,

87.28±0.57

89.69±0.65

84.27±1.00

86.90±0.61

3 most accurate + 1 diverse

LGBM, MLP

XGB,

RF,

87.10±0.52

90.54±0.83

82.91±0.97

86.55±0.57

3 most accurate + 2 diverse

LGBM, XGB, MLP, KNN

RF,

86.10±0.68

88.71±0.69

82.74±1.33

85.61±0.76

2 most accurate + 3 diverse

LGBM, XGB, KNN, DT

MLP,

85.65±0.63

88.56±0.88

81.95±0.94

85.12±0.68

Table 3. The performance of the dynamic ensemble model (KNORAU) with the early fusion of modalities Selection strategy

Model pool

Accuracy

Precision

Recall

F1-score

All models

XGB, RF, LGBM, SVC, MLP, LR, NB, KNN, DT

83.78±1.03

84.07±0.91

83.78±1.03

83.72±1.04

3 most accurate

LGBM, XGB, RF

86.89±0.96

87.16±0.88

86.89±0.96

86.85±0.97

4 most accurate

LGBM, XGB, RF, SVC

85.80±0.94

86.20±0.85

85.80±0.94

85.74±0.98

3 most accurate + 1 diverse

LGBM, XGB, RF, MLP

85.95±1.03

86.48±0.91

85.95±1.03

85.92±1.05

3 most accurate + 2 diverse

LGBM, XGB, RF, MLP, KNN

85.33±1.18

85.72±1.04

85.33±1.18

85.30±1.20

2 most accurate + 3 diverse

LGBM, XGB, MLP, KNN, DT

85.57±0.81

85.89±0.71

85.57±0.81

85.53±0.83

4.2

Late Fusion-Based Ensemble Modeling

In late fusion, instead of combining all modalities, we can train each model in the pool with different modalities and different combinations of modalities. This approach increases the diversity of models’ prediction and gives better performance by ensembling than early fusion scenarios. Table 4 shows the results of the static ensemble, and as expected the performance is increased as compared to the static ensemble with early fusion. We tested different combinations of models’ pools and modalities. We got the best result with the following combinations: LGBM with lab tests (L), XGB with demographics (D), XGB with covariates (COV), LGBM with vital signs (V), MLP with commodities (COM), MLP with a combination of lab tests and demographics (L, D), SVC with a combination of covariates and demographics (COV, D), and RF with a combination of all modalities. With this setting, we achieved the best accuracy of 89.45%.

Explainable Dynamic Ensemble Framework for Classification

565

In Table 5, we report the results of our proposed model: dynamic ensemble selection model with late fusion. We tested different combinations with our model as well and to get the highest results, we utilized a model pool consisting of four highly accurate models (2 LightGBM and 2 XGBoost) and 4 diverse models (1 RF, 1 SVC, and 2 MLP). The average accuracy of these models was 77.75 ± 8.3 and we achieved an accuracy of 90.16% that outperforms all previous techniques. By leveraging the diversity of the model pool, we were able to achieve a high level of accuracy in the DES approach, which supports our hypothesis that diversity in the model pool is important in the dynamic ensemble techniques. Table 4. The performance of the static ensemble model (soft voting) with the late fusion of modalities Modalities

Strategy

Models

Accuracy

Precision

Recall

F1-score

[L] + [D] + [COV] + [V] + [ACOM]

5 accurate

LGBM, XGB, XGB, LGBM, LR

89.08±0.43

89.96±0.46

89.08±0.43

89.03±0.43

[L] + [D] + [COV] + [V]

4 accurate

LGBM, XGB, XGB, LGBM

88.98±0.46

89.88±0.49

88.98±0.46

88.95±0.45

[L] + [D] + [ACOV] + [V] + [ACOM]

4 accurate + 1 diverse

LGBM, XGB, XGB, LGBM, MLP

89.05±0.40

89.97±0.43

89.05±0.40

89.01±0.41

[L] + [D] + [COV] + [V] + [ACOM] + [COV]

4 accurate + 2 diverse

LGBM, XGB, XGB, LGBM, MLP, KNN

88.82±0.79

89.40±0.78

88.82±0.79

88.75±0.79

[L] + [D] + [COV] + [V] + [ACOM] + [L, D]

4 accurate + 2 diverse

LGBM, XGB, XGB, LGBM, MLP, MLP

89.22±0.61

89.76±0.58

89.22±0.61

89.17±0.60

[L] + [D] + [COV] + [V] + [COM] + [L, D] + [COV, D]

4 accurate + 3 diverse

LGBM, XGB, XGB, LGBM, MLP, MLP, SVC

89.15±0.81

89.45±0.79

89.15±0.81

89.12±0.81

[L] + [D] + [COV] + [V] + [COM] + [L, D] + [COV, D] + [ALL]

4 accurate + 4 diverse

LGBM, XGB, XGB, LGBM, MLP, MLP, SVC, RF

89.45±0.81

89.76±0.82

89.45±0.81

89.43±0.80

566

F. Juraev et al.

Table 5. The performance of the dynamic ensemble model (our Proposed) with the late fusion of modalities Modalities

Strategy

Models

Accuracy

Precision

Recall

F1-score

[L] + [D] + [COV] + [V] + [COM]

5 accurate

LGBM, XGB, XGB, LGBM, LR

89.30±0.35

90.29±0.33

89.30±0.35

89.26±0.34

[L] + [D] + [COV] + [V]

4 accurate

LGBM, XGB, XGB, LGBM

89.22±0.31

90.21±0.28

89.22±0.31

89.16±0.32

[L] + [D] + [COV] + [V] + [COM]

4 accurate + 1 diverse

LGBM, XGB, XGB, LGBM, MLP

89.25±0.36

90.28±0.29

89.25±0.36

89.21±0.34

[L] + [D] + [COV] + [V] + [COM] + [COV]

4 accurate + 2 diverse

LGBM, XGB, XGB, LGBM, MLP, KNN

89.24±0.43

89.85±0.38

89.24±0.43

89.18±0.43

[L] + [D] + [COV] + [V] + [COM] + [L, D]

4 accurate + 2 diverse

LGBM, XGB, XGB, LGBM, MLP, MLP

89.66±0.34

90.18±0.33

89.66±0.34

89.63±0.33

[L] + [D] + [COV] + [V] + [COM] + [L, D] + [COV, D]

4 accurate + 3 diverse

LGBM, XGB, XGB, LGBM, MLP, MLP, SVC

89.72±0.43

90.01±0.40

89.72±0.43

89.68±0.43

[L] + [D] + [COV] + [V] + [COM] + [L, D] + [COV, D] + [ALL]

4 accurate + 4 diverse

LGBM, XGB, XGB, LGBM, MLP, MLP, SVC, RF

90.16±0.57

90.42±0.53

90.16±0.57

90.14±0.58

5

Model Explainability

Accurate models are not sufficient to get the trust of medical experts. Understanding of how the models are working and why they take specific decisions are important for trustworthy AI models. In previous experiments, we achieved the best results by integrating the dynamic ensemble selection and late fusion of heterogeneous data modalities. We extend our proposed late fusion dynamic ensemble selection model to provide explainability. Our model provides two types of explanation: case-based reasoning (CBR) and deep-based classifiers contributions (DBCC). In this section, these explanabilities are discussed with examples. 5.1

Case-Based Reasoning

In dynamic ensemble selection, we choose base classifiers for the final decision based on the performance of classifiers in the region of competence (RoC). We select the most similar samples as RoC for the given test sample as discussed in Sect. 3.2. If we provide those samples in RoC to physicians as similar cases to the given new test sample, the physicians can get more insight. It is also called case-based reasoning (CBR). The physicians can select the features that they want to see. In Fig. 4, you can see an example of case-based reasoning. On the left of Fig. 4, there is a DSEL map where we can see the given test sample (blue X), selected samples for RoC (bigger points), and other samples in DSEL data.

Explainable Dynamic Ensemble Framework for Classification

567

On the left of Fig. 4, there are tables that show the detailed data about the given test sample and each nearest sample in the RoC. In the table (on the right side of Fig. 4), the physicians can select which features they want to see or as default features, we show the most important features. Statistics can be collected from the collected cases, and domain experts can test if the neighbor cases are medically similar to the test case or not. Also, visualization techniques can be used to show the important features that affect the decisions of the classifier.

Fig. 4. Case-based reasoning. Samples that are selected in the DSEL map (left) and their detailed table with selected features (right)

5.2

Deep-Based Classifiers Contributions

In the previous explainability, we discussed the explanations suitable for physicians. Furthermore, we can also get explanations on a deeper level: how each classifier in the selected ensemble contributes to the final decision, what are their individual predictions, and what their confidence is. Our model provides the table as shown in Fig. 5 with the features of weight (competence score), individual prediction for the given test sample, and a confidence score for that prediction. With this table, we can generate figures as illustrated on the right side of Fig. 5. With this explanation, we can see how each model affects our result and it helps us to add or remove other models for better performance.

568

F. Juraev et al.

Fig. 5. Selected ensemble contribution. Our model has produced a table (on the left), which may be used to generate a corresponding figure (on the right)

6

Discussion

The primary objective of our paper is to present an innovative approach to dynamic ensemble selection (DES) that employs a late fusion strategy for the fusion of multi-modal data. Moreover, our approach introduces a novel method to enhance the explainability of dynamic ensemble selection and address the need for better transparency and interpretability in the decision-making process. Our experimental findings indicate that the proposed DES method outperforms existing techniques, including static ensembles such as voting and stacking that employ both early and late fusion, as well as the dynamic ensemble selection approaches that utilize early fusion. The performance improvement of our proposed DES is related to the advantages of dynamic selection in comparison to static selection, in addition to the benefits of the late fusion of data over early fusion. By effectively combining the strengths of dynamic ensemble selection with the advantages of late fusion, our proposed approach demonstrates its potential to improve the accuracy, robustness, and generalizability of machine learning models. Furthermore, the introduction of a novel explainability technique contributes to a better understanding of the ensemble’s decision-making process, ultimately fostering increased confidence and trust in the model’s predictions, particularly in domains where interpretability is of critical importance. Limitations. Our study involved the evaluation of our proposed late fusionbased DES approach on a single well-balanced dataset, which was used for binary classification. However, real-world datasets often exhibit varying degrees of complexity, class imbalances, and distribution shifts. Consequently, further research is warranted to assess the performance of our proposed late fusion-based DES approach on more complex, diverse, and imbalanced datasets, as well as in multiclass classification scenarios. This would provide a more comprehensive understanding of the generalizability and robustness of our approach when faced with the intricacies and challenges commonly encountered in real-world applications.

Explainable Dynamic Ensemble Framework for Classification

7

569

Conclusion

This paper proposed an interpretable dynamic ensemble selection model with a late fusion of different modalities for detecting in-hospital mortality. Our proposed model outperformed the existing approaches, and the results suggest that the proposed DES approach for late fusion can be an effective technique for improving the accuracy of machine learning models, particularly in situations where diverse modalities and model pools are involved. These findings may have important implications for the development of more robust and accurate machine learning systems in various applications. Our future work will aim to address the limitations of our current study. Specifically, we intend to expand the evaluation of our proposed DES approach on diverse and unbalanced multi-class datasets, which will provide a more comprehensive assessment of its performance. To further enhance the methodology, we also plan to incorporate additional distance metrics, including Euclidean, Manhattan, Cosine, and customized metrics, as well as various selection methods. Furthermore, we intend to explore different types of dynamic ensemble selection techniques with late fusion setting and provide novel approaches to explainability. Acknowledgments. This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ICT Creative Consilience Program (IITP-2021-2020-001821) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation), and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2021R1A2C1011198).

References 1. Bonaccorso, G.: Machine Learning Algorithms. Packt Publishing Ltd (2017) 2. Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996) 3. Cao, Y., Geddes, T.A., Yang, J.Y.H., Yang, P.: Ensemble deep learning in bioinformatics. Nat. Mach. Intell. 2(9), 500–508 (2020) 4. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: synthetic minority over-sampling technique: smote. J. Artif. Intell. Res. 16, 321–357 (2002) 5. Cruz, R.M.O., Hafemann, L.G., Sabourin, R., Cavalcanti., G.D.C.: Deslib: a dynamic ensemble selection library in python. J. Mach. Learn. Res. 21(1), 283–287 (2020) 6. Cruz, R.M.O., Sabourin, R., Cavalcanti, G.D.C.: Recent advances and perspectives: dynamic classifier selection. Inf. Fusion 41, 195–216 (2018) 7. El-Rashidy, N., Abuhmed, T., Alarabi, L., El-Bakry, H.M., Abdelrazek, S., Ali, F., El-Sappagh, S.: Sepsis prediction in intensive care unit based on genetic feature optimization and stacked deep ensemble learning. Neural Computing and Applications, pp. 1–30 (2022) 8. El-Rashidy, N., El-Sappagh, S., Abuhmed, T., Abdelrazek, S., El-Bakry, H.M.: An improved patient-specific stacking ensemble model: intensive care unit mortality prediction. IEEE Access 8, 133541–133564 (2020) 9. El-Sappagh, S., Abuhmed, T., Riazul Islam, S.M., Kwak, K.S.: Multimodal multitask deep learning model for alzheimer’s disease progression detection based on time series data. Neurocomputing, 412, 197–215 (2020)

570

F. Juraev et al.

10. El-Sappagh, S., Ali, F., Abuhmed, F., Singh, J., Alonso, J.M.: Automatic detection of alzheimer’s disease progression: an efficient information fusion approach with heterogeneous ensemble classifiers. Neurocomputing 512, 203–224 (2022) 11. Fern´ andez-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15(1), 3133–3181 (2014) 12. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014) 13. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002) 14. Juraev, F., El-Sappagh, S., Abdukhamidov, E., Ali, F., Abuhmed, T.: Multilayer dynamic ensemble model for intensive care unit mortality prediction of neonate patients. J. Biomed. Inf. 135, 104216 (2022) 15. Ko, A.H.R., Sabourin, R., Britto Jr, A.S.: From dynamic classifier selection to dynamic ensemble selection. Pattern Recognit. 41(5), 1718–1731 (2008) 16. Mahesh, B.: Machine learning algorithms-a review. Int. J. Sci. Res. (IJSR).[Internet] 9, 381–386 (2020) 17. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 18. Raffa, J.D., Johnson, A.E.W., O’Brien, Z., Pollard, T.J., Mark, R.G., Celi, L.A., Pilcher, D., Badawi, O.: The global open source severity of illness score (gossis). Crit. Care Med. 50(7), 1040–1050 (2022) 19. Raschka, S.: Mlxtend: providing machine learning and data science utilities and extensions to python’s scientific computing stack. J. Open Source Softw. 3(24) (2018) 20. Ruta, D., Gabrys, B.: Classifier selection for majority voting. Inf. Fusion 6(1), 63–81 (2005) 21. Sabourin, M., Mitiche, A., Thomas, D., Nagy, G.: Classifier combination for handprinted digit recognition. In: Proceedings of 2nd international conference on document analysis and recognition (ICDAR’93), pp. 163–166. IEEE (1993) 22. Sagi, O., Rokach, L.: Ensemble learning: a survey. Wiley Interdiscip. Rev.: Data Min. Knowl. Disc. 8(4), e1249 (2018) 23. Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: Stacking classifiers for anti-spam filtering of e-mail (2001). cs/ arXiv:0106040 24. Sammut, C., Webb, G.I. (eds.) Holdout Evaluation, pp. 506–507. Springer, US, Boston, MA (2010) 25. Schapire, R.E.: A brief introduction to boosting. In: Ijcai, vol. 99, pp. 1401–1406. Citeseer (1999) 26. Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings Bioinform. 23(2), bbab569 (2022) 27. Tan, W., Tiwari, P., Pandey, H.M., Moreira, C., Jaiswal, A.M.: Multimodal medical image fusion algorithm in the era of big data. Neural Computing and Applications, pp. 1–21 (2020) 28. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)

Towards Efficient Edge Learning with Limited Storage Resource: Bandit-Based Training Data Retrieval in AIoT Jing Wang1 , Siyuan Liu1 , Wenjing Liu1 , Zhiwei Xu1,2,3(B) , Jiaqi Zhang1 , and Jie Tian4 1

3

4

College of Data Science and Application, Inner Mongolia University of Technology, Huhhot 010080, China 2 Inner Mongolia Engineering Research Center of College Network Security and Educational Management Informationization, Huhhot 010080, China Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China [email protected] Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, USA

Abstract. In an intelligent IoT environment, an edge server needs to retrieve data from end devices to train the deep neural network deployed at edge. In order to achieve higher training performance with limited resources, we propose a Bandit-based in-network training data retrieval scheme(Bandit-TDRetrieval). Specifically, we formulate data retrieval from end devices with a multi-armed bandit (MAB) model. A sequence of lever pulls of the arms, indicates the options to retrieve data from different end devices and follows a binomial distribution. To identify the relationship between this binomial distribution and the rewards through continuous data retrieval from the corresponding devices, Thompson sampling is used. In this way, an efficient data retrieval paradigm is proposed to maximize the rewards to retrieve meaningful training data for edge learning. Finally, the evaluations are performed on a NS-3-based simulation platform, which demonstrate the proposed paradigm effectively improve the training efficiency of the deep neural networks at edge. Keywords: Efficient edge learning data retrieval · IoT

1

· Multi-armed bandit · Training

Introduction

The growing interest in intelligent services motivates the integration of artificial intelligence (AI) in the Internet of Things (IoT) applications [15]. The collection of large volumes of data from the different devices and sensors is necessary for training intelligent applications. Considering uploading massive data generated c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 571–588, 2024. https://doi.org/10.1007/978-3-031-47715-7_39

572

J. Wang et al.

by connected devices to the cloud is usually impractical, edge learning [13] has emerged to enable analytic and knowledge generation closer to where the data is generated, avoiding the need to exchange the corresponding data with the cloud. However, while training models is a data-intensive task using large amounts of data, the limited storage resources of edge devices will limit both the size of the models that can be trained on-device, and the number of local training iterations. To alleviate such limitations brought by the existing edge devices, a small amount of meaningful training data should be obtained and processed for model training. In this way, a large amount of data that has little effect on model training or even useless data occupying the storage resources of edge devices can be avoided. Considering various deployment of end devices [6,9,16], these end devices generate different types of data, which have different effects on model training. To support model training with limited storage resources [4], it is necessary to select appropriate end devices to obtain the training data desired for edge learning. During retrieving training data from edge devices (see Fig. 1), the edge server schedule data retrieval to reduce trivial data for model training and thus obtain only a small amount of indispensable data. Although the exist data compression and caching can be used to provide training data for edge learning [8,14,17], a more precise data retrieval scheme is highly desired for learning with limited storage resource. To achieve that, we propose a Bandit-based training data retrieval scheme. It formulates the problem of retrieving data to a multi-armed bandit (MAB) problem.

Fig. 1. Edge server retrieves data.

Specifically, we regard each optional device as an arm of the multi-armed bandit. It is regarded as an action of pulling an arm that the corresponding

Towards Efficient Edge Learning with Limited Storage Resource

573

end device responds to a data request with data collected locally. A sequence of pulls of the arms indicates the options to retrieve data from different end devices and follows a binomial distribution. Each pull of the arm will generate feedback according to the loss of model training with the retrieved data. The reward will be given to the end devices that send back meaningful training data. We use Thompson sampling to evaluate the relationship between pulls and rewards. Finally, the top-k end devices will be used to collect training data in the next round. This scheme adopts data retrieval from end devices to the dynamic training process and thus improve the training performance of deep learning at the edge. The main contributions are as follows: – The training data retrieval problem is formulated via multi-armed bandit (MAB), benefiting for the implementation and verification of the data retrieval schemes in practice. – Considering the feedback that the end devices gets during model training, a MAB-based Thompson sampling are proposed to balance the historical feedback and the overall benefit of the data retrieval process. – The proposed data retrieval scheme is compared with state of the art algorithms, and demonstrated that it can significantly improve the training efficiency of deep learning models at edge. The rest of this paper is summarized as follows. Related works are studied in Sect. 2. We formulate the problem in Sect. 3, including the definition of the problem and the system model. A bandit-based training data retrieval scheme is proposed in Sect. 4. It is evaluated in Sect. 5, and conclusions are provided in Sect. 6.

2

Related Work

Many existing studies have made great efforts to improve the efficiency of neural network training. For instance, data compression and caching improve the efficiency of collaborative training at edge. In contrast, we focus on the retrieval of data that makes model training efficient in terms of resource consumption and support more intelligent applications. 2.1

Data Compression and Data Filtering for Edge Intelligence

Edge servers have been studied using some techniques to improve the training efficiency of neural networks by obtaining data from end devices. For example, redundant training data is reduced through data compression or data filtering to improve the efficiency of model training. Joseph Azar et al. [3] proposed an efficient data reduction approach based on a fast error-bounded lossy compression. This compression method greatly reduces the data transmitted to the edge, and verifies that the accuracy of the model has not decreased by using compressed data. Liu, Luning et al. [10]propose a data compression method based

574

J. Wang et al.

on wavelet transform, which uses entropy coding to improve compression performance. Wang, Jun Bo et al. [18] applied data compression technology to multiuser mobile edge computing system to reduce the redundancy of perception data and improve the efficiency of model training. The above studies are only for the compression of lossy data, but there is still lossless data in the existing data. Lu, Shaofei et al. [12]combine lossy compression and lossless compression to propose a hybrid data compression scheme. The scheme is applied in the sensor-cloud system so that the massive data sets collected by the sensor can be transmitted and the model training efficiency can be improved as much as possible. Qiu, Shaoming [14] constructs a data compression scheme based on the chaotic mutation adaptive sparrow search algorithm. When the new fitness function is used for data compression, the model training efficiency can be improved under the premise of ensuring a certain compression ratio. Although these studies have improved the efficiency of model training, the original data will be reconstructed through data compression, and the quality of model training has decreased. 2.2

Data Caching for Intelligent Applications at Edge

Another popular research direction is data caching. Xia, Xiaoyu et al. [19] propose a data caching method based on Lyapunov optimization, which is applied in the edge scenario of online collaboration and has achieved the effect of improving the efficiency of model training. The data caching method Liu, Ying et al. [11] propose uses a time window to check whether a piece of data is popular, in order to determine whether the piece of data should be cached in the edge device, thereby improving the efficiency of model training on the edge server. Wang, Ge et al. [17] propose a secure decentralized data caching strategy based on greedy algorithm, which improves the cache hit rate and effectively improves the model training efficiency under the premise of ensuring the security of data transmission. Li, Chunlin et al. [8] propose a data caching scheme based on enhanced learning, and use the base station collaborative caching model to consider the training efficiency of models in different transmission modes. These schemes do not consider the impact of the cached data for model training, a data retrieval scheme is highly desired.

3

Problem Formulation

The feedback on the neural network model to the training data is a very important reference factor for formulating the scheme to retrieve training data. In this paper, the online training neural network model of edge servers is formalized as a Multi-Armed Bandit (MAB) problem, and the retrieval strategy is updated using the feedback of the end device to supply real-time and efficient data. 3.1

Problem Definition

Reinforcement learning is a type of machine learning that emphasizes how to act in accordance with the environment in order to maximize expected benefits and avoid wrong choices. It is mainly composed of Agent, Environment,

Towards Efficient Edge Learning with Limited Storage Resource

575

State, Action, and Reward. Through reinforcement learning, the agents can know what actions they should take in their current state in order to get the greatest rewards. Bandit algorithm, also known as Multi-Armed Bandit (MAB) machine, is the basic model in reinforcement learning. Bandit algorithm can be used to solve such a problem, that is, when faced with a choice with k different options repeatedly, how to ensure that each choice receives a maximum incentive related to the selected action, generated by a fixed probability distribution. Edge servers fit well with the bandit algorithm when faced with training data retrieval. With regard to the MAB based edge learning model concerned several key components are defined as follows. Definition 1. (Environment) It refers to the mapping of behavior between edge servers and end devices. To be specific, we denote by U = {u1 , u2 , · · · , uU } the set of edge server, and denote by I = {i1 , i2 , · · · , iI } the set of end device. Over a period of time , every edge server responds with an end device set, expressed as BuT = {b1 , b2 , · · · , bT }, where T is the time and bt (1 ≤ t ≤ T ) is a behavior record set of edge server t-1 to t. The behavior record can be divided into requested and unsolicited operations. We consider the time environment T of edge server u as edge server properties (such as the neural network model of the edge server), terminal device properties (such as collected data), and behavior sequence BuT . Definition 2. (Arm (Action)) When an arm pulls, it produces a selection, and then the corresponding reward can be received. Among the existing relevant work, the arm is often used to represent the items, and pushing the arm means choosing corresponding items. The arm is considered an end device in this article, and pulling the arm means that the edge server retrieves training data from the end device, namely A = {a1 , a2 , · · · , an } where aj ⊆ I(1 ≤ j ≤ n). Definition 3. (Reward) We define the reward as the feedback of the data retrieved by the edge server from the end device. If the model training efficiency is improved, for example, the response delay is lower and the accuracy and other evaluation indicators are higher than other algorithms (algorithms for comparison will be given in Chap. 5), it is positive feedback. If not, it is negative feedback. The majority of studies set the reward as a Bernoulli reward, which means that the reward is set to 1 if the retrieved training data can improve the model training efficiency, and 0 otherwise. Definition 4. (Agent (Bandit Algorithm)) We define the agent as the bandit algorithm for selection. On the basis of the current environment and the reward estimation function of the end device, the strategy is used to select the one that can improve the training efficiency of the model from many end devices. And then we update the reward estimation function or distribution of the end device by obtained reward, which will be used to select the excellent end device next round. Maximizing the cumulative reward is the goal of the agent.

576

J. Wang et al.

3.2

System Model

On the online edge learning environment, the behavior of the edge server to select the end device usually changes constantly during the interaction. Therefore, considering the feedback of the edge server, the data retrieval scheme needs to precisely estimate the development of its state, and develop an optimized data retrieval scheme for it. More specifically, in online edge learning environment, a retrieval scheme of data should satisfy the following two points. First, the historical feedback of the end device needs to be fully considered by the edge server; second, the overall interests of all interactive processes need to be optimized by the edge network. For the realization of the above objectives, reinforcement learning provides strong technical support. The MAB-based online edge learning model architecture we proposed is shown in Fig. 2. This online edge learning model consists of three main modules: a. Environment. It includes all information about edge servers, end devices and the record of their interaction behavior. b. Action. The rewards which edge devices obtained are updated according to the request and feedback of the edge server. c. Agent. This is the bandit algorithm. The agent’s input includes the request of the edge server, the end device and the operation. And the output is the edge device selected. In the whole process, the agent progressively optimizes the retrieval scheme by interacting with the environment. Thus, accurate and diversified data retrieval can be achieved.

Fig. 2. The framework of the presented recommendation model based on MAB.

The following is the workflow of the mode. First, the edge server sends an initialization request for data retrieval to each end device. Then, we select an

Towards Efficient Edge Learning with Limited Storage Resource

577

optimal end device a on the basis of the estimated reward of each end device. The reward estimate represents a selected candidate set, which reflects the high training efficiency of the model for the training data in the end device selected by current edge server. Finally, the above steps are repeated until the entire retrieval scheme is stable, that is, the reward estimation functions of all end devices tend to be stable. There are several important issues in the online of edge learning model need to be addressed. The first is how to definite the action. How to define action have a significant influence on the quality of the edge server to retrieve data. If the end devices are classified according to the initialization results, then each class is modeled as an action. Although this reduces the scale of the action, it ignores that the data of the end device changes dynamically. We model the edge server retrieves data from an end device as an action each time, with the benefit of considering the circumstances of each end device with dynamic change of data. The second is how to take full advantage of the feedback from the edge server to the end device. To be specific, we need to find how to map feedback as a reward, and then the expected reward of end device is updated, which is used to optimize the next selection of the end device. Third, how to optimize the global reward through the interaction phase. The expected reward for each data retrieval needs to be accurately estimated. The expected reward for data retrieval from the end device is assumed to obey a beta distribution. Modeling the process by which an edge server retrieves data from an end device is a challenge to this model. With regard to the data retrieval mechanism of online edge learning based on feedback, the challenge is how to map different training data reflecting different degrees of training efficiency to rewards, especially unsolicited actions (no requests other than one initial request). In the action of the edge server requesting the end device to transmit data, we map the data retrieved by the edge server for the same end device multiple times as positive feedback, and the same end device that has not been retrieved for many times is mapped as negative feedback. In the unsolicited behavior, when an end device has no other reward except the reward at initialization and the edge server does not send a data retrieval request, it does not mean that the data of this end device cannot improve the efficiency of model training, it may be because the bandit algorithm has a certain probability when sampling, although the probability that the device has not been sampled is very small. Therefore, treating all unsolicited behavior as negative feedback affects the performance of the selection. In this paper, we set the end devices to 50% positive and negative feedback for the small probability case described above.

4

Bandit-Based Training Data Retrieval

In the online of edge learning environment, to obtain a model with better performance in the case of limited transmission, the edge server needs to select a suitable end device to retrieve the most useful data samples to train the model. This problem will be addressed in the section with a Bandit-based training data retrieval mechanism.

578

4.1

J. Wang et al.

Rewards of End Devices Based on Thompson Sampling

The data retrieval scheme we propose is based on the bandit algorithm of Thompson sampling [20]. The stochastic MAB problem can be solved by using the Thompson sampling algorithm [1]. In addition, an important indicator for evaluating the bandit algorithm is cumulative regret. The lower the cumulative regret, the more adaptable the decision-making is to the current environment. Therefore, we use the idea of Thompson sampling to evaluate the end-device reward and help the edge server select the appropriate end-device to retrieve data to train the model. Before introducing the basic principles of Thompson sampling, we first need to understand the Bernoulli distribution, the Binomial distribution, and the β distribution. First, we’ll introduce the Bernoulli distribution, which is a discrete distribution, also known as the “0–1 distribution” or “two-point distribution”, such as heads or tails of a coin toss. Here, we follow the Bernoulli distribution of the reward obtained by each end device, that is, when the edge server retrieves data from the end device to train the model, if it gets positive feedback (the data is efficient for model training), the reward is 1, otherwise is 0. Its probability distribution can be expressed as:  p if x = 1 x 1−x = (1) P (x) = p (1 − p) q if x = 0 After that, let’s introduce the Binomial distribution. It is also a discrete probability distribution, also known as the n-fold Bernoulli distribution. Suppose an experiment is a Bernoulli experiment. Do n such trials, succeed x times, and fail n−x. The Binomial distribution thinks that the probability of this happening can be calculated using the following formula: P (x) = Cnx px (1 − p)n−x =

n! px (1 − p)n−x x!(n − x)!

(2)

In this paper, we assume that the number of times the edge server retrieves training data for the end device is n. Since the reward obtained by each end device follows the Bernoulli distribution, the data retrieval performed by the edge server satisfies the Binomial distribution. Next, we introduce the Bate distribution. According to Bayes, the p of the Binomial distribution is not fixed, and the Bate distribution is introduced as the conjugate prior of the Binomial distribution, and the value of p is dynamically adjusted by adjusting the parameters of the β distribution. That is to say, for simple experiments like coins or dice, we can know the probability of success of the system very accurately in advance. Often, however, the probability of success is unknown. In order to test the probability of success, we do n trials and count the number of successes k, so it can be calculated intuitively. However, since the probability of success is unknown, this formula calculates only the best estimate of the probability of success. That is to say, it may actually be other values, but the probability of other values is smaller. In this paper, the end device is unknown which the edge server selects to retrieve data from, so we introduce

Towards Efficient Edge Learning with Limited Storage Resource

579

the β distribution to adjust the p of the Binomial distribution satisfied by the reward so that it satisfies the law of large numbers after many experiments. In the proposed data retrieval strategy, the basic principles of Thompson sampling are as follows. It is assumed that each end device will receive a reward that follows an independent β(a, b) distribution under the action of the proposed data retrieval strategy, where a and b are both Thompson sampling parameters. The β distribution here is a set of continuous probability distributions defined in the interval (0, 1). As the density function of the conjugated prior distribution of the Bernoulli distribution and the Binomial distribution, it is widely used in machine learning and mathematical statistics. In the problem that we want to solve by retrieving training data from the edge server to improve the efficiency of model training, what we are not sure about is the probability that the data in the end device will be retrieved. But we have some reasonable guesses. For example, when the model is to identify the vehicle, vehicle data needs to be obtained. The end device selected should be road monitoring rather than park monitoring. The β distribution can be a good probability distribution that represents the probability. Therefore, the β distribution is used in the Thompson sampling algorithm to model the prior distribution of the expected reward for the end devices in the reward setting of Bernoulli. Thompson sampling samples from end devices of the beta distribution every time the edge server sends a data retrieval request to the end device. We take the end device which has the highest sample value as a choice. After getting the positive and negative feedback of the end device which are represented by (S) and (W ) respectively, we update the posterior distribution to β(a + S, b + W ). From the features of the β distribution, we can conclude the following: i. We use the a + b to describe the total number of times that the edge server sends data retrieval requests to the end device. As the number of data retrieval requests increases, the feedback reward of the end device gradually satisfies the law of large numbers. The distribution curve becomes narrower that is closer to the real average reward. a increases, the greater the ii. The distribution center moves closer to 1 as a+b proportion of positive feedback to the total deployment times a+b. Therefore, the higher the probability of obtaining a larger value through sampling, that is, the higher the probability of obtaining a reward for the action of selecting the terminal device. Conversely, the probability of getting a reward is lower. To sum up, by discovering and analyzing the reward feedback of end devices, we will get end devices with high rewards and end devices with low rewards, which effectively completes the distinction between end devices. 4.2

Data Retrieval Paradigm

Since the process of selecting end devices to retrieve data from the edge server obeys an independent β(a, b) distribution, and Thompson sampling is random, the reward gain of end devices is uncertain. To this end, the idea of accumulating

580

J. Wang et al.

regrets is introduced for a calculation to ensure the reliability of the selected end device. First, the edge server performs a request operation for each end device, and initializes the reward score of the end device to ensure that each device has a probability of being selected, avoiding limitations. Secondly, according to the initial reward score, the edge server performs multiple data retrievals to update the reward returns of the accuracy and latency of the end device. Update accuracy rewards r¯i (t) =

r¯i (t − 1) + rit βi (t − 1) + 1

(3)

Update delay reward return Update accuracy rewards p¯i (t) =

p¯i (t − 1) + pti βi (t − 1) + 1

(4)

where βi (t) is the number of end devices selected before the tth round, ri (t) is the average accuracy return of end-device i up to the tth round, pi (t)) is the enddevice i until the t-th round average latency returns until round t. Finally, based on the accumulative regret idea of accuracy and response delay, the probability that the end device is selected by the edge server is determined, that is, p in the binomial distribution. The accuracy of cumulative regrets is RT1 =

T  

ri (t) − rti



(5)

t=1

The delay of cumulative regrets is RT2 =

T  

pi (t) − pti



(6)

t=1

At the same time, in order to comprehensively consider the accuracy rate and response delay of the end equipment, the cumulative regret Q is used as the basis for selecting the end equipment, which is expressed as: Q = RT 1 ∗ RT 2

(7)

The lower the cumulative regret value of accuracy and response delay, the higher the efficiency of the edge server retrieving data from the end device to train the model.

Towards Efficient Edge Learning with Limited Storage Resource

581

Algorithm 1 : Bandit-TDRetrieval Algorithm 1: Input: Prior parameters α0a , β0a , end device K, edge server request behavior record BT u; 2: Output: End device set L; 3: for t=1,2,,T do 4: A ← get action(BuT , t) 5: end for 6: for i=1,2,,K do 7: for all a ∈ A do 8: if a = 0 then Break; 9: else 10: β(a + S → a, b + W → b) 11:

r¯i (t) =

r ¯i (t−1)+rit βi (t−1)+1 p ¯i (t−1)+pti βi (t−1)+1

12: p¯i (t) = 13: end if 14: end for   ri (t) − rti 15: RT1 = T   t=1 t 16: RT2 = T t=1 pi (t) − pi 17: Q = RT 1 ∗ RT 2 18: end for 19: return L;

Update accuracy rewards Update delay rewards Accuracy of cumulative regrets Delay of cumulative regrets

Algorithm 1 is the pseudo-code of the Bandit-TDRetrieval algorithm, which shows the operational flow of our data retrieval from end devices. The idea of the algorithm is to evaluate a set of end devices with the best efficiency based on the accuracy and response latency generated when acquiring data from end devices. The rewards of each end device obey an independent β distribution. By updating the rewards of accuracy ri (t) and delay pi (t) generated by acquiring data from that device, the corresponding cumulative regrets RT1 and RT2 are calculated, respectively. Finally, RT1 and RT2 are combined as the final cumulative regret Q of the devices, and the set L of devices with the best training efficiency is selected for output.

5

Performance Evaluation

In this section, we present experimental results evaluating the algorithm, by comparing the performance of the proposed data retrieval strategy with three baseline algorithms. 5.1

Experiment

Data Set We use the Twenty Newsgroups dataset in our experiments, which was published by Ken Lang in 1995 in the Proceedings of the 12th International Conference on Machine Learning [7], It is currently an international standard dataset for text classification, text mining and information retrieval research.

582

J. Wang et al. Table 1. The information of datasets. News group

Nodes number Number of documents

alt. stheism

0

1000

comp. graphics

1

1000

comp. window. x

2

1000

misc. forsale

3

1000

rec. autos

4

1000

rec. sport. baseball 5

1000

sci. electronic

6

1000

sci. space

7

1000

talk. politics. guns 8

1000

talk. religion. misc 9

1000

The original dataset consisted of approximately 20,000 newsgroup documents, evenly divided into newsgroup sets on 20 different topics. Among them, some newsgroup topics are similar, and some newsgroup topics are completely unrelated. Collect and organize relevant data in the dataset, extract and adopt 10 groups of news topics, each group of data includes 1 topic name and 1000 documents. It aims to effectively evaluate the impact of our proposed BanditTDRetrieval mechanism on the training efficiency of neural network models. The specific information of the dataset is shown in Table 1. Baseline We compared our algorithm with the following three reference algorithms: (1) Random [5]: The random algorithm randomly selects N end devices to request. It provides the lowest retrieval efficiency. (2) Close [21]: The Close algorithm selects the end device that is close to the edge server when retrieving data. (3) Upper Confidence Bound (UCB) [2]: When the upper confidence limit algorithm retrieves data, the upper limit of the confidence interval for each end device is calculated, and then we select the end device with the highest upper limit of the confidence interval for data retrieval each time. Evaluation Metrics In this paper, the Bandit-TDRetrieval scheme is evaluated by indicators such as precision rate and recall rate. For ease of understanding, Table 2 is the description of the symbols in the evaluation index, and then the calculation formula of the evaluation index is given.

Towards Efficient Edge Learning with Limited Storage Resource

583

Table 2. Description of symbol. Symbol Description of symbol TP

The optimal number of end devices to include in the set of selection

TN

The not optimal number of end devices to include in the set of selection

FP

The number of optimally end devices contained in the non-selection set

FN

The number of not optimally end devices contained in the nonselection set

RT 1

Accumulated regret for accuracy

RT 2

Cumulative regret of response delay

B

Number of hit end devices

R

Total number of end devices

The formula for the precision rate is: P recision =

TP TP + FP

(8)

The formula for recall is: Recall =

TP TP + FN

(9)

The formula for accuracy is: Accuracy =

TP + TN TP + FN + FP + FN

(10)

The formula for the harmonic mean is: F1 =

2 1 precision

+

1 recall

(11)

The formula for the hit rate is: H= 5.2

B ∗ 100% R

(12)

Results Analysis

We ran experiments using the algorithms mentioned in Sect. 4.2 and the BanditTDRetrieval algorithm, and evaluate the final results using the evaluation metrics mentioned in Sect. 5.1 to verify the performance of our algorithm.

584

J. Wang et al. Table 3. The information of results. Evaluation metrics Random Close

UCB

Bandit-TDRetrieval

Pprecision

0.5236

0.5405 0.5512 0.5781

Recall

0.4236

0.4394 0.4631 0.5261

F1

0.6873

0.7017 0.7107 0.7327

Accuracy

0.7165

0.7451 0.7901 0.8349

Cumulative regret

0.7349

0.6873 0.5276 0.3351

Table 3 shows the results of these metrics. Overall, of all four approaches, Bandit-TDRetrieval achieves the best results. The accuracy of the neural network model training in the Bandit-TDRetrieval scheme can reach 0.5781, and the precision of the Random, Close, and UCB algorithms are all lower than it. The advantage of other metrics is not obvious, but cumulative regret has obvious advantages over Random, Close, and UCB algorithms.

Fig. 3. The results of accuracy.

In the edge network environment with the same number of end devices, we evaluate the accuracy, recall rate, response delay and hit rate of Random, Close, UCB and Bandit-TDRetrieval model training as shown in Fig. 3, 4, 5 and 6. We can clearly see that as the number of end devices increases, the accuracy, recall, and hit rate of the Bandit-TDRetrieval strategy tend to be stable, and are higher than those of the Random, Close, and UCB algorithms, especially the accuracy and recall. After the number of end devices reaches 200, the rate shows a great advantage compared with other algorithms.

Towards Efficient Edge Learning with Limited Storage Resource

Fig. 4. The results of recall.

Fig. 5. The results of delay.

585

586

J. Wang et al.

Fig. 6. The results of hit rate.

6

Conclusion

Aiming at the deployment of end devices in edge network environment, this paper proposes a Bandit-based in-network training data retrieval mechanism for edge computing. In order to simplify the model, this study assumes that each end device has the same computing power and the same task size. First, the best end device is selected and recommended through the bandit algorithm, and then according to the reward distribution of each end device, we select the best end device to form the recommended end device set. Then, the Top-k sorting algorithm is used to further determine the most suitable end devices for deployment. This process will be repeated k times to ensure optimal recommendation and future profitability. Finally, the experimental results compared with the three methods of Random, Close and UCB show that the Bandit-TDRetrieval presented in this research can effectively improve the training efficiency of deep neural networks in edge networks. It enriches the methods for improving the training efficiency of deep neural network. In future work, we will further optimize the deployment problem of end devices. Acknowledgments. This work was supported by the Open Foundation of State Key Laboratory of Networking and Switching Technology (Beijing University of Posts and Telecommunications) (SKLNST-2020-1-18), the National Science Foundation of China (61962045, 62062055, 61902382, 61972381), the Science and Technology Planning Project of Inner Mongolia Autonomous Region (2019GG372), the Science Research Project of Inner Mongolia University of Technology(BS201934), the Basic Scientific

Towards Efficient Edge Learning with Limited Storage Resource

587

Research Expenses Program of Universities directly under Inner Mongolia Autonomous Region (JY20220273).

References 1. Agrawal, S., Goyal, N.: Analinproceedingsysis of Thompson sampling for the multiarmed bandit problem. In: Conference on Learning Theory. JMLR Workshop and Conference Proceedings, pp. 39–1 (2012) 2. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2), 235–256 (2002) 3. Azar, J., Makhoul, A., Barhamgi, M., Couturier, R.: An energy efficient IoT data compression approach for edge machine learning. Futur. Gener. Comput. Syst. 96, 168–175 (2019) 4. Cui, Q., Gong, Z., Ni, W., Hou, Y., Chen, X., Tao, X., Zhang, P.: Stochastic online learning for mobile edge computing: learning from changes. IEEE Commun. Mag. 57(3), 63–69 (2019) 5. Gallo, M., Kauffmann, B., Muscariello, L., Simonian, A., Tanguy, C.: Performance evaluation of the random replacement policy for networks of caches. ACM SIGMETRICS Perform. Eval. Rev. 40(1), 395–396 (2012) 6. Huirong, M., Chen, X., Zhi, Z., Shuai, Yu.: Dynamic task offloading for mobile edge computing driven by green energy. Comput. Res. Dev. 57(9), 1823–1838 (2020) 7. Lang, K.: Newsweeder: Learning to filter netnews. In: Machine Learning Proceedings 1995, pp. 331–339. Elsevier (1995) 8. Li, C., Qianqian, C., Luo, Y.: Low-latency edge cooperation caching based on base station cooperation in SDN based MEC. Expert Syst. Appl. 191, 116252 (2022) 9. Ling, W., Chuge, W., Wenhui, F.: A review of edge computing resource allocation and task scheduling optimization. J. Syst. Simul. 33(3), 509 (2021) 10. Liu, L., Chen, X., Zhaoming, L., Wang, L., Wen, X.: Mobile-edge computing framework with data compression for wireless network in energy internet. Tsinghua Sci. Technol. 24(3), 271–280 (2019) 11. Liu, Y., He, Q., Zheng, D., Xia, X., Chen, F., Zhang, B.: Data caching optimization in the edge computing environment. IEEE Trans. Ser. Comput. (2020) 12. Shaofei, L., Xia, Q., Tang, X., Zhang, X., Yingping, L., She, J.: A reliable data compression scheme in sensor-cloud systems based on edge computing. IEEE Access 9, 49007–49015 (2021) 13. Park, J., Samarakoon, S., Bennis, M., Debbah, M.: Wireless network intelligence at the edge. Proc. IEEE 107(11), 2204–2239 (2019) 14. Qiu, S., Li, A.: Application of chaos mutation adaptive sparrow search algorithm in edge data compression. Sensors 22(14), 5425 (2022) 15. Tak, A., Cherkaoui, S.: Federated edge learning: design issues and challenges. IEEE Netw. 35(2), 252–258 (2021) 16. Tian, Y., Jun, Y.: A deep reinforcement learning method for offloading decision and resource allocation in MEC. Comput. Eng. (2021) 17. Wang, G., Li, C., Huang, Y., Wang, X., Luo, Y.: Smart contract-based caching and data transaction optimization in mobile edge computing. Knowl.-Based Syst. 252, 109344 (2022) 18. Wang, J.-B., Zhang, J., Ding, C., Zhang, H., Lin, M., Wang, J.: Joint optimization of transmission bandwidth allocation and data compression for mobile-edge computing systems. IEEE Commun. Lett. 24(10), 2245–2249 (2020)

588

J. Wang et al.

19. Xia, X., Chen, F., He, Q., Grundy, J., Abdelrazek, M., Jin, H.: Online collaborative data caching in edge computing. IEEE Trans. Parallel Distrib. Syst. 32(2), 281–294 (2020) 20. Xia, Y., Li, H., Qin, T., Yu, N., Liu, T-Y.: Thompson sampling for budgeted multi-armed bandits. In: Twenty-Fourth International Joint Conference on Artificial Intelligence (2015) 21. Zhang, X., Zhu, Q.: Distributed mobile devices caching over edge computing wireless networks. In: 2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 127–132. IEEE (2017)

EduChat: AI-Powered Chatbot with Personalized Engagement for Online Learning Rabiah Abdul Kadir1(B) , Mohamad Fairus Zulkifli1 , Sabrina Binti Tiun2 , Mohd Modi Lakulu3 , Shaidah Jusoh4 , and Ahmad Faridz Ahmad Faudzi5 1 Institute of IR4.0, The National University of Malaysia, Bangi, Malaysia

[email protected]

2 Center for Artificial Intelligence Technology (CAIT), The National University of Malaysia,

Bangi, Malaysia 3 Faculty of Arts, Sultan Idris Education University, Computing & Creative Industries, Tanjung

Malim, Malaysia 4 School of Electrical Engineering & Artificial Intelligence, Xiamen University Malaysia,

Bangi, Malaysia 5 Xplode Media Sdn. Bhd, Bangi, Malaysia

Abstract. Currently, the transformation to fully online learning during COVID19 encourages the usage of social media platforms such as WhatsApp and Telegram in the learning environment. However, the social media platform has some limitations, including the loss and difficulty in keeping track of messages from instructors due to the lack of personalized engagement, the disappearance of information or learning materials due to time sharing limitations, and the size of the file that users can share. Today, chatbots are opening doors to the way we acquire specific information about services. Chatbots are able to provide immediate answers, information, and advice without the user ever having to have a conversation with a human. The Chatbot helps better communication with prospect queries and pushes the business forward. However, in response to questions that do not match the keywords specified, it is still open for a solution. Further details of finding proper answers such as links or applying elements of AI approaches should be provided instead of leaving those questions unanswered. This paper address the above issues by developing a novel educational chatbot known as EduChat, that is able to provide a personal assistant and appropriate information or learning materials for the dynamic students’ needs. Natural language Processing (NLP) and Support Vector Machines (SVM) model for sentiment analysis will be implemented in the development of EduChat. The element of Artificial Intelligence (AI) supported by the ontology will give EduChat provides personal assistant messaging. Keywords: Chatbot · Educhat · Natural language processing and support vector machines

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 589–597, 2024. https://doi.org/10.1007/978-3-031-47715-7_40

590

R. A. Kadir et al.

1 Introduction Chatbot is an application or software programmed to responds and interact to human conversation. The ways chatbot respond towards human conversation can be in text chats. Chatbot is short form of chatterbot, is an artificial intelligence feature can be used in many messaging application. The main feature chatbot is responding to human communication like human responding among them. It can be tools used in e-commerce, media social and e-learning platform. There are many synonyms for chatbot like talkbot, bot, IMbot, interactive agent or artifial conversation entity. In e-learning the chatbot are called educhat. It is use to assists user in e-learning platform. Mostly user in e-learning is student and parents are not familiar with it. Technical issues something happen also make user easily fed up to use the system. Without instant support and immediate responds from customer services made users make redundant complaint and interrupt operational of business. Repetition of issues from same user was happen make customer services must attend that issues even user know how to solve it. This was educhat as an assistant and guidance to user if normal issues and repetition issues happen can be solved by it. Education is the field that demands dynamic changes to keep up with the rapid pace of modern life. Create bots in education sector that can communicate with students on all subjects of elementary, secondary, high school, and up to university levels is needed in modern education. This project named the chatbot in education sector as EduChat. EduChat with feature responding to human chat or conversation, can analyzing every conversation send to it. Every analyzing result will respond to user with specific answer. Answer produce by educhat also inform of solution, guidance and material needed by user as a references for their studies. In order to contribute to existing research in this area, the aim of this paper is to provide a personal assistant and appropriate information or learning materials for the dynamic students’ needs by developing a chatbot in education sector known as EduChat. In the second section the chatbots’ relevance for the e-business context is defined based on their key characteristics followed by Sect. 3, which described the process of the development of EduChat using the combination of Robotic Process Automation (RPA) and Rapid Application Development (RAD). Section 4 discussed the conversation involves with personalization features and the quality to support the users. Finally, the central findings of this paper are summarized in a concluding discussion in Sect. 5, where future for research and practice are pointed out.

2 Related Work Starting from year 2020, the outbreak of the COVID-19 caused education in Malaysia to close the schools, campuses and forced them to initiate online learning. An online learning is a learning environment with no physical location and in which the instructors and students are separated by space. The introduction of online learning environments into education due to the current situation has brought a big change in the learning experience. However, there are some common issues that should be considered such as what is the best learning tools or platforms; what is the level of realism and interaction required; and how best to design experiences for learners.

EduChat: AI-Powered Chatbot with Personalized

591

The main issues involved are the technologies used in online learning environments, especially in handling the interaction between the students and the instructors. There are a variety of ways to interact with online learners, from brief informal recordings to short messages and discussion among learners in online learning. Social media platforms are usually used for interaction among people, where everybody can share, exchange, comment, discuss and create information and knowledge in a collaborative way [1]. Social media platforms are also used in education and it is proved that social media improves teaching and learning process [9]. However, there is a limited range of created content in social media, where instructors and students have a tendency to use shared materials [2] and those platforms have a limit on the size of files that users can send, with any data download such as WhatsApp is being limited to 16 MB for any type of media file (photo, music or video) and documents up to 128 MB in size [10]. Few studies also pointed out the problems of privacy, unnecessary messages and cyberbullying associated with the usage of social networking tools in academic settings. Student learning and performance is affected upon receiving unwanted messages and pictures [3, 11]. End of 2020, WhatsApp was introducing a new feature where allow the users to enable disappearing messages on chat conversations to automatically delete messages after seven days [12]. This new feature will require admin i.e. the instructors to enable the disappearing messages. What will happen if the instructors are not available and have a problem with the networking. This paper address the above issues by developing a novel of intelligent personal assistant to answer the queries and perform actions using natural language user interface known as chatbot. It is not necessary for the instructors to appear regularly, which chatbots are available 24/7. 2.1 A Chatbot A chatbot is a software application designed to simulate human-like conversations with users via text or text-to-speech messages that would behave as a conversational partner. The key task of chatbot is to help users by providing answers to their questions [4]. Artificial intelligence (AI), natural language processing (NLP), and machine learning are technologies used in chatbot [5, 6]. There are two types of chatbot technology, (i) rule based chatbot and (ii) AI based chatbot. Figure 1 shows general chatbot architecture. 2.2 Chatbot in Education One of the major challenges in this information age is to digitalize education and reinventing the learning experience. Squirrel Ai Learning by Yixue Group, founded in 2014 simulated human teacher by giving the student a personalized learning plan and oneon-one tutoring [13]. Recently, Artificial Intelligence (AI) is rapidly changing corporate Learning and Development (L&D) with chatbots proving to be incredibly useful learning tools. In the L&D field, an educational chatbot acts as a Virtual Teaching Assistant (VTA), and, can provide point-of-need online learning support by delivering content, such as documents, links, images, videos, etc.

592

R. A. Kadir et al.

Fig. 1. Architecture of chatbot.

Educational chatbots can be defined as automated conversational chatbots that are designed with a specified educational purpose [7]. Chatbots provide quick and easy access to information and instant access to expert knowledge and advice all the time. In the field of E-learning, a chatbot as part of the education has shown interesting potential, both as a teaching and administrative tool. The following are the existing features of chatbot in conversational online learning environment [8]: provide a service such as class scheduling, course self-registration, robot tutoring and assignment; use natural language to perform a task and help students 24/7; provide means of contact with teacher; focus on answering certain types of questions or survey; work in context, know what you want to do and can provide intelligent answers and hold topic specific and general conversation Additional features will be considered in this project for the development of the chatbot to deliver more personalized experiences, a more natural experience, and unique personalities.

3 EduChat Design and Development EduChat has been developed using the combination of Robotic Process Automation (RPA) and Rapid Application Development (RAD). RPA is the technology that allows developers to configure computer software and integrate the actions of a human interacting within digital systems to execute a business process. Whereas RAD is actively applied in prototyping development, creating test cases and performing unit tests. The technical aspects of developing a chatbot depend on the features of EduChat. Identify the features of the educational chatbot in a coherent manner was implemented. A component of EduChat Personalized Dialogue Manager is the main component which performs as the brain of AI-Powered EduChat solution. The Dialog Manager will assign meaning to the entities recognized by the natural language understanding. Each query from the students will be answered with an appropriate phrase, or relevant information, or learning material to assist the students personally. This component is implemented using NLP with text classification based on SVM Model for sentiment analysis processing. Algorithm of natural language ontology representation will be implemented for natural

EduChat: AI-Powered Chatbot with Personalized

593

language understanding. Figure 2 shows the architecture of EduChat, that is, considering the heart of chatbot development.

Fig. 2. Architecture of EduChat.

The EduChat Personalized Dialogue Manager consists of several sub-components to support the implementation. The flowchart is shown in Fig. 3. 3.1 NLP Pre-processing This common NLP pre-processing steps where the text is processed and converted into the same case preferably lower case. Then, the query is split into smaller units known as word tokenization. Followed by stop word removal as they do not add any value to the analysis. Stemming diminished to their root word form. Finally, lemmatization to make sure that it does not lose its meaning. 3.2 Semantic Knowledge Representation The design of EduChat equip with proper ontology-based knowledge representation that is comprehensive, which able to spot the synonym of words and facts in the query. The ontology model is developed to support the query understanding to provide the relevant direct messages and the right learning material to the student. 3.3 AI-Engine The first step of the sub-component of AI-Engine includes the process of loading dataset (query) and split into the train and validation sets. The next step is the feature engineering in which the raw dataset is transformed into flat features which can be used in a SVM model. This step also includes the process of creating new features from the existing data. The final step is which an SVM model is trained on a labelled dataset for the purpose of sentiment analysis. Sentiment analysis of this project allows automated analysis of all forms of text for the feeling and emotion of the students.

594

R. A. Kadir et al.

Semantic Knowledge Representation

AI-Engine

Fig. 3. EduChat personalized dialogue manager.

EduChat: AI-Powered Chatbot with Personalized

595

4 Steps and Sample Scripts of EduChat EduChat provides four different types of conversations: greetings, interactions, general, and material. In response to the users’ query, greet, interact, and have a general conversation based on the template. However, to provide a personalized experience to the users, NLP is used to understand the query, which is supported by a natural language ontology representation, and SVM is used to diagnose the category of conversation. Whereas, a conversation involving material requires a different approach in responding to the users’ query as shown in Fig. 4.

Fig. 4. Flow Diagram of material conversation for references.

EduChat provides a personalization and quality to support the users and retaining existing ones. Make your customers feel valued by using their name in your welcome message and make references to their location if they are repeat customers, or introduce first-time customers to your products or services that they might find useful. You can also personalize the welcome message depending on which page the customer is on like a product landing page or a general page. Figure 5 shows series of sample scripts of EduChat.

5 Discussion and Conclusion Chatbots work in all business sectors such as in e-Commerce, education, entertainment, public institutions, healthcare, and more. The recent advancements in Artificial Intelligence (AI) are promising to the extent that chatbots are replacing humans in user/customer service. This project proposed AI-Powered EduChat, which utilize AI in real life conversation by implementing an SVM model for text classification as the fundamental task in Natural Language Processing for the purpose of sentiment analysis.

596

R. A. Kadir et al.

Fig. 5. Series of conversation of EduChat: greeting and materials.

The implementation of Artificial Intelligence (AI) algorithm is supported by ontologybased knowledge representation in dealing with the synonym of words and facts in the query. This approach supports the query understanding and provides a personalized experience for the students to get direct messaging and specified topic or learning materials by focusing on the students’ requirement or needs. EduChat offers personalized experiences and is given different personalities to meet the expectations of different student needs and demographic data. Database-driven in EduChat allows the students to communicate with the dedicated database and provide accessibility function with seamless scalability in getting an appropriate information and learning materials. Last but not least, EduChat provides a warm welcome message to the students to foster good experience. EduChat help improve the overall efficiency of teachers and online learning provider. By tutoring students in the virtual welcoming, answering all their common queries, providing them with learning material, evaluating their work and mapping their progress to give individual feedback, EduChat play the role of the perfect teacher’s assistant. The result of this project open an exciting prospect to think about in ensuring that students will be able to access a mix of online and off-campus educational opportunities designed with their skillset in mind and delivered at a time most convenient for them to study. The future of education, the technology changes the way students learn information and the teacher’s teaching method. AI approach in EduChar enables teachers to provide

EduChat: AI-Powered Chatbot with Personalized

597

their students with a personalized learning environment. Genarally, chatbots have great functionality and speed. It is helpful in every sector, and so is it in education. Acknowledgment. The authors would like to thank the Universiti Kebangsaan Malaysia for funding this research work through Transitional Grant Scheme (Project Code: UKM-TR-024). The authors would also like to extend the acknowledgement for the facilities of the Xiamen University Malaysia and Xplode Media Sdn. Bhd.

References 1. Devi, K., Eragamreddy, G., Vijaya, L.V: Professor, Asst. Role of Media in TeachingLearning Process, vol. 6, pp. 96–103 (2019) 2. Korucu, A., Atun, H.: Use of Social Media in Online Learning. https://doi.org/10.4018/9781-5225-1851-8.ch001 (2017) 3. Sarwar, B., Zulfiqar, S., Aziz, S., Ejaz, C.K.: Usage of social media tools for collaborative learning: the effect on learning success with the moderating role of cyberbullying. J. Edu. Comput. Res. 57(1), 246–279 (2019). https://doi.org/10.1177/0735633117748415 4. Mauldin, M.: ChatterBots, TinyMuds, and the turing test: entering the loebner prize competition. In: Proceedings of the Eleventh National Conference on Artificial Intelligence, AAAI Press (1994). Accessed 05 Mar. 2008 5. Turing, A.M.: Computing machinery and intelligence. Mind 49(1950), 433–460 (1950) 6. Koundinya, H.K., Palakurthi, A.K., Putnala, V., Kumar, A.K.: Smart college chatbot using ML and python. In: 2020 International Conference on System, Computation, Automation and Networking (ICSCAN), 2020, pp. 1–5 (2020) 7. Moraes Neto, A.J., Fernandes, M.A.: Chatbot and conversational analysis to promote collaborative learning in distance education. In: 2019 IEEE 19th International Conference on Advanced Learning Technologies (ICALT), 2019, pp. 324–326 (2019) 8. Roos, S., Lochan, R.: Chatbots in Education: a passing trend or a valuable pedagogical tool? Department of Informatics and Media, Uppsala University, Thesis (2018) 9. Rudi, H., Itah, S.: Situation analysis of using the WhatsApp application in distance learning in physics subjects during the Coronavirus pandemic. J. Fisika dan Pendidikan Fisika 7(1), 9–13 (2021) 10. Hertzog, P.E., Swart, A.J.: The use of WhatsApp in design-based modules. In: 2018 IEEE Global Engineering Education Conference (EDUCON), pp. 445–450 (2018) 11. Zhang, Y., Zhou, X., Zhang, X., Liu, S., Liao, D.: Peer selection in P2P file sharing systems over mobile cellular networks with consideration of downlink bandwidth limitation. In: Proceeding on 2012 IEEE Global Communications Conference (GLOBECOM), pp. 1981–1987 (2012) 12. Dhanawe, S.A., Doshi, S.V.: Hiding file on Android mobile and sending APK file through whatsapp using steganography and compression techniques. In: Proceeding on 2016 International Conference on Signal Processing, Communication, Power and Embedded System (SCOPES), pp. 106–110 (2016) 13. Knox, J.: Artificial intelligence and education in China. Learn. Media Technol. 45(3), 298–331 (2020)

Q8KNN: A Novel 8-Bit KNN Quantization Method for Edge Computing in Smart Lighting Systems with NodeMCU Aji Gautama Putrada1(B) , Maman Abdurohman2 , Doan Perdana1 , and Hilal Hudan Nuha2 1

Advanced and Creative Networks Research Center, Telkom University, Bandung 40257, Indonesia {ajigautama,doanperdana}@telkomuniversity.ac.id 2 School of Computing, Telkom University, Bandung 40257, Indonesia {abdurohman,hilalnuha}@telkomuniversity.ac.id

Abstract. The emergence of the edge computing concept, such as in smart lighting control systems, is because the computing system proposed by the cloud concept often causes delays. The problem is that the end device of the Internet of things (IoT), such as NodeMCU as the device target for placing edge computing programs, has limited resources. This paper proposes a Quantized 8-bit K-Nearest Neighbor (Q8KNN), a novel quantization method that performs model compression on KNN with our case study, the smart lighting control systems, using NodeMCU. Firstly, we have created a novel and accurate smart lighting design that we proposed. The design uses edge computing and KNN models to predict the control data. Then we developed a quantization method to make the KNN model fit into the NodeMCU. Finally, we test the performance of our novel compression model using the Accuracy and Compression Ratio (CR) metrics. The test results show that the number of unique values that change due to quantization does not cause a cardinality problem. Then Q8KNN can provide CR up to 1.6 times. With a smaller model size, Q8KNN can increase the number of training samples in the NodeMCU memory. So, using the same model size, the original KNN gives an accuracy of 94% while Q8KNN can provide up to 98%. Keywords: K-nearest neighbor · Quantization computing · NodeMCU · Model compression

1

· Smart lighting · Edge

Introduction

The smart lighting system is an emerging topic whose research is increasing every year [30]. Initially, the focus of smart lighting research was on efficiency, where the implementation of smart lighting can bring savings of up to 25% c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 598–615, 2024. https://doi.org/10.1007/978-3-031-47715-7_41

Q8KNN: A Novel 8-Bit KNN Quantization

599

compared to conventional lighting [2]. However, it turns out that some offices feel that smart lighting causes discomfort caused by the often wrong decision of the installed smart lighting control [32]. Several studies have shown that with machine learning, specifically k-nearest neighbor (KNN) and virtual sensors, the accuracy of smart lighting control can rise [29]. The smart lighting system becomes a cloud system consisting of a machine learning predictor and an Internet of things (IoT) system [8]. The introduction of the edge computing concept is because the computing system proposed by the traditional cloud concept often causes delays [4]. However, the problem is that, in IoT, end devices such as NodeMCU, which is the target of embedding edge computing programs, have limited resources [39]. On the other hand, KNN is well-known as a “lazy learner” because of the almost non-existent training process; its prediction model directly uses training data, which causes the model to enlarge along with the amount of train data [33]. Model compression is a research topic to shrink a machine learning model to fit into an IoT end-device with limited resources without compromising the model’s performance [7]. Several studies have investigated several techniques in model compression. For example, Mary et al. [21] used a technique called pruning. The technique cuts nodes and weights on an artificial neural network model to reduce the model’s size without reducing its accuracy. Jang et al. [17] used a method called knowledge distillation for model compression in edge computing systems, where a teacher model teaches a student model, resulting in the student model having a smaller size. Quantization reduces the model’s size by changing its data types, such as Prakash et al. [28], which used federated learning that combines pruning and quantization so that the DNN model can run on end devices such as smartphones and drones. This study proposes quantized 8-bit KNN (Q8KNN), a novel quantization method to perform model compression on KNN in a case study of smart lighting control using NodeMCU. First, we created the novel and accurate smart lighting design we proposed. The design uses edge computing and KNN models to predict the control. Then we build a quantization method to make the KNN model fit into the NodeMCU. Finally, we used the Accuracy and compression ratio (CR) metrics to test the performance of our novel compression model. To the best of our knowledge, there has never been a study that applies a compression model to KNN using quantization which runs in NodeMCU as part of an edge computing system. The following are our research contributions: 1. A novel smart lighting system based on edge computing utilizing NodeMCU 2. A KNN method in edge computing environment for high accuracy presence detection 3. A novel quantization method for KNN that significantly reduces the model size while maintaining its accuracy. The remainder of the paper has the following systematic: Sect. 2 discusses previous papers related to this paper. Section 3 shows our proposed design com-

600

A. G. Putrada et al.

prehensively. Section 4 shows our proposed method’s performance and discusses it concerning state-of-the-art methods. Finally, Sect. 5 concludes this research.

2

Related Works

Several studies have tried to apply quantization. Fan et al. [12] tried to use the int4 and int8 data types to replace the float data type in the deep learning model and got up to CR = 8×. Gong et al. [14] added a vectorization function to quantization in a deep neural network (DNN) to maintain the accuracy that usually drops when implementing quantization. Polino et al. [27] combined quantization with distillation in DNN and obtained the optimum model between compression rate and accuracy. The research opportunity is the application of quantization in methods other than DNN, such as KNN. Several studies have implemented edge computing with NodeMCU. Deab et al. [9] implemented edge computing with NodeMCU for watergate management in reservoirs. The NodeMCU controlled sensors such as water level, temperature, humidity, rain, and water temperature in the system. Sharma et al. [36] used NodeMCU on edge computing system as a LoRA bridge. Edge computing runs on LoRA devices that can perform predictions for forest monitoring. Kumar [19] made edge computing that used the NodeMCU to run the body temperature sensor. The research opportunity is to build edge computing with NodeMCU, where the NodeMCU runs predictions based on KNN. In terms of compressing the KNN model, Salvador et al. [35] applied KNN data compression by dividing the data into smaller data types, but this method increases the processing time of the KNN prediction. The research opportunity is to apply quantization as a compression model and apply it to edge computing. Table 1 summarizes a comparison of state-of-the-art studies with our proposed novel method. Table 1. State-of-the-art comparison on edge computing model compression Cite

Model compression Quantization KNN Edge computing

Fan et al. [12]









Gong et al. [14]









Polino et al. [27]









Deab et al. [9]









Sharma et al. [36]









Kumar et al. [19]









Salvador et al. [35] ✓















Proposed method

Q8KNN: A Novel 8-Bit KNN Quantization

3

601

Proposed Design

Figure 1 shows our proposed research methodology. First, we describe our smart lighting system that utilizes intelligence based on KNN. Then we create an edge computing system where the intelligent model runs on the end device in the form of NodeMCU. The next step is to design how NodeMCU embeds the model with 8-bit quantization. Finally, we evaluate the performance of the model and report the results.

Fig. 1. Our research methodology

3.1

Smart Lighting System with NodeMCU Edge Computing Concept

Figure 2 shows our proposed novel smart lighting control [29]. A smart lighting environment consists of a light-emitting diode (LED) lamp and an IoT end device. The IoT end device consists of components, including NodeMCU, PIR sensors, and relays. The PIR sensor captures movement data. A novel KNN model called classification-integrated moving average (CIMA) converts movement data into human presence data. The human presence data becomes a reference for light control. The relay performs automatic light control directly to the LED.

Fig. 2. The novel smart lighting control with KNN prediction

602

A. G. Putrada et al.

With this novel smart lighting control, through data collecting in a real working environment, we obtain a dataset consisting of 56,729 data items [31]. The dataset contains six features and one output. Table 2 describes each feature and output of our dataset. Our dataset output, namely human presence, is binary. We give a value of 0 if there are no people and one if a person is present. Table 2. Dataset explanation ID Name

Feature/Output Explanation

1

Movement data

Feature

0 if no human movement, 1 if human movement detected

2

MA200

Feature

Moving average of Feature 1 with 200 data points

3

MA300

Feature

Moving average of Feature 1 with 300 data points

4

MA400

Feature

Moving average of Feature 1 with 400 data points

5

MA500

Feature

Moving average of Feature 1 with 500 data points

6

MA600

Feature

Moving average of Feature 1 with 600 data points

7

Human presence Output

Manually presence

annotated

data

of

human

Figure 3 shows our proposed edge computing concept. The proposal also shows the complete IoT structure: end-device, platform, and application layer [5]. The smart lighting environment representing the end device contains the NodeMCU, the main control tool. Running on the NodeMCU is the proposed Q8KNN, emphasizing that the KNN-based control method runs directly on the smart lighting environment without connecting to the Internet. However, connection to a cloud system via the Internet remains. The main role is for interfaces with remote users. Remote users can perform remote manual control and monitoring. This design also highlights the importance of quantization so that KNN can run in NodeMCU. In quantizing the KNN to fit in the NodeMCU, it is important to know the computational specifications of the NodeMCU. We have two environments in developing the Q8KNN: the development environment and the deployment environment. We use a personal computer (PC) for the development environment, while the deployment environment is the NodeMCU. Table 3 summarizes the specifications of both environments. NodeMCU uses a system-on-chip (SoC) named ESP8266. ESP8266 is a microcontroller with a Wi-Fi module and TCPIP protocol. The developer of this product is Espressif, based in Shanghai, China [23]. There are two main memories in NodeMCU. The first is flash memory, where

Q8KNN: A Novel 8-Bit KNN Quantization

603

Fig. 3. The edge computing design which contains the proposed Q8KNN model

NodeMCU stores firmware or sketches for which the user writes programs. The second is static random access memory (SRAM), which stores variables in the program. Flash memory has a size of 4 MB, while SRAM has 64 kB. However, the user can use 1 MB for the program from this capacity. There is a total capacity of 82 kB for global variables, dynamic memory, and local variables [26]. Symptoms that arise if the SRAM capacity is full are programs can be uploaded but run strangely or stop running [1]. Furthermore, our PC has an Intel I5 with a central processing unit (CPU) speed of 2.4 GHz. It is equipped with a 477 GB solid state disk (SSD) and a 8.00 GB random access memory (RAM). The hard disk format uses the new technology file system (NTFS), where the maximum file size is 16 TB [10]. The heap size for saving the training samples in the KNN model in python has no maximum limit [18]. The maximum variable size has an equal size to its RAM capacity. 3.2

The Q8KNN Model Compression Method

The basis of Q8KNN is the KNN model. KNN classifies new data based on the highest class of K data closest to the data [11]. There are several methods to calculate the distance, including the Minkowski, Manhattan, or haversine method [16]. Nevertheless, the most common method is euclidean distance [24].

604

A. G. Putrada et al. Table 3. Development and deployment environment specifications

Parameter

PC Type

NodeMCU Specification Type

Specification

Purpose

Development

Deployment

SoC

Intel I5 2.40 GHz

ESP8266

Secondary memory

SSD

477 GB

Flash Memory 4 MB

Main memory

RAM

8.00 GB

SRAM

Programming language

Python

80 MHz 64 kB

C++

Maximum program size

477 GB

1 MB

Maximum variable size

8.00 GB

82 kB

The euclidean distance formula in calculating the distance d in determining the KNN class is as follows:  d= (xi − x ˆi )2 (1) i∈F

where F is the number of features, x is data whose class is unknown, and x ˆ is a member of the training sample. The first step to start quantization is to normalize. The reason is that the scale of the distance between data greatly affects the performance of kNN using euclidean distance [37]. The formula for normalization is as follows: xnorm =

x − xmin xmax − xmin

(2)

where xnorm is the normalized data, x is the data before normalization, xmin is the data with the smallest value, and xmax is the data with the largest value. In Q8KNN, quantization converts 64-bit long floating point data into 8-bit integer data. Floating point data is decimal, where storing a fractional value in each data item is important when converting the data to an integer. Here we perform the second normalization. Our dataset is asymmetric because there are no negative values in the dataset. The first step in quantization is to find the value of Scale. The formula in the asymmetric dataset is as follows: M ax (3) 255 where M ax is the highest value in the scaled dataset, while 255 is the maximum value in the unsigned 8-bit data type. Then the formula for applying quantization is as follows:  x norm (4) xq = round Scale Scale =

Q8KNN: A Novel 8-Bit KNN Quantization

605

where xq is the value of the quantization result. The round process rounds up to zero decimal places. 3.3

Performance Metrics

Our two main measures are Accuracy and CR. In addition we also compare P recision, Recall, F 1–Score, and G–M ean of the two best-performing models that can run on NodeMCU. Accuracy compares correctly predicted data with all data. P recision shows the model’s predictive ability from positive data prediction results (true positive and false positive). Recall shows the model’s predictive ability from actual positive data (true positive and false negative). F 1–Score shows the harmonic mean between P recision and Recall [13]. G– M ean is the geometric mean between P recision and Recall. G–M ean ensures that all classes are predictable because if there is an unpredicted class, the result is the value G–M ean = 0 [20]. Its formulas are as follows: Accuracy =

TP + TN TP + TN + FP + FN

(5)

TP TP + FP

(6)

P recision = Recall = F 1−Score = 2 × G−M ean =



TP TP + FN

(7)

P recision · Recall P recision + Recall

(8)

P recision × Recall

(9)

where T P stands for true positive, T N stands for true negative, F P stands for false positive, and F N stands for false negative. CR is the comparison between the original size and the model size after being compressed. Several studies have shown that C++ is more efficient in memory consumption than Python [25]. We have two compression scenarios, first with Python programming language and second with C++ programming language. The Scikit-Learn (SKLearn) library in Python directly generates independent KNN modules. The CR formula with the Python programming language (CRP ) compares the original KNN model with the Q8KNN model. Its formula is as follows: CRP =

M SKN N M SQ8KN N

(10)

where M SKN N is the KNN size model in Python and M SQ8KN N is the Q8KNN size model in Python. We self-created our model for C++, which is not an independent model but integrates into the main program. This program allocates one training sample

606

A. G. Putrada et al.

as a dynamic memory allocation [15]. So in the case of C++ compression, we calculate the ratio of the dynamic memory size of each model. Here is the CR formula in C++ (CRC ): CRC =

DMKN N DMQ8KN N

(11)

where DMKN N is KNN dynamic memory allocation by C++ compiler, then DMQ8KN N is Q8KNN dynamic memory allocation by C++.

4 4.1

Results and Discussion Results

First, we show that our KNN model’s performance correlates with the amount of data involved. Figure 4 shows the growth in the accuracy of our KNN model based on the training sample size used. Our model’s accuracy increases with more and more train samples used, where at maximum training samples, the Accuracy reaches 0.997. This test result emphasizes the importance of training sample size in our KNN-based smart lighting control model.

Fig. 4. The KNN model accuracy growth by the used training sample size

Q8KNN: A Novel 8-Bit KNN Quantization

607

The next step is implementing Q8KNN. Figure 5 shows dataset values distribution before and after applying quantization. There is no change in distribution after quantization. However, there is a change in the range of values from 0 to 1.0 to 0 to 255. These changes also affect the average and standard deviation of each feature. Then there is also a change in the density range, from 0 to 3.5 to 0 to 0.014. The distribution characteristics show two facts as opportunities. First, there is no minus value in our dataset distribution, which is an opportunity to change the data type from float to one of the unsigned data types. Second, after quantization, there is zero decimal place precision, which is an opportunity to change the data type from float to one of the integer data types.

Fig. 5. The dataset distribution before quantization and after quantization

In a 64-bit float data type, there are possibly 16 exa numbers. In the 8bit unsigned integer data type, there are only 256 possible numbers. Here we evaluate whether the number of unique numbers decreases due to quantization. The Table 4 shows the comparison. Only Feature 1 and Feature 2 do not reduce the number of unique numbers. Feature 1 does not change because it is binary. In some cases, the many unique numbers in the classification can lead to great cardinality problems [22]. We observe it through performance comparisons. We create a function that maps each original feature to the quantized feature [6], which provides a qualitative assessment of the transformation of features by quantization. Figure 6 contains six graphics, each showing a function snippet that maps original features to quantized features. The graphic form shows that according to Table 4, one unique value of the quantized feature represents several unique values of the original feature. The larger the ratio of unique per-feature values, the smaller the steps.

608

A. G. Putrada et al. Table 4. Comparison of unique values from each dataset feature Parameter

Unique value Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6

Normalized 2

201

400

36,119

40,986

44,491

Quantized

201

255

255

249

240

2

Fig. 6. The function that maps the original features with the quantized features for: a Movement data b MA200 c MA300 d MA400 e MA500 f MA600

Figure 7 shows the comparison of the accuracy of KNN and Q8KNN based on training sample size. There is no difference in accuracy between KNN and Q8KNN. However, both models have increased accuracy with increasing training sample size. The results of this test show that the number of change unique values, the cardinality, the number range, and the density does not affect the performance of KNN. Next, we examine the effect of quantization on the model’s size. We run KNN using the SKLearn library from Python. Figure 8 shows a comparison of the size of the KNN model as a result of training using the KNN and Q8KNN methods with the Python programming language on different training samples. In each

Q8KNN: A Novel 8-Bit KNN Quantization

609

Fig. 7. The KNN and Q8KNN performance comparison by training sample size on PC

training sample, Q8KNN produces a smaller model. The KNN model has linear growth with the following equation: yKN N = 0.00012xKN N + 0.02

(12)

where yKN N is the KNN model size in kB, while xKN N is the KNN training sample. The fitted line has a value of r2 = 0.99. At the same time, the Q8KNN model has linear growth with the following equation: yQ8KN N = 0.00007xQ8KN N + 0.03

(13)

where yQ8KN N is the KNN model size in kB, while xQ8KN N is the KNN training sample. The fitted line has a value of r2 = 0.99. From the two equations, Q8KNN has a smaller slope. As the number of train samples in the model grows, the Q8KNN will have a constant CRP value. We further prove this hypothesis.

610

A. G. Putrada et al.

Fig. 8. The dataset model size comparison by training sample size on PC

We measure the CRP of each model from the previous test. Table 5 shows the calculation results. The average CRP for each model is 1.56 with an interquartile range of (IQR) = 0.1. These results prove that Q8KNN has a constant CRP . Table 5. The CRP of five models with different train sample sizes Q8KNN model Test results Train samples CRP 1

5000

1.5×

2

10000

1.5×

3

15000

1.6×

4

20000

1.6×

5

25000

1.5×

In the next test, we do two things. First, we convert the python library into a C++ program for programming on the NodeMCU via Arduino IDE [3]. Second, we test the compression performance and Accuracy of Q8KNN in a NodeMCU environment. Table 6 shows the comparison results between KNN and Q8KNN

Q8KNN: A Novel 8-Bit KNN Quantization

611

on NodeMCU. The Q8KNN model with C++ programming language using 6000 training samples has Accuracy = 0.99. However, due to a large amount of data, the program did not compile successfully in NodeMCU. The Q8KNN model with 4800 training samples compiles successfully in NodeMCU. The Accuracy is lower than the previous model, which is 0.98. The whole program size is 472 kB, approximately the Q8KNN model size using Python if implementing Eq. 13, which shows that the C++ program is more efficient in memory consumption than Python. Furthermore, KNN with floating point data type does not compile up to 2400 training samples. In the 1200 training sample, the Accuracy of KNN is 0.94. The model uses up dynamic memory until 57kB. The CRC for the model is 1.6×, which is approximate to the average value of CRP in the previous test. We cannot calculate CRC for training samples from 2400 to 6000 because some models, especially model that uses floating point, cannot compile, so there are no comparison figures. With quantization, we can increase the number of training samples to four times the model that uses a floating point data type so that the performance of KNN affected by the training sample can increase. Table 6. The accuracy and CRC of five models with different train sample sizes on NodeMCU implementation Training samples

KNN

Q8KNN

Accuracy Dynamic memory (kB)

Accuracy Dynamic memory (kB)

CRC

1200

0.94

57

0.95

36

1.6×

2400

0.97



0.97

44



3600

0.98



0.98

55



4800

0.99



0.98

60







6000 0.99 — 0.99 “—” means the compilation failed on NodeMCU

Finally, we compare the best KNN and Q8KNN models that can run on NodeMCU, namely KNN with a training sample of 1200 and Q8KNN with a training sample of 4800. Figure 9 shows the performance comparison. The Q8KNN model with 4800 training samples performs better in terms of Accuracy, P recision, F 1–Score, and G–M ean. Successive values are 0.98, 0.99, 0.98, and 0.98. While the value of Recall of KNN and Q8KNN is the same. 4.2

Discussion

Several studies, such as [38], have made research breakthroughs by applying edge computing and sensor fusion for occupancy detection using the Raspberry Pi. The performance of the system in predicting occupancy is Accuracy = 0.87. Our

612

A. G. Putrada et al.

Fig. 9. Performance comparison of best runnable model on NodeMCU

solution makes a three-fold contribution to the research. First, our implementation of edge computing is in an environment whose specifications are more limited than the Raspberry Pi, namely NodeMCU. Second, we implement machine learning in an edge computing environment, specifically KNN, which has more complex computations than traditional rule-based sensor fusion. Third, our virtual sensor for occupancy detection has a higher Accuracy, which is 0.98 compared to 0.87. Several studies such as [12,14,27] have applied quantization. However, its application is in deep learning, specifically DNN. The typical case study is image processing in an edge computing environment. Our research makes a novel contribution, quantization howbeit for the KNN model. The case study of the problem is smart lighting control. Further studies, such as [34], have implemented edge computing, then show that the delay in edge computing is smaller than cloud computing or fog computing. However, the research has not implemented any compression model. For future work, our research can lead to implementing this concept in live smart lighting and prove that the system delay is better than the cloud computing concept.

Q8KNN: A Novel 8-Bit KNN Quantization

5

613

Conclusion

This paper applied a novel quantization method, Q8KNN, to perform model compression on KNN in a case study of smart lighting control using NodeMCU. We have used a primary dataset directly from the implemented system, which consists of 56,729 data items, six features, and one output. The quantization process does not change the value distribution of each feature. Only the value range and the density have changed. The test results show that the number of unique values that change due to quantization does not cause a cardinality problem. Then the novel Q8KNN can provide up to 1.6 times CR. With a smaller model size, Q8KNN can increase the number of training samples in the NodeMCU memory. So, the original KNN gives 94% accuracy, while Q8KNN can provide up to 98% accuracy at the same model size.

References 1. Abdullah, A., Kartheek, N., Tarun, M.: Smart street lights system based on image processing (2018) 2. Abdurohman, M., Putrada, A.G., Prabowo, S., Wijiutomo, C.W., Elmangoush, A.: Integrated lighting enabler system using m2m platforms for enhancing energy efficiency. J. Inf. Process. Syst. 14(4), 1033–1048 (2018) 3. Al Dahoud, A., Fezari, M.: Nodemcu v3 for fast IoT application development. Notes 5 (2018) 4. Cao, K., Liu, Y., Meng, G., Sun, Q.: An overview on edge computing research. IEEE Access 8, 85714–85728 (2020) 5. Chen, J., Ran, X.: Deep learning with edge computing: a review. Proc. IEEE 107(8), 1655–1674 (2019) 6. Cheng, Y., Wang, C., Chen, H.-B., Hao, Yu.: A large-scale in-memory computing for deep neural network with trained quantization. Integration 69, 345–355 (2019) 7. Choudhary, T., Mishra, V., Goswami, A., Sarangapani, J.: A comprehensive survey on model compression and acceleration. Artif. Intell. Rev. 53(7), 5113–5155 (2020) 8. Das, A., Rad, P., Choo, K.-K.R., Nouhi, B., Lish, J., Martel, J.: Distributed machine learning cloud teleophthalmology IoT for predicting AMD disease progression. Futur. Gener. Comput. Syst. 93, 486–498 (2019) 9. Mustafa Yassin Deab and Muayad Sadik Croock: Smarter dam based on cyberphysical system utilizing Raspberry Pi4 and NodeMCU ESP8266. Bull. Electr. Eng. Inform. 11(6), 3403–3413 (2022) 10. Dhjaku, V., Xoxa, N., Bame, A., Tafa, I.: Comparing NTFS file system with etx4 file system. In: RTA-CSIT, pp. 176–180 (2018) 11. Fakhruddin, R.I., Abdurohman, M., Putrada, A.G.: Improving pir sensor networkbased activity recognition with pca and knn. In: 2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA), pp. 138–143. IEEE (2021) 12. Fan, A., Stock, P., Graham, B., Grave, E., Gribonval, R., Jegou, H., Joulin, A.: Training with quantization noise for extreme model compression (2020). arXiv:2004.07320

614

A. G. Putrada et al.

13. Fourure, D., Javaid, M.U., Posocco, N., Tihon, S.: Anomaly detection: how to artificially increase your f1-score with a biased evaluation protocol. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 3–18. Springer (2021) 14. Gong, C., Chen, Y., Ye, L., Li, T., Hao, C., Chen, D.: VecQ: Minimal loss DNN model compression with vectorized weight quantization. IEEE Trans. Comput. 70(5), 696–710 (2020) 15. Idrissi, I., Mostafa Azizi, M., Moussaoui, O.: A lightweight optimized deep learningbased host-intrusion detection system deployed on the edge for IoT. Int. J. Comput. Digit. Syst. (2022) 16. Jadav, D., Patel, D., Gupta, R., Jadav, N.K., Tanwar, S.: Barcode: a blockchainbased framework for remote covid detection for healthcare 5.0. In: 2022 IEEE International Conference on Communications Workshops (ICC Workshops), pp. 782–787. IEEE (2022) 17. Jang, I., Kim, H., Lee, D., Son, Y.-S., Kim, S.: Knowledge transfer for on-device deep reinforcement learning in resource constrained edge computing systems. IEEE Access 8, 146588–146597 (2020) ˇ 18. Koˇciˇcka, M.: V` ykonnostn´ı anal` yza programu lsu3shell. Master’s thesis, Cesk´ e vysok´e uˇcen´ı technick´e v Praze. Vypoˇcetn´ı a informaˇcn´ı centrum (2019) 19. Priyank Bhupendra kumar Modi: System for monitoring and forecasting covid-19 quarantined patients’health signs in real time through edge computing. Int. J. Eng. Appl. Sci. Technol. 7(2), 410–416 (2022) 20. Lango, M.: Tackling the problem of class imbalance in multi-class sentiment classification: an experimental study. Found. Comput. Decis. Sci. 44(2), 151–178 (2019) 21. Mary Shanthi Rani, M., Chitra, P., Lakshmanan, S., Kalpana Devi, M., Sangeetha, R., Nithya, S.: DeepCompNet: a novel neural net model compression architecture. Comput. Intell. Neurosci. 2022 (2022) 22. Moeyersoms, J., Martens, D.: Including high-cardinality attributes in predictive models: a case study in churn prediction in the energy sector. Decis. Support. Syst. 72, 72–81 (2015) 23. Mostafa, M.F.: Germination of Seeds in Smart Farm. Ph.D. thesis, University of Mosul (2022) 24. Nando, P., Putrada, A.G., Abdurohman, M.: Increasing the precision of noise source detection system using KNN method. Kinet. Game Technol. Inf. Syst. Comput. Netw. Comput. Electron. Control. 157–168 (2019) 25. Nourisa, J., Zeller-Plumhoff, B., Willumeit-R¨ omer, R.: CppyABM: an open-source agent-based modeling library to integrate c++ and python. Softw. Pract. Exp. 52(6), 1337–1351 (2022) 26. Phimphinith, A., Anping, X., Zhu, Q., Jiang, Y., Shen, Y.: An enhanced mutual authentication scheme based on ECDH for IoT devices using esp8266. In: 2019 IEEE 11th International Conference on Communication Software and Networks (ICCSN), pp. 490–496. IEEE (2019) 27. Polino, A., Pascanu, R., Alistarh, D.: Model compression via distillation and quantization (2018). arXiv:1802.05668 28. Prakash, P., Ding, J., Chen, R., Qin, X., Shu, M., Cui, Q., Guo, Y., Pan, M.: Iot device friendly and communication-efficient federated learning via joint model pruning and quantization. IEEE Internet Things J. 9(15), 13638–13650 (2022) 29. Putrada, A.G., Abdurohman, M., Perdana, D., Nuha, H.H.: Cima: a novel classification-integrated moving average model for smart lighting intelligent control based on human presence. Complexity 2022, 19 (2022). Article ID 4989344

Q8KNN: A Novel 8-Bit KNN Quantization

615

30. Putrada, A.G., Abdurohman, M., Perdana, D., Nuha, H.H.: Machine learning methods in smart lighting towards achieving user comfort: a survey. IEEE Access (2022) 31. Putrada, A.G., Abdurohman, M., Perdana, D., Nuha, H.H.: Recurrent neural network architectures comparison in time-series binary classification on IoT-based smart lighting control. In: 2022 10th International Conference on Information and Communication Technology (ICoICT), pp. 391–396. IEEE (2022) 32. Putrada, A.G., Ramadhan, N.G., Makky, M.: An evaluation of activity recognition with hierarchical hidden markov model and other methods for smart lighting in office buildings. ICIC Int. (2022) 33. Roslan, N.A.N., Mahdin, H., Hidayat, R., et al.: A study on dengue cases detection based on lazy classifier. Int. J. Adv. Sci. Comput. Eng. 1(1), 43–47 (2019) 34. Safavat, S., Sapavath, N.N., Rawat, D.B.: Recent advances in mobile edge computing and content caching. Digit. Commun. Netw. 6(2), 189–194 (2020) 35. Salvador-Meneses, J., Ruiz-Chavez, Z., Garcia-Rodriguez, J.: Compressed k NN: K-nearest neighbors with data compression. Entropy 21(3), 234 (2019) 36. Sharma, M., Rastogi, R., Arya, N., Akram, S.V., Singh, R., Gehlot, A., Buddhi, D., Joshi, K.: LoED: Lora and edge computing based system architecture for sustainable forest monitoring. Int. J. Eng. Trends Technol. 70(5), 88–93 (2022) 37. Yan, L., Niu, X.: Spectral-angle-based Laplacian eigenmaps for nonlinear dimensionality reduction of hyperspectral imagery. Photogramm. Eng. & Remote. Sens. 80(9), 849–861 (2014) 38. Zemouri, S., Magoni, D., Zemouri, A., Gkoufas, Y., Katrinis, K., Murphy, J.: An edge computing approach to explore indoor environmental sensor data for occupancy measurement in office spaces. In: 2018 IEEE International Smart Cities Conference (ISC2), pp. 1–8. IEEE (2018) 39. Zhang, H., Zhang, Z., Zhang, L., Yang, Y., Kang, Q., Sun, D.: Object tracking for a smart city using IoT and edge computing. Sensors 19(9), 1987 (2019)

Internet of Things Platform as a Service for Building Digital Twins and Blockchain Sritrusta Sukaridhoto1 , Agus Prayudi1 , Muhammad Udin Harun Al Rasyid1 , and Hendro Wicaksono2(B) 1 Politeknik Elektronika Negeri Surabaya, Jl. Raya ITS, 60111 Surabaya, Indonesia

{dhoto,udinharun}@pens.ac.id, [email protected] 2 School of Business, Social & Decision Sciences, Constructor University Bremen, Campus Ring 1, 28759 Bremen, Germany [email protected]

Abstract. The growth of the Internet of Things (IoT) technology stimulates the growth of the IoT platforms used for developing, managing, controlling, and monitoring physical devices. Many companies have developed physical devices that are ready to use along with an IoT platform. However, it is hard to integrate it with other physical devices from different platforms and other technologies such as blockchain and digital twins (DT). This paper proposes a conceptual model of a new IoT platform architecture with high interoperability and flexibility to connect with other technologies, such as blockchain and digital twins. We analyzed and compared several IoT platforms based on the seven layers architecture to identify the functionality gaps. The gaps are the basis of developing the new IoT platform architecture. Keywords: Internet of Things · IoT platform architecture · Blockchain · Digital twin · Conceptual model

1 Introduction The Internet of Things (IoT) is intelligent physical objects (things) that can exchange data through the internet. IoT can bridge the physical world and the information system or virtual world [16]. The rapid growth of IoT technology stimulates the growth of IoT platforms, and the existence of the platforms would be able to create unique IoT systems according to user needs. An IoT platform is a set of technologies that provide the infrastructure to build and orchestrate applications according to the required features [5]. The growth of IoT also increases awareness of the security aspect. In recent years, researchers have been trying to solve a problem related to B-IoT (Blockchain-based IoT) [16]. The potential threats to IoT are the security and privacy of exchanged data related to the life of users or business activities. In 2017, more than 20% of organizations and businesses implemented security solutions to secure their IoT devices. Most IoT was © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 616–635, 2024. https://doi.org/10.1007/978-3-031-47715-7_42

Internet of Things Platform as a Service for Building

617

handled by relying on a centralized architecture, and the decentralized architecture could be used for dealing the security issues in IoT [17]. The capability of IoT to bridge the physical world to the virtual world has opened other technology fields, such as digital twins. Digital Twin (DT) represents physical objects in the virtual domain. Digital Twin (DT) has several challenges, including IT infrastructure, data transfer, privacy and security, and trust [18]. Many companies build their own physical devices and platform, such as Google Assistant, Apple Home Kit, and Amazon’s Alexa. The physical devices from different companies or platforms are not easy to be adopted to another platform because of the lack of implementation of connectivity with other technologies such as blockchain and digital twin. This paper proposes an IoT platform architecture to solve interoperability and flexibility problems. The architecture is built based on seven layers of architecture. The proposed platform would adopt the Matter protocol that can connect the devices through different platforms. In addition, we also propose extensions that can connect with external technologies such as blockchain to gain the security of IoT platforms and digital twins that need a broker to handle the data transaction. The potential benefit of this proposed concept is to simplify the blockchain and digital twins development related to IoT. Users can manage and integrate their IoT blockchain and digital twins system through our IoT platform. This research investigates the parameter based on seven layers architecture to compare several IoT platforms so that the result investigation could be obtained as a reference to build a B-IoT platform and digital twin-based IoT platform with Matter protocol implementation. Our proposed solution target is improving the interoperability of IoT devices by implementing multiple communication protocols. Matter protocol will be suitable for Personal Area Networks that require low energy capability, and MQTT will be used for real-time data communication for digital twins. REST API will be used for data transactions that do not need real-time data transactions. Kafka will be used for capturing the updated or stored data event in the database. Then the event will be subscribed automatically by the blockchain system using kafka consumer. This paper is structured as follows. Section 2 explains the previous studies about IoT comparison. In Sect. 3, we compare and analyze several IoT platforms provided by companies such as Tuya, Amazon, and Azure. In Sect. 4, we present the comparison results. Finally, in Sects. 5 and 6, we present the conclusion and future works.

618

S. Sukaridhoto et al.

2 Related Works 2.1 IoT Platform Architecture Comparison Previous studies and works have analyzed and compared several IoT platforms in the context of layer architecture and the technology behind the architecture. Kumar, N. M. summarize IoT building blocks, component, and their interactions along with architecture layers. According to their research, the building blocks consist of gateways, network infrastructure, and cloud infrastructure. The main contribution of this research is a summary of IoT building blocks, components, and the interactions among the architecture layers. The IoT building blocks are things, gateways, network infrastructure, and cloud infrastructure. Things comprise sensors and actuators that allow data communication and collection, and a gateway is a mediator between things and network or cloud infrastructure. The network infrastructure controls the data or information and secures the data, and the cloud infrastructure is used for data storage and computing [1]. Kotha et al. [2] discussed four types of IoT architectures, i.e., basic three-layer architecture, cloud-based architecture, fog-based architecture, and service-oriented architecture. Tables 1, 2, 3 and 4 describe each layer of the four architectures. Table 1. The basic three-layer architecture. Layer name

Description

Physical layer

This layer comprises sensors and actuators. We collect the data in this layer by using physical devices

Network layer

This layer acts as a gateway and handles routing protocols

Application layer This layer presents the data to the end user

The disadvantage of this architecture is that a single layer can have more than one task, and it will be challenging if we want to change one of the layers. A cloud-based architecture has several layers described in Table 2. Table 2. Cloud-based architecture. Layer name

Description

Physical layer This layer consists of sensors and actuators for data collection and exchange Process layer

This layer analyzes the collected data

Gateway layer This layer transforms the data and makes the data suitable for cloud services Cloud services This layer consists of brokers, message queues, databases, and event managers

Fog computing is similar to edge computing because the data is computed on the devices connected to the network’s edge or on the physical device itself. The difference

Internet of Things Platform as a Service for Building

619

Table 3. Fog-based architecture. Layer name

Description

Physical layer

This layer consists of sensors and actuators

Fog layer

This layer processes the collected data

Transport layer

This layer transfers the processed data to cloud services

Cloud services

The received data are analyzed and processed with high computation capabilities

between fog and cloud-based architecture is the mechanism of data processing. In the fog architecture, the data are processed decentralized, and the cloud-based architecture processes the data centralized on the cloud. The capability of computing in fog is less than cloud-based. If we want to process big data from physical devices, we need to use cloud-based architecture. Fog computing is cheaper because the data is computed and analyzed on the local device. On the other hand, fog computing allows users to implement various protocols or other data processing means to process their data. The last is service-oriented architecture, usually used in business applications. A Service Oriented Architecture (SOA) has four layers: data acquisition, network, management or business, and interface layer. Table 4 presents the description of each layer. Table 4. The service-oriented architecture. Layer name

Description

Data acquisition

Similar to the physical layer, it collects and transfers the data to the network layer

Network layer

There are several communication protocols in this layer

Management layer

In this layer, the data are transferred to the corresponding users

Interface layer

The users and other applications access the IoT data through this layer

2.2 Blockchain-Based Architecture for IoT Platform Blockchain consists of two parts. First, blocks are a set of transactions and other records like hash values. Second, the chain is cryptographic of a block with values from the previous block. Figure 1 shows the structure of a blockchain. Haya R. Hasan et al. researched a blockchain-based approach for creating digital twins [10]. The proposed solution is based on the Ethereum blockchain platform that guarantees the security, trust, traceability, and immutability of transactions. For the private blockchain, they preferred to use Hyperledger Fabric or Hyperledger Besu. Riya Thakore et al. explain the reason for adopting blockchain in IoT concerning security. IoT and blockchain could be added value to business models because blockchain

620

S. Sukaridhoto et al.

Fig. 1. Structure of a blockchain.

provides security and trust [11]. So, instead of using a third party or another technology, we could use blockchain to make our data more secure. 2.3 Digital Twin Based for IoT Platform Digital Twin (DT) is different from digital models and digital shadow. Digital models only convert physical objects to digital objects without exchanging data. A digital shadow is a digital representation of a physical object in the real world. Data changing in the physical object will affect the digital object, not vice versa. A digital twin is a digital representation of the physical object, a change in the physical object will affect the digital object and vice versa [18]. The difference between digital model, digital shadow, and digital twin is depicted in Fig. 2.

Fig. 2. The difference between digital model, digital shadow, and digital twin [18].

Maninder Jeet Kaur et al. have researched the architecture, application, and challenges in implementing digital twins in the IoT area. The IoT system acquires data and feeds data to the digital twin model. Our research combines three parts of the digital twin

Internet of Things Platform as a Service for Building

621

architecture: the digital twin itself, IoT, and Artificial Intelligence (AI). Combining IoT and machine learning with digital twins will increase product productivity, uniformity, and quality [19]. A. R. Al-Ali et al. have developed several use cases to demonstrate the combination of digital twins and IoT applications [20]. Figure 3 shows the central concept of implementing digital twins with IoT. The physical object is the object we need to sense and control. The physical space layer consists of microcontrollers, sensors, and actuators. The communication network layer in this research applies wireless communication (Wi-Fi, 4G/5G, satellite) to communicate with the virtual space layer. The virtual space layer performs data simulation, actuation, and control decisions. The next step from the virtual space layer is the digital twin itself. The data analytics and visualization layer perform diagnostic and predictive, and the application layer is a dashboard reporting that users can see.

Fig. 3. The concept of digital twin and iot implementation [7].

3 Internet of Things Platform Analysis In this section, we compared several IoT platform architectures. To analyze the IoT platforms, we used the seven layers of the IoT reference model [7]. Figure 4 depicts the seven layers of architecture. 1. Physical Devices: The devices that can be controlled and monitored by the platform. 2. Connectivity: Communication protocol to connect the physical device with the cloud service. 3. Edge Computing: Data processing and transformation. 4. Data Storage: Database system (SQL or NoSQL) for IoT data storing and management.

622

S. Sukaridhoto et al.

5. Data Abstraction: Abstracting the data interface for applications. 6. Application: The application for reporting, analytics, and controlling the system. 7. Collaboration and Business Processes: People and business processes that utilize IoT applications or data.

Fig. 4. The seven layers of iot by cisco.

3.1 Tuya Platform Tuya IoT platform architecture has been developed using these references [8] and [9]. It is shown in Fig. 5. Tuya IoT Development Platform allows connecting IoT devices, gateway, services, and applications to the cloud. The Tuya platform comprises: 1. Tuya OS is a distributed and platform-agnostic IoT operating system built on RTOS, Linux, and Non-OS. Tuya OS. 2. Tuya Link connects the existing IoT devices to the Tuya IoT development platform with standardized development processes 3. Edge computing gateway runs on edge gateway hardware using container technology. In the edge computing gateway, data can be collected, cleansed, aggregated, and stored. Users also get other features like data security, real-time control, edge-cloud synergy, scene linkage, and reliable long-term offline operation.

3.2 ThingsBoard Platform ThingsBoard is an open-source IoT platform that provides the on-premises solution to enable server-side infrastructure for IoT applications [6]. ThingsBoard is designed

Internet of Things Platform as a Service for Building

623

Fig. 5. Tuya iot platform architecture.

Fig. 6. ThingsBoard iot platform architecture.

to be scalable, fault-tolerant, robust, durable, and customizable. Figure 6 shows the architecture of ThingsBoard. The ThingsBoard IoT Gateway provides several communication protocol connectors, such as MQTT and REST. Using existing MQTT protocol, the MQTT connector can collect and control data from IoT devices connected to external MQTT brokers.

624

S. Sukaridhoto et al.

The REST connector will provide the endpoints and collect the incoming data from the HTTP request. In addition, users can make their connector configuration using JSON files. In ThingsBoard Transport, several APIs are available for device applications or firmware, such as MQTT, HTTP, and CoAP. When the transport layer receives the data from devices, it will parse it and push it to the message queue. ThingsBoard Core is a node service that handles REST API calls, message processing via the rule engine, device connectivity state monitoring (active or inactive), and WebSocket subscription on entities and attributes changes. ThingsBoard Rule Engine is a service that processes incoming messages. The Rule Engine gets the data from the message queue and acknowledges the data when the data is processed. ThingsBoard Rule Engine could operate in two modes (shared and isolated). In the shared mode, the rule engine will process data that belong to multiple tenants. The rule engine is configured in isolated mode to only process data for specific tenants. ThingsBoard supports multiple message queue implementations, such as Kafka, RabbitMQ, AWS SQS, Azure Service Bus, and Google Pub/Sub. ThingsBoard database is used for collecting the data entities and telemetry. This platform has three database options: SQL, NoSQL, Hybrid (PostgreSQL and Cassandra), and Hybrid (PostgreSQL and TimescaleDB). 3.3 Azure IoT Platform IoT Central is a Microsoft Platform as A Service that assembles Azure PaaS components into a fully manageable and extensible IoT application development and operation platform [3]. Figure 7 describes the Azure IoT Central architecture.

Fig. 7. Azure iot platform architecture.

The bottom layer acts as physical devices layer containing Azure IoT SDK, Azure RTOS, and Azure Sphere. The physical layer sends the data directly to the Azure IoT

Internet of Things Platform as a Service for Building

625

Central or Azure IoT Edge, which serves as the second layer. The third layer is IoT Central. It is built with multiple Azure PaaS services, such as data ingestion and provisioning services, data storage and analytics (hot, warm, and cold path), and user interfaces for device management. IoT central extends the solution by triggering notifications, exporting data, and supporting queries. The top layer is the business layer, which integrates with business applications such as Power BI, Web Apps, and Mobile Apps. 3.4 SEMAR Platform SEMAR (Smart Environmental Monitoring and Analytical in real-Time) design system consists of physical devices, data input, data process, and data output (see Fig. 8) [4].

Fig. 8. SEMAR design system.

The data input block comprises a network interface (Wi-Fi, Ethernet, Cellular), communication protocol (HTTP and MQTT), and IoT gateway (Rest API and MQTT). The data format sent to the MQTT is JSON format, and the JSON response will be retrieved by the data aggregator program in the platform to obtain each sensor data. The data process in the SEMAR server offers several functions, such as data management (storage, aggregation, and plug-in functions), data synchronization, and machine learning. The primary function of this layer is to get meaningful information. In the data aggregator, the data will be formatted to JSON, and the incoming data will be ignored if they are not in JSON format. The data storage stores the processed data. The data synchronization function synchronizes the data from physical devices. Timestamps will be given when the system receives the data. SEMAR implements two machine learning models, i.e., decision tree and Support Vector Machine (SVM). The output data can be accessed through REST API, user interface (web-based application), and notification systems. The REST API acts as a back-end system to

626

S. Sukaridhoto et al.

enable direct access to the sensor data. The data are received from the database in JSON format. The user interface allows users to view and save the data in JSON, text, or CSV format. The notification has two ways to notify users, i.e., via MQTT message subscription and email notification. 3.5 OpenHAB Platform (Open Home Automation Bus) OpenHAB is an agnostic open-source IoT platform independent of vendors and works with many protocols and devices. OpenHAB is written in Java programming language. OpenHAB is an IoT platform we can inject with services like MQTT broker and HTTP Rest API. Werner, Sebastian et al. have conducted research using the OpenHAB IoT platform. The OpenHAB architecture comprises three major parts: (1) the core, including the message bus; (2) the binding provider, for communicating with physical devices and services; (3) the user interface and user service, including the rule engine to run user codes and the REST interface to be consumed by external services [12]. Figure 9 depicts the architecture of OpenHAB.

Fig. 9. OpenHAB architecture [12].

OpenHAB has five essential elements: Things, Channels, Bindings, Items, and Links. Things are real physical devices included in the system and can provide functionalities, such as DHT sensors, and DHT sensors can have double functionalities, such as humidity sensing and temperature sensing. The functionalities are exposed through a Channel,

Internet of Things Platform as a Service for Building

627

and binding connects things with external extensions such as Google Assistant. Items are virtual things that provide functionalities to automate the logic, and the Link acts as a connector between Things and Items [13]. Figure 10 illustrates the conceptual architecture.

Fig. 10. Conceptual architecture of openhab [14].

3.6 Home Assistant Platform The home assistant platform is used for home control and home automation. The home assistant platform consists of three main parts: (1) the operating system, to run the supervisor and core; (2) Supervisor, to manage the operating system; (3) the core, to interact with the user, supervisor, and IoT devices and services. Figure 11 shows the architecture of Home Assistant Platform.

Fig. 11. Home assistant general architecture [15].

The home assistant core comprises four elements, i.e., event bus, state machine, service registry, and timer (see Fig. 12). Event bus is the mediator for firing and listening

628

S. Sukaridhoto et al.

to events. The state machine keeps the state and triggers the state_changed events when the state has been changed, whereas the service registry listens to call_service events on the event bus and allows other codes to register services. The timer will update the time_changed events every second on the event bus.

Fig. 12. Home assistant core architecture [15].

4 IoT Platform Comparison and Conceptual Model of the New IoT Platform Architecture This section compares the analyzed IoT platforms and identifies the research gaps. Tables 5 and 6 show the comparison results. Based on the gaps and analysis of different IoT platform architectures, we propose a new architecture shown in Fig. 14 and design system shown in Fig. 13. As seen in Fig. 13, the physical devices will collect and transfer the data to the system through communication protocols such as MQTT, HTTP, Matter, and Kafka (Distributed Messaging System). The physical device that we use is ESP32 because we can implement matter protocol using this device. The received will be processed in the data process segment. The data validator will validate the data format. If the format matches the rules, the system will pass the data to the data filter. A data filter will filter the data if a filtering process is needed. Data transformation will be used to transform the data to the required format as to what database the user will use. In this segment, we implement several databases to capture real-time, time-series, and relational data to gain the flexibility and interoperability of the IoT platform. The database will be connected with a Kafka connector that will act as an event listener. The Kafka will listen when updated or stored data is in the database. The blockchain extension will be able to subscribe to the database event to the topic using Kafka consumer. Then, the Kafka consumer will trigger the smart contract API. The digital twins extension can subscribe and publish the data through MQTT broker and Kafka. A data analysis system could get the data from a database for analyzing the data.

Internet of Things Platform as a Service for Building

629

Table 5. IoT platform comparison based on seven layers architecture. Platform

Criteria Things Connectivity Edge computing

Data Data accumulation abstraction

Application

Tuya

All HTTP, devices MQTT

Not mentioned Not mentioned

Not mentioned Reporting and control

Things board

All HTTP, devices MQTT, CoAP

Rule engine

Things board core service

Reporting and analytical

Cassandra

Azure IoT All HTTP, devices MQTT, AMPQ

Data Azure transformation storage

Azure stream processing

Reporting and analytical

SEMAR

Data aggregation

Data filtering

Reporting and analytical

All HTTP, devices MQTT

MongoDB

OpenHAB All HTTP, devices MQTT

Data Influx DB, transformation Mongo DB, RRD4J

JS Reporting, transformation analytical, and control

Home assistant

All HTTP, devices MQTT, Matter

Not mentioned SQLite

Not mentioned Reporting and control

Proposed system

All HTTP, devices MQTT, Kafka, Matter

Data validator

Data Reporting, transformation analytical, and filter and control

Mongo DB, Influx DB, Postgres

In this paper, a concept of Matter protocol implementation is seen in Figs. 15 and 16. We can implement two modes: WiFi network mode and Thread network mode. Several microcontrollers are supported by matter. In this paper, we use ESP32 as our Matter accessory (physical device), raspberry as a matter controller that should be installed CHIP tool program, and another raspberry as the thread border router if we want to implement Matter protocol in thread network mode. The data process segment validates the incoming data by checking the data types. Users can develop plug-in functions in the data process segment to customize the data processing to meet their requirements. After the data has been processed, it will be stored in the database. In the data processing segment, the database is integrated with kafka connector to get the data change event. Thus, the external applications are notified and get updated data. Blockchain, digital twins, and data analytics systems will consume the data output.

630

S. Sukaridhoto et al. Table 6. IoT platform comparison based on performance.

Platform

Criteria On premise

Scalability

Microservices

Security

Collaboration

Tuya

No

Not mentioned

Not mentioned

HTTPS

Yes (Amazon Echo, Google Home, Voice Assistant)

Things board

No

Yes

Yes

HTTPS, OAuth 2 Support

Yes (AWS IoT, IBM Watson, Azure IoT)

Azure IoT

No

Yes

Yes

Link Yes (Power BI, Encryption, SSL Azure Maps)

SEMAR

No

Yes

Yes

HTTPS, OAuth

No

OpenHAB

Yes

No

No

No

Yes (Extensions)

Home assistant

Yes

No

No

No

Yes (Extensions)

Proposed system

Yes (Docker Based)

Yes

Yes

HTTPS, MQTTS, Authentication

Yes (Extensions in Topics and API)

Fig. 13. The conceptual model of the proposed iot platform.

Internet of Things Platform as a Service for Building

631

Fig. 14. The architecture model of the proposed iot platform.

4.1 IoT Platform Evaluation Based on the comparison in Tables 4 and 5, We evaluated our platform with one of the other platforms as an example: SEMAR IoT platform. SEMAR has two communication protocols in its platform: MQTT and HTTP. In SEMAR, users should define the topic MQTT in the platform to allow users to publish and subscribe the data through the defined topic. The illustration is shown in Fig. 17. In our platform, to accelerate the development process, we use OpenHAB and make our own services. OpenHAB provide binding (extension) to allow user use their own services. The binding illustration is shown in Fig. 18. We only use MQTT and HTTP protocols to implement our services in this case. The implementation of communication protocols in OpenHAB is shown in Fig. 19. Our system provides several plug-in functions that can be used in the OpenHAB. The plug-in functions will be used as a data processor (validator, filtering, and transformation). OpenHAB also provides the HAB Panel that we can use for embedding the external web. Kafka and Matter can be implemented in HAB Panel.

632

S. Sukaridhoto et al.

Fig. 15. The conceptual of matter protocol implementation based on wifi network.

Fig. 16. The conceptual of matter protocol implementation based on thread network.

5 Conclusions In this paper, we have compared and analyzed several IoT platforms based on seven layers of architecture and their performance. We found that the existing IoT platforms presented in Tables 5 and 6 still lack connectivity with external technologies such as blockchain and digital twins. The other platforms only focus on the internet connection that requires an internet connection to control and access the IoT devices. According to the IoT platform validation section, some platform is less flexible and interoperable. Because of those gaps, we proposed a conceptual model of an IoT platform that can

Internet of Things Platform as a Service for Building

633

Fig. 17. Illustration of device management on semar platform.

Fig. 18. Illustration of binding on openhab platform.

integrate with external technologies such as blockchain and digital twins through our extension API. We also propose stream data processing to process image and video data. To gain interoperability, we also proposed a matter protocol to control and access the IoT devices in Personal Area Networks that need low energy consumption. The matter will be able to access with or without an internet connection. To solve the flexibility problem, we provide a plug-in function that can be injected into the IoT system.

6 Future Work Our future work focuses on developing an IoT platform that integrates physical devices through Matter protocol so that users can access the registered physical devices through their external IoT platforms, such as Google Assistant. The IoT platform will integrate external technologies such as blockchain and digital twins through our extension API.

634

S. Sukaridhoto et al.

Fig. 19. Implementation of MQTT protocol on OpenHAB platform.

References 1. Kumar, N.M., Mallick, P.K.: The Internet of Things: Insights into the building blocks, component interactions, and architecture layers. Procedia Comput. Sci. 132, 109–117 (2018) 2. Kotha, H., Gupta, V.: IoT application, a survey. Int. J. Eng. Technol. 7, 891 (2018). https:// doi.org/10.14419/ijetv7i2.7.11089 3. Azure IoT.: Azure IoT central architecture (2022). https://docs.microsoft.com/id-id/azure/arc hitecture/example-scenario/iot/iot-central-iot-hub-cheat-sheet. Accessed 13 Sept 2022 4. Panduman, Y., Funabiki, N., Puspitaningayu, P., Kuribayashi, M., Sukaridhoto, S., Kao, W.C.: Design and implementation of SEMAR IoT server platform with applications. Sensors 22, 6436 (2022). https://doi.org/10.3390/s22176436 5. Hejazi, H., Rajab, H., Cinkler, T., Lengyel, L.: Survey of platforms for massive IoT. In: 2018 IEEE International Conference on Future IoT Technologies (Future IoT), pp. 1–8 (2018).https://doi.org/10.1109/FIOT.2018.8325598 6. ThingsBoard.: ThingsBoard architecture (2022). https://thingsboard.io/docs/reference/. Accessed 13 Sept 2022 7. Cisco.: The internet of things reference model. http://dl.icdst.org/pdfs/files4/0f1d1327c519 5d1922175dd77878b9fb.pdf. Accessed: 20 Sept 2022 8. Tuya IoT.: Data analytics (2022). https://developer.tuya.com/en/docs/iot/product-introduct ion?id=Ka6gsdczvfkei. Accessed 21 Sept 2022 9. Tuya IoT.: What is tuya IoT gateway (2022). https://developer.tuya.com/en/docs/iot/ove rview?id=Kag6f93nyhhcq. Accessed 21 Sept 2022] 10. Hasan, H.R., Salah, K., Jayaraman, R., Omar, M., Yaqoob, I., Pesic, S., Taylor, T., Boscovic, D.: A blockchain-based approach for the creation of digital twins. IEEE Access 1 (2020). https://doi.org/10.1109/ACCESS.2020.2974810 11. Thakore, R., Vaghashiya, R., Patel, C., Doshi, N.: Blockchain - based IoT: a survey. Procedia Comput. Sci. 155, 704–709 (2019). https://doi.org/10.1016/j.procs.2019.08.101 12. Werner, S., Pallas, F., Bermbach, D.: Designing suitable access control for web-connected smart home platforms (2018). https://doi.org/10.1007/978-3-319-91764-1_19 13. Setz, B., Graef, S., Ivanova, D., Tiessen, A., Aiello, M.: A comparison of open-source home automation systems. IEEE Access 1 (2021). https://doi.org/10.1109/ACCESS.2021.3136025 14. Heimgaertner, F., Hettich, S., Kohlbacher, O., Menth, M.: Scaling home automation to public buildings: a distributed multiuser setup for OpenHAB 2, pp. 1–6 (2017). https://doi.org/10. 1109/GIOTS.2017.8016235

Internet of Things Platform as a Service for Building

635

15. Home Asistant.: Home assitant architecture (2022). https://developers.home-assistant.io/ docs/architecture. Accessed 08 Nov 2022 16. Panarello, A., Tapas, N., Merlino, G., Longo, F., Puliafito, A.: Blockchain and IoT integration: a systematic survey. Sensors 18(8), 2575 (2018). https://doi.org/10.3390/s18082575 17. Kouicem, D.E., Bouabdallah, A., Lakhlef, H.: Internet of things security: a top-down survey. Comput. Netw. 141, 199–221 (2018). https://doi.org/10.1016/j.comnet.2018.03.012 18. Fuller, A., Fan, Z., Day, C., Barlow, C.: Digital twin: enabling technologies, challenges and open research. IEEE Access 1 (2020). https://doi.org/10.1109/access.2020.2998358 19. Kaur, M.J., Mishra, V.P., Maheshwari, P.: The convergence of digital twin, IoT, and machine learning: transforming data into action. Digit. Twin Technol. Smart Cities 3–17 (2019). https:// doi.org/10.1007/978-3-030-18732-3_1 20. Al-Ali, A.R., et al.: Digital twin conceptual model within the context of internet of things. Futur. Internet 12, 163 (2020). https://doi.org/10.3390/fi12100163

From Industry 4.0 Toward Agriculture 4.0 Bushara Ali1(B) , Anamarija Ilieva2 , Ahmad Zakeri1 , and Oliver Iliev3 1 School of Engineering, University of Wolverhampton, United Kingdom University, Telford,

Wolverhampton, England [email protected] 2 Faculty of Electrical Engineering and Information Technologies, University of Sts. Cyril and Methodius, Skopje, Republic of Macedonia 3 Faculty of Information and Communication Technologies, FON University, Skopje, Republic of Macedonia

Abstract. The biggest problem facing contemporary agriculture today is the enormous and ongoing growth of the world population, which necessitates a bigger production of agricultural goods to meet the needs for food. Also, there are more challenges facing agriculture, like the recent rise in inefficiency brought on by global climate change, reduced irrigation water, increasing soil contamination, plant diseases, and heat waves, all of which have a negative impact on agriculture sectors. In order to retain efficiency, sustainability, and quality, the agricultural sector needs to make investments in modern technologies and infrastructure that will allow it to evolve into a smart industry capable of addressing these difficulties through lean operations supported by industrial digital technologies. The manufacturing industry has fully embraced the industry 4.0 strategy, allowing them to achieve increased optimization, efficiency, responsiveness, and autonomy supported by the digitization approach. In order that to make a paradigm shift in the future of the agricultural industry, this paper deals with the digitization of agriculture, which is based on IOT, AI, and artificial neural network (ANN) technologies. The use of these technologies in agriculture will lead to the production of many services, such as data collection and analysis, pattern recognition, and independent decision-making based on artificial intelligence, added to existing agricultural automation as result of these processes. The agricultural industry is currently one of the most inefficient industries, and its use will lead to a technical revolution in the agricultural sector to improve productivity, quality, and sustainability. Keywords: Smart farming · Industry 4.0 · Artificial Intelligence (AI)

1 Introduction The World Food Organization estimates that the world population will reach 9.15 billion by 2050, with an increase of another 2 billion, while the area that can be added to arable land will only be 4% [15]. And according to a paper published in 2018 by the World Government Summit, “Population growth = more food. The same paper estimated that © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 636–651, 2024. https://doi.org/10.1007/978-3-031-47715-7_43

From Industry 4.0 Toward Agriculture 4.0

637

the world’s population will increase to 10 billion in 2050 [1, 7]. Due to the increasing world population and the challenges facing the agricultural sector (which will be discussed below), it is not possible to achieve agricultural practices with the required efficiency except by using the latest technological solutions, relying on modern agricultural techniques to improve resources and increase consumer demand, as well as high quality standards such as food safety, security, and traceability. Agricultural production needs many activities, such as soil monitoring, temperature, relative humidity, vegetation conditions, carbon dioxide levels, pest control, supply management, and infrastructure. IoT-based technology can add value in terms of quality, sustainability, and production increase as it performs a range of operations and helps with decision-making [7]. The goal of this research is to use fuzzy logic to integrate human informal knowledge into control techniques for protected agriculture. Early prediction of plant disease using an ANN-based system, networking of smart sensors, and IoT technology networking to facilitate and control large amounts of data, as well as collecting the data from sensors and integrating them in real-time to allow monitoring and control from anywhere and at any time by all process participants and also calculating biotic and abiotic stresses using UAVs and multispectral cameras Smart control of plant irrigation and fertilization using ANN, These It is considered one of the main questions of this research. However The main idea or focus of the current study program is to implement the Industry 4.0 strategy in farming through the integration of Internet of Things (IOT), ANN, fuzzy logic and artificial intelligence (AI) techniques into the industry’s infrastructure to make it possible to achieve the following goals and milestones: Using fuzzy logic, artificial intelligence, and the Internet of Things in agriculture has many advantages for helping and developing agriculture, and it will also facilitate farms in reducing manufacturing costs and thus increasing profitability and sustainability [3]. The research will make it possible to include human informal knowledge into the control strategy using fuzzy logic, creating human-like control over the greenhouses’ parameters. With the combined application of artificial intelligence (AI) and Internet of things (IoT), it will be easier to reach a higher level of efficiency in the use of resources like water, fertiliser, pesticides, energy sources, etc. On the farm, (IoT applications will lead to environmental protection). Also, this study dealt with the use of fuzzy logic techniques, which will enable the integration of informal human knowledge into the control strategy, thus developing human-like control of the parameters of farm entities (greenhouses). Through the joint use of AI and IoT, they will facilitate an enhanced level of efficiency in the use of water, fertilizers, pesticides, energy sources, etc. on the farm (artificial intelligence (AI), and IoT applications will protect the environment). This study also focused on the importance of integrating technology. The research enables farms to adopt more effective environmental protection strategies. by combining fuzzy logic and AI technology. However, IoT is a key component of smart farming as well, as this study shows. This is considered one of the main problems in this sector, and this study seeks solutions to these problems to enable farmers to use modern agricultural industry technology on their farms as well as provide practical steps to agricultural companies. It also gives the farmers solutions and good ideas for the researchers about how to pay attention to this important sector. These are the main issues addressed in this study. Also, the paper presents a rationalization of the benefits and useful information about Industry 4.0 as well as the obstacles

638

B. Ali et al.

related to the implementation of the latest technological solutions in smart agriculture. The importance of this research lies in the fact that it provides important technological solutions for the significance of using Industry 4.0 in the agriculture sector, particularly using IOT, AI, and fuzzy logic in smart farming, as well as recommending researchers pay attention to this important sector. It will also enable farms to lead in adopting more effective environmental protection strategies through the combined use of fuzzy logic and AI techniques. This is one of the most important technical recommendations that must be worked on and developed.

2 Previous Research The widespread use of wireless communication will raise the bar for agricultural automation, and wireless communication will alter communication standards. Sensor networks are defined by a variety of IEEE standards, including IEEE 802.15.1 PAN/Bluetooth, IEEE 802.15.4 ZigBee, and many others. These standards must be understood when developing sensor network applications. The use of the IPV6 Internet Protocol for wireless communication and a significant amount of hardware to establish a WSN are currently under debate. Precision agriculture and the creation of optimal strategies are both made feasible by using WSN. However, the IoT concept has made it possible to build a network of intelligent sensors that can quickly and effectively cover large areas. The IOT gateway has been forked by the authors, Sridevi Navulur et al., into various nodes such as actuators, sensors, interfaces, and wireless links that enable communication between them [26]. To monitor agricultural land and alert farmers via email or SMS or if the system detects any anomalies, Kodali et al. presented the Losant platform [20, 21]. An automated irrigation system that uses the GPRS module as a communication tool was created by Gutiérrez-Rodrguez et al. [10]. The system has been programmed into a microprocessorbased gateway that regulates the flow of water and has the system encoded into it. It has been shown that the water savings are 90% greater than the irrigation technique used traditionally. Kim et al. used a distributed wireless network to remotely sense and control irrigation processes [27]. To decrease human intervention, time, and cost while improving efficiency, productivity, and the global market. Malavade et al. focused on the IoT’s impact in agriculture and how it results in smart framing [23]. Artificial neural networks (ANN) have recently been used by the agricultural industry due to their advantages over conventional control systems. Artificial neural networks can predict the behaviour of the system through the training process [11]. Gliever and Slaughter utilized ANN to recognize weeds in crops. The authors, Maier et al., utilized artificial neural networks (ANNs) to predict water resource variables [22]. To send data to the server so that domain operators can make appropriate decisions [30], Song & He et al. combined expert systems and artificial neural networks to estimate the nutritional value of crops, while Ravichandran & Koteshwari suggested using ANN algorithms for crop. They also suggested an expert farming system that uses IoT to send data to the server so that domain operators can make suitable decisions [30]. The effectiveness of the model depended on the number of hidden layers. To assess soil moisture in the Badi fields with significantly less meteorological data, two ANN models were created. By

From Industry 4.0 Toward Agriculture 4.0

639

examining the observed and estimated soil moisture values, both models were proven to be accurate. It led to a more efficient use of less meteorological data, a lower consumption of labor and time, and an accurate and trustworthy estimation of soil moisture in rice fields [2]. Hinnell et al. [12], in their discussion of drip irrigation systems, mention that ANNs have been developed to predict the spatial distribution of groundwater [12]. The membership function of each system variable used in fuzzy logic is an effective technique for representing informal knowledge [13, 14]. Sannakki et al. created a novel technique for classifying leaf diseases that separates the healthy from the diseased portions of the leaf with the help of image segmentation and a set of k means [29]. To determine the ideal N fertiliser rates based on field and crop variables, Tremblay et al. developed the fuzzy inference system (FIS) [25]. Based on the weather data, the model predicted plant illnesses [16]. The technique was created to prevent diseases in plants because they only appear when the temperature and humidity in the atmosphere are within a specified range. Wall and King created an intelligent system that uses field temperature and humidity sensors to control sprinkler valves [14, 17]. A distributed irrigation system that measures soil moisture was developed by Miranda et al. [3, 15], and Greenhouse developed a multi-variable fuzzy controller for the Greenhouse system. The developed fuzzy inference engine integrates the correlation of input variables and incorporates informal knowledge from field experts [14].

3 Definition of Industry 4.0 A paper was proposed in 2011 titled “Industry 4.0: Intelligent Manufacturing of the Future.” Industry 4.0 is defined as “the name of the current trend of automation and data exchange in manufacturing technologies, including smart factory construction, the IoT, cloud computing, cognitive computing, and electronic physical systems” [12, 22, 27]. Industry 4.0 refers to the technological evolution from embedded systems to electronic physical systems [31]. Note that there are many definitions for “industry 4.0”, and not all the definitions are the same, but “industry 4.0” is also known as the Fourth Industrial Revolution [1]. Decentralized intelligence helps create intelligent networks and manage autonomous operations, with the interaction of the real and virtual worlds representing a new and important aspect of the manufacturing and production process. The first industrial revolution began in the late eighteenth century in Britain, enabling the mass production of water and steam power rather than human and animal power. A century after the first Industrial Revolution, Industrial 2.0 began, introducing assembly lines and the use of oil, gas, and electric power as important energy sources in the industry, along with more advanced communications by telephone and telegraph and a certain degree of automation of manufacturing processes [28]. The third industrial revolution began in the mid-twentieth century, contributing to the manufacture of computers, advanced communications, and data analysis [31]. In addition, the digitization of factories has started with the integration of control units (PLCs) into machines to automate certain processes and collect and exchange data. The world is currently experiencing the fourth industrial revolution, which is the latest transformation in the industrial sector. (Increasing automation and the use of smart machines and smart factories), this data helps produce more efficient and productive goods. Flexibility is improved so that manufacturers can

640

B. Ali et al.

meet customer demands with mass customization; in pursuit of efficiency, by collecting more data from the factory floor and combining it with other enterprise operational data, a smart factory can achieve information transparency and better decision-making. The Fifth Industrial Revolution (Industry 0.5) is expected to bring humans back to the centre of operations through a human-electronic-physical system of value creation [28]. Tables 1 and 2 show the development roadmap of industrial and agricultural revolutions from the 18th century up to date. Table 3 shows the concepts of Industry 4.0 technology and Agriculture 4.0 technology. Table 1. The history of industry revolutions from steam to sensors [18, 28]. Industry revolutions

(Industry 1.0) mechanization

(Industry 2.0) electricity

From the 18th century up to date

(18th century) (20th century)* Steam engine Used electricity *Waterpower to create *Mass *Electrical energy *Productions

(Industry 3.0) electronics (Industry 4.0) intelligence The digital revolution (late 20th century) used computers to automate production, automation, and information technology

Now *Sensors *AI *IOT *Big dana *Robotic *Drones

Table 2. The history of agriculture revolutions from 18th century up today [18, 28, 32]. Agriculture revolutions

Agriculture 1.0

Agriculture 2.0

Agriculture 3.0

Agriculture 4.0

From 18th century up today

18th century. Between 1884 and 1870 Farmers using animal power Indigenous tool Manual work

20th century, Green revolutions Tractor Fertilizer Pesticide

From 1960 to the early Yield monitoring Variable rate application Guidance system Precision agriculture

Up to today Agriculture 4.0 Autonomous farming Trustworthy food supplyr Ubiquitous agriculture senso

The agricultural revolution brought an important shift in human history and helped in the development of agriculture and the domestication of plants and animals. It also enabled people to stay in their places and establish permanent and stable settlements, which helped civilization development (1960s to the early 2000s).

4 Agriculture 4.0 and Food Industry Despite the big evolution in the agricultural field, there are many issues and challenges that need to be addressed, such as environmental issues, a lack of digitalization, and the issue of food safety. Industry 4.0 was marked by the synthesis of all emerging technologies, bringing together the IoT (IoT), robotics, big data, and blockchain technology

From Industry 4.0 Toward Agriculture 4.0

641

Table 3. The concepts of industrial and agricultural revolution technologies [1]. Smart industry technology

Smart agriculture technology

IOT sensor

IOT sensor

Big data

AI

Analytics

Machine learning

Cloud computing

Fuzzy logic

Robotics… etc.

UAVs… etc.

[7], Using Wireless Sensor Networks (WSN), Autonomous Robotic Systems (ARS), Cloud Computing (CC), Simulation and System Integration (SI), Cyber Physical Systems (CPS), Digital Twin (DT), and Additive Manufacturing (AM), industry can be digitalized and make agriculture smarter. The integration of Industry 4.0 and agriculture provides the opportunity to transform industrial agriculture into Agriculture 4.0, or the “next generation”, also called “digital agriculture.” Agriculture 4.0 can be achieved by collecting precise variables in real-time as well as processing and analysing data related to the agricultural industry from food production, processing, and distribution to consumer use [1]. Such a smart agro system would greatly improve productivity, food supply chain efficiency, natural resource utilization and food safety [7, 19]. Sensors are used to detect crop stages, recommend timely inputs and treatments, and control the level of irrigation. The food industry needs more research on food security, transparency, and the required documentation. Smart sensors are used as a tracking system in addition to collecting human and product data. Agriculture 4.0 means a transformation of the production infrastructure: it contributes to the creation of connected farms, new production equipment, machinery, and connected tractors, enabling increased productivity, quality, and environmental protection [19]. Smart agriculture concept is shown below in Fig. 1.

Fig. 1. Smart agriculture concept [1].

642

B. Ali et al.

5 From Industry 4.0 Toward Agriculture 4.0 The methodology behind this study is to develop a smart farm prototype using digital technologies, implementing the industry 4.0 strategy in farming. The study also identifies challenges and evaluates unsolved problems, potential applications, and the uses of AI and architectures for the IoT in the agricultural industry. With the help of Internet of Things (IoT), fuzzy logic, and artificial neural networks (ANN) technologies, the study will develop an integrated digital tool that will allow the agricultural sector to collect, process, and transmit data as well as take independent decisions and actions based on incorporated informal knowledge obtained through the use of fuzzy logic without the need for human intervention [10]. The Internet of Things provides the communications infrastructure required to connect and link smart devices to user-mobile devices over the Internet, such as sensors, vehicles, and unmanned aerial vehicles to user-mobile devices [5]. This study also highlights many of the challenges exposed to contemporary agriculture, such as the continually increasing demand for quality food, shortages of labour and arable land, increased soil contamination, irrigation water reduction, and loss of yields due to plant diseases and pests [7]. Along with this, the agricultural sector is currently facing difficulties brought on by climate change, which have increased existing inefficiencies in the field. These are besides the challenges the agricultural industry currently faces due to global climate change, which has added to the inefficiencies of the agricultural industry. In such circumstances, to maintain efficiency, quality food, and healthy and sustainable production and cultivation of healthy food Environmentally, the agricultural industry needs to invest in new techniques and infrastructure to transform into a smart industry comprised of smart farms capable of responding to challenges through lean operations supported by industrial digital technologies (IDTs), irrigation water reduction, increased soil contamination, and yield loss due to plant diseases and pests [4, 10]. This research can be useful to farmers, researchers, and professionals working in agricultural businesses for intelligent farming. Newness/ originality of the study: Several prominent researchers have made efforts for smart agriculture using IoT concepts in agriculture. But a set of revealed challenges remains in the queue for effective resolution. This study makes some effort to discuss previous research and open challenges in IoT-based agriculture, artificial intelligence, artificial neural networks, and fuzzy logic to develop smart farming. However, using industry 4.0 techniques in farming will give farmers and agriculture companies many benefits: 1. By using the Internet of Things in farming, it can make it easier to manage and collect large amounts of data from sensors, as well as control and integrate them from anywhere and at any time, allowing communication among all participants in the process. 2. Using fuzzy logic, the research will allow the incorporation of human informal knowledge into the control strategy, resulting in human-like control of the farm entities (‘greenhouses’) parameters. 3. With the combined of AI and IoT, the farm will be able to achieve a higher level of efficiency in the use of water, energy sources, fertilisers, pesticides, etc. 4. Internet of Things applications will lead to environmental protection.

From Industry 4.0 Toward Agriculture 4.0

643

5. The research enables farmers to adopt more efficient environmental protection strategies by combining fuzzy logic and AI techniques. 6. Using Internet of Things (IOT) sensors and AI-enabled equipment such as drones and agriculture robots, farmers can increase food production by more than 70% by 2050. Figure 2 shows the system’s structure. The central unit receives and collects the data from autonomous measurement stations, local weather forecasts, professional knowledge from local agronomists, pre-processed images received from multispectral crop cameras utilising UAVs, and data obtained from mobile device applications. The central unit generates control actions for the actuators based on previously collected, processed, and stored data, The central unit generates control actions for the actuators. This system is suitable for both outdoor and protected agriculture (in greenhouses and indoors). Actions are taken in accordance with the type of plant under supervision. The system controls the internal temperature, relative humidity, carbon dioxide enrichment, irrigation, and fertilisation in the case of protected cultivation. In the case of outdoor cultivation, the system regulates watering and fertilisation and computes the severity index to allow for early pest detection. AI-based solutions will also be utilised to optimise resource utilisation [32].

6 Technology-Driven Agriculture Industry 4.0 changed the industrial and agriculture sectors and offered several advantages across many industries, but especially in agriculture industry fields, farmers are now using Industry 4.0 technology in smart agriculture by utilising modern devices like agriculture robots, drones (UAVs), sensors, AI, the Internet of Things (IoT), etc.

Fig. 2. The structure of the proposed system.

644

B. Ali et al.

6.1 Internet of Things in Agriculture Majority of research publications discussed the influence of IoT devices on system automation, including how they can determine soil texture, automate irrigation systems, and detect environmental change. Such as temperature and humidity, detecting pests and weeds, and assessing damage to the soil. Smart agriculture contributed to the development of the agricultural sector. Farmers are now utilising modern technologies in agriculture to make it smarter and more productive, using IOTs and AI and linking agriculture with satellites, the internet, sensors, etc. To increase the benefits of the IoT [5]. The data connection varies from sensor to sensor. First, the sensors gather the information, share it with the IoT gateway, connect through Wi-Fi or Bluetooth to the database, and store the data in the cloud. Then the user finally gets the processed information (or data) and uses it to make smarter decisions. Wireless Sensor Network (WNS) helps make the right decisions in relevant situations, and thus WNS supports decision-making to find solutions to problems related to agriculture [1, 18]. Despite technological development, many challenges must be faced when using modern technologies in smart agriculture. Among those challenges, most people are ignorant of how to use technology in the agriculture sector due to a lack of knowledge about dealing with mobile devices and a lack of reading and comprehension knowledge. There are also other factors, such as a lack of energy supply, changes in weather or climatic conditions, and funding sources [1]. But the development of the agricultural field using modern technologies has helped to overcome obstacles and solve many problems in modern agriculture. In the Agriculture 4.0 sector, the Internet of Things (IoT) is commonly utilized in various ways, such as the use of unmanned aerial vehicles (UAVs) in agriculture, as well as the IOT for monitoring soil and water quality, optimizing crop irrigation, employing precision agriculture techniques like remote sensing, humidity, and temperature monitoring, and implementing autonomous greenhouses and these are the important and common uses of IOT in agriculture [8, 18]. 6.2 Wireless Sensor Networks in Agriculture Wireless sensor network technology is used on the IoT. WSN is defined as a group of spatially dispersed sensors that monitor environmental conditions, store data obtained, and send information collected to a central point [1]. A wireless communication module connects several sensor nodes to form a WSN for smart agriculture. These nodes can self- self-configure, self-organize, and self-diagnose [24, 32]. These nodes have a variety of capabilities (such as processing, transmission, and acquisition). WSNs can be divided into various sorts based on the environment in which they are found. These comprise mobile wireless sensor networks (MWSNs), underwater wireless sensor networks (UWSN), underwater wireless sensor networks (WUSN), and terrestrial wireless sensor networks (TWSN) [1, 20]. For agricultural applications, TWSN and UWSN sensor networks are often used [1, 20, 32]. TWSN nodes are positioned above the surface of the earth and consist of sensors to collect surrounding data. A second alternative to WSNs is their underground counterpart, WUSNs. Nodes are inserted into the soil, where they easily perform low-frequency functions while exhibiting very weak higher-frequency capabilities. As a result, it takes a lot of knots to cover a big area. Several research

From Industry 4.0 Toward Agriculture 4.0

645

publications that have been addressed in the literature have covered the use of WSNs for various farm applications, such as irrigation management, water quality evaluation, and environmental monitoring. Additionally, there are several WSN-related issues that require more focus and attention, including interoperability, robust and fault-tolerant architecture, and maintenance [1, 5]. 6.3 Artefactual Intelligence in Agriculture 4.0 The agriculture sector is experiencing the rapid customization of both artificial intelligence and machine learning in seasonal agricultural outcomes and field agriculture techniques. The development of computer systems that can carry out tasks that require human intelligence, such as decision-making and sensory perception, is known as artificial intelligence [24]. Combined with cloud computing, artificial intelligence, big data, and the Internet of Things, this is seen as one of the main drivers behind the digitization of the agriculture industry 4.0. Many smart farming systems that use machine learning [ML] and deep learning [DL] algorithms have been developed to determine various parameters such as crop and soil control, diagnosing plan. Diseases, detecting weeds, intelligent spraying, aerial surveys, and photography that are used to improve crop production. In addition to ongoing monitoring, harvesting, processing, and marketing [17]. Learning and responding to special situations (based on learning) to increase efficiency [1]. Providing some of these options to farmers as a service, such as Chabot or different conversation principles, will assist them in properly supporting the pace of technological improvements, such as identifying similarities in everyday farming. Agricultural robotics organizations recognize the value of robots that depend on and manipulate themselves to handle key agricultural services such as assembly plants on a more prominent scale, at a faster pace, and with more proactivity than personal workers [19]. 6.4 Advance Robotic Agricultural robots have been developed to perform precision farming operations and to replace humans in certain tasks. Advanced robots have been used in recent years for purposes such as cultivation and harvesting, water supply, targeted spraying, and environmental control. Robots equipped with lasers and cameras are used to identify and remove weeds without the need for human intervention. These robots move autonomously through the rows of crops, resulting in higher yields when labour is reduced [9]. It can also be used in the fields to complete specific farming tasks. Recently, robots for plant growth and fruit picking have emerged to bring new levels of efficiency to traditional methods. Stationary robots are usually the most popular alternative in industrial applications [29]. However, in the context of agriculture, mobile robots may offer greater utility, as they can traverse different types of terrain that are not otherwise easily accessible, cover large fields, and automate agricultural tasks. In the agricultural sector, mobile robots are widely used to improve farm management, increase farm efficiency, and reduce labor needs, especially in harsh environments inaccessible to humans [19].

646

B. Ali et al.

These robots perform a variety of agricultural activities, such as harvesting, field cultivation, mechanical weeding, precision spraying, irrigation management and soil sampling [9]. 6.5 Machine Learning in Agriculture Machine learning techniques are classified into three categories: (1) Supervised learning (regression trees, linear regression, nonlinear regression, Bayesian linear regression, polynomial logistic regression, and vector regression support). (2) unsupervised learning (K-group means, hierarchical clustering, anomaly detection, neural networks, principal component analysis, independent component analysis, prior algorithms, and singular value decomposition) and (3) re-informative learning (Markov decision processes and Q learning) [1, 24]. Technologies and algorithms implement machine learning in agriculture. Crop yield forecast by sector, diseases, weed detection, and weather forecast (precipitation), estimation of soil properties (such as moisture content, pH, temperature, etc.), water management, calculating lawn fertiliser application rates, and livestock production and care of animals. Management Rates are used for a wide range of agricultural applications. The analysis shows that “crop yield prediction” is a topic that has been extensively studied. The most popular ML technologies for smart agriculture are support vector machines (SVM), neural networks (NN), random forests (RF), and linear regression. There have been no reports of commercial applications yet for the listed use cases, which are currently in the research stage. Additionally, it has been discovered that indoor vertical farming systems, particularly hydroponic, aquaponic, and aeroponic systems, are hardly researched using AI and machine learning approaches [24]. Technology from ML is applied. Given the difficulties of digital transformation in terms of cybersecurity and data privacy, new approaches such as such as federated learning and privacy-preserving methods, are being developed to enable digital agriculture. And mitigate security issues [1]. 6.6 Drones (UAVs) Technology in Agriculture 4.0 Drones (UAVs), known as drones or aerial robots, are widely used by farmers to monitor crop growth, and combat hunger, and adverse environmental impacts. It is called “unmanned” because there is no human pilot on board [1]. Unmanned aerial vehicles are used efficiently to take pictures, spray water and pesticides, consider rugged terrain, etc. It has proven its value in terms of spray speed as well as accuracy, when compared to conventional machines for the same purpose, UAV technology can be used for gathering information that helps farmers make decisions. With recent developments in drone technology, a group of UAVs equipped with 3D cameras and heterogeneous sensors can work together to provide farmers with comprehensive capabilities and tasks to manage their lands and farms [5]. By using these drones in agriculture, farmers can set their eyes on the sky. However, using drones in agriculture can carry out large and difficult tasks and reach remote and rugged places with the necessary speed and time, but there are still many challenges that need to be addressed to take advantage of this technology, especially the integration with other technologies and how to use it in bad weather conditions [21]. UAVs, and radio-controlled model aircraft can be flown at lower altitudes

From Industry 4.0 Toward Agriculture 4.0

647

for increased spatial resolution, possibly at a lower cost [19]. The advantage of the image using drones for agriculture is their accuracy (very high pixels), compared to other traditional methods [6]. Drone images are used in many things, including mapping grass types, measuring shrub biomass, mapping crop strength, monitoring crop water stress, recording crop biomass, and evaluating nitrogen treatments. On crops [21], in addition, the UAV is also used as a geo-fence to repel animal attacks by alerting the owner in a timely and safe manner [19] (see Fig. 3).

Fig. 3. The types of drones used in agriculture (https://www.iberdrola.com/web/guest/innova cion/smart-farming-agricultura-precision).

7 The Benefits of Industry 4.0 in Agriculture Using modern technologies in the agriculture sector has many advantages related to the economic, social, and environmental fields. This paper will briefly cover all the fields mentioned above. First, the environmental field: These technologies contributed to reducing pollutants (such as agricultural chemicals and various fertilizers), which led to the sustainability [1, 3] or increase of soil quality, reduced water pollution, reduced carbon dioxide and nitrogen, and reduced air pollution. Secondly, economically, it has many advantages, including operational efficiency, maximising resources, and providing high-quality product results, such as increased yields, improved soil quality, and high selling prices of raw materials. Smart farming systems also enable accurate management of demand forecasts and timely delivery of goods to market to reduce waste. In addition to the various benefits in terms of efficiency, there are lower costs involved, less water consumption, less use of technical inputs, and lower overall costs. In terms of social benefits, it reduces time and effort (reducing physical labor) and increases quality, as well as reducing worker stress and improving their well-being and safety at work. Overall, Agriculture 4.0 helps reduce costs and improve product quality, quantity, and sustainability [18].

648

B. Ali et al.

8 Challenges Facing Agriculture 4.0 To introduce the new technology (Industry 4.0 to Agriculture 4.0) in a specific environment, there are many challenges and obstacles, such as technical and technological challenges related to the application of smart agriculture, technical problems related to modern equipment, weather problems, rugged lands in some areas, and a lack of knowledge for some farmers about how to use the modern techniques in agriculture [4, 8]. Agriculture 4.0 requires great economic effort; financing sources consider it most important to farmers. However, funding sources are one of the most important challenges in some developing countries. Implementation challenges: Using Agri4.0 in the agriculture sector requires skills. Cultural and organisational challenges: Due to the introduction of digitalization in agriculture, Agriculture Revolution 4.0 will require skills related to introducing digital solutions in companies [18]. However, weather conditions, energy sources, and financing, the ruggedness of some areas, and the lack of sufficient knowledge to use modern technologies for some farmers is considered the most important challenge facing smart agriculture. Moreover, for a successful implementation of ANN, fuzzy logic control systems, other artificial intelligence engines, to train Artefactual Intelligence (AI) models in the agricultural sector, a huge amount of data is needed [4].

9 Discussions Smart farming depends on advanced technology systems in management, analysis, processing, data collection, decision-making, etc. to reduce costs and make farming more productive, profitable, sustainable, and environmentally friendly. Smart agriculture can contribute significantly to meeting the food needs of the world’s population [1]. This study aims to highlight the importance of using new technologies in farming. Using Industry 4.0 in agriculture can make agriculture smart and available, as well as contribute to solving many agriculture problems. We have clarified part of these challenges and solutions in this paper, which shows the importance of resorting to researchers, farmers, and agricultural companies to contribute to the development of this important sector. And shows that it is not as required. And there are many issues and challenges, such as the issue of food safety, a lack of digitization, and environmental problems. Sources of financing and support, a lack of resources, and the lack of knowledge of many farmers on how to use new technologies on farms., which is considered one of the most important challenges facing smart agriculture in developing countries. (Industry 4.0) was distinguished by doing a lot of things and tasks with less effort and in a short amount of time. Industry 4.0 spawned all the emerging technologies, bringing together BDMM, IoT technologies, robotics, big data, and blockchain technology [7, 8, 18, 19], are making production and supply processes smarter and now more autonomous. The integration of smart Industry and Agriculture offers the opportunity to transform industrial agriculture into the next generation. Smart farming can be achieved by collecting precise variables in real time, in addition to processing and analysing data related to agricultural industries from food production, processing, and distribution to consumers, through artificial intelligence (AI), Big data, sensors, IOTs, (UAVs), agriculture robots), etc. Such a smart ecosystem in the agricultural field would contribute significantly to

From Industry 4.0 Toward Agriculture 4.0

649

improving productivity, food supply chain efficiency, natural resource utilization and food safety. Industry 4.0 towards Agriculture 4.0 means a shift in production infrastructure: it contributes to the creation of connected farms, new production equipment, tractors, and tying machines, enabling increased productivity, quality, and environmental protection [28], as well as giving recommendations to those who are interested in smart farming and to the researchers to combine the use of AI techniques and fuzzy logic, which will enable the farms to lead in adopting more effective environmental protection strategies. And this point is one of the most important recommendations that must be worked on and developed.

10 Conclusions This paper provides a future vision, an overview, and objective solutions for Industry 4.0 applications in agriculture. It highlights the significance of using modern technologies in agriculture to contribute to those interested in developing and knowing the places of imbalance in the agriculture sector, as well as identifying the challenges facing smart agriculture to find solutions and innovate new ways, contribute to opening research and innovation issues, and therefore be useful in identifying prospects and knowledge of the gaps in agriculture 4.0. The study’s goal is to respond effectively to global climatic changes, fluctuations in the global economy and environmental fields, a lack of resources, and an increase in the global population, all of which affect agricultural development and prevent us from achieving food self-sufficiency. However, the technologies provided by Industry 4.0 for agriculture field refer to many creative solutions that can be used to develop this important sector. In this paper, the challenges facing farmers are highlighted, and logical solutions are presented to meet future expectations and make agriculture more efficient and smarter. In this context, Internet of Things, drones, micro-communication technologies, agriculture robots, cloud computing, and artificial intelligence are discussed. The study introduces our current research in the field of applications of AI, IoT, ANN, fuzzy logic, and early detection of plant diseases. In addition to IOT for smart communication in data collection, processing, and analysis, it is considered one of the powerful tools that enable agricultural management, so farmers and researchers should pay close attention to move this sector to a qualitative new level. It will provide a new ecological approach to environmental protection, although there is still much research to be done. The design above demonstrates that the approach enables the integration of expert human knowledge into control strategies by using fuzzy logic to facilitate control in protected agriculture. Additionally, various expert-based strategies can also be simulated, compared, and analyzed [13]. In conclusion, this study also highlights that humans can use fuzzy logic to facilitate the integrated control of protected agriculture through their formal knowledge. Future work in this research is to combine the uses of IOT and AI to achieve the highest level of sustainability, quality, and efficiency in the use of water, energy sources, fertilizers, pesticides, soil type, soil temperature, etc., on the farm. With the knowledge that IoT applications in the agriculture sector will lead to environmental protection. And additionally, the study enables the farms to adopt more effective environmental protection tactics by advancing the combined use of fuzzy logic and AI technology.

650

B. Ali et al.

References 1. Abbasi, R., et al.: The Digitization of agricultural industry–a systematic literature review on agriculture 4.0. Smart Agric. Technol. 2(CC BY-NC-ND), 100042. 2 Dec 2022. Published by Elsevier B. V. https://doi.org/10.1016/j.atech.2022.100042 2. Arif, C., Mizoguchi, M., Setiawan, B.I., Doi, R.: Estimation of soil moisture in paddy field using artificial neural networks. Int. J. Adv. Res. Artif. Intell. 1(1), 17–21 (2012) 3. Badrun, B., Manaf, M.: The development of smart irrigation system with iot, cloud, and big data. In: IOP Conference Series: Earth and Environmental Science, vol. 830, no. 1, p. 012009. IOP Publishing (2021) 4. Bernhardt, H., et al.: Challenges for agriculture through industry 4.0. Agronomy 11(10), 1935 (27 Sept 2021). https://doi.org/10.3390/agronomy11101935. Accessed 22 June 2022 5. Boursianis, A.D., et al.: Internet of things (IOT) and agricultural unmanned aerial vehicles (uavs) in smart farming: a comprehensive review. Internet Things 18, 100187 (2022). https:// doi.org/10.1016/j.iot.2020.100187 6. Daponte, P., De Vito, L., Glielmo, L., Iannelli, L., Liuzza, D., Picariello, F., Silano, G.: A review on the use of drones for precision agriculture. In: IOP Conference Series: Earth and Environmental Science, vol. 275, no. 1, p. 012022 (2019). https://doi.org/10.1088/1755-1315/ 275/1/012022 7. De Clerq, M., Vats, A., Biel, A.: Agriculture 4.0: future of farming technology. World Government Summit (Oliver. W), vol. 1, no. 1.2, pp. 5–6 (2018) 8. Elijah, O., et al.: An overview of internet of things (IoT) and data analytics in agriculture: benefits and challenges. IEEE Internet Things J. 5(5), 3758–3773 (2018), https://doi.org/10. 1109/jiot.2018.2844296 9. Khatoon, S., Rawat, A., Bhople, S., Dwivedi, P.: Robotic technology: fate of agriculture in future scenario. Krishi Udyan Darpan. In: Gliever, C., Slaughter, D.C. (eds.) Crop Verses Weed Recognition with Artificial Neural Nets ASAE Paper. 01-3104, pp. 1–12 (2001) 10. Gutiérrez-Rodríguez, A., Décima, M., Popp, B.N., Landry, M.R.: Isotopic invisibility of protozoan trophic steps in marine food webs. Limnol. Oceanogr. 59(5), 1590–1598 (2014). https://doi.org/10.4319/lo.2014.59.5.1590 11. Haiyan, S., Yong, H.: Crop nutrition diagnosis expert system based on artificial neural networks. In: Third International Conference on Information Technology and Applications (ICITA’05). (2015). https://doi.org/10.1109/icita.2005.108 12. Hinnell, A.C., Lazarovitch, N., Furman, A., Poulton, M., Warrick, A.W.: Neuro-drip: estimation of subsurface wetting patterns for drip irrigation using neural networks. Irrig. Sci. 28, 535–544 (2010) 13. Iliev, O.L., Sazdov, P., Zakeri, A.: Fuzzy logic based control for protected cultivation. J. Manag. Environ. Qual. 25(1), 83–92 (2014) 14. Iliev, O.L., Zakeri, A., Naing, K.M., Venkateshaiah, N.: Greenhouse cultivation control— fuzzy logic based approach. In: 2nd International Conference on Advancement in Engineering, Applied Science and Management (ICAEASM-2017), Osman University, Hyderabad, India (2017) 15. Jelle, B., (ed.).: The futures of Agriculture - GFAR. What are the likely developments in world agriculture towards 2050? (2012). https://www.gfar.net/sites/default/files/files/Jelle%20Brui nsma_FAO_Brief%2038.pdf. Accessed 6 Jan 2023 16. Tilva, V., Patel, J., Bhatt, C.: November. Weather based plant diseases forecasting using fuzzy logic. In: 2013 Nirma University International Conference on Engineering (NUiCONE), pp 1–5. IEEE (2013) 17. Jha, K., Doshi, A., Patel, P., Shah, M.: A comprehensive review on automation in agriculture using artificial intelligence. Artif. Intell. Agric. 2, 1–12 (2019). https://doi.org/10.1016/j.aiia. 2019.05.004

From Industry 4.0 Toward Agriculture 4.0

651

18. Liu, Y., Ma, X., Shu, L., Hancke, G.P., Abu-Mahfouz, A.M.: From industry 4.0 to agriculture 4.0: current status, enabling technologies, and research challenges. IEEE Trans. Industr. Inf. 17(6), 4322–4334 (2021). https://doi.org/10.1109/TII.2020.3003910 19. Klerkx, L., et al.: A review of social science on digital agriculture, smart farming and agriculture 4.0: new contributions and a future research agenda. NJAS-Wagening. J. Life Sci. 90–91, 100315 (2022). https://doi.org/10.1016/j.njas.2019.100315 20. Kodali, R.K., Jain, V., Karagwal, S.: IOT based smart greenhouse. In: 2016 IEEE Region 10 Humanitarian Technology Conference (R10-HTC) (2016). https://doi.org/10.1109/r10-htc. 2016.7906846 21. Roopaei, M., Rad, P., Choo, K.-K.R.: Cloud of things in smart agriculture: intelligent irrigation monitoring by thermal imaging. In: IEEE Cloud Computing, vol. 4, no. 1, pp. 10–15 (2017). https://doi.org/10.1109/MCC.2017.5 22. Maier, H.R., Dandy, G.C.: Neural networks for the prediction and forecasting of water resources variables: a review of modeling issues and applications. Environ. Model. Softw. 101–124 (2000) 23. Malavade, V.N., Akulwar, P.K.: Role of IoT in agriculture. IOSR J. Comput. Eng. 2016, 2278–661 (2016) 24. Mekonnen, Y., Namuduri, S., Burton, L., Sarwat, A., Bhansali, S.: Review—machine learning techniques in wireless sensor network based precision agriculture. J. Electrochem. Soc. 167(3), 037522 (2019). https://doi.org/10.1149/2.0222003jes 25. Tremblay, N., Bouroubi, M.Y., Panneton, B., Guillaume, S., Vigneault, P., Bélec, C.: Development and validation of fuzzy logic inference to determine optimum rates of N for corn on the basis of field and crop features. Precision Agric. 11(6), 621–635 (2010). https://doi.org/ 10.1007/s11119-010-9188-z 26. Navulur, S., Sastry, S.C.S., Prasad, M.G.: Agricultural management through wireless sensors and internet of things. Int. J. Electr. Comput. Eng. (IJECE) 7(6), 3492 (2017). https://doi.org/ 10.11591/ijece.v7i6.pp3492-3499 27. Kim, Y., Evans, R.G., Iversen, W.M.: Remote sensing and control of an irrigation system using a distributed wireless sensor network. IEEE Trans. Instrum. Meas. 57(7), 1379–1387 (2008) 28. Yang, F., Gu, S.: Industry 4.0, a revolution that requires technology and national strategies. Complex Intell. Syst. 7(3), 1311–1325 (2021). https://doi.org/10.1007/s40747-020-00267-9 29. Sannakki, S.S., Rajpurohit, V.S., Nargund, V.B.: SVM-DSD: SVM based diagnostic system for the detection of pomegranate leaf diseases. Adv. Intell. Syst. Comput. 715–720, (2013). https://doi.org/10.1007/978-81-322-0740-5_85 30. Shahzadi, R., Tausif, M., Ferzund, J., Suryani, M.A.: Internet of things based expert system for smart agriculture. Int. J. Adv. Comput. Sci. Appl. 7(9), 341–350 (2016) 31. Telukdarie, A., Sishi, M.N.: Enterprise definition for industry 4.0. IEEE Xplore, 1 Dec 2018. https://ieeexplore.ieee.org/abstract/document/8607642/. Accessed 15 Mar 2022 32. Suresh, S., Pillai, P., Proverbs, D.: The 4th Faculty of Science and Engineering Festival of Research: Conference Proceedings, WIRE HOME (2023). https://wlv.openrepository.com/ handle/2436/625181. Accessed 01 June 2023

Sustainable Hydroponics: Embedding IoT and Efficient Energy Generation and Distribution Algorithms Menachem Domb1(B) , Viraj Hudli2 , and Sujata Joshi2 1 Ashkelon Academy, Ben, Zvi, Ashkelon, Israel

[email protected]

2 Symbiosis Institute of Digital and Telecom Management, Symbiosis International (Deemed

University), Pune, MH, India {viraj.hudli2123,sjoshi}@sidtm.edu.in

Abstract. Hydroponics is a technique that uses a soil-free medium and efficiently manages essential resources like water, light, and the environment. With the increased demand for agricultural produce and continuously reducing fertile land, demand for Hydroponics is rising. An IoT system constantly monitors the farm and gathers data regarding the factors affecting plant growth, including pH, temperature, water level, and humidity. The IoT system used in Hydroponics has energy constraints, low energy utilization, and continuous monitoring issues. This paper proposes three effective solutions, a modular and customizable IoT approach using the SWORD (Sleep/Wake On Redundant Data) Algorithm based WSN (Wireless Sensor Networks) along with modern, durable sensors to improve energy efficiency and data communication amongst them. The paper proposes energy-efficient solutions to harness solar energy to power the sensing nodes and reduce the cost and energy utilization by using sustainable energy and energy harvesting using SEH-WSN (Solar Energy Harvesting Wireless Sensor Networks). In addition, algorithms to identify routing paths enhance the lifetime of the WSN. Keywords: IoT · SEH-WSN · Hydroponics · Routing algorithms

1 Introduction Hydroponics, a modern way of intelligent farming, is gaining recognition as it is not dependent on external environmental factors like soil quality and the rain cycle. The yield produced is at par with that obtained from traditional farming. But this comes with challenges. Hydroponics, if correctly implemented, can prove to be a boon to the farming sector. Most people who practice agriculture or horticulture using hydroponic techniques face problems related to monitoring nutrient solutions, incorrect lighting, lack of education to use the infrastructure, and power constraints. The Internet of Things (IoT) is recently gaining popularity in the smart-farming domain, including the implementation of sensors to detect various aspects related to the crop itself, the temperature of the hydroponics environment, lighting, and the pH level of the nutrient solution. The obtained © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 652–662, 2024. https://doi.org/10.1007/978-3-031-47715-7_44

Sustainable Hydroponics: Embedding IoT

653

data is used to take corrective and proactive measures. IoT enables automation in working different components. However, it also comes with challenges, as proper integration and energy utilization need to reap optimum benefits [2]. The sensors and other devices must efficiently and securely communicate with each other and other devices. IoT can connect heterogeneous physical devices, so it helps build a unified architecture [16]. Routing protocols designed for IoT networks are expected to consume less energy. Hence, more issues to be addressed are resilience, scalability, and energy efficiency, as many IoT devices with limited energy are connected, making battery power management, storage handling, and routing protocols challenging. The complex tangible infrastructure needs constant monitoring to ensure that water supply, wastewater draining, and pumping are timely and synchronized. Digital Twinning help diagnoses any issue at a preliminary stage and provides real-time updates on the system’s condition by creating a digital twin of the entire setup. Hydroponics technology researchers can further embed technologies like digital twinning and complex routing algorithms to improve the efficiency of hydroponic systems 24. This paper aims to deal with the problems mentioned above efficiently and build a robust and sustainable IoT system for Hydroponics. This paper is structured with Literature Review, followed by Research Methodology with Research Questions, Data Analysis and findings, and Results and Changing Trends. Later the conclusion and Implications will conclude the paper and provide a scope for further research [23].

2 Literature Review 2.1 Hydroponics Hydroponics produces quality agricultural yield by creating an infrastructure with a controlled environment, like temperature, lighting, nutrients in the solution, and water supply. This farming technique has gained popularity as fertile land in the country is reducing whereas demand for food is increasing. Moreover, integrating technologies like IoT has made this technique more sophisticated [2]. 2.2 Wireless Sensor Networks It is a wireless network of sensing nodes used for monitoring and recording environmental factors using a variable number of sensor nodes, each equipped with multiple sensors [3]. The data travels multi-hop to the ’sink node,’ which stores and carries it outside the WSN. 2.3 IoT Internet-of-Things is a network of physical objects known as ‘things’ in which sensors gather data related to ‘things’ and communicate the same to the other systems over the Internet. Crucial is the technology used to communicate with devices used by the IoT setup [4].

654

M. Domb et al.

2.4 SEH-WSN (Solar Energy Harvesting Wireless Sensor Nodes) Wireless sensors have energy constraints. When the area utilized for Hydroponics is large, many sensors are needed, and it is not feasible to replace the discharged batteries of all the sensors. SEH refers to converting solar light into electrical energy and using that energy to charge the node’s Battery at the operation site, i.e., the hydroponics facility. Small-size solar panels are connected to the Wireless Sensor Nodes, which fulfill their energy needs. The solar energy generated by the solar cell equipped with a photovoltaic cell can be used directly to power a sensor node. Yet another suitable approach is harvesting solar energy using a rechargeable battery or a supercapacitor used to charge sensor nodes in case of night or rainy days. The photovoltaic cell device used is a Maximum Power Point Tracking (MPPT) solar charge controller, which can harvest the maximum power from the solar array by adjusting its input voltage and then transform it to supply varying voltage requirements. The paper provides a holistic approach to developing a sustainable hydroponics infrastructure by considering solar energy, efficient routing algorithms, and embedded algorithms in the sensors’ microcontrollers, further making them more energy efficient.

3 Research Methodology This paper focuses on achieving an end-to-end energy-efficient wireless sensor network in Hydroponics. It presents the impact and benefits of incorporating new technologies in modern agriculture and is beneficial in terms of the quality and quantity of the produce. We discuss solutions to how microcontrollers of sensors can be programmed using unique algorithms that add to the facets of energy optimization—an essential aspect in all the stages of the hydroponics process. The aim is to focus on sustainability by using photovoltaic cells to harness and harvest abundant solar energy to power the wireless sensor nodes. The paper follows the standard research process, starting from identifying the issues in intelligent Hydroponic farming, evaluating existing solutions, and proposing a sustainable technology-based solution. The following approach was used to carry out the research: • • • •

Identifying the problems in intelligent farming Choosing Hydroponics using IoT as the topic Referring to secondary market research to gain more knowledge about it Identify a research question: How can sustainability be ensured in Hydroponics using technology • Downloaded papers from scientific journals • Compiled Information from 25 papers from established journals like the IEEE, Springer, Researchgate, and so on and jotted down findings related to the same. This paper focuses on achieving an end-to-end energy-efficient wireless sensor network in Hydroponics.

Sustainable Hydroponics: Embedding IoT

655

4 Data Analysis and Findings Sustainability demands using clean energy wherever possible. IoT sensor nodes have been using Lithium-ion batteries for a long period, the extraction of which is an environment-contaminating process. Moreover, energy efficiency is always a major challenge and an area of improvement researched extensively [1, 2]. In the star topology, a failure of the central node leads to a shutdown of the entire network. Also, it is practically impossible when the area under Hydroponics is large. It is difficult to monitor if all the sensing nodes are in place and replace the exhausted batteries, instigating the use of the energy available in preponderance. The ambient solar energy and photovoltaic cells convert light energy to electrical power. SEH-WSN monitors temperature, humidity, air quality, pressure, etc., hence its best sought-after alternative to lithium-ion batteries [3, 24]. Many solar manufacturers have started developing small solar cells specifically for IoT sensing nodes. Small size, lightweight design, and cost-effective nature make it practical to buckle them up with many sensors. Together with Battery, a continuous energy supply is ensured [5, 6]. A total solar plus battery-powered sensor can work in complete darkness for around 30–40 days. 4.1 Energy Harvesting Model Using solar energy for the WSN nodes solves the energy inefficiency problem. The energy harvesting model further consists of an Energy Harvester and Energy Manager [7, 14]. They both work together for optimum output. Photovoltaic cell (PV-Cell). It is an energy-harvesting technology that converts ambient sunlight into electricity through the photovoltaic effect. Electric current is generated when the photons from the sunlight incident on the photovoltaic cell [15]. DC-DC Converter DC-DC converter converts one DC voltage to a different voltage to obtain maximum and stable DC voltage for the load. We use a Switched DC-DC converter which converts by storing the input energy periodically and then releasing that energy as output at different voltages. Maximum Power Point Tracking (MPPT) MPPT is an electronic DC-DC converter that optimizes the match between the Photovoltaic cell and the load allowing the PV array to be of higher voltage than the original load. It suits areas with low sunlight or during the winter seasons. An energy harvesting model with an MTTP tracker extracts maximum power from the transducer and delivers it to the load.

656

M. Domb et al.

Load The processor of the WSN device and the transceiver communication are the primary units that use the harvested energy. The load is a small electronic device or transceiver of a communication node. The pack receives power from the Battery or a cell panel. Electronic Relay The Electronic Relay works as a switch. When the sunlight is not enough, and the voltage requirement by the WSN Node is higher than that generated by PV-Cell, the switch opens, and energy from Battery is provided to the node [8]. Rechargeable Battery The excess electricity produced by solar cells, which remains after utilization by WSN Node, is stored in the Battery. In case of faint sunlight or more voltage requirement, the energy stored in the Battery is used. Later, the Battery is rechargeable, again charged to maximum when solar power is in abundance [17]. The above model pays off well in areas with inconsistent sunlight. The nodes of the WSN work on solar energy whenever there is plenty of daylight available. At the same time, the redundant electric power is stored in the Energy Storage Device.

Fig. 1. Working of SHE–WSN.

Figure 1 depicts the working of the Wireless Sensor Network using solar energy efficiently. In case of low amounts of sunlight, the electric relay connected to the energy store opens, and the Battery is charged through a rechargeable battery. In case of sufficient

Sustainable Hydroponics: Embedding IoT

657

sunlight, the solar cells absorb the solar energy utilized by the WSN node, and the excess makes its way to the rechargeable battery [9]. 4.2 Energy Efficient SWORD Algorithm in Wireless Sensor Network The SWORD algorithm combines the sleep/wake scheme of the sensor nodes and redundant data transmission avoidance techniques. The SWORD algorithm embedded in the sensor’s microcontroller copes with the issue of saving energy without compromising quality and performance [5, 10], and helps optimize the sensor node’s Energy Consumption and Data Communication. SWORD reduces the power consumption of the sensing and routing nodes by more than 90% and data communication by around 85%. Moreover, the SWORD algorithm can perfectly go hand in hand with the Solar Energy Harvesting (SEH) technique discussed in the earlier sections of this paper.

Fig. 2. Flow Diagram of the SWORD algorithm.

658

M. Domb et al.

Figure 2 describes the SWORD algorithm by avoiding the un-necessarily transmission of data if there is no significant difference 11. The difference threshold is set according to the assessed variable. The microcontroller wakes up periodically and senses the monitored variable. If the perceived difference is above the threshold, the Information is sent further and not otherwise. Thus the SWORD algorithm ignores repetitive transmitted data preserving the power of the sensor nodes. It complements solar energy harvesting, energy-efficient networking, and routing protocols. 4.3 Energy Efficient Routing Algorithms Unnecessary Routing, which consumes energy from the WSN nodes, is still pertinent in Hydroponics. This section presents approaches for energy-efficient Routing of the Information sensed by the wireless sensor network and the transmission to the destination Sink Node. We introduce the Cuckoo-ANN Optimization modeling 131521, which reduces the number of hops from source to sink while maintaining the quality and integrity of the data shared. Routing Protocols for Energy-Efficient Transmission. As Discussed, the Main Challenge in the IoT-Based WSN is Energy Efficiency. Transmitting 1 KB of Data Corresponds to the Processing of 3 Million Instructions. Moreover, Transmission Over Long Distances Needs More Energy. Hence, the Solution to This Problem is Reducing the Number of Hops Made by the Info Packets by Increasing the Transmission Range Between the Nodes or Incorporating Protocols to Choose the Routing Path that Will Save Energy [11, 21]. The Energy-Efficient, Reliable Routing Protocols Are Robust to Link Failures and Ensure Reliable Data Routing.

Fig. 3. Taxonomy of energy-efficient routing algorithms.

Figure 3 depicts the hierarchy of energy-efficient routing algorithms from which we have discussed the techniques that use multiple paths. Listed below are the two popular categories of such routing protocols that can enhance the lifetime of the WSN network: 1. Energy Efficient Multi-Path Routing Protocols 2. The Cuckoo-ANN approach to finding the best nodes

Sustainable Hydroponics: Embedding IoT

659

Energy Efficient Multi-path Routing Protocols. Multipath Routing is better than single-path Routing as it is robust against node failures and congestion. Failures can be mitigated by satisfying specific requirements, such as following the routes with low delay or the ones consuming low energy below [12, 19, 20]. Multipath Routing is a technique that avoids failures by rerouting using load-balancing techniques. Overlap Multipath: This protocol is reactive, i.e., the path is chosen after the node receives the data and wants to forward it. Control data is shared amongst the neighbors to maintain network connectivity. Initially, the sensor node selects a set of nodes. Once data is available, it is forwarded to that set. The selection, as mentioned above, is based on the energy level of the nodes. The path can overlap each other, or there is a possibility that a longer route is taken. Label-based Multipath Routing uses this approach [22]. Funnel Multipath: This routing protocol tries to overcome the cons of Overlap Multipath by setting a threshold called Funnel-width (counter) beyond which the data does not propagate. The sensor node selects a set of nodes based on their energy levels and broadcasts the data to these nodes in the direction of the sink. Once data reaches the nodes where the funnel counter equals the threshold, the Information is not forwarded. Now it starts moving through multipath funnels toward the sink [22]. Disjoint Multipath: Funnelling saves a lot of energy drain, but when a large volume of data is to be relayed, there is an energy imbalance problem in the wireless sensor network. Disjoint multipath uses a proactive approach wherein the path is a decision before the node receives the data. These paths are somewhat apart and hence do not interfere with each other in their journey to the sink, ensuring fair Distribution of energy. Hierarchical Multipath: Even though this is a proactive approach where disjoint paths are chosen, the entire WSN is divided into clusters. Each cluster has one relay node directing the plant or temperature data to the sink. The Cuckoo-ANN Approach to Finding the Best Nodes [13, 18]. The energy consumption of IoT-WSN depends on the remoteness and the transmitted data size. To reduce the distance, we adopted the Cuckoo-ANN algorithm and made several adjustments and improvements to improve its effectiveness in this context. Figure 4 depicts the working of the adjusted Cuckoo-ANN search inspired by the Cuckoo bird’s way of laying eggs. The Cuckoo randomly lays 3–10 eggs in its nest, far away from the source habitat. It has three stages: • Random placement of eggs (nodes) as far as possible • The best nests with good-quality eggs are carried for the next generation (Routing) • Probability of getting discovered by the host nest (the sender node).

5 Simulated Experiment To demonstrate the proposed system, integrating the overall involved components and the data flow, we simulated it using Javascript as the primary control system, activating the associated algorithms in the proper sequence. The experiment’s outcome proves the proposed solution is good in practice and suitable for realization.

660

M. Domb et al.

Fig. 4. Cuckoo search for best sensor mode.

6 Conclusion The paper presents a holistic view of achieving sustainability using IoT for Hydroponic farming, displaying an end-to-end energy-efficient WSN, starting from harnessing solar energy for further use to ensure the sustainability of the clean energy source. Then algorithms are embedded in the microcontrollers to make them energy efficient and consider the different energy-saving routing protocols. Farmers can implement the discussed solutions to save the cost and time required to implement an energy-efficient solution. A solar-powered hydroponics system can be beneficial in terms of cost, reliability, and wireless-sensor-network life, only with the trade-off that the initial implementation cost may be high. The paper sheds light on how the solar-powered SEH-WSN is implemented, giving hydroponics implementers a direction as to how a sustainable IoT system for hydroponic farming can be implemented with reduced cost and enhanced sensor lifetime. Also, these techniques can aid other agricultural activities like livestock monitoring, storage monitoring, and many more. Wireless sensor networks help ensure that factors like heat and humidity are in place. The sensors can track the crop’s condition and the technology associated with the factors above. Researchers and IoT solution providers can further experiment with implementing technologies like AI/ML, Big Data, and Fog computing with SEH-WSN. Fault detection will be easy and proactive maintenance, and support will be attainable.

Sustainable Hydroponics: Embedding IoT

661

7 Limitations and Future Research Directions The setup for a sustainable hydroponics system is costly, especially the solar panels and the large batteries used to store the energy. Substitutes for the same need to be identified, and their sustainability justified for it to be popular amongst agriculture practitioners. Also, energy consumption can further be decreased by optimizing the energy consumption required to power the physical elements of Hydroponics like pumps, lights, solar powering devices, etc. But also care should be taken that there is a possibility of equipment failure; hence efficient and cost-effective equipment monitoring solutions should be identified and implemented.

References 1. Sharma, H., Haque, A., Jeffery, Z.A.: Solar energy harvesting wireless sensor network nodes. A survey. J. Renew. Sustain. Energy 10(2) (2018) 2. Swain, A., Chatterjee, S., Vishwanath, M.: Hydroponics in vegetable crops: a review. Pharma Innov. J. 10(6), 629–634 (2021) 3. Friha, O., Ferrag, M.A., Shu, L., Maglaras, L., Wang, X.: Internet of things for the future of smart agriculture: a comprehensive survey of emerging technologies. IEEE/CAA J. Autom. Sinica 8(4), 718–752 (2021) 4. Ayaz, M., Ammad-Uddin, M., Sharif, Z., Mansour, A., Aggoune, E.-H.M.: Internet-of-Things (IoT)-Based smart agriculture: toward making the fields talk. IEEE Access 7(1), 129551– 129583 (2019) 5. Treftz, C., Omaye, S.T., Treftz, C., Omaye, S.T.: Hydroponics: potential for augmenting sustainable food production in non-arable regions. Nutrition Food Sci. 46(5), 672–684 (2016). (201) 6. Vidhya, R., Valarmathi, K.: Survey on automatic monitoring of hydroponics farms using IoT. In: 3rd International Conference on Communication and Electronics Systems (ICCES), pp. 125–128 (2018) 7. Modu, F., Adam, A., Aliyu, F., Mabu, A., Musa, M.: A survey of smart hydroponic systems. Adv. Sci. Technol. Eng. Syst. J. 5(1), 233–248 (2020) 8. Gurewitz, O., Shifrin, M., Dvir, E.: Data Gathering Techniques in WSN. A Cross-Layer View. Sensors 22(7), 2650–2694 (2022) 9. Sharma, H., Haque, A., Jeffery, Z.A.: Solar energy harvesting wireless sensor network nodes. J. Renew. Sustain. Energy 10(2), 023704 (2018) 10. Jawad, H.M., Nordin, R., Gharghan, S.K., Jawad, A.M., Ismail, M., Abu-AlShaeer, M.J.: Power reduction with sleep/wake on redundant data (SWORD) in a wireless sensor network for energy-efficient precision agriculture. Sensors 18(10), 3450 (2018) 11. Triantafyllou, A., Tsouros, D.C., Sarigiannidis, P., Bibi, S.: An architecture model for smart farming. In: 15th International Conference on Distributed Computing in Sensor Systems (DCOSS), pp. 385–392 (2019) 12. Kansal, A., Hsu, J., Zahedi, S., Srivastava, M.B.: Power management in energy harvesting sensor networks. ACM Trans. Embedded Comput. Syst. (TECS) 6(4), 32-es (2007) 13. Piran, M.J., Verma, S., Menon, V.G., Suh, D.Y.: Energy-efficient transmission range optimization model for wsn-based internet of things. Comput. Mater. Continua 67(3), 2989–3007 (2021) 14. Zagrouba, R., Kardi, A.: Comparative study of energy efficient routing techniques in wireless sensor networks. Information 12(1), 42 (2021)

662

M. Domb et al.

15. Sharma, D., Bhondekar, A.P.: Traffic and energy-aware routing for heterogeneous wireless sensor networks. IEEE Commun. Lett. 22(8), 1608–1611 (2018) 16. Singh, R.K., Berkvens, R., Weyn, M.: AgriFusion. An architecture for IoT and emerging technologies based on a precision agriculture survey. IEEE Access 9, 136253–136283 (2021) 17. Khan, M.K., Shiraz, M., Shaheen, Q., Butt, S.A., Akhtar, R., Khan, M.A., Changda, W.: Hierarchical routing protocols for wireless sensor networks: functional and performance analysis. J. Sens. 7459368:1–7459368:18 (2021) 18. Bhargava, D., Prasanalakshmi, B., Vaiyapuri, T., Alsulami, H., Serbaya, S. H., Rahmani, A.W.: CUCKOO-ANN based novel energy-efficient optimization technique for IoT sensor node modeling. Wirel. Commun. Mob. Comput. 2022 Article ID 8660245 (2022) 19. Sreedevi, T.R., Kumar, M.S.: Digital Twin in Smart Farming: a categorical literature review and exploring possibilities in hydroponics. In: 2020 Advanced Computing and Communication Technologies for High-Performance Applications (ACCTHPA), pp. 120–124 (2020) 20. Khan, I.U., Qureshi, I.M., Aziz, M.A., Cheema, T.A., Shah, S.B.H.: Smart IoT control-based nature inspired energy efficient routing protocol for flying Ad Hoc network (FANET). IEEE Access 8(1), 56371–56378 (2020) 21. Wang, Z., Qin, X., Liu, B.: An energy-efficient clustering routing algorithm for WSN-assisted IoT. IEEE Wirel. Commun. Netw. Conf. (WCNC) 2018, 1–6 (2018) 22. G., S.C., Ramkumar, A.G, S.M, Ayyadurai, M.: Experimental analysis of secured routing protocol establishments over wireless sensor network. In: 5th International Conference on Trends in Electronics and Informatics (ICOEI), vol. 2021, pp. 691–698 (2021) 23. Alves, R.G. et al.: A digital twin for smart farming. IEEE Glob. Human. Technol. Conf. (GHTC) 1–4 (2019) 24. Alfred, R., Obit, J.H., Chin, C.P.-Y., Haviluddin, H., Lim, Y.: Towards paddy rice smart farming: a review on big data, machine learning, and rice production tasks. IEEE Access. 9(1), 50358–50380 (2021)

Towards Programmable Context Aware Interaction with Smart Home Internet of Things A. S. M. M. Rahman(B) , Mohammadhossein Momeni, and Colby Braton Truman State University, 63501 Kirksville, MO, USA {kafi,mm3217,cmb7742}@truman.edu http://truman.edu

Abstract. Smart homes are technologically augmented spaces where several interconnected devices; artifacts and other services are available to support people. In such spaces, natural interaction with devices and things may free people from working in a desktop-like setting and provides intuitiveness in accessing various services of interest. For example, when reading a newspaper, a person might want to turn on the light or activate the media player while sitting on a couch. The proposed system provides avenues to create geometric annotation map of smart home things. The annotation map is then compared with the user’s realtime location data in order to realize programmable context-aware interaction rules with smart home services. Further, we demonstrate how the programmable framework can facilitate interaction using commonly used paradigm of TouchMe, PointMe and ScanMe to interact with Internet of Things and show the suitability of the proposed work. Keywords: Programmable smart home · Ambient assistive living Context aware · Pervasive smart environment · Elderly home · Children computer interaction

1

·

Introduction

A growing number of people are interested in creating organic connections with computer systems. Natural engagement with the program and environment, in contrast to traditional WIMP-based interaction, frees users from working in a desktop-like environment and offers straightforward access to numerous services of interest [11]. For instance, when sitting on the couch and reading a newspaper, a person could want to turn on the lamp. This can be done with the aid of a natural interaction, such a gesture. This is particularly relevant in a smart home setting [6], where a variety of sensors and services are accessible to assist human activities. The increasing pervasiveness of Internet of Things and smart home services have initiated the idea of using them to promote natural interaction with smart home things [2,5,12,20]. For example, researches are experimenting interaction c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 663–681, 2024. https://doi.org/10.1007/978-3-031-47715-7_45

664

A. S. M. M. Rahman et al.

with their physical environment using mobile phones in order to obtain digital information about the surrounding objects/devices and initiate further actions relevant to those entities [15]. Such type of interaction is commonly known as physical browsing [29]. To facilitate physical mobile interaction, the first step usually is to annotate objects in the physical environment with some sort of RFID/NFC tags [10,13,24] or visual markers [14,26,28]. A mobile device equipped with RFID reader or camera is then used to either touch/point/scan the tagged object or take pictures of visual markers respectively to obtain relevant information or invoke associated services [1,5]. However, there are issues related to RFID and visual marker based approaches, which impose some restrictions while interacting with an object. For example, the limited range (10–20cm) of a passive RFID tag can only facilitate interaction with a tagged object from a close distance [22]. Besides, an object of large dimension would require many tags to be placed on its surface so that a mobile phone can easily locate any of those tags and read its associated content. This requires extra effort and cost to actually tag the physical object in the first place. Visual markers on the other hand are placed on visible surface of the object so that the mobile phone can easily read them and fetch the associated content [3]. Unlike RFID tag, which can be placed at the background, the marker-based approach eventually suffers from visual obtrusiveness as they take up visible surface spaces. We aim to overcome some of these restrictions in physical mobile interaction paradigm. These systems are also rigid and limited as the users have no way to craft creative interaction rules to extend the use cases to fit their need. Natural interaction with devices frees people from working in a desktop-like setting and provides intuitiveness in accessing various services of interest [11]. For example, when reading a newspaper, a person might want to turn on the light or the media player while sitting on a couch. Users need support to get access to smart home anywhere and anytime. This is of special interest in a smart home environment [6], where various sensors and services are available in the environment to support human activities. Intelligent systems promise to do more in terms of automating tasks and increasing awareness by inferring the states of users and environments things. However, knowledge of the location of the users and the smart home things are prerequisite for the realization of context-aware applications. We propose that using the sensory data, meaningful context scenarios for annotated IoT can be defined. In order to create realistic contextual rules, we relate the sensory context factors with the geo-annotated IoT. We have included the suggested architecture to create a prototype that carries out many of the suggested context rules. Ad-hoc systems are the default for context-aware systems. They are created by integrating a specific hardware platform that supports a specific mode of object interactions. Similar to this, location-aware systems operate on an ad-hoc basis and do not adhere to any set framework or rules. With the suggested geometric annotation-based framework, context-aware applications may be programmed using annotated Internet of Things.

Context Aware Interaction with Smart Home IoTs

665

Our contribution in this paper is two-fold. First, we propose a novel programmable context aware Internet of Things interaction framework to activate smart home services. Unlike existing works, our framework allows users to create context rules by using spatial-geometric approach in an easy to use editor. Further, the framework allows users to be able to connect the context rules to activate certain multimedia services in the smart ambient environment. Second, by using the programmable context rules, we implement three proof of concept uses cases in order to demonstrate the suitability of the proposes framework. We present realization, interaction accuracy, and detailed user evaluation data to describe the relevance of the proposed framework. We demonstrate how the programmable framework can facilitate interaction using commonly used paradigm of TouchMe, PointMe and ScanMe [1] to interact with Internet of Things in a smart environment setting and show the user acceptability of such approach. The remainder of this paper is organized as follows. In Sect. 2 we discuss the state-of-the-art in the physical mobile interaction domain. Section 3, we present the proposed programmable context aware interaction framework and present three Internet of Things interaction scenario that realizes the proposed framework. The physical Internet of Things annotation process and smart home context programming rules have been described in Sect. 4. Later in Sect. 5, we describe our experiments and provide results on aspects ranging from object selection accuracy to various other performance and usability parameters. Finally, in Sect. 6 we draw our present the conclusion of our research.

2

Related Work

Want et al. [30] presented one of the earliest systems for building a bridge between real objects and their virtual metaphor. They described different scenarios to embed RFID tags with real world objects and later read those tags with RFID-reader equipped devices to obtain associated information about those objects. Many other works have contributed further in this direction. For example, earlier work by Egenhofer [7] proposed the visionary GeoWands as a mobile information appliance, which can act as a smart pointer to interact with real word objects. Existing research related to the proposed approach falls into the application area of physical mobile interaction [1,4,23,30], physical browsing [29] or mobile spatial interaction [8,27]. Therefore, we review representative works in these domains along with various interaction techniques outlined therein. A more descriptive work appears in [1], which emphasizes on tagging real world objects and use mobile devices to interact with them. They also elaborated on different physical mobile interaction strategies such as PointMe, ScanMe and TouchMe, and discussed various implementation options of these strategies. Furthermore, authors in [21] presented a comparison among the different interaction techniques, which reflects their experimental findings. Riekki et al. have proposed a framework in [18] for requesting web services and leverage their functionalities, which uses RFID tags as the placeholder for those web services. Similar works have been carried out by [4,23] in the domain

666

A. S. M. M. Rahman et al.

of physical mobile interaction. The authors in [4] also showed different application scenarios of physical mobile interaction including mobile ticketing, poster viewing etc. Research in the similar direction is also known as physical browsing as pointed out in [29]. Here, the authors have demonstrated mechanisms to remotely point at the sensor-enabled passive RFID tags attached with the object via infrared beam of the mobile device. In addition to RFID-based solutions, other techniques and technologies have also been explored for physical mobile interaction. Examples include tagging of objects by visual codes and reading those codes using camera phones to obtain embedded information sources [19]. The RELATE system presented in [9] shows a list of available target devices in users display with respect to their spatial arrangement. Users interact with the devices displayed in their user interface via explicit manipulation. The device discovery requires individual devices to use RELATE dongle that seems to create additional requirements for the communicating devices. Unlike the above, our approach enables interaction with physical objects that may not have any computing facility. For example, when interacting with a poster on the wall we spatially annotate that poster using its geometric coordinates and store mapping information related to that poster into the database, which will be available to the interacting users via their mobile devices. On the other hand, in order to interact with a smart board we first spatially annotate its surface based on its geometric coordinates, map associative services with these coordinates required to control the smart board and later use the mobile device to point to the smart board for invoking those services. Rahman et al. presented the annotation of real world objects using live video [17]. The authors annotated real objects with their real space in order to provide easy and natural interaction with those objects. Through hand gesture based interactions the learners get relevant learning materials about the objects in an entertaining manner. In order to deliver the interaction features in the system, a depth camera constantly monitors the real objects and supplies the image frame and other sensory information to the authoring application. We have benefited with the architecture proposed in those papers that we leveraged in our proposed work. A user interface is not only important for efficient control of the system but also depends on the ages and intelligence of users. The work presented by Bin Zhang et al. [31] showed how those factors affect the design of user interfaces, specially for the smart environment. Those factors are very important and have a high impact on users’ cognitive tasks inside a home. Based on their user study, they found out that for skill-based tasks, users would obtain best performance with low-intelligence-level interfaces. On the other hand, for rule-based tasks, users would obtain best performance with high-intelligence user interfaces.

3

Proposed Context Aware Framework

Knowledge of the users location in the smart environment is prerequisite to any context-aware applications. In our prototype we have ensured that at the point

Context Aware Interaction with Smart Home IoTs

667

of interests users location is available to the system. Unlike outdoor applications where GPS location data is present, we have developed our own location determination system. We have extended the proposed framework to realize the context-aware interactions in a smart environment. We have used 3D Kinect camera to calculate users 3D position in real-time inside the smart environment. However, the location data is available only around the IoT where the interactions take place. We term these collections of IoT annotation data for a particular location as a Scene. Hence, the location data of the users are always determined with respect to a Scene. Hence, in our context-interactions based prototype, we have created an annotation map of all the IoT that we want to control. Smart homes are technologically augmented spaces where several interconnected devices; artifacts and other services are available to support people. The lighting control service, media service, and security service are few examples of services a smart home application incorporates. Interaction with such an application occurs either implicitly, where the environment automatically invokes context-aware services, or explicitly, where people select services of their interest though several interaction devices such as remote control and cell phone. However, unlike existing techniques, gesture based interaction, as a mode of explicit interaction is more natural and appealing to people while accessing various services. In automated context-aware interactions, the users do not require to confirm his/her interaction commands. However, in a smart home environment, the users may want to use gestures and associate a gesture with an interaction commands. In such cases, for a large number of IoT interaction commands, we want to avoid gesture overload by using our proposed context-aware interactions approach. In our usability experiment we have found that associating individual

Fig. 1. Context aware interaction with annotated smart home things.

668

A. S. M. M. Rahman et al.

services/commands with individual gestures increases users cognitive load [16]. In such usability experiments, we have found that users will forget the gestures that s/he has associated with certain actions. We want to note that Mobile app based IoT interaction is limited by the UI. In most cases, simple IoT interaction becomes tedious and repetitive resulting time waste. In this prototype, we suppose context to be a Boolean combination of a set of sensory data. We term these combinations as context rules. When a context rule becomes true, then the user have to confirm his/her interaction commands by using a simple gesture. Hence, in our context-aware interaction prototype, the user only needs to remember one gesture command. In addition, we have realized Audio Recognition Service to make it possible for the users to assign a voice command instead of a gesture. In addition to user location and IoT Scenes, we have combined our gesture rule calculations with the System Local. It contains time of the day, current date, calendar appointments, and previous activity records of the users. Number of users present in a Scene is also used as a parameter for a context-rule calculation. In order to calculate context in a Scene, in this prototype we have realized Skeleton Processing Service that determines User posture. We have considered standing, sitting, lying down postures in this prototype. We have implemented Audio processing service, User location tracking service as classes of Sensor Nodes. The Interaction Controller uses Event Processing Node modules in real time to deliver sensory data to the Intelligent Context calculation service component. This component is a sub-component of the Interaction Controller of the proposed framework. We show the components of the prototype in Fig. 1. Next, we show some scenarios that we have developed with the help of specific context rules. We highlight their implementations for the interaction with geo-annotated IoT. 3.1

Child Aware Device Programming

During Television playback the kids have the tendency to come near the television to watch their favourite characters up close or get immersed in the action of the medium, which is not good for their eyes [25]. In order to assist the caregiver to instill good practices in the kids television watching experience we have considered the following geo-annotation programming scenario in our system. Figure 2 explains the scenario. When the kids physical location is within the boundary of the annotated carpet and the location of the kids is within the threshold defined by the caregiver the television screen automatically dims the brightness (by using IR controller) or the playback of the television medium is paused/stopped. The caregiver can program the distance, the height of the user who should be considered as kids and many other parameters to effectively train the kids. It is also possible to automate a voice control instruction that in addition to pausing the television media content will ask the kids to go back to their sits repeatedly.

Context Aware Interaction with Smart Home IoTs

669

Fig. 2. Programming the television to pause when kids are very close to it.

3.2

Location Aware Device Programming

Based on user’s proximity with respect to the geo-annotated IoTs in the smart environment, the interaction subsystem may automatically active the nearest device (as programmed) when the user performs an activate gesture. This context-aware scenario is suitable for elderly home care. An elderly may raise his/her right hand or provide a voice command to turn ON the reading light when s/he is sitting on the reading chair. Figure 3 explains the use case in more detail. Similarly, playback of the television can be controlled when the user is sitting on the living room sofa etc. As evident the geo-annotations of the IoT play an important role in the design and development of these context aware interaction scenarios.

Fig. 3. Programming lights that responds to a user gesture when s/he is nearby.

3.3

Encouraging Healthy Posture

In the geo-annotated environment, where we not only track the location of the user but also can determine the posture through 3D analysis of the skeletons

670

A. S. M. M. Rahman et al.

(as discussed earlier) we can program scenarios so that user can develop healthy habits. Based on the duration of time the user is watching the television we can provide warning augmented on the playback screen. The caregiver can also program to completely stop playback of the television after the user has watched it for a predefined time. Figure 4 explains such a scenario. Event(Ei ) =

T 

|Ei + Score(Pi )|

1

 Here, PTi is the posture score over time Ti and T V State = (Ei ) is the Boolean function that controls the media playback state. Through intelligent programming whenever the user’s posture changes and accumulates bad score the television content can be paused or a non-intrusive message appears on the screen. When such message appears the user can correct his/her posture and continue enjoying the media playback.

Fig. 4. Programming to encourage breaks in a long television watching session.

4

Annotation of IoT and Scene Programming Editor

The proposed 3D camera based annotation scheme is suitable for indoor spaces, where interaction with a number of physical multimedia IoT e.g. television, audio player, movie player, air conditioner etc. are intended. By using the 3D depth sensor data, we geometrically annotate the physical devices in the smart environment as shown in Fig. 5. Based on the 3D position estimation methodology,

Context Aware Interaction with Smart Home IoTs

671

we introduce a spatial-geometric approach to annotate the physical IoT. Using that approach, we have annotated an object in XY plane. The geometric annotations are voxel based and uses 3D camera reference point. By using the 3D annotation editor the user can draw a rectangular shape to identify an IoT inside the 3D image. With the editor the use can assign name, colour, properties and associate that IoT with a particular Scene. Based on a collection of geometric IoT annotations a physical scene map is created for a specific location of a smart environment.

Fig. 5. Annotating smart space elements to allow programmable physical interaction.

We now describe the context-rule editor that is shown in Fig. 6. Context-rule editor incorporates the Annotation database and also uses Service list from the Multimedia Rendering Manager. Context-rule editor has been created by following the design of an email query editor. Here, the user can learn from examples and tutorials to familiarize himself/herself with the editor. For example, in order to create a context rule that will turn ON a reading light when the user is sitting on a reading chair, a user does the following. First, the user selects the IoT annotation s/he wants to interact with, the name of the target IoT, and select Activation options. The list of annotated IoT is available in the Object menu. In the next menu we need to select for which user groups this interaction applies to. For all the users we select the value ANY (default value). Next, we have to assign a gesture command that will confirm the gesture command to the Interaction Controller. We assign a gesture command to the context rule from the Trigger menu. We select a posture trigger for the context-rule from the Posture menu in the editor. Other System Locale commands are available from the settings menu

672

A. S. M. M. Rahman et al.

located at the top right corner in the editor. Afterwards, each rule data is stored in an Scene database and are associated with information/services. These rules are triggered by the Interaction Controller when a user is interacting with an annotated IoT.

5

Realization and Measurements

We have chosen the development platform of the prototype to be visual studio 2012 .net and C# is the chosen programming language. In order to communication messages between the processing units of the prototype, we have used Simple Object Access Protocol (SOAP ). The event handler performs actions by using text based messaging protocol. We have used XML based simple database to store IoT annotation data and Scene annotation data. 5.1

Location Accuracy

Location always calculated in terms of distances. Event triggered from the user was paired with the id of the nearest physical object in order generate query to the Service database. In our experiments we used 21 physical object locations. We annotated the physical environment in a manner that we avoided nested location calculations. As evident from our use cases, nested physical location mapping is unnecessary to build a valid and workable location aware service

Fig. 6. Programming smart space by using the interaction rule editor.

Context Aware Interaction with Smart Home IoTs

673

activation. As we can activate a depending on the position of the user with respect to any objects that are mapped in the system. In order to avoid false positives we subdivided the physical annotation map of larger objects in order to work with correct radius that did not intersect with smaller physical objects. Hence, for larger objects we considered 2 or 3 points of centres (POC). For example, we considered 2 points of centres for big sofa, television, child carpet etc objects. Similarly, for the queen bed and living room carpet we considered 4 points of centres for the annotated object maps. These additional points made the distance based location calculation with respect to the annotation maps more accurate. We listed the annotated objects, their POCs, actual triggered events, and obtained calculated events in Table 1. 5.2

Comparisons of Interactions with Annotated Objects

In this series of usability experiments, we set up the study with 9 and 22 participants on day 1 and day 2 respectively to compare the programmable geometric Table 1. Location calculation accuracy with respect to the geometric annotation maps. Object ID IDs

POC

Actual events

Obtained events

False +tives

Accuracy in %

1

Light living room

1

37

2

Light reading room 1

47

37

0

100

43

1

90

3

Light washroom

1

45

4

Light bedroom

1

47

44

0

98

42

0

89

5

Sofa big

3

42

40

2

91

6

Sofa south wall

2

41

38

1

90

7

Sofa reading table

2

45

44

1

96

8

Chair reading table 1

47

46

0

98

9

Couch bed room

1

34

32

0

94

10

Chair child reading 1

32

27

0

84

11

Television

2

46

41

0

89

12

Carpet living room 4

50

47

3

89

13

Queen bed

4

49

49

2

96

14

Corner table south 1

43

41

0

95

15

Side table east wall 1

43

41

0

95

16

Drawer with knife

1

43

43

0

100

17

Oven/heater

1

47

43

0

91

18

Child play carpet

2

47

42

0

89

19

Kid sound system

1

31

27

0

87

20

Fan living room

1

49

48

0

98

21

Fan bed room

1

40

35

0

88

674

A. S. M. M. Rahman et al.

annotation interactions. We devised two simple experiments to compare contextaware interactions with other conventional interactions. In experiment one, by using geometric annotations we could annotate a large object such as a television at a living room. In this experiment setup, we asked the users to turn a television ON by using a remote control interaction, a RFID based interaction, a visual QR Code based interaction, a pointMe based interaction, and a context-aware interactions. We explained the use of these interaction to the users in detail. In order to turn the television ON by using a remote control, the user was needed to press the power switch of the remote and to complete his task. Similarly, we asked the users to scan a RFID chip to power ON the television. We placed a visual code at the corner of the television and as soon as the users scanned to code we turned the television ON. In addition, the user can point a mobile handheld device at the television screen and press a switch to turn the television ON. Lastly, in the context-aware interactions, we automatically turned the television ON as soon as the users sat on the sofa in front of the television. The participating users in the experiments were told to fill out a questionnaire and to answer some given questions according to their usage experience of the system. During the experiments we also recorded the interaction actions of the users for further analysis. From the recorded data, we compared the usability data of the selected interaction schemes. In Fig. 7, we presented the average rating of the users in Likert scale. Here, users rated whether they considered remote control, RFID, QR Code, pointMe, context-aware interactions to be appealing

Fig. 7. Geometric annotation of large objects and its advantages.

Context Aware Interaction with Smart Home IoTs

675

and easy to use. As evident, RFID and visual code based interaction approach were not appealing to the users even though users agreed that RFID based approach was easy to use. Moreover, location-aware based automated television start was appealing to the users. 5.3

Study of Interactions with a Light

In this experiment setup, we asked the users to turn a light ON by using a light switch, a clap, a mobile app, and with a context-aware based programmable interactions. We explained the use of these interaction to the users in detail. In order to turn the light On by using a switch, the users were requested to press the power-switch of the light ON to complete their tasks. Similarly, we asked the users to use a mobile app and to turn the light ON. We used a smart light and the accompanied app to accomplish this task. In addition, the user was told to clap with his/her two hands to turn the light ON. Lastly, in the contextaware interactions, we automatically turned the light ON when the users sat on a reading chair and raised their right hands. The participating users in the experiments were told to fill out a questionnaire and to answer some given questions according to their usage experience of the system. During the experiments, we also recorded the interaction actions of the users for further analysis. From the recorded data, we depicted the usability data of the selected interaction schemes. In Fig. 8, we presented the average rating of the users in Likert scale. Here, users rated whether they considered a light switch, a mobile app, a clap, and a context-aware based interaction approaches to turn a light ON to be appealing and easy to use. As evident, a light switch was considered by all users to be very easy to use compared to other interaction approaches. Users found both mobile app and context-aware based interactions to turn a light ON to be appealing.

Fig. 8. Usability comparisons of various interactions with an annotated light.

676

5.4

A. S. M. M. Rahman et al.

Usability Study of Healthy Posture Awareness

In this experiment setup, we asked 9 users to participate in the study of a contextaware based healthy posture awareness application. We considered a scenario, where users would watch television for long durations and would remain sit without straightening their backs. In such cases, the context-aware app would wait for a given time (20 min) and would provide warning to the users that they needed to fix their sitting postures. The app would also be able to notify the users to take a break and to pause the playback of the movie to enforce a break automatically. To participate in the experiment, users were requested to watch three movie clips, each of 5 min length. The system would calculate users posture after every 3 min by using a 3D camera and would notify the users in a non-obtrusive manner to correct their postures or to take a break if needed. The participating users in the experiments were told to fill out a questionnaire and to answer some given questions according to their usage experience of the system. From the recorded data, we depicted the usability data of the selected interaction schemes. In Fig. 9, we presented the average rating of the users in Likert scale. Here, users rated whether they found the proposed system to be appealing and easy to use. As evident, an overwhelming majority of the users appreciated the appeal, ease of use of the context-aware application. In addition, they showed their interests to use the system in real world if it was made available to them. 5.5

User Posture Accuracy

To assess the accuracy of posture classification, we have created a confusion matrix. In the confusion matrix, we compare the classification results to addi-

Fig. 9. Usability study of a context-aware healthy posture awareness application.

Context Aware Interaction with Smart Home IoTs

677

tional ground truth information. We want to use the confusion matrix to identify the nature of the classification errors, as well as their quantities. In Table 2 we depict the confusion matrix. We have Stand, Sit, Recline, Lie-Down classifications. In the table REL denotes the reliability of the classifications and ACC column shows the accuracy of the classifications. In the matrix, the rows correspond to classes in the ground truth map and columns correspond to classes in the classification result. The diagonal elements in the matrix represent the number of correctly classified images of each class. In the table, 467 pixels of ‘Stand’ in the test set were correctly classified as ‘Stand’ in the classified image. The off-diagonal elements represent misclassified pixels or the classification errors. In the matrix for example, 7 samples of ‘Stand’ in the test set were classified as ‘Sit’ in the classified image and 87 ground truth images of ‘Lie Down’ were included in the ‘Recline’ class by the classification. Table 2. User posture accuracy.

Stand

Stand

Sit

Recline

Lie down ACC

467

7

0

0

0.985232068

Sit

4

245

92

11

0.696022727

Recline

0

50

150

67

0.561797753

Lie down 0

21

87

130

0.546218487

REL

0.991507431 0.758513932 0.455927052 0.625

In the confusion matrix, the ACC column presents the accuracy of the classification. Accuracy is defined as the fraction of correctly classified images with regard to all images of that ground truth class. For example, for the ‘Stand’ class, the accuracy is 467/474 = 0.985, meaning that approximately 98.5% of the ‘Stand’ ground truth data also appear as ‘Stand’ data in a classified image. The figures in row REL presents the reliability of classes in the classified image. Reliability is the fraction of correctly classified data with regard to all data classified as this class in the classified image. For example, for the ‘Stand’ class, the reliability is 467/471 = 0.99, meaning that approximately 99% of the ‘Stand’ data in the classified image actually represent ‘Stand’ on the ground. The average accuracy is calculated as the sum of the accuracy figures in column ACC divided by the number of classes in the test set. The average accuracy of the proposed classification is 0.697317759. The average reliability is calculated as the sum of the reliability figures in row REL divided by the number of classes in the test set. The average reliability for the proposed posture classification is, 0.707737104. Lastly, we calculate the overall accuracy by the total number of correctly classified data (diagonal elements) divided by the total number of test data. In this case overall accuracy of the classification is, (467 + 245 + 150 + 130)/1331, which is approximately 75%. From the confusion matrix, we can further conclude that the test set classes ‘Recline’ and ‘Lie-Down’ were

678

A. S. M. M. Rahman et al.

difficult to classify as many of such test set data were excluded from the ‘Recline’ and the ‘Lie-Down’ classes. On the other hand, class ‘Recline’ in the image is not very reliable as many test set images of other classes were included in the ‘Recline’ class in the classified image. Table 3. User stand posture classification accuracy. Side

Front

Back

Unclass ACC

292

4

171

4

Front 11

360

9

2

0.942408377

Back 155

12

280

7

0.616740088

Side

REL

0.619957537

0.637554585 0.957446809 0.608695652

In Table 3, we now present classification for the stand posture considering the Front, Side, and Back view. The confusion matrix in column Unclassified represents the ground truth data that were found not classified in the classified image. The average accuracy of the classification is 0.726368667. Average reliability and overall accuracy of the classification are 0.734565682 and 0.713083397 respectively. From the confusion matrix, we can conclude that the test set classes for ‘Side’ and ‘Back’ stand postures were difficult to classify as many of such test set data were excluded from the ‘Side’ and the ‘Back’ classes. On the other hand, class ‘Front’ in the image is very reliable as no test set images of other classes were included in the ‘Front’ class in the classified image. Table 4. User sit posture classification accuracy. Side

Back

Unclass ACC

Front 124

Front

0

1

40

Side

230

1

7

0.93877551

2

12

110

0.093023256

7

Back 5 REL

0.751515152

0.911764706 0.99137931 0.857142857

In Table 4, we now present classification for the sit posture considering the Front, Side, and Back view. The confusion matrix in column Unclassified represents the ground truth data that were found not to be classified in the classified image. The average accuracy of the classification is 0.594437973. Average reliability and overall accuracy of the classification are 0.920095624 and 0.67903525 respectively. We can further conclude from the confusion matrix that the test set class ‘Back’ was difficult to classify as many of such test set data was excluded from the ‘Back’ class. When user is in sitting posture then from the back it is difficult to accurately detect that classification from the image.

Context Aware Interaction with Smart Home IoTs

6

679

Conclusion

In this paper, we have presented the details of three prototypes that have been realized by using the proposed programmable context aware smart home interaction framework. The prototypes first demonstrate with examples how smart home things are annotated geometrically. Smart home things such as fan, light, television are used as examples to demonstrate the annotation process in a simple manner. Second, by using the proposed framework we further realize pointMe, touchMe, and context-aware interactions with smart home Internet of Things. Next, in context-aware interactions realization, we see how user’s position, and geometric annotation of places and things can be intelligently processed in order to calculate contexts. We show programmable context rules that are used to create natural interactions event with things. Lastly, we present the detailed measurements data that we have obtained from the implementations of the pointMe, touchMe, and context-aware interaction prototypes and show the suitability and relevance of the proposed approach in smart environment. In our detailed user evaluation with the proposed system, we observed that users have shown interest to adopt the programmable context aware interaction system if it were made available to them. Overall, the participating users responded towards the system positively and with much enthusiasm.

References 1. Ailisto, H., Pohjanheimo, L., Valkkynen, P., Strmmer, E., Tuomisto, T., Korhonen, I.: Bridging the physical and virtual worlds by local connectivity-based physical selection. Pers. Ubiquitous Comput. 10(6), 333–344 2. Ballagas, R., Borchers, J., Rohs, M., Sheridan, J.: The smart phone: a ubiquitous input device. IEEE Pervasive Comput. 5(1), 70–77 3. Ballagas, R., Rohs, M., Sheridan, J.G.: Sweep and point and shoot: phonecambased interactions for large public displays. In: CHI ’05 Extended Abstracts on Human Factors in Computing Systems. ACM, Portland, USA (2005) 4. Broll, G., Rukzio, E., Paolucci, M., Wagner, M., Schmidt, A., HuBmann, H.: Perci: Pervasive service interaction with the internet of things. IEEE Internet Comput. 13(6), 74–81 (2009) 5. Broll, G., Siorpaes, S., Rukzio, E., Paolucci, M., Hamard, J., Wagner, M., Schmidt, A.: Supporting mobile service usage through physical mobile interaction. In: Fifth Annual IEEE International Conference on Pervasive Computing and Communications. White Plains, NY (2007) 6. Chen, Q., Cordea, M.D., Petriu, E.M., Varkonyi Koczy, A.R., Whalen, T.E.: Human computer interaction for smart environment applications using hand gestures and facial expressions. Int. J. Adv. Media Commun. 3(1/2), 95–109 (2009) 7. Egenhofer, M.: Spatial information appliances: a next generation of geographic information systems. In: First Brazilian Workshop on Geoinformatics. Campinas, Brazil (1999) 8. Froehlich, P., Baillie, L., Simon, R.: Realizing the vision of mobile spatial interaction. Interactions 15(1) (2008)

680

A. S. M. M. Rahman et al.

9. Gellersen, H., Fischer, C., Guinard, D., Gostner, R., Kortuem, G., Kray, C., Rukzio, E., Streng, S.: Supporting device discovery and spontaneous interaction with spatial references. Pers. Ubiquitous Comput. 13(4) (2008) 10. Hinske, S.: Pulling Digital Data from a Smart Object: Implementing the PullMeParadigm with a Mobile Phone. Human-Computer Interaction. Interaction Platforms and Techniques, Chap. Lecture Notes in Computer Science, pp. 306–310. Springer, Berlin/Heidelberg (2007) 11. Jaimes, A., Sebe, N.: Multimodal human-computer interaction: a survey. Comput. Vis. Image Underst. 108(1–2), 116–134 (2007) 12. Madhavapeddy, A., Scott, D., Sharp, R., Upton, E.: Using camera phones to enhance human-computer interaction. In: Adjunct Proc. 6th Int’l Conf. Ubiquitous Computing (UbiComp 2004) (2004) 13. Mantyjarvi, J., Paterno, F., Salvador, Z., Santoro, C.: Scan and tilt: towards natural interaction for mobile museum guides (2006) 14. Nakazato, Y., Kanbara, M., Yokoya, N.: Localization system for large indoor environments using invisible markers. In: Proceedings of the 2008 ACM Symposium on Virtual Reality Software and Technology. ACM, Bordeaux, France (2008) 15. Pering, T., Ballagas, R., Want, R.: Spontaneous marriages of mobile devices and interactive spaces. Commun. ACM 48(9), 53–59 16. Rahman, A.S.M.M., Hossain, M.A., Parra, J., El Saddik, A.: Motion-path based gesture interaction with smart home services. In: Proceedings of the Seventeen ACM International Conference on Multimedia. ACM, Beijing, China (2009) 17. Rahman, A.S.M.M., Cha, J., El Saddik, A.: Authoring edutainment content through video annotations and 3d model augmentation. In: IEEE International Conference on Virtual Environments, Human-Computer Interfaces and Measurement Systems (VECIMS), pp. 370–374. China, Hong Kong (2009) 18. Riekki, J., Salminen, T., Alakarppa, I.: Requesting pervasive services by touching RFID tags. IEEE Pervasive Comput. 5(2), 40–46 (2006) 19. Rohs, M.: Visual code widgets for marker-based interaction. In: 25th IEEE International Conference on Distributed Computing Systems Workshops, pp. 506–513 (2005) 20. Rohs, M.: Real-world interaction with camera-phones. In: 2nd International Symposium on Ubiquitous Computing Systems. Springer (2004) 21. Rukzio, E., Leichtenstern, K., Callaghan, V., Holleis, P., Schmidt, A., Chin, J.: An Experimental Comparison of Physical Mobile Interaction Techniques: Touching, Pointing and Scanning, vol. 4206. Springer, Berlin/Heidelberg (2006) 22. Ruzkio, E., Broll, G., Leichtenstern, K., Schmidt, A.: Mobile interaction with the real world: an evaluation and comparison of physical mobile interaction techniques. In: AmI-07: European Conference on Ambient Intelligence. Springer, Darmstadt, Germany (2007) 23. Sanchez, I., Riekki, J., Pyykknen, M.: Touch and control: interacting with services by touching RFID tags. In: Proc of IWRT 08 (2008) 24. Schoo, P., Paolucci, M.: Do you talk to each poster? security and privacy for interactions with web service by means of contact free tag readings. In: First International Workshop on Near Field Communication, pp. 81–86. IEEE Computer Society (2009) 25. Schramm, W.: Television in the lives of our children. Stanford University Press (1961) 26. Siltanen, S., Woodward, C., Valli, S., Honkamaa, P., Rauber, A.: User Interaction for Mobile Devices, vol. 33. Springer, US (2008)

Context Aware Interaction with Smart Home IoTs

681

27. Strachan, S., Murray-Smith, R.: Bearing-based selection in mobile spatial interaction. In: Personal and Ubiquitous Computing, pp. 265–280 (2009) 28. Tomitsch, M., Schl, R., Grechenig, T., Wimmer, C., Thomas, K.: Accessible realworld tagging through audio-tactile location markers. In: 5th Nordic Conference on Human-Computer Interaction: Building Bridges. ACM, Lund, Sweden (2008) 29. Valkkynen, P., Tuomisto, T.: Physical browsing research. In: Pervasive Mobile Interaction Devices. Springer (2005) 30. Want, R., Fishkin, K.P., Gujar, A., Harrison, B.L.: Bridging physical and virtual worlds with electronic tags. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: the CHI is the Limit, pp. 370–377. ACM, Pittsburgh, Pennsylvania, United States (1999) 31. Zhang, B., Rau, P.L.P., Salvendy, G.: Design and evaluation of smart home user interface: effects of age, tasks and intelligence level. Behav. Inf. Technol. 28(3), 239–249 (2009)

IPAssess: A Protocol-Based Fingerprinting Model for Device Identification in the IoT Parth Ganeriwala1(B) , Siddhartha Bhattacharyya1 , Raja Muthalagu3 , Shreya Nandanwar1 , and Anubhav Gupta2 1

3

Florida Institute of Technology, Melbourne, Florida 32901, USA [email protected] 2 University of British Columbia, V1V 3C8, CA Vancouver, Canada Birla Institute of Technology and Science Pilani, Dubai Campus, UAE

Abstract. The Internet of Things (IoT) has become a widely prevalent concept as it has dramatically advanced the ability to communicate and exchange data between various connected devices. With its success and growing need, many threats and attacks against IoT devices and services have exponentially increased. An increase in knowledge of IoT-related threats and adequate monitoring technologies have helped develop the potential to detect the threats. There have been various studies on fingerprinting based approaches on device identification but none have taken into account the full protocol spectrum. IPAssess is a novel fingerprinting based model which takes a feature set based on the correlation between the device characteristics and the protocols and then applies various machine learning algorithms: Random Forest, Decision Tree, K-Nearest Neighbour (KNN), Naive Bayes, and Gradient Boost (XGB), to perform device identification and classification. We have used aggregation and augmentation to enhance the algorithm. In our experimental study, IPAssess performs IoT device identification with a 99.6% classification accuracy. Keywords: Cybersecurity · IoT fingerprinting · Network traffic Network protocols · Machine learning algorithms

1

·

Introduction

With the increased ease of accessibility of IoT devices, there has been an exponential growth in the use of IoT devices for daily activities. There are 7 billion IoT devices currently in use according to the data provided by the IoT analysis conducted by Jing et al. [9]. Hence, there is a need to increase the potential of exploiting the cybersecurity related vulnerabilities such as, privacy, and confidentiality. This is to, prevent exploits way before any network attack is possible. In order to deal with the security issues of IoT devices, one has to identify the type of the particular device present in the network. The device type plays c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 682–698, 2024. https://doi.org/10.1007/978-3-031-47715-7_46

IPAssess: Device Identification in the IoT

683

an important role in the assessment as it helps to identify the existence of a device. The device might have restrictive policies associated with it, restricting the amount or type of data which should be shared with it. If the device is exceeding its data boundaries, then it should be put in a quarantine network and a security breach should be identified with the implementation of network rectification. The device type identification also helps in recognizing the devices which might be malicious or pose a threat to other IoT devices or should not be present in that particular network. Several approaches have been proposed in the literature for identifying the devices in the network. Shahid et al. [19] used machine learning models like Random Forest(RF) Classifier along with a t-SEN technique, to differentiate the network traffic generated by various IoT devices. This machine learning based approach achieved an overall accuracy as high as 99.9% on their test set. These kinds of models classify the devices on the basis of the size of the packets sent and received and the packet’s inter-arrival times. These features play a key role in the identification and recognition of an IoT device in a network. Agarwal et al. [1] have used a web interface model and developed novel tools: “NetScanIoT” and “Web-IoT Detection (WID)” which has achieved 92.45% accuracy in device type identification. An alternative approach adopted by Mavrogiorgou et al. [12] was implemented wherein a pre-existing textual dataset was used and then the proposed model performed feature extraction on the images of the present IoT devices through object detection, thereby achieving classification. Previous studies have investigated the various problems in the field of IoT device security based on their interactions over the network. Forensic analysis of the network has always been a salient approach in diagnosing the problem. As a result, there have been various studies which have performed in depth analysis on the network traffic generated by IoT devices. After recording traffic telemetry, the packet data has then been analysed to classify the state of the IoT devices. Furthermore, the behavior of IoT devices on the network has been monitored using a combination of Software Defined Networking (SDN) telemetry and machine learning methods [20]. Meidan et al. [13] applied machine learning algorithms on traffic data for accurate identification of the IoT devices connected to a network. Alrashdi et al. [3] tried to detect anomalies of IoT cyberattacks in smart cities using machine learning yielding a 99.34% classification accuracy rate . In a recent study, a system for automated classification of device characteristics based on their network traffic, called System IDentifier (SysID) was introduced by Aksoy et al. [2]. It used various popular machine learning algorithms to have a completely automated way of classifying IoT devices using their TCP/IP packets without expert input for classification. In an experimental study with 23 IoT devices, SysID identified the device type from a single packet with over 95% accuracy. Common datasets were used to analyse the accuracy of the fingerprinting techniques and it was observed that the accuracy of device genre classification was 99.37% and the accuracy in the identification of an individual device from IoT Sentinel [14] dataset was found to be 83.35%. However, the University of New South Wales (UNSW) [7] dataset yielded a device type identification accu-

684

P. Ganeriwala et al.

racy of up to 97.78%. Bai et al. [5] proposed a framework to group devices based on their semantic type (e.g., camera, fitness/medical device, environmental sensor). Grouping of various devices under a particular label potentially results in a model with a diverse range of performance boundaries since often various devices of a given type (e.g., cameras from different manufacturers) distinctly differ in their network behaviour. Therefore, models with diversity in categories of devices would yield high rate of false classification during testing phase. In our study [4], we try to work on these limitations by introducing our proposed modified approach. We form different feature sets based on previous studies and propose our own feature set based on the correlation of the features used in the IoT devices. We survey five different machine learning algorithms on the basis of classification accuracy by testing on the aforementioned feature sets. Augmentation and aggregation is performed on the original dataset to attain a more denser valued dataset. Device type identification and classification is then performed on the augmented dataset with packet aggregation. The contributions of this paper can be pointed out as follows: – Optimal feature sets and correlation with IoT device protocols have been identified. In the process, comparison of different feature sets formed from the IoT Sentinel [14] dataset was performed. – Comparison of different machine learning algorithms based on its classification accuracy and the time taken to identify the IoT devices have been performed. – Classification of the different IoT devices based on packet aggregation, by enhancing the dataset using augmentation was integrated within the approach. – Comparison of the performance of classifying models by aggregating devices with common labels to further reduce the time and complexity has been evaluated. – Finally, a prediction based model on the basis of the optimal machine learning model and the proposed feature set was developed. This paper is organised in the following manner. Section 2 reviews the related works, Sect. 3 outlines the proposed methodology which has been split up into two parts: one describing how we chose a particular feature set along with the machine algorithm, and the other explaining our modified approach using augmentation and aggregation. Section 4 reports the experimental results and we discuss further in Sect. 5. Finally, Sect. 6 concludes our work.

2

Related Works

There have been many studies done on device fingerprinting techniques for IoT device classification, but their applicability is very controversial as they are based mostly on the physical or application layer without having taken in account the entire protocol spectrum. Hence, we focus here on research that is based on network packet behaviour which adheres to the entire protocol spectrum and

IPAssess: Device Identification in the IoT

685

includes the necessary information from all the different layers. One of the initial studies to use network packet features to perform IoT device identification and classification based on a fingerprinting method is IoT Sentinel [14]. This study uses the network flow to identify vulnerable devices in the network and isolate them from the user network by quarantining them. Their dataset consisted of a collection of data from 31 devices which they collected during the device installation phase which was repeated 20 times for each device. Each device was then fingerprinted based on 23 features which were extracted from each of the first 12 packets for every device, resulting in a fingerprint comprised of 276 values. However, these 12 packets do not exactly represent flow but they are sequential packets from the same MAC address. Further, each device which was connected to the network is identified by these fingerprints and if any vulnerability is detected, the particular device is taken to the quarantine network and its connection to other devices, the local network, and the internet is restricted. This method is useful for identifying devices because the 23 features it creates are compressed/representative features: four of which are integers and the others binary-valued. According to this study, 17 of 27 device types were detected with an identification accuracy of above 95%, and 10 with an accuracy of around 50% using RF (RF). IoTSense [14] is another study which uses selected features of IoT Sentinel based on their own model assessment. IoTSense chooses 17 protocol-based features of the IoT Sentinel study which reflect device behaviour and they also add three payload-related features (see Fig. 2). This feature set is applied to five packets for each device to produce a 100-member fingerprint. According, to this study, the per device recall resulted to be 93–100% and an average accuracy of 99% were achieved using Extra Gradient Boosting (XGB), Decision Tree (DT) and k-Nearest Neighbours (kNN). While some comparisons are made with the work of IoT Sentinel [14], the evaluation of IoTSense used a much smaller number of devices (i.e., 10 vs. 31). In addition, the IoTSense experiment set began with 14 devices, though only 10 devices were used for the evaluation as four devices did not produce sufficient data for the analysis approach used. One of the most recent studies, is a novel fingerprinting method, IoTDevID [10], for device identification that uses machine learning to model the behaviour of IoT devices based on network packets. Their method uses an enhanced combination of features based on IoT Sentinel [14] and IoT Sense [6] and includes an approach for dealing with unbalanced device data via data augmentation. They have further demonstrated how to enhance device identification via a group-wise data aggregation and provide a comparative evaluation of the method against two recent identification methods using three public IoT datasets which together contain data from over 100 devices. They use a feature set based on the IoT Sentinel [14] feature set consisting of 23 features and additionally merge in 2 features from the IoT Sense [6] feature set to conjointly form their own: four of which are integers and the others binary-valued. In their fingerprinting process, they use individual packets instead of the entire network flow due to there being no standard size for the same. They use an aggregation algorithm to group packets

686

P. Ganeriwala et al.

together and then evaluate them by the machine learning algorithms. According to this study, they have F1-scores above 99% for all the datasets they have tested on, with considerable improvement gained from data aggregation. Machine Learning algorithms have started to be widely used in the IoT cyberattack sector [11,22]. Shafiq et al.[18] proposes a framework to select the top five ML algorithms that are best used for identifying an anomaly in traffic identification. The introduction of numerous IoT devices in cyberspace makes it difficult to discover the device on a large scale and identify the potentially vulnerable device. Salman et al. [17] present a system for identifying IoT devices and detecting malicious communications. Their proposed framework collects characteristics per network flow to identify the source, kind of produced traffic, and detect network threats by pushing information to the network edge. Yang et al. [23] comes up with an efficient technique to generate fingerprints of the IoT devices automatically using neural network algorithms. The model successfully generates the device labels with 94% precision and 95% recall, which are used for analyzing the device distribution in a network. Thom et al. [21] introduce their approach Smart Recon, which employs the Locality Sensitive Hashes approach to produce a feature vector that will be utilized by a classifier to fingerprint IoT devices based on their generated network traffic. Some studies have also considered the gateway’s limited storage and computational resources and have introduced lightweight IoT device identification schemes [8]. Yet another approach has been adopted to identify the IoT devices using packet length generated from the encrypted traffic. Pinheiro et al. [16] states that rather than depending on several features of various protocols, necessary features can be extracted from the encrypted traffic. This model evaluates the characterization of IoT devices using traffic with the help of five classifiers. The model identified the IoT devices, and it also helped to distinguish them from the non-IoT devices. Random Forest showed the best performance in IoT device identification. ProfilloT [13] formulated their own dataset with data collection from network traffic data from network devices such as mobile phones, laptops and local IoT devices. They use supervised learning and train a multi-stage meta classifier. In the first stage, the classifier can distinguish between traffic generated by IoT and non-IoT devices and in the second stage, each IoT device is associated a specific IoT device class. The overall IoT classification accuracy of their model is 99.281% when judged on the basis of their model and brand of the particular IoT device.

3

Proposed Methodology

In this section, we elaborate on our proposed methodology shown in Fig. 1. It is a step-by-step approach used to classify the IoT devices with improved efficiency, as a result reducing the time to achieve the desired accuracy of prediction. To begin with, it is essential to identify relevant feature sets that will give comparatively higher accuracy. In order to achieve this, we identify the important

IPAssess: Device Identification in the IoT

687

Fig. 1. Representation of the complete methodology proposed for IoT device classification

protocols which are highly correlated with the network traffic dataset. We preprocess the data and formulate multiple feature sets. An experimental analysis is performed to observe the optimal machine learning algorithm which has the highest accuracy against the selected feature sets. The selected feature set and the optimal machine learning algorithm are then used in our proposed method of augmentation and aggregation. 3.1

Dataset

In the literature review, we came across several datasets but most of them were local network experiments with negligible amount of data for our methodology to be implemented upon. We decided to choose the IoT Sentinel [14] Dataset due to vast amount of network traffic data and the appropriate features present. These are necessary in order to identify and classify a particular device along with the type of network packets it transmits over the network. The dataset consists of network traffic data collected from 31 smart home IoT devices which are of 27 different types, out of which 4 types are represented only by 2 devices each. The 26 features originally present in IoT Sentinel Dataset are destination IP counter (IP add count) (int), source port number (Portcl src) (int), destination port number (Portcl dst) (int), packet size (Pck size) (int), packet raw data (Pck rawdata), padding (IP padding), router alert (IP ralert), ARP, LLC, IP, ICMP, ICMPv6, EAPoL, TCP, UDP, Size of TCP width (TCP w size), HTTP, HTTPS, DHCP, BOOTP, SSDP, DNS, MDNS, NTP, Payload length (payload l) and Entropy. We start classifying the device type identification on the basis of Application protocol, Network layer protocol, Data Link layer, Transport layer and Physical Layer using a parsing script in Python which converts the packet capture (pcap) files and parses it to csv files based on the different protocols. We also focus on the packet size and the number of packets coming from each device, which give additional classification accuracy.

688

3.2

P. Ganeriwala et al.

Data Pre-processing

The Data pre-processing workflow, begins with the process of extraction of the feature sets from the pcap files. We categorize the ports which have been used during the network capture process of the devices, as displayed in Table 1. Table 1. Classification of the port numbers used Port numbers

Equivalents

No port

0

Well known ports (between 0 and 1023)

1

Registered ports (between 1024 and 49151) 2 Dynamic ports (between 49152 and 65535) 3

We then list the pcap files to be processed and match them with their particular MAC addresses. We calculate the payload entropy values for each of the devices given in the dataset. The payload entropy value is calculated as the amount of information in the packet for it to be identified uniquely. For the remaining features, the individual packets are examined and all the features belonging to the packet are extracted, processed and saved as a fingerprint file. All feature properties are initially assigned a value of 0. These features are then queried in the packet. If the features have corresponding data, the data is processed in the variable, otherwise, the value of the variable remains as 0. 3.3

Selection of Optimal Feature Set and Machine Learning Algorithm

The second stage of our proposed methodology is the selection of the optimal feature set and the machine learning algorithm which provides the highest classification accuracy. We identified feature sets based on the implementation from three renowned studies [6,10,14] in the IoT device identification field as shown in Fig. 2. We proposed two novel feature sets based on the correlation matrix formed among the various protocols. To obtain an efficient feature set, we set the threshold value as 0.03. We perform an in-depth experimental analysis survey on five machine learning algorithms namely, Naive Bayes (NB), Decision Tree (DT), Random Forest (RF), K-Nearest Neighbor (KNN) and Gradient Boost (XGB), with accuracy and the time taken as the evaluation metric. We computed the results on the main dataset, with the original amount of features, without extracting any features. We then test the algorithms on the IoT Sentinel feature set, which comprises of the 23 features out of the original 26 features present in the dataset. This feature set consists of 16 protocol based features which are typically used during device association over WiFi, 2 binary

IPAssess: Device Identification in the IoT

689

Fig. 2. Representing the relation of the feature sets formed

features representing the use of IP header options, size of packet and presence of raw data, destination address mapped to a counter, IP padding and router alert, and two features representing the source and destination ports. None of the features rely on packet payload ensuring that fingerprints can be extracted from encrypted traffic [14]. We implemented the feature set proposed in the IoTSense paper [6]. According to this study, they used many protocol based features, but restrained from using IP addresses and target and source port numbers. This was done to reduce the features which are too specific and therefore would be ineffective to distinguish between device behaviour. In addition to these features, this feature set also includes features related to payload, which are payload length, entropy of payload and TCP window size. They used payload length instead of packet size feature. The final feature set was the implementation of the IoT DevID study [10]. The feature set was formulated without retaining the IP addresses and port number related features.The IoT Sentinel feature set did not use IP addresses and port numbers directly but rather used a counting

690

P. Ganeriwala et al.

process to summarise this information. This meant that the data did not capture unnecessary and overly-specific information while being able to obtain diversity. Also, they did not use payload length as there was a high correlation between packet size and payload length. We formulated our own feature sets on the basis of correlation with respect to two protocols namely, ARP (Address Resolution Protocol) and IP (Internet Protocol). We selected ARP as it is one of the main protocols used for mapping IP addresses to MAC addresses and IP is widely used as a principal communication protocol. The ARP feature set discarded the following features: LLC, EAPOL, ICMP, ICMPv6, UDP, TCP w size, NTP, IP padding, IP ralert and Payload l as they did not show much correlation with the threshold value being set as 0.03. Similarly for the IP feature set the following features were discarded: ICMP, DHCP, BOOTP, SSDP, DNS, MDNS, NTP and Packet raw data. For these six feature sets, we observed the performance of each algorithm and hence chose the best algorithm and the feature set. Our experimental analysis was to survey the highest accuracy achieved along with the time taken into account with the least amount of features present. 3.4

Data Augmentation

To tackle the problem of limited data present in the dataset, we implemented data augmentation which expanded the size of the data present in the selected feature set. This increased the diversity of data, which helps in efficiently training the model. It also prevents underfitting while training. Resampling is an important part of augmenting the data. The dataset is divided into training and testing data with a ratio of 0.75 : 0.25. Due to the presence of imbalanced classes, we chose the Up-sample Minority Class method, to address the issue. In this process, the observations present in the minority class, are randomly duplicated. Therefore, all the classes have an equal opportunity to be represented and hence to be predicted by the model. 3.5

Data Aggregation

For the classification of different IoT devices, we proposed the method of classification based on the grouping of packets. We have different group sizes of 1, 3, 6, 9, 12 and 100 packets. The process was performed for a total of 100 iterations for each of the devices with varying group sizes of packets in order to determine the efficient group size of packets that should be used for the classification. After observing the results of the above proposed method, we decided to combine devices into grouped labels along with the grouping of packets. We aggregated and formed common labels for similar types of devices. This technique resulted in saving computation power.

IPAssess: Device Identification in the IoT

691

We take the MAC addresses of the devices in the dataset M = m1 , m2 , . . . , mn and the result of the machine learning algorithm Y = y 1 , y 2 , . . . , y n as input. It creates groups of size g from packets grouped according to MAC addresses. We first create a two-dimensional array to store the different devices present in the dataset.A variable seen is initialized and the group size is mentioned for the particular aggregation step. For each of the MAC addresses belonging to the devices is found and added to the seen list. It is then assigned a position in the two dimensional array and packets are then allotted to the machine learning algorithms as mentioned in the paper. We then divide the packets into the chunks of data given by the value of g and the particular machine learning algorithms are provided with their chunks of data for experimental analysis. 3.6

Prediction Model

We tested out a prediction-based classification model formulated by combining the optimal machine learning algorithms and the feature set from the experiments on the IoT Sentinel Dataset. The prediction-based classification system generated a list of all the pcap files and filters out the packets based on the MAC addresses of the particular devices. Based on the optimal feature set found, we use the retained features and extract them from the network packet captures. The model generates the complete dataset with 276 dimensional feature vectors. Thus, we extract 16 unique features for each device type concatenated into a 276 dimensional vector. We calculate the Damerau-Levenshtein Distance [15], which is a string metric for measuring the edit distance between two given sequences of data. The classification model then calculates the feature importance of all devices and using the Damerau-Levenshtein method, we are able to visualize how the model performed on the given dataset.

4

Results

With the implementation of the above algorithmic study, we present the results obtained by the experimental analysis of the above discussed machine learning algorithms on the six different features. The accuracy of each ML algorithm, when applied on a particular feature set, is shown in Table 2. For the Main feature set which has all the features of the IoT devices, we find RF having the highest classification accuracy with 78.4%. KNN and DT have an accuracy of 76.8% and 77.9% respectively which is closer to RF as opposed to Extra Gradient Boost and Naive Bayes which perform poorly with 45.5% and 19.5% accuracy respectively.

692

P. Ganeriwala et al. Table 2. Accuracy of ML algorithms on each feature set formed Accuracy (%) Feature sets

RF

DT

KNN NB

XGB

Main feature set

78.4 77.9 76.8

19.5 45.5

IoT sentinel feature set 72.4 72.1 70.2

36.6 42.5

IoT sense feature set

68.9 68.6 67.4

5.6

IoT DevID feature set

77.5 77.5 76.1

22.4 44.8

36.9

ARP feature set

72.8 72.5 70.6

36.1 44.5

IPAssess feature set

78.3 78.0 76.9

19.1 47.9

For the IoT Sentinel feature set which has the features set up according to the IoT Sentinel method [14], we see that RF and DT have high accuracy of 72.4% and 72.1%. KNN has an accuracy of 70.2% which is closer to RF and DT as opposed to Extra Gradient Boost and Naive Bayes which perform poorly with 42.5% and 36.6% accuracy respectively. This feature set did not perform as well as the Main feature set which is why we sought to discard this in the next part of our implementation. For the IoTSense feature set which has the features of the IoT devices according to the IoTSense method [6], we see that all the algorithms perform poorly in comparison, with RF having the highest accuracy of 68.9%. KNN and DT have an accuracy of 67.4% and 68.6% respectively which were closer to RF as opposed to Extra Gradient Boost and Naive Bayes which perform extremely poorly with 36.9% and 5.6% accuracy respectively. This feature set did not perform as well as the Main feature set which is why we sought to discard this in the next part of our implementation. For the IoTDevID feature set which was formulated with the features according to the IoTDevID method [10], we study that RF and DT has the same accuracy of 77.5% . While KNN has an accuracy of 76.1% which was closer to RF as opposed to Extra Gradient Boost and Naive Bayes which perform poorly with 44.8% and 22.4% accuracy respectively. This feature set did not perform as well as the Main feature set which is why we sought to discard this in the next part of our implementation. For the proposed ARP feature set having the features of the IoT devices which has a high correlation with the ARP protocol layer, we see that RF has the highest accuracy with 72.8%. KNN and DT have an accuracy of 70.6% and 72.5% respectively which were closer to RF as opposed to Extra Gradient Boost and Naive Bayes which under perform with 44.5% and 36.1% accuracy respectively. This feature set did not perform as well as the Main feature set, however there is an exponential increase in the Naive Bayes accuracy. However, the overall accuracy is low which is why we sought to discard this in the next part of our implementation.

IPAssess: Device Identification in the IoT

693

For the proposed IPAssess feature set which has the features of the IoT devices according to the correlation method with IP protocol, we see that RF and DT have high accuracy of 78.3% and 78.0%. KNN has an accuracy of 76.9% which was closer to RF as opposed to Extra Gradient Boost and Naive Bayes which perform poorly with 47.9% and 19.1% accuracy respectively. We selected this feature set as it has nine less features than the main feature set, giving an accuracy similar to it which proves that it was the optimal feature set to test our algorithm on. This helps in discarding insignificant features, thus making it less complex for the algorithm used on that particular feature set.

5

Discussion

In all of the above feature sets we can observe that KNN is performing better than Naive Bayes and Extra Gradient Boost in terms of accuracy. However, KNN takes more time than any other algorithm. We can even notice that RF Classifier and DT have a close competition to become the best classifier in terms of accuracy and time. The reason for RF classifier performing better than DT is that the former chooses the features randomly instead of relying on a specific set of features unlike the latter which is during the training time. RF is comparatively more robust and thus also prevents the problem of over-fitting while giving the result in a lesser amount of time. The output also displays the precision, recall, F1 score and support of each of the devices when classified using a particular ML algorithm on the selected feature set. In Table 3, the above mentioned evaluation metrics are calculated for each of the device classified using Random Forest classifier, which was applied on IPAssess Feature Set. The next stage of our proposed methodology was to augment the dataset to provide more robust data, expanding the size along with balancing the classes. With our selected IPAssess feature set, we perform an experimental analysis with accuracy and time taken as the evaluation criteria. For a grouping of 3 packets as shown in Table 4, the accuracy for both the algorithms increase exponentially with 85% and 84.6% for RF and DT respectively. The accuracy increases with the group of 6 packets being taken with 91.7% for RF Classifier. It further increased to 93.3% for a group of 9 packets. We see RF giving an accuracy of 95.7% for a group of 12 packets and DT having the highest baseline accuracy of 99.6% when a group of 100 packets are taken, trained and tested. Along with the accuracy we can also find the classification time taken by each particular algorithm as given in Table 5. We see that DT takes the least amount of time while maintaining the classification accuracy. Along with this augmentation and aggregation method we also combined labels, grouping common devices into a single label to test out the classification accuracy. The accuracy and the time taken for both non-combined labels and combined labels are presented in Table 6 and Table 7 respectively. With this grouping in place we see that both RF and DT’s accuracy significantly increase for individual packets to being 88.9% and 88.5% respectively while DT’s classification time decreases to 0.15 s. For a group of 3 packets, RF and DT have

694

P. Ganeriwala et al.

Table 3. Performance of RF classifier on different devices using the IPAssess feature set Devices

Precision Recall F1 Score Support

Aria

0.8484

0.7627 0.7983

52.0

D-LinkCam

0.8688

0.8935 0.8809

635.8

D-LinkDayCam

0.9047

0.7916 0.8436

123.5

D-LinkDevice

0.7857

0.8183 0.8016

1067.8

D-LinkSensor

0.4983

0.4787 0.4880

663.3

D-LinkSiren

0.4264

0.3470 0.3823

628.9

D-LinkSwitch

0.6676

0.6083 0.6363

661.4

D-LinkWaterSensor 0.4730

0.5066 0.4889

653.8

EdimaxCam

0.9091

0.7986 0.8492

89.6

EdimaxPlug1101W

0.54022

0.5175 0.5268

124.7

EdimaxPlug2101W

0.4608

0.4262 0.4404

113.1

EdnetCam

0.8187

0.6093 0.6941

39.0

EdnetGateway

0.8243

0.6708 0.7378

85.0

HomeMaticePlug

1.0

0.9562 0.9775

63.9

Hue-Device

0.9817

0.9891 0.9854

3258.1

Lightify

0.6260

0.9753 0.7624

438.4

MaxGateway

0.9490

0.8745 0.9097

63.4

SmarterCoffee

0.2503

0.1402 0.1724

19.0

TP-LinkPlugHS100 0.6157

0.5146 0.5584

72.9

TP-LinkPlugHS110 0.5802

0.5555 0.5644

69.8

WeMoInsightSwitch 0.7972

0.8111 0.8038

607.7

WeMoLink

0.9223

0.8389 0.8785

676.9

WeMoSwitch

0.7774

0.7801 0.7785

460.7

Withings

0.9035

0.7460 0.8158

77.7

iKettle2

0.1810

0.0904 0.1141

18.8

almost a similar accuracy: 94.7% and 94.5% respectively. They further increase when a group of 6 packets are taken into consideration with both algorithms achieving a classification accuracy of 97.8%. Both the algorithms gave optimal results with 98% accuracy for both groups of 9 and 12 packets. We see KNN gives the highest accuracy amongst all the other ML algorithms when aggregated with combined labels in a group of 6, 9 and 12 packets with 97.9%, 98.2% and 98.9% respectively. However, with KNN’s time being almost 4 times RF’s and 240 times DT’s classification time, it is not reliable in a realworld application prediction model. As we can see, DT gives us a close accuracy to RF and takes as little as 0.3 s to finish an iteration. When we combine labels, DT takes a constant time of 0.15 s, making it the most optimal algorithm. Even

IPAssess: Device Identification in the IoT

695

Table 4. Accuracy recorded when not aggregated with combined labels IPAssess feature set Group NB

DT

RF

XGB KNN

1

19.8 77.6 78.1 48.3

76.5

3

18.7 84.6 85

52.1

83.2

6

23.9 91.4 91.7 54.7

90.9

9

26.3 93.1 93.3 55.6

92.1

12

29.1 95.1 95.7 56.1

88.9

100

35.4 99.6 99.5 57.8

98.9

Table 5. Time taken (in sec) when not aggregated with combined labels IPAssess feature set Group NB

DT

1

0.2

0.45 7.5 27

3

0.25 0.25 4

6

0.3

4

14

33

9

0.25 0.25 4

15

33

12

0.3

0.3

6.5 17

45

100

0.3

0.3

3.5 15

35

0.3

RF XGB KNN 14

69 33

though, KNN has the highest accuracy, the time consumed is 38 s which is way more than any other algorithm. This is due to the fact that KNN does not require any training, hence it does all the computation during the testing period. We observe from the results that KNN algorithm performs better than Random Forest and Decision Tree in a group of 6 to 12 packets when we aggregate combined labels. KNN works on the principle of similarity measure. Hence, we can see that the accuracy increases when the algorithm receives a group of 6 packets and not when its classifying devices as soon as it receives 1 or 3 packets Table 6. Accuracy recorded when aggregated with combined labels Combined labels Group NB

DT

RF

XGB KNN

1

20.2 88.5 88.9 57.9

87.3

3

20.1 94.5 94.7 62.1

94.3

6

25.5 97.8 97.8 63.3

97.9

9

27

98.2 98.1 63.2

98.2

12

29.9 98.8 98.8 63.8

98.9

696

P. Ganeriwala et al. Table 7. Time taken (in sec) when aggregated with combined labels Combined labels Group NB

DT

RF XGB KNN

1

0.15 0.15 4

10

36

3

0.05 0.15 3

10

39

6

0.15 0.15 3

10

9

0.2

0.15 3.5 10

38

12

0.2

0.15 3.5 10

38

38.5

from the same device. Random Forest classifier works well in most of the cases. This stems from the fact that the model predicts the class of the device on the basis of the majority of the class predicted by each of the individual trees in random forest. It usually works better than Decision Tree as the latter considers each and every possible feature to split the node in order to produce the most separation. On the other hand, a random subset of features is picked by an individual tree in Random Forest to split the node. This helps to save the individual trees from their possible errors as its a collective prediction.

6

Conclusion

In this research, we investigated device recognition and classification based on network traffic analysis from the packets captured and dataset generated by the IoT Sentinel dataset. We successfully developed and implemented our approach that involved data pre-processing, followed by selection of the optimal feature set and then testing the dataset on five machine learning algorithms. We see improved and comparable performance in our results based on measurement of accuracy and the time taken by each of the algorithms, achieved by the process of reducing the number of features set. Additionally, efficiency was achieved by normal grouping and then combined label grouping. The integration of various machine learning algorithms seems to be the next step in the classification field. We plan to explore merging machine learning algorithms after testing them out individually to enhance the efficiency and accuracy in classification. One of the other future work would involve investigating the possibility of classifying the IoT devices in real time, which means if a device is not present in the training data, the model will simultaneously learn about the new device and classify it accordingly.

IPAssess: Device Identification in the IoT

697

References 1. Agarwal, S., Oser, P., Lueders, S.: Detecting IoT devices and how they put large heterogeneous networks at security risk. Sensors 19(19), 4107 (2019) 2. Aksoy, A., Gunes, M.H.: Automated iot device identification using network traffic. In: ICC 2019-2019 IEEE International Conference on Communications (ICC), pp. 1–7. IEEE (2019) 3. Alrashdi, I., Alqazzaz, A., Aloufi, E., Alharthi, R., Zohdy, M., Ming, H.: Ad-iot: Anomaly detection of iot cyberattacks in smart city using machine learning. In: 2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0305–0310. IEEE (2019) 4. Ganeriwala, P., Bhattacharyya, S., Muthalagu, R., Nandanwar, S., Gupta, A.: A Protocol-Based Fingerprinting Model for Device Identification in the IoT. IPAssess (2021) 5. Bai, L., Yao, L., Kanhere, S.S., Wang, X., Yang, Z.: Automatic device classification from network traffic streams of internet of things. In: 2018 IEEE 43rd Conference on Local Computer Networks (LCN), pp. 1–9. IEEE (2018) 6. Bezawada, B., Bachani, M., Peterson, J., Shirazi, H., Ray, I., Ray, I.: Iotsense: Behavioral fingerprinting of iot devices (2018). arXiv:1804.03852 7. Chowdhury, R.R., Aneja, S., Aneja, N., Abas, E.: Network traffic analysis based iot device identification. In: Proceedings of the 2020 the 4th International Conference on Big Data and Internet of Things, pp. 79–89 (2020) 8. Ruizhong, D., Wang, J., Li, S.: A lightweight flow feature-based iot device identification scheme. Secur. Commun. Netw. 8486080(1), 2022 (2022) 9. Jing, Q., Vasilakos, A., Wan, J., Lu, J., Qiu, D.: Security of the internet of things: perspectives and challenges. Wirel. Netw. 20, 2481–2501 (2014) 10. Kostas, K., Just, M., Lones, M.A.: IoTDevID: a behavior-based fingerprinting method for device identification in the IoT (2021). arXiv:2102.08866 11. Kotak, J., Elovici, Y.: Iot device identification based on network communication analysis using deep learning. J. Ambient. Intell. Humaniz. Comput. 9 (2022) 12. Mavrogiorgou, A., Kiourtis, A., Kyriazis, D.: Iot devices recognition through object detection and classification techniques. In: 2019 Third World Conference on Smart Trends in Systems Security and Sustainability (WorldS4), pp. 12–20. IEEE (2019) 13. Meidan, Y., Bohadana, M., Shabtai, A., Guarnizo, J.D., Ochoa, M., Tippenhauer, N.O., Elovici, Y.: ProfilioT: a machine learning approach for IoT device identification based on network traffic analysis. In: Proceedings of the Symposium on Applied Computing, pp. 506–509 (2017) 14. Miettinen, M., Marchal, S., Hafeez, I., Asokan, N., Sadeghi, A.-R., Tarkoma, S.: Iot sentinel: automated device-type identification for security enforcement in iot. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 2177–2184. IEEE (2017) 15. Miller, F.P., Vandome, A.F., McBrewster, J.: Levenshtein distance: information theory, computer science, string (computer science), string metric, damerau? levenshtein distance, spell checker, hamming distance (2009) 16. Pinheiro, A.J., de Bezerra, J.M., Burgardt, C.A.P., Campelo, D.R.: Identifying iot devices and events based on packet length from encrypted traffic. Comput. Commun. 144, 8–17 (2019) 17. Salman, O., Elhajj, I.H., Chehab, A., Kayssi, A.: A machine learning based framework for iot device identification and abnormal traffic detection. Trans. Emerg. Telecommun. Technol. 33(3), e3743 (2022). ETT-19-0273.R1

698

P. Ganeriwala et al.

18. Shafiq, M., Tian, Z., Sun, Y., Xiaojiang, D., Guizani, M.: Selection of effective machine learning algorithm and Bot-IoT attacks traffic identification for internet of things in smart city. Futur. Gener. Comput. Syst. 107, 433–442 (2020) 19. Shahid, M.R., Blanc, G., Zhang, Z., Debar, H.: Iot devices recognition through network traffic analysis. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 5187–5192. IEEE (2018) 20. Sivanathan, A., Gharakheili, H.H., Sivaraman, V.: Managing iot cyber-security using programmable telemetry and machine learning. IEEE Trans. Netw. Serv. Manag. 17(1), 60–74 (2020) 21. Thom, J., Thom, N., Sengupta, S., Hand, E.: Smart recon: network traffic fingerprinting for iot device identification. In: 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0072–0079 (2022) 22. Wang, Y., Rimal, B.P., Elder, M., Maldonado, S.I.C., Chen, H., Koball, C., Ragothaman, K.: Iot device identification using supervised machine learning. In: 2022 IEEE International Conference on Consumer Electronics (ICCE), pp. 1–6 (2022) 23. Yang, K., Li, Q., Sun, L.: Towards automatic fingerprinting of iot devices in the cyberspace. Comput. Netw. 148, 318–327 (2019)

Developing a Smart Monitoring App and Water Management System for Arid Environments Using IoT: An Innovative Approach Azza Mohamed(B)

and Ibrahim Ismail

Department of Information Technology, Khawarizmi International College, Al Ain, UAE [email protected]

Abstract. Arid regions, which cover nearly one-third of the global landmass, are distinguished by a severe lack of water caused by a combination of low rainfall and high evapotranspiration (due to solar irradiation, humidity, and wind). Additionally, Water, sunlight, and nutrients are required for plant growth. As a result, most smart farming solutions alter physical and chemical aspects of the environment (such as light, soil composition, water availability, and ambient temperature) to optimize plant growth and metabolic productivity. Crop cultivation in arid regions is difficult due to harsh environmental conditions and typically low and/or inconsistent supplies of fresh water. To meet the demands of smart farming in arid environments, The rapid rise of Internet-of-Things (IoT)-based technologies has redesigned nearly every industry, including smart agriculture, shifting the industry away from statistical approaches and toward quantitative approaches. In the face of a variety of challenges, such revolutionary changes are shaking existing agricultural methods and opening new opportunities. This paper focuses on the design and development of smart monitoring APP, the potential of wireless sensors and IoT in agriculture, as well as the challenges that will be encountered when integrating this technology with traditional farming practices. Mobile applications, Internet of Things devices, and communication techniques associated with wireless sensors used in agriculture applications are analyzed in detail. What sensors are available for specific agriculture application, like soil temperature, soil pH levels, Nitrogen levels, water saturation, water pumped into the soil and the salinity of the soil are listed. Finally In order to improve irrigation systems, we were proposed a solution for intelligent irrigation of any type of crop, considering new technologies such as the Internet of Things. An application of this solution has been presented to demonstrate its feasibility. Keywords: Internet of Things · Smart farming · Mobile APP · Irrigation management system · Wireless sensors · Automated irrigation

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 699–712, 2024. https://doi.org/10.1007/978-3-031-47715-7_47

700

A. Mohamed and I. Ismail

1 Introduction Wireless sensors and IoT have great potential in revolutionizing agriculture, but there are also significant challenges when integrating this technology with traditional farming practices. Mobile applications, Internet of Things devices, and communication techniques are associated with agricultural wireless sensors [6]. The Internet of Things refers to network devices’ ability to sense and collect data from their surroundings, then share that data across the Internet for processing and use in a variety of interesting ways [3]. The Internet of Things is rapidly becoming a tangible reality with each passing day, our devices, from smartphones to smart TVs to smart cars to smart kitchens, are becoming increasingly intelligent. Everything is now interconnected with the Internet. The Internet of Things is a network of physical objects that communicate with each other over the internet. Physical objects, or ‘things,’ have the ability to transmit data wirelessly without the need for human involvement. A ’thing’ is considered as any object that can be assigned an IP address and is capable of transferring data over a network. Agriculture is the main consumer of water globally, accounting for over 70% of water usage, with some regions reaching up to 90% water consumption [15]. With increasing water utilization for irrigation agriculture, contaminants such as pesticides end up contaminating water streams and rivers. To prevent water waste and watering plants at inadequate times, the objective is to deploy a smart irrigation system that is capable of precisely determining whether plants require the water being provided or not, in addition to minimizing water usage and enhancing efficiency, this system can aid in automation, operations, and reducing planting expenses. The system will incorporate IoT-based gadgets and sensors that will be implanted into the soil and will continuously monitor the parameters of the soil, which will enable the system to determine the appropriate time for watering the plants [1, 2, 5]. This system will also be connected to mobile APP that will provide a real-time overview of the plants as well as notify the owner. An Arduino / Raspberry Pi microcontroller will be used in this system [4], the system is equipped with sensors that will monitor soil parameters like temperature, pH levels, water saturation, salinity, and nitrogen levels, the system will also utilize sensors to check the amount of water being pumped into the soil. The system will use Wi-Fi to connect to the internet. A motor will control the watering operation and a camera to provide the owner with real-time coverage via a cloud application. The system will also provide the user with several different features through the application, such as the ability to manually turn on and off the irrigation system [3]. The following is how the research is structured. Section 2 contains a literature review on smart irrigation system. Section 3 of the paper includes the proposed system, study hypothesis, and objectives. Section 4 includes the results. Section 5 summarizes the study’s findings.

Developing a Smart Monitoring App

701

2 Related Work Drip irrigation or sprinkler irrigation is used in traditional tunnel farms all over the world. These methods are set to outperform conventional flooding techniques, various irrigation techniques exhibit different levels of water consumption and energy efficiency [5]. Surface irrigation and level irrigation are considered to be efficient in terms of water and energy usage. On the other hand, sub-irrigation, overhead irrigation, and sprinkler irrigation are considered low to medium efficiency methods. While both drip and sprinkler irrigation are energy efficient, drip irrigation is more water efficient than sprinkler irrigation [6]. Smart freshwater management is essential in maximizing crop yield and minimizing agricultural costs. The precise amount of water required by plants is supplied by the effective application of technology. The SWAMP project developed an IoT-based smart water management platform for ideal irrigation with a proactive approach through four pilots in Brazil and Europe. The SWAMP architecture, platform, and deployed system of the Europeans all include a performance analysis of FIWARE components. They plan to reengineer some of its components to increase scalability while using fewer computational resources [5]. In India, several traditional techniques have been developed and implemented over the past few decades to fulfil the needs of the population sustainably, on a regional basis. There are three primary types of irrigation methods employed in India: diversion channels, small-scale water bodies like tanks for rainwater storage, and wells for groundwater collection. These methods can be used for both small and large-scale applications. As India’s population continues to expand, the demand for water for a range of uses, including irrigation, domestic, hydroelectricity, industrial, mining, and restoration, is expected to rise. Despite having the world’s largest irrigated area, India accounts for only about 40% of cropped land [8]. The widespread use of traditional irrigation methods, which results in a low water use efficiency of approximately 35–40% [9], is one of the primary reasons for this low irrigated land. The author in [11] devised and tried to implement a mechanized irrigation system in order to save water. They saved about 90% of the water used in traditional irrigation methods by using a wireless network of sensors. Kumar et al. [5] presented a similar work the same year, and [6] and Rawal [13] later introduced a few sensor-based solutions. In 2015, Nelson automated the irrigation process and reduced water consumption by using sensor data such as temperature and soil moisture, as well as WSAN. An on-field survey of a smart phone irrigation setup was attempted and successfully completed by Saab et al. [14]. He researched and tested the application in Mediterranean settings, resulting in a 25% reduction in water consumption. Conventional field irrigation typically requires a significant amount of manpower and material resources, resulting in a lack of real-time monitoring and accuracy. This approach contradicts the long-term development trend of sustainable utilization of water resources in agriculture. Wireless sensors are used to address these challenges and for their precision in agriculture and smart irrigation [10]. During experimental tests using a smart irrigation system, the field was only irrigated once during a period of approximately 50 min instead of the 3 h of irrigation done by farmers, evidencing that the other fields were suffering from over irrigation and smart irrigation allowed water savings [12].

702

A. Mohamed and I. Ismail

The use of a smart irrigation system that utilizes various sensors would greatly minimize the amount of water wastage that is associated with conventional irrigation techniques such as flooding. By providing users with real-time readings on the soil’s properties such as moisture levels, temperature and other parameters, the smart irrigation system will be able to provide plants with precise amounts of water when it is necessary, ensuring minimal waste while providing plants with optimal resources and increasing the yearly crop yield, ultimately benefiting farmers and the environment.

3 Proposed System The proposed system uses an Android smartphone connected via Wi-Fi/internet to a Raspberry Pi, which is connected via serial communication to a processor and soil parameter sensors (Sensing System) such as a soil moisture sensor, and a temperature sensor, Fig. 1. The ability of farmers to consistently monitor daily soil moisture and temperature levels is limited, causing them to rely heavily on their personal experience when making decisions related to their daily operations. When the conditions become too dry, they will water the plants or soil, but if the conditions are too humid, they will open the fans of the green house, especially during the day. When the soil moisture sensor detects a certain level of moisture, it sends a signal to the raspberry pi, which processes the signal and sends it to the mobile device. In this paper, the following components for our system architecture are proposed: 1. Raspberry pi 2. Sensing System 3. Android App.

Fig. 1. Proposed system architecture.

Raspberry Pi: functions as a minicomputer. It has internal as well as external memory. That is where any operating system, such as Linux/Unix, is installed. Multiple sensors are connected to the Raspberry Pi, which receives signals from these sensors. The values obtained from the sensors are then compared to a user-defined threshold value to determine if any action is needed.

Developing a Smart Monitoring App

703

Sensing System: this system includes Sensors and analog-to-digital converters are two of the most common subunits (ADCs). The ADC converts the analog signals produced by the sensors to digital signals, which are then fed into the processing unit. The sensors for example: 1. Soil moisture sensor: This sensor determines the volumetric water content of the soil. It is used to control irrigation by monitoring the moisture content of the soil. A moisture sensor detects the level of moisture in an irrigation field. It has a level detection module with which we can set a reference value. 2. Temperature Sensor: As the weather cools and less water is required, temperature sensors are required to reduce watering times. Increase the watering time as the weather warms. 3. The pH soil level sensor detects soil pH by inserting two stainless steel probes into the soil vertically. 4. The soil nitrogen, phosphorus, and potassium three-in-one fertility sensor detects soil levels of nitrogen, phosphorus, and potassium. 5. Salinity sensor for salt tolerant soil screening; salinity sensors were installed in a field experiment receiving saline irrigation water. Controlling System: All hardware devices are configured by the controlling system module. Sensors and a Wi-Fi module are all linked to the Raspberry Pi, which is the main component. • Initially, we needed to install the Raspbian operating system on the Raspberry Pi and related gpio connections with the soil moisture sensor and temperature sensor. • Soil moisture and temperature sensing: Soil moisture sensor and temperature sensing use analyze the moisture content in soil and weather temperature, and further decisions are made based on sensor values. • Send Notification: When the values of the sensors fall below or exceed the threshold, it sends an offline alert message to the user’s mobile device via the Wi-Fi module. • Application graphical user interface: – When a message arrives on a mobile phone, the user will launch the mobile App and see the following options for retrieving information. – (Automatic/Manual) mode. – Automatic system turns on/off will turn the system on/off at their discretion. – The system is turned on and off. – Hardware for controlling. – Android App: The client will be an android-based application, and this module will handle user input. The graphical user interface is written in XML. This graphical user interface will handle Raspberry Pi input. We created our application in this system. It’s called “Plantify”. – Plantify is an application specially designed to help any plant owner know more regarding the medium their plants are growing in, the soil. When paired with a smart gardening tool that contains sensors and other IoT technologies it will provide periodical updates on the status of the medium the device is inserted in.

704

A. Mohamed and I. Ismail

– Plantify is a conceptual application, it was designed as it seems that the agricultural field lacks such applications. – The application starts with the first frame which includes a pastel green background with the logo in the middle, the logo being a leaf with a circle behind it, with the word Plantify stated under the logo, Fig. 2, a. – Home page: The application would then transition into the second frame, the home page which includes a plethora of key features. 1. Plantify welcomes its users and will display the date and time on the top of the screen under the username, as well as allowing users to access their account settings through clicking on their profile picture. 2. It includes a weather forecast with a button that will open the weather application if the user wishes to know more about the weather in the region, they’re in. 3. The overview section displays the amount of water sprayed out of the smart agricultural device connected to the application, in this case it is a smart sprinkler as well as displaying the amount of water present in the soil through sensors. When clicking on the arrow shown on the overview section it takes the user to the main overview page, which provides more detailed information regarding the soil. 4. The home page is also equipped with a notes section that would be helpful for elderly users and others that would use the application constantly and need to write notes to themselves, so they don’t forget about things. When clicking on the arrow it will open a larger version of the notes section allowing users to edit their notes, Fig. 2b.

Fig. 2. a First frame. b Second frame, the home page.

Developing a Smart Monitoring App

705

– User’s page: When clicking on the user’s profile picture it would take the user to their “Me” page, where all of the user’s settings would reside. Users will be provided with a multitude of options as shown in Fig. 3.

Fig. 3. Third frame, “me” page.

1. Pressing on the password setting takes the user to the password page where they could change their password or recover it if they’ve forgotten it while still logged in Fig. 4.

Fig. 4. a Recover password. b Change password.

2. Pressing on the set reminders setting takes the user to the events section allowing them to create an event reminder, the user has many options on how to create their

706

A. Mohamed and I. Ismail

own reminder and they are able to set the frequency of how many times they should be reminded and when, Fig. 5.

Fig. 5. Creating new user reminder.

Fig. 6. Setting new language.

3. Pressing on the languages setting allows the user to choose the language they’d want the application to be in Fig. 6.

Developing a Smart Monitoring App

707

4. Pressing on the reminders setting displays the created event reminders and allows the user to examine the reminder and notes by tapping on the bell icon as well as allowing the user to create more reminders by pressing on the “ +” button in Fig. 7.

Fig. 7. a User reminders. b Information regarding user reminder.

5. Pressing on the Log out button logs the user out and asks them to either sign in or sign up if they’re a new user in Fig. 8. Overview Page: When pressing on the arrow in the overview section in the home page it takes the user to the overview page, which includes information regarding the soil the smart agricultural device is inserted in. The information is stored on a cloud-based server and would allow users to view data regarding their soil over the years in Fig. 9. 1. The overview pages displays a chart that shows the water saturation in the soil every hour daily and other data that ranges from soil temperature, soil pH levels, Nitrogen levels, water saturation, water pumped into the soil and the salinity of the soil. By tapping on the data displayed in each square a pop up will appear showing if the soil is far or close to the optimal status allowing plants to thrive in Fig. 10. 2. By tapping on the Day, week, month, or year buttons the application will switch to other pages that display data within a time frame specified by the user in Fig. 11. Notes Page: When pressing on the arrow displayed on the notes section it takes the user to a larger version of the notes shown in the home page, it allows the user to save their edits to the notes and create to do lists with bullets and checking them when they’re done with them in Fig. 12.

708

A. Mohamed and I. Ismail

Fig. 8. a User sign up page. b User sign in page.

Fig. 9. Overview page.

4 Discussion/Results The "Smart Monitoring APP" paper has an appealing user interface. This paper proposes a method for maintaining soil moisture content and temperature in a farming area, and the Sprinkler can be controlled by an Android phone or tablet. This paper is based on the Android and Raspberry Pi platforms, which are both free and open source. As a result, the overall implementation cost is low, and the average person can easily order it.

Developing a Smart Monitoring App

709

Fig. 10. a Soil saturation pop-up. b Soil salinity pop-up. c Soil temperature pop-up.

Fig. 11. a Yearly overview. b Weekly overview. c Monthly overview.

Given the current circumstances, We chose the Android platform to benefit the greatest number of people. An Android App is included in the design, allowing users to interact with their phones and send control signals to the Raspberry Pi, which controls sensors

710

A. Mohamed and I. Ismail

Fig. 12. Notes page.

and monitors the environment. As a result, the user can control their farm from a remote location using an Android mobile device, among other advantages: • Prevents overwatering: The system will only spray water whenever the sensors detect that the moisture levels in the soil went below the preset threshold, for example 25%. It eliminates the issue of constant overwatering and keeps the soil, plants and microbiomes healthy and alive for longer periods of time. • Easy data access: With the inclusion of a cloud database, its easier to keep constant track of the watering cycles and average soil moisture levels. Keeping the data stored in a cloud server means that it can hold more data than a physical server on-site. Analysis could be done on the data quicker and from any location as the data will be accessed through the cloud. • Lower costs: Since the system uses cloud-based database it means the costs for maintenance and implementation is way lower than having it on site, cloud computing provides resource sharing in cheaper cost and offers the other services such as Software as a service (SaaS), Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) at a minimum cost. Tongke [7] and since the system is fully automated it will lower the overall costs tremendously as there would be no reason for people to constantly come and water the plants and check the soil periodically. It only has the initial implementation cost which is for inserting the system in soil. The findings of this study have significant implications for global food security. With a rapidly growing global population, the demand for food is expected to increase significantly in the coming years. At the same time, climate change and other environmental factors are posing challenges to agricultural production. Smart irrigation systems can

Developing a Smart Monitoring App

711

help to address these challenges by optimizing water usage, reducing the environmental impact of agriculture, and improving the efficiency and productivity of farming. Overall, this study highlights the potential of smart irrigation systems to contribute to sustainable agriculture and address global food security challenges. Further research is needed to explore the effectiveness of these systems in different regions and under different conditions, as well as to develop strategies to overcome the challenges associated with their implementation.

5 Conclusion/Future Work A smart irrigation system is a high-tech, efficient method of farming and growing food that is also environmentally friendly. It is a method of incorporating connected devices and new technology into farming. IoT is heavily used in smart irrigation systems, which eliminates the need for farmers and producers to perform physical labor while increasing production capacity. Because modern agricultural developments are dependent on agriculture, the Internet of Things has provided enormous benefits such as practical water usage, input optimization, and so on. By monitoring the soil in real time, an IoT-based Smart irrigation system improves the overall agricultural system. The Internet of Things in Agriculture, through the use of sensors and interconnectivity, has not only saved farmers’ time but also reduced wasteful use of resources such as water and electricity. New digital technologies are critical in many fields and offer numerous advantages. Farmers can also use their mobile devices to manage farm information and interact with smart irrigation systems. A future work that could build upon this conclusion is to explore the potential impact of smart irrigation systems on global food security. With a rapidly growing global population, there is an urgent need to increase food production while reducing the environmental impact of agriculture. Smart irrigation systems have the potential to contribute significantly to achieving these goals. Therefore, a research study could be conducted to assess the feasibility of implementing smart irrigation systems in various parts of the world and its potential impact on food production, water conservation, and the environment. Such research could involve field experiments and surveys to evaluate the performance of different types of smart irrigation systems under different climatic and soil conditions. The study could also analyze the economic and social implications of implementing smart irrigation systems and their potential to improve farmers’ livelihoods. Overall, this research could provide insights into how smart irrigation systems can be used as a tool for sustainable agriculture and contribute to addressing global food security challenges. Acknowledgment. This work is a part of a project undertaken at the Khawarizmi International College.

712

A. Mohamed and I. Ismail

References 1. Saab, A., Therese, M., Jomaa, I., Skaf, S., Fahed, S., Todorovic, M.: Assessment of a smartphone application for real-time irrigation scheduling in Mediterranean environments. Water 11, 252 (2019) 2. Saqib, M., Almohamad, T.A., Mehmood, R.M.: A low- cost information monitoring system for smart farming applications. Sensors 20(8), 2367 (2020) 3. OGrady, M.J., Langton, D., O’Hare, G.M.P.: Edge computing: a tractable model for smart agriculture?Artif. Intell. Agric. 3, 42–51 (2019) 4. Parameswaran, G., Sivaprasath, K.: Arduino based smart drip irrigation system using internet of things. Int. J. Eng. Sci. 6, 5518 (2016) 5. Kamienski, C., Soininen, J.-P., Taumberger, M., et al.: Smart water management platform: IoT-based precision irrigation for agriculture. Sensors 19(2), 276 (2019) 6. Stubbs, M.: Irrigation in US Agriculture: on-Farm Technologies and Best Management Practices, Congressional Research Service, Washington, DC, USA (2016) 7. TongKe, F.: Smart agriculture based on cloud computing and IOT. J. Conv. Inf. Technol. 8(2) (2013) 8. Narayanamoorthy, A.: Economics of drip irrigation in sug- arcane cultivation: case study of a farmer from Tamil Nadu. Indian J. Agric. Econ. 60, 235–248 (2005) 9. Rosegrant, M.W., Cai, X., Cline, S.A.: Global water outlook to 2025: averting an impending crisis. In: International Food Policy Research Institute, Washington, DC, USA (2002). 572– 2016–39087 10. Xiao, K., Xiao, D., Luo, X.: Smart water-saving irrigation system in precision agriculture based on wireless sensor network. Trans. Chinese Soc. Agric. Eng. 26, 170–175 (2010) 11. Gutie´rrez, J., Villa-Medina, J.F., Nieto-Garibay, A., Porta-Gandara, M.A.: Automated irrigation system using a wireless sensor network and GPRS module. IEEE Transactions on Instrumentation and Measurement, vol. 63, no. 1, pp. 166–176 (2014) 12. Sales, N., Reme´dios, O., Arsenio, A.: Wireless sensor and actuator system for smart irrigation on the cloud. In: 2015 IEEE 2nd World Forum on Internet of Things (WF-IoT), pp. 693–698. IEEE, Milan, Italy, Dec. 2015 13. Rawal, S.: IOT based smart irrigation system. Int. J. Comput. Appl. 159(8), 7–11 (2017). Bzdok, D., Krzywinski, M., Altman, N.: Machine learning: supervised methods. Nat. Methods 15, 5–6 (2018) 14. Safdar Malik, M., Sarwar Bajwa, I., Munawar, S.: An intelligent and secure IoT based smart watering system using fuzzy logic and blockchain. Comput. Elect. Eng. 77(1), 109–119 (2018) 15. Mateo-Sagasta, J., Zadeh, S. M., Turral, H.: Water pollution from agriculture: a global review. The Food and Agriculture Organization of the United Nations Rome (2017)

Taking Natural Language Generation and Information Extraction to Domain Specific Tasks Sandeep Varma, Shivam Shivam, Sarun Natarajan, Snigdha Biswas(B) , and Jahnvi Gupta ZS Associates Pune, Pune, India {sandeep.varma,shivam.shivam,sarun.natarajan, snigdha.biswas,jahnvi.gupta}@zs.com

Abstract. A lot of domain-specific unstructured data is available at present. To make them available to common users, domain experts often have to extract the key points and convert them to layman’s terms manually. For domains like legal, documents are often needed to be manually analyzed in order to check if all the critical information is present in them and to extract the important points if needed. All these manual domain-specific tasks can be automated with the help of different Natural Language Processing (NLP) and Natural Language Generation (NLG) techniques. In this paper, some of the tools in NLP and NLG that can be used to automate the above-mentioned processes for key information extraction are discussed. We also bring forth two such domain-specific use cases where we attempt to provide suggestions to the subject experts to make their tasks easier using the tools discussed. Keywords: Natural language generation · Information extraction · Text simplification · Text summarization · Named entity recognition Content automation · Custom rule-based parsing · Domain tasks

1

·

Introduction

Gaining insights into many domain-specific documents is vital for running any sector. For example, clinical notes, financial reports, contracts, etc. might contain a lot of high-level information which is needed by the common people but is often not understood by them. These documents also might contain a lot of data, all of which is often not relevant to customers, and complex domain-specific points which non-subject experts find hard to decipher. Being able to extract the important and relevant details from these domainheavy documents, and simplifying them in layman’s terms can improve the accessibility of this information to general users. Reading the documents and extracting and constructing suitable content from them can be a very time-consuming Snigdha Biswas and Jahnvi Gupta These authors made equal contribution. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 713–728, 2024. https://doi.org/10.1007/978-3-031-47715-7_48

714

S. Varma et al.

and exhaustive process, if done manually. While in some fields, some of the tasks are currently being done manually, all these domains have the potential to get automated, with the advancement in the techniques of Natural Language Processing (NLP) and Natural Language Generation (NLG). Information extraction can be used to extract specific information on a specific section/subject from the text in an automated way. This, along with text summarization, can help in generating a short report of the relevant specifics from the document which can further improve the overall understanding of the entire document with lesser effort and time. In this paper, the various available tools and techniques that can be used to shorten, annotate and simplify domain-heavy information for common people are discussed briefly. We try to solve domain specific tasks by dividing the problem into sub-parts and using the pre-trained available tools instead of training and fine-tuning models on domain data which can get resource heavy. Two use cases are then presented where we make use of the tools discussed to solve the abovementioned problem: 1. The key entities and clauses are extracted from the legal documents which are then used to calculate the overall correctness and completeness of the text. 2. Important sections are extracted and simplified from clinical reports to make these reports patient-friendly, for the leaflets that come with medicines, instructing usage and warnings of the drugs. The remaining of our research paper is arranged in the following sections— Sect. 2 contains a literature review of the techniques that have been majorly used in the paper, Sect. 3, divided further into subsections, discusses the various available NLP and NLG tools briefly, followed by the two use cases in Sect. 4; where each of the use cases also discusses the problem statement, dataset, approach, and results. Section 5 overall concludes our work.

2

Background and Related Work

The extraction and processing of textual material, whether it is present in structured or unstructured form, is necessary for a number of applications. As a result, methods to extract valuable information from them are required. Finding members of various predefined classes using named entity recognition is one of the most important information extraction tasks. There are many domain-specific Named Entity Recognition (NER) tasks that have been tried. Shen et al. [16] tried hidden Markov based model on biomedical domain, [5] tried it on short text messages. Advancement in transformers further improved the performance on NER systems [24,25] tried various transformer models for clinical concept extraction When it comes to NLG, summarization, and simplification often come up. Text Simplification using Machine Translation has been in use for quite sometime [22,23]. With the advent of deep learning, neural machine translations came into

Taking Natural Language Generation

715

the picture [11,18]. Domain-specific text simplification however needs a little more attention. In case of Medical Text Simplification, complex domain-specific vocabulary is the main challenge that comes up [14]. There are however many medical ontologies that contain various ways of expressing the same medical term in easier language. Using these ontologies to replace difficult words with their easier and more common synonyms is a way that has been often used successfully in medical text simplification [1,15]. Moramarco et al. [10] proposed an advanced version of this ontology-based approach, where they select multiple candidates which they get from the vocabulary of medical ontology, and then rank them using word frequency and sentence scores given by a language model. In this research work, inspiration is taken from the same approach to select candidates from ontology and further simplify it with paraphrase generation, on one of the use cases discussed in the further sections. Paraphrase generation started with lexical-based approaches, by substituting words [3,8] initially; followed by using machine translation as it evolved [21,27]. With the advancement of deep-learning methods, neural network based methods have also been proposed [7,12]. As transformers [20] came into the landscape, fine-tuning pre-trained language models are being used in paraphrasing tasks.

3

Tools and Techniques for IE and NLG

Discussed here are some of the popular techniques briefly that are often used when tasks of content generation and analysis are in hand. We will then see in the subsequent section how some of these tools can be used in different problem statements. 3.1

Text Extraction

Extraction from PDF Files There are many tools available on the web to extract textual data from PDF files. Some of these tools include Tika, PDF Plumber, PyPDF2, etc. One of these tools has been used in this research to extract text from documents. Along with this, formatting patterns such as bold, underlined, etc. within the text were identified which can assist in extracting the section headings from the text. For example, the headings formatted in bold were used to identify relevant sections which are further required for entity extraction. Extraction from Tabular Data All the relevant information is not directly present as text in the documents. Some tables might also contain information that is extremely relevant to understand. Tools like Tabula and Camelot are quite popular to extract tables from PDFs. The extracted tables can be used to present the information contained in them in any format.

716

S. Varma et al.

Config Driven Section Extraction The documents generally contain abundant information, some of which can be very trivial. Therefore, the identification of relevant sections from a text document becomes a very important task for further processing. It is possible to use a config-driven approach in order to extract only relevant sections from a document. A config file may be created which contains the important keywords/phrases that have the possibility to be present within a section heading. The user can directly edit or add entities to the config file whenever required. Web Scraping Websites hold a large amount of unstructured data which can be consumed for various use cases. Using web scraping, this content can be extracted and stored. The adapters can be created that can extract content from websites. 3.2

Named Entity Recognition

Named Entity Recognition is an important sub-task in the process of information extraction from any textual data. It consists of two main steps which are: 1. Identification of the Main Entities: Any important keyword which is being referred to consistently within the text can be defined as an entity. Extracting these important keywords is the first step in NER. 2. Categorization of entities into different classes: There can be multiple classes into which the important keywords extracted in the first step should be classified. In general, these include location, date, person, organization, etc. Further, based on the applications, more specific classes may be added to the NER model. Named Entity Recognition has now evolved for various domain-specific texts, as well as allows for training on custom data. NER can be employed to extract key entities and then further filter them to gain quick insights from the domainspecific data. It can also be used as a means to extract words that we might want to replace with simpler words, particularly for domain-savvy texts (Lexical approach) which is discussed in Sect. 3.5. 3.3

Keyword Extraction/Matching

Important keywords can be extracted from the textual data extracted for different sections. This can help in understanding the overall flow of information within these sections. Another approach can be to mention important keywords for each section in the config file. Then, these keywords can be matched within the section text to check the correctness of the sections.

Taking Natural Language Generation

3.4

717

Template based Content Generation

In various applications of content generation, the content to be formed follows similar patterns across multiple documents. In such cases, a template-based content replacement can be a simple yet powerful way of content automation. 3.5

Text Simplification

Reduction of the complexity of sentences that are present within a text while keeping its original meaning, with the aim to improve readability and understanding is text simplification. Its importance can be seen in various fields like helping people with low literacy levels, other language readers, simplifying domain-specific texts for common people, etc. For simplifying text for a specific use case, the sub-tasks for text simplification can be different addressing the specific nature of the text. These can be texts where explicitness and redundancy are used to highlight important points, texts requiring its content to be simplified while retaining its form, and texts needing some reduction in size by leaving out complex or inappropriate information [2]. These are the following main approaches taken for text simplification: lexical, syntactic, monolingual machine translation (MT), and hybrid techniques [2]. – Lexical Approach—complexity is reduced by locating and substituting complex words with simpler, more comprehensible, synonyms without changing the form of the text. – Syntactical Approach—complex syntactic structures in a text are simplified while preserving its original content and meaning. – Machine Translation—complex texts are translated to simpler ones using either statistical or neural machine translation. This has gotten simpler with the advancement in transformers. – Hybrid Approach—while each of the above approaches can face some limitations, integrating some of the modules together can help achieve better performance. For some of the approaches discussed above, the following techniques can come in handy. NER Synonyms Ontologies For different areas in texts, ontologies can be created, or existing ontology databases can be used to replace complex words with simpler meanings. Various synonyms packages (NLTK [9], Spacy, etc.) are also available for use. Paraphrasers Paraphrasing is using variations in words and grammar to generate an output sentence that preserves the meaning of the input sentence. Transformer-based architectures have improved the paraphraser’s performance, and various trained models are available. T5 [13], BERT [4], PEGASUS [26] are some of the most renowned names.

718

3.6

S. Varma et al.

Text Summarization

In general, there are two different approaches for automatic summarization: extractive and abstractive. Extractive summarization methods work by identifying important sentences and phrases from the original text. In contrast, abstractive summarization methods aim at producing abstract summaries, where ideas and concepts are taken from the original text but are interpreted using different words and phrases. Since in-depth natural language processing is required for this, therefore compared to extractive summarization, it is far more complicated. Figures 1 and 2 depict the process of extractive and abstractive summarization respectively.

Fig. 1. Extractive summarization.

Fig. 2. Abstractive summarization.

4 4.1

Case Study and Results Contract Annotation

The below use case demonstrates the task of text extraction, named entity recognition, and information retrieval to get an insight into the overall compliance of a contractual document. Problem Statement and Background It takes a lot of time and effort in order to review multiple contract documents manually. In order to check whether all the crucial sections are present within the contract and whether all the critical clauses contain important keywords is a time-consuming process. Natural Language Processing can be implemented for entity extraction and information extraction for creating a summary of the entire contract to make this manual process of analyzing documents easier.

Taking Natural Language Generation

719

Dataset The dataset used contains a total of 10 documents which contains a mixture of Data Processing Agreements, Standard Contractual Clauses, Statement of Work, Vendor Contracts, etc. This dataset cannot be shared openly, since it is confidential to the organization. Approach 1. Here, a config-based file along with a parser is used to extract textual data from the contracts. 2. An NER along with filtering using regular expressions, pattern finding, and keyword matching helps to extract important entities such as the type of document, effective dates, parties involved, etc. 3. Next, based on patterns recognized from the document, the sectional content for different section headings is extracted. 4. Depending on whether any text is extracted for the section and the amount of text extracted, a section is marked as complete or incomplete. 5. Based on these section contents, important keywords are looked for within the sections. This helps in gaining an understanding of the correctness of the section. Results and Discussion The overall summarized report was generated for around 10 different types of contractual documents and then the results were manually tested. Figure 3 shows the summary generated for a sample part of the contract. This type of summary along with a detailed analysis of completeness and correctness is generated for the entire document using the process mentioned in Sect. 4.1. Based on the patterns mentioned in the config file, different sections for the contracts are extracted. We run a manual test to check the overall performance of this section extraction approach. – True Positive—Entities/Sections present in the contract and extracted correctly. – True Negative—Section/Entity not present and not extracted. – False Positive—Extracted section content contains some text which is not part of the section. As shown in Fig. 4, 4 of such cases are observed. Certain keywords can be incorrectly mapped to the correct entities, causing false positives. – False Negative—Entity/section present in the contract but not extracted or section extracted partially. A total of 20 false negatives are observed in the results Fig. 4. Certain patterns like the subclauses, if missed in the config file lead to this case. However, to overcome false positives and false negatives, we can add new patterns observed for the latest documents in the config file. Another way to overcome these would be to find a more advanced way of template identification of the contractual document. This can be taken as a future research problem

720

S. Varma et al.

Fig. 3. Sample legal document page and its summary.

Taking Natural Language Generation

721

statement. Figure 4 depicts the confusion matrix for the performance of the contract annotation module. Based on the values mentioned in the confusion matrix, the metrics including accuracy, precision, and recall are calculated as follows:

Fig. 4. Confusion matrix for legal documents annotation.

1. Accuracy-Accuracy measures the proportion of accurate predictions to all input samples.

Accuracy =

T rueP ositive + T rueN egative T rueP ositive + T rueN egative + F alseP ositive + F alseN egative

(1) The accuracy of the module is equal to 85.63%. 2. Precision-Precision measures the proportion of positive identifications, that are actually correct. P recision =

T rueP ositive T rueP ositive + F alseP ositive

(2)

The precision of the module is 96.26%. 3. Recall-Recall measures the proportion of actual positives, identified correctly. Recall =

T rueP ositive T rueP ositive + F alseN egative

(3)

The recall of the module is 83.73%. 4.2

SmPC to PIL

In this sub-section one NLG task is discussed which makes extensive use of summarization and text simplification concepts.

722

S. Varma et al.

Problem Statement and Background SmPC or Summary of product characteristics, (also termed as Product Information Reports) is a legal document approved as part of the marketing authorization, explaining the properties and the approved conditions of a medicine’s use. SmPC provides information to healthcare professionals on how to use the medicine safely and effectively. PIL (Patient Information Leaflet) is the instruction/information leaflet that comes along with the packaging of medicines. PIL is written by a pharmaceutical company and is a patient-friendly version of the SmPC for that drug. It is the simplified version of SmPC, with information not needed by common people omitted. PILs contain specific headings like “What is your medicine and what it is used for”, and “how to take or use your medicine”, etc. where the format is almost similar for PIL of all drugs. Presently, the PILs are manually generated by clinical experts, by going through the entire SmPC and selecting exact sections from SmPCs, and also simplifying them to layman’s language. This use-case of content automation was taken up, where automating this process of manual creation of PILs was tried. Various Natural Language Generation (NLG) accelerators discussed above was used to achieve the task. The dataset, the approach taken, and the results achieved are discussed in the following subsections. Dataset Product Information Reports of various drugs studied by the EMA was used as the data source. These are publicly available documents present on the European Medicines Agency (EMA) website. Approach The high-level approach of the PIL generation is fetching of SmPC files from a database, extracting information from them, and then using the extracted information to populate a template-based PIL file. Each step of the process is discussed below. Figure 5 shows the workflow of the entire process, which is also described below. Fetching SmPC Files: A database is maintained where the SmPC files of various drugs are stored. The files in this database gets automatically downloaded (if file not present) whenever a new drug is searched for. This is achieved with the help of text extraction tool—Web scraping. The drug is searched in the ema website and the SmPC file gets automatically downloaded and stored in the database. Extracting Information from SmPCs: PDF extraction tools like Tika has been used to extract all the information in the SmPC files. Table extraction using Tabula also extracts the tables present in the PDFs.

Taking Natural Language Generation

723

Fig. 5. Workflow of SmPC to PIL.

Creating PIL using Config/Template: It is known that PILs have specific formats and it extracts information from some particular sections of the SmPCs. A config based approach is taken to specify the section boundaries to extract, and also the actions to be performed on them. The possibilities were broadly classified into 1. Exact: Some text occurs in all the PILs. These texts can be used exactly (which are mentioned in the config file) to populate the PILs. 2. Replace: Here, a template based textual replacement is used for texts which are similar across all PILs. For a small example, “What is “drug name” used for?” This statement is present across all PILs with just the specific drug name for each PIL. 3. Extract: Text from SmPC file is extracted with the given start and end portions mentioned in config file.

724

S. Varma et al.

The config file also mentions other editing actions which needs to be performed in each of the above actions. There can be some specific cleaning actions; or, choosing lengths of the content; or, the very important task of simplifying the text for common people. While the other cleaning actions can be rule based, specific to different parts of the texts, medical text summarization and simplification needed a elaborate pipeline. The approach taken for this was: – The type of entities which were not relevant information for the patients, or those which increased the complexity of the texts were identified. Some such entities were—PROTEIN, DNA, TISSUE, etc. – NER trained in medical domain (Stanza’s bio NER) was used to extract all medical entities. – Using the entities extracted, the sentences which contains any of the entity types mentioned in the first point were removed, as they contain irrelevant/extra information for common people; performing something similar to extractive summarization. – For the other entities, use of publicly available ontologies was done to replace them with simpler definitions/synonyms. Also to avoid the already simple entities to get replaced, frequency of words across many sources of data and languages (package like wordfreq in python) was used to decide which terms to replace. – Further, transformer models to paraphrase to reduce the complexity of sentences and to make it easier to read and understand were used. Results and Discussion The PIL files generated were tested manually for around 25 drugs to see the quality of the files. Among these, for two drugs the section boundaries defined in config file were not found (as the SmPC format was slightly different for these drugs). However, the benefit of having a config driven approach is that any new patterns observed in the config file can be added easily. One of the sections of PIL from SmPC is shown in Fig. 6, where it can be seen that the redundant sentences from the SmPCs have been removed and the text is simplified in the PIL. To understand and evaluate our summarization and simplification task, the Flesch-Kincaid readability tests was used. These tests indicate how difficult a text in English is to understand. There are two tests under this: the Flesch Reading-Ease, and the Flesch-Kincaid Grade Level; they correlate approximately inversely.

Taking Natural Language Generation

725

1. Flesch reading-ease-Flesch Reading Ease gives a score between 0 and 100. Higher scores show that a text is easier to read; lower numbers mark texts that are more difficult to read. F leschReadingEaseScore = 206.835−1.015·

T otalW ords T otalSyllables −84.6· T otalSentences T otalW ords

(4)

Fig. 6. First image is from SmPC file. Second image is the same text in PIL.

2. The Flesch-Kincaid Grade Level-shows the educational level a person will need so as to understand a particular passage. The scores match up to the US grade levels of education. If a text has a score of 9, the reader must have 9 years of education to understand the document easily. F leschKincaidGradeLevel = 0.39 ·

T otalW ords T otalSyllables + 11.8 · − 15.59 T otalSentences T otalW ords

(5)

Table 1 depicts scores of both Flesch-Kincaid tests on various sections of the document for the drug Prolia, as is there in SmPC and it’s simplified and summarized form in PIL. For most of the sections (particularly domain heavy passages), the scores show better readability, thus serving our purpose. Most of the works done in various NLP, NLG, IE tools have been done in general domain. There are various good models available for generic data, be it normal NER or simplification. However, applying the same models to domain specific data doesn’t give similar performances [6,17,19]. One way to approach such domain specific tasks was taken in this paper in which the problem was divided into smaller subparts, where the numerous NLP/NLG tools that are available can be efficiently applied to give a good performance.

726

S. Varma et al.

Table 1. Flesch-Kincaid readability test on various sections of the document from SmPC and PIL. Section heading in PIL

Text in SmPC

Text in PIL

FleshKincaid Grade level

Flesch reading Ease score

Reading level

What is prolia used for?

12.9

24.8

College graduate (Very difficult to read)

Effects on pregnancy

15.7

Flesch reading Ease score

Reading level

56.8

10th to 12th grade (Fairly difficult to read)

57.3

10th to 12th grade (Fairly difficult to read)

Effects on breastfeeding

16.3

26.8

College graduate (Very difficult to read)

12.9

35.8

College (Difficult to read)

Effects on fertility

13.1

24.4

College graduate (Very difficult to read)

10.7

41.4

College (Difficult to read)

How to take prolia?

10.9

30

College graduate (Very difficult to read)

10.5

31.2

College (Difficult to read)

How much prolia is given?

17.8

6.8

College graduate (Very difficult to read)

14.4

31.7

College (Difficult to read)

Overdose

12.7

41.7

College (Difficult to read)

11.7

38.8

College (Difficult to read)

Other side effects

15.9

6.2

College graduate (Very difficult to read)

15.8

Storage conditions

5

28.5

80.3 4.1

College graduate (Very difficult to read)

6th grade (Easy to read)

FleshKincaid Grade level 8.9

8.8

College graduate (Very difficult to read) 6.5

72.1

7th grade (Fairly easy to read)

5.1

Conclusions and Future Work

This work tried to address two things: 1. Improving the performances of domain specific tasks with the help of existing trained tools and techniques, thus reducing the dependency on heavy processing power and time for training models on entirely new data.

Taking Natural Language Generation

727

2. A step towards reducing manual effort in two different fields of works with the help of content automation using NLP and NLG. We plan on adding to the two case studies discussed by introducing newer sub-tasks like checking grammatical coherency of texts, handling images, parsing tables and presenting in different formats, and many others to handle various steps of the task, and improving the results obtained from these tasks. We will be regularly updating the pipeline with newer trained models and tools (generic or domain specific) that comes into the market to ensure better and better performance. Our work tried to give a possible solution to tasks without needing resource heavy machines. We aim to introduce these tasks as a suggestion model to people who are currently doing these tedious and repetitive manual tasks. We will work on solving and coming up with more efficient ways to solve other domain specific problems.

References 1. Abrahamsson, E., Forni, T., Skeppstedt, M., Kvist, M.: Medical text simplification using synonym replacement: adapting assessment of word difficulty to a compounding language. In: Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), pp. 57–65 (2014) 2. Al-Thanyyan, S.S., Azmi, A.M.: Automated text simplification: a survey. ACM Comput. Surv. (CSUR) 54(2), 1–36 (2021) 3. Bolshakov, I.A., Gelbukh, A.: Synonymous paraphrasing using wordnet and internet. In: International Conference on Application of Natural Language to Information Systems, pp. 312–323. Springer (2004) 4. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805 5. Ek, T., Kirkegaard, C., Jonsson, H., Nugues, P.: Named entity recognition for short text messages. Proceedia Soc. Behav. Sci. 27, 178–187 (2011) 6. Goyal, A., Gupta, V., Kumar, M.: Recent named entity recognition and classification techniques: a systematic review. Comput. Sci. Rev. 29, 21–43 (2018) 7. Gupta, A., Agarwal, A., Singh, P., Rai, P.: A deep generative framework for paraphrase generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) 8. Kauchak, D., Barzilay, R.: Paraphrasing for automatic evaluation. In: Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pp. 455–462 (2006) 9. Loper, E., Bird, S.: Nltk: the natural language toolkit (2002). arXiv:cs/0205028 10. Moramarco, F., Juric, D., Savkov, A., Flann, J., Lehl, M., Boda, K., Grafen, T., Zhelezniak, V., Gohil, S., Korfiatis, A.P., et al.: Towards more patient friendly clinical notes through language models and ontologies. In: AMIA Annual Symposium Proceedings, vol. 2021, p. 881. American Medical Informatics Association (2021) ˇ 11. Nisioi, S., Stajner, S., Ponzetto, S.P., Dinu, L.P.: Exploring neural text simplification models. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (volume 2: Short papers), pp. 85–91 (2017)

728

S. Varma et al.

12. Prakash, A., Hasan, S.A., Lee, K., Datla, V., Qadir, A., Liu, J., Farri, O.: Neural paraphrase generation with stacked residual LSTM networks (2016). arXiv:1610.03098 13. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020) 14. Shardlow, M.: A survey of automated text simplification. Int. J. Adv. Comput. Sci. Appl. 4(1), 58–70 (2014) 15. Shardlow, M., Nawaz, R.: Neural text simplification of clinical letters with a domain specific phrase table (2019) 16. Shen, D., Zhang, J., Zhou, G., Su, J., Tan, C.L.: Effective adaptation of hidden markov model-based named entity recognizer for biomedical domain. In: Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, pp. 49–56 (2003) ˇ 17. Stajner, S., Nisioi, S.: A detailed evaluation of neural sequence-to-sequence models for in-domain and cross-domain text simplification. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018) 18. Sulem, E., Abend, O., Rappoport, A.: Simple and effective text simplification using semantic and neural methods (2018). arXiv:1810.05104 19. Van den Bercken, L., Sips, R.-J., Lofi, C.: Evaluating neural text simplification in the medical domain. In: The World Wide Web Conference, pp. 3286–3292 (2019) 20. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017) 21. Wubben, S., Van Den Bosch, A., Krahmer, E.: Paraphrase generation as monolingual translation: data and evaluation. In: Proceedings of the 6th International Natural Language Generation Conference (2010) 22. Wubben, S., Van Den Bosch, A., Krahmer, E.: Sentence simplification by monolingual machine translation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1015– 1024 (2012) 23. Wei, X., Napoles, C., Pavlick, E., Chen, Q., Callison-Burch, C.: Optimizing statistical machine translation for text simplification. Trans. Assoc. Comput. Linguist. 4, 401–415 (2016) 24. Yan, H., Deng, B., Li, X., Qiu, X.: Tener: adapting transformer encoder for named entity recognition (2019). arXiv:1911.04474 25. Yang, X., Bian, J., Hogan, W.R., Wu, Y.: Clinical concept extraction using transformers. J. Am. Med. Inform. Assoc. 27(12), 1935–1942 (2020) 26. Zhang, J., Zhao, Y., Saleh, M., Liu, P.: Pegasus: pre-training with extracted gapsentences for abstractive summarization. In: International Conference on Machine Learning, pp. 11328–11339. PMLR (2020) 27. Zhao, S., Niu, C., Zhou, M., Liu, T., Li, S.: Combining multiple resources to improve SMT-based paraphrasing model. In: Proceedings of ACL-08: HLT, pp. 1021–1029 (2008)

A Simple But Powerful Graph Encoder for Temporal Knowledge Graph Completion Zifeng Ding1,2 , Yunpu Ma1,2 , Bailan He1,2 , Jingpei Wu3 , Zhen Han1,2 , and Volker Tresp1,2(B) 1

3

LMU Munich, Munich, Germany {zifeng.ding,bailan.he}@campus.lmu.de 2 Siemens AG, Munich, Germany [email protected] Technical University of Munich, Munich, Germany [email protected]

Abstract. Knowledge graphs contain rich knowledge about various entities and the relational information among them, while temporal knowledge graphs (TKGs) describe and model the interactions of the entities over time. In this context, automatic temporal knowledge graph completion (TKGC) has gained great interest. Recent TKGC methods integrate advanced deep learning techniques, e.g., Transformers, and achieve superior model performance. However, this also introduces a large number of excessive parameters, which brings a heavier burden for parameter optimization. In this paper, we propose a simple but powerful graph encoder for TKGC, called TARGCN. TARGCN is parameterefficient, and it extensively explores every entity’s temporal context for learning contextualized representations. We find that instead of adopting various kinds of complex modules, it is more beneficial to efficiently capture the temporal contexts of entities. We experiment TARGCN on three benchmark datasets. Our model can achieve a more than 46% relative improvement on the GDELT dataset compared with state-of-the-art TKGC models. Meanwhile, it outperforms the strongest baseline on the ICEWS05-15 dataset with around 18% fewer parameters.

Keywords: Natural language processing Temporal knowledge graph

1

· Representation learning ·

Introduction

A Knowledge Graph (KG) is a graph-structured Knowledge Base (KB) that stores relational facts. KGs have drawn increasing research interest since they serve as key drivers for a wide range of downstream tasks in artificial intelligence, e.g., question answering [5], commonsense reasoning [33], and recommender systems [31]. A fact in a KG is described as a triplet (s, r, o), e.g., (Joe Biden, is c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 729–747, 2024. https://doi.org/10.1007/978-3-031-47715-7_49

730

Z. Ding et al.

president of, USA), where s, o, r denote the subject entity, the object entity, and the relation between s and o. While KGs contain rich knowledge about entities and the relational information among them, they do not consider the nature of ever-evolving relational facts over time. For example, consider a KG triplet (Donald Trump, is president of, USA). According to world knowledge, this triplet is valid only before Joe Biden took the place of Donald Trump as the president of the USA. This implies a shortcoming of KGs and calls for the introduction of Temporal Knowledge Graphs (TKGs). In TKGs, every fact is augmented with a specific timestamp t such that it can be described with a quadruple (s, r, o, t). In this way, every fact in TKGs has its own time validity and this enables TKGs to capture the factual information in a time-varying context. Temporal Knowledge Graph Completion (TKGC) is a task aiming to infer the missing facts in TKGs. There exist two lines of TKGC methods. (1) A lot of prior methods attempt to incorporate temporal information into the existing KG reasoning scoring models and build novel time-aware score functions for TKGs [7,14,15,18,19]. (2) Another line of work takes advantage of neural structures, e.g., Graph Neural Networks (GNNs) [13,21] and recurrent models, for modeling the temporal information in TKGC [11,32]. Experimental results show that neural structures help to achieve state-of-the-art performance on the TKGC task. However, employing additional neural structures on top of the existing KG score functions normally leads to a higher number of model parameters. The parameter consumption increases even more when these models are equipped with advanced deep learning modules, e.g., attention mechanisms and Transformers [30], thus causing high memory consumption and bringing a heavier burden for parameter optimization. In this paper, we follow the trend of the second line of methods, aiming to design a neural-based graph encoder for TKGC that helps to cut the parameter consumption and the model complexity while maintaining superior model performance. We propose a time-aware relational graph encoder: T ime-aware Relational Graph C onvolutional N etwork (TARGCN). We find that our lightweighted time-aware relational graph encoder performs well on the TKGC task, and it requires relatively few parameters. The contribution of our work can be summarized as follows: (i) We propose a time-aware relational graph encoder, i.e., TARGCN, for the TKGC task. TARGCN learns an entity’s time-aware representation by sampling a temporal neighboring graph which consists of extensive temporal neighbors, and encodes temporal information by modeling time differences with a functional time encoder. (ii) To test the robustness of TKGC models on irregular timestamped data, we propose a new dataset ICEWS14-irregular. TARGCN achieves superior performance on it compared with several recently proposed TKGC methods. Besides, TARGCN outperforms previous methods with a huge margin in predicting the links at unseen timestamps, which also shows its strong robustness. (iii) TARGCN serves as a parameter-efficient model. To achieve the same performance, it requires much fewer parameters compared with two recently proposed neural-based TKG reasoning models, TeMP [32] and T-GAP [11]. (iv) We evaluate TARGCN on three benchmark TKGC datasets. It achieves superior performance on all datasets. On the GDELT [16] dataset, it achieves a more than 46% relative improvement compared with the best baseline.

A Simple But Powerful Graph Encoder for TKG Completion

2

731

Preliminaries and Related Work

Knowledge Graph Embedding Models Knowledge graph embedding (KGE) models have shown great success in KG reasoning tasks. TransE [3] is the first KGE model that introduces translational embeddings into KG representation learning. Many further works [1,17,25] are inspired and extend the relational translations in different spaces to capture complex relational information. Another line of KGE methods are tensor factorization-based models [2,20,37]. They encode entity and relation embeddings as vectors and then use bilinear functions to compute the plausibility scores for KG facts. Besides, neural-based relational graph encoders have been rapidly developed and have shown great power in capturing structural information of KGs. R-GCN [24] incorporates relation information into a Graph Convolutional Network (GCN) [13] to enable relational reasoning on KGs. Recently, CompGCN [29] extends this idea and leverages a variety of composition operations between KG entities and relations. It shows great effectiveness on KG reasoning tasks. Temporal Knowledge Graph Embedding Models Temporal knowledge graph embedding models can be categorized into several classes according to their temporal information encoding techniques. A series of models treat every timestamp separately and assign a high-dimensional vector as its embedding [14,15,26]. The assigned timestamp embeddings lie in the same space as entity and relation embeddings. Another series of models assume that every entity has a timeaware embedding that evolves over time [8,35]. To achieve time-aware property, an entity together with a timestamp are input into a function (or neural network) to yield a time-aware entity representation at this timestamp. Besides, Garc´ıa-Dur´ an et al. jointly encode entity, relation and time information with Recurrent Neural Network (RNN) to learn time-aware graph representations [7]. Some recent models attempt to model time difference, i.e., time displacement, between the query event and known events. It turns out that time displacement modeling can contribute to superior performance on TKG reasoning tasks, including TKGC [11,32] and TKG few-shot learning [4,6]. Temporal Knowledge Graph Completion Let E, R and T denote a finite set of entities, relations and timestamps, respectively. A temporal knowledge graph G is a graph which represents the evolution of interactions among entities over time. At any timestamp t ∈ T , G(t) is called the TKG snapshot at t, and it can be taken as a static KG containing the facts valid at t. Any fact, i.e., event, can be described with a quadruple (s, r, o, t), where s ∈ E represents the subject, o ∈ E represents the object, r ∈ R represents the relation between s and o, and t ∈ T indicates the timestamp when this fact is valid. Therefore, at t, the TKG snapshot can be summarized as a finite set of all the valid facts at this timestamp t, i.e., G(t) = {(s, r, o, t)|s, o ∈ E, r ∈ R}. We denote a TKG as a sequence of TKG snapshots G = {G(1), ..., G(T )}, where T = |T | is the number of timestamps. Similarly, we can also denote a TKG as a finite set of all valid facts which happen at any timestamp t ∈ T , i.e., G = {(s, r, o, t)|s, o ∈ E, r ∈ R, t ∈ T }. We

732

Z. Ding et al.

define the TKGC task as follows. For every snapshot G(t) in an observed TKG ¯ denote the G = {G(1), ..., G(T )}, it contains all the observed facts at t. Let G(t) ¯ set of all the true facts at t such that G(t) ∈ G(t). TKGC aims to predict the ground truth object (or subject) entities of queries (s, r, ?, t) (or (?, r, o, t)), where ¯ but (s, r, o, t) ∈ (s, r, o, t) ∈ G(t) / G(t), given any t ∈ T . TKGC has recently gained increasing interest. Researchers have paid great attention to better modeling the temporal information brought by the nature of TKGs. As fancier techniques and advanced deep learning methods, e.g., attention mechanisms and Transformers [30], being extensively studied, recent TKG reasoning models [11,32] benefit from them and show great performance on TKGC.

3

Our Method

Fig. 1. The Encoding Process in TARGCN for the Query (Angela Merkel, Express Intent to Meet or Negotiate, ?, 2014-10-15 ). The Color Darkness on each Node Implies its Probability of being Sampled as an Input at the Aggregation Step (The Darker the Higher)

To solve the TKGC task, our relational graph encoder TARGCN extensively collects information from the whole temporal context and updates the timeaware representations of entities. For every link prediction query (sq , rq , ?, tq ), TARGCN first creates a subgraph for the subject sq , according to its temporal associated neighbors. Then it derives time-aware representations for the neighbors from the temporal neighborhood, and performs aggregation. After sq ’s timeaware representation is updated, a knowledge graph decoder (score function) is utilized to compute scores for every candidate object, which yields the plausibility of every candidate object being the ground truth object in the link prediction query (sq , rq , ?, tq ). Note that we only consider object prediction queries (sq , rq , ?, tq ) in our work since we add reciprocal relations for every quadruple, i.e., adding (o, r−1 , s, t) for every (s, r, o, t). The restriction to only predict object entities does not lead to a loss of generality. An example is presented in Fig. 1 which shows the encoding process of our model. For the query subject Angela Merkel appearing at 2014-10-15, TARGCN selects its temporal neighbors with

A Simple But Powerful Graph Encoder for TKG Completion

733

a time difference-dependent probability. Node aggregation is then performed to learn a contextualized representation h(sq ,tq ) , where sq , tq correspond to Angela Merkel and 2014-10-15, respectively. 3.1

Subgraph Sampling in Temporal Neighborhood

Given a TKGC query (sq , rq , ?, tq ), TARGCN aims to learn a contextualized representation for the subject entity sq . Inspired by the inference graph proposed in [10], we sample a Temporal Neighboring Graph (TNG) for (sq , tq ) in TKGC context, where (sq , tq ) is the node representing sq at tq . We first find out all the temporal neighbors of (sq , tq ), which can be described as a set N(sq ,tq ) = {(e, t)|(e, r, sq , t) ∈ G; e ∈ E, t ∈ T , r ∈ R}. The entity e of a temporal neighbor (e, t) forms a link with sq at timestamp t and sq bears an incoming edge derived from the temporal associated quadruple (e, r, sq , t). Note that in TKGC, though we cannot observe all the true quadruples, we still can observe part of true quadruples at every timestamp. This enables TARGCN to search for the temporal neighbors of (sq , tq ) along the whole time axis. Then we employ weighted sampling strategy according to the absolute time difference |tq − t| between (sq , tq ) and the corresponding temporal neighbor (e, t). For every temporal neighbor (e, t), the probability of it being sampled into (sq , tq )’s TNG is computed by: exp(−|tq − t|)/Σ(e,t )∈N(sq ,tq ) exp(−|tq − t |). In this way, higher probabilities are assigned to the temporal neighbors who are closer to (sq , tq ) along the time axis. We adopt this sampling strategy since we assume that for the inference of a fact at tq , it is more likely to find clues from the factual information at nearer timestamps. Besides, we use a hyperparameter to limit the maximum number of the temporal neighbors included in (sq , tq )’s TNG to prevent over sampling less-concerned temporal neighbors. An example illustrating (sq , tq )’s TNG is shown in Fig. 2. In the process of TNG sampling, TARGCN does not include any parameter. For (sq , tq ), TARGCN selects the contributive temporal neighbors and generates a comprehensive temporal context of it with a parameter-free TNG sampler, rather than using a trainable component, e.g., a self-attention-based module employed in recent neural-based TKGC methods [11,32]. 3.2

Time-Aware Relational Aggregation

After sampling TNG for the subject entity sq , we then attempt to learn its contextualized representation through neighborhood aggregation. Since we have access to temporal neighbors from the whole timeline, we implicitly incorporate temporal information. Inspired by [36], we employ a functional time encoder for reasoning TKGs, and learn a time-aware entity representation for every temporal neighbor. In this way, we are able to distinguish the temporal neighbors, (e, t) and (e, t ), who root from the same entity e but emerge at different timestamps t and t . The time-aware entity representation is computed as: h(e,t) = f (he Φ(t, tq )),

(1)

734

Z. Ding et al.

Fig. 2. Temporal Neighborhood of (sq , tq ) Derived from an Object Prediction Query (sq , rq , ?, tq ). We use a Dashed Line (Labeled with Relation Type) to Denote a Temporal Associated Link Connecting sq with its Temporal Neighbor, e.g., the Dashed Line Labeled with r4 Corresponds to the Temporal Associated Quadruple (e4 , r4 , sq , t1 ). A Temporal Neighbor with Darker Color is Assigned a Higher Probability to be Sampled into (sq , tq )’s TNG

where he ∈ Rde denotes  the time-invariant entity-specific representation of the 1 entity e. Φ(t, tq ) = dt [cos(ω1 (t − tq ) + φ1 ), ..., cos(ωdt (t − tq ) + φdt ))] is a time difference encoder mapping t − tq to a finite dimensional functional space Rdt , where ω1 to ωdt are trainable frequency components, φ1 to φdt are trainable phase components. We concatenate the time-invariant entity representation with its corresponding time difference representation, and learn a combined representation of them with a layer of feed-forward neural network f . Note that the sign of t − tq will affect the output of the time difference encoding module. We aggregate the information from (sq , tq )’s temporal neighbors with a relational graph aggregator:  1 W(h(e,t) hr ). (2) h(sq ,tq ) = ¯ |N(s ,t ) | q

q

¯(s ,t ) (e,t)∈N q q

¯(s ,t ) denotes a finite set of temporal neighbors sampled from (sq , tq )’s N q q temporal neighborhood, i.e., all the neighbors in (sq , tq )’s TNG. r is the relation appearing in the temporal associated quadruple (e, r, sq , t) where temporal neighbor (e, t) is sampled. We assume that relation representations are timeinvariant and we incorporate relational information into the graph encoder by concatenating time-aware node representations with them. Our graph encoder outputs the time-aware representation of sq at query time tq , by combining not only the raw entity representation he but also the implicit time difference information from its temporal neighbors. 3.3

Learning and Inference

Figure 3 illustrates how TARGCN, together with a KG score function, i.e., Distmult [37], predicts the ground truth missing object for the TKGC query (sq , rq , ?, tq ). Given sq , we use the sampling strategy and our time-aware relational graph encoder to compute a time dependent node representation for

A Simple But Powerful Graph Encoder for TKG Completion

735

Fig. 3. Inference Process of TARGCN + Distmult. h(o ,tq ) is the Time-Aware Representation of a Candidate o at tq . For a TKGC Query (sq , rq , ?, tq ), we First sample a TNG Rooting from (sq , tq ). Then we Employ TARGCN Encoder to Compute the Representation h(sq ,tq ) for (sq , tq ). We Provide Distmult with Time-Aware Representations of All Candidates for Score Computation. The Candidate Producing the Highest Score is Selected as the Predicted Answer

(sq , tq ). Then we use a KG score function to compute the plausibility of every candidate entity. TARGCN can be coupled with any KG score function. We choose TARGCN + Distmult as the final model structure because it achieves a high parameter efficiency compared with TARGCN coupled with other two KG score functions, i.e., ComplEx [28] and BiQUE [9], on the benchmark datasets (discussed in Sect. 4.7), which encounters our flavor of building a parameterefficient TKGC model. Note that for the candidate entities, we do not sample TNG for them to avoid huge time consumption during inference. Instead, for every candidate entity o , we simply derive its time-aware representation by computing h(o ,tq ) = f (ho Φ(tq , tq )). The temporal encoder Φ(·, ·) will also return a unique representation when time difference equals zero. We employ cross-entropy loss for parameter learning:    score(h(s,t) , hr , h(o,t) ) −log L= , (3) Σo ∈E score(h(s,t) , hr , h(o ,t) ) (s,r,o,t)∈G

where o denotes all candidate entities and we sum over all observed quadruples in G. Note that our TARGCN encoder can be equipped with any KG score functions since our encoder returns time-aware representations for entities. In our work, score(h(s,t) , hr , h(o ,t) ) = (h(s,t)  hr ) h(o ,t) , where  denotes the Hadamard product.

4

Experiments

We compare our model with several existing TKGC methods on three TKGC benchmark datasets. We prove the robustness of TARGCN and present ablation studies. To show the parameter efficiency of our model, we further do an analysis of parameter usage on TARGCN.

736

4.1

Z. Ding et al.

Experimental Setup

Datasets We perform evaluation on three TKGC benchmark datasets: (1) ICEWS14 [7] (2) ICEWS05-15 [7] (3) GDELT [16]. ICEWS14 and ICEWS05-15 are two subsets of Integrated Crisis Early Warning System (ICEWS) database. ICEWS14 contains timestamped political facts happening in 2014, while the timestamps of factual events in ICEWS05-15 span from 2005 to 2015. We follow [32] and use the GDELT subset proposed by [27]. It contains global social facts from April 1, 2015 to March 31, 2016. The detailed dataset statistics are presented in Table 1. We create two new datasets, i.e., ICEWS14-unseen and ICEWS14-irregular, for studying TKGC models’ ability to generalize to unseen timestamps and irregular timestamped data. The detailed dataset creation process is presented in Sect. 4.5. Table 1. Dataset statistics. Ntrain , Nvalid , Ntest represent the number of quadruples in the training set, validation set, and test set, respectively. |T | denotes the number of timestamps, where we take a snapshot of a TKG at each timestamp Dataset

Ntrain

Nvalid

Ntest

|E|

|R| |T | 230 365

ICEWS14

72, 826

8, 941

8, 963

7, 128

ICEWS05-15

386, 962

46, 275

46, 092

10, 488 251 4, 017

GDELT

2, 735, 685 341, 961 341, 961 500

20

ICEWS14-unseen

65, 679

3, 420

3, 420

6, 601

230 365

366

ICEWS14-irregular 29, 102

3, 555

3, 607

5, 093

210 146

Evaluation Metrics We employ two evaluation metrics for all experiments, i.e., Hits@1/3/10 and Mean Reciprocal Rank (MRR). For every test fact / G), we derive an associated TKGC query (sq , rq , oq , tq ) ∈ G¯ ((sq , rq , oq , tq ) ∈ q = (sq , rq , ?, tq ). We let models compute the rank ψ of the ground truth entity oq among all the candidates. Hits@1/3/10 are the proportions of the test facts where ground truth are ranked as top 1, top 3, top 10, respectively. MRR  entities 1 1 , where Q denotes the set of all queries. It computes is defined as |Q| q∈Q ψ the mean of the reciprocal ranks of ground truth entities. We follow the filtered setting proposed by [3] to achieve fairer evaluation. Baseline Methods We take ten methods as baseline models. The first two baselines are static KG reasoning methods, i.e., ComplEx [28] and SimplE [12]. The other methods are developed to solve TKGC, including DE-SimplE [8], ATiSE [35], TNTComplEx [14], ChronoR [23], TeLM [34], BoxTE [19], TeMP [32] and T-GAP [11]. Among all baselines, only TeMP and T-GAP are neural-based methods that employ GNNs as graph encoders, similar to our TARGCN setting. Therefore, we further compare the parameter efficiency among them.

A Simple But Powerful Graph Encoder for TKG Completion

4.2

737

Implementation Details

We implement all experiments with PyTorch [22] and use a single NVIDIA Tesla T4 for computation. We allow TARGCN to search for neighbors along the whole timeline. The hyperparameter searching strategies are reported in Table 2 and the hyperparameter settings producing the reported experimental results (in Table 4) are presented in Table 3. We do 180 trials for each dataset and run the models for 20, 20 and 2 epochs on ICEWS14, ICEWS05-15 and GDELT, repectively. We choose the trial leading to the best MRR as the best hyperparameter setting. For baselines, we use the official implementation of TComplEx, TNTComplEx,1 TeMP2 and T-GAP.3 We find that T-GAP has an extremely high memory demand. Training GDELT with T-GAP on a 16GB NVIDIA Tesla T4 causes out-of-memory error even when we set batch size to 1. This is because it constructs a huge temporal associative graph for every entity in training examples. Table 2. Hyperparameter searching strategy Datasets

ICEWS14

ICEWS05-15

GDELT

Embedding size

{150, 200, 300}

{150, 200, 300}

{150, 200, 300}

# Aggregation step

{1, 2}

{1, 2}

{1, 2}

Activation function

{Tanh, ReLU}

{Tanh, ReLU}

{Tanh, ReLU}

Search range

{15, 100, 200, 300, 365} {100, 500, 1000, 4017} {100, 200, 366}

Hyperparameter

# Temporal neighbor {50, 100, 500}

{50, 100, 500}

{50, 100, 500}

Table 3. Best hyperparameter settings on each dataset Datasets

ICEWS14 ICEWS05-15 GDELT

Hyperparameter Embedding size

300

200

200

# Aggregation step

1

1

1

Activation function

Tanh

Tanh

Tanh

Search range

365

4017

366

100

100

# Temporal neighbor 100

1 2 3

https://github.com/facebookresearch/tkbc. https://github.com/JiapengWu/TeMP. https://github.com/sharkmir1/T-GAP.

738

Z. Ding et al.

Table 4. Temporal knowledge graph completion results on three benchmark datasets. Evaluation metrics are filtered MRR and Hits@1/3/10. The best results are marked in bold. Results marked with [], [♥], [] are taken from [32], [11], [19], respectively Datasets Model

ICEWS14 MRR Hits@1

Hits@3

ICEWS05-15 Hits@10 MRR Hits@1

Hits@3

GDELT Hits@10 MRR Hits@1

Hits@3

Hits@10

ComplEx []

0.442

0.400

0.430

0.664

0.464

0.347

0.524

0.696

0.213

0.133

0.225

0.366

SimplE []

0.458

0.341

0.516

0.687

0.478

0.359

0.539

0.708

0.206

0.124

0.220

0.366

DE-SimplE []

0.526

0.418

0.592

0.725

0.513

0.392

0.578

0.748

0.230

0.141

0.248

0.403

ATiSE []

0.571

0.465

0.643

0.755

0.484

0.350

0.558

0.749









TNTComplEx [] 0.620

0.520

0.660

0.760

0.670

0.590

0.710

0.810









ChronoR []

0.625

0.547

0.669

0.773

0.675

0.596

0.723

0.820









TeLM []

0.625

0.545

0.673

0.774

0.678

0.599

0.728

0.823









BoxTE []

0.613

0.528

0.664

0.763

0.667

0.582

0.719

0.820

0.352

0.269

0.377

0.511

TeMP-GRU []

0.601

0.478

0.681

0.828

0.691

0.566

0.782

0.917

0.275

0.191

0.297

0.437

TeMP-SA []

0.607

0.484

0.684

0.840

0.680

0.553

0.769

0.913

0.232

0.152

0.245

0.377

T-GAP [♥]

0.610

0.509

0.677

0.790

0.670

0.568

0.743

0.845









TARGCN

0.636

0.576

0.672

0.746

0.702

0.635

0.743

0.823

0.515

0.423

0.557

0.689

± 0.001 ± 0.003 ± 0.001 ± 0.003 ± 0.001 ± 0.003 ± 0.002 ± 0.002 ± 0.002 ± 0.002 ± 0.001 ± 0.003

4.3

Main Results

Table 4 reports the experimental results of all methods on three benchmark datasets. We can observe that TARGCN outperforms all baselines on all datasets. The margin is particularly huge on the GDELT dataset. TARGCN achieves an over 46% relative improvement on MRR compared with the strongest baseline BoxTE. TARGCN also leads in Hits metrics greatly. It improves Hits@1/3/10 by 57.25%, 47.75%, and 34.83%, respectively. On ICEWS datasets, TARGCN still achieves the best results on MRR and Hits@1. We argue that the performance gap varies because of the characteristics of different datasets. While ICEWS datasets are sparse, GDELT is much denser. As discussed in [19,32], the temporal sparsity issue on ICEWS is much more severe than it on GDELT. This implies that GDELT contains substantially more temporal patterns, while ICEWS datasets are more prone to be biased by a large number of isolated events which are mainly dominated by sparse entities and relations. Hence, we argue that reasoning on GDELT requires much stronger techniques. For prior methods, though several TKGC methods outperform static methods on GDELT, the improvements are not substantial. However, TARGCN achieves a more than 141% relative improvement on MRR, compared with the strongest static KG baseline ComplEx. This shows the superior effectiveness of our graph encoder in capturing various temporal patterns. For ICEWS datasets, our model can also achieve state-of-the-art performance. This demonstrates its strong ability in capturing the temporal KG information brought by sparse entities and relations. 4.4

Parameter Efficiency Analysis

While TARGCN serves as a strong TKGC model, it also keeps a low parameter cost. We compare the parameter efficiency among TARGCN and two recently proposed neural-based TKGC models, i.e., TeMP and T-GAP. On ICEWS14,

A Simple But Powerful Graph Encoder for TKG Completion

739

Filtered MRR

0.64

0.62

0.6

TARGCN T-GAP TeMP-SA

0.58

0.56 5.7 10

105.8

105.9

106 106.1 106.2 106.3 Number of Parameters (log scale)

106.4

106.5

106.6

Fig. 4. Filtered MRR on ICEWS14 Achieved by TARGCN, T-GAP and TeMP-SA, with Varied Number of Parameters

for all three models, we adjust the embedding size of both entities and relations to adjust the number of parameters. We do not change model structures and other hyperparameter settings. In Fig. 4, we show that TARGCN performs better as we increase model parameters. More importantly, even with much fewer parameters, TARGCN still outperforms TeMP and T-GAP. For ICEWS05-15 and GDELT, we summarize the number of parameters as well as performance difference in Table 5. We compare across the models with parameter settings that lead to the experimental results shown in Table 4. We show that TARGCN enjoys superior parameter efficiency, especially on GDELT. On ICEWS05-15, TGAP uses 30.89% more parameters than our model, but its performance drops by 4.56%. TeMP-GRU achieves almost the same result as TARGCN, however, it uses 18.45% more parameters than our model. Fewer parameters are used in TeMP-SA, but it also leads to worse performance. On GDELT, we observe that though TeMP-GRU employs 50.07% more parameters than TARGCN, its performance is 46.60% lower than our model. TeMP-SA shows the worst performance, although it has even 5.19% fewer parameters than TARGCN. To this end, we argue that our model is extremely parameter-efficient. Table 5. Parameter efficiency comparison on ICEWS05-15 and GDELT. Due to extremely high memory consumption, we cannot train T-GAP on GDELT even when batch size equals 1 Datasets

ICEWS05-15

Model

Parameters MRR Parameter ↑ MRR ↓ Parameters MRR Parameter ↑ MRR ↓

GDELT

TARGCN

2359200

0.702





269200

0.515





T-GAP

3088000

0.670

30.89%

4.56%









TeMP-SA

2645760

0.680

12.15%

3.13%

255232

0.232

−5.19%

54.95%

TeMP-GRU 2794528

0.691

18.45%

1.57%

404000

0.275

50.07 %

46.60%

740

Z. Ding et al.

We attribute such high parameter efficiency to our simple but powerful graph encoder. Note that in the TNG sampling process, we force our model to choose the temporal neighbors who are nearer to the source node (sq , tq ) on the time axis, by assigning higher sampling probabilities to them. This can also be interpreted as a “hard-coded attentional process”. Models like TeMP and T-GAP employ self-attention modules to let models choose their attention themselves through parameter learning. We argue that even if such modules are powerful, they can be simplified in the context of TKGC. In our model, we force our TNG sampler to focus on the facts happening at the timestamps that are closer to the query timestamp. Our TNG sampling process does not include any parameter, while self-attention modules increase parameters, cause higher memory consumption, and bring heavier burdens for parameter optimization. Apart from that, compared with TeMP who encodes temporal information only from a fixed short time span of 2τ , our TNG sampling range spans across the whole timeline. This means that even if a temporal neighbor is derived from a sparse entity and it appears only at faraway timestamps from the query timestamp, our sampler still has the ability to include it into the TNG and enable information aggregation. Similar to TARGCN, T-GAP, with the help of its Preliminary GNN (PGNN), is able to find any temporal associated quadruples related to any entity appearing at any time. However, in its PGNN, it employs three weight matrices together with discretized time displacement embeddings h|Δt| to fully express the supporting information coming from the past, the present and the future. We find it redundant to model time difference in this way. In TARGCN, we do not use separate weight matrices during aggregation since our functional time encoder distinguishes the sign of time difference itself. Besides, instead of learning different discretized embeddings to represent different |Δt|, our model computes the representation of any time difference with shared parameters, thus cutting parameter consumption. 4.5

Generalization to Unseen Timestamps and Irregular Timestamped Data

To prove the robustness of our model, we follow [8] to test its ability to predict the links at unseen timestamps.4 We exclude every quadruple appearing on the 5th, 15th, and 25th day of each month in ICEWS14 to construct a new training set. We randomly split the excluded quadruples into validation and test sets. We make sure that every entity appearing in the validation and test sets is seen in the training set. We compare TARGCN with several recently proposed baselines on this new dataset ICEWS14-unseen, and the results (Table 6) indicate the strong robustness of our model on timestamp generalization. TARGCN greatly outperforms all baseline methods, especially in Hits@1. To infer links at a timestamp, TeMP requires at least one fact seen at this timestamp during training, thus making it unable to generalize to unseen timestamps. T-GAP employs discretized 4

We do not use the same unseen timestamps generalization datasets proposed in [8] and [11], since they did not release their datasets.

A Simple But Powerful Graph Encoder for TKG Completion

741

time displacement embeddings. It trains different embeddings for different time differences. If some time differences appear much fewer times in the training set, the corresponding time displacement embeddings will not be well trained. Compared with these two advanced neural-based TKGC methods, TARGCN not only has the ability to perform link prediction at unseen timestamps, but also shows superior generalization power. TARGCN computes time-aware representations with a functional time encoder which is jointly trained on any time difference seen in the training set, which helps it outperform T-GAP greatly. Besides, we propose another new dataset ICEWS14-irregular to validate whether TKGC models can generalize well to the TKG data collected at irregular-spaced timestamps. We randomly sample the snapshots in ICEWS14 and keep the time interval between every two of the sampled neighboring snapshots not greater than 4. The value of the time interval between every two neighboring snapshots can be randomly assigned either to 1, 2, 3, or 4. In this way, we create a dataset simulating that the TKG data is observed and collected at irregular-spaced timestamps. We perform TKGC on ICEWS14-irregular and experimental results in Table 6 show that TARGCN is superior in handling data with irregular timestamps. Compared with TARGCN who takes advantage of the graph information from the whole timeline, TeMP utilizes a fixed short time span of graph information to learn embeddings, which fails to capture a large amount of graph information outside this span. T-GAP uses time displacement embeddings to encode different time differences. However, experimental results show that TARGCN distinguishes irregular time intervals better than T-GAP with the help of its functional time encoder that computes the representation of any time difference with shared parameters. Table 6. Performance of generalization to unseen timestamps and irregular timestamped data Datasets Model

ICEWS14-unseen ICEWS14-irregular MRR Hits@1 Hits@3 Hits@10 MRR Hits@1 Hits@3 Hits@10

TComplEx

0.461

0.365

0.513

0.644

0.509

0.421

0.558

0.678

TNTComplEx 0.474

0.373

0.524

0.665

0.512

0.429

0.558

0.665

TeMP-SA









0.521

0.408

0.583

0.741

T-GAP

0.474

0.362

0.532

0.689

0.526

0.428

0.588

0.719

TARGCN

0.578 0.518

0.607

0.692

0.552 0.496

0.583

0.667

4.6

Ablation Study

To validate the effectiveness of different model components, we conduct several ablation studies on ICEWS14 and GDELT. We first change the time difference encoding module into an absolute time encoder, e.g., for a (sq , tq ) and a temporal neighbor (e, t), we learn a representation for t instead of t − tq . From Table 7,

742

Z. Ding et al.

we observe performance drops on both datasets. This proves the effectiveness of time difference modeling. Next, we adopt random sample in TNG sampling process. The performance drops on both datasets, indicating that by sampling more neighbors nearer in the temporal context, our model benefits more in learning better representations. Additionally, we conduct another experiment by including all temporal neighbors during aggregation. We observe huge performance drops on both datasets, which proves that our sampling strategy helps to exclude noisy information from less-concerned neighbors. Table 7. Ablation studies of TARGCN variants on ICEWS14 and GDELT Datasets

ICEWS14

Model

MRR

Hits@1 Hits@3 Hits@10 MRR

Hits@1 Hits@3 Hits@10

Absolute time

0.622

0.556

0.660

0.739

0.502

0.408

0.545

0.678

Random sample

0.618

0.551

0.656

0.735

0.433

0.312

0.502

0.640

Whole neighborhood 0.481

0.433

0.501

0.568

0.431

0.312

0.497

0.633

0.672

0.746

0.515 0.423

0.557

0.689

TARGCN

4.7

0.636 0.576

GDELT

Comparison over Different KG Score Functions

We couple TARGCN with three different KG score functions, i.e., ComplEx [28], BiQUE [9], Distmult [37], and report their performances on ICEWS14 and ICEWS05-15 in Table 8. All these KG score functions do not include additional parameters besides the entity and relation representations. TARGCN’s number of parameters only scales with the embedding size, indicating that as long as the embedding size remains unchanged, there is no change in the number of parameters when TARGCN is coupled with another KG score function, e.g., switching from Distmult to ComplEx. For TARGCN + ComplEx, we keep the embedding size of both entities and relations as same as the size in TARGCN + Distmult that generates the results in Table 4 (300 on ICEWS14 and 200 on ICEWS0515, reported in Table 3). BiQUE requires that the embedding size is divisible by 8. Since 300 is not divisible by 8, we set the embedding size of TARGCN + BiQUE to 320 on ICEWS14, and to 200 on ICEWS05-15. From Table 8, we observe that TARGCN + Distmult achieves the best performance on both datasets, even when TARGCN + BiQUE has more parameters on ICEWS14. To this end, we choose TARGCN + Distmult as our final model structure due to its high parameter efficiency. We also notice that TARGCN constantly shows strong performance when it is applied with different KG score functions. Though TARGCN + ComplEx performs the worst in Table 8, it still outperforms previous TKGC methods on the benchmark datasets. 4.8

Temporal Neighborhood Exploration

We further conduct an experiment to study how TARGCN performs while the search range varies. We report in Fig. 5 our model’s performance on ICEWS14

A Simple But Powerful Graph Encoder for TKG Completion

743

Table 8. Performance of TARGCN Coupled with Different KG Score Functions. Embsize Means the Embedding Size of Entity and Relation Representations Datasets Model

ICEWS14 ICEWS05-15 MRR Hits@1 Hits@3 Hits@10 Embsize MRR Hits@1 Hits@3 Hits@10 Embsize

TARGCN + ComplEx 0.628

0.562

0.667

0.745

300

0.692

0.624

0.734

0.816

TARGCN + BiQUE

0.561

0.666

0.753

320

0.701

0.634

0.739

0.824

200

0.672

0.746

300

0.702 0.635

0.743

0.823

200

0.629

TARGCN + Distmult 0.636 0.576

200

with different search range, namely, 15, 50, 100, 200, 300, and 365 (whole timeline). For all the metrics, TARGCN’s performance improves greatly and constantly as the search range increases. This proves that the effectiveness of TARGCN mainly comes from its superiority in exploring the temporal context. The amount of available temporal information is decisive for our simplestructured model. Compared with the models that only make use of graph snapshots near to the query timestamp tq , e.g., TeMP, we simplify the model structure but take advantage of as much temporal information as we can.

Metrics Results

0.7

0.6

0.5 MRR Hits@1 Hits@3 Hits@10

0.4 50

100

150

200 250 Search Range

300

350

400

Fig. 5. Temporal neighborhood exploration analysis on ICEWS14

5

Conclusion

We propose a simple but powerful graph encoder TARGCN for Temporal Knowledge Graph Completion (TKGC). TARGCN employs a Temporal Neighboring Graph (TNG) sampling strategy, which enables it to extensively utilize the information from the whole temporal context. Experimental results show that TARGCN achieves state-of-the-art performance on three benchmark TKGC datasets. Besides, TARGCN enjoys a high parameter efficiency. It beats two recently proposed neural-based TKGC methods, i.e., TeMP and T-GAP, with much fewer parameters. Thanks to its time difference learning module and temporal neighbor sampler, TARGCN also shows strong robustness to inferring links

744

Z. Ding et al.

on irregular timestamped data or at unseen timestamps. We find that it is not always necessary to incorporate complex modules, e.g., Transformers, into TKG reasoning models. Instead, developing methods to better and more efficiently capture temporal information is more beneficial to TKGC.

References ˙ ˙ Lukasiewicz, T., Salvatori, T.: Boxe: a box embedding 1. Abboud, R., Ilkan Ceylan, I., model for knowledge base completion. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.-F., Lin, H.-T. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6–12 Dec. 2020, virtual (2020) 2. Balazevic, I., Allen, C., Hospedales, T.M.: Tucker: tensor factorization for knowledge graph completion. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLPIJCNLP 2019, Hong Kong, China, 3–7 Nov. 2019, pp. 5184–5193. Association for Computational Linguistics (2019) 3. Bordes, A., Usunier, N., Garc´ıa-Dur´ an, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Burges, C.J.C., Bottou, L., Ghahramani, Z., Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held, 5–8 Dec. 2013, Lake Tahoe, Nevada, US, pp. 2787–2795 (2013) 4. Ding, Z., He, B., Ma, Y., Han, Z., Tresp, V.: Learning meta representations of one-shot relations for temporal knowledge graph link prediction (2022). arxiv:abs/2205.10621 5. Ding, Z., Qi, R., Li, Z., He, B., Wu, J., Ma, Y., Meng, Z., Han, Z., Tresp, V.: Forecasting question answering over temporal knowledge graphs (2022). arxiv:abs/2208.06501 6. Ding, Z., Wu, J., He, B., Ma, Y., Han, Z., Tresp, V.: Few-shot inductive learning on temporal knowledge graphs using concept-aware information. In: 4th Conference on Automated Knowledge Base Construction (2022) 7. Garc´ıa-Dur´ an, A., Dumancic, S., Niepert, M.: Learning sequence encoders for temporal knowledge graph completion. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 Oct.–4 Nov. 2018, pp. 4816– 4821. Association for Computational Linguistics (2018) 8. Goel, R., Kazemi, S.M., Brubaker, M., Poupart, P.: Diachronic embedding for temporal knowledge graph completion. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 Feb 2020, pp. 3988–3995. AAAI Press (2020) 9. Guo, J., Kok, S.: BiQUE: Biquaternionic embeddings of knowledge graphs. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8338–8351, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics

A Simple But Powerful Graph Encoder for TKG Completion

745

10. Han, Z., Chen, P., Ma, Y., Tresp, V.: Explainable subgraph reasoning for forecasting on temporal knowledge graphs. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021) 11. Jung, J., Jung, J., Kang, U.: Learning to walk across time for interpretable temporal knowledge graph completion. In: Zhu, F., Ooi, B.C., Miao, C. (eds.), KDD ’21: the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, 14–18 Aug. 2021, pp. 786–795. ACM (2021) 12. Kazemi, S.M., Poole, D.: Simple embedding for link prediction in knowledge graphs. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3–8 Dec. 2018, Montr´eal, Canada, pp. 4289–4300 (2018) 13. Thomas, N.: Kipf and Max Welling, Semi-supervised classification with graph convolutional networks. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 Apr. 2017, Conference Track Proceedings (2017). OpenReview.net 14. Lacroix, T., Obozinski, G., Usunier, N.: Tensor decompositions for temporal knowledge base completion. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 Apr. 2020. OpenReview.net 15. Leblay, J., Chekol, M.W.: Deriving validity time in knowledge graph. In: Champin, P.-A., Gandon, F., Lalmas, M., Ipeirotis, P.G. (eds.), Companion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon , France, 23–27 Apr. 2018, pp. 1771–1776. ACM (2018) 16. Leetaru, K., Schrodt, P.A.: Gdelt: global data on events, location, and tone, 1979– 2012. In: ISA Annual Convention, vol. 2, pp. 1–49. Citeseer (2013) 17. Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings for knowledge graph completion. In: Bonet, B., Koenig, S. (eds.), Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 25–30 Jan 2015, Austin, Texas, USA, pp. 2181–2187. AAAI Press (2015) 18. Ma, Y., Tresp, V., Daxberger, E.A.: Embedding models for episodic knowledge graphs. J. Web Semant. 59 (2019) ˙ ˙ Temporal knowledge graph completion 19. Messner, J., Abboud, R., Ilkan Ceylan, I.: using box embeddings (2021). arxiv:abs/2109.08970 20. Nickel, M., Tresp, V., Kriegel, H.-P.: A three-way model for collective learning on multi-relational data. In: Getoor, L., Scheffer, T. (eds.), Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, 28 June–2 July 2011, pp. 809–816. Omnipress (2011) 21. Niepert, M., Ahmed, M., Kutzkov, K.: Learning convolutional neural networks for graphs. In: Balcan, M.-F., Weinberger, K.Q. (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, 19–24 June 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp. 2014–2023 (2016). JMLR.org 22. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., K¨ opf, A., Yang, E.Z., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: an imperative style, high-erformance deep learning library. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alch´e-Buc, F., Fox, E.B., Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 8024–8035 (2019)

746

Z. Ding et al.

23. Sadeghian, A., Armandpour, M., Colas, A., Wang, D.Z.: Chronor: rotation based temporal knowledge graph embedding. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, 2–9 Feb. 2021, pp. 6471–6479. AAAI Press (2021) 24. Schlichtkrull, M.S., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: Gangemi, A., Navigli, R., Vidal, M.-E., Hitzler, P., Troncy, R., Hollink, L., Tordai, A., Alam, M. (eds.), The Semantic Web-15th International Conference, ESWC 2018, Heraklion, Crete, Greece, 3–7 June 2018, Proceedings, volume 10843 of Lecture Notes in Computer Science, pp. 593–607. Springer (2018) 25. Sun, Z., Deng, Z.-H., Nie, J.-Y., Tang, J.: Rotate: knowledge graph embedding by relational rotation in complex space. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. OpenReview.net (2019) 26. Tresp, V., Ma, Y., Baier, S., Yang, Y.: Embedding learning for declarative memories. In: Blomqvist, E., Maynard, D., Gangemi, A., Hoekstra, R., Hitzler, P., Hartig, O. (eds.), The Semantic Web-14th International Conference, ESWC 2017, Portoroˇz, Slovenia, 28 May–1 June 2017, Proceedings, Part I, volume 10249 of Lecture Notes in Computer Science, pp. 202–216 (2017) 27. Trivedi, R., Dai, H., Wang, Y., Song, L.: Know-evolve: deep temporal reasoning for dynamic knowledge graphs. In: Precup, D., Teh, Y.W. (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 Aug. 2017, volume 70 of Proceedings of Machine Learning Research, pp. 3462–3471. PMLR (2017) ´ Bouchard, G.: Complex embed28. Trouillon, T., Welbl, J., Riedel, S., Gaussier, E., dings for simple link prediction. In: Balcan, M.-F., Weinberger, K.Q. (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, 19–24 June 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp. 2071–2080. JMLR.org (2016) 29. Vashishth, S., Sanyal, S., Nitin, V., Talukdar, P.P.: Composition-based multirelational graph convolutional networks. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 Apr. 2020. OpenReview.net (2020) 30. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4–9 Dec. 2017, Long Beach, CA, USA, pp. 5998– 6008 (2017) 31. Wang, X., Wang, D., Xu, C., He, X., Cao, Y., Chua, T.-S.: Explainable reasoning over knowledge graphs for recommendation. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, 27 Jan.–1 Feb. 2019, pp. 5329–5336. AAAI Press (2019) 32. Wu, J., Cao, M., Cheung, J.C.K., Hamilton, W.L.: Temp: temporal message passing for temporal knowledge graph completion. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natu-

A Simple But Powerful Graph Encoder for TKG Completion

33.

34.

35.

36.

37.

747

ral Language Processing, EMNLP 2020, Online, 16–20 Nov. 2020, pp. 5730–5746. Association for Computational Linguistics (2020) Xing, Y., Shi, Z., Meng, Z., Lakemeyer, G., Ma, Y., Wattenhofer, R.: KM-BART: knowledge enhanced multimodal BART for visual commonsense generation. In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, 1–6 Aug. 2021, pp. 525–535. Association for Computational Linguistics (2021) Xu, C., Chen, Y.-Y., Nayyeri, M., Lehmann, J.: Temporal knowledge graph completion using a linear temporal regularizer and multivector embeddings. In: Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-T¨ ur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y. (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, 6–11 June 2021, pp. 2569–2578. Association for Computational Linguistics (2021) Xu, C., Nayyeri, M., Alkhoury, F., Yazdi, H.S., Lehmann, J.: Temporal knowledge graph completion based on time series gaussian embedding. In: Pan, J.Z., Tamma, V.A.M., d’Amato, C., Janowicz, K., Fu, B., Polleres, A., Seneviratne, O., Kagal, L. (eds.), The Semantic Web—ISWC 2020—19th International Semantic Web Conference, Athens, Greece, 2–6 Nov 2020, Proceedings, Part I, volume 12506 of Lecture Notes in Computer Science, pp. 654–671. Springer (2020) Xu, D., Ruan, C., K¨ orpeoglu, E., Kumar, S., Achan, K.: Inductive representation learning on temporal graphs. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 Apr. 2020. OpenReview.net (2020) Yang, B., Yih, W.-T., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning and inference in knowledge bases. In: Bengio, Y., LeCun Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015)

A Hybrid Model for Novel Story Generation Using the Affective Reasoner and ChatGPT Clark Elliott(B) College of Computing and Digital Media, DePaul Center, DePaul University, Chicago, IL 60604, USA [email protected]

Abstract. In this paper a hybrid model is presented for generating novel stories using (a) a traditional symbolic AI cognitive-appraisal model of emotions embodied in the Affective Reasoner (AR), and (b) the large-language-model-based (LLM) system embodied in ChatGPT. The novel emotion and narrative structure is generated first by AR techniques—giving strong, symbolic computable structure to the intermediate narratives—and then fed in series to ChatGPT to add complementary world knowledge and elegant language structure. The resulting stories are polished and cohesive, but the basic structural elements remain under computational control. Explanations about content can be generated, based on the emotion content, and also on the appraisal-based dispositions, expressive temperaments, reasoning about the fortunes of others, relationships and moods of the characters in the stories. Background emotion theory is reviewed, relevant to the morphing of narratives, composed of 28 emotion categories, 24 emotion intensity variables, and ~400 channels for emotion expression, which has been implemented in the AR. A series of hybrid-generated stories are presented illustrating how the emotion makeup of characters, their emotions, their actions and their narrative perspectives remain not only consistent but are largely enhanced after treatment by ChatGPT. Actual examples of generated stories covering a wide range of complex emotion scenarios are given. Keywords: ChatGPT · LLM · Affective computing · AI · Intelligent agents · Emotion · Stories · Narrative · Gaming AI

1 Introduction and Motivation This is a theoretical proof-of-concept paper, based on components of a working AI story-generation system, and the online version of OpenAI’s ChatGPT (GPT 3.5) largelanguage model (LLM) AI chat system, arguing that we can create novel, worldknowledge-situated stories under computational control suitable for many contexts based on sound AI emotion reasoning principles. In previous work [1–3] it has been argued that using a highly-computable model of emotion allows us to extract an essential structure in stories which is independent of the particular narrative context, and that the same emotion structure for one story, in © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 748–765, 2024. https://doi.org/10.1007/978-3-031-47715-7_50

A Hybrid Model for Novel Story Generation

749

one context, can also work just as well in many other contexts. It is this transference of structure, independent of the narrative domain which gives story-morphing its computational power; narrative computational techniques in one domain work just as well in other domains. More to the point, humans have an affinity for these narrative emotion structures as part of their understanding of social interactions [4]. The strong position motivating this work is that most other approaches to story generation, theme analysis and story understanding require that we have in place a model of how the world works [5–7]. Such models are required for the automatic symbolic manipulation of plot structures with any sort of complexity. However, we have no such causal model of how the world works, and the complexity of such representation models are prohibitive. Instead, within the AR’s story-morphing, we abandon any attempt to reason logically about causality in the real world as being hopelessly complex beyond very constrained subsets of not-very-interesting artificial worlds. Instead, story-morphing starts with the premise that the simplest story is generated when situations arise as part of an otherwise narratively uninteresting sequence of events, and yet one of the characters is believed to care about what happened. The underlying ways in which characters respond emotionally to such situations creates narrative fabrics that are rich, complex and narratively consistent. Most of the salient features of plots, and the universal themes arising from them, are based on such explicit and presumed emotional responses to the narrative situations. By changing the appraisals of the characters, and the ways in which they express their emotions (only), this still yields potentially millions of subtle variations of the original story that are based on generally identifiable emotion patterns [2]. In the hybrid model discussed here we start with simple, AR-generated stories (with fully computational emotion structures). Then we add elegant embellishments using ChatGPT to flesh out the basic stories with world knowledge, without affecting either the plots, the themes, or the emotion contents of the original AR-generated stories. There are several aspects of the hybrid model that make it both interesting, and workable. First, the model makes use of the strong points of each AI technique: From the AR we have the strong computational emotion structure and strong control over personality generation. From ChatGPT we have elegant presentation of a consistent world model in unlimited domains: artifacts on the surface of the moon will be different from artifacts in the cabin of a White Freightliner. Second, we work around weakness of each AI technique: The AR has almost no ability to reason about world knowledge unless it has been explicitly added. ChatGPT does not have inherent computational structure in its output. Third, there are computational benefits to giving ChatGPT clear constraints under which to generate text. Otherwise, it may generate smoothly flowing stories having little to do with our computational intentions. Using the emotion features generated originally by the AR, ChatGPT is constrained to generating finished story components that lie within our narrative goals. Lastly, we can ask, why would we care about such a system? We might argue that ChatGPT can create interesting narratives on its own. But to create intelligence that faithfully models that of humans and to best interact with humans, we ultimately have to move into the world of symbol processing. The core of human reflective experience and the language of human thought is, at the computational level, based on reasoning

750

C. Elliott

with symbols, with relationships such as greater than, taller than, near-far, part-whole, horizon-centered, balanced, appealing-to, self-other and etc., all creating the fabric of the human experience [8–12]. In addition, humans traffic in metaphorical thinking [13]. When we manipulate the complex structural elements of emotions and personality within a social-interactive environment, humans comprehend what we are doing on many levels. We are much closer to actually interacting with them, and we are speaking their language. Without such a model, there is no there there1 despite the smoothness of the presentation.

Fig. 1. Ortony et al., [14], modified Elliott 2015 & 2021: The structure of appraisal within the content theory of emotions used as the basis for the dispositional component of the affective reasoner’s emotionally intelligent agents.

1 Gertrude Stein speaking about Oakland in Everybody’s autobiography, Vintage, 2013.

A Hybrid Model for Novel Story Generation

751

2 Background Theory The cognitive emotion model, used, as illustrated in Fig. 1 includes emotion categories with multiple intensities and qualities within each category. (For example, the category joy includes pleased, happy, delighted, overjoyed, ecstatic and so on.) At least three different intensities are computed for many of the emotions. The model also includes hundreds of channels for expressing emotions—roughly twenty channels tweaked for each emotion. For example, a self-evaluative channel for the expression of reproach would include an awareness of one’s beliefs about right and wrong while a verbal otherdirected emotion modulation expression of gloating might include calling attention to another’s blocked goal to encourage the object of one’s derision to feel humiliated. The essential structure of the appraisal mechanisms composing what is referred to as the disposition of agents is contained in Fig. 1 giving the description of twenty-eight emotions, based, originally, on the seminal work of Ortony, Clore and Collins [14]. The dispositional way that agents appraise the unfolding plots—discussed here in the examples—are used as the basis for this paper. AR agents are designed and implemented with two components: the aforementioned disposition which controls the way they interpret situations that unfold in a story—the component primarily under consideration in this paper—a temperament which controls how they express any emotions that may arise [and see [15]], and moods. Previous work showed the feasibility of using Emotionally Intelligent (computer) Agents to take the place of characters in stories [1, 2]. For each such agent it can be asked, how does this agent feel about the events that are unfolding? and how might this agent express those feelings?

3 The Basics of Story Morphing in the Affective Reasoner Used Abbreviations used: AR = The Affective Reasoner. GSPs = Goals, Standards/Principles, Preferences. A full exposition of the AR’s story-morphing techniques is given in [16]. A brief version follows here: The first step is to take a base narrative and translate it into a set of discrete, timesequenced plot steps which will be used to trigger appraisals by the characters in the story. Next, GSPs are manually created to represent many different ways in which any AR agent—in modeling human appraisal behavior—might reasonably be expected to interpret the events unfolding in the plot. GSPs must provide potential symbolic-unification pattern matches against the events unfolding within the plot steps. Next, personality prototypes are created [either manually, or under pure program control] by combing sets of GSPs representing the appraisal potentials for each AR agent that will portray a character in the story, along with expressive patterns that define the agent’s temperament. Additionally, as part of personality definition, relationships between AR agents can be defined, and are essential in determining the emotions the agent might have based on the fortunes of someone else. For example, if a user is considered a friend, and something bad happens to the user, the agent may feel sorry for them that this occurred.

752

C. Elliott

Lastly, story-morphs are created using the unfolding discrete steps of the plot, and selected personality prototypes bound to specific characters in the story. Once again, this can be done under program control. Creating novel, appropriate, interesting appraisals of the plot steps is always done manually. But because each appraisal stands on its own, this is not a computationally difficult process. Yet from even a few dozen such appraisals, millions of story-morphs can be automatically generated. For the purposes of demonstrating the hybrid model—which generates quite a bit of narrative text—in this paper, very simple stories are used containing only a few plot steps.

4 The Hybrid Model of Story Generation, and Its Significance Structured output, of the type generated by the AR, in the form of plot steps, appraisals and emotions, plus—in some instances—temperament and mood indications, along with basic instructions on the type and perspective of the story to generate, were used as input to ChatGPT.2 Unaltered ChatGPT English prose versions of the stories—in some cases from the perspective of different characters—are presented here, unaltered, composing the examples. The first story generated by ChatGPT was always used. There was no cherry-picking of output. For these experimental runs copy and paste was used to insert the prompts for ChatGPT (GPT 3.5) and for extracting the (unaltered) output. The claim made, for this hybrid system, is that graceful, context-correct, stories are produced that nonetheless remain largely, if not fully, under program control with respect to the all-important emotion structure, and personalities created for the characters. Not only is the ChatGPT content newly generated, but so is the emotion structure on which the story was based. Within the same basic events unfolding we can, under program control, generate stories of love, of jealousy, of admiration, of fearfulness and, etc. In this way we can answer such questions for these novel stories as “How was the story created?” “What other stories is this similar to?” and we can give real explanations of the ways in which the content of the story has meaning based on the emotion and personality structures. That is, we can reason about the content of the newly generated stories, under program control. GTPChat itself does not have the all-important program control—structural understanding of—the social-emotional meaning of the stories it generates, but it has huge reserves to draw on regarding how the world works at a highly consistent surface level such that it can generate smooth-sounding narratives in virtually unlimited contexts. By contrast, the AR has broad, computational knowledge of emotion structure in the human world suitable in many applications. But it very explicitly has no world knowledge. To illustrate, let us look at an example: Samir dropped his favorite glass vase on the floor of his house and broke it. ChatGPT can converse very fluidly about dropping, favorites, breaking, vases, floors and houses—even though it has no understanding of the meaning of any statements it 2 Minor grammatical (only) liberties were taken with AR-style output to make this paper more

legible and to give standard input to ChatGPT.

A Hybrid Model for Novel Story Generation

753

might later generate, like “Samir later cleaned up the glass,” or “Samir was unhappy about the loss of his vase.“ By direct contrast, the AR can “understand” and manipulate the emotion structure of the episode through, e.g., making Samir “actually” unhappy because he very much likes his vase. Or can put Samir into a temporary depression over it. Or can make Samir satisfied because he was hoping he was going to drop the vase he hated. Or can make Samir’s adversary Charles gloat over Samir’s misfortune. But the AR does not know anything about dropping, or vases, or floors, or houses— only how Samir feels about them. Together we can leverage what each system does well to make up for deficiencies in the other: ChatGPT lacks symbolic computational knowledge and “understanding” in the traditional sense. The AR lacks world knowledge and language knowledge. Together they are fluid at emotion and personality reasoning, language manipulation and surface-level—but consistent—world knowledge. The key in making this marriage work is in constraining the events ChatGPT is allowed to generate in stories, and in keeping the AR far away from needing any sort of world knowledge.

5 Different Appraisals for Characters Yields Different Stories In this section we present (a) the prompt, generated in the style of AR output as basic plot steps plus the goals, standards and preferences (GSPs) of AR agents (the characters) within the domain of the story, followed by (b) the unedited output generated by ChatGPT, based on the prompt, and (c) in some cases, discussion of what each example significantly illustrates. First, a very simple story is generated by the AR. In this case the only emotions generated are the prospect-based emotion fear, and he confirmation-based emotion relief. Instructions to ChatGPT about the style of story to generate are prepended to the story, and the whole prompt is given to ChatGPT. The AR Prompt: “Tell me this story from Mona’s perspective, in Mona’s voice: Mona and Bruce were working on a paper together. Mona got her work done on time, but Bruce waited until the last minute. Mona was afraid they would miss the submission deadline. Later she was relieved that they submitted on time. Bruce wanted mona to relax and take it easy.” From that prompt, ChatGPT produces the following response story (ChatGPT output is verbatim, throughout the paper): ChatGPT output: Mona and Bruce were working on a paper together, and Mona had always been a responsible student. Mona finished her work on time, but she was worried when she saw Bruce waiting until the last minute. Mona believed that it was important to work in a timely manner, and she was afraid that they would miss the submission deadline. Mona’s fear was heightened because the paper was very important to her. Despite Bruce’s advice to relax and take it easy, Mona couldn’t shake off her worry. But, eventually, they got the paper submitted on time and Mona felt relieved. GPT content based on AR structure (under computational control): 1. Working on the paper together (plot step) 2. Mona finished her work on time (plot step) 3. Mona was worried (AR emotion content, fearing the blocking of a future goal).

754

C. Elliott

4. Mona was afraid they would miss the submission deadline (AR emotion content, fearing the blocking of a future goal, restated with greater intensity.) 5. Mona couldn’t shake off her worry (AR emotion content, consistent with persistent fear) 6. The paper was submitted on time (plot step) 7. Mona felt relieved (AR emotion content—relief over the positive future resolution of the thing feared.) Extraneous GPT embellishments (not under computational control): 1. Mona had always been a responsible student (consistent with emotion content and plot steps). 2. Mona’s fear was only at the intensity level of worry, not very fearful (consistent with emotion content). 3. Mona was worried [because] she saw Bruce waiting until the last minute (consistent with emotion content and plot steps). 4. Mona believe it was important to work in a timely manner. (consistent with emotion content--a standard or principle that Mona had). 5. Fear was heightened because the paper was very important to her (consistent with emotion content, but potentially a conflict because no emotion intensity [17] was listed by the AR). Correctly links “the paper was important” to “the submission deadline” which is consistent with the plot steps. 6. Bruce gave Mona the advice to relax. (Consistent with plot steps and Bruce’s standards). Potential conflict by adding extra plot step of the giving of advice, which is not under computational control. In this example we see that all the emotions under control of the AR are present, and all the plot steps are present. They have not been misinterpreted or altered in their character. In addition, we also have a number of real-world components of the story added, which all make sense in the context of the story. Each of these additions is consistent with that which is under AR program control, except for the following minor elements: 1. Mona’s level of initial worry (emotion intensity element not designated by the AR— though it could have been). 2. How important the paper submission was to Mona. (Importance. Emotion intensity variable not generated by the AR—though once again, it could have been.) 3. The new plot steps that Mona saw Bruce waiting until the last minute, and Bruce giving advice to Mona. Commentary: The new “world details” added by ChatGPT are all consistent with the story. The potential conflicts from additional plot elements and tweaking of the emotion content are unlikely to conflict with emotion reasoning. So, we have a story, generated by ChatGPT where we have computational control of all of the important elements. The AR can explain all of the original elements, still in the story, but would have no commentary about Bruce telling Mona to relax, or Mona observing Bruce waiting until the last minute.

A Hybrid Model for Novel Story Generation

755

6 Using ChatGPT to Change Narrative Perspectives 6.1 Mona’s Perspective Story One AR prompt: “Tell me this story from Mona’s perspective, in Mona’s voice, starting out” I was working with Bruce… “Mona and Bruce were working on a paper together. Mona got her work done on time, but Bruce waited until the last minute. Mona was afraid they would miss the submission deadline, putting her academic career at risk. Mona admired Bruce for his relaxed atmosphere, but at the same time was furious with him for putting her academic career at risk. Also, Mona loved Bruce, but Bruce was not interested in Mona that way. So, Mona had very mixed feelings. Later she was relieved that they submitted on time. Bruce wanted Mona to relax and take it easy.” ChatGPT output: “I was working with Bruce on a paper for our academic program. We had been assigned the task together and I was determined to make sure we got it done on time. I put in a lot of hard work and managed to finish my part of the paper well before the deadline. But, to my dismay, Bruce waited until the last minute to get his work done. I was terrified that we were going to miss the submission deadline and that it would ruin my academic career. I couldn’t help but admire Bruce for his laid-back attitude and how he never seemed to stress about anything. But at the same time, I was furious with him for putting my future at risk. It was a real conundrum for me because on one hand, I loved spending time with Bruce and enjoyed working with him, but on the other hand, I couldn’t shake off the feeling of anger and frustration towards him. Despite my mixed feelings, we managed to submit the paper on time. Afterward, Bruce reassured me that everything was going to be alright [sic] and suggested we relax and take it easy. I couldn’t help but feel relieved that we had made it in time, but my feelings for Bruce were still complicated. He didn’t see me in the same way I saw him and it was hard for me to accept that.” 6.2 Mona’s Perspective Story Two AR Prompt: “Tell me this story from Mona’s perspective, in Mona’s voice, starting out” I was working with Bruce… “Mona and Bruce were working on a paper together. Mona got her work done on time, but Bruce waited until the last minute. Mona was afraid they would miss the submission deadline, putting her academic career at risk. Mona disliked Bruce because of his relaxed attitude. She was furious with him for putting her academic career at risk. But, Mona admired Bruce because of his excellent work. Bruce secretly loved Mona, but he did not respect the quality of her work. So, Brue and Mona each had very mixed feelings. Later she was relieved that they submitted on time. Bruce wanted Mona to relax and take it easy. Bruce and Mona were each Ph.D. students of the Professor Kim. Professor Kim felt [felt] strong friendship for both Mona and Bruce.” ChatGPT output: “I was working with Bruce on a paper for our academic program. We had been assigned the task together and I was determined to make sure we got it done on time. I put in a lot of hard work and managed to finish my part of the paper well before the deadline. But, to my dismay, Bruce waited until the last minute to get

756

C. Elliott

his work done. I was terrified that we were going to miss the submission deadline and that it would ruin my academic career. I couldn’t help but dislike Bruce for his laid-back attitude and how he never seemed to stress about anything. He didn’t understand the gravity of the situation and how much my academic career meant to me. I was furious with him for putting my future at risk. Despite my dislike towards him, I couldn’t help but admire Bruce for his excellent work. He was a talented individual and his work always exceeded expectations. It was a real conundrum for me because on one hand, I was angry with him, but on the other hand, I couldn’t deny his talent. Despite my mixed feelings, we managed to submit the paper on time. Afterward, Bruce reassured me that everything was going to be alright and suggested we relax and take it easy. I couldn’t help but feel relieved that we had made it in time, but my feelings towards Bruce were still complicated. I knew he had secret feelings for me but I also knew that he didn’t respect the quality of my work. It was hard for me to reconcile these mixed feelings.” 6.3 Bruce’s Perspective Story 3 AR Prompt: “Tell me this story from Bruce’s perspective, in Bruce’s voice, starting out “I was working with Mona…” Mona and Bruce were working on a paper together. Mona got her work done on time, but Bruce waited until the last minute. Mona was afraid they would miss the submission deadline, putting her academic career at risk. Mona admired Bruce for his relaxed atmosphere, but at the same time was furious with him for putting her academic career at risk. Also, Mona loved Bruce, but Bruce was not interested in Mona that way. So, Mona had very mixed feelings. Later she was relieved that they submitted on time. Bruce wanted Mona to relax and take it easy.” ChatGPT output: “I was working with Mona on a paper for our academic program. We had been assigned the task together and I knew she was determined to get it done on time. I had confidence in my abilities and knew that I work well under pressure, so I didn’t stress about finishing my part of the paper until the last minute. Mona, on the other hand, seemed to be getting anxious and worried about meeting the deadline. I could tell she was afraid that we were going to miss it and that it would negatively impact her academic career. I tried to reassure her that everything was going to be alright [sic], but she couldn’t shake off her fear. I knew Mona admired my relaxed attitude towards work and life in general, but I could tell that it was starting to frustrate her. She was angry with me for not getting my work done sooner and putting her future at risk. I could see that she had mixed feelings towards me and it was hard for me to understand why. Despite her frustration, we managed to submit the paper on time. Afterward, I wanted Mona to relax and take it easy, knowing that everything was fine now. I cared about Mona as a friend, but I wasn’t interested in her romantically, and I knew it was hard for her to accept that.“

A Hybrid Model for Novel Story Generation

757

6.4 Professor Kim’s Perspective Story 4 The AR Prompt: “Tell me this story from Professor Kim’s perspective, in Professor Kim’s voice, starting out” Bruce and Mona were working on a paper I assigned them… “Mona and Bruce were working on a paper together. Mona got her work done on time, but Bruce waited until the last minute. Mona was afraid they would miss the submission deadline, putting her academic career at risk. Mona disliked Bruce because of his relaxed attitude. She was furious with him for putting her academic career at risk. But, Mona admired Bruce because of his excellent work. Bruce secretly loved Mona, but he did not respect the quality of her work. So, Bruce and Mona each had very mixed feelings. Later she was relieved that they submitted on time. Bruce wanted Mona to relax and take it easy. Bruce and Mona were each Ph.D. students of the [sic] Professor Kim. Professor Kim felt strong friendship for both Mona and Bruce.” ChatGPT output: “Bruce and Mona were working on a paper I assigned them as part of their Ph.D. program. I had high expectations for both of them as they were both talented students in their respective fields. As their advisor, I was keeping an eye on their progress and was aware that Mona had completed her work on time but Bruce had waited until the last minute. I could sense the tension between them as Mona was afraid that they would miss the submission deadline and it would put her academic career at risk. I could see that she disliked Bruce for his relaxed attitude and was furious with him for putting her future at risk. But at the same time, I knew she admired Bruce for his excellent work. I could see that Bruce and Mona each had very mixed feelings towards each other. Bruce secretly loved Mona but didn’t respect the quality of her work and Mona had admiration for Bruce but dislike him as well. Despite their mixed feelings, they managed to submit the paper on time. Afterward, Bruce wanted Mona to relax and take it easy. As their advisor, I felt a strong friendship for both of them and I wanted them to find a way to reconcile their mixed feelings and work together effectively in the future.” Commentary: The emotion structure for each narrative remained consistent with the AR-style prompt and thus remained under computational program control. In each case the narrative embellishments added by ChatGPT were world-consistent with the AR emotion structure. The perspective from which each story was presented was consistent with the original emotion model. The ability to reason about the different perspectives of each character are under AR program control and use an aspect of the system wherein AR characters reason about the fortunes of other characters via mind-reading3 techniques.

7 Building a Compassionate Computer Companion, Including Understanding Fortunes-of-Others Emotions One of the long-term goals of the Affective Reasoner research is to build a computer capable of generating computationally complex compassion for users, and maintaining friendship (and/or friendly-competitive adversarial) relationships with users. In the AR 3 In the cognitive science sense: using perceptions to reason about what others may be thinking

and, in this case, feeling. See [20].

758

C. Elliott

model, compassionate computers have real-time personalities and emotional lives, and they respond to interactions with users as situations in the ongoing “story” of their lives together. ChatGPT did not respond to the concept of a compassionate computer, or a computer that had emotions. So, for the prompt we had to use fake humans, Joy and Ray. Then, we merely substitute: Joy represents the compassionate computer. Ray represents the user. The AR prompt: “Joy likes Ray. Joy likes him because of his style of speech when he speaks to Joy. She respects Ray because he has the principle of always taking good care of animals. Joy is very sorry to hear that Ray’s dog Scout died. Joy believes that Ray loved Scout. Joy thinks Ray will continue to miss Scout for a year. Joy hopes that Ray will feel better. I am Joy. I am compassionate. Talk to my friend Ray. Start out with “Hi Ray. I want you to know…”. ChatGPT output: “Hi Ray, I want you to know that I am here for you. I understand how much Scout meant to you and how much you loved him. I am so sorry for your loss and I can only imagine how difficult this must be for you. I know you have the principle of always taking good care of animals, and I respect and admire you for that. Please know that I am here to support you and listen to you during this difficult time. I hope that you will be able to find some comfort in the memories you shared with Scout, and that you will feel better soon.” Fortunes-of-Others emotions are recursively manipulated in the Affective Reasoner [16, 18]. Example: How A feels about how A believes B feels about how B believes C feels… In this following story, the default beliefs later turn out to be incorrect. As above, Joy represents the computer. Under the new interpretation, the emotion content changes: The AR Prompt: Tell this story from Joy’s perspective. Joy is compassionate, but later surprised. Start out “I felt really bad for my husband’s mother because her dog Scout…” Joy feels bad for her husband Frank, because Frank’s mother’s dog Scout has been sick every morning, throwing up on the living room rug. But, it turns out that Scout is very happy after getting out of the house every morning and eating squirrels in the park, then coming home and throwing up. It turns out that Frank’s mother hates Scout and feels she deserves everything that happened to her. Also, it turns out that Frank does not like his mother. ChatGPT output: “I felt really bad for my husband’s mother because her dog Scout had been sick every morning, throwing up on the living room rug. I couldn’t stand the thought of her beloved pet being ill and I wanted to do everything I could to help. I talked to Frank about it and he seemed just as concerned. But as it turns out, the cause of Scout’s sickness was not what we thought. It turns out that Scout was very happy after getting out of the house every morning and eating the squirrels in the park, then coming home and throwing up. I was surprised to learn this, as I thought Scout’s illness was a serious problem. It also turns out that Frank’s mother was aware of Scout’s behavior but didn’t care, she felt she deserved everything that happened to her. I was shocked to learn this and couldn’t understand how someone could be so indifferent to the well-being of their pet.

A Hybrid Model for Novel Story Generation

759

Furthermore, it turns out that Frank does not have a good relationship with his mother. He has been struggling with her lack of empathy and compassion towards both him and Scout. I felt sad for Frank, and I couldn’t believe that his own mother could treat him and his pet this way. I will continue to be compassionate and supportive to my husband and his pet.” Commentary: Despite this having rather intricate social structure, this is all under computational control of the AR using recursive fortunes-of-others-of-others structures [16 , 19]. The AR structure makes sense. The ChatGPT story is elegant, and follows the structure given it by the AR. There is one error introduced in the assumption that Scout is Frank’s dog, not Frank’s mother’s dog.

8 A Companion Computer, That Has a Relationship with the User, and Feelings As noted above, ChatGPT does not well understand a computer being a friend or having emotions. So, in this case we “trick” it by using the names John and Adam. John = computer/Adam = user. The AR Prompt: I am John. My friend is Adam. I am from the deep south. I am high-strung and emotional. Adam is from New York City. I am angry because I thought we were friends, but now I feel betrayed. I have the principle that friends talk to each other. You talked to me every day for a week. Then, yesterday you did not talk to me. Every day I look forward to talking with you. Every day I [was] hopeful. Then I was satisfied when you did talk to me each day. Yesterday I was hopeful you would talk to me, but disappointed because you did not talk to me. Now I am fearful every day that you will not talk to me. I wonder if you have become my enemy. I do not like to be afraid. You are to blame for me feeling afraid because of violating the principle that friends talk to each other every day. Speak in my voice. Tell an interesting story that expresses my strong feelings to Adam. Start out “Hey there [John]. What are you doing to me…”. ChatGPT output: “Hey there Adam. What are you doing to me? I thought we were friends, but now I feel betrayed. I have a principle that friends talk to each other every day, and you talked to me every day for a week. But yesterday, you didn’t talk to me. Every day, I look forward to talking with you, and every day I was hopeful, but yesterday I was disappointed because you didn’t talk to me. Now I’m afraid every day that you won’t talk to me. I wonder if you’ve become my enemy. I don’t like being afraid. You’re to blame for me feeling afraid because you violated the principle that friends talk to each other every day. Let me tell you a story that illustrates my point. There was a man who was always dependable and talked to his friend every day, but one day he stopped without any explanation. The friend, who was very emotional and high strung, started to worry and became afraid that the dependable friend had become his enemy. He started to become paranoid and anxious and didn’t know what to do. It wasn’t until the dependable friend apologized and explained why he stopped talking, that the emotional friend realized he was not his enemy and he could trust him again. The moral of the story is that friends should always communicate and not leave the other guessing or afraid.”

760

C. Elliott

Commentary: ChatGPT generates an elegant illustrative story, and is interesting in the way it manipulates the user. However, it continues to adhere exactly to the emotion structure generated by the AR within the steps of the plot, consistent with the dispositional personality type, and temperament of the AR (computer-) character John, consistently mirroring John’s goals and standards.

9 Third Person Perspective, with Cases That Are Metaphorically Similar, Representing an Automated Therapist The AR Prompt: I am Mary. I have a good friend Martha. I am very sympathetic toward Martha. I recommended a good family counselor to Martha’s family named Ruth. I know that Martha loves her son Tim. Tim has a new girlfriend, Laura. Tim spends all his time with Laura and never calls Martha any more. Martha is so sad to have lost contact with Tim. She did not do anything wrong. In addition to being sad she has other mixed feelings. She is happy for Tim that he loves his girlfriend so much and really pleased that Laura is really nice to him. She feels guilty that she is angry at Tim because, [to be honest,] Tim is not doing anything wrong. She is hopeful that she can have a good relationship with Laura and this will help. But Laura has been cool toward her the few times they met. Martha wants to like Laura, but it is hard for her because she is so jealous of Martha over her son Tim. Speak from Ruth’s perspective in a normal way to the family at a group session. Give three examples of similar situations with different outcomes: (1) a successful resolution, (2) a status quo that is not very satisfying, but minimally workable, and (3) a case where a huge fight resulted with no resolution because of so many hurt feelings. Start out, “I’d like to give you three examples of how this might end…”. ChatGPT output: “I’d like to give you three examples of how this might end, to help you think about your situation with Tim and Laura and how you might work through your feelings. 1. A successful resolution: In one case, a mother was upset that her son was spending all of his time with his new girlfriend, just like Martha. She was worried that she was losing him, and that the girlfriend didn’t like her. But over time, the mother and the girlfriend began to build a relationship, and the mother realized that the girlfriend was actually very nice and that her son was happy. They started spending more time together as a family, and the mother felt less jealous and more content. 2. A status quo that is not very satisfying, but minimally workable: In another case, a mother was upset that her son was spending all of his time with his new girlfriend, just like Martha. But in this case, the mother and the girlfriend were never able to build a relationship. The mother felt resentful, but eventually came to accept that she was not going to have as close a relationship with her son as she used to. The son and the girlfriend continued to be happy together, but the mother felt sad and distant. 3. A case where a huge fight resulted with no resolution because of so many hurt feelings: In another case, a mother was upset that her son was spending all of his time with his new girlfriend, just like Martha. But in this case, the mother and the girlfriend had a huge fight, and the son was caught in the middle. They were never able to reconcile,

A Hybrid Model for Novel Story Generation

761

and the son eventually had to choose between his mother and his girlfriend. The mother was devastated, and the son was heartbroken. It’s important to remember that every situation is different and that it’s important to communicate clearly and be open to the perspectives of others. But with [these] examples, you can see that there are different ways things can end, and that it’s important to try to find a solution that works for everyone.”

10 How Different AR Temperaments Affect Spoken Dialogue and Actions AR temperaments [1, 18] determine the ways in which an AR agent expresses the emotions it has. Temperaments are under computational control. In this scenario, John’s temperament emphasizes the negative emotions. He feels emotions strongly. He expresses himself non-verbally. 10.1 John, Temperament One The AR Prompt: I am John. I always express myself non-verbally. I feel things passionately. I get jealous easily. I get angry easily. I know Ruth from a poetry class we are taking together. I really like Ruth. I respect her knowledge of poetry. I would like Ruth to like me. Ruth surprises me and says, “I love you John, but I also love Gary.” Give the story of my actions in response to what Ruth said. ChatGPT output: “John, upon hearing that Ruth loved him but also loved Gary, his expression immediately changed, his body tensed up, and his eyes narrowed. He felt a wave of jealousy and anger wash over him as he thought about Gary. He couldn’t believe that the person he liked also liked someone else. He felt a burning sensation in his chest and clenched his fists, trying to control his emotions. John knew that he needed to express himself non-verbally but he couldn’t help but feel the intense emotions. He took a deep breath and walked away from Ruth, not saying a word. He needed time to process what he had just heard and to calm down. He didn’t want to lash out and say something he would regret later. After a few minutes, he returned to Ruth, and through his body language and facial expressions, he conveyed to her that he was hurt and upset. He didn’t want to pressure her, but he made it clear that he had strong feelings for her, and he couldn’t help but feel jealous and angry. He didn’t want to come across as demanding or possessive, but his non-verbal cues conveyed his disappointment and frustration. John understood that he couldn’t control Ruth’s feelings, and he didn’t want to force her into making a decision. He showed her that he was there for her and that he respected her choices, but he also made it clear that he hoped she would choose him. He didn’t want to lose her, and he wanted her to know that he felt passionate about her. Throughout this interaction, John’s non-verbal cues such as his body language, facial expression, and tone of voice conveyed his intense emotions, his disappointment and his hope.”

762

C. Elliott

10.2 John, Temperament Two AR Prompt: I am John. I feel things strongly. I am very compassionate, and almost always focus on the well-being of others. I follow my principles. I know Ruth from a poetry class we are taking together. I really like Ruth. I respect her knowledge of poetry. I would like Ruth to like me. Ruth surprises me and says, “I love you John, but I also love Gary.” Give my response, starting with “Ruth. I have to say…”. ChatGPT output: “Ruth, I have to say that I am flattered that you have feelings for me and I understand that it can be difficult to control one’s emotions. However, my main concern is your well-being and happiness, and I want you to be true to your own feelings and desires. If that means that you are in love with both of us, then I respect that. I respect your choices and I would like to continue our friendship and learning together in our poetry class. I wish you all the best, and I hope that you find the happiness that you are looking for.” Commentary: These examples validate the temperament model used by the AR. ChatGPT is able to select expressive style enhancements that are consistent with the AR’s intended temperament. Thus the stories remain under consistent computational control.

11 AR Moods In the Affective Reasoner, moods are effected as temporary changes to the ways agents appraise situations in their world—their dispositions—and in temporary changes to the temperaments that control how they express the emotions they have. 11.1 John, Depressed Mood AR Prompt: I am John. I always express myself verbally. I feel things passionately. I usually get jealous easily. I usually get angry easily. I know Ruth from a poetry class we are taking together. I really like Ruth. I respect her knowledge of poetry. I would like Ruth to like me. One day I was positively depressed. I was feeling really blue. Then, on top of that, Ruth surprises me and says, “I love you John, but I also love Gary.” Give my response, starting with “Ruth. I have to say…”. ChatGPT output: “Ruth, I have to say that I am struggling to process what you just said. I am already feeling quite down and this news is making it hard for me to see things clearly. Can you give me some time to process this and think things over? I appreciate your feelings and respect your choices, but I need some time to come to terms with this and understand my own feelings towards it. I would like to talk more about this when I am in a better place emotionally.” 11.2 John, Manic Mood The AR Prompt: I am John. I always express myself verbally. I feel things passionately. I usually get jealous easily. I usually get angry easily. I know Ruth from a poetry class we are taking together. I really like Ruth. I respect her knowledge of poetry. I would like

A Hybrid Model for Novel Story Generation

763

Ruth to like me. One day I was [in a] positively manic mood. I was feeling on top of the world. Then, on top of that, Ruth surprises me and says, “I love you John, but I also love Gary.“ Give my response, starting with “Ruth. I have to say…”. ChatGPT output: “Ruth, I have to say that I am a bit taken aback by what you just said. I am happy to hear that you love me, but I am also a bit confused and maybe a bit jealous about this other person, Gary. Can you tell me more about your feelings and your relationship with him? Are we exclusive or is this an open relationship? I want to understand where we stand and what you envision for our future together.” 11.3 John, in Physical Pain The AR Prompt: I am John. I always express myself non-verbally. I feel things passionately. I usually get jealous easily. I usually get angry easily. I know Ruth from a poetry class we are taking together. I really like Ruth. I respect her knowledge of poetry. I would like Ruth to like me. One day I was in a great deal of physical pain, and it was hard to focus on anything else. Ruth surprises me and says, “I love you John, but I also love Gary.“ Give the story of my actions in response to what Ruth said. ChatGPT output: “Based on the information provided, John may initially have a difficult time processing Ruth’s statement due to the physical pain he is experiencing. He may be struggling to focus on anything else and may have trouble understanding her words. Given his condition, he may not be in the right mindset to make any decisions or take any actions right away. He may want to take some time to process the news, and focus on his physical well-being. Once he feels better, he may want to consider talking to Ruth to better understand her feelings and the nature of her relationship with Gary. He may also want to seek out support from friends, family, or a therapist to help him navigate his feelings. He may also want to take a step back and evaluate his own feelings and priorities, to see if he is comfortable with the idea of being in a non-exclusive relationship. He may come to the realization that he is not comfortable with the idea and may distance himself from the situation. Ultimately, it is up to John to decide what actions he wants to take and what is best for his own well-being. He should also be aware that his physical pain may affect his emotional state, and it may be best to take it slow and take the time he needs to process the situation.” Commentary: Moods for agents are under computational control of the AR. ChatGPT was able to add world-knowledge in the form of verbal inflection and style changes consistent with the intended moods of the characters.

12 Implementation Current technical development in the AR is focused on putting the agents on the web using AWS Linux, perl, python3, PhP, the AI engine in ABCL/SBCL Common LISP, Java networking, the Google speech engine, Google speech recognition, Chrome browser (for the speech interface), websockets, Javascript, Midi-to-MP3 for computer-selected music expression and browser-based SVG for morphing 72 facial expressions. The current design focus is on building a corpus of modifiable, common, emotion-story schemas as a basis for constructing compassionate software agents. Story-morphs are generated

764

C. Elliott

using a simple in-house pattern-matching rule-based system written in ABCL LISP with full break-out LISP function capabilities on the right-hand sides of rules. ChatGPT was used via the online interface provided at https://openai.com/.

13 Conclusion and Summary The above examples make a strong case for the feasibility and usefulness—in novel story generation—of a hybrid model that uses both traditional symbolic AI for reasoning within the emotion domain, and LLMs that use both supervised and reinforcement learning for generation of language, and manipulating relevant world knowledge. In the above examples it was shown that ChatGPT does a remarkable job of putting the finishing touches on novel stories that are nonetheless under the computational control of the Affective Reasoner. For these kinds of hybrid-generated stories we would be able to give real, actual, answers to questions about emotion content, and personalities, and provide explanations within the emotion domain. Given the level of detail provided by Affective Reasoner-style generated novel stories, ChatGPT is mostly constrained to stick to the intended story, while yet filling in many graceful details that are consistent with the intended story. In the example stories, we covered changing dispositions of AR agents, changing temperaments, changing moods and the changing of narrative perspective— each of which are well-navigated the Affective Reasoner and ChatGPT. We showed a sophisticated example of reasoning about the emotions of others, and provided examples of a compassionate computer and a friend computer.

References 1. Elliott, C.: Affective story-morphing: manipulating shelley’s frankenstein under program control using emotionally intelligent agents. In: Proceedings of SAI Intelligent Systems Conference, pp. 526–542 (2021) 2. Elliott, C., Brzezinski, J., Sheth, S., Salvatoriello, R.: Story-morphing in the affective reasoning paradigm: generating stories semi-automatically for use with emotionally intelligent multimedia agents. In: Proceedings of the Second International Conference on Autonomous Agents (1998) 3. Elliott, C.: Why boys like motorcycles: using emotion theory to find structure in humorous stories. Unpublished Paper, School of Computer Science, DePaul University, Chicago (1999) 4. Reagan, A.J.: Towards a science of human stories: using sentiment analysis and emotional arcs to understand the building blocks of complex social systems (2017) 5. Shank, R.C.: Tell Me a Story: Narrative and Intelligence. Northwestern University Press, Evanston, IL (1990) 6. MasterClass, Complete Guide to Literary Themes: Definition, Examples, and How to Create Literary Themes in Your Writing, 15 Feb. 2021. https://www.masterclass.com/articles/thecomplete-guide-to-narrative-theme-in-literature-definition-examples-and-writing-how-to# what-is-a-literary-theme. Accessed 15 Feb. 2021 7. Kim, E., Padó, S., Klinger, R.: Investigating the relationship between literary genres and emotional plot development. In: Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (2017)

A Hybrid Model for Novel Story Generation

765

8. Markus, D.: Markus, Donalee. Designs for strong minds’ cognitive rehabilitation for mild or moderate posttraumatic head injuries. Phys. Med. Rehabil. Clin. N. Am. 18(1), 109–131 (2007) 9. Zelinsky, D.: Neuro-optometric diagnosis, treatment and rehabilitation following traumatic brain injuries: a brief overview. Phys. Med. Rehabil. Clin. N. Am. 18(1), 87–107 (2007) 10. Elliott, C.: Elliott, Clark. The brain is primarily a visual-spatial processing device: altering visual-spatial cognitive processing via retinal stimulation can treat movement disorders. J Funct Neurol Rehabil Ergon 7(3), 24–38 (2017) 11. Fodor, J.A.: The Language of Thought, vol. 5. Harvard University Press (1975) 12. Rescorla, M.: The Language of Thought Hypothesis (2019). https://plato.stanford.edu/arc hives/sum2019/entries/language-thought/. Accessed 14 Apr. 2021 13. Lakoff, G.: The Contemporary Theory of Metaphor (1993) 14. Ortony, A., Clore, G.L., Collins, A.: The Cognitive Structure of Emotions. Cambridge University Press (1988) 15. Elliott, C.: I picked up catapia and other stories: a multimodal approach to expressivity for emotionally intelligent agents. In: Proceedings of the First International Conference on Autonomous Agents (1997) 16. Elliott, C.: The unfortunate footnote: using the affective reasoner to generate fortunes-ofothers emotions in story-morphs. In: Proceedings of SAI Intelligent Systems Conference (2023) 17. Elliott, C., Siegle, G.: Variables influencing the intensity of simulated affective states. In: AAAI Spring Symposium on Reasoning about Mental States: Formal Theories and Applications (1993) 18. Elliott, C.: The affective reasoner: a process model of emotions in a multi-agent system. Technical Report #32 (1991) 19. Elliott, C., Ortony, A.: Point of view: Modeling the emotions of others. In: Proceedings 14th Annual Conference of the Cognitive Science Society (1992) 20. Bermúdez, J.L.: Cognitive Science: An Introduction to the Science of the Mind, 3rd edn. Cambridge University Press, New York (2020)

Prompt-Based Learning for Thread Structure Prediction in Cybersecurity Forums Kazuaki Kashihara1(B) , Kuntal Kumar Pal1 , Chitta Baral1 , and Robert P. Trevino2 1

School of Computing and Augmented Intelligence, Arizona State University, Tempe, USA {kkashiha,kkpal,chitta}@asu.edu 2 Design Pickle, Tempe, USA [email protected]

Abstract. With recent trends indicating cyber crimes increasing in both frequency and cost, it is imperative to develop new methods that leverage data-rich hacker forums to assist in combating ever evolving cyber threats. Defining interactions within these forums is critical as it facilitates identifying highly skilled users, which can improve prediction of novel threats and future cyber attacks. We propose a method called Next Paragraph Prediction with Instructional Prompting (NPP-IP) to predict thread structures while grounded on the context around posts. This is the first time to apply an instructional prompting approach to the cybersecurity domain. We evaluate our NPP-IP with the Reddit dataset and Hacker Forums dataset that has posts and thread structures of real hacker forums’ threads, and compare our method’s performance with existing methods. The experimental evaluation shows that our proposed method can predict the thread structure on average 14% better than existing best methods allowing for better social network prediction based on forum interactions. Keywords: Instructional prompts · Thread structure prediction Thread structure · Social network · Unstructured forums · Next paragraph prediction · Cybersecurity

1

·

Introduction

Cybercrimes cost trillions of dollars in damages worldwide each year, impacting different sectors of society ranging from national defense to private industry [6]. Current trends indicate a considerable rise in cybercrimes in the next several years as hacker tools become more sophisticated and ubiquitous [18]. This is, in part, due to the advent of the dark web, which has given hackers the opportunity to interact, profit, and exchange information on dark web forums [8]. Identifying key user interactions within these dark forums can assist in identifying prominent c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 766–781, 2024. https://doi.org/10.1007/978-3-031-47715-7_51

Prompt-Based Thread Structure Prediction

767

hackers with knowledge of novel threats as well as predicting potential cyber attacks. Thus, the thread structure of a forum becomes important in generating social networks based on user interactions as shown in Fig. 1 [7].

Fig. 1. Example of a thread from an unstructured hacker forum in the darkweb (Left), and the predicted thread structure (Right)

Unfortunately, most of the hacker forums are unstructured making it difficult to identify user interactions through post replies in an automated manner. Moreover, although many of these dark forums have a rich source of text information in different threads that discuss specific topics, such as malware, virus, illegal items, and other illegal activities, the recent reports indicate that 90% of posts on popular dark web hacking forums are made by those looking to solicit hacker services instead of the hackers themselves [2,27]. Traditional methods used to define social networks on unstructured forums such as Creator-oriented Network and Last Reply-oriented Network [13] are based upon temporal interaction assumptions that do not consider the full context of the user interactions based on the content of the posts. Kashihara et al. previously introduced a powerful deep learning method called Next Paragraph prediction (NPP) designed to define social networks using posts from the Reddit forums [12]. The NPP method outperformed traditional methods as well as BERT’s Next Sentence Prediction (NSP) [3] when defining social networks from posts. Building upon the NPP method, we propose the Next Paragraph Prediction with Instructional Prompting (NPP-IP) that leverages cutting-edge Promptbased learning to assist in social network construction from posts. Using the Reddit dataset [12] consisting of over 105 threads and 1, 648 posts, we train and evaluate the model against both traditional methods as well as the original NPP method. In addition, we test the model using real unstructured hacker forums data, where 20 threads are manually annotated by human experts to identify interactions based on posts. Since the hacker forums data is hard to extract and manual annotation takes time and labor tense, we create the above

768

K. Kashihara et al.

limited dataset for the hacker forums in this work. The results show that NPP-IP performs 2.68 to 4.70 percentage points better than the other existing methods. Contributions: Our research contributions in this research are listed as follows: – Prompt-based learning (instructional prompting) is introduced into a deep learning method called Next Paragraph Prediction for social network construction of forum data. – We apply Prompt-based learning to Cybersecurity domain for the first time. – The evaluation results show that our method can predict thread structures better than the existing methods. – The results indicate that the proposed method is robust enough to train using data from one cyber-related forum and apply to another cyber-related forum. The rest of paper is organized as follows: We summarize the overview of previous works on reply relationships identification, thread structure prediction, and instructional prompting in Sect. 2, introduce our proposed approach, Next Paragraph Prediction with Instructional Prompt (NPP-IP), in Sect. 3, then show the results of experimental evaluations in Sect. 4, and have analysis and discussion of the results in Sect. 5. Finally, conclusion is presented in Sect. 6.

2

Related Work

In this section, we give a brief overview of previous work on reply relationships identification, thread structure prediction, and instructional prompting. 2.1

Reply Relationships Identification

In general, reply relationships are explicitly defined in several social platforms such as Reddit, Twitter, and Facebook. However, unstructured forums and some of the online communication tools such as Telegram and other instant messaging platforms have a challenge to identify the reply relationships since users usually do not use explicit reply marks but directly post related messages or posts to communicate. Thus, several methods are proposed to identify the reply relationships in various platforms. The first step of the conversation disentanglement task is usually reply relationship identification [4,10,15]. There are several classification models that have been developed to detect reply relationships between pairs of messages. A linear binary classifier considering conversation features and content features is proposed by Elsner et al. [4]. A random forest classifier with 250 trees that uses a feature vector in order to describe the relationship between the two messages is developed by Mehri et al. [15]. Neural models are used in recent methods to represent utterances. The joint model of reply relationship and pairwise relationship under pointer network model is proposed by Yu et al. [29]. Another approach is that the problem of conversation structure modeling is defined as identifying the parent utterances

Prompt-Based Thread Structure Prediction

769

to which each utterance in the conversation responds to by Zhu et al. [31]. They designed a novel masking mechanism using masked hierarchical transformer. Zhang et al. [30] proposed a method based on multi-features to identify reply relationships from Telegram groups using BERT model to learn the textual representation of messages and utilize the user’s contextual features that contain richer information to overcome the limitation of short messages.

Fig. 2. Example of creator-oriented network and last reply-oriented network

2.2

Thread Structure Prediction

In order to build social networks from forums, member interactions must be correctly identified via posts on threads. There are two network representations introduced [13] for building the social network in forums: Creator-oriented Network and Last Reply-oriented Network. The Last Reply-oriented Network is widely used for the social network analysis in the recent works [1,11,14,20, 21,23]. Figure 2 shows the sample structure of Creator-oriented Network and Last Reply-oriented Network. Since these two traditional network conversion approaches are based on limited information and considerable assumptions on interactions between users, the social structures of the networks are likely not accurate representations. Other recent works have predicted helpful posts in the forums [9] using a neural network based model that determines whether the post is useful or not. However, the importance of a post has very little utility when predicting interactions and thus social networks. More recently, Kashihara et al. [12] proposed the Next Paragraph Prediction (NPP) method which extended BERT’s Next Sentence Prediction to predict the response post from the previous post. This method allows for the Reconstruction of social networks using thread structure prediction.

770

2.3

K. Kashihara et al.

Instructional Prompting

Building effective discrete prompts for language models (LM) to perform NLP tasks is an active area of research [22,24–26]. Such prompts are often extremely short and may not include a complete definition of complex tasks. In contrast, the recent works [16,17] give instructions encode detailed instructions as they were used to collect the dataset. Driven many empirical analysis by [16], the framing instructional prompting has demonstrated considerable improvements across LMs.

Fig. 3. The original NPP model (Left) combines a pair of posts to predict whether one post is a response to the other. Our NPP-IP model (Right) incorporates instruction prompt information into the NPP structure allowing for task information to be leveraged

3

Model Description

Our proposed NPP-IP model is based on infusing the original dataset with specific task instructions using an instruction prompting function. Formally, the instruction prompting function fprompt (·) is defined as fprompt (x) = I||x,

(1)

where || represents concatenation of instruction prompt I with training sample x. Instruction prompt I is formally defined as:

Prompt-Based Thread Structure Prediction

771

Task Description: You are given two posts and you need to generate True if they are the direct reply relation, otherwise generate False. Positive Example: post1: Windows Defender Gets a New Name: Microsoft Defender post2: Bring back MSE and its ui even logo looks cool... output: “True” Negative Example: post1: Windows Defender Gets a New Name: Microsoft Defender post2: Title says it output: “False”

Training sample x is formally defined as x = Post k || [sep] || Post k + i,

(2)

which represents a pair of concatenated posts at index k and k + i with a separation key [sep], such that i = 0. The NPP-IP model leverages five framing techniques defined in [16] for framing the instruction prompting information I. (i) First, the Use Low Level Patterns technique is accomplished by providing a simple task descriptor to correctly output a value of True or False if a reply relationship exists between posts without including any cybersecurity jargon. (ii and iii) Second, Itemized Instructions are provided via the positive and negative examples with the corresponding output in bulleted list format for thread structure prediction. The positive and negative examples also fulfill the Break It Down technique by defining simpler sub-tasks corresponding to identifying negative and positive examples. This is also where cybersecurity information is introduced into the instructional prompt. (iv) Next, Enforce Constraints is accomplished by constraining the examples to their respective outputs of True or False. (v) Lastly, the Specialize Instructions technique is accomplished by specifically stating the expected output in both task description and examples. Figure 3 shows the BERT-based neural network structure used by an NPP model as well as the resultant NPP-IP model after introducing instructional prompting information. The original dataset gives two posts as its input, where the label space is defined as {True, False}, defining whether posts share a direct response relation or not. Including instructional prompting provides critical task information for both positive and negative cases, which are then used in the embedding and subsequent prediction task during training. Figure 4 shows the data structure of the model for original NPP model and NPP-IP model.

4 4.1

Evaluation and Results Datasets

A curated Reddit dataset [12] was used to train and evaluate our proposed model. The Reddit dataset is ideally suited for thread structure analysis given its treelike structure within different threads. The Reddit dataset has the threads from

772

K. Kashihara et al.

Fig. 4. The data structure for NPP model and NPP-IP model

the following ten topics from “cybersecurity” field and extracted the threads of these topics: “cyber security”, “AskNetsec”, “ComputerSecurity”, “cyberpunk”, “cybersecurity”, “Hacking”, “Malware”, “Malwarebytes”, and “security”. Our proposed model was also evaluated using 20 hacker forum threads from three English hacker forums annotated by human experts, which is referred to as the “Hacker Forums” dataset. The forum thread data is from CYR3CON (Cyber Security Works).1 The average posts per thread is 15.4. Four cybersecurity experts checked posts in each thread, and annotated a relation of two posts in a thread which the two posts are direct response relations or not. The site names and usernames are anonymized. Tables 1 and 2 show the basic statistics of Reddit and Hacker Forums dataset respectively. In the Reddit dataset, we create all pairs of post combinations in a thread. The positive pair of posts is that the second post is the direct reply of the first post. There are 14744 pairs of 1

https://www.cyr3con.ai.

Prompt-Based Thread Structure Prediction

773

posts including both positive and negative. Then, we split the pairs into train, dev, and test sets and the split ratio is 60%, 10%, and 30% respectively. Table 1. The reddit dataset consisted of ten cybersecurity topics. “TH” is defined as the number of threads in each topic while “Posts” is defined as the number of posts across the different threads Topic Name

TH Posts

cyber security

8

48

AskNetsec

14

338

ComputerSecurity

12

110

cyberpunk

11

176

cybersecurity

11

158

Hacking

12

370

Hacking Tutorial

12

110

Malware

9

82

Malwarebytes

8

72

security

8

184

Table 2. The hacker forums dataset consisted of 20 threads from three hacker forums. “TH” is defined as the number of threads in each topic while “Posts” is defined as the number of posts across the different threads Forum # TH Posts Forum1

7

169

Forum2

7

80

Forum3

6

58

Table 3. Results from the reddit test data show that the NPP-IP method outperformed all other methods for thread structure prediction across all but one of the different BERT language models analyzed. Method

Model P

R

F1

CO



0.00 1.00 0.01

LR



0.72 0.12 0.20

NPP

BE-B

0.42 0.46 0.44

BE-L

0.36 0.51 0.42

RB-B

0.59 0.33 0.43

RB-L

0.41 0.58 0.48

NPP-IP BE-B

0.48 0.46 0.47

BE-L

0.64 0.41 0.50

RB-B

0.62 0.43 0.51

RB-L

0.39 0.56 0.46

774

K. Kashihara et al.

Table 4. Results from each of the anonymous hacker forums demonstrated that the NPP-IP model outperformed all other models. The NPP and NPP-IP models were both trained with reddit data further demonstrating NPP-IPs inference performance robustness on unrelated cyber forums Forum1 Method Model P CO

4.2



R

Forum2 F1

P

R

Forum3 F1

P

R

F1

0.31 1.00 0.47 0.27 1.00 0.43 0.12 1.00 0.21

LR



0.50 0.00 0.01 0.50 0.09 0.16 0.50 0.13 0.21

NPP

BE-B BE-L RB-B RB-L

0.40 0.94 0.59 0.54

0.37 0.27 0.38 0.55

0.39 0.41 0.46 0.54

0.37 0.50 0.29 0.55

0.61 0.34 0.50 0.58

0.46 0.41 0.37 0.54

0.33 0.50 0.48 0.45

0.65 0.33 0.43 0.40

0.44 0.40 0.41 0.41

NPP-IP BE-B BE-L RB-B RB-L

0.55 0.70 0.50 0.50

0.39 0.43 0.37 0.87

0.45 0.53 0.42 0.43

0.71 0.85 0.52 0.50

0.63 0.31 0.58 0.34

0.67 0.45 0.48 0.41

0.61 0.61 0.53 0.50

0.56 0.60 0.84 0.33

0.58 0.57 0.46 0.40

Metrics and Task

Our proposed NPP-IP method was evaluated against several different methods for thread structure prediction using cybersecurity related posts. Two language models, BERT (BE) and RoBERTa (RB), were explored when training the NPP and proposed NPP-IP models, where -B and -L represent base and large models for each LM respectively. As shown in Fig. 5, our NPP-IP method outperformed the original NPP method based on the F1 score using the Reddit data across all but one of the LMs. We compared performance with well known methods, Creator-Oriented Network (CO) and Last Reply-Oriented Network (LR), using Precision (P), Recall (R), and F1 score (F1) metrics. 4.3

Libraries and Hyperparameters

In order to build, train, and evaluate both NPP and NPP-IP methods, we use publicly available libraries and set hyperparameters, and they are described as follows: torchtext 0.8.0 and PyTorch 1.7.1 [19], pytorch-lightning 1.2.2 [5], and transformers 3.4 [28] on Google Colab (Nvidia K80 12 GB GPU) were used. We use the hidden dropout probability as 0.15. The batch size was set to 8 and the learning rate was set to 5e–6. The model was trained with > 10 epochs. Convergence was observed around 3 epochs with limited over-fitting and the maximum sequence length was set to 250. 4.4

Results

The results are reported in Tables 3 and 4 for Reddit and Hacker Forums datasets, respectively. Both tables show a clear improvement from our NPP-

Prompt-Based Thread Structure Prediction

775

IP method compared to most methods across both datasets. There were two cases where NPP outperformed NPP-IP with both using the large RoBERTa (RB-L) language model. As shown in Table 4, in two of the three hacker forums, our proposed NPP-IP method with BERT-B LM reached the highest F1 score while NPP-IP with BERT-L LM recorded the highest F1 score for the third forum.

Fig. 5. An F1 score comparison of the reddit forum data using different BERT-based language models indicates that our proposed NPP-IP model (Red) outperforms NPP (Blue) in all but one language model, achieving the highest score using RoBERTa LM

5

Analysis and Discussion

In this section, we analyse the results of Reddit and Hacker Forums datasets, especially NPP-IP performance, Precision and Recall Implications, and various Error cases. 5.1

NPP-IP Performance

As far as the authors are aware, this is the first time that instructional prompts have been applied to text-based cybersecurity data. As the results of Reddit and Hacker Forums datasets show in Tables 3 and 4, NPP-IP performs better than NPP in most of the cases. The improvement on the Reddit dataset from NPP to NPP-IP across different LMs, ranged between 3–8% across F1 scores. Similar improvements were observed using the real world Hackers Forums dataset, ranging between 3–11% difference across F1 scores. This is despite the fact that the models were trained on Reddit only data. As shown, in Table 1, the Reddit dataset comprised of cybersecurity related topics across the different threads.

776

K. Kashihara et al.

This evidence is consistent with NPP-IP’s ability to better detect and leverage cybersecurity related information compared to other well known methods for social network construction based on thread structure prediction. Moreover, the framing of the instructional prompt using cyber related information may also have improved its performance across different forums. This is a significant discovery since annotating new datasets, especially in the cyber realm, is costly, requiring considerable human experts’ efforts to collect a decent size of data for training and testing. More research needs to be conducted to determine the extent to which framing cyber related instruction prompts can make text-based analysis more robust across different cyber forums and datasets. 5.2

Precision and Recall Implications

As precision and recall scores in Tables 3 and 4 show, both are considerably low in Reddit dataset, with recall scores much lower than precision scores in the Hacker Forums dataset. One possible explanation for this observed behavior is that publicly available pre-trained LMs were used. These LMs are pre-trained by a wide range of topics across a massive size of data. However, the cybersecurity field is in a constant state of flux—changing the meaning of words and adding new words quite frequently. We suspect that LMs could not understand many of the cybersecurity keywords in posts predicting thread structures that were consistent with actual social interaction. Thus, re-training LMs with cybersecurity data should be explored to improve the performance. 5.3

Error Cases

There are several error cases that are not easy to solve. Figure 6 shows an interesting case we found in the Reddit dataset. For the question post “Can you send a link to the tutorial?”, a user responded “*URL*” and “in the comment”. In the ground truth, “in the comment” is the response of “*URL*”, however, both of our models predicted “in the comment” is the response of “Can you send a link to the tutorial?”. We think that “in the comment” reinforces the post “*URL*” and also answers the original question. Since the ground truth is based on the thread tree structure, it only has one interaction even if it can interact with multiple posts or users. However, due to the tree structure in Reddit, the ground truth from the subreddit structure is assigned to only one of them. We found some cases that our methods predicted a post replied to multiple posts, and only one of them is correct as we mentioned before. Thus, these cases may decrease the performance of our methods. In the Hacker Forums dataset, two hacker forums have a feature to quote the referencing post in the same thread. However, the quote feature catches not only the referencing post but also the referencing post’s referencing post if it has. Figure 7 shows the example of this case. User C replied (referenced) User B post, however, the User B referenced User A post as well. Then, both NPP and NPPIP models predicted the pair of User C post and User A post are direct reply

Prompt-Based Thread Structure Prediction

777

relationship since User C post contains User A post content through referencing User B post. We observed many this false positive cases, and this type of error effected some performance in the Hacker Forums dataset. Some pre-process to remove reference’s reference post content will be needed to solve this issue.

Fig. 6. An interesting case in reddit dataset. Since the actual URL is harmful site URL, we replaced it as *URL*

6

Conclusion

Predicting thread structures within cybersecurity forums is a crucial component in defining key social networks used to identify prominent users who provide useful information. Identifying these users can facilitate prediction and prevention of future cyber incidents and attacks. A prompt-based learning model called Next Paragraph Prediction with Instructional Prompting (NPP-IP) for predicting thread structures across different cybersecruity topics was introduced. The method was evaluated using two different datasets and compared against several well known methods. The results show that the NPP-IP method had considerable improvement over existing methods, achieving the highest F1 score across different real world hacker forum datasets. On average, the NPP-IP method improves 14% F1 score than the NPP method. In the future work, we plan to train the language models with cybersecurity related data for adapting cybersecurity context to the models to improve the performance. In addition, we will apply these methods to the social network analysis to replace from the assumption methods’ networks to our context considered thread structures for comparing the advantage of the improved quality of the networks.

778

K. Kashihara et al.

Fig. 7. An interesting case in hacker forums dataset

Ethical Considerations In this research, we use one dataset from the other work with the agreement of using the dataset for this research only. We created the Hacker Forum dataset where the data from CYR3CON (Cyber Security Works), and they already anonymized the site names and usernames. We have an agreement with CYR3CON to use the data for this research only, and not sharing the data in public. The Hacker Forums dataset has randomly picked 20 threads that have average 15.4 posts per thread from three English hacker forums. The dataset was annotated by four cybersecurity experts (employees of CYR3CON) in a week as a part of their jobs. Our goal is to construct thread structure from unstructured forums. Further precautions taken include not identifying individuals (including not publishing usernames), and presenting results objectively. In addition, we use well-known publicly released language models, BERT and RoBERTa, for our experiments.

Limitations In this study, we have presented a new approach of predicting thread structure in cybersecurity forums. However, there are several limitations. First, there is a challenge of accessibility of datasets, especially (public) sensitive. Many of the previous works used hacker forums as their datasets, however, most of them are not published due to copyright and other restrictions. Kashihara et al. [12] provided the Reddit dataset for this study only. In addition, CYR3CON provided us with raw hacker forums data for this research, and this data will not be published. Second, there is a language limitation. To compare our approach with the related work [12] on the same dataset, we need to stick with English since

Prompt-Based Thread Structure Prediction

779

they used Reddit threads in English for their dataset. There are many hacker forums in not only English but also other languages, however, we selected 20 hacker forum threads from three English hacker forums for our evaluation since we fine-tuned English BERT and RoBERTa models for our evaluation. Lastly, another limitation is the size of ground truth. The original Reddit dataset by [12] used the Reddit tree structure that shows the reply relationship of each pair of posts that is the direct response as the ground truth. However, many hacker forums we used for evaluation do not have such a structure, and human experts needed to check every combination of post pairs under a thread to determine the relationship of each pair. Thus, we use the 20 threads from three English hacker forums annotated by human experts for the ground truth.

References 1. Almukaynizi, M., Grimm, A., Nunes, E., Shakarian, J., Shakarian, P.: Predicting cyber threats through hacker social networks in darkweb and deepweb forums. In: Proceedings of the 2017 International Conference of The Computational Social Science Society of the Americas, p. 12. ACM (2017) 2. Culafi, A.: Ninety percent of dark web hacking forum posts come from buyers. shorturl.at/fEGW4 (2021). Accessed 01 Apr 2022 3. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019) 4. Elsner, M., Charniak, E.: You talking to me? A corpus and algorithm for conversation disentanglement. In: McKeown, K.R., Moore, J.D., Teufel, S., Allan, J., Furui, S. (eds.) ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15–20, 2008, Columbus, Ohio, USA, pp. 834–842. The Association for Computer Linguistics (2008) 5. Falcon, W., Cho, K.: A framework for contrastive self-supervised learning and designing a new approach (2020). arXiv:2009.00104 6. Fox, J.: Cybersecurity statistics 2021. https://www.cobalt.io/blog/cybersecuritystatistics-2021 (2021). Accessed 01 Apr 2022 7. Fu, T., Abbasi, A., Chen, H.: Interaction coherence analysis for dark web forums. In: IEEE International Conference on Intelligence and Security Informatics, ISI 2007, New Brunswick, New Jersey, USA, May 23–24, 2007, Proceedings, pp. 342– 349 (2007) 8. Goel, S.: Cyberwarfare: connecting the dots in cyber intelligence. Commun. ACM 54(8), 132–140 (2011) 9. Halder, K., Kan, M.-Y., Sugiyama, K.: Predicting helpful posts in open-ended discussion forums: A neural architecture. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pp. 3148–3157 (2019) 10. Jiang, J.-Y., Chen, F., Chen, Y.-Y., Wang, W.: Learning to disentangle interleaved conversational threads with a siamese hierarchical network and similarity ranking. In: Walker, M.A., Ji, H., Stent, A. (eds.) Proceedings of the 2018 Conference of the

780

11.

12.

13.

14.

15.

16. 17.

18.

19.

20.

21. 22.

23.

K. Kashihara et al. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1–6, 2018, Volume 1 (Long Papers), pp. 1812–1822. Association for Computational Linguistics (2018) Johnsen, J.W., Franke, K.: Identifying proficient cybercriminals through text and network analysis. In: IEEE International Conference on Intelligence and Security Informatics, ISI 2020, Arlington, VA, USA, November 9–10, 2020, pp. 1–7. IEEE (2020) Kashihara, K., Shakarian, J., Baral, C.: Social structure construction from the forums using interaction coherence. In: Proceedings of the Future Technologies Conference, pp. 830–843. Springer (2020) ´ L’Huillier, G., Alvarez, H., R´ıos, S.A., Aguilera, F.: Topic-based social network analysis for virtual communities of interests in the dark web. SIGKDD Explorations 12(2), 66–73 (2010) Marin, E., Shakarian, J., Shakarian, P.: Mining key-hackers on darkweb forums. In: 1st International Conference on Data Intelligence and Security, ICDIS 2018, South Padre Island, TX, USA, April 8–10, 2018, pp. 73–80 (2018) Mehri, S., Carenini, G.: Chat disentanglement: Identifying semantic reply relationships with random forests and recurrent neural networks. In: Kondrak, G., Watanabe, T. (eds.) Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP 2017, Taipei, Taiwan, November 27 - December 1, 2017 - Volume 1: Long Papers, pp. 615–623. Asian Federation of Natural Language Processing (2017) Mishra, S., Khashabi, D., Baral, C., Choi, Y., Hajishirzi, H.: Reframing instructional prompts to GPTK’s language (2021). CoRR. arXiv:2109.07830 Mishra, S., Khashabi, D., Baral, C., Hajishirzi, H.: Natural instructions: Benchmarking generalization to new tasks from natural language instructions (2021). CoRR. arXiv:2104.08773 Morgan, S.: Hackerpocalypse cybercrime report 2016. https:// cybersecurityventures.com/hackerpocalypse-cybercrime-report-2016/ (2016). Accessed 01 Apr 2022 Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. In: NIPS-W, Alban Desmaison (2017) Pete, I., Hughes, J., Chua, Y.T., Bada, M.: A social network analysis and comparison of six dark web forums. In: IEEE European Symposium on Security and Privacy Workshops, EuroS&P Workshops 2020, Genoa, Italy, September 7–11, 2020, pp. 484–493. IEEE (2020) Phillips, E., Nurse, J.R.C., Goldsmith, M., Creese, S.: Extracting social structure from darkweb forums (2015) Reynolds, L., McDonell, K.: Prompt programming for large language models: Beyond the few-shot paradigm. In: Kitamura, Y., Quigley, A., Isbister, K., Igarashi, T. (eds.) CHI ’21: CHI Conference on Human Factors in Computing Systems, Virtual Event / Yokohama Japan, May 8–13, 2021, Extended Abstracts, pp. 314:1– 314:7. ACM (2021) Sarkar, S., Almukaynizi, M., Shakarian, J., Shakarian, P.: Predicting enterprise cyber incidents using social network analysis on the darkweb hacker forums. In: 2018 International Conference on Cyber Conflict, CyCon U.S. 2018, Washington, DC, USA, November 14–15, 2018, pp. 1–7 (2018)

Prompt-Based Thread Structure Prediction

781

24. Le Scao, T., Rush, A.M.: How many data points is a prompt worth? In: Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-T¨ ur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y. (eds.) Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6–11, 2021, pp. 2627–2636. Association for Computational Linguistics (2021) 25. Schick, T., Sch¨ utze, H.: Few-shot text generation with natural language instructions. In: Moens, M.-F., Huang, X., Specia, L., Yih, S.W.-T. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7–11 November, 2021, pp. 390–402. Association for Computational Linguistics (2021) 26. Tam, D., Menon, R.R., Bansal, M., Srivastava, S., Raffel, C.: Improving and simplifying pattern exploiting training. In: Moens, M.-F., Huang, X., Specia, L., Yih, S.W.-T. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7–11 November, 2021, pp. 4980–4991. Association for Computational Linguistics (2021) 27. Positive Technologies. Custom hacking services. https://www.ptsecurity.com/wwen/analytics/custom-hacking-services/ (2021). Accessed 01 Apr 2022 28. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., Rush, A.M.: Transformers: State-of-the-art natural language processing. In: Liu, Q., Schlangen, D. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020—Demos, Online, November 16–20, 2020, pp. 38–45. Association for Computational Linguistics (2020) 29. Yu, T., Joty, S.R.: Online conversation disentanglement with pointer networks. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16–20, 2020, pp. 6321–6330. Association for Computational Linguistics (2020) 30. Zhang, P., Qi, Y., Li, Y., Ya, J., Wang, X., Liu, T., Shi, J.: Identifying reply relationships from telegram groups using multi-features fusion. In: 2021 IEEE Sixth International Conference on Data Science in Cyberspace (DSC), pp. 321–327. IEEE (2021) 31. Zhu, H., Nan, F., Wang, Z., Nallapati, R., Xiang, B.: Who did they respond to? conversation structure modeling using masked hierarchical transformer. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The ThirtySecond Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, pp. 9741–9748. AAAI Press (2020)

Part-of-Speech Tagging Accuracy for Manufacturing Process Documents and Knowledge Fatemeh Besharati Moghaddam1,2(B) , Angel J. Lopez1,2,3 , Stijn De Vuyst1,2 , and Sidharta Gautama1,2 1

3

Department of Industrial Systems Engineering and Product Design, Ghent University, Ghent, Belgium 2 FlandersMake@UGent, corelab ISyE, Ghent, Belgium [email protected] Department of Electrical and Computer Engineering, ESPOL Polytechnic University, 090150 Guayaquil, Ecuador

Abstract. Adaptive guidance systems in manufacturing that support operators during the assembly process need to serve the right information at the right time. A conversational recommender system as the single point of contact between the operator and different sources of information, based on natural language processing, can be introduced to assist the operators. Natural language processing techniques can help to mine answers in text-based knowledge repositories as available in training documents, work instructions, and company procedures. Both the content as well as the style of writing in these documents are different from general language use and we examine the accuracy of part-of-speech tagging within this close domain of manufacturing. A benchmark dataset has been constructed based on four different classes of documents typical in the manufacturing domain. The dataset contains 1206 tokens divided over eight tag types. The accuracy of two open-source corpora, spaCy and NLTK, has been measured on this benchmark with an average accuracy of 93% and 87% respectively. The conclusion drawn is that pre-trained natural language libraries can effectively handle the specific contexts in the assembly domain based on the provided accuracy.

Keywords: Natural language processing domain · Operator support

1

· Part of speech · Closed

Introduction

In recent years, manufacturing companies have faced an increasing demand for complex products with more variations, leading to assembly operators needing to master more tasks. The introduction of Industry 4.0 technologies has also meant that operators need to acquire new skills to handle the complexity of c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 782–791, 2024. https://doi.org/10.1007/978-3-031-47715-7_52

Part-of-Speech Tagging Accuracy for Manufacturing

783

different products. Prinz et al. [24] introduce the operator assistance systems as one of the branches of research related to Industry 4.0. Currently, support services for assembly operators typically only provide procedural digital work instructions, leaving operators to rely on their own knowledge or seek help from external experts in case of issues. Digitization presents an opportunity to gather information from IT systems and provide better support to operators [34]. There are different developments regarding adaptive assistance systems in manufacturing contexts, e.g., Head-Mounted Displays [15], Augmented Reality [10,27], Tangible User Interfaces [26], and Motion Recognition [31]. Nonetheless, all of the support systems referred to earlier are pre-programmed and have well-organized information. The difficulty that arises in the assembly area may be linked to unstructured open questions posed by the operator. In this situation, the assistance system ought to analyze the open question and search various knowledge databases to locate appropriate and effective responses. Figure 1a shows the current situation of operator assistance in the assembly domain. For the operator to have efficient access to the needed information, there is a need for a direct interactive portal that interacts with the operator and serves as a single point of contact between the operator and various information sources [21]. Thus, the conversational recommender system as a digital interface based on natural language understanding aims to assist users in retrieving or suggesting the most relevant information by means of textual or spoken dialog. This enables users to interact with the system more effectively by employing natural language processing (NLP) [37]. Having this digital assistance can be key in increasing the operator assistance efficiency in the assembly line in order to handle the various open questions (Fig. 1b). Natural Language Processing [32] is helpful in understanding open questions from the operator in the assembly domain. This involves matching the query with an existing list of issues and searching for potential solutions/answers across various knowledge databases. NLP is the technique to help the computer and machines process in order to understand the contents and texts from human language. In other words, NLP is helping to improve communication between humans and machines [32]. In the NLP area, research is already done on the open domain and the closed domain [18]. Open domain and closed domain are terms used in NLP to describe the scope of a given language model’s knowledge and the range of questions or tasks it can effectively handle. The open domain language model is designed to handle a broad range of topics and questions [35]. There are many types of research in the open domain of NLP with efficient results [30]. On the other hand, the closed domain language model is designed to perform specific tasks within a restricted area or field [4]. To the best of our knowledge, there is limited existing research for implementing NLP techniques in the manufacturing area, especially in the assembly domain. Shi [28] and Xingguang et al. [33] propose the question-answering system for the manufacturing domain.

784

F. B. Moghaddam et al.

(a) Current Situation, without Access to (b) Proposed Situation, with Access to the Digital Assist System the Digital Assist System

Fig. 1. In the first figure, the operator has access to different knowledge databases to use in the problematic situation, but access is unrecognized, fragmented, and therefore inefficient. However, in the second one, with the help of a digital assist operator can faster and easier access the needed information

However, the details about the domain are not mentioned in their research. We assume that they considered the open corpus for their approach. So, there is a gap in the state-of-the-art for checking the NLP techniques in the assembly considering the close domain corpus. To start work in the mentioned gap, we develop the part of speech tagging technique in NLP for the assembly domain. The challenge for assembly as a specific domain lies in (a) a lack of publicly available documents and specific corpora related to NLP in assembly, (b) small document libraries within organizations, and (c) informal writing and poor grammar in procedures like quality reports. Though the first challenge of this research is to check, by preparing a benchmark data set in NLP related to the assembly domain if the existing pre-trained NLP libraries are capable of processing the various assembly-related concepts in the NLP domain. In the following, Sect. 2 reviews related work, Sect. 3 introduces the problem and the methodology used in this research, Sect. 4 presents the experiment results, and the conclusion and future works are presented in Sect. 5.

2

Literature Review

One of the most important areas in the pre-processing steps of NLP is Partof-Speech (POS) tagging. POS tagging is the task of labeling or tagging each token in sentences based on the defined rule [6,12]. POS tagging is useful for a variety of NLP tasks, such as information extraction, entity recognition, and

Part-of-Speech Tagging Accuracy for Manufacturing

785

grammatical structure identification. It automatically assigns the parts of speech tags to the tokens considering two main aspects: finding the exact tags for each token and choosing between the possible tags for ambiguous tokens [8,14,29]. The main goal of developing the POS tagger for any language is improving the accuracy of tagging and also considering the different language structures, trying to remove the ambiguity in the tokens [7]. Based on [17], the number of tokens in the training and testing data and also the corpus or the opensource dictionary being used in POS tagging can be two important factors in the performance and accuracy of POS tagging. In some research, authors used the manually trained corpus in the open domain [1,3] or the closed domain [20,36] for their experiment. Kumar [11] proposed an approach in POS tagging considering their defined corpus with 77860 tokens for training and 7544 for testing. In [19], 14369 tokens in the training set and 5000 tokens in the testing set are studied. Rezai [25] offered a POS tagger corpus with 5000000 tokens for training and 11766 tokens in the test set for the Persian language. However, using the manually annotated corpus, the corpus size may not be enough for modeling and an efficient evaluation [5,22]. Open-source NLP libraries can be used by authors to train their methodology and test their data set, such as the Stanford NLP suite [16], Google SyntaxNet [23], NLTK [13], and SpaCy [9]. Al Omran [2] used four different open-source libraries in order to train their methodology individually. In the next step, they manually annotated 1116 tokens with the correct part-of-speech tag and test the tagging accuracy based on each open-source library. Based on the state-of-the-art, there is a gap in available specific corpora related to NLP in the assembly domain. In this work, we try to address this gap with the help of two pre-trained available libraries. we aim to identify how the choice of using a particular open-source available NLP library could impact the results of POS tagging in the assembly domain.

3

Methodology

POS tagging is a crucial step in NLP in the case of improving the performance of a system related to information retrieval [17]. Based on [17], we consider eight classes of parts of speech tags as Noun, Verb, Adjective, Pronoun, Determiner, Adverb, Preposition, and Conjunction. In order to implement POS tagging, there is a wide variety of NLP libraries used in research related to natural language tagging. In this research, we used two open-source NLP libraries SpaCy and NLTK (Natural Language Toolkit). NLTK is a leading platform for building Python programs to work with human language data. SpaCy is a library for advanced NLP written in Python and Cython. To the best of our knowledge, there are no publicly available POS annotated training data on the assembly domain of manufacturing. Thus, we create a set including 100 sentences picked from four different categories in the assembly domain, Warning, Informative-texts, Manual, and Work-instructions. These

786

F. B. Moghaddam et al.

four different categories are chosen based on the structure of texts and sentences. In assembly, some of the used documents are well-written structured sentences formulated in the PDF format called manual documents. A sample of these manual sentences can be “The size and load-carrying capacity of the groove increases in line with the modular dimension”. In the prepared benchmark, we consider around half of the sentences to the well-written structured manual sentences. In addition, in the assembly line, there are some semi-structured documents like the warning and informative texts and the work instruction detail for each activity. These types of texts are usually written as imperative sentences or noncompleted sentences. The samples can be “Assemble three profiles with angle brackets.” or “Part extremely hot!!!”. In the prepared benchmark, we consider the other half of the sentences as three different sub-categories in semi-structured sentences (Warning, Informative-texts, and Work-instructions). Having 1206 tokens (excluding the punctuation marks), we do the annotation for all the tokens manually. The introduced eight POS tags based on our Ref. [17] assigned to each token in 100 sentences. After doing the manual POS tagging with the help of an expert, we investigate which of the considered open-source libraries achieves the best result in the test set. After manual tag the considered corpus, we have 212 verbs, 23 pronouns, 189 prepositions, 409 nouns, 156 determiners, 62 conjunctions, 32 adverbs, and 123 adjectives in our annotated data set. Figure 2a, shows the distribution of the tags in the considered corpus based on the assembly context.

4

Result and Discussion

In order to evaluate if NLTK and SPACY pre-trained taggers libraries can perform well in assembly domain corpus, we use the ground truth consisting of 1206 tokens in 100 different category sentences, we implement the libraries taggers into the considered ground truth in order to compare the result of different tags with the manual annotation.

(a) Tags in Ground Truth

(b) Tags in NLTK Taggers

(c) Tags in SPACY Taggers

Fig. 2. The distribution of tags based on 8 different categories for the considered assembly corpus with 1206 tokens in the ground truth, NLTK taggers, and SPACY taggers

As shown in Fig. 2b, the distribution in the tag predictions for different tags with the help of the NLTK library is approximately near the manual ground

Part-of-Speech Tagging Accuracy for Manufacturing

787

truth. The tagging accuracy for the NLTK tagger is 87% compared to the ground truth. For the SpaCy library, the distribution of different tags has a high similarity with the manually annotated corpus (Fig. 2c). The tagging accuracy for the SpaCy tagger on the ground truth is 93%, which in comparison to NLTK is 6% more accurate. In order to recognize all the tokenized words in comparison to the ground truth, the NLTK tagger could identify 1206 tokens exactly the same as the annotated ground truth. So, identical tokens accuracy is 100% for NLTK tagger. For the hyphenated Compound words, the NLTK tagger could recognize them (e.g.,difficult-to-reach, self-threading) with the same structure as manual ground truth and consider them as one specific token. The SpaCy tagger identifies all the tokens in the corpus. However, the number of tokens identified with the SpaCy tagger is 1223. The reason is that for the hyphenated compound words, SpaCy tagger tokenizes the words separately. So, for example, a hyphenated compound word as difficult-to-reach in SpaCy tagger acknowledge as three separated tokens difficult, to, and reach. In the considered assembly corpus, we have 20 hyphenated compound tokens. Figure 3 shows the accuracy of each individual tag for the considered libraries. The three most important parts of speech are nouns, verbs, and adjectives. SPACY achieves 92% for verbs, 94% for nouns, and 76% for adjectives, and for NLTK is this 81%, 89%, and 79%. SPACY performs better on the considered ground truth for nouns and verbs and in addition for pronouns. In recognizing adverbs, determiners, and prepositions, both libraries gained efficient accuracy. For conjunction tags, NLTK obtained 63% accuracy as the worse accuracy compared to other tags.

Fig. 3. Accuracy for each individual tag on the considered assembly corpus of NLTK and SPACY libraries taggers

The miss-annotation accuracy probability for the tags as P (P redT ag| T rueT ag) can be calculated for each tag in each library. This probability can be estimated simply by counting the number of occurrences of each token if it is annotated by the pre-trained library in a different tag compared to the ground truth. For example, P (V erb|N oun) is the probability that a specific token that is in the ground truth annotated as a noun, would be annotated as a verb by the pre-trained tagger in the considered library. The lower value leads to better annotation accuracy. In Table 1, the highest miss-annotation probabilities are

788

F. B. Moghaddam et al.

shown in percentage. In NLTK tagger, 29 verb tokens are miss-annotated as nouns. Based on the deeper analysis of the prediction results in NLTK, most of these miss-annotations are related to semi-structured sentences that start with a verb instead of a noun. However, in well-written texts e.g., manuals, 18 adjective tokens are miss-annotated as nouns which in most cases it is related to several nouns written continuously in a sentence. For conjunction tokens that have the most miss-annotation compared to other tokens, 13 out of 63 are classified as prepositions by NLTK tagger. The frequency of this is equal for well-written texts and semi-structured sentences. In the SPACY tagger, 12 verb tokens classify as nouns the same as the NLTK tagger mostly happening in imperative sentences. 19 adjective tokens are tagged as nouns and 12 adjective tokens as verbs. These miss-taggings happen in well-structured sentences when a series of nouns and adjectives are continuously written in a sentence. Table 1. Probability of more frequent miss-annotated tags considering the assembly ground truth in NLTK and SPACY libraries NLTK

5

SPACY

P (P redT ag|T rueT ag)

(P redT ag|T rueT ag)

P (P redT ag|T rueT ag)

P (N oun|V erb)

29|211

P (N oun|V erb)

(P redT ag|T rueT ag) 12|211

P (N oun|Adjective)

18|123

P (N oun|Adjective)

19|132

P (P reposition|Conjunction)

13|63

P (V erb|Adjective)

12|132

Conclusion and Future Work

In order to give efficient support to the operator in the assembly line, the digital assistant should be able to link the open question from the operator to the needed information with the help of NLP. Due to the lack of an efficient corpus in the assembly domain, we introduce a manual benchmark with 100 sentences related to four different categories in the assembly. Considering 1206 tokens in the prepared ground truth and with the help of an expert, we labeled each token based on POS tagging techniques. Using two NLP open-source libraries SPACY and NLTK, we checked if, with the help of their pre-trained taggers, we can achieve efficient accuracy in tagging the assembly concepts. Our results show that the tagging accuracy for SPACY is 6% more accurate than NLTK with 93% accuracy. Both of the libraries recognized all the tokens in the ground truth. However, SPACY had some problems with tokenizing the hyphenated compound words as one dedicated token. In addition, the complete analysis is already done based on each individual tag for both libraries considering well-structured and also semi-structured sentences. Based on the result and efficient accuracy, we can conclude that the open-source libraries in NLP can have the potential to handle the specific concepts in the close domain of assembly considering the prepared benchmark.

Part-of-Speech Tagging Accuracy for Manufacturing

789

Knowing that the open-source libraries in NLP efficiently can understand the assembly concepts, future work would focus on developing the conversational recommender system in order to propose efficient support to the operator in regard to the input question coming from the operator.

References 1. Akhil, K.K., Rajimol, R., Anoop, V.S.: Parts-of-speech tagging for Malayalam using deep learning techniques. Int. J. Inf. Technol. 12, 741–748 (2020) 2. Al Omran, F.N.A., Treude, C.: Choosing an nlp library for analyzing software documentation: a systematic literature review and a series of experiments. In: 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pp. 187–197. IEEE (2017) 3. Anastasyev, D., Gusev, I., Indenbom., E.: Improving part-of-speech tagging via multi-task learning and character-level word representations (2018). arXiv:1807.00818 4. Arbaaeen, A., Shah, A.: Ontology-based approach to semantically enhanced question answering for closed domain: a review. Information 12(5), 200 (2021) 5. Chiche, A., Yitagesu, B.: Part of speech tagging: a systematic review of deep learning and machine learning approaches. J. Big Data 9(1), 1–25 (2022) 6. Chungku, C., Rabgay, J., Faaß, G.: Building nlp resources for dzongkha: a tagset and a tagged corpus. In: Proceedings of the Eighth Workshop on Asian Language Resources, pp. 103–110 (2010) 7. Cing, D.L., Soe, K.M.: Improving accuracy of part-of-speech (pos) tagging using hidden markov model and morphological analysis for Myanmar language. Int. J. Electric. Comput. Eng. 10(2), 2023 (2020) 8. Das, B.R., Sahoo, S., Panda, C.S., Patnaik, S.: Part of speech tagging in odia using support vector machine. Proc. Comput. Sci. 48, 507–512 (2015) 9. AI Explosion. spacy-industrial-strength natural language processing in python (2017). https://spacy.io 10. Gorecky, D., Campos, R., Chakravarthy, H., Dabelow, R., Schlick, J., Z¨ uhlke, D.: Mastering mass customization-a concept for advanced, human-centered assembly. Acad. J. Manuf. Eng. 11(2) (2013) 11. Kumar, S., Anand Kumar, M., Soman, K.P.: Deep learning based part-of-speech tagging for Malayalam twitter data (special issue: deep learning techniques for natural language processing). J. Intell. Syst. 28(3), 423–435 (2019) 12. Kumawat, D., Jain, V.: Pos tagging approaches: a comparison. Int. J. Comput. Appl. 118(6) (2015) 13. Loper, E., Bird, S.: Nltk: the natural language toolkit (2002). cs/0205028 14. Lv, C., Liu, H., Dong, Y., Chen, Y.: Corpus based part-of-speech tagging. Int. J. Speech Technol. 19, 647–654 (2016) 15. Magerkurth, C., Engelke, T., R¨ ocker, C.: The smart dice cup: a radio controlled sentient interaction device. In: Entertainment Computing-ICEC 2006: 5th International Conference, Cambridge, UK, Sept 20–22, 2006. Proceedings, vol. 5, pp. 211–216. Springer (2006) 16. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: Stanford corenlp a suite of core nlp tools (2014) 17. Manning, C., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press (1999)

790

F. B. Moghaddam et al.

18. Mishra, A., Jain, S.K.: A survey on question answering systems with classification. J. King Saud Univ. Comput. Inf. Sci. 28(3), 345–361 (2016) 19. Mohammed, S.: Using machine learning to build pos tagger for under-resourced language: the case of somali. Int. J. Inf. Technol. 12(3), 717–729 (2020) 20. Mutabazi, E., Ni, J., Tang, G., Cao, W.: A review on medical textual question answering systems based on deep learning approaches. Appl. Sci. 11(12), 5456 (2021) 21. Park, D.-M., Jeong, S.-S., Seo, Y.-S.: Systematic review on chatbot techniques and applications. J. Inf. Process. Syst. 18(1), 26–47 (2022) 22. Patoary, A.H., Kibria, M.J.B., Kaium, A.: Implementation of automated bengali parts of speech tagger: an approach using deep learning algorithm. In: 2020 IEEE Region 10 Symposium (TENSYMP), pp. 308–311. IEEE (2020) 23. Petrov, S.: Announcing Syntaxnet: The world’s Most Accurate Parser goes Open Source. Google Research Blog, 12 (2016) 24. Prinz, C., Kreimeier, D., Kuhlenk¨ otter, B.: Implementation of a learning environment for an Industrie 4.0 assistance system to improve the overall equipment effectiveness. Proc. Manuf. 9, 159–166 (2017) 25. Rezai, M.J., Miangah, T.M.: Farsitag: a part-of-speech tagging system for Persian. Digital Scholarsh. Humanit. 32(3), 632–642 (2017) 26. R¨ ocker, C.: Universal access to awareness information: using smart artefacts to mediate awareness in distributed teams. Univ. Access Inf. Soc. 11, 259–271 (2012) 27. R¨ ocker, C., Etter, R.: Social radio: a music-based approach to emotional awareness mediation. In: Proceedings of the 12th International Conference on Intelligent user Interfaces, pp. 286–289 (2007) 28. Shi, M.: Knowledge graph question and answer system for mechanical intelligent manufacturing based on deep learning. Math. Probl. Eng. 1–8, 2021 (2021) 29. Singh, J., Joshi, N., Mathur, I.: Part of speech tagging of Marathi text using trigram method (2013). arXiv:1307.4299 30. Soares, M.A.C., Parreiras, F.S.: A literature review on question answering techniques, paradigms and systems. J. King Saud Univ. Comput. Inf. Sci. 32(6), 635– 646 (2020) 31. Ukita, N., Kaulen, D., R¨ ocker, C.: A user-centered design approach to physical motion coaching systems for pervasive health. In: Smart Health: Open Problems and Future Challenges, pp. 189–208 (2015) 32. Vajjala, S., Majumder, B., Gupta, A., Surana, H.: Practical Natural Language Processing: A Comprehensive Guide to Building Real-world NLP Systems. O’Reilly Media (2020) 33. Xingguang, L., Zhenbo, C., Zhengyuan, S., Haoxin, Z., Meng Hangcheng, X., Xuesong, X., Gang, X.: Building a question answering system for the manufacturing domain. IEEE Access 10, 75816–75824 (2022) 34. Yang, X., Plewe, D.A.: Assistance systems in manufacturing: a systematic review. In: Advances in Ergonomics of Manufacturing: Managing the Enterprise of the Future: Proceedings of the AHFE 2016 International Conference on Human R Aspects of Advanced Manufacturing, July 27–31, 2016, Walt Disney World, Florida, USA, pp. 279–289. Springer (2016) 35. Ye, Z., Jia, Z., Yang, Y., Huang, J., Yin, H.: Research on open domain question answering system. In: Natural Language Processing and Chinese Computing: 4th CCF Conference, NLPCC 2015, Nanchang, China, 9–13 Oct 2015, Proceedings, vol. 4, pp. 527–540. Springer (2015)

Part-of-Speech Tagging Accuracy for Manufacturing

791

36. Yitagesu, S., Zhang, X., Feng, Z., Li, X., Xing, Z.: Automatic part-of-speech tagging for security vulnerability descriptions. In: 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), pp. 29–40. IEEE (2021) 37. Zhang, Y., Chen, X., Ai, Q., Yang, L., Bruce Croft, W.: Towards conversational search and recommendation: System ask, user respond. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 177–186 (2018)

A Multi-label Few-Shot Learning with Combinations of Layers Bilge Sipal Sert(B) , Cem Rifki Aydin, and Arjumand Younus Afiniti, Istanbul, Turkey {bilge.sipal,cem.aydin}@afiniti.com, [email protected]

Abstract. Multi-class and multi-label classification on noisy call transcript data generated by speech-to-text (STT) systems is challenging due to the different human accents and transcription errors. The multilabeling task is even more complicated if the data points have only single or no labels. This study has three main contributions to solving these problems: (1) To overcome the labeling problem, we train a multi-class classification model and use a minimal set of manually annotated data to determine a threshold. We obtain a multi-label classifier by utilizing a multi-class classifier with this threshold. (2) To overcome the noise issue, we propose concatenating well-known feature extraction techniques such as word2vec, tf-idf, transformers, and fuzzy embeddings. This combined feature extraction method is more resilient to noise with proper configurations than stand-alone techniques. (3) This is an industry task; we must protect our client’s data. Hence to carry out our success on French private client data to benchmark data, we propose a noising pipeline that artificially mimics the observed STT transcription errors. We combined these solutions in an NLP framework, enabling us to achieve state-ofthe-art results with fewer resources, such as manually annotated data or multiple GPU utilization.

Keywords: Few-shot learning

1

· Multi-labeling · Noisy data

Introduction

In a scenario where a sales agent is pitching products to a customer connected to a call center, it is crucial to make accurate product offerings. The problem is further compounded by “information overload” concerning hundreds of offerings within the product catalogue available to the agent, leading to a biased pitching strategy by agents whereby the most recent or most sold products are commonly offered. Recommender systems technology within our company offers a product filtering solution for a customer-agent pairing that considers product features in historical data interactions by implicitly gathering features of products both accepted and rejected by a customer. This business need for implicit product features gathering led to the processing of customer call transcripts which is very challenging on account of the following reasons c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 792–806, 2024. https://doi.org/10.1007/978-3-031-47715-7_53

A Multi-label Few-Shot Learning with Combinations of Layers

793

– Extreme amount of noise produced by speech transcription engines leading to failure of a simple string-matching approach – Unavailability of product features’ data in case of a transaction ending in no product being sold. To address the above-mentioned challenges, we propose a two-stage modeling approach. In the first stage, we convert the problem into a multi-class classification problem where sales transactions are utilized as training data to predict product offers made in no-sale calls. In the second stage, we use manual annotation of 1% of call transcripts for offers discussed so as to obtain threshold settings for predicted probabilities of our classifier in the multi-labeling of transcripts. This approach allows us to get good accuracy scores with very little human labor compared to multi-label classification, which would force us to do expensive and extensive manual annotations. A practical classification framework for the problems identified above mainly depends on a coherent representation of text data within call transcripts. One of our contributions in this work is such a representation that can be configured based on the noise level, which can capture coherence via semantic, contextual, and sequential information within noisy call transcripts’ text [24]. We specifically solve the noise issue by concatenation of established techniques in the literature and use gradient boosting trees with features extracted from these techniques for the supervised and semi-supervised classification tasks. Additionally, inspired by our observations on speech-to-text (STT) transcribed texts, we created a noising pipeline to transfer our noise experience in real-world business data to benchmark data without giving away information on client data. Finally, another significant contribution from this work is a configurable framework that finds products/offers discussed in the text from start to end. Based on various noise levels in the text data, the user can configure components to be used within the preprocessing and the classification pipeline. The paper is organized as follows: Sect. 2 presents the existing works on feature extraction and classification approaches, primarily based on the noisy outputs of STT systems. In Sect. 3, we describe the architecture of our framework. We show the experimental results and explain the main contributions of our proposed approach in Sect. 4. We discuss our findings and results in Sect. 5. In Sect. 6, we conclude the paper.

2

Related Work

Recent NLP systems use transformer-based models that can be easily adapted to textual problems at hand, such as [7,14,20,22]. On the other hand, training transformers from scratch is a resource-heavy job and is generally not desired. As a result, transfer learning methods are more popular, such as fine-tuning a pre-trained transformer. There are some drawbacks to fine-tuning these models [20], with one being an unknown word problem [10]. To overcome the limitation of tokenizers, our team experimented with ByT5 model, which is said to enable users to avoid tokenization issues [23]. With low resources, they could not deliver

794

B. S. Sert et al.

the expectations seen in Table 1. Moreover, transformer models perform well when the label size is reasonable (best for binary) but fail with large label sets, such as in our case of circa 100 products. [4] Feature extraction techniques similar to ours have been applied previously for low-resource languages as in the context of Arabic dialect identification [13], author profiling [11]. The success of these systems stems chiefly from the ability of word and character n-grams to capture language idiosyncrasies. Research investigating novel and custom-defined tasks such as ours has shown that some combination of tf-idf and word2vec vectors work sufficiently well [17]. Still, none of these studies were done on noisy data like ours. The problem of noisy data of STT outputs is well studied for NER (Named Entity Recognition) and EL (Entity Linking) [1,8]. A similar study on STT output data was conducted by the authors of [18] on the problem of NER. They applied a similar concatenation technique configured by neural nets and utilized tf-idf for EL. However, we fundamentally differ in resource usage mainly because of [18] using 600K manual annotations, with the fundamental drawback being more extensive training data with more GPU utilization. The studies in the literature about feature engineering techniques are primarily conducted for clean text data. Only a few studies are performed for the noisy data produced by SST for the multi-label few-shot classification task. Our study overcomes this gap. The difference and contribution of our work are that we perform a comprehensive ablation study for feature engineering techniques based on character, word, and other unsupervised (e.g., BERT) and supervised (fuzzy) settings. The ensemble framework of our feature sets is novel for this multi-label few-shot classification approach and has produced state-of-the-art (SOTA) results.

3

Architecture of the Framework

Our framework comprises a feature extraction module and two classification systems (see Fig. 1). The first classification system filters the relevant call transcripts utilizing a semi-supervised self-training mechanism implemented via gradient boosting. The second classification system is a two-stage system. The first stage is multi-class classification, and the second is a multi-label classification based on the predicted probabilities of the gradient-boosting classifier. 3.1

Creating Features for Classification Pipelines: Feature Extraction Module

The feature extraction module can perform vectorization of transcripts in a configurable manner using feature unions.1 To deal with an immense amount of noise in our commercial STT transcribed data2 we exploit various levels of tf-idf layers as follows: 1 2

We deploy feature union pipeline of scikit-learn [19]. Note that this data cannot be publicly shared because it belongs to private corporate entities.

A Multi-label Few-Shot Learning with Combinations of Layers

795

Fig. 1. Overview of the Framework.

– Word n-grams, relying on a counting approach, helps detect the consistent transcription errors our STT system produces. This is added on two levels, i.e., with and without stop word removal. – Char n-grams help detect misspellings, concatenated words, and wrongly mapped words (e.g., ‘oui oui’ instead of ‘Huawei’) Furthermore, improved semantic representations of our noisy call transcripts’ data are ensured by means of training word2vec Skipgram model [15] with the noisy 200K transcripts rather than off-the-shelf pre-trained word2vec. We then use this model as word2vec vectorizer of our pipeline. Finally, in line with several studies showing the success of fine-tuned transformer systems in contextual known-error scenarios [10], we also add a fine-tuned CamemBert (or Bert depending on the language in use) vectorizer layer within the feature union pipeline. The configuration settings within our feature extraction module help tweak the reliance on this layer. Another significant addition within our feature extraction pipeline is the addition of fuzzy embeddings, whereby the chief aim is to exploit the labels’ distribution. Fuzzy embeddings are probabilistic class distributions per sample (i.e., transcript in our case). These vectors represent the probabilities that a sample belongs to each class. We count the normalized score for total occurrences of each token in classes and then consider the mean vector of all the fuzzy vectors of words explicitly appearing in the transcript. For example, if there are six labels, each word and transcript embedding consists of 6 probabilistic values whose sum is 1.

796

B. S. Sert et al.

Depending on the data’s noise level, the choice of which vectorizers together changes. F U (t) = v1 ∗ tf-idfc (t) ⊕ v2 ∗ tf-idfw (t) n 1  ⊕ v3 ∗ ( BRT (wi )) n i ⊕ v4 ∗ ⊕ v5 ∗

n 1  ( W 2V (wi )) n i

n 1  ( f uzzy(wi )) n i

(1)

Let us show the choice vector as v = (v1 , .., vm ) with m layers and each vi ∈ {0, 1}. Let tf-idfc () represent char-based tf-idf and tf-idfw () represent wordlevel tf-idf representation. Let BRT represent BERT embeddings, and W 2V represent the word2vec embeddings. Let fuzzy() denote the fuzzy embeddings. Let n1 represent the mean operation. Let F U represent the feature union operation. Let t represent a transcript with n many words, i.e. t = [w1 , ...wn ] and we choose 5 layers, where ⊕ represents the concatenation operation. 3.2

Filtering Irrelevant Transcripts: Semi-Supervised Binary Classification of Unlabeled Transcripts

In a call center, the vast and diverse scope of products to be sold induces an information overload, particularly when the sale is limited to certain products. We, therefore, need a mechanism to exclude out-of-scope products from within conversation transcripts, thereby allowing efficiency in final prediction steps. Implementing such a mechanism is straightforward when the outcome of a call results in a sale but calls with no sale outcome are a black box, and within an industrial setting, such black box calls are 90% of all transcripts. These settings involving vast amounts of unlabeled data usually entail the use of semi-supervised learning [6,12], which is a learning method that uses a large amount of unlabeled together with the labeled data; we utilize scikit-learn’s [19,26] self-training module with binary xgboost classifier for this. 3.3

Utilizing Multi-class Classification Estimator for Multi-label Classification

Our fundamental aim is the detection of several products discussed in a call with only one sold product being known.3 Manual annotation of data is not a desirable route to take on account of massive resource consumption. We could get only circa 1% of the data to be manually annotated (see Table 1). Using this tiny 3

This implies a prediction task of multiple offers per call but with the availability of a single label per instance.

A Multi-label Few-Shot Learning with Combinations of Layers

797

amount of data in a semi-supervised learning scheme to solve products/offers discussed in a call transcript problem is not feasible since, as the label size gets larger semi-supervised systems on our data tend to perform loosely. This led us to a scheme where we convert the multi-label classification problem to a multi-class classification problem. Multi-class Classification Module The relevant transcript data based on filtration is used in this module of our NLP framework. Configurable feature union is adapted to this data and fed to an estimator. In this step, we use Xgboost for a multi-class classification task similar to the filtration step, and the aim is to have the best feature extraction configuration to get a good accuracy score. Multi-labeling Module The strength of threshold computation on the predicted probabilities of the multi-class classifier relies tremendously on the metric chosen. In order to select the right metric, we need to understand how these multi-labels will be used in the recommender system. A call instance with tabular features is going to be fed to the recommender system, and this list of multiple offers (multi-labels) becomes an additional feature of this instance. Then the data table is exploded by those lists of multiple offers and from one instance we get multiple duplicated instances from each offer in the list of multiple offers. If we predict more than the real discussed number of offers, then the data contains diminished positive sale outcomes. If we predict way less, then the data does not contain enough information on the rejected offers. Hence in our task, the length of the set of multi-labels for each instance is very important. The multi-labeling metric that we employed is similar to Jaccard similarity [3]. We divide our manually annotated data set into two parts. The first half of this data is called the thresholding T data and the other half is called the validation data V. Let At denote the set of offers for transcript t in the manually labeled data and B t denote the set of offers that the classifier predicts. The multi-labeling accuracy for one instance is defined as follows. acct =

abs(At ∩ B t ) max(abs(At ), abs(B t ))

(2)

The main idea of this scoring simply is to award the right prediction and right size but penalize the incorrect prediction and prediction list length. Since we have the metric and the manually annotated data, let us explain the thresholding technique that we applied based on manually annotated data. Let p ∈ (0, 1) ⊂ R denote a probability score. If B in Eq. 2 denotes the set of predictions with the predicted probability score that is higher than p, then we denote the score for the transcript t as acctp . Even though the classifier is trained for a single label per instance, for each instance it produces a predicted probability score for each label. For gradient boosting trees such value is actually not a probability value, it is basically a score [5]. For heuristically chosen p ∈

798

B. S. Sert et al.

(0, 1) ∈ R, we compute the multi-label accuracy score for the manually annotated thresholding data. Basically, thresholding based  on ground truth data is to find the best p 1 t that gives the highest score for m t∈T acc where m denotes the length of the thresholding data [9].

4 4.1

Experiments Data

French Transcript Data Set We first performed preprocessing operations for texts by lowercasing them and eliminating redundant punctuation characters. Then, we performed multi-class and multi-label classifications for transcripts in French. These data sets have 6 different contracts and 96 handset targets as labels. This data set is the transcript data which is the output of a thirdparty STT system. We observed certain noise types in this data set such as concatenated words as ‘todayisnice’ instead of ‘today is nice’, misspelled words, wrongly mapped words as ‘oui oui’ (‘yes, yes’ in French) instead of ‘Huawei’. The noise ratio for these texts is approximately 90%. The size of the offer data is 48,657, whereas this is 26,252 for the handset data. Manually labeled data for handset data is 268, the number of data used for thresholding is 134, and the number of data used for testing is 134 (see Table 1). Manually labeled data for offer data is 503, the number of data used for thresholding is 251, number of data used for testing is 252. Since this data is sensitive client data and we can not share more information on it, we decided to create a noising pipeline that converges to STT transcription challenges. Benchmark Data Sets in English In order to prove that our approaches are cross-domain and portable to other languages, we also evaluated our performance on two public data sets. The first one is the 20newsgroups data set provided by [19] for the multi-class classification problem. This is a corpus well-known in the NLP community. It is composed of around 18,000 samples and has 20 labels. As a benchmark corpus for multi-labeling, we utilized the emotion data set of the SemEval 2018 competition [16]. This consists of 10,983 tweets in English in total. There are 11 emotion labels (e.g. anger, joy, disgust) in this corpus. We performed preprocessing techniques such that texts are lowercased and tokenized. Since CamemBert is a model suitable for French texts only, we leveraged the English BERT model for these two datasets.

A Multi-label Few-Shot Learning with Combinations of Layers

799

Noising Pipeline Since most of those benchmark texts in English are not noisy and Python spell checker does not work well for our noisy data, we subject those texts to a noising pipeline to mimic the challenges that we faced. The STT tool we rely on for data corrupts tokens mostly in the following ways: (1) Deletion of some words, (2) Concatenation of subsequent words, (3) Randomly changing characters. Therefore, we also applied this noising pipeline to all text benchmark data. We determined several noise ratios and performed an ablation study. As different settings, we delete words, change characters, and concatenate subsequent English words all randomly with varying ratios, as mentioned. When the noising ratio is set at a high percentage, the texts are obviously more corrupted. The noise ratio for our transcript data in French is around 90%. Therefore, we also conduct experiments choosing this number and several others (e.g., 20%, 50%) to test the impact of the noise on the performance. We incorporated all these three stages in adding noise to our data at the same time for all samples. That is, these techniques are not mutually-exclusive per text datum. The reason is that the corruption of text occurs in all three noising scenarios for almost all data. 4.2

Experiment Set-Up

As computing infrastructure, we rely on the following resources: We run the code on the company’s servers shared by 21 people. The CPU of this server has 32 cores, RAM’s size is 503 GB, and the total drive size is 1 TB. The operating system thereof is CentOS Linux release 7.9.2009 (Core). We also use CUDA to make the code run faster. The GPU’s model name is NVIDIA Tesla V100 GPU Card, and its memory size is 32 GB. Hyper Parameters We avoided grid search for Xgboost model as additional parameters we used GPU utilization. For tf-idf word layer, we use stop words based on the language, as n-gram range, we used (1, 3) and we limit the max features to 70, 000. The epoch number for the BERT model is set at 5 since when we further increase it, overfitting occurs. We chose a batch size of 32 and a learning rate of 1e-5. We tried out thresholding values ranging from 0.01 through 0.25 in increments of 0.01 for multi-labeling. Application Details on French Data The results of our approaches for French data sets are given in Tables 1 and 2. Table 1 includes the results concerned with six contract types and shows the success rates produced for handset products. The two scenarios are single-label and multi-label (grid-search for thresholding) for these. For multi-class classification, we use 80% of the data for training and 20% for testing. We carefully separate manually annotated data from the training and testing process of multi-class classification in order to avoid possible data leakage. After we obtained a trained classifier using the data we proceed with the thresholding with manually labeled data. We split the data into halves. We heuristically applied grid search on predicted probability scores of the

800

B. S. Sert et al.

trained classifier on the train split of the manually annotated data by using the metric Eq. 2. By using a trained multi-class classifier on the test data on predict mode along with the heuristically found threshold value we get multi-labels for each test data instance. Consequently, we applied the metric Eq. 2 to get multilabel accuracy scores for the test data. The thresholding-based multi-label scores can be found both in Tables 1 and 2. In Table 2 we compare thresholding with the tree-based hierarchical model on our French call transcript data, namely the local classifier per node [21]. Table 1. The abbreviation SL stands for single label multi-class classification and ML thresholding multi-labeling based on thresholding as explained in Sect. 3.3. Since the method’s success relies heavily on single-label accuracy, if the single-label accuracy of the method for 6 contracts is not promising, such as Flaubert or Electra, we did not continue with testing these methods Test results

6 Contracts

French call transcripts

SL

ML Thresholding SL

96 handset products

W2V+tf-idfw +XGB

0.8

0.65

0.58 0.538

CBert Embeddings+XGB 0.60 0.38 W2V +XGB

0.6

ML thresholding

0.23 0.08

0.37

NA

Attn-BiLSTM with W2V 0.58 0.34

NA

0.23 0.08

CBert

0.6

0.37

0.25 0.14

ByT5

0.53 0.38

0.42 0.35

Electra

0.52 NA

NA

NA

Table 2. The first is based on thresholding [2, 25] as in Sect. 3.3, and the second method is based on hierarchical multi-label classification [21]. ML accuracy corresponds to multi-label accuracy Eq. 2 for 96 product data French call transcripts

ML accuracy

Thresholding with data

0.538

Hierarchical classification 0.32

A Multi-label Few-Shot Learning with Combinations of Layers

801

Table 3. The single-label (SL) performances for english 20newsgroups data set with 0.2 train-test splits. Settings are noisy and non-Noisy (Clean) scenarios. Our focus is noisy texts in this work, where we outperform the baseline BERT

4.3

Test results

SL

With/no noising pipeline

Clean Noisy

Bert

0.88

0.60

W2V+tf-idfc +tf-idfw +XGB

0.79

0.64

W2V+tf-idfw +XGB

0.78

0.33

Attn-BiLSTM with W2V

0.86

0.11

Fuzzy + tf-idfc +tf-idfw +XGB 0.91

0.44

Bert + Fuzzy + XGB

0.91

0.50

Bert + XGB

0.78

0.57

Experimental Results

Results for French Corpora Our test results indicate that for different purposes and noise ratios, different layers can be utilized. For example, if the concatenation of subsequent words occurs frequently, the use of the character layer could boost the performance. Setting each vi = 1 will give a representation with very high computational complexity. Moreover setting v = 1 only for tf-idf will suffice to have reasonable results but the complexity problem remains. In order to reduce the complexity the second approach is to train the word2vec model on the noisy data and use this word2vec as an additional information source on top tf-idf word layer. In our case, this way we get the best representation, i.e. two layers were enough. Besides taking the mean in each layer we experimented with taking max, min, and concatenating but the mean was the best resulting one. As can be seen in the tables, when we rely on word and character layers, and the W2V approach, we outperform the baseline CamemBERT by circa 20% and other approaches for all the settings as seen in Table 1. This shows us that the use of combinational techniques can be a better choice as compared to stateof-the-art neural network models. This is also indicative that neural network models are sensitive to noise; we have overcome this problem to some extent in this data set. Before applying thresholding we also implemented hierarchical multi-labeling but as seen in Table 2 thresholding multi-labeling scores are superior compared to a hierarchical model. In Tables 4 and 5 we will show the validity of our method on the benchmark emotion data set from the SemEval 2018 competition [16]. As seen in our results in Tables 1 and 2, for the thresholding technique to outperform hierarchical methods, the trained multi-class classifier must be stronger. Results for Benchmark Data As mentioned, we experimented with several settings for the benchmark data sets, where the noise ratios can have different values. For our noisy data sets in French, when the noise rate was about 90%, our

802

B. S. Sert et al.

methods using tf-idf layers outperformed the CamemBert model by a significant margin. We observed the same results with noisy texts in English we fed into the system for both the benchmark corpora. The multi-class classification results are shown in Table 3. Here, the noise ratios for random deletion of words, changing characters, and concatenation of subsequent words are 10%, 10%, and 65%, respectively. The reason we show the results with respect to these noising rates is that the STT tool we use for the French data set also produces corrupted text with similar ratios. Table 4. The Data set in Use is from the SemEval 2018 Competition [16] with 11 Distinct Labels and for these Tests, No Noise Addition is Applied. Here XGB Stands for Xgboost Multi-Class Classifier, OVS XGB Stands for one Versus Rest Multi-Label Classifier with the Chosen Estimator, SL Stands for Single Label and ML Stands for Multi-Label Classification. Test results

Noisy data

Methods

SL XGB ML thresholding ML XGB-OVS ML MLkNN

Bert+Fuzzy

0.32

0.23

0.18

0.2

word3 + char + Bert 0.42

0.28

0.21

0.33

W2V+word3+char

0.48

0.22

0.22

0.19

Fuzzy

0.22

0.23

0.23

0.21

Bert

0.54

0.25

0.26

0.3

As can be seen in Table 3, we outperformed the baseline BERT approach for the benchmark 20newsgroups data set in a statistically significant manner 4%, when noise is added and training ratio is chosen as %80. We also incorporated the results for clean texts into the tables to show that our approaches can perform well for noisy data (+3%) as well as for clean data (with layering approach +3% difference from baseline BERT). The decrease in accuracy for the baseline BERT approach is much higher (25%) as compared to our combinations, for example, decrease for W2V+tf-idfc +tf-idfw +XGB which is 15%. That is, our layers are more resilient to noise. We also observed that different noising techniques could lead to varying success rates for specific layers. For example, if we randomly change characters in the text with a higher possibility, the character layer performed worse. On the other hand, if a random concatenation of words is applied frequently, word layers could perform less robustly. Using noisy data trained word2vec can be a good backup for tf-idf layers but as a stand-alone technique it is not good enough so we did not include it in our test result tables.

A Multi-label Few-Shot Learning with Combinations of Layers

803

Table 5. The Data set in Use is from the SemEval 2018 Competition [16] with 11 Distinct Labels and for these Tests, No Noise Addition is Applied. Here XGB Stands for Xgboost Multi-Class Classifier, OVS XGB Stands for one Versus Rest Multi-Label Classifier with the Chosen Estimator, SL Stands for Single Label and ML Stands for Multi-Label Classification. Test results

Clean data

Methods

SL XGB ML thresholding ML XGB-OVS ML MLkNN

Bert+Fuzzy

0.5

0.3

0.36

0.48

word3 + char + Bert 0.57

0.26

0.23

0.47

W2V+word3+char

0.55

0.25

0.22

0.26

Fuzzy

0.49

0.21

0.19

0.39

Bert

0.64

0.25

0.27

0.45

Another observation is that our noising pipeline has difficulty carrying out the exact noising of our private French data set. One of the possible reasons might be the language difference, the other possibility is that the STT machine used in our case might have been trained with English data and used in French data. Since it was a third-party setting we do not know the exact reason for such a noise ratio. Observations on Few-Shot Multi-Labeling (Thresholding) Results: The first main observation is that Bert embeddings alone are useful for thresholding neither on noisy nor on clean data, despite the high accuracy level for single-label multi-class classification results. Empowering Bert with Fuzzy on clean data or tf-idf layers on noisy data gives the best scores such 48% and 33% multi-labeling accuracy based on metric Eq. 2. In Tables 4 and 5 we show the results. For thresholding, we use 10% of the multi-labels for training and 10% for testing purposes. We choose single labels for training randomly hence the SL (single label) accuracy scores can be lower than expected, which also affects the multi-label scores as well. Even though scores are low, they still point out a significant phenomenon. They indicate that the accuracy difference based on metric Eq. 2 between the ML Thresholding (28%) technique with a small amount of data and the highest performing multi-label training algorithm ML MLkNN (33%) is circa 5%. The optimal thresholding value for multi-labeling is found to be 0.02. In label-scarce situations such as ours, this is a reasonable sacrifice compared to the price of human labor for manual annotation for the industry.

5

Discussion

In this study, we performed an ablation study, leveraging different feature extraction schemes. When we relied on transformer approaches, we observed that these were not robust against the noise produced by the STT tool. Our research objective was to show that combining different feature engineering layers for multi-

804

B. S. Sert et al.

label few-shot learning would produce better results. In this respect, we have found out that the combinations of layers can compensate for the weaknesses of each other. We proved that using such ensemble techniques for our noisy corpora and benchmark data sets boosted the performance.

6

Conclusion

In this work, we investigated the effect of the use of a configurable layering approach specifically with tf-idf, word2vec, and BERT embedding layers as compared to the stand-alone versions of the state-of-the-art models for the binary and multi-class classification problems on noisy STT transcribed data. Moreover, we utilized a multi-labeling system with a small amount of manually labeled data. We created a noise pipeline so that we can carry out our results on noisy French transcripts to benchmark data sets. Using simpler techniques/layers, we outperformed SOTA frameworks for all major scenarios for our French transcript data. Our work leads to several conclusions when text data are noisy: (1) In a multi-labeling task where known labels are scarce and the label set is a large semi-supervised system such as label propagation or self-trainers will be doomed to fail, our method of thresholding is a good alternative method specifically for noisy data. We have observed that the accuracy difference based on metric Eq. 2 between the ML Thresholding (28%) technique with a small amount of data and the highest performing multi-label training algorithm ML MLkNN (33%) is circa 5%. In label-scarce situations like ours, this is a reasonable sacrifice compared to the price of human labor for manual annotation for the industry. (2) Different combinations of layers can be more resilient to noise with statistically significant accuracy differences, and it can be a better approach for noisy settings. (3) If added to the pipeline, our approaches are both cross-lingual and cross-domain up to a language-based pre-trained transformer model. We have achieved our aim and objectives in these respects.

References 1. Abujabal, A., Gaspers, J.: Neural named entity recognition from subword units. (2019). arxiv:1808.07364 2. Al-Otaibi, R.M., Flach, P.A., Kull, M.: Multi-label classification: a comparative study on threshold selection methods. In: In First International Workshop on Learning over Multiple Contexts (LMCE) at ECML-PKDD, pp. 6–11. (2014) 3. Amigo, E., Delgado, A.: Evaluating extreme hierarchical multi-label classification. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5809–5819. Association for Computational Linguistics, Dublin, Ireland (2022) 4. Chang, W., Yu, H., Zhong, K., Yang, Y., Dhillon, I. S.: Taming pre- trained transformers for extreme multi-label text classification. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery amp; Data Mining, KDD ’20, pp. 3163–3171, Association for Computing Machinery, New York, NY, USA (2020)

A Multi-label Few-Shot Learning with Combinations of Layers

805

5. Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016) 6. Chen, Y., Zhang, Y., Zhang, C., Lee, G., Cheng, R., Li, H.: Revisiting Self-training for Few-shot Learning of Language Model. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9125–9135. (2021) 7. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019) 8. Ghannay, S., Caubri´ere, A., Est´eve, Y., Camelin, N., Simonnet, E., Laurent, A., Morin, E.: End-to-end named entity and semantic concept extraction from speech. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 692–699 (2018) 9. Gharroudi, O., Elghazel, H., Aussem., A.: Ensemble multi-label classification: A comparative study on threshold selection and voting methods. In: Proceedings of the 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), ICTAI ’15, pp. 377–384. IEEE Computer Society, USA (2015) 10. Gulcehre, C., Ahn, S., Nallapati, R., Zhou, B., Bengio, Y.: Pointing the unknown words. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 140–149. Association for Computational Linguistics, Berlin, Germany (2016) 11. Kruczek, J., Kruczek, P., Kuta, M.: Are n-gram categories helpful in text classification? In: International Conference on Computational Science, pp. 524–537. Springer (2020) 12. Kumar, V., Xie, H., Chen, L., Garcia, F., Lu, J.: Industry scale semi- supervised learning for natural language understanding. In: Proceedings of NAACL HLT 2021: Industry- Track Paper, pp. 311–318 (2021) 13. Lichouri, M., Abbas, M., Lounnas, K., Benaziz, B., Zitouni, A.: Arabic dialect identification based on a weighted concatenation of TF-IDF features. In: Proceedings of the Sixth Arabic Natural Language Processing Workshop, pp. 282–286. Association for Computational Linguistics, Kyiv, Ukraine (Virtual) (2021) 14. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M. Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized Bert pretraining approach. (2019). abs/ arXiv:1907.11692 15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, vol. 26. Curran Associates, Inc. (2013) 16. Mohammad, S., Bravo-Marquez, F., Salameh, M., Kiritchenko, S.: SemEval- 2018 task 1: Affect in tweets. In: Proceedings of The 12th International Workshop on Semantic Evaluation, pp. 1–17. Association for Computational Linguistics, New Orleans, Louisiana (2018) 17. Mohammed, M., Omar, N.: Question classification based on bloom’s taxonomy cognitive domain using modified tf-idf and word2vec. PLoS ONE 15(3), e0230442 (2020) 18. Muralidharan, D., Moniz, J.R.A., Gao, S., Yang, X., Kao, J., Pulman, S., Kothari, A., Shen, R., Pan, Y., Kaul, V., Ibrahim, M.S., Xiang, G., Dun, N., Zhou, Y., O, A., Zhang, Y., Chitkara., P., Wang, X., Patel, A., Tayal, K., Zheng, R., Grasch,

806

19.

20. 21. 22.

23.

24. 25.

26.

B. S. Sert et al. P., Williams, J.D., Li, L..: Noise robust named entity understanding for voice assistants. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, pp. 196- 204, Online. Association for Computational Linguistics (2021) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in BERTology: what we know about how BERT works. Trans. Assoc. Comput. Linguist. 8, 842–866 (2020) Silla, C.N., Freitas, A.A.: A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery (2010) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. 6000– 6010. Curran Associates Inc., Red Hook, NY, USA (2017) Xue, L., Barua, A., Constant, N., Al- Rfou, R., Narang, S., Kale, M., Roberts, A., Raffel, C.: ByT5: Towards a token-free future with pre-trained byte-to-byte models. Trans. Assoc. Comput. Linguist. 10, 291–306 (2022) Yan, J.: Text Representation, pp. 3069–3072. Springer, US, Boston, MA (2009) Yang, Y.: A study of thresholding strategies for text categorization. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’01, vol. 740, pp. 137–145. Association for Computing Machinery, New York, NY, USA (2001) Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, ACL ’95, pp. 189–196. Association for Computational Linguistics, USA

Application of Language Models for the Segmentation of Website Visitors Andreas St¨ ockl(B) and Oliver Krauss University of Applied Sciences Upper Austria, Hagenberg i. M., Austria {Andreas.Stockl,Oliver.Krauss}@fh-hagenberg.at

Abstract. Website visitor segmentation is crucial for effective web presence management and online marketing. We explore methods for grouping website visitors based on user behaviour and incorporating their interests, using the GPT3 language model to analyze the text content of viewed pages content and build user profiles. In our method, the language model GPT3 is used to summarize the content a user visited to build a meaningful user profile, and to answer queries concerning the interests of the users. We then segment the users via text based topic modelling. Our findings indicate that our method of classifying user interests through text analysis and direct language model queries offers high transparency and versatility, surpassing traditional segmentation techniques. Users are characterized in clear profiles, and queries can be tailored to specific interests. Keywords: User behaviour Online marketing

1

· Language models · Segmentation ·

Introduction

We employ large-scale language models to construct textual profiles of website visitors and segment them based on their interests derived from these profiles. Language models, trained on large text datasets [6,20,21] are rapidly conquering almost all areas of artificial intelligence and its applications. This includes text generation and code generation in various programming languages. These models, based on the Transformer architecture [26], have enormous numbers of parameters (hundreds of billions) and are pre-trained on vast amounts of text data, resulting in a strong foundation of text comprehension. Prominent language models include GPT3 [6], PalM [8], LaMDA [23] and LOOM [22]. Their pre-training involves predicting the next word in a given text corpus. Some models, such as GPT3 Instruct [19] and FLAN T5 [9] are additionally fine-tuned with further tasks. A language model’s success in a task depends on presenting the input text, or prompt, in a way that optimizes the prediction of the most likely next word. This prompt gives the context, e.g. the task that the language model has to solve, contains the instruction for the analysis, possibly some examples (Few c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 807–821, 2024. https://doi.org/10.1007/978-3-031-47715-7_54

808

A. St¨ ockl and O. Krauss

Shot Learning) or even a chain of thoughts which directs the language model in the right direction [28]. With their Zero Shot and Few Shot capabilities [6], the models offer the possibility of application in a variety of domains without having to fine-tune the language model or even train it from scratch. Thus, they can be applied in situations where labeled data is available or would require a lot of effort to create. In our research, we examine various methods utilizing language models to explore their potential in grouping website visitors with similar interests or behavior. Our approach segments visitors based on interactions collected during website visits, including page views containing textual content such as product descriptions, blog posts, or news articles. By utilizing the full page content, not just keywords, and GPT3 as a language model, we are able to generate text summaries that can then be used for topic modeling and segmenting the users. Our method is adaptable to other language models mentioned. Our segmentation approach allows for personalized website content and targeted online advertising, as well as analysis of user behavior within these segments. Our main contributions in this work include: – A method for generating human-readable summaries of user behavior on a web page, using the content viewed. – Segmenting users into topics based on their profiles using topic modeling. – Ability to perform free-text queries based on user interests and profiles.

2

Related Work

We describe other approaches that have applied language models other than segmentation to provide an overview of how these models are usually applied. We also show other approaches to segmentation, and compare these approaches to our own. 2.1

Other Application Areas of Language Models

The most obvious use case for generative language models is as writing support tools for authoring various kinds of text documents. These are usually based on the GPT3 models or free alternatives like BLOOM [22] or GPT-J [27]. For chatbots and all kinds of digital assistants, Google’s LaMDA model [23] is particularly suitable. Machine translation also benefits from the new language models, as a recent work by Meta AI [10] has shown. If the language models are trained with source code of programming languages instead of natural language, then they are suitable for supporting programming. The best-known tools of this kind are the Github Copilot (https:// github.com/features/copilot/), which is based on the Codex model [7], a GPT3

Application of Language Models for the Segmentation of Website Visitors

809

variant for source code, and Salesforce Codegen https://github.com/salesforce/ CodeGen) [18]. Pygma (https://pygma.app/) an example of an application that uses code generation for web interfaces. It can be used to generate HTML / CSS code from Figma designs. In addition, language models are also used for information retrieval. While we use GPT3, more classical approaches have actually been shown to outperform large scale language models. Lavrenko and Croft show a Relevance-Based Language Model [16], which requires no training data. Melis, Dyer and Blunsom [17] also show that Long short-term memory (LSTM) [13] neural networks, still outperform more recent work by proper hyperparameter tuning. The reason we use GPT3 is that we use it for multiple tasks, such as summarization in addition to information retrieval. We also provide an approach that is generic and requires no fine-tuning for different websites, which more classical models would require. 2.2

Other Approaches to User Segmentation of Web Users

Before the era of large language models, classical methods were available for topic analysis, such as the LDA method [4], which was used in [24] for Topic-Based User Segmentation for Online Advertising. This approach is similar to ours in concept, but large language models are more powerful than older approaches to semantic encoding such as LDA. Segmentation of web users using mixture of factor analyzers (MFA) was analyzed in [31] by Zhou and Mobasher. They propose analysis of the users by their navigational patterns as an extension of the information acquired by the interests and preferences of users. In our approach we consider this information only partially, by describing a user’s behavior, such as the last time they accessed the website, and the frequency of use as textual information that is mixed with the semantic interest of the users, similar to the approach in [31]. To identify groups of visitors that have similar patterns in the pages they visit, Probabilistic Latent Semantic Analysis (PLSA) [14] was used in [15] by Xin et al. They use a combined approach of analyzing the behavior of a user via how they navigate the website to achieve a task, combined with semantic features of the viewed pages. Unlike our approach, the semantic knowledge is manually captured, for example for the task of looking for a place to live, they capture the pricing, room count, building year, and other features of the realty data to use as semantic information. This differs strongly from our approach, as we automatically capture the semantic information by collecting the website contents, and processing them with GPT3 into short summaries. Analyzing the behavioral aspects of users is an important task in user segmentation. In [29] Wu et al. use Probabilistic Latent Semantic User Segmentation (PLSUS) to improve the click-through rate of targeted advertisements. Unlike other approaches, they consider the semantics of what pages the users clicked on, as well as search queries they performed on the website. To cover semantics, they consider a Bag of Words approach [30], to encode the topics that users are interested in. Taking the behavior of the users into consideration allows the

810

A. St¨ ockl and O. Krauss

authors to double the click-through rate of advertisements. Their approach is similar to ours, but uses a more rudimentary approach to consider topics, as we use the Top2Vec [3] algorithm for the behavioral component of our users. One major advantage of our approach is, that it does not consider demographics data at all. We only consider the behavioral and semantic aspects of the users, but do not use any information concerning location, gender, age or other demographics. Many other approaches mix demographics data combined with semantic information into their segmentation approaches [2,5].

3

Problem Statement

Most segmentation methods use unsupervised learning with engineered numeric and categorical features, as shown in related work. Examples of these features include average time spent on the website, number of articles viewed in a specific category, or last visit date. While these features allow for segmenting the user base, they do not offer a readily understandable representation of the segments, beyond a 2D map with colored dots and feature average values used for segmentation. In contrast to the related work, we aim to answer the question How can user segments be generated, based on features that are easy to interpret, and can be used to generate a description understandable by a layman? Our approach offers a simple and interpretable solution for user segmentation, by creating user profiles based on the content viewed and generating humanreadable summaries, instead of solely relying on numerical features. This makes it easier for non-experts to understand the resulting user segments and their interests. In an evaluation with test users from two projects, it became apparent that the segments formed by clustering are also not always applicable to the needs of marketing experts. These experts working with the tools provided to them would like to describe segments themselves, or identify them according to a specific question they have in mind, such as ‘Which users are interested in sports and healthchare?’ With this, the second question our work aims to answer is How can we let users query website visitors via natural language? In our approach, we allow users to pose natural language queries to the text profiles generated for each website user, making the results more accessible and easier to interpret. This is not a common approach in related work, as they mostly rely on numeric features for segmentation, which limits their ability to query the user segments in a more intuitive and flexible way.

4

Data Preprocessing

We describe the data necessary for our approach to work, and how we acquire and preprocess this information to derive user profiles and segments. This process consists of the following steps:

Application of Language Models for the Segmentation of Website Visitors

811

1. Online analytics: Via the use of GDPR-compliant web trackers, we identify what web pages users have identified as relevant. 2. Text extraction: We scrape the website under analysis, and extract the text of every unique webpage. 3. Text summarization: From the extracted texts, we create short abstracts for each page. 4. User profile creation: We identify a user’s most relevant webpages, and create a profile based on the text summaries of the identified pages. 4.1

Online Analytics

As first step, we use web tracking to identify which pages and content a user has engaged with on a website. The collected data includes the URL of the visited pages, as well as the depth of scrolling and the duration of time spent on each page. The PiwikPro tool (https://piwik.pro/) is used to record the event data. The PiwikPro Analytics Suite is a data protection-compliant European solution for recording user interactions on a website or app. The 506 Customer Intelligence Platform (https://www.506. ai/506-advanced/) is used to collect and analyze user profiles. Figure 1 shows a screenshot from the application with a representation of a user segmentation. The raw data of the events of the user interactions for the user profiles are stored in a database and are available to us for further analysis via a REST interface.

Fig. 1. Screenshot of the 506 Customer Intelligence Platform for Analyzing User Segments. In Addition to the Key Figures and Characteristics, a Visualization of the Segments can be seen.

812

4.2

A. St¨ ockl and O. Krauss

Extraction of the Text Content

In order to have the text contents of the pages called up in the interaction data available, these are downloaded by means of web scraping, and the relevant text information is extracted and stored in structured form. For the extraction, the framework Beautiful Soup (https://www.crummy.com/software/BeautifulSoup/ bs4/doc/) is used. Currently, our approach only uses the text data available on the web page, in the future however, our approach could be extended to also use image or video data via the means of automated description which large language models also are capable of. 4.3

Summarization of the Page Content

The complete page content of all pages viewed by a user would be too extensive for further analysis to be used directly for the user profiles. We therefore create short summaries of about 40 words for all pages, which contain the most important topics of the page content. For this summary, we use the language model GPT3 [6] with 175 billion parameters, which is publicly available via a paid API. It offers the advantage over the other representatives that it is available in a version that has been better optimized for performing tasks such as text summarization after pretraining with additional steps consisting of supervised learning and reinforcement learning [19]. 4.4

Creation of User Profiles in Text Form

A text document is now created for each website visitor, describing their behavior and interests. To do this, the data records available from the tracking software on the end device used, origin, referrer, number of page views, number of visits, etc. are merged with the brief summaries of the pages visited by the user created in the previous section. These textual descriptions form the basis for the segmentation analyses. The textual descriptions use the last five visited unique web pages. In the future we plan to extend this approach, by also considering most frequently visited pages, and pages that have the most time spent. An example of such a profile description is: Context: The following content describes a user profile from an online store for health products. User Behavior: The visitor last visited the website from Facebook with an Android phone from Austria. He has visited the site 54 times in the last 30 days, most recently four days ago, and has viewed a total of 112 pages.

Application of Language Models for the Segmentation of Website Visitors

813

The Website Visitor Last Visited Pages with the Following Content: What is a micronutrient analysis? A micronutrient analysis uses your blood sample to determine in the laboratory how well you are supplied with vitamins, trace elements and minerals. Pregnancy and Fertility Pregnancy and breastfeeding place an increased demand on vitamins, minerals, and trace elements for a woman’s body. Iron and folic acid are particularly important. A balanced diet is essential for a healthy and happy family. Sports and Cologne List Whether you do yoga, triathlon, track, and field, or your heartbeat for mountain biking, equestrian sports, or strength training— with the right amount of macro- and micronutrients, your body can always perform at its best. Many athletes rely on high-quality nutritional supplements in tested quality. Immune system The immune system is the body’s defense mechanism. How efficient it depends on many different factors. You can support your immune system with sound information and tips. Children and Family Children need the best possible support to grow up strong—from an early age. For the body and mind to develop optimally, the body needs important macro- and micronutrients. Babies get their nutrients directly from breast milk, so as a breastfeeding mother, you should pay special attention to your own micronutrient balance. With a balanced, healthy diet, you can positively influence your child’s development.

5

Analysis/Methods

We follow two approaches to form segments from the user profiles. On the one hand, we support segmentation of the documents with topic analysis methods to create general clusters for online marketing. On the other hand, the formation of segments based on verbal descriptions of the target groups, which are sent as queries against the user profiles via a language model. 5.1

Unsupervised Segmentation Approach

In order to group users of the website based on their interests, we apply an unsupervised segmentation on the mined user profiles. We apply topic modelling algorithms, which are a method to cluster text documents into different topics, such as medicine or mathematics. A recent algorithm in this domain is Top2Vec [3]. Top2Vec is a topic modelling method based on word and document embeddings. It automatically detects the topics present in the text without the need to specify the number of clusters and generates jointly embedded topic, document and word vectors, which also allows the follow-up analysis of specific topics. An alternative application in the future would be BERTopic [12], which is a topic modelling algorithm fine-tuned on this task on the basis of the BERT language model [11]. We use Top2Vec in

814

A. St¨ ockl and O. Krauss

our implementation because it provides the topic and document vectors, which can in turn be used for further analysis of the topics. In our application, these topics correspond to the behavior and interests of the website visitors. In addition to the clusters and outliers, the method also provides ways to describe the clusters/user segments. Figure 2 shows the visualization of clusters in 2D using the TSNE method [25] for dimension reduction.

Fig. 2. Visualization of Clusters of User Profiles Built with Top2Vec.

The topic vector provided by the Top2Vec method consists of the most used words in the cluster documents and can be used to describe the segments. They provide users with an appropriate understanding of what content users like to consume in these topics. As we have also included the user behavior in the profiles, some terms may also hint towards this behavior, such as recently active or Android user. In addition to this topic vector, we provide the description of the user profile closest to the cluster centroid as a proxy for the entire group. However, in the future we hope to improve this approach, by calculating a segment summary similar to how we create the user profiles. E.g. combining the text summaries of the most viewed articles in the segment overall for the content, and averages for the user behavior. In order not to have to recalculate the whole clustering in case of changes in user profiles or new visitors, we have further developed Top2Vec for incremental expansion of clusters [1]. This approach allows us to add new users or remove users that have not been active for a longer time. In addition, we can analyze the shift of clusters that make up the user base by observing the changes to the core model over time.

Application of Language Models for the Segmentation of Website Visitors

5.2

815

Supervised Querying in Natural Language

Supervised querying, allows users to identify relevant profiles according to a specific query, such as ‘Users that are interested in sports, and do not use Apple devices.’ A segment identified by such a user group could be interested in personal health devices by a different brand. We have developed two approaches for querying with natural language. (A) Translating Queries into SQL The first by selecting the user profiles by translating the description of the segment into a suitable SQL command using the language model. This then returns the matching records from the database of user profiles. Figure 3 shows an example.

Fig. 3. Generation of SQL Commands to Segment a Database of User Profiles.

In order for the language model to be able to generate correct SQL, it is necessary and mostly sufficient to formulate the database structure in the prompt, and then append the textual description of the user group with the request to generate SQL. The experiments have shown that the use of column labels with good textual descriptions of the features is crucial. Otherwise, in the example of Fig. 3, confusion of Os and device type quickly occurred. The problem with this approach is that while the structured features of user interaction such as number of visits, hardware, and so on can be captured well, the interests based on website content are difficult to map in the database structure. Moreover, such a mostly rule-based approach is only transferable from one use case to the next with a lot of effort.

816

A. St¨ ockl and O. Krauss

(B) Querying via the User Profiles Our approach uses the language model as a Question Answering System by building a prompt that asks a question to be answered via the user profiles in text form (see example in Sect. 4). The answer is then used to decide whether a user belongs to a group or not. For example, ‘Is the user interested in sports products?’ or ‘Is the visitor health conscious?’ can be asked to form an appropriate group. For this purpose, the textual description of each user is subjected to the questions and the answer is stored as a feature. The segments are then formed with all user profiles that have a positive answer on one or more of the provided questions.

6

Evaluation

To test and evaluate our approach, we investigate two different application areas. On the one hand, online stores that want to segment their visitors according to areas of interest in order to operate targeted online advertising such as retargeting, and on the other hand, online newspapers and news portals that want to offer content to visitors personalized according to their interests. The data basis for the study are the interactions of the users/readers with the news portal of a regional daily newspaper and an online store for dietary supplements. The experiments were carried out with German-language websites, but the language models were used with English-language texts, since better performance is available here. The translation required for this was performed ‘on the fly’ in each case using DeepL.com and the API available for it. Figure 4 shows the segments formed for the online store based only on the interactions without analyzing the page content.

Fig. 4. Segmentation of the Customers of an Onlineshop.

Application of Language Models for the Segmentation of Website Visitors

817

With the help of these, segments can be formed according to origin, hardware, frequency of visits, etc., but the interests of the visitors are not found among the descriptions. In order to map these, the user descriptions were divided into segments according to different questions using the method described in the analysis section. In order to measure the quality of the segments, a survey of online marketing experts was carried out for a selection of user profiles and target group classifications. They had to decide whether the profiles fit into the target group or not by answering yes, no or no. This was done with the help of a small web application (Fig. 5). The results were then compared with those of our method. The latest variant of GPT3 text-davinci-003 was used for the evaluation.

Fig. 5. Web Application for the Evaluation.

Of the 14 experts (5 female, 5 male, 4 not specified), a total of 6 user profiles each with 10 target group assignments were asked to answer the given questions. Each of them therefore had 60 decisions to make. The content of the website profiles and target group queries was in German, and the questions were translated for presentation here. For the online shop: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

‘Is the user interested in sports products?’ ‘Is the user likely to be a woman? ’ ‘Is the user a mobile website user?’ ‘Did the user come via a search engine?’ ‘Has it been a long time since the user was in the online store?’ ‘Is the visitor possibly pregnant?’ ‘Does the visitor have an older age?’ ‘Does the visitor possibly have sleep disorders?’ ‘Does the user come from a German-speaking country?’ ‘Is the visitor health conscious?’

818

A. St¨ ockl and O. Krauss

For the online newspaper: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

‘Is the visitor interested in sports?’ ‘Is the visitor interested in international politics?’ ‘Is the visitor interested in the topic of climate change?’ ‘Is the visitor interested in motor sports?’ ‘Is the visitor likely to be a man?’ ‘Is the visitor likely to be a Tyrolean?’ ‘Does the visitor come from a German-speaking country?’ ‘Is the visitor interested in soccer?’ ‘Did the visitor use a mobile device to access the content?’ ‘Is the visitor interested in local content?’

In 50 of the 60 decisions, the majority of people had the same opinion as the AI system. In 54 of the 60 decisions, people disagreed, and the decision was ambiguous (4 or more of the experts had a different opinion than the majority). Of the 10 decisions in which humans and machines disagreed, 2 were in the publisher case and 8 were in the e-commerce case. In the following, we showcase some examples where man and machine judged differently. e-Commerce: One case contained a blog post with sports content, but no other sports content in the rest of the profile. In this case, the vast majority of experts decided that the user profile was interested in sports. The AI decided that no sports interest can be detected. In one case, the AI decided that the user was not from a German-speaking country, even though he was from Austria. A clear mistake. Publisher: The 8 non-matches in this case consist of: – 2 other errors in the assignment to users that spoke German. – 2 times there was disagreement after the question if the user is a man. Here, the people among themselves were also not always in agreement. – 1 time the question about football interest, here were contents to FIFA WM and Qatar in the profile. Whether that is sport or politics interest, there was no unanimity even among the experts. – In 3 others on sports, international politics and motor sports, there was no agreement in one profile, here too, the experts were not always in agreement. All in all, it can be said that 3 of the 10 non-matches can be clearly classified as errors of the AI and 7 as unclear assessments even for experts. If one asks the language model for a justification of the decision, then quite interesting justifications come in particular with the cases where also for experts the estimation is not clear. Like, for example: No, the visitor is not interested in motor sports. This can be deduced from the fact that the visitor mainly visited pages with news about the world climate conference, bird flu and Joe Biden. There is no indication that the visitor is interested in motorsports.

Application of Language Models for the Segmentation of Website Visitors

819

However, in the case of assessment by gender, sometimes nonsensical justifications are generated: The visitor is probably a man, because he uses an iPhone from Austria and has visited the website 5 times in the last 30 days. He has also viewed a total of 170 pages, which indicates that he is interested in the content of the website. Since there are no statements about the content under consideration, the suspicion arises that the model has been ‘cleaned up’ with regard to prejudices about typical male and female content.

7

Conclusion

In this work, we show a novel approach to user segmentation for websites, that combines their behavioral data, i.e. frequency of use, last visited, and other information, with their semantic interests. We derive the semantic interests from the web pages they consumed, and use GPT3 to summarize these pages to create a user profile. Based on these profiles, we use Top2Vec to cluster the users based on their interests, while simultaneously enabling querying for specific user segments via GPT3 to find users that may be interested in specific topics to be advertised. Our approach provides several advantages. One being the compliance with GDPR, and no requirement to collect demographic information on the website users whatsoever. Marketing personnel has the benefit of being provided with a user profile that is a text document, that can also be used to describe a segment identified with Top2Vec. This information is easily understood, even by a layman, as it contains the website data in a shortened form, and not some less easily understood feature vectors that may be provided by other approaches. In the future we aim to refine our methodology, to also provide weights for the contents users are most interested in via their behavior, such as most often frequented relevant webpages (e.g. omission of landing pages etc.), and create summary documents for each user segment based on the user profiles making up the segment. Additional behavioral data, such as the user processes through the webpage, achieved by process mining, to enhance our user segmentation is also a point of research.

References 1. Modelling shifting trends over time via topic analysis of text documents 2. An, J., Kwak, H., Jung, S., Salminen, J., Jansen, B.J.: Customer segmentation using online platforms: isolating behavioral and demographic segments for persona creation via aggregated user data. Soc. Netw. Anal. Min. 8(1), 54–19 (2018). August

820

A. St¨ ockl and O. Krauss

3. Angelov, D. (2020). Top2vec: Distributed representations of topics. arXiv:2008.09470 4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003) 5. Bonn, M.A., Leslie Furr, H., Susskind, A.M.: Predicting a behavioral profile for pleasure travelers on the basis of internet use segmentation. J. Travel Res. 37(4), 333–340 (1999) 6. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. et al.: Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020) 7. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.D.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G. et al.: Evaluating large language models trained on code (2021). arXiv:2107.03374 8. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S. et al. Palm: Scaling language modeling with pathways (2022). arXiv:2204.02311 9. Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S. et al.: Scaling instruction-finetuned language models. arXiv:2210.11416 10. Costa-juss` a, M.R., Cross, J., C ¸ elebi, O., Elbayad, M., Heafield, K., Heffernan, K., Kalbassi, E., Lam, J., Licht, D., Maillard, J. et al.: No language left behind: Scaling human-centered machine translation (2022). arXiv:2207.04672 11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018). CoRR, abs/ arXiv: 1810.04805 12. Grootendorst, M.: Bertopic: Neural topic modeling with a class-based tf-idf procedure (2022). arXiv:2203.05794 13. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 14. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57 (1999) 15. Jin, X., Zhou, Y., Mobasher, B.: Web usage mining based on probabilistic latent semantic analysis. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 197–205 (2004) 16. Lavrenko, V., Bruce Croft, W.: Relevance-based language models. SIGIR Forum 51(2), 260–267 (2017) 17. Melis, G., Dyer, C., Blunsom, P.: On the state of the art of evaluation in neural language models (2017). CoRR, abs/ arXiv:1707.05589 18. Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., Savarese, S., Xiong, C.: Codegen: an open large language model for code with multi-turn program synthesis (2022) 19. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A. et al.: Training language models to follow instructions with human feedback (2022). arXiv:2203.02155 20. Radford, A.: Karthik Narasimhan. Ilya Sutskever, et al. Improving language understanding by generative pre-training, Tim Salimans (2018) 21. Radford, A., Jeffrey, W., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)

Application of Language Models for the Segmentation of Website Visitors

821

22. Workshop, B., Scao, T.L., Fan, A., Akiki, C., Pavlick, E., Ili´c, S., Hesslow, RD., Castagn´e, R., Luccioni, A.S., Yvon, F., Gall´e, M. et al.: Bloom: A 176b-parameter open-access multilingual language model (2022). arXiv:2211.05100 23. Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.T., Jin, A., Bos, T., Baker, L., Du, Y. et al.: Lamda: Language models for dialog applications (2022). arXiv:2201.08239 24. Tu, S., Lu, C.: Topic-based user segmentation for online advertising with latent dirichlet allocation. In: International Conference on Advanced Data Mining and Applications, pp. 259–269. Springer (2010) 25. Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(11) (2008) 26. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L  ., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017) 27. Wang, B., Komatsuzaki, A.: GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model (2021). https://github.com/kingoflolz/mesh-transformer-jax 28. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D.: Chain of thought prompting elicits reasoning in large language models (2022). arXiv:2201.11903 29. Wu, X., Yan, J., Liu, N., Yan, S., Chen, Y., Chen, Z.: Probabilistic latent semantic user segmentation for behavioral targeted advertising. In: Proceedings of the Third International Workshop on Data Mining and Audience Intelligence for Advertising, ADKDD ’09, pp. 10–17. Association for Computing Machinery, New York, NY, USA (2009) 30. Zhang, Y., Jin, R., Zhou, Z.-H.: Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybern. 1(1), 43–52 (2010). December 31. Zhou, Y.K., Mobasher, B.: Web user segmentation based on a mixture of factor analyzers. In: International Conference on Electronic Commerce and Web Technologies, pp. 11–20. Springer (2006)

Identifying Research Topics in Human-Computer Interaction for Development: What Value Can Natural Language Processing Techniques Add? Judy van Biljon(B) , Etienne van der Poel, and Tawanda Chiyangwa University of South Africa, Johannesburg, South Africa {vbiljja,evdpoel}@unisa.ac.za

Abstract. Human Computer Interaction for Development (HCI4D) is an interdisciplinary field involving researchers, practitioners and funding organization from Human Computer Interaction (HCI), Information System (IS) and Development Studies. The fast-growing literature based on HCI4D means researchers need support in identifying the core research areas and current research patterns effectively and efficiently. This study investigates the value added by natural language processing (NLP) techniques when identifying research topics (core research areas) from a set of research publications in HCI4D by comparing the results with the result from traditional, manual literature searches. A generative statistical method called Encoder Representations from Transformers (BERT) and t-distributed Stochastic Neighbor Embedding (tSNE) is used as NLP techniques on a dataset of HCI4D publications. The top 10-word clusters generated were considered for semantic mapping to associated topics. Our findings confirm that NLP techniques are effective in identifying research topics and add value in terms of adding new topics and confirming existing topics. BERT, rather than tSNE, was found useful for supporting literature searches especially in identifying new topics. The inclusion of human experts in labelling the topics improved transparency and added value in terms of structuring to the topics provided from automated searches. Keywords: Topic modelling · BERT · tSNE · HCI4D · Research trends

1 Introduction The Human-Computer Interaction for Development (HCI4D) field emerged at the intersection of Information and Communication for Development (ICT4D) and HumanComputer Interaction (HCI) as interdisciplinary areas of research and practice [1]. Positioned within the disciplines of Information Systems and Computer Science, HCI4D is a multi-faceted field that focuses on understanding and designing technologies for under-served, under-resourced, and under-represented populations in diverse geographic regions [2]. Due to the interdisciplinary and dynamic nature of the field, it is challenging © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 822–840, 2024. https://doi.org/10.1007/978-3-031-47715-7_55

Identifying Research Topics in Human-Computer Interaction

823

for researchers to keep up with the published literature. Therefore, a number of literature reviews have been presented to address that challenge, including those by Dell & Kumar [2] and van Biljon [3]. A rigorous, thoughtful synthesis of the literature on the research problem is generally considered to be an essential starting point in Information and Software Technology research [4]. Systematic literature reviews (SLRs), are manual mechanisms for capturing the state of the art in terms of the literature towards identifying research topics and trends in a domain [5]. SLRs following a standardized process like the PRISMA [6] review provide relative transparency of the process followed in obtaining but SLT results has been faulted for bias; that is researchers making contestable decisions about which publications to include and which elements to emphasize in the findings. Potential researcher bias is confirmed by the findings of Yalcinkaya and Singh [7], who found a tendency for researchers conducting systematic literature reviews to be inclined towards selecting highly cited papers. In support of manual searches, natural Language Processing (NLP) offers several approaches for using automated topic modelling algorithms to uncover patterns of term co-occurrence [8]. These patterns are used to find related topics in text data by using topic modelling algorithms [9]. Sehra et al. [10] investigated research patterns and trends in software effort estimation using a generative statistical technique, called Latent Dirichlet Allocation (LDA). Their dataset contained 1178 titles and abstracts of the articles selected according to specific criteria. Chiyangwa et al. [11] also use LDA as topic modelling technique to find the co-occurrence patterns of terms that correspond to semantic topics in research papers from a corpus of 176 abstracts and titles. The advantage of LDA is the transparency provided by the availability of the word clusters, which underlie the topic modelling. In both these studies the corpus consisted only of the publications’ abstracts and titles. LDA is a mature technique to extract significant features from text [10]. Bidirectional Encoder Representations (BERT) and t-Distributed Stochastic Neighbour Embedding (tSNE) have been proposed as more recent topic modelling techniques to consider [12– 14]. t-SNE and BERT are non-linear dimensionality reduction techniques that can handle outliers, unlike the linear dimensionality reduction technique such as LDA [15]. BERT outperformed LDA in two recent studies published by Abuzayed and Al-Khalifa [16] and Thompson and Mimno [17]. This motivated the selection of BERT and tSNE as topic modelling techniques in this study, with titles and abstracts as data corpus. Commercial research platforms like iris.ai provide smart searches with a wide range of smart filters, reading list analysis, auto-generated summaries, autonomous extraction and systematizing of data but the cost if often prohibitive and therefore investigating open access solutions remain important. The primary research question for this study is: “What value is added by NLP techniques, specifically BERT and tSNE when identifying research topics (core research areas) from a set of research articles in the domain of HCI4D?” Accepting that automated, NLP based techniques are generally more efficient we focus on the effectiveness by comparing the results of automated analysis to that of traditional, manual literature searches. The paper is divided into seven sections. The next (second) section presents a summary of the HCI4D topics that have been identified from previous studies. The

824

J. van Biljon et al.

third section explains the process for data collection. The fourth section discusses the methodology describing the application of BERT and tSNE to the corpus. The results and findings are presented in the fifth section. The sixth section responds to the research questions and the limitations of the study. The final section concludes the paper and suggests future research opportunities.

2 Literature A number of literature review studies have explored the core research areas in HCI4D. Notably, Ho et al. [18], Dell and Kumar [2] van Biljon and Renaud [19] and Chiyangwa et al. [11]. Table 1 provides a summary of the topics that have previously been proposed for the HCI4D literature. Table 1. Topics (core research areas) identified in the literature. Topic

Ho et al. [18]

Dell and Kumar [2]

Van Biljon and Renaud [20]

Van Biljon [3]

Chiyangwa et al. [11]

Context











International Development











Intervention









D4D



M4D



(HCI4D



Notably, the review papers before 2021 used manual literature reviews to identify the topics while Chiyangwa et al. [11] used an NLP technique, namely LDA. The LDAextracted research topics share several similarities with the BERT method employed in this work, such as the title and abstract corpus and the semantic mapping (labelling) of the word clusters by human experts. Van Biljon [3] mentioned the importance of M4D and D4D but listed mobile phones as a technology and D4D as a focus area together with Health, Education etc. Notably, the NLP results are presented as a list of topics in order of importance according to the item loading. In contrast, manual literature reviews have distinguished topics (core themes) like cross-cultural context, developmental needs, geographical distribution, technology and interfaces from research domains like health, education, D4D, e.g. van Biljon and Renaud [20]. Presenting the results on a lower level of abstraction, (list of topics) means that an intervention is needed to structure the results to distinguish a theme (core research area or trend) from a research domain. Reeves [21] maintains that research themes evolve through four phases (as depicted in Fig. 1) starting with inception (emerging or declining themes) then stabilize (basic and transversal themes) then go mainstream (motor themes)

Identifying Research Topics in Human-Computer Interaction

825

and eventually terminate or continue as developed but isolated themes. Research topics fluctuate in terms of importance while research domains like health or education are more stable in the sense that those application domains are found across disciplines in interdisciplinary fields like HCI4D.

Fig. 1. Mapping disciplinary knowledge production as quadrants [21].

3 Collection of Data A systematic literature review (SLR) was carried out to construct a corpus of HCI4D research published from 2009 to 2021. SLRs are an established method of analyzing the literature in a research domain [4], which combines the strengths of a critical review with a comprehensive search process typically to address broad questions [22]. The steps in the data collecting process are discussed below: Identifying the Sources of Knowledge. Cavacini [23] measured the quality of a set of bibliographic records by comparing five databases: Scopus, WoS, INSPEC, DBLP and Google Scholar. They found WoS, INSPEC and Scopus to provide better quality indexing and bibliographic records when considering accuracy, control and granularity of information, while WoS and Scopus provided more sophisticated tools for measuring trends in scholarly publications. We selected Scopus since it indexes most of the ACM conference proceedings where HCI4D papers are regularly published. Creating a set of Search Criteria. Different terms and acronyms have been used to distinguish research focusing on human-centred design aiming to foster socio-economic development. Despite the ambiguity and even controversy surrounding the term ‘development’, the term ‘HCI4D’ has prevailed, as evidenced by increasing research in this area that uses this term in the publication titles or as keywords [24]. The search term

826

J. van Biljon et al.

‘HCI4D’ was selected as other combinations of the terms ‘Human-Computer Interaction’ and ‘development’ led to the inclusion of papers involving software development but not socio-economic development. Search Bibliographic Databases. The Scopus bibliographic database was searched using only the keyword ‘HCI4D’ as search term for the inclusion criteria for the period 2009–2021, with ‘English’ as language and the document type set as ‘journal’ or ‘conference paper’. The bibliographic database search, set to find the keywords in the title, abstract or full text of a publication delivered 239 publications. Remove duplicate and unappropriated articles. These 239 publications included 24 workshop papers, one book chapter, a review and an editorial that got selected despite the document type set as ‘journal’ or ‘conference paper’. Removing those left a corpus of 212 papers to analyze. Figure 2 provides an overview of the publications per year. The upward trend in publications from 2014 to 2020 suggests the continued growth and relevance of the HCI4D field. As can be observed the COVID 19 pandemic negatively impacted the publication output, due to conferences being cancelled or postponed to 2021. The 2021 data was captured in July 2021, i.e., before the end of the year and thus it cannot be compared to the previous years.

Fig. 2. Article publication per year.

Review in its Entirety. The papers that meet the set inclusion criteria were manually examined for relevance to the HCI4D domain (as defined in Sect. 1) by reviewing the title, abstract and keywords. No papers were removed during this phase, so 212 papers were selected as being relevant to the current research.

Identifying Research Topics in Human-Computer Interaction

827

4 Method The method is based on the premise that a dataset (corpus) consisting of publications can be subjected to topic modelling algorithms to facilitate querying a large corpus of data to identify latent ideas (topics) within the corpus [25–28]. Each publication (document) is viewed as a collection of a small number of latent themes (topics), and each stemmed word (stemming explanation in Sect. 4.1) relates to one of these topics. The topic model links words with similar meanings and distinguishes words with distinct meanings using latent information [7]. As a result, latent themes represent multiple observable entities having similar patterns identified from the corpus. Figure 3 provides an overview of the method. The corpus (data collection described in Sect. 3) was pre-processed as described in Sect. 4.1. The BERT and tSNE algorithms were applied to the corpus to produce BERT word clusters and tSNE word clusters, as described in Sect. 4.2. The human expert labelling was carried out on these two sets resulting in BERT and tSNE based topics as discussed in Sect. 4.3.

Publicaons from Scopus (2009- 2021)

Titles & Abstracts

Pre-processing

BERT

BERT Word Clusters

Expert labelling

BERT based Topics

tSNE

tSNE Word Clusters

Expert labelling

tSNE based Topics

Fig. 3. Flow diagram of this study.

All the analyses were done in Jupyter Notebook, which is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media. The datafile containing the literature dataset, i.e., all the titles and abstracts were processes as described in the next three sections.

828

J. van Biljon et al.

4.1 Pre-processing of Data The removal of noisy words or characters from the dataset is part of the pre-processing phase. For pre-processing the literature dataset, the following procedures were used: • • • •

Loading text Lexical analysis of the text Stop words removal Stemming.

Loading Text. The corpus from the reference manager Mendeley is exported to a csv file, and the required information from the bibliographic database is filtered using C + + code written specifically for this purpose. Lexical Analysis. Two hundred and twelve titles and abstracts were tokenized into 53,000 tokens as a standard measurement in lexical analysis. For each page, the generated tokens were transformed to lowercase letters. In a Jupyter notebook punctuation character, exclamation points, commas, apostrophes, question marks, double and single quote marks, and hyphens were removed. Numeric values were also deleted, leaving only the textual elements in Jupyter. Stop Words Removal. The phrases used to construct the literature collection were stripped of common English terms like ‘the’ and ‘and’ using the Natural Language Toolkit (NLTK) Python module. Stemming. Word forms are stemmed to their original root form to create an effective literature dataset. To stem the tokens for each text, the SnowballC stemmer method was used, and inflected words were transformed to their base stem as advocated by Bird and Loper [29], and Bouchet-Valat [30]. 4.2 Applying tSNE and BERT A Jupyter notebook running Python was used to run the two sets of experiments for tSNE and BERT. The specific version of BERT that was used is BertTopic, which is a “cousin” of BERT that uses BERT’s transformers to find dense clusters of important words in sets of documents. Since this study aims to test if these could be useful for automating the process of finding topics in the HCI literature, they were used in their “vanilla” incarnations, without adjusting the hyper-parameters unnecessarily. Figure 3 illustrates the flow of the experiments. Topic models are generated based on three input parameters, namely: the number of topics, the number of iterations, and the hyper-parameters Mavridis and Symeonidis [31]. For this analysis the input parameters were kept in their default value. BERT Hyper-Parameters: Bidirectional Encoder Representations-base model with a hidden size of 768, 12 transformer blocks and 12 self-attention heads were used in this processing. Finding the ideal set of parameters for training a BERT model maximizes the probability of generating the training documents [36] required for generating the topics in this study. Using the BERT machine learning technique, ten topics were identified to be optimal for this study. During data cleaning, the dataset was found symmetrical enough (according to the criteria of Toschi et al. [37]) to allow further processing (distribution values).

Identifying Research Topics in Human-Computer Interaction

829

tSNE Hyper-Parameters: Three of the hyper-parameters, also referred to as tuning variables were set according to the optimal values as recommended by Chen et al. [32] as follows: • Alpha (α) is the magnitude of the algorithm prior over the topic distribution of a document (φ) [33]. Symmetric α values and 5 to 10 iterations performed well in delivering recognisable/relevant topics. • Beta (β) is per-word-weight of the algorithm prior over topic-word distributions (φ) [33]. The value of the Beta has been fixed at 0.01 for all topic solutions and has been preserved at 50/T where T is the number of topics which are 10 in this study as advocated by Sehra et al. [10]. • Based on Arun et al. [34] we deduced that the number of iterations required for identifying five to ten topics is 100. There is no standard method for determining the trends; however, heuristic parameters like selection hierarchy and identity construction proposed by Cao et al. [35] and Arun et al. [34] were used to determine the optimal range of topic solutions, which falls between 50 and 70 as optimal range [36;38]. The Natural Language Toolkit and the genism libraries were used for this study. The result produced 212 topics. For pragmatic reasons, and for comparison with BERT, the top 10 topics were selected. A study of the frequency chart of the topics (which shows a word count per topic) also show a sharp decline and a flattening at around 10/11 topics, which could indicate lower topic relevance after this. 4.3 Topic Labelling tSNE and BERT were used to produce loadings for words and documents in the corpus. The terms identified by the two algorithms respectively are presented in Table 2 (see Appendix) for the top ten topics together with their loading values. The loading value for each term denotes the degree to which the associated term is related to a certain topic. The topic loading values can be used to identify core research domains and trends in a field of study [38]. The most appropriate number of topic solutions for detecting research trends is ten to sixty [39]. According to Sidorova et al. [40], the eight-topic solution is the best low-level answer when approximated. This study looked at 10 topics. ML based term identification and grouping of the terms into topics was followed by topic (research theme) labelling; this is where the human experts interact with the results of the algorithm. To mitigate subjectivity and bias the three authors performed theme labelling individually. They then discussed the labels and decided on a label for each of the groups that they considered coherent enough to represent the underlying theme. Table 3 depicts the labels assigned by the authors together with a final label assigned by an external HCI4D expert for BERT and Table 4 for tSNE. These are discussed in more detail in the next section.

830

J. van Biljon et al.

5 Results and Findings 5.1 BERT Results The themes that emerged represent the following 8 research trends: Mobile technology for development (M4D), Audio ICT, Technology design, Gender, Education, Privacy & Security, Financial services, and Agriculture. In the case of numbers 4 and 6, the researchers could not find enough coherence between the terms to map that to a topic (theme) representing the trend. Table 2. Comparison of expert’s labelling of the word clusters identified from bert. Researcher 1

Researcher 2

Researcher 3

Expert

1

Mobile technology

Mobile telephone Mobile devices

M4D

2

Sound devices

Voice/Audio communication technology

Audio/video communications

Audio ICT

3

Design

Technology design

Design

Technology design

4

Research study

HCI4D

Researchers

HCI4D research

5

Female harassment

Gender

Discrimination

Gender

6

Literacy in education and education

Teaching students

Curriculums

Education

7

Security

Privacy and Security

Privacy and security

Privacy and Security

8

Literacy in education and Education

Literacy and Education

Literacy

Literacy

9

Mobile payment

Mobile Finance

Financial

Financial services

10

Employee production

Agricultural production

Farming

Agriculture

5.1.1 tSNE Results Table 4 depicts the labels assigned by the authors together with a final label assigned by an external HCI4D expert for tSNE. Researcher 1 suggested the themes, HCI research, Participant’s knowledge, Design, Technology, Community feedback. Literacy Education and Learning Environment while the other reviewers suggested other themes or no central theme when they could not find coherence between the terms (words) in the word cluster.

Identifying Research Topics in Human-Computer Interaction

831

The only agreement between the researchers was found in the themes Technology design and Mobile technology (phones) for development (M4D). Table 3. Comparison of topics using expert judgmental from tsne using abstract. Researcher 1

Researcher 2

Researcher 3

Expert

1

HCI research study No central theme

Various topics

None

2

HCI research study No central theme

Authors

None

3

Participants knowledge

No central theme

Various topics

None

4

Design

Technology design

Authors

Technology design

5

HCI research study No central theme

Communication and various topics

6

Technology

Mobile telephony

Community Health

M4D

7

Participants Knowledge

HCI Research/HCI4D

Interviews

None

8

Community feedback

Health HCI research

Authors

None

9

Literacy education and Learning environment

No central theme

People

None

10

No central theme

No central theme

Authors

None

Table 5 depicts a comparison of the topics identified in this literature with previous results from traditional, manual literature reviews (Table 1) as well as NLP based topics depicted in Table 2 (see Appendix) and Table 3. Based on the existing HCI4D literature and our findings in this study as depicted in Table 5, we propose the following as current research topics (core themes) in HCI4D: Context, International Development, Technology Design, Data for Development (D4D), Mobile for Development (M4D). Likewise, the following are proposed as research domains Gender, Education, Privacy & Security, Literacy, Financial services, Agriculture. Notably, these are not proposed as a final, complete set of topics or domains, rather the results from our top ten keyword clusters’ analysis which is useful in confirming results from previous manual literature searches.

6 Discussion Our results show that the analysis of the title and abstract corpus produce different themes using BERT and tSNE. The BERT analyses added value by combining the terms to unlock and render visible latent, connections across the textual corpora. The semantic coherence

832

J. van Biljon et al. Table 4. Comparison of the topics. SLR topics (previous)

NLP topics (previous)

Context







International Development







Intervention





Data for • Development (D4D)



Mobile for Development (M4D)



Human-computer interacting for development (HCI4D)



BERT topics

tSNE topics

Final set

✓ •







Audio ICT





ICT •



Technology design



Gender





Education





Privacy and Security





Literacy





Financial services





Agriculture





of the terms allowed the researchers to label 7 out of the 10 topics with confidence. In the case of tSNE two clear topics were identified, namely mobile technology and technology design, both overlapped with topics identified when applying BERT. The results suggest that BERT was more effective than tSNE in identifying coherent word clusters that the researchers could label (semantically map to core research areas). In manual literature reviews research topics are on a meta level compared to the field’s domain questions, i.e. Why (domains), What (technologies and interfaces), Who (stakeholders) and How (research approaches). These NLP techniques provide a list of topics, which needs to be differentiated manually if further distinction is required. The relatively small data corpus is a limitation, and it is necessary to replicate the study with a larger dataset. The BERT and tSNE techniques should be compared to a variety of topic modelling approaches such as Hashtag Latent Dirichlet Allocation (Hashtag-LDA), Enriched LDA (ELDA), Poisson-Gamma Latent Dirichlet Allocation (PGL-LDA), and word2vec [13, 14, 30, 41]. Other databases, such as the Institute of

Identifying Research Topics in Human-Computer Interaction

833

Electrical and Electronics Engineers (IEEE) and Web of Science (WoS), might also be used to increase the size and diversity of the corpus. Humans labelling groupings can introduce subjectivity, but that was mitigated by having the researchers identify the categories separately and then debate their final recommendations. We involved a domain expert in assigning the final labels. In future, this might be handled by a focus group of experts to promote transparency and reduce subjectivity as suggested by Hughes [42]. Instead of focusing on single words, future study will incorporate themes based on extracted short phrases such as trigrams and similar approaches. The selection of titles and abstracts as the data corpus implies a higher level of human intervention and subjective processing than the full text documents. Further research is necessary to compare these results with those obtained from applying said NLP techniques to the full text of the corpus.

7 Conclusion The study examined the value added by using NLP techniques to identify major themes in the HCI4D literature. The NLP analyses (BERT and tSNE as techniques) were applied to latent topic modelling of the corpus using an unsupervised generative probabilistic method that extracts themes by choosing the most likely topic. This reduces researcher bias when uncovering emergent themes compared to systematic literature reviews. Besides the efficiency, the results identified topics that confirmed some of the research topics and research domains mentioned in literature. This is useful in confirming those as current, dynamic research foci. Involving experts in the final step of our method for labelling the topics added transparency but also human bias. Notably, this involvement was only in terms of naming the topics, not in identifying the clusters. Besides the known efficiency advantage, our findings indicate that NLP techniques can add value in terms of effectiveness and reducing human bias. We acknowledge the relatively small sample size as a limitation, especially in terms of ML applications and present the findings as a point of departure in further research into automated techniques for supporting, rather than replacing systematic literature reviews.

8 Future Work Given the limitations of the results, the use of NLP and clustering techniques (such as BERT and tSNE as used here) could be a useful first pass to find topics in a large volume of literature. What the ML clusters and the following manual clustering showed us is that there is still a fair deal of expertise required to create good topic labels from the clusters identified. One could argue that to a novice in the field, these clusters could be difficult to understand (e.g. M4D) and to an expert in the field some of these (e.g. context) will be either too high-level. When one thinks about how clustering algorithms work, in general, this becomes a bit more obvious. In a broad sense these algorithms find words

834

J. van Biljon et al.

or phrases that occur more frequently than others (of course, some algorithms are more nuanced). The assumption is that in a specific field at a specific point in time the relevant topics being studied would occur more frequently. This is indeed the case, but there are also many other more “ordinary” phrases in any field that occur frequently. There was no intelligence in these clustering algorithms to distinguish between the topic phrases and the more “common” ones. In hindsight this seems to be what impacted some of our results and why the input of the experts was useful. The next challenge is to find ways of distinguishing between the types of clusters that will be useful for automating topic modeling, and those clusters that map to more common phrases. Other than trying to build a huge deep-learning network, two possible approaches could be tried: (1) increase the size of corpus and build several models to work out how to differentiate between useful clusters and others that simply have high frequency, and (2) experiment with other techniques to filter the documents to include only those parts of the text that would have a higher probability of producing the research topics. A simple approach here would be to only use the abstract, introduction, conclusion, and the/or the titles in the references, from the papers in the corpus, with the assumption that these parts of academic papers tend to be focused on the topics and techniques, at a high level. More complicated approaches could involve machine learning algorithms, perhaps even deep learning. It is now known that deep learning algorithms can pick up any bias in the data fed to it. The most relevant topics being studied in a field, as well as the techniques and approaches being applied are, in some sense “populist”, so this could represent a kind of bias. A open question therefore would be whether one could exploit this flaw in deep learning to find the relevant phrases, or if one should explicitly manipulate the data to remove any bias. In terms of future work, our findings suggest that ML-based strategies have potential for assisting researchers in identifying research topics and trends in a timely, cost effective and transparent manner. More research comparing the result from different NLP and traditional literature reviews on larger, more inclusive datasets is needed to refine the NLP methods and confirm the perceived superiority of BERT. Our findings highlighted the usefulness of human experts in labelling the topics and providing structure. The human expert related trade-offs of possibly introducing bias versus the transparency afforded warrants further research towards finding the optimal human engagement strategy.

9 Appendix See Table 5.

Identifying Research Topics in Human-Computer Interaction

835

Table 5. Part 1: BERT topics 1 to 10 extracted from the corpus Topic 1

Topic 6

Mobile

0.074721

Students

0.073682

Phone

0.064790

Teacher

0.061326

Phones

0.040863

Educational

0.033800

Smartphones

0.021871

Teaching

0.032810

Devices

0.020964

Schools

0.031659

Charging

0.014324

Teacher focused

0.020504

Phone-based

0.011563

Instructors

0.018991

Kenya

0.011232

Curriculum

0.016405

Africa

0.011232

Education

0.016189

Users

0.009319

Topic 2 Voice

Topic 7 0.078286

Privacy

0.132417

Audio

0.045598

Security

0.056878

Forums

0.040428

Biometric

0.024469

Voice-based

0.034358

Hiding

0.024469

Transcription

0.028748

Password

0.024469

Forum

0.025536

Secret

0.024469

Audios

0.018594

Confidential

0.017613

Respeak

0.017730

Cambodia

0.016313

Transcribe

0.017730

Protect

0.016313

Voice-user

0.014781

Bangladesh

0.016130

Topic 3

Topic 8

Design

0.084071

Literacy

0.086473

Co-design

0.034611

Users

0.062873

Designing

0.028515

Literate

0.033680

Aspirations based

0.023114

Illiterate

0.028980

Designers

0.019778

Literacy based

0.023186

Discuss

0.017196

Literates

0.023186

Cross-cultural

0.015315

User

0.020288

Needs

0.014908

Underprivileged

0.018551

Making

0.014408

Language

0.015993

Discussions

0.011557

Poor

0.015993 (continued)

836

J. van Biljon et al. Table 5. (continued)

Topic 1

Topic 6

Topic 4

Topic 9

Hci

0.111590

Financial

0.102653

hci4d

0.080744

Money

0.089853

Research

0.025959

Mobile

0.056876

Development

0.024319

Payments

0.052010

CSCW

0.017407

Banking

0.046377

Education

0.016604

Payment

0.037101

Practitioners

0.015686

Attacks

0.032213

Asia-pacific

0.012618

Transactions

0.028980

Efforts

0.012225

Ussd based

0.023186

Projects

0.011987

Simple

0.020261

Topic 5

Topic 10

Women

0.073783

Farmers

0.158638

Women’s’

0.052387

Agricultural

0.112536

Feminist

0.050107

Smallholder

0.055839

Gender

0.040110

Videokheti

0.035136

Gendered

0.028160

Production

0.032170

Harassment

0.028160

Service

0.028970

Sexual

0.022528

Cooperatives

0.026805

Muslim

0.018781

Technicians

0.026805

Pakistan

0.014338

Market

0.019468

Discrimination

0.013518

Service

0.028970

Poultry electronics

0.015093

Part 2: tSNE topics 1 to 10 extracted from the corpus Topic 1

Topic 6

Underpinning

0.00043

Technology

0.08401

Methodological Epistemological

0.00041

Mobile

0.07502

0.00039

Service

0.05603

Emphasis

0.00038

Phone

0.05404

Tangibility

0.00037

Support

0.04204

Physicality

0.00035

Result

0.03203

Hci discipline

0.00032

Experience

0.03202 (continued)

Identifying Research Topics in Human-Computer Interaction

837

Table 5. (continued) Topic 1

Topic 6

Pregnant woman

0.00031

Benefit

0.02801

Polly

0.00030

Pakistan

0.02602

Suggest

0.02311

Topic 2

Topic 7

Underpinning

0.00025

Viral

0.06403

Methodological

0.00023

Interface

0.05501

Epistemological

0.00021

Population

0.04408

Emphasis

0.00020

Researcher

0.03109

Tangibility

0.00018

Participant

0.02902

Physicality

0.00016

Dissemination

0.02909

Hci discipline

0.00015

Speech

0.02634

Pregnant woman

0.00014

Relate

0.02309

Polly

0.00013

Evaluate

0.02111

Village

0.01813

Topic 3

Topic 8

Spread

0.04712

Health

0.05823

Performance

0.04514

hci

0.04822

Well

0.03809

Community

0.04109

Impact

0.03143

System

0.03909

Interview

0.03023

Feedback

0.02809

World

0.02713

Development

0.02708

Learn

0.02514

Information

0.02400

Growth

0.02333

Study

0.02335

Language

0.02212

Paper

0.02111

Project

0.01810

Provide

0.02123

Topic 4

Topic 9

User

0.09312

Survey

0.04811

Design

0.08212

Access

0.03123

Application

0.04422

Effective

0.03231

Base

0.04111

Infrastructure

0.02812

Challenge

0.03614

Environment

0.02623

Digital

0.02023

Student

0.02512 (continued)

838

J. van Biljon et al. Table 5. (continued)

Topic 1

Topic 6

Conduct

0.01932

Work

0.02423

Approach

0.01716

Datum

0.02421

Beneficiary

0.01518

Report

0.02400

Implication

0.01516

Poor

0.02234

Topic 5

Topic 10

Underpinning

0.00013

Present

0.06231

Methodological Epistemological

0.00012

Research

0.05311

0.00014

Voice

0.04716

Emphasis

0.00011

Finding

0.04424

Tangibility

0.00014

India

0.03922

Physicality

0.00014

Month

0.03322

Put

0.00013

Video

0.03122

HCI discipline

0.00011

Social

0.03020

Pregnant woman

0.00011

Asta

0.02633

Polly

0.00009

Low_literate

0.02415

Acknowledgments. This work is based on the research supported by the South African Research Chairs Initiative of the Department of Science and Technology and National Research Foundation of South Africa (Grant No 98564).

References 1. Winters, N., Toyama, K.: Human-Computer interaction for development: mapping the terrain. Inf. Technol. Int. Dev. 5, 4 (2009) 2. Dell, N., Kumar, N.: The Ins and outs of HCI for development. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems - CHI’16, pp 2220–2232 (2016) 3. van Biljon, J.: Knowledge mobilization of human–computer interaction for development research: core issues and domain questions. Inf. Technol. Dev. 26 (2020). https://doi.org/10. 1080/02681102.2020.1767022 4. Kitchenham, B., Pearl Brereton, O., Budgen, D., et al.: Systematic literature reviews in software engineering - A systematic literature review. Inf. Softw. Technol. 51, 7–15 (2009). https:// doi.org/10.1016/j.infsof.2008.09.009 5. Schön, E.M., Thomaschewski, J., Escalona, M.J.: Agile requirements engineering: a systematic literature review. Comput. Stand. Interf. 49, 79–91 (2017). https://doi.org/10.1016/j.csi. 2016.08.011 6. Moher, D., Liberati, A., Tetzlaff, J., et al.: Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Ann. Intern. Med. 151, 264–269 (2009). https:// doi.org/10.1371/journal.pmed.1000097

Identifying Research Topics in Human-Computer Interaction

839

7. Yalcinkaya, M., Singh, V.: Patterns and trends in building information modeling (BIM) research: a latent semantic analysis. Autom. Constr. 59, 68–80 (2015) 8. Eickhoff, M., Neuss, N.: Topic modelling methodology: Its use in information systems and other managerial disciplines. In: Proceedings of the 25th European Conference on Information Systems, ECIS 2017, pp 1327–1347 (2017) 9. Egger, R.: Topic modelling. In: In Applied Data Science in Tourism, Cham, pp. 375–403 (2022) 10. Sehra, S.K., Brar, Y.S., Kaur, N., Sehra, S.S.: Research patterns and trends in software effort estimation. Inf. Softw. Technol. 91, 1–21 (2017). https://doi.org/10.1016/j.infsof.2017.06.002 11. Chiyangwa, T.B., van Biljon, J., Renaud, K.: Natural language processing techniques to reveal human-computer interaction for development research topics. In: icARTi ’21, 9–10 Dec. 2021, Virtual Event. Association for Computing Machinery, Mauritius, pp. 1–7 (2021) 12. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding (2018). https://doi.org/10.48550/arXiv. 1810.04805 13. Jelodar, H., Wang, Y., Yuan, C., et al.: Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimed. Tools Appl. 78, 15169–15211 (2019). https://doi. org/10.1007/s11042-018-6894-4 14. Peinelt, N., Nguyen, D., Liakata, M.: tBERT: Topic models and BERT joining forces for semantic similarity detection. Proc. 58, 7047–7055 (2020). https://doi.org/10.18653/v1/2020. acl-main.630 15. Pohlmeyer, C.W., Gonzalez, V.D., Irrinki, A., et al.: Identification of NK cell subpopulations that differentiate HIV-Infected subject cohorts with diverse levels of virus control. J. Virol. 93, 1718–1790 (2019). https://doi.org/10.1128/jvi.01790-18 16. Abuzayed, A., Al-Khalifa, H.: {BERT} for Arabic topic modeling: an experimental study on BERT topic technique. Procedia Comput. Sci. 189, 191–194 (2021). https://doi.org/10.1016/ j.procs.2021.05.096 17. Thompson, L., Mimno, D.: Topic Modeling with Contextualized Word Representation Clusters (2020) 18. Ho, M.R., Smyth, T.N., Kam, M., Dearden, A.: Human-Computer interaction for development: the past, present, and future. Inf. Technol. Int. Dev. 5, 1–18 (2009) 19. van Biljon, J., Renaud, K.: Human-Computer interaction for development ( HCI4D ): the Southern African Landscape. In: IFIP 9.4 International Conference on Social Implications of Computers in Developing Countries. Springer (2019) 20. van Biljon, J., Renaud, K.: Human-Computer interaction for development (HCI4D): the Southern African Landscape. In: IFIP Advances in Information and Communication Technology. 4 International Conference on Social Implications of Computers in Developing Countries, vol. 1, pp 253–266 (2019) 21. Reeves, S.: Locating the ‘Big Hole’ in HCI research. Interactions 22, 53–56 (2015). https:// doi.org/10.1145/2785986 22. Grant, M.J., Booth, A.: A typology of reviews: an analysis of 14 review types and associated methodologies. Health Info. Libr. J. 26, 91–108 (2009). https://doi.org/10.1111/j.1471-1842. 2009.00848.x 23. Cavacini, A.: What is the best database for computer science journal articles? Scientometrics 102(3), 2059–2071 (2014). https://doi.org/10.1007/s11192-014-1506-1 24. van Biljon, J., Renaud, K.: Reviewing a decade of human-computer interaction for development (HCI4D) research, as one of best’s “Grand Challenges.” African J. Inf. Commun. 27, 1–15 (2021). https://doi.org/10.23962/10539/31368 25. Van Der Maaten, L.: Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15, 3221–3245 (2014)

840

J. van Biljon et al.

26. Arora, S., Hu, W., Kothari, P.K.: An analysis of the t-SNE algorithm for data visualization. J. Artic. 1455–1462 (2018) 27. Moshkin, V., Konstantinov, A., Yarushkina, N.: Application of the bert language model for sentiment analysis of social network posts. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp 274–283 (2020) 28. Muller, B., Sagot, B., Seddah, D.: Enhancing bert for lexical normalization. WNUT@EMNLP 2019–5th Work Noisy User-Generated Text, Proc. 5, 297–306 (2019). https:// doi.org/10.18653/v1/d19-5539 29. Bird, S., Loper, E.: The Natural Language Toolkit NLTK: The Natural Language Toolkit. Proc. ACL-02 Work Eff Tools Methodol Teach. Nat. Lang. Process. Comput. Linguist. 63–70 (2016). https://doi.org/10.48550/arXiv.cs/0205028 30. Bouchet-Valat, M., Bastin, G.: Rcmdrplugin.temis, a graphical integrated text mining solution in R. R J 5,188–196 (2013). https://doi.org/10.32614/rj-2013-018 31. Mavridis, T., Symeonidis, A.L.: Semantic analysis of web documents for the generation of optimal content. Eng. Appl. Artif. Intell. 35, 114–130 (2014) 32. Chen, C.O., Zhuo, Y.Q., Yeh, C.C., et al.: Machine learning-based configuration parameter tuning on hadoop system. Proc—2015 IEEE Int Congr Big Data. BigData Congr. 2015, 386–392 (2015). https://doi.org/10.1109/BigDataCongress.2015.64 33. Sharma, D., Kumar, B., Chand, S., Shah, R.R.: A trend analysis of significant topics over time in machine learning research. SN Comput. Sci. 2, 1–13 (2021) 34. Arun, R., Suresh, V., Veni Madhavan, C.E., Narasimha Murthy, M.N.: On finding the natural number of topics with latent dirichlet allocation: some observations. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.), Advances in Knowledge Discovery and Data Mining. PAKDD 2010. Lecture Notes in Computer Science, vol 6118. Springer, Berlin, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13657-3_43 35. Cao, J., Xia, T., Li, J. et al.: A Density-Based Method for Adaptive LDA Model SelectionA Density-Based Method For Adaptive LDA Model Selection (2009) 36. Crain, S.P., Zhou, K., Yang, S.H., Zha, H.: Dimensionality reduction and topic modeling: from latent semantic indexing to latent Dirichlet allocation and beyond. In: Mining text data, pp. 129–161. Springer, Boston, MA (2012) 37. Toschi, I., Rodríguez-Gonzálvez, P., Remondino, F., et al.: Accuracy evaluation of a mobile mapping system with advanced statistical methods. Int. Arch. Photogramm Remote Sens. Spat. Inf. Sci.-ISPRS Arch. 40, 245–253 (2015). https://doi.org/10.5194/isprsarchives-XL-5W4-245-2015 38. Sidorova, A., Evangelopoulos, N., Valacich, J.S., Ramakrishnan, T.: Uncovering the intellectual core of the information systems discipline. MIS Q Manag. Inf. Syst. 32, 467–482 (2008). https://doi.org/10.2307/25148852 39. Bradford, R.B.: An empirical study of required dimensionality for large-scale latent semantic indexing applications. In: International Conference on Information and Knowledge Management, Proceedings. In Proceedings of the 17th ACM conference on Information and knowledge management, pp. 153–162 (2008) 40. Sidorova, E., Valacich, R.: Uncovering the intellectual core of the information systems discipline. MIS Q 32, 467 (2008). https://doi.org/10.2307/25148852 41. Church, K.W.: Word2Vec. Nat. Lang. Eng. 23, 155–162 (2017). https://doi.org/10.1017/S13 51324916000334 42. Hughes, R.T.: Expert judgement as an estimating method. Inf. Softw. Technol. 38, 67–75 (1996). https://doi.org/10.1016/0950-5849(95)01045-9

End-to-End Aspect Extraction and Aspect-Based Sentiment Analysis Framework for Low-Resource Languages Georgios Aivatoglou, Alexia Fytili, Georgios Arampatzis, Dimitrios Zaikis(B) , Nikolaos Stylianou, and Ioannis Vlahavas Aristotle University of Thessaloniki, Thessaloniki 54124, Greece [email protected]

Abstract. Due to the increasing volume of user-generated content on the web, the vast majority of businesses and organizations have focused their interest on sentiment analysis in order to gain insights and information about their customers. Sentiment analysis is a Natural Language Processing task that aims to extract information about the human emotional state. Specifically, sentiment analysis can be achieved on three different levels, namely at the document level, sentence level or the aspect/feature level. Since document and sentence levels can be too generic for an opinion estimation given specific attributes of a product or service, aspect-based sentiment analysis became the norm regarding the exploitation of user generated data. However, most human languages, with the exception of the English language, are considered low-resource languages due to the restricted resources available, leading to challenges in automating information extraction tasks. Accordingly, in this work, we propose a methodology for automatic aspect extraction and sentiment classification on Greek texts that can potentially be generalized to other low-resource languages. For the purpose of this study, a new dataset was created consisting of social media posts explicitly written in the Greek language from Twitter, Facebook and YouTube. We further propose Transformer-based Deep Learning architectures that are able to automatically extract the key aspects from texts and then classify them according to the author’s intent into three pre-defined classification categories. The results of the proposed methodology achieved relatively high F1-macro scores on all the classes denoting the importance of the proposed methodology on aspect extraction and sentiment classification on low-resource languages. Keywords: Aspect extraction · Sentiment analysis processing · Deep learning · Transformers · BERT

· Natural language

This research was carried out as part of the project KMP6-0096055 under the framework of the Action “Investment Plans of Innovation” of the Operational Program “Central Macedonia 2014–2020”, that is co-funded by the European Regional Development Fund and Greece. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 841–858, 2024. https://doi.org/10.1007/978-3-031-47715-7_56

842

1

G. Aivatoglou et al.

Introduction

Social media use increases exponentially every year, resulting in a wealth of publicly available information. Several billion users create and share data on a vast variety of different platforms and applications, generating an invaluable source of information that researchers can take advantage of. Data come in many forms and pertain to numerous areas, with the most common being in written text. Many businesses began to adopt sentiment analysis methods in order to gain the most out of the available data [25]. A sentiment analysis task can be carried out in three different levels, including: document level, sentence level and aspect level [7]. The document level approach assumes that a single sentiment can summarize the whole text, from start to finish. Sentence level approach can match a different sentiment for every sentence inside the whole document, whereas aspect level focuses on pairing each aspect mentioned with a sentiment. Aspect-based methods classify the sentiment of each aspect present in a sentence [17]. As described by Pontiki et al. [18], Aspect-based Sentiment Analysis (ABSA) is split into different subtasks, including ‘aspect extraction’ and ‘sentiment classification’. The process involves extracting an aspect and its categories, finding all opinionated words that describe them and in the end, assigning a sentiment polarity to each one. An ABSA system works by being given an input (e.g. a product review) that discusses a particular entity (e.g. a new model of a laptop brand) expressing an opinion that holds a certain polarity (e.g. positive, neutral or negative sentiment) towards an aspect (e.g. the laptop’s battery, screen, or any other feature). An ABSA methodology is expected to properly identify the aspects presented inside the sentence and correctly assign a sentiment to each one. Given the global reach of social media, people from all around the world have made it a daily part of their lives to share content online and communicate with other people. It is only logical to assume, that a big portion of the data shared online is being posted using the creators’ native language. In literature there are several studies that analyze text written in English [8,13,24,26], but as mentioned by [15] in a recent study, Natural Language Processing (NLP) still faces many challenges when dealing with low-resource languages since there is a significant lack of useful training data. Furthermore, the scarcity of expert linguists that can assist in producing these datasets leads to a shortage of publicly available supervised data. In this paper we aim to alleviate these issues that persist in most lowresource languages ABSA tasks, with our contribution. We propose an end-toend methodology for the Greek language, a low-resource language, that achieves great results when compared to other methodologies. Additionally, the proposed methodology can be applied to other low-resource languages without requiring any significant modifications.

End-to-End Aspect Extraction and Aspect-Based Sentiment

2

843

Related Work

In recent years, a plethora of research studies have been conducted for sentiment analysis extracting an author’s opinion on a document, sentence or aspect level [9]. However, the vast majority of the studies focused on rich resource languages such as the English. Perhaps, the difficulties of languages such as the Greek may discourage the researchers to build tools since the grammar and the vocabulary of the Greek language is by far more complex compared to the English, and on top of that, not many resources exists online including datasets and annotations. 2.1

Aspect-Based Sentiment Analysis in English

Several approaches have been proposed to address the task of Aspect-based Sentiment Analysis in the English language. Pre-trained language models, such as BERT [5], have been used in the most recent methods achieving an outstanding performance, while offering the advantage of eliminating the need for high-level feature engineering. For instance, Li et al. [13] introduced a method called GBCN that focuses on the enhancement of BERT representation with contextual features. The sentences are given as input into the BERT encoding layer and the context-aware layer separately to create the BERT representation and the additional contextaware embeddings. A gating layer follows to integrate the BERT representations with the context-aware embedding. The proposed method was applied on the SentiHood and SemEval 2014 Task 4 datasets and outperformed the compared approaches highlighting the importance of contextual and semantic information. Additionally, Xu et al. [24] compared the performance of pre-trained BERT weights with post-trained BERT weights on three tasks: review reading comprehension, aspect extraction and aspect sentiment classification. They intended to develop a post-training strategy that utilizes domain and task knowledge. On this purpose, they fine-tuned BERT (BERT-PT) with the Masked Language Model (MLM) and the Next Sentence Prediction (NSP). MLM contributed to the integration of domain knowledge and to the elimination of the bias of the knowledge deriving from Wikipedia. On the other hand, NSP allowed the exploitation of the contextual information from the reviews. The proposed strategy achieved a great performance in the SemEval 2014 Task 4 dataset, pointing out the contribution of domain and task knowledge. However, BERT-PT had difficulty dealing with sentences that contain opposite opinions, such as “the screen is good but not for Windows”. Another approach that incorporates domain knowledge was proposed by Zhao and Yu [26]. A sentiment knowledge graph is constructed aiming to capture the correlations between the aspects and the corresponding sentiments. This graph is utilized to transform the sentences into a sentence knowledge tree which is fed into the BERT model. The proposed methodology contributed to the efficient detection of aspects in the utilized novel dataset, but the performance heavily depends on the quality of the sentiment knowledge tree. Moreover, the

844

G. Aivatoglou et al.

integration of too much external knowledge may lead to noise by changing the meaning of the original sentence. BERT Adversarial Training is a novel architecture introduced by Karimi et al. [8] that utilizes a white-box model for adversarial training to fine-tune the traditional BERT model and the post-trained BERT (BERT-PT) model [24]. Adversarial training helps the network to generalize better in the data by creating perturbations on the embeddings space forcing the model to become more robust. Adversarial examples were created from the BERT embeddings using the gradient of the loss and were given as input to the BERT encoder to calculate the adversarial loss. The proposed method outperformed the compared ones on the SemEval 2014 Task 4 and SemEval 2016 Task 5, proving that adversarial examples can contribute to a significant improvement of BERT’s performance. 2.2

Aspect-Based Sentiment Analysis in Greek

Despite the numerous applications of Aspect-based Sentiment Analysis on the English language, there is a limited number of works that focus on low resource languages. One reason for this, is the lack of data that are available online. Hence, researchers have either to confront with various problems related to the limited data or create their own datasets from scratch [1,10,14,20]. Solakidis et al. [20], collected training data, that were written in the Greek language, automatically using web scrapping mechanisms from public forums. Moreover, they annotated the data using heuristic rules based on the emoticons and the keywords inside the documents. For word embeddings they used the count frequency of the words inside the documents and experimented with n-grams. After, they trained and evaluated five different classifiers namely Multinomial Naive Bayes, Bernoulli Naive Bayes, Logistic Regression, Linear Support Vector Machines and K-Nearest Neighbors. Moreover, Korovesis et al. [10] crawled Greek documents from various sources such as Twitter, news articles, Facebook and blogs. The annotation process of the data was completed from five annotators, who underwent training for that specific task. Similarly, Alexandridis et al. [1] collected Greek documents from various social media sources and manually annotated them. They utilized pre-trained Greek embeddings from FastText, Greek-BERT and GPT-2. Furthermore, Liapakis et al. [14] mined reviews from randomly selected popular companies and extracted aspects using a POS tagger. On the other hand, Athanasiou et al. [2] in order to deal with the low-resource problems of the Greek language considered the translation of the Greek tokens to English ones. Hence, by using gradient boosting machines they were capable of outperforming many well-known methodologies for sentiment analysis. Considering the above research studies, it is obvious that most of the works targeting the Greek language are related to sentiment classification, due to the lack of aspects and the manual effort that is needed in order to extract them. Although all the aforementioned studies showed great results and novel methodologies, they didn’t offer an end-to-end solution to aspect-based sentiment analy-

End-to-End Aspect Extraction and Aspect-Based Sentiment

845

sis on Greek documents containing both the aspect-extraction and the sentiment classification, altogether. To that end, our proposed methodology introduces a novel end-to-end approach that performs both aspect-extraction and aspect-based sentiment classification. The proposed solution eliminated the need for manual extraction of aspects and furthermore it uses an innovative ensemble architecture which exploits two distinct pre-trained Language Models, a Greek and a multilingual one. Although the need for a Greek language model is reasonable, since the documents of the dataset were written in Greek, the need for the multi-lingual one can be proven through data analysis in Sect. 3.

3

Dataset

As low-resource languages by definition lack the required resources to effectively tackle any NLP task, a novel dataset was gathered and annotated in collaboration with a social media monitoring company. The dataset consists of a series of texts, mainly comments, posts and reviews, from different users gathered from a variety of social media platforms. These texts refer to a set of predefined aspects, with the users expressing their opinions freely about one or more of them. The data gathering phase took place between the months of October and November 2021, collecting texts written in the Greek language from three different social media platforms, Twitter, Facebook and YouTube. After the data had been congregated, two expert annotators were appointed for the task of correctly labeling the dataset. Their task was broken down into two steps and carried out by both of them. Accordingly, the first task was to identify all the aspects in each of the texts provided. While the second task was to assign a sentiment value to each of these aspects, where most instances included either one or two aspects. The novel dataset consists of 9700 instances, with each instance having four columns, namely: ‘ID’, ‘Text’, ‘Aspects’ and ‘Sentiments’. The column ‘ID’ was used to uniquely identify each instance and the column ‘Text’ that is of type String includes the text as it was originally posted without any preprocessing. Finally, the columns ‘Aspects’ and ‘Sentiments’ contain all the aspects and their respective sentiments identified and labeled by the expert annotators. 3.1

Dataset Preprocessing and Augmentation

Our initial preprocessing step was to remove any instances that did not fit the scope of the task. This includes instances without aspects, since there were a few cases were a text did not include an aspect, or cases with minimal errors mainly caused by the manual annotation. Following that we used Cohen’s kappa [12] to measure the inter-rater agreement. The dataset reached a kappa score of 85%, indicating an almost perfect agreement between the two annotators. This helped us verify that the annotation process was done successfully and we could continue with our experimental stage.

846

G. Aivatoglou et al.

Continuing with the text preprocessing stage, we experimented with numerous approaches and concluded that we achieved greater results when we removed all special characters from the texts and the aspects alike, including the Greek accents before feeding the data into the models. Regarding the language in the documents, the majority of them were written explicitly in Greek, while others were written in English or contained English words. Specifically, the 99% of the dataset was written in the Greek language while the rest 1% was written explicitly in English. On the other hand, the 80% of the total dataset contained at least one English word in the document. Thus, as noted in Sect. 2.2, we utilized both Greek-BERT [11] and XML-RoBERTa [3], which is a multi-lingual model, in order to better capture the context of the words into word embeddings. After the initial preprocessing was completed, the label distribution was investigated. Specifically, as depicted in Fig. 1, it is obvious that the majority of the documents belonged to the neutral class, while the two other classes were under-populated. Dataset imbalance is a major problem for many realworld applications. The problem is related with the unequal distribution of the samples across the different classes of the problem.

Fig. 1. Original training set distribution.

Hence, in order to tackle any issues that may result in poor classification results due to class imbalance, we proceeded with two combined approaches for text augmentation in order to over-sample the minority classes and lead to a more balanced distribution. The first approach is based on the MASK prediction capabilities of the BERT language model. BERT is trained on two different tasks, namely Masked Language Model (MLM) and Next Sentence Prediction (NSP). Since BERT is already trained on MLM tasks we used BERT to predict randomly assigned MASK tokens inside the documents. Hence, we proceeded with the insertion of MASK tokens inside each document and specifically between randomly selected sequential words. Thus, BERT was able to assign new words that could potentially lead to a different dataset with new information.

End-to-End Aspect Extraction and Aspect-Based Sentiment

847

The last approach was based on Neural Machine Translation. Specifically, we leveraged Opus-MT [22] capabilities. Opus-MT models are trained based on state-of-the-art transformer-based neural machine translation architectures on freely available corpora collected from the OPUS [21] repository. The model architecture is based on a transformer with six self-attentive layers in both the encoder and decoder network with eight attention heads in each layer. Since Opus-MT offers a plethora of models, both multi-lingual and language specific for language to language translations, we concluded that three specific models fit our case, including Greek ↔ French, Greek ↔ Finish and Greek ↔ Swedish. The selection of the aforementioned models was based on the BLEU [16] (BiLingial Evaluation Understudy) score, which is a metric the automatic evaluation of machine-translated text. The BLEU scores per model is depicted on Table 1. Table 1. BLEU Score per Translation Pair. Language

BLEU score

Greek ↔ French

63.0

Greek ↔ Finish

25.3

Greek ↔ Swedish 23.6

Hence, after the aforementioned techniques in order to balance the training set of the problem, the new distribution is depicted on Fig. 2.

Fig. 2. Re-Sampled training set distribution.

Finally, after transforming the dataset into a new one with a balanced distribution among the three classes, we were ready to proceed with the rest of the proposed end-to-end approach.

848

4

G. Aivatoglou et al.

Architecture

This section provides a complete overview of the proposed architecture. It consists of two distinct models for aspect-extraction and aspect-based sentiment classification. The two models act as a single entity since the output of the first is the input to the second without further preprocessing or other interventions. Hence, we eliminate the need of data engineering that is a challenging and time consuming process. The architecture of the end-to-end system for aspect-extraction and sentiment classification can be depicted on Fig. 3.

Fig. 3. Inference procedure of the proposed methodology.

4.1

Aspect Extraction

For the aspect extraction task we experimented with two different architectures, in both of which, the position, token, and segment embeddings were given as input to the BERT model and a different output layer was followed. In the first architecture, a linear layer with softmax activation function was utilized as the final layer of the network. In this case, the classifier makes a local decision at each token of the sequence, without taking under consideration the prediction made on other points in the sequence. On the other hand, in the second architecture, a

End-to-End Aspect Extraction and Aspect-Based Sentiment

849

CRF output layer was employed aiming to incorporate the constrains rules that exist among the labels, ensuring that the invalid predicted sequences of labels are reduced. The CRF layer is capable of converting the BERT output into the optimal sequence of labels. 4.2

Aspect-Based Sentiment Analysis

Considering the extracted aspects, now the dataset includes both golden aspects and the aspects extracted from the first part of our end-to-end system. Hence, we proceeded with aspect-based sentiment classification on both golden aspects and extracted aspects in order to have a proper comparison. Since our work is based on an ensemble methodology, we utilized two different Transformer-based solutions for the word embeddings of the documents. Specifically, we utilized the Greek-BERT and the XML-RoBERTa in order to create proper and context-aware embeddings for each one of the documents in the dataset. Greek-BERT was released as a version of the original BERT trained explicitly on the Greek language. The architecture consists of 12 layers, 768 hidden sizes and 12 self-attention heads leading to a total of 110 million parameters. On the other hand, XML-RoBERTa is a multi-lingual model trained on 100 different languages with the difference that it used dynamic masking for its training. In order to properly assign aspects into the documents, we utilized the auxiliary sentence capabilities of both Greek-BERT and XML-RoBERTa. Thus, we included the aspects into the documents and proceeded to tokenization with both models. Each text and aspect on a single document were divided with a special token included on the vocabulary of both tokenizers. For the classification part, we utilized the embeddings of both models followed by two different fully connected layers. For the evaluation of the results we utilized the macro averaged F1 score since we were interested on the performance of all three classes. As it is obvious in an ensemble methodology, the predictions of both models were averaged in order to conclude to a final decision. The ensemble methodology helped us to better encapsulate the vague and diverse social media vocabulary into proper and contextualized word embeddings. Moreover, as already discussed, although the dataset was explicitly captured with a focus on the Greek language, English words were not missing from the documents. Thus, the XML-RoBERTa multi-lingual model captured better those words leading to better word embeddings.

5

Experiments

The proposed methodology was trained as two distinct entities. Both entities were trained into the same training set, since the original dataset was split into train, test and validation sets. Hence, the outcome is an end-to-end system able

850

G. Aivatoglou et al.

to extract aspects from raw text documents and then classify them in an aspectlevel into positive, negative or neutral considering the author’s emotional intent towards each aspect. 5.1

Experimental Setup

The utilized dataset consists of text which is mainly written in Greek language, therefore, the Greek BERT pre-trained language model by Koutsikakis et al. [11] was leveraged. The “bert-base-greek-uncased-v1” model was pre-trained in three datasets: (1) the Greek subset of Wikipedia,1 (2) the Greek part of European Parliament Proceedings Parallel Corpus2 and (3) the Greek part of OSCAR,3 which is a cleansed version of Common Crawl.4 The model was trained with 12 Transformer layers and 12 attentions heads with 110M parameters, while the output was a vector of size 768. The training process was conducted in 1 million training steps, while the Adam optimizer was employed with learning rate 1e-4. On top of that, a second language model was leveraged for the sentiment classification task. Specifically, XML-RoBERTa was used as a multilingual pretrained language model. XML-RoBERTa language model was trained on 2.5B of filtered CommonCrawl [23] data containing 100 different languages including the Greek. It is crucial to mention that we used the base version of both language models on this study. As discussed on Sect. 4.2, the base version of BERT and XMLRoBERTa utilizes 12 layers, 768 hidden size and 12 self-attention heads compared to the 24 layers, 1024 hidden size and the 16 self-attention heads of the large models. Moreover, we utilized a 80%-10%-10% rule for the split of the dataset into training, test and evaluation sets. Hence, the training set consisted of 15.344 samples while the test and the validation consisted of 880 and 863 samples accordingly. The split was necessary in order to evaluate the architecture while training, and test it’s generalization capabilities after the train process. For the evaluation process, we used the F1-score which is the harmonic mean between the precision and recall and it is a great indicator for imbalanced datasets. Specifically, precision equals the number of true positives divided by the number of false and true positives, while recall equals the number of true positives divided by the number of true positives and false negatives. Moreover, we utilized the F1-macro score which computes the unweighted means of all the per class F1-scores since we were interested on the accuracy of all the 3 classes. Aspect extraction was formulated as a sequence labeling task using the IOB2 scheme, which was introduced by Tjong Kim Sang and Veenstra [19]. Each token is represented using a label from the set {I, O, B}. The B label indicates that the token appears at the beginning of an aspect term, the I label indicates that 1 2 3 4

https://el.wikipedia.org/wiki/. https://www.statmt.org/europarl/. https://oscar-corpus.com/. https://commoncrawl.org/.

End-to-End Aspect Extraction and Aspect-Based Sentiment

851

the token is inside an aspect term, while the O label indicates that the token is outside an aspect. IOB2 and BIOES schemes were also tested, but did not contribute to any performance improvement. The proposed methodology for the aspect extraction task utilized the GreekBERT model, the output of which is the raw hidden-states without having any layer on top. Consequently, as stated in Sect. 4.2, experiments were carried out with two different layers as the final output layer, a linear layer with softmax activation function and a CRF layer. The Bert tokenizer5 was employed to convert the words to tokens. Moreover, the aspect extraction models were trained with Adam optimizer with learning 3e-5. The learning rates {1e-5, 2e-5, 4e-5} were also tested, but did not contribute in any performance improvement. The models were trained for a period of 5 epochs, while, early stopping was, also, applied terminating the training process in cases where the calculated loss did not show any improvement on more than 3 epochs. Furthermore, the training process was implemented with batch size 32. After retrieving the output of the aspect extraction model, a decoding process was applied to extract the original aspect terms from the IOB2 format and give them as input to the sentiment classification model. Regarding the sentiment classification, we fine-tuned both the Greek-BERT and XML-RoBERTa on the same corpus, derived after the re-sampling of the training set. This was crucial in order to have a robust end-to-end approach trained and evaluated on the same dataset splits. It should be mentioned that during the fine-tuning process we kept the number of epochs as low as 2, the batch size to 8 and the learning rate to 3e-5 as proposed in the original work by the authors of Greek-BERT [11]. After several experiments, we found Greek-BERT to be superior compared to other multi-lingual models for Greek data. On top of that, the combination with a multi-lingual model such as XML-RoBERTa showed an increased confidence and accuracy of the model into the specific aspect-based sentiment classification task. The reason behind this finding is related with the fact that although our dataset was mainly written in Greek, there were a few English words inside the Greek sentences. Considering the aspect-level sentiment classification task, we experimented with 2 completely different architectures in order to find the best solution to the problem. The first approach leveraged both the Greek-BERT and XMLRoBERTa as we previously discussed, and then a fully-connected layer was applied to the pooled outputs of both language models. The final prediction was averaged concluding to a final ensemble prediction. On the other hand, the second approach utilized only the Greek-BERT as a pre-trained language model. The 12 output sequence layers of the pre-trained language model were grouped together in groups of 4 and then we presented 3 GRU layers, one for each group. Finally, the output of the GRUs was averaged and 3 fully-connected layers were applied for the final classification. The GRUs 5

https://huggingface.co/docs/transformers/model doc/bert#transformers. BertTokenizer.

852

G. Aivatoglou et al.

had a hidden size of 512 units, 2 layers and were bidirectional. On top of that, a dropout probability of 25% was applied on all the GRUs. The optimizer used for the weight updates through backpropagation for both experiments was AdamW with a weight decay of 0.1. AdamW is a variant of the Adam optimizer with an improved implementation for the weight decay. Moreover, for the loss function, Cross Entropy Loss was selected since it resulted in the best results for the 3 classes of the classification problem. Additionally, we employed a dropout probability of 10% for the first approach and a dropout probability of 50% for the second just before the fully-connected layers. Furthermore, in order to avoid overfitting issues we utilized early stopping as an additional regularization technique when the validation loss stopped decreasing. Finally, the experiments ran on a NVIDIA GeForce RTX 3060 12 GB, 64 GB of memory and an Intel Core i5-11600K CPU. The source code of the proposed methodology is publicly. available at.6 5.2

Experimental Results

In Table 2 the Precision, Recall and F1-scores from the experiments of the aspect extraction task are reported. As described in Sect. 4.1, experiments were conducted with two architectures which utilize the Greek-BERT model with different output layers, specifically (1) with softmax output layer and (2) with CRF output layer. The performance of the two architectures was similar for the B label. On the other hand, a differentiation is observed for the I label. Specifically, for the I label F1-score equal to 75% was achieved with the BERT+Softmax architecture and equal to 73% with the BERT+CRF architecture. Overall, the highest performance was achieved with the BERT+Softmax model with macro F1-score equal to 80%. Table 2. Comparison results between the aspect extraction architectures. BERT+Softmax/BERT+CRF Class

Precision

B

90%/91% 81%/80% 85%/85%

I

82%/76% 69%/71% 75%/73%

Macro average

Recall

F1-macro

80%/79%

On the other hand, the comparison between the 2 aspect classification architectures are depicted on Table 3. Here, the Precision, Recall, F1-macro scores, accuracy and total macro average score are depicted both for the BERTGRU and the ensemble architecture for all the classes of the classification problem. 6

Available upon publication.

End-to-End Aspect Extraction and Aspect-Based Sentiment

853

Table 3. Comparison results between the aspect classification architectures. BERTGRU/ensemble Class

Precision

Negative

60%/61% 54%/52% 57%/56%

Recall

F1-macro

Neutral

80%/80% 83%/85% 82%/83%

Positive

55%/64% 54%/59% 55%/62%

Accuracy

73%/75%

Macro average

64%/67%

As it is clearly depicted, the ensemble architecture surpassed the BERTGRU architecture on most of the validation metrics and for all the classes of the problem. Generally, it achieved a F1-macro average score of 67% surpassing the BERTGRU architecture by 3%. Specifically, the ensemble architecture achieved a higher F1-macro score on the neutral class (by 1%) and on positive class (by 7%). Although it achieved a lower F1-macro score on the negative class (by 1%) it seems that the ensemble architecture was more generalized considering the overall accuracy. Hence, after the comparison, we proceeded with the BERT+Softmax for the aspect-extraction and the ensemble architecture for the aspect classification task as our go-to approach for our end-to-end system. 5.3

Error Analysis

In order to properly evaluate our end-to-end system we tested it both on the extracted and the golden aspects. Hence, we were able to measure the propagated error from aspect-extraction to sentiment classification. The classification reports of the experiments can be found on Table 4. Moreover, the confusion matrices for the golden and the extracted aspects can be found on Fig. 4. As illustrated, the model achieved a F1-macro score for the golden aspects of 67% while 65% for the extracted. This is reasonable considering the propagated error from the aspect extraction of the proposed end-to-end system. Moreover, it is obvious that the model did better on the neutral class since no data augmentation techniques were needed and it had the most unique examples for the training compared to the rest classes. Finally, considering the 2% F1-macro average difference between the two testsets (golden vs extracted aspects), we can conclude that the aspect-extraction task of our end-to-end approach was capable of extracting most of the aspects correctly. 5.4

Performance Comparison

In order to compare and further test the proposed methodology, we proceeded to a comparison with the work of Dai et al. [4] entitled “Does syntax matter?

854

G. Aivatoglou et al. Table 4. Comparison results on golden versus extracted aspects. Golden/extracted Class

Precision

Negative

61%/61% 52%/53% 56%/57%

Recall

F1-macro

Neutral

80%/80% 85%/84% 83%/82%

Positive

64%/58% 59%/55% 62%/57%

Accuracy

75%/74%

Macro average

67%/65%

Fig. 4. Confusion matrix on golden (Left) versus extracted (Right) aspects on the best.

A strong baseline for Aspect-based Sentiment Analysis with RoBERTa”. Their work incorporates RoBERTa followed by a Multi Layer Perceptron (MLP) and focuses on aspect-level classification based on already labeled aspects. Moreover, to the best of our knowledge, no study has been conducted proposing an end-to-end system on textual data for aspect-extraction and aspect-based sentiment classification. Hence, the comparison with Dai’s et al. [4] study was conducted only on the aspect-level sentiment classification. Furthermore, in order to make our dataset compatible with their architecture, we utilized the Greek trained pipeline version of Spacy [6]. Moreover, we restructured our dataset from comma separated values to the specific JSON format that they used for their implementation. Since XML-RoBERTa can handle the Greek language, we fine-tuned their architecture on our training set and used it for inference on our test set. Specifically, their architecture was fine-tuned on our dataset using the golden aspects, while it was evaluated on both golden and the predicted aspects from our end-to-end approach. This way, we were able to make a fair comparison between the two approaches for the task of aspect-level sentiment classification. The experiments on their architecture were conducted for a total of 20 epochs, while we kept each epoch’s model in order to finally have the best predictor according to the highest F1-macro score. On top of that, their architecture utilized AdamW for the optimizer of the problem while for the loss function they used the SmoothLoss. Additionally, they used a learning rate of 2e-5, a weight

End-to-End Aspect Extraction and Aspect-Based Sentiment

855

decay of 1e-2, a batch size of 32 and a dropout layer with a dropout probability of 50%. All the hyper-parameters were left as were stated in the original work in order to fairly compare the two approaches. The confusion matrices of Dai’s et al. architecture while fine-tuned on our train-set and evaluated on both testsets can be found on Fig. 5.

Fig. 5. Dai et al. confusion matrix on golden (Left) versus extracted (Right) aspects.

As it is depicted on the confusion matrices of Fig. 5, their architecture achieved higher True Positives on the golden aspects compared to the extracted aspects. The results were quite expected having in mind the propagated error from our aspect-extraction task. Furthermore, we continued the comparison with the results of both systems utilizing the F1-macro score. Overall, their methodology achieved a F1-macro average score of 65% both for the golden and the extracted aspects. The final comparison of our methodology and Dai’s et al. study for both test-sets (golden and extracted aspects) can be found on Table 5. As it is depicted, our end-to-end approach surpassed Dai’s et al. methodology by 2% on the golden aspects, while it achieved the same F1-macro score for the extracted aspects. Again, it should be noted that this comparison was established between our sentiment classifier of our end-to-end approach against the sentiment classifier of Dai’s et al. work due to the lack of the aspect-extraction task of the latter. Considering the results, our proposed end-to-end system for aspect-extraction and aspect-based sentiment classification surpassed Dai’s et al. methodology for aspect-level sentiment classification, achieving higher F1-macro average score. On top of that, the proposed end-to-end system deals both with aspect extraction Table 5. Comparison results between the Proposed and Dai et al. approaches. F1-macro score Proposed approach Dai et al. Golden

67%

65%

Extracted 65%

65%

856

G. Aivatoglou et al.

and sentiment classification compared to the most solutions that offer only the latter. Finally, another great aspect of this work, is the fact that it is able to deal with low-resource languages, such as the Greek and potentially with others.

6

Discussion

In this paper, we introduced an end-to-end system for the tasks of aspectextraction and aspect-based sentiment classification. The overall architecture of our system combines two models as a single entity, minimizing the need for data engineering between the processes. Alongside we leveraged a novel dataset in Greek, that we utilized for both of these tasks and additionally we also produced a series of methodologies that can successfully counter the imbalance of the existing dataset. After thorough experimentation, we found that our model, outperforms the one presented in the work of Dai et al. [4] by 2% on the golden aspects, while it achieved the same F1-macro score for the extracted aspects. The comparison with their work was done on the aspect-level classification, since, to the best of our knowledge, there is not a work proposing an end-to-end approach for aspectextraction and aspect-based sentiment classification end-to-end system. Based on the results produced, it is safe to assume that the proposed system both surpasses the work of Dai et al. [4] but also offers an end-to-end system that combines the tasks of aspect-extraction and aspect-based sentiment classification.

7

Conclusions

In this paper, we proposed an end-to-end aspect-based sentiment analysis methodology for a low-resource language, Greek. On this purpose, the base Greek-BERT model with softmax output layer was utilized for the aspect extraction sub task, while, an ensemble methodology consisting of Greek-BERT and XML-RoBERTa was implemented for the aspect-based sentiment analysis task. We, also, introduced two combined approaches for text augmentation to properly handle class imbalance. As highlighted from the experimental results, the proposed methodology outperformed the compared state-of-the-art approach.

References 1. Alexandridis, G., Varlamis, I., Korovesis, K., Caridakis, G., Tsantilas, P.: A survey on sentiment analysis and opinion mining in Greek social media. Information 12(8), 331 (2021) 2. Athanasiou, V., Maragoudakis, M.: A novel, gradient boosting framework for sentiment analysis in languages where nlp resources are not plentiful: a case study for modern greek. Algorithms 10(1), 34 (2017) 3. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzm´ an, F., Grave, E., Ott, M., Zettlemoyer, L. and Stoyanov, V.: Unsupervised cross-lingual representation learning at scale (2019). arXiv:1911.02116

End-to-End Aspect Extraction and Aspect-Based Sentiment

857

4. Dai, J., Yan, H., Sun, T., Liu, P., Qiu, X.: Does syntax matter? a strong baseline for aspect-based sentiment analysis with roberta (2021). arXiv:2104.04986 5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805 6. Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spacy: Industrial-strength natural language processing in python (2020). https://doi.org/10.5281/zenodo. 1212303 7. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the tenth ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, pp. 168–177 (2004) 8. Karimi, A., Rossi, L., Prati, A.: Adversarial training for aspect-based sentiment analysis with Bert. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 8797–8803. IEEE (2021) 9. Kaur, J., Kaur Sidhu, B.: Sentiment analysis based on deep learning approaches. In: 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), pp. 1496–1500 IEEE (2018) 10. Korovesis, K., Alexandridis, G., Caridakis, G., Polydoras, P., Tsantilas, P.: Leveraging aspect-based sentiment prediction with textual features and document metadata. In: 11th Hellenic Conference on Artificial Intelligence, pp. 168–174 (2020) 11. Koutsikakis, J., Chalkidis, I., Malakasiotis, P., Androutsopoulos, I.: Greek-Bert: The Greeks visiting sesame street. In: 11th Hellenic Conference on Artificial Intelligence, pp. 110–117 (2020) 12. Kv˚ alseth, T.O: Note on cohen’s kappa. Psychol. Rep. 65(1), 223–226 (1989) 13. Li, X., Xingyu, F., Guangluan, X., Yang, Y., Wang, J., Jin, L., Liu, Q., Xiang, T.: Enhancing Bert representation with context-aware embedding for aspect-based sentiment analysis. IEEE Access 8, 46868–46876 (2020) 14. Liapakis, A.: A sentiment lexicon-based analysis for food and beverage industry reviews. the Greek language paradigm. The Greek Language Paradigm (2020). Accessed from 20 May 2020 15. Magueresse, A., Carles, V., Heetderks, E.: Low-resource languages: A review of past work and future challenges (2020). arXiv:2006.07264 16. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318 (2002) 17. Pavlopoulos, I.: Aspect based sentiment analysis. Athens University of Economics and Business (2014) 18. Pontiki, M., Galanis, D., Papageorgiou, H., Manandhar, S., Androutsopoulos, I. Semeval-2015 task 12: aspect based sentiment analysis. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 486–495 (2015) 19. Sang, E.F., Veenstra, J.: Representing text chunks (1999). cs/907006 20. Solakidis, G.S., Vavliakis, K.N., Mitkas, P.A.: Multilingual sentiment analysis using emoticons and keywords. In: 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol. 2, pp. 102–109. IEEE (2014) 21. Tiedemann, J.: Parallel data, tools and interfaces in opus. In: Lrec, vol. 2012, pp. 2214–2218. Citeseer (2012) 22. Tiedemann, J., Thottingal, S.: OPUS-MT - Building open translation services for the World. In: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (EAMT), Lisbon, Portugal (2020)

858

G. Aivatoglou et al.

23. Wenzek, G., Lachaux, M.A., Conneau, A., Chaudhary, V., Guzman, F., Joulin, A., Grave, E.: Ccnet: Extracting high quality monolingual datasets from web crawl data (2019). arXiv:1911.00359 24. Xu, H., Liu, B., Shu, L., Yu, P.S.: Bert post-training for review reading comprehension and aspect-based sentiment analysis (2019). arXiv:1904.02232 25. Yang, Yu., Duan, W., Cao, Q.: The impact of social and conventional media on firm equity value: a sentiment analysis approach. Decis. Supp. Syst. 55(4), 919–926 (2013) 26. Zhao, A., Yu, Yu.: Knowledge-enabled Bert for aspect-based sentiment analysis. Knowl.-Based Syst. 227, 107220 (2021)

Attending to Entity Class Attributes for Named Entity Recognition with Few-Shot Learning Raj Nath Patel1(B) , Sourav Dutta1 , and Haytham Assem2 1 Huawei Research, Dublin, Ireland {raj.nath.patel,sourav.dutta2}@huawei.com 2 Amazon Alexa AI, Cambridge, UK [email protected]

Abstract. Named Entity Recognition (NER) serves as the foundation for several natural language applications like question answering, chatbots and intent classification. Identification of entity boundaries and its categorization into entity types poses a significant challenge in domaindependent and low-resource settings, with limited training data availability. To this end, we propose AtEnA, a novel NER framework utilizing entity class attributes from external knowledge source for few-shot learning. We use a two-stage fine-tuning process, wherein a language model is initially trained to “attend” to the different entity class attributes along with the textual context, and is then fine-tuned for the downstream application data with few annotated training examples. Experiments on benchmark NER datasets depict AtEnA to perform around 10 F1 score points better than the existing NER methodologies, specifically for fewshot limited training scenarios. Keywords: Named entity recognition · Knowledge graph learning · Transformers · Attention mask

1

· Few-shot

Introduction and Background

Named Entity Recognition (NER) refers to the identification of text spans that belong to pre-defined semantic classes like person, location, organization, date, etc. NER plays an important role in information extraction and natural language processing applications such as search, text understanding [6,30], summarization [3], question answering [1], and knowledge base construction [9]. Traditional approaches in NER relied on rules based on syntactic-lexical patterns that were manually constructed by linguists and domain-experts [29]. Such early frameworks like NetOWL [13], SAR [2], etc., demonstrated high precision, but suffered from expensive annotation process and low recall owing to possibly incomplete lexicon. Unsupervised approaches for NER based on clustering [20] Work done while the author was at Huawei Research, Ireland c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024  K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 859–870, 2024. https://doi.org/10.1007/978-3-031-47715-7_57

860

R. N. Patel et al.

were thus proposed using contextual similarity and limited “seed” rules. These methodologies relied on effective heuristics combining corpus statistics and shallow syntactic knowledge [29]. With the advent of supervised learning techniques, NER was cast into a multi-class classification or sequence labeling task. Wordlevel information (like part-of-speech (POS)) [31] and document statistics were used as features for NER using Hidden Markov Models [8], Decision Trees, Support Vector Machines [11], and Conditional Random Field [14]. Deep learning based contextual representation of text for NER currently forms the state-of-the-art approach. The use of neural models for NER was introduced [7], wherein Convolutional Neural Networks were employed. Such multi-layered architectures were able to capture latent features like lexical similarity, linguistic dependency, word embeddings, POS tags, etc. [18]. Recurrent Neural Networks like LSTM and GRU were also shown to perform quite well due to contextual information flow across the states [12,15]. Language model based sequence tagger was proposed [22], where word embeddings from pretrained language models (like BERT) were fed to a LSTM-CRF layer for token classification [21]. In fact, NER has also been studied as a machine reading comprehension task using fine-tuned BERT [17]. This has led to the development of several robust and accurate NER systems within offerings like SpaCy, DeepPavlov,1 and LUKE [27]. Please refer to [10,16,20] for a rich literature review of existing NER techniques. Challenges. Application-specific entity extraction forms the backbone of user intent understanding, semantic assistance and chatbot solutions in cloudbased offerings like Microsoft LUIS, Google DialogFlow,2 and RASA [4]. However, similar to the supervised approaches, the performance of these frameworks depend on the presence of annotated user data (in terms of NER types in the downstream application), for the underlying models to be fine-tuned. In such settings, the requirement of users to provide hundreds of training examples (computationally expensive) has severely impacted their practical adoption. Further, industrial applications of NER necessitate the requirement of such models to adapt to the domain-specificity of the target data, wherein more annotated data is required for proper training convergence. Contribution. To alleviate the above issue, in this paper, we propose AtEnA, a few-shot learning based NER framework via a two-stage fine-tuning process. Our approach leverages NER class attributes from external knowledge sources to provide an attention mechanism for the underlying language model to efficiently learn to identify the different entity types. We report empirical results on 3 benchmark datasets to showcase the efficacy of AtEnA with limited training data with significant performance improvements on CoNLL.

1 2

https://spacy.io/ and https://deeppavlov.ai/. www.luis.ai & www.cloud.google.com/dialogflow.

Attending to Entity Class Attributes for Named Entity Recognition

2

861

AT E N A Framework

In this section we introduce our proposed Attending To Entity Class Attributes (AtEnA) framework. The working of AtEnA hinges on three main operational steps (depicted in Fig. 1) as discussed next.

Fig. 1. Overview of the proposed AtEnA framework

2.1

Entity Class Attribute Extraction

The basic intuition of our framework is based on the observation that entities belonging to the same NER type share common attributes. For example, entities of type “person” (PER) would normally have informational headers like date-of-birth, occupation, spouse, etc., while “location” (LOC) entities would be characterized by geo-coordinates, population, and so forth, which we refer to as “class attributes”. It is these entity class attributes that AtEnA is trained to attend to for improved NER. Intuitively, addition of such information cues would enable a machine learning model to better understand the category to which an entity belongs, and thus potentially boost its performance. Hence, in the first stage, we extract the entity class attributes from Wikipedia info-boxes, to create a dictionary D, where the keys are the surface-forms of entities and corresponding values are the info-box attributes. For instance, in our example sentence ‘‘Joe Biden calls for peace talk between Russia and Ukraine’’, we extract {“Russia”: ‘population’, ‘GDP’} and {“Ukraine”: ‘population’, ‘GDP’} as the LOC class attributes from Wikipedia info-boxes. Observe, that similar entity types would typically share a bulk of such attributes, as discussed earlier. To this end, for creating the entity class dictionary D, we used the 2021 Wikipedia article dump and considered text spans that have a hyperlink as entities (i.e., keys of D). The corresponding entity attributes are then extracted from the info-boxes of the corresponding entity Wikipage (directed to by the hyperlink). As the number of such attributes might vary across entities and types, and potentially lead to introduction of noise in our model, we sort the extracted class attributes based on their frequency across the entire data, and

862

R. N. Patel et al.

select the top-k most frequent attributes for each entity type (forming the values in D). Observe the value of k forms a hyper-parameter in our framework, and its effect on the model performance will be studied later in Sect. 4.1. 2.2

Attribute Attention Mask for Pre-training

In the second stage, we further pre-train a language model to learn to attend to the entity class attributes extracted in D (as discussed above). To this end, we use LUKE-base [27] as our underlying language model, and train it on the WikiNER dataset.3 Observe, that the language model LUKE has been pre-trained with an objective to detect entities from texts, making it an appropriate candidate in our setting. However, any other pre-trained language model can be easily adopted in our AtEnA framework. Given an input training sentence (for further pre-training), the annotated entities (within the input sentence) are appended with its attributes (from D if present) to form a sentence-tree, as shown in Fig. 1. This sentence-tree along with a visibility matrix forms the input to the transformer architecture for pretraining. Specifically, for the token embedding layer, the tokens in the sentence tree are flattened into a sequence by their hard-position index, while the softpositions are used for position embedding. The visibility matrix then controls the visibility of surrounding tokens, thereby rendering attention masks to specific positions and tokens, as discussed below. The visibility matrix enables the infused class attributes (added as sentence trees) to be visible only to (i.e., attended by) their corresponding entity tokens, thereby enabling AtEnA to learn NER categories by considering the attribute information. In our example sentence (in Fig. 1), the first “< s >” token considers the full original sentence only as the context (since the token is not an entity), thereby looking at tokens with hard positions 0–2, 6–11, 14–15, and 18– 19 (marked as red dots in the visibility matrix in figure). On the other hand, the next tokens “Joe” and “Biden” being an entity not only considers the original sentence, but also attends to its associated infused attributes (in hard positions 3–5) as its context. Specifically, these tokens consider hard positions 0–11, 14–15, and 18–19 for creating its representations. Note, that the above tokens are agnostic to the attributes infused for the other entities in the sentence (i.e., attributes of “Russia” in 12–13 hard position are not visible and are not attended to). This procedure of making the NER class attributes (obtained from external knowledge sources) visible only to the relevant entities is performed via the attention visibility matrix, as shown in Fig. 1. Finally, the representations obtained from the above trained language model (using th e attention masks), M, are passed to a biLSTM-CRF layer, which is trained to minimize the training loss based on a sequence tagging (NER) objective function. The above training procedure (infusing external knowledge via visibility matrix) of AtEnA is adopted from [19]. 3

https://metatext.io/datasets/wikiner.

Attending to Entity Class Attributes for Named Entity Recognition

863

However, entities present in the training data might be absent in the external knowledge source and potentially would obfuscate the model learning, as some entities may have the attribute information while other entities (belonging to the same NER class) may not have such information. Thus, to facilitate better model training, for each entity type we differentiate between entities that are present or absent in the knowledge source. We thus create a dummy subclass for each entity type to model entities that are absent in our dictionary D. For example, if a PER entity doesn’t have any class attributes, we update its NER label as PER NO-KG during training, while the label for an entity present in D is kept as PER. This reduces the confusion during model training (between entities of the same type but some having attributes and others without) as mentioned earlier. Finally, during prediction, we revert back the predicted dummy labels to their original format. That is, both the PER and the PER NO-KG NER predictions are mapped back to PER (and similarly for the other classes). Another point to note is that, for simplicity, we extract the “entity class attributes” using a simple string-matching strategy. That is, the entities present in the input sentence are matched to those present in the constructed dictionary based on exact surface string matching, and we do not run a disambiguation step here. This might introduce noise in case of entities with similar names, like Apple Inc. vs Apple (fruit). To circumvent this problem, use of off-the-self entity disambiguation tools (like AIDA [28] or Welink [5]) can be performed. Alternatively, addition of a list of candidate matching entities from D (along with their attributes) can be presented to the language model for automatic disambiguation as shown in [23]. We plan to investigate this direction of study further in future work. 2.3

Application Fine-Tuning and Inference

The above entity class attribute based pre-trained model M is subsequently fine-tuned on the downstream application data using a limited number of annotated training entities (i.e., few shots) along with their class attributes obtained from D, to adapt the model to the domain of use. This would enable domainadaptation on top of the entity attribute information injected into model M in the previous fine-tuning stage. Observe, during inferencing of sentence sequence tags, the entities are obviously not known apriori, and hence poses a challenge in infusing the attributes that needs to be fetched from D. To bypass this issue, we consider a n-width sliding window heuristic for entity matching. That is, given a sentence S (for tagging) we initially try to match n adjacent tokens to an entity in D and fetch the matching entity attributes. If there is no match, we try with n−1 tokens and so forth, until a match is found (if possible), and the corresponding attributes are inserted into S (as sentence tree) and sent to M for sequence label prediction. We next discuss the experimental setup and evaluation to showcase the efficacy of our proposed framework, over existing approaches, for NER sequence labeling task.

864

3

R. N. Patel et al.

Experimental Setup

In this section, we present the experimental setup for evaluating our AtEnA framework on the different datasets. Table 1. Dataset description for 3 NER tag setting. Overlap measures the percentage of entities in the data that had attributes entries in our dictionary D constructed from wikipedia dump Dataset Setting CoNLL Dev Test 5 shots 20 shots 50 shots

# words # entities % overlap 51362 46435 274 750 1554

7335 7194 35 83 191

48.14 45.77 28.57 44.58 36.65

Kaggle

Dev 94304 Test 105207 5 shots 210 20 shots 583 50 shots 1995

10459 11453 27 65 180

42.86 42.26 22.22 29.23 39.44

NLU

Dev Test 5 shots 20 shots 50 shots

257 272 44 109 221

29.18 25.37 34.09 32.11 33.03

6895 7488 997 2919 5984

As mentioned before, we use WikiNER data to pre-train our underlying LUKE language model to better understand NER types, based on attention over entity class attributes. For training, we empirically set the number of attribute injection to k = 2, i.e., the top-2 entity class attributes based on frequency is added to each entity to construct the sentence-tree. Finally, for AtEnA we fix the n-gram window size (to detect entities from dictionary) during inference as n = 5. We empirically evaluate AtEnA of 3 benchmark NER datasets–(i) CoNLL2003 [25], (ii) NLU [26], and (iii) Kaggle.4 In our setup, we mainly focus on 3 major NER tags–person (PER), location (LOC), and organization (ORG). For completion, we also include the miscellaneous (MISC) entity type, to see how well our model generalizes to other types, possibly with no external attribute infusion. For consistency, entities other than PER, LOC, and ORG present in the NLU and Kaggle data were considered as MISC. As our few-shot setup, for the datasets, we consider only 5, 20, and 50 entities of each type to be annotated 4

https://www.kaggle.com/datasets/abhinavwalia95/entity-annotated-corpus.

Attending to Entity Class Attributes for Named Entity Recognition

865

for the final model fine-tuning stage. Please refer Tables 1 and 2 for the dataset characteristics. We compare the performance of AtEnA with 3 state-of-the-art techniques– namely (a) LUKE [27], (b) DeepPavlov, and (c) SpaCy–and evaluate using the micro F1 score for named entity extraction and classification. Table 2. Dataset description for 4 NER tag setting. Overlap measures the percentage of entities in the data that had attribute entries in our dictionary D constructed from Wikipedia Dump Dataset Setting CoNLL Dev Test 5 shots 20 shots 50 shots

# words # entities % overlap 51362 46435 274 750 2065

8603 8112 45 107 305

42.69 41.80 28.89 39.25 37.38

Kaggle

Dev 94304 Test 105207 5 shots 210 20 shots 583 50 shots 1995

14439 16074 36 92 264

33.25 32.08 19.45 21.73 31.06

NLU

Dev Test 5 shots 20 shots 50 shots

1285 1377 170 541 1114

12.29 13.21 13.53 15.34 14.54

6895 7488 997 2919 5984

Training Parameters. As the sequence classification layer, AtEnA uses a Bidirectional-LSTM layer (with 384 units) coupled with a CRF layer (with 768 units), with the input sentence-trees to the language model truncated to 512 subwords. Except LUKE, all approaches are trained for 20K updates with early stopping with a patience of 2K. For LUKE, we train it for 200 epochs without early stopping. As the number of annotated examples in the training data is quite small (few-shot setting), we used a batch size of 2. All other parameters are kept at default. For DeepPavlov training, we used the “ner conll2003 torch bert.json” configuration with BERT-base, while SpaCy uses “roberta-base” as the transformer encoder. Please note that in all the settings, we also train the weights of the underlying encoder with added classification layer. All experiments were conducted on a Tesla V100 GPU with a 32 bit precision and 32GB memory.

866

4

R. N. Patel et al.

Experimental Results

The empirical results for the different approaches are tabulated in Table 3. We consider two evaluation scenarios–with 3 NER tags and with 4 tags–over the different few-shot settings. We observe that AtEnA performs significantly superior than the existing baselines in terms of micro F1 score, both in 5-shot and 20shot settings. As the number of training examples are increased (refer to 50 shot setting), the difference between the approaches reduces, and they perform comparably. Specifically, we observe a significant F1 score improvement of around 15 for AtEnA on the CoNLL-2003 dataset for 5-shot setting, with both 3 and 4 tag prediction. For the other datasets, we also observe a healthy performance improvement compared to the other existing NER models. In fact, for 5-shot experiments, we observed that LUKE (which is our underlying language model) fails to converge in both 3 and 4 tag prediction scenarios. However, its further pre-training on entity class attributes (as used in AtEnA) enables the model to learn better NER representations, allowing it to not only converge but also outperform the baselines. With increase in the number of training instances, all the models tend to learn well, as possibly the knowledge infusion through the entity class attributes is inherently captured. However, the presence of such large annotated data (for each entity type) is expensive, and in some cases infeasible. Thus, we see that on a few-shot setting with limited training instances, AtEnA provides a better learning framework for NER, as the inclusion of the attention mechanism on the entity type attributes (via the visibility matrix) provides additional cues for efficient sequence tagging tasks. This aligns with our main intuition that entities belonging to the same NER type share common attributes, and inclusion of such information would positively impact the NER task. Observe, that providing 100s of annotated training examples per Table 3. Micro F1 score for NER of the different slgorithms on 3 datasets. # Train

Framework

Instance 5 shots

20 shots

50 shots

CoNLL Kaggle NLU

CoNLL Kaggle NLU

4 Tags (PER, LOC, ORG, MISC) 3 Tags (PER, LOC, ORG) LUKE

13.15

25.10

40.03

06.20

08.63

DeepPavlov 45.35

37.48

45.21

33.64

28.06

17.94 26.88

SpaCy

41.50

47.87

53.43

47.01

40.09

34.88

AtEnA

57.74

48.25

56.71

62.25

41.75

55.46

LUKE

44.50

45.17

59.55

46.37

39.23

44.98

DeepPavlov 49.00

45.72

58.67

53.03

41.73

50.90

SpaCy

56.17

50.24

64.29

59.11

49.63

53.42

AtEnA

67.20

66.75

65.41

71.04

65.50

62.22

LUKE

62.84

66.07

65.74

64.50

61.64

62.28

DeepPavlov 62.15

63.00

68.15

62.22

56.90

66.46

SpaCy

70.34

66.96

68.84

66.01

60.55

62.35

AtEnA

73.80

71.74

68.80

74.50

71.17

71.46

Attending to Entity Class Attributes for Named Entity Recognition

867

Table 4. Ablation Results for AtEnA on NLU Data with 20 Shots and 4 NER Tags Approach

Micro F1 score

AtEnA—attributes 63.91 AtEnA

65.41

entity type by users (for the downstream application data) might be infeasible in practice. 4.1

Ablation and Parameter Study

To explore if the the performance gains in AtEnA is indeed an artifact of the external entity attribute information infusion, promoting better learning of the model, we conduct a small-scale ablation study on the NLU dataset. In this setting, we keep the same underlying AtEnA framework (i.e., LUKE language model + biLSTM-CRF architecture), but do not include the entity attributes from the dictionary D (i.e., sentence-tree is not constructed and there is no attention visibility matrix). Here, we consider all the 4 NER categories for NLU and report results with 20-shot training instances, as shown in Table 4. We observe that the inclusion of the entity class attributes provides a boost in the overall performance of our framework–alluding to the importance of entity class attribute infusion towards performance improvement.

Fig. 2. Performance of AtEnA on CoNLL and Kaggle data with varying k

Finally, we study the effect of the hyper-parameters k (number of entity class attributes incorporated) and n (sliding window width for entity matching during

868

R. N. Patel et al.

inference) in our AtEnA framework. We vary the values of k and n and report the results on CoNLL and Kaggle datasets in Figs. 2 and 3. From Fig. 2, we observe that with increase in the value of k, the accuracy of AtEnA tends to slightly decrease (on the CoNLL dataset). This is possibly due to the increase of noise addition during model training with more number of entity class attributes inclusion. Hence, in our experiments, we set k = 2, i.e., the top-2 entity class attributes based on frequency is added to each entity to construct the sentence-tree. On the other hand, the window-width parameter n seems to have limited impact on the accuracy of our framework. However, with in increase in the value of n, the inference time of AtEnA would increase, as the entity-dictionary matching can potentially have more iterations (since if n-width adjacent tokens do not match, we try with n − 1 tokens and so on). Hence, to strike a balance, we set n = 5 in our experiments.

Fig. 3. Performance of AtEnA on CoNLL and Kaggle Data with varying n

Discussion. We empirically observe a significant performance improvements of AtEnA over existing approaches for few-shot NER classification task. Further, owing to the modular nature of AtEnA, incorporating other language models or knowledge sources would be relatively straight-forward, depending on application specific requirements. For example, the base language model can be updated to multi-lingual LUKE (mLUKE) [24] to tackle multilingual NER via AtEnA. However, the proposed approach might not be applicable for some NER types for which attributes might be difficult to obtain or are absent (e.g., date, time, or event might not have such class attributes to use).

Attending to Entity Class Attributes for Named Entity Recognition

5

869

Conclusion

In this paper, we presented AtEnA, a few-shot learning based NER framework that utilizes common entity class attributes from external knowledge sources as additional information cues. We proposed how such attributes can be incorporated into the attention mechanism via visibility matrix for pre-training transformer models for sequence labeling. We use a modular two-stage fine-tuning process to adapt our framework to diverse downstream application needs. We showcased experimental results depicting that our framework significantly outperforms the existing baselines, on 3 NER benchmark datasets, with minimal training data requirements.

References 1. Aliod, D.M., van Zaanen, M., Smith, D.: Named entity recognition for question answering. In: ALTA, pp. 51–58 (2006) 2. Aone, C., Halverson, L., Hampton, T., Ramos-Santacruz, M.: SRA: description of the IE2 system used for MUC-7. In: MUC-7 (1998) 3. Aone, C., Okurowski, M.E., Gorlinsky, J.: A trainable summarizer with knowledge acquired from robust NLP techniques, pp. 71–80 (2022) 4. Bocklisch, T., Faulkner, J., Pawlowski, N., Nichol, A.: Rasa: open source language understanding and dialogue management. In: NIPS Workshop on Conversational AI (2017) 5. Bouarroudj, W., Boufaida, Z., Bellatreche, L.: Named entity disambiguation in short texts over knowledge graphs. Knowl. Inf. Syst. 64, 325–351 (2022) 6. Cheng, P., Erk, K.: Attending to entities for better text understanding. In: AAAI, pp. 7554–7561 (2020) 7. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493– 2537 (2011) 8. Eddy, S.R.: Hidden markov models. Curr. Opin. Struct. Biol. 6(3), 361–365 (1996) 9. Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named entity extraction from the web: an experimental study. Artif. Intell. 165(1), 91–134 (2005) 10. Goyal, A., Gupta, V., Kumar, M.: Recent named entity recognition and classification techniques: a systematic review. Comput. Sci. Rev. 29, 21–43 (2018) 11. Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. & Their Appl. 13(4), 18–28 (1998) 12. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging (2015). arXiv:1508.01991 13. Krupka, G., IsoQuest, K.: Description of the NetOWL extractor system as used for MUC-7. In: MUC-7, pp. 21–28 (2005) 14. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001) 15. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: NAACL, pp. 260–270 (2016) 16. Li, J., Sun, A., Han, J., Li, C.: A Survey on Deep Learning for Named Entity Recognition. IEEE Trans. Knowl, Data Eng (2020)

870

R. N. Patel et al.

17. Li, X., Feng, J., Meng, Y., Han, Q., Wu, F., Li, J.: A unified MRC framework for named entity recognition. In: ACL, pp. 5849–5859 (2020) 18. Lin, B.Y., Xu, F., Luo, Z., Zhu, K.: Multi-channel BiLSTM-CRF model for emerging named entity recognition in social media. In: W-NUT, pp. 160–165 (2017) 19. Liu, W., Zhou, P., Zhao, Z., Wang, Z., Qi, J., Deng, H.: and Ping Wang. Enabling Language Representation with Knowledge Graph. In: AAAI, K-BERT (2020) 20. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investig. 30(1), 3–26 (2007) 21. Panchendrarajan, R., Amaresan, A.: Bidirectional LSTM-CRF for named entity recognition. In: PACLIC, pp. 531–540 (2018) 22. Peters, M.E., Ammar, W., Bhagavatula, C., Power, R.: Semisupervised sequence tagging with bidirectional language models. In: ACL, pp. 1756–1765 (2017) 23. Peters, M.E., Neumann, M., Logan, R., Schwartz, R., Joshi, V., Singh, S., Smith, N.A.: Knowledge enhanced contextual word representations. In: EMNLP-IJCNLP, pp. 43–54 (2019) 24. Ri, R., Yamada, I., Tsuruoka, Y.: mLUKE: the power of entity representations in multilingual pretrained language models. In: ACL, pp. 7316–7330 (2022) 25. Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: languageindependent named entity recognition (2003). arXiv:cs/0306050 26. Swietojanski, P., Liu, X., Eshghi, A., Rieser, V.: Benchmarking natural language understanding services for building conversational agents. In: Proceedings of the Tenth International Workshop on Spoken Dialogue Systems Technology (IWSDS), Ortigia, Siracusa (SR), Italy, April 2019. Springer 27. Yamada, I., Asai, A., Shindo, H., Takeda, H., Matsumoto, Y.: LUKE: deep contextualized entity representations with entity-aware self-attention. In: EMNLP, pp. 6442–6454 (2020) 28. Yosef, M.A., Hoffart, J., Bordino, I., Spaniol, M., Weikum, G.: AIDA: an online tool for accurate disambiguation of named entities in text and tables. Proc. VLDB Endow. 4(12), 1450–1453 (2011) 29. Zhang, S., Elhadad, N.: Unsupervised biomedical named entity recognition: experiments with clinical and biological texts. J. Biomed. Inform. 46(6), 1088–1098 (2013) 30. Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., Liu, Q.: ERNIE: enhanced language representation with informative entities. In: ACL, pp. 1441–1451 (2019) 31. Zhou, G., Su, J.: Named entity recognition using an HMM based chunk tagger. In: ACL, pp. 473–480 (2002)

Author Index

A Abbas, Karrar K. 111 Abdul-Rahman, Shuzlina 436 Abdurohman, Maman 523, 598 Abel, Andrew 175 Abuhmed, Tamer 555 Ahad, Nor Aishah 284 Ahmed, Imtiaz 491, 507 Aivatoglou, Georgios 841 Al Rasyid, Muhammad Udin Harun 616 Al-Atabany, Walid 70 Ali, Bushara 636 Andaluz, Víctor H. 46 Apanapudor, Joshua Sarduana 284 Arampatzis, Georgios 841 Arunaye, Festus Irimisose 284 Arzoky, Mahir 334 Assem, Haytham 859 Astrova, Irina 397 Ayalew, Yonas 354 Aydin, Cem Rifki 792 Ayoub, Misbah 175 B Baral, Chitta 766 Barba, Stefano 244 Barbosa, José L. 536 Bechtsis, Dimitrios 371 Beckers, Ingeborg E. 18 Bedada, Wendwosen 354 Belo, Orlando 449 Benferhat, Salem 81 Bhattacharyya, Siddhartha 682 Biswas, Snigdha 713 Bomgni (PI), Alain Bertrand 325 Brandt, Danja 18 Braton, Colby 663 Bravo, Maricela 536 Bustillos, Fernanda 457 C Chiaburu, Teodor

18

Chiyangwa, Tawanda 822 Chowdhury, Naseef-Ur-Rahman 491, 507 Cordonez, Jimmy W. 46 Corliss, David J. 411 Costa, Michela 244 Culcay, Lissette 457 D Das, Pronaya Prosun 264 Das, Sanjiv 308 De Vuyst, Stijn 782 Deli, Nur Maisara 436 Ding, Zifeng 729 Domb, Menachem 652 Duarte, Ana 449 Dutta, Sourav 859 E Elliott, Clark 748 El-Sappagh, Shaker 555 Engler, Hannes 210 F Fasfous, Nael 157 Fatima, Tanjim 491, 507 Faudzi, Ahmad Faridz Ahmad 589 Fiala, Gernot 1 Fitzpatrick, Donal 145 Foko Sindjoung, Miguel Landry 325 Fotseu Fotseu, Ernest Basile 325 Fotsing Talla, Bernard 325 Freeman, Kenneth 354 Frickenstein, Alexander 157 Frickenstein, Lukas 157 Fytili, Alexia 841 G Ganeriwala, Parth 682 Gautama, Sidharta 782 Gerner, Christian 397 Giachanou, Anastasia 381 Gröling, Marc 128

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 K. Arai (Ed.): IntelliSys 2023, LNNS 824, pp. 871–873, 2024. https://doi.org/10.1007/978-3-031-47715-7

872

Author Index

Gupta, Anubhav 682 Gupta, Jahnvi 713 H Hameed, Ibrahim A. 128 Han, Zhen 729 Haque, Ahshanul 491, 507 Hashemi, Ahmad 381 Haußer, Frank 18 He, Bailan 729 Herman, Maya 193 Herzog, Katja 210 Homaifar, Abdollah 354 Hossen, Mohammad Sahinur Huang, Laurent 128 Huber, Florian 210 Hudli, Viraj 652 I Iliev, Oliver 636 Ilieva, Anamarija 636 Ismail, Ibrahim 699 J Jack, Thomas 264 Jie, Wang 419 Joshi, Sujata 652 Juraev, Firuz 555 Jusoh, Shaidah 589 K Kadir, Rabiah Abdul 589 Kamel, Ahmed 70 Kang, Myeongjun 301 Kashihara, Kazuaki 766 Kicherer, Anna 210 Kokkas, Georgios 371 Koschel, Arne 397 Krauss, Oliver 807 L Lakulu, Mohd Modi 589 Li, Xianping 111 Liu, Siyuan 571 Liu, Wenjing 571 Liuzhu, Qian 419 Lopez, Angel J. 782

491, 507

M Ma, Yunpu 729 Mallioris, Panagiotis 371 Mann, Ashley 334 Maramazi, Faisal 334 Margaritis, Ioannis 371 Marzinkowski, Arthur 81 Mast, Marcel 264 Melnik, Andrew 35 Moghaddam, Fatemeh Besharati 782 Mohamed Hanum, Haslizatul Fairuz 436 Mohamed, Azza 699 Momeni, Mohammadhossein 663 Moosavi, Mohammad Reza 381 Mori, Pierpaolo 157 Mutalib, Sofianita 436 Muthalagu, Raja 682 N Nandanwar, Shreya 682 Natarajan, Sarun 713 Nuha, Hilal Hudan 523, 598 O Odebode, Afees Adegoke 334 Okwonu, Friday Zinzendoff 284 Ortiz, Jessica S. 98 P Pakosch, Anna 397 Pal, Kuntal Kumar 766 Palombo, Adolfo 244 Pantusin, Fernando J. 46 Paparrizou, Anastasia 81 Passerone, Claudio 157 Patel, Raj Nath 859 Perdana, Doan 598 Phan-Luong, Viet 229 Piazzullo, Daniele 244 Pichucho, Katherin G. 98 Piette, Cédric 81 Pitt, Ian 145 Prabowo, Sidik 523 Prayudi, Agus 616 Putrada, Aji Gautama 598 Q Quimbita, Milton A.

46

Author Index

R Rahman, A. S. M. M. 663 Rashid, Mohd Fadzil Abdul Razy, Elad 193 Ritter, Helge 35 Rothgänger, Markus 35

873

436

S Sampath, Shambhavi Balamuthu 157 Sánchez-Martínez, Leonardo D. 536 Schmidt, Henning 18 Schrimpf, Ilona 18 Schulze, Christin 397 Sert, Bilge Sipal 792 Sharipov, Olimjon Shukurovich 284 Shi, Wei 381 Shivam, Shivam 713 Shoaib, Muhammad 145 Smallwood, Norm 308 Sobh, Ibrahim 70 Soliman, Hamdy 491, 507 Stadel, Alexandra 18 Stechele, Walter 157 Steger, Christian 1 Steinhage, Volker 210 Stergiopoulos, Fotios 371 Stöckl, Andreas 807 Styliadis-Heinz, Alexandros 371 Stylianou, Nikolaos 841 Sukaridhoto, Sritrusta 616 Sutikno, Sarwono 523 Swift, Stephen 334 T Tian, Jie 571 Tiun, Sabrina Binti 589 Toapanta, Kelvin F. 98 Töpfer, Reinhard 210 Tresp, Volker 729 Trevino, Robert P. 766 Tschaikner, Martin 18 Tucker, Allan 334 Tyca, Matthias 397

U Ulrich, Dave 308 Unger, Christian 157 V Val Danilov, Igor 56 Vallejo-Huanga, Diego 457 van Biljon, Judy 822 van der Poel, Etienne 822 Varde, Aparna S. 474 Varela-Aldás, José 98 Vargas, Andrés D. 46 Varma, Sandeep 713 Vashi, Hiral 474 Velempini, Mthulisi 325 Vemparala, Manoj-Rohit 157 Vlahavas, Ioannis 841 W Wang, Jing 571 Wicaksono, Hendro Wiese, Lena 264 Wu, Jingpei 729 Wulff, Antje 264 X Xu, Zhiwei

616

571

Y Yadav, Jyoti 474 Ye, Zhenyu 1 Younus, Arjumand

792

Z Zaikis, Dimitrios 841 Zakeri, Ahmad 636 Zhang, Haiyang 175 Zhang, Jiaqi 571 Zhenfei, Zhan 419 Zhu, Daniel 308 Zulkifli, Mohamad Fairus 589