This book gathers the proceedings of the eighth Future of Information and Computing Conference, which was held successfu
597 10 90MB
English Pages 926 [927] Year 2023
Table of contents :
Preface
Contents
The Disabled’s Learning Aspiration and E-Learning Participation
1 Introduction
2 Literature Review
2.1 Lifelong Education for the Disabled
2.2 E-Learning and the Disabled
3 Method
3.1 Sample
3.2 Measurement
3.3 Analysis
4 Results
4.1 E-Learning Participation Rate and Perceived Effectiveness of E-Learning
4.2 ICT Access and Use
4.3 The Results of Structural Equation Model for E-Learning Participation
5 Conclusion
References
Are CK Metrics Enough to Detect Design Patterns?
1 Introduction
2 Background and Related Works
2.1 Design Patterns Background
2.2 Design Patterns Detection
3 Methodology
3.1 Data Extraction
3.2 Data Preprocessing
3.3 Machine Learning Models
3.4 Performance Metrics
4 Implementation Details and Results
4.1 Dataset Description
4.2 Experiments
5 Discussion
6 Conclusion
References
Detecting Cyberbullying from Tweets Through Machine Learning Techniques with Sentiment Analysis
1 Introduction
2 Background
2.1 Natural Language Processing
2.2 Machine Learning Algorithms
3 Proposed Tweets SA Model
3.1 Evaluation Measures
3.2 Datasets
4 Experiments and Results
4.1 Results for Dataset-1
4.2 Results for Dataset-2
4.3 Comparing the Results of Dataset-1 and Dataset2
5 Conclusion
References
DNA Genome Classification with Machine Learning and Image Descriptors
1 Introduction
2 Related Work
3 Materials and Methods
3.1 CNNs and Chaos Game Representation
3.2 Kameris and Castor
3.3 FOS, GLCM, LBP, and MLBP
4 Experiments and Results
4.1 Datasets
4.2 Metrics
4.3 Results
4.4 Processing Time
5 Discussion
6 Limitations
7 Conclusions
8 Future Work
References
A Review of Intrusion Detection Systems Using Machine Learning: Attacks, Algorithms and Challenges
1 Introduction
2 Intrusion Detection System
3 Machine Learning and Intrusion Detection
4 Metrics
5 Challenges
6 Conclusions
References
Head Orientation of Public Speakers: Variation with Emotion, Profession and Age
1 Introduction
2 Motivation and Literature Review
3 Method and System Design
4 Data and Results
4.1 The Data Used
4.2 Preliminary Data Profiling
4.3 Results and Discussion
5 Conclusions
References
Using Machine Learning to Identify Top Antecedents Affecting Crime in US Communities
1 Introduction
2 Related Work
3 Proposed Work
3.1 Dataset
3.2 Lasso Regression
3.3 Feature Normalization
3.4 Alpha Value Selection
3.5 Results
4 Conclusion
References
Hybrid Quantum Machine Learning Classifier with Classical Neural Network Transfer Learning
1 Introduction
2 The Classic Neural Network
3 Quantum Machine Learning Classifier
3.1 Circuit Design Principles
3.2 Gradient Parameter-Shift Rule
3.3 Unlearning Rate
4 Results
4.1 Expectations
4.2 Weights
4.3 Metrics
5 Future Work
6 Definitions
References
Repeated Potentiality Augmentation for Multi-layered Neural Networks
1 Introduction
1.1 Potentiality Reduction
1.2 Competitive Learning
1.3 Regularization
1.4 Comprehensive Interpretation
1.5 Paper Organization
2 Theory and Computational Methods
2.1 Repeated Potentiality Reduction and Augmentation
2.2 Total and Relative Information
2.3 Repeated Learning
2.4 Full and Partial Compression
3 Results and Discussion
3.1 Experimental Outline
3.2 Potentiality Computation
3.3 Collective Weights
3.4 Partially Collective Weights
3.5 Correlation and Generalization
4 Conclusion
References
SGAS-es: Avoiding Performance Collapse by Sequential Greedy Architecture Search with the Early Stopping Indicator
1 Introduction
1.1 Problems of DARTS and SGAS
1.2 Research Purposes and Main Contributions
2 Prior Knowledge
2.1 Neural Architecture Search (NAS)
2.2 Differentiable Architecture Search (DARTS)
2.3 Sequential Greedy Architecture Search (SGAS)
2.4 The Relation Among Flat Minima, max and Performance Collapse
3 Approach
3.1 SGAS-es Overview
3.2 Bilevel Optimization
3.3 Early Stopping Indicator
3.4 Edge Decision Strategy
3.5 Fixing the Operation
3.6 Put-It-All-Together: SGAS-es
4 Experiment
4.1 NAS-Bench-201
4.2 Fashion-MNIST Dataset
4.3 EMNIST-Balanced Dataset
5 Conclusion
References
Artificial Intelligence in Forensic Science
1 Introduction
2 Related Works
3 Applications of AI in Forensic Science
3.1 Pattern Recognition
3.2 Data Analysis
3.3 Knowledge Discovery
3.4 Statistical Evidence
3.5 Providing Legal Solutions
3.6 Creating Repositories
3.7 Enhance Communication Between Forensic Team Members
4 Proposed Methodology
4.1 Data Collection and Pre-processing
4.2 Training the Model
5 Results
6 Conclusion
References
Deep Learning Based Approach for Human Intention Estimation in Lower-Back Exoskeleton
1 Introduction
2 Related Work
2.1 Vision-Based Methods
2.2 Data from Wearable Sensors
3 Methodology
3.1 Data Acquisition via Sensors
3.2 Kinematic Data Analysis
3.3 Experimental Protocol
3.4 Human Intention Estimation
3.5 Motion Prediction for the Exoskeleton Control with Deep Learning Approach
3.6 Long Short-Term Memory
4 Implementation
4.1 Hardware Implementation
4.2 Data Format
4.3 Feature Extraction
4.4 Performance Metric
5 Results and Discussion
6 Conclusion
References
TSEM: Temporally-Weighted Spatiotemporal Explainable Neural Network for Multivariate Time Series
1 Introduction
2 Related Work
2.1 Attention Neural Models
2.2 Post-Hoc Model-Specific Convolutional Neural Network-Based Models
2.3 Explanation Extraction by Class Activation Mapping
3 Methodology
4 Experiments and Evaluation
4.1 Baselines
4.2 Accuracy
4.3 Interpretability
4.4 Experiment Settings
4.5 Experiment Results
5 Conclusion and Outlook
References
AI in Cryptocurrency
1 Introduction
2 Related Work
3 Artificial Intelligence and Machine Learning
3.1 Artificial Intelligence
3.2 Machine Learning
4 Methodology
4.1 Data Collection
4.2 Preliminary Data Analysis
4.3 Data Preprocessing
4.4 Model Preparation
5 Results
6 Conclusions and Future Work
References
Short Term Solar Power Forecasting Using Deep Neural Networks
1 Introduction
2 Literature Review
3 Deep Neural Network
4 Proposed Methodology
5 Results and Discussion
6 Conclusion and Future Work
References
Convolutional Neural Networks for Fault Diagnosis and Condition Monitoring of Induction Motors
1 Introduction
2 Related Work
3 Data Description
4 Methodology
4.1 Experiment One: Raw Signal Data
4.2 Experiment Two: Statistical Features Data
5 Results
6 Conclusion
References
Huber Loss and Neural Networks Application in Property Price Prediction
1 Introduction
2 About the Data
2.1 Data Preparation
3 Neural Network Models
3.1 Deep Neural Network
3.2 Recurrent Neural Network
3.3 Hybrid Neural Network
4 Results
5 Conclusion
References
Text Regression Analysis for Predictive Intervals Using Gradient Boosting
1 Introduction
2 About the Data
2.1 Data Preparation
3 Literature Survey
4 Word Vectorization
4.1 Term Frequency-Inverse Document Frequency (TF-IDF)
4.2 Word2Vec
4.3 GloVe
5 The Intuition of Text Regression
5.1 Gradient Boosting Regressor
5.2 Quantile Regression Model
5.3 Benchmarking the Model
5.4 Model Comparisons
5.5 Hyperparameter Tuning
6 Result
6.1 Inference
7 Scope of Future Improvements
8 Conclusion
References
Chosen Methods of Improving Small Object Recognition with Weak Recognizable Features
1 Introduction
2 Related Works
3 Dataset Analysis
4 Data Augmentation with DCGAN
5 Augmentation Setup
5.1 Oversampling Strategies
5.2 Perceptual GAN
6 Results
7 Conclusions and Future Work
References
Factors Affecting the Adoption of Information Technology in the Context of Moroccan Smes
1 Introduction
2 Research Question
3 Literature Review
3.1 Definition of SME
3.2 Adoption of Information Technology
4 Research Methodology
5 Findings and Results
5.1 Organizational Factors
5.2 Individual Factors
5.3 Technological Factors
5.4 Environmental Factors
6 Conclusion and Future Works
References
Aspects of the Central and Decentral Production Parameter Space, its Meta-Order and Industrial Application Simulation Example
1 Introduction
2 Metadescriptive Approach in Orgitonal Terms of the Central-Decentral Problem
3 Model in Witness and Comparison to Previous Approach
3.1 Traditional and Additive Manufacturing (AM)
3.2 Electric Motor Housing as Application Example for the Case Study
3.3 Simulation of the Case Study in Witness
3.4 Comparison to Previous Work
3.5 Discussion
4 Conclusion and Outlook
References
Gender Equality in Information Technology Processes: A Systematic Mapping Study
1 Introduction
2 Background
2.1 Gender Equality
2.2 Gender Equality in IT
3 Research Method
3.1 Research Questions
3.2 Search Strategy
3.3 Selection Criteria
3.4 Data Sources and Study Selection
3.5 Strategy for Data Extraction and Analysis
4 Results
4.1 RQ1: What Kind of Studies Exist on IT Processes and Gender Equality?
4.2 RQ2: What Gender Equality Targets are addressed by IT Processes?
4.3 RQ3: What are the Main Challenges to Achieve Gender Equality in IT Processes?
4.4 RQ4: What are the Best Practices Established to Address Gender Equality in IT Processes?
5 Discussion
5.1 Principal Findings
5.2 Limitations
5.3 Implications
6 Conclusions and Future Work
Appendix A. Selected Studies
Appendix B. Results Mapping
References
The P vs. NP Problem and Attempts to Settle It via Perfect Graphs State-of-the-Art Approach
1 Introduction
2 Literature Review
3 Transforming Any Graph into a Perfect Graph
4 Extracting the Independence Number from the Transformed Perfect Graph
4.1 Characterisation of the Transformed Graph T
4.2 Algorithms to Find the Independence Number for Special Graphs
4.3 Example Graphs
5 Results
6 Conclusion and Future Work
References
Hadoop Dataset for Job Estimation in the Cloud with Limited Bandwidth
1 Introduction
2 Experimental Setup
3 The Hadoop MapReduce Job Traces Database
3.1 Data Generation
3.2 Data Collection
4 Conclusion and Future Work
References
Survey of Schema Languages: On a Software Complexity Metric
1 Introduction
2 Review of Related Works
3 Materials and Methods
3.1 Experimental Setup of LIS Metric
4 Result and Discussion
5 Conclusion
References
Bohmian Quantum Field Theory and Quantum Computing
1 Introduction
2 The Abrams-Lloyd Theorem
3 Lagrangians and Yang-Mills Fields
4 Fields and Photons
5 Quantum Logic Gates
6 Bohmian Mechanics
7 A System of Equations
8 Conclusion
References
Service-Oriented Multidisciplinary Computing: From Code Providers to Transdisciplines
1 Introduction
2 Disciplines and Intents in TDML
2.1 The Service-Oriented Conceptualization of TDML
2.2 Three Pillars of Service Orientation
3 Discipline Instantiation and Initialization
3.1 Explicit Discipline Instantiation with Builder Signatures
3.2 Implicit Discipline Instantiation by Intents
4 Discipline Execution and Aggregations
5 An Example of a Distributed Transdiscipline in TDML
5.1 Specify the Sellar Intent with the MadoIntent Operator
5.2 Define the Sellar Model with Two Distributed Disciplines
5.3 Execute the Sellar Intent
6 Conclusions
References
Incompressible Fluid Simulation Parallelization with OpenMP, MPI and CUDA
1 Introduction
2 Implementation and Tuning Effort
2.1 OpenMP
2.2 MPI
2.3 CUDA
3 Experimental Data: Scaling and Performance Analysis and Interesting Inputs and Outputs
3.1 OpenMP
3.2 Strong Scaling
3.3 Weak Scaling
3.4 MPI
3.5 CUDA
4 Conclusion
4.1 Difficulty
4.2 Timeliness of the Contribution
References
Predictive Analysis of Solar Energy Production Using Neural Networks
1 Introduction
2 Related Work
3 Research Methodology
3.1 Data Preparation and Preprocessing
3.2 Feature Extraction
3.3 Machine Learning and Neural Network Algorithms Overview
4 Analysis of Results
4.1 Wavelet Transforms
4.2 Discussion of Results
5 Conclusion and Future Work
References
Implementation of a Tag Playing Robot for Entertainment
1 Introduction
2 Related Work
3 Design and Implementation
4 Design Hardware Component
5 Discussion and Results
6 Conclusion
References
English-Filipino Speech Topic Tagger Using Automatic Speech Recognition Modeling and Topic Modeling
1 Introduction
2 Literature
2.1 Latent Dirichlet Allocation
3 Materials and Methods
3.1 Dataset
3.2 Pre-processing
3.3 Modeling
3.4 Topic Modeling: Latent Dirichlet Allocation
3.5 Evaluation
4 Results and Discussion
4.1 English XLSR Wav2Vec2 Fine-Tuned to Filipino
4.2 Comparison of Transcriptions: Base XLSR-Wav2Vec2 Model Fine-Tuned to Filipino
4.3 Topic Tagger Results
4.4 Evaluation of the Topic Models
5 Conclusion and Future Work
References
A Novel Adaptive Fuzzy Logic Controller for DC-DC Buck Converters
1 Introduction
2 System Model
3 The Proposed DAFLC
4 Implementation of the DAFLC Using ARM Embedded Platform
4.1 Fuzzification and Defuzzification
4.2 Inference Rule Update
5 Experimental Results
5.1 Case 1: Experiment with Reference Voltage Value 200V, Constant Load R = 50
5.2 Case 2: Experiments with Rising Steeply Reference Voltage Values from 50V to 200V, and the Load Resistance is Fixed at 50
5.3 Case 3: Experiment with Varying Load and Input Voltage
6 Conclusions
References
Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment
1 Introduction
2 Multi-stage Channel Attention Integrated Swin Transformer for IQA (MCAS-IQA)
3 Locally Shared Attention in Transformer for VQA (LSAT-VQA)
4 Experiments and Discussions
4.1 IQA and VQA Datasets
4.2 Training Approach
4.3 Comparison and Ablations of IQA Models
4.4 Experiments on VQA Models
5 Conclusion
References
A Neural Network Based Approach for Estimation of Real Estate Prices
1 Introduction
2 The Proposed Approach
2.1 Training
2.2 Generation of Output Vector by a New Unknown Input Vector
3 Software Realization
4 Conclusion
References
Contextualizing Artificially Intelligent Morality: A Meta-ethnography of Theoretical, Political and Applied Ethics
1 Introduction
2 Literature Review
3 Theoretical AI Morality: Top-Down vs Bottom-Up
3.1 The Example of Fairness in AI
4 Technical AI Morality: Top-Down vs Bottom-Up
5 Political AI Morality: Top-Down vs. Bottom-Up
6 The Bottom-Up Method of AI Being Taught Ethics Through Reinforcement Learning
6.1 Reinforcement Learning as a Methodology for Teaching AI Ethics
7 The Top-Down Method of AI Being Taught Ethics
7.1 Practical Principles for AI Ethics
8 The Hybrid of Bottom-Up and Top-Down Ethics for AI
8.1 Data Mining Case Study: The African Indigenous Context
8.2 Contact Tracing for COVID-19 Case Study
9 Discussion
10 Conclusion
References
An Analysis of Current Fall Detection Systems and the Role of Smart Devices and Machine Learning in Future Systems
1 Introduction
2 Literature Review
2.1 Current Fall Detection and Fall Prevention Systems
2.2 Sensors
2.3 Fall Detection and Prevention Algorithms
3 Methodology
3.1 Materials
3.2 Dataset
3.3 Program
3.4 Evaluation and Analysis Techniques
4 Analysis and Evaluation
4.1 Statistical Analysis to Test the Classifiers
5 Discussion
6 Conclusion
6.1 Future Work
References
Authentication Scheme Using Honey Sentences
1 Introduction
2 Previous Methods
2.1 Overview of Previous Methods
2.2 Security Analysis of Previous Methods
3 Proposed Method
3.1 Preprocessing Phase
3.2 Sign-Up Phase
3.3 Login Phase
4 Discussion
4.1 Naturalness Evaluation
4.2 Security Analysis
5 Conclusion and Future Research
References
HSDVS-DPoS: A Secure and Heterogeneous DPoS Consensus Protocol Using Heterogeneous Strong Designated Verifier Signature
1 Introduction
2 Related Work
3 Preliminaries
3.1 Heterogeneous Strong Designated Signature Based on PKI and IBC
4 Proposed HSDVS-DPoS Consensus Protocol
4.1 Roles
4.2 Relationship of Roles
4.3 Data Structure
4.4 Consensus Phases
5 Discussion
5.1 Counting Phase
5.2 Rules of Counting
6 Security Analysis
6.1 Authentication and Secrecy
6.2 Anonymity
6.3 Fairness
6.4 Verifiability
6.5 Receipt-Freeness
6.6 Coercion-Resistance
7 Experiment
7.1 Overall Design
7.2 Experimental Evaluation
7.3 Experimental Results and Analysis
8 Conclusion
References
Framework for Multi-factor Authentication with Dynamically Generated Passwords
1 Introduction
2 Literature Review
3 Challenges and Limitations of the Existing Approaches
4 Proposed Authentication Idea
5 Experimental Environment
6 Results
7 Analysis of the Experimental Results
8 Conclusion
9 Further Research
References
Randomness Testing on Strict Key Avalanche Data Category on Confusion Properties of 3D-AES Block Cipher Cryptography Algorithm
1 Introduction
2 Review of 3D-AES
3 Proposed Design of Enhanced 3D-AES Block Cipher
3.1 Confusion-Based Function
3.2 Proposed Design of Confusion-Based Functions
3.3 ConfuseP
3.4 ConfuseK
3.5 ConfuseP−1 and ConfuseK−1
4 Randomness Testing on Enhanced 3D-AES
5 Discussion on Test Results
6 Conclusion and Future Research
References
A Proof of P ! = NP: New Symmetric Encryption Algorithm Against Any Linear Attacks and Differential Attacks
1 Introduction
2 Introduction to Eagle Encoding Algorithm
3 Eagle Encryption Algorithm
3.1 Eagle Key Generator
3.2 Eagle Encryption Process
3.3 Eagle Decryption Process
4 Linear Attack Analysis to Eagle Encryption Algorithm
5 Differential Attack Analysis to Eagle Encryption Algorithm
6 One-Way Function Design
6.1 Introduction to One-Way Functions
6.2 Construction of One-Way Functions
7 Conclusion
References
Python Cryptographic Secure Scripting Concerns: A Study of Three Vulnerabilities
1 Introduction
2 Problem and Significance
3 Review of Literature
4 Methodology
4.1 ECB Mode Benchmark
4.2 Open-Source Repository Information for Four Projects
4.3 Analysis of Open-Source Repositories with Static Analysis Tools
5 Findings
6 Discussions
7 Conclusions and Future Work
References
Developing a GSM-GPS Based Tracking System: Vulnerable Nigerian School Children as a Case Study
1 Introduction
1.1 Vulnerability Points in Nigerian School Children as a Basis for Tracker Functionality
2 Literature Review
2.1 Review of Related Works
2.2 Review Results
3 Methodology
3.1 Hardware Components
3.2 Software Components
3.3 System Flowchart
4 Result and Discussion
4.1 Circuit Diagram
4.2 System Operation
4.3 Performance Tests
5 Future Scope and Conclusion
References
Standardization of Cybersecurity Concepts in Automotive Process Models: An Assessment Tool Proposal
1 Introduction
1.1 Scope
1.2 Background and Approaches
1.3 Motive
2 Case Study Methodology and Scope
3 Case Study Observations and Results Consolidation
3.1 Overview of Automotive SPICE Assessment Methodology Steps
3.2 Details of the 1st Automotive SPICE Assessment for Cybersecurity (Challenged/Lessons Learned)
4 Study Results and Final Conclusion
5 Conclusion and Recommendations for Future Work
References
Factors Affecting the Persistence of Deleted Files on Digital Storage Devices
1 Introduction
2 Related Work
3 Experimental Design
3.1 Virtual Machine Preserving and Archiving Process
4 Parameters Identification
4.1 Disk and System Factors
4.2 Post-user Activity
4.3 Deleted Files Properties
5 Approach and Methodology
6 Results and Analysis
6.1 User Activity and Number of Files
6.2 File Size
6.3 File Types
6.4 Media Free Space and Disk Fragmentation
7 Challenges and Limitations
8 Conclusion
9 Future Work
References
Taphonomical Security: DNA Information with a Foreseeable Lifespan*-4pt
1 Introduction
2 Biochemical Preliminaries
2.1 NA Composition
2.2 NA Bonds Degradation Over Time
2.3 On the Synthesis of RNA-DNA Chimeric Oligonucleotides
2.4 RNA Degradation in Further Detail
3 The Proposed Method
3.1 Description of the Method
3.2 Encrypting Information
3.3 Key Reconstruction
3.4 Security
4 Controlling the Information Lifetime
4.1 Probabilistic Model
4.2 The Information Lifetime Bounds
5 Parameter Choice and Efficiency Analysis
5.1 Finding (n,k) for Target Times t and t'
5.2 Finding (n,k) with Lowest Agony Ratio
5.3 Finding (n,k) for Target Time ttarget with the Least Variance
6 Conclusion and Future Work
A Proofs
B Numerical Values
References
Evaluation and Analysis of Reversible Watermarking Techniques in WSN for Secure, Lightweight Design of IoT Applications: A Survey
1 Introduction
2 WSN and IoT Smart Meters in Summary
3 Digital Watermarking and Reversible Watermarking
4 Literature Review and Related Work
5 Comparison and Discussion
5.1 Design Purpose
5.2 Influential Factors on System Performance
5.3 Resistance Against Attacks
6 Suggestions for Improving Security Performance for the Future Lightweight IoT-Based Design
7 Conclusion and Future Work
References
Securing Personally Identifiable Information (PII) in Personal Financial Statements*-4pt
1 Introduction
2 Review of Relevant Literature
3 Methodology
3.1 Qualifying Questions
3.2 Survey Contents
3.3 Sampling
4 Results
4.1 Preferences: Financial Institution
4.2 Preferences: Information on Statements
4.3 Changing View of PII
4.4 Discussion - Information on Statements
5 Conclusions
6 Discussions on Future Work
A Appendix
A.1 Example Statement: Full Account Number Obfuscated Only Obfuscated
A.2 Example Statement: Last 4 Digits Obfuscated
A.3 Example Statement: No Obfuscated Information
A.4 Survey Questions
References
Conceptual Mapping of the Cybersecurity Culture to Human Factor Domain Framework
1 Introduction
2 Research Method
3 Identified Cybersecurity Culture Factors
4 Human Factor
4.1 Factors of Human Problem
4.2 Human Factor Framework
5 Mapping of Cybersecurity Culture Factors and Human Factor Domain
5.1 Organisational and Individual Levels Factors
5.2 Discussions
6 Limitation
7 Conclusion and Future Work
References
Attacking Compressed Vision Transformers
1 Introduction
1.1 Related Work
1.2 Novel Contribution
2 Dataset
3 Metrics
4 Vision Transformers and Types
4.1 Transformers
4.2 Vision Transformers
4.3 Data-Efficient Image Transformers (DeiT)
5 Attacks
5.1 White Box Attacks
5.2 Black Box Attacks
6 Compression Techniques
6.1 Dynamic Quantization
6.2 Pruning: Dynamic DeiT
6.3 Weight Multiplexing + Distillation: Mini-DeiT
7 Experiments and Results
7.1 Compression Results
7.2 Quantization Attack Results
7.3 Pruning Attack Results
7.4 Weight Multiplexing + Distillation Attack Results
8 Limitations and Future Work
9 Conclusion
References
Analysis of SSH Honeypot Effectiveness
1 Introduction
2 Background
3 Methodology
4 Data and Results
4.1 Uncloaked Data
4.2 Cloaked Data
5 Comparison
6 Conclusion
7 Future Work
Appendix A
References
Human Violence Recognition in Video Surveillance in Real-Time
1 Introduction
2 Related Work
3 Proposal
3.1 Spatial Attention Module (SA)
3.2 Temporal Attention Module (TA)
4 Experiment and Results
4.1 Datasets
4.2 Configuration Model
4.3 Efficiency Evaluation
4.4 Accuracy Evaluation
4.5 Real-Time Evaluation
5 Discussion and Future Work
6 Conclusions
References
Establishing a Security Champion in Agile Software Teams: A Systematic Literature Review
1 Introduction
2 Related Work
2.1 Security Challenges in Agile Teams
2.2 The Champion Role and Security Champions
3 Methodology
3.1 The Search Protocol
3.2 The Systematic Search
3.3 The Adhoc Search
3.4 Data Extraction and Analysis
4 Results
4.1 RQ1 - Is There a Reportedly Consistent View on Security Champion in Agile Software Teams?
4.2 RQ2 - What Is Reported from the Software Engineering Literature About Establishing and Maintaining Security Champion Roles in Agile Software Teams?
4.3 RQ3: Which Challenges Has Been Reported Regarding the Establishment And Maintenance of Security Champion in Agile Software Teams?
5 Discussion
6 Conclusions
References
HTTPA: HTTPS Attestable Protocol
1 Introduction
2 Threat Modeling
3 Problem Statement
4 HTTPS Attestable (HTTPA) Protocol
4.1 Standard HTTP over TLS
4.2 Attestation over HTTPS
4.3 One-Way HTTPA
4.4 Mutual HTTPA (mHTTPA)
5 Summary
References
HTTPA/2: A Trusted End-to-End Protocol for Web Services
1 Introduction
2 Technical Preliminaries
2.1 Trusted Execution Environment (TEE)
2.2 Attest Quote (AtQ)
2.3 Attest Base (AtB)
2.4 Three Types of Request
2.5 Attest Ticket (AtT)
2.6 Attest Binder (AtBr)
2.7 Trusted Cargo (TrC)
2.8 Trusted Transport Layer Security (TrTLS)
3 Protocol Transactions
3.1 Preflight Check Phase
3.2 Attest Handshake (AtHS) Phase
3.3 Attest Secret Provisioning (AtSP) Phase
3.4 Trusted Communication Phase
3.5 Protocol Flow
4 Security Considerations
4.1 Layer 7 End-to-End Protection
4.2 Replay Protection
4.3 Downgrade Protection
4.4 Privacy Considerations
4.5 Roots of Trust (RoT)
5 Conclusion
6 Future Work
7 Notices and Disclaimers
References
Qualitative Analysis of Synthetic Computer Network Data Using UMAP
1 Introduction
1.1 Problem Statement
1.2 Motivation
1.3 Proposed Solution
1.4 Paper Roadmap
2 Background and Literature Review
2.1 Computer Network Traffic and Classification
2.2 Generative Machine Learning
2.3 Synthetic Traffic Evaluation
2.4 CTGAN
2.5 UMAP Embedding
3 Proposed Methodology
4 Case Study
4.1 Plan
4.2 Data Preparation
4.3 Generators
4.4 Real Data is Embedded
4.5 Synthetic Data is Embedded
5 Results and Discussion
5.1 Held-Out vs. GAN vs. Empirical Distribution
5.2 Comparison with Quantitative Measures
5.3 Behavior of UMAP
6 Future Work
6.1 Hyperparameter Search for Quality Embedding
6.2 Use in Traffic GAN Tuning
References
Device for People Detection and Tracking Using Combined Color and Thermal Camera
1 Introduction
2 Literature Review
2.1 Human Detection and Wearing Mask
2.2 Tracking Humans
2.3 Camera RGB-D
2.4 Thermal Camera
2.5 Calibration
3 Functional Results
4 Solution Architecture
5 Test Results
6 Conclusions
References
Robotic Process Automation for Reducing Food Wastage in Swedish Grocery Stores
1 Introduction
2 Robotic Process Automation
2.1 The History of RPA
2.2 Previous Studies
2.3 Related Studies in Food Wastage Reduction: Previous Studies Focus on the Theoretical Aspects, Such as the Usability of RPA in the Supply Chain, Although Very Few. Related Studies Are
3 Methodology
3.1 Methods for Data Collection
3.2 Thematic Analysis
3.3 Evaluation Methods
4 Results and Analysis
4.1 Advantages and Challenges of Adopting RPA
4.2 Advantages of Adopting RPA
4.3 Challenges of Adopting RPA
4.4 Possibilities of Implementing RPA in Current Business Processes
4.5 The Necessity of Physical Work in Certain Monotonous Tasks
5 Discussions and Conclusions
5.1 Identified Advantages of Adopting RPA
5.2 Identified Challenges of Adopting RPA
5.3 Limitations
5.4 Concluding Remarks and Future Directions
References
A Survey Study of Psybersecurity: An Emerging Topic and Research Area
1 Introduction, Scope and Motivation
2 Causes of Psybersecurity Attacks (PSA)
2.1 PSA Attack Vectors
2.2 Attack Target
2.3 Ciaidini's Principles of Psychiatric and Social Engineering
3 The Effects of Psybersecurity Attacks (PSA)
4 Cyberpsychology and its Dimensions
5 Summarizing and Classifying Literature: Survey Study Results 1
6 Analyzing Psybersecurity Attacks (PSA): Survey Study Results 2
7 Psybersecurity Amidst the COVID Pandemic
8 Future Scope of Work and Potential Research Directions
9 Conclusion and Summary
References
Author Index
Lecture Notes in Networks and Systems 652
Kohei Arai Editor
Advances in Information and Communication Proceedings of the 2023 Future of Information and Communication Conference (FICC), Volume 2
Lecture Notes in Networks and Systems
652
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas—UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Turkey Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science. For proposals from Asia please contact Aninda Bose ([email protected]).
Kohei Arai Editor
Advances in Information and Communication Proceedings of the 2023 Future of Information and Communication Conference (FICC), Volume 2
Editor Kohei Arai Faculty of Science and Engineering Saga University Saga, Japan
ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-031-28072-6 ISBN 978-3-031-28073-3 (eBook) https://doi.org/10.1007/978-3-031-28073-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
We are extremely delighted to bring forth the eighth edition of Future of Information and Computing Conference, 2023 (FICC 2023) held successfully on 2 and 3 March 2023. We successfully leveraged the advantages of technology for seamlessly organizing the conference in a virtual mode which allowed 150+ attendees from 45+ nations across the globe to attend this phenomenal event. The conference allowed learned scholars, researchers and corporate tycoons to share their valuable and out-of-the-box studies related to communication, data science, computing and Internet of things. The thought-provoking keynote addresses, unique and informative paper presentations and useful virtual roundtables were the key attractions of this conference. The astounding success of the conference can be gauged by the overwhelming response in terms of papers received. A total number of 369 papers were received, out of which 143 were handpicked by careful review in terms of originality, applicability and presentation, and 119 are finally published in this edition. The papers not only presented various novel and innovative ways of dealing with mundane, time-consuming and repetitive tasks in a fool-proof manner but also provided a sneak peek into the future where technology would be an inseparable part of each one’s life. The studies also gave an important thread for future research and beckoned all the bright minds to foray in those fields. The conference indeed brought about a scientific awakening amongst all its participants and viewers and is bound to bring about a renaissance in the field of communication and computing. The conference could not have been successful without the hard work of many people on stage and back stage. The keen interest of authors along with the comprehensive evaluation of papers by technical committee members was the main driver of the conference. The session chairs committee’s efforts were noteworthy. We would like to express our heartfelt gratitude to all the above stakeholders. A special note of thanks to our wonderful keynote speakers who added sparkle to the entire event. Last but certainly not least, we would extend our gratitude to the organizing committee who toiled hard to make this virtual event a grand success. We sincerely hope to provide an enriching and nourishing food for thought to our readers by means of our well-researched studies published in this edition. The overwhelming response by authors, participants and readers motivates us to better ourselves each time. We hope to receive continued support and enthusiastic participation from our distinguished scientific fraternity. Regards, Kohei Arai
Contents
The Disabled’s Learning Aspiration and E-Learning Participation . . . . . . . . . . . . Seonglim Lee, Jaehye Suk, Jinu Jung, Lu Tan, and Xinyu Wang
1
Are CK Metrics Enough to Detect Design Patterns? . . . . . . . . . . . . . . . . . . . . . . . . Gcinizwe Dlamini, Swati Megha, and Sirojiddin Komolov
11
Detecting Cyberbullying from Tweets Through Machine Learning Techniques with Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jalal Omer Atoum DNA Genome Classification with Machine Learning and Image Descriptors . . . Daniel Prado Cussi and V. E. Machaca Arceda A Review of Intrusion Detection Systems Using Machine Learning: Attacks, Algorithms and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jose Luis Gutierrez-Garcia, Eddy Sanchez-DelaCruz, and Maria del Pilar Pozos-Parra
25
39
59
Head Orientation of Public Speakers: Variation with Emotion, Profession and Age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yatheendra Pravan Kidambi Murali, Carl Vogel, and Khurshid Ahmad
79
Using Machine Learning to Identify Top Antecedents Affecting Crime in US Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kamil Samara
96
Hybrid Quantum Machine Learning Classifier with Classical Neural Network Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Avery Leider, Gio Giorgio Abou Jaoude, and Pauline Mosley Repeated Potentiality Augmentation for Multi-layered Neural Networks . . . . . . . 117 Ryotaro Kamimura SGAS-es: Avoiding Performance Collapse by Sequential Greedy Architecture Search with the Early Stopping Indicator . . . . . . . . . . . . . . . . . . . . . . 135 Shih-Ping Lin and Sheng-De Wang Artificial Intelligence in Forensic Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Nazneen Mansoor and Alexander Iliev
viii
Contents
Deep Learning Based Approach for Human Intention Estimation in Lower-Back Exoskeleton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Valeriya Zanina, Gcinizwe Dlamini, and Vadim Palyonov TSEM: Temporally-Weighted Spatiotemporal Explainable Neural Network for Multivariate Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Anh-Duy Pham, Anastassia Kuestenmacher, and Paul G. Ploeger AI in Cryptocurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Alexander I. Iliev and Malvika Panwar Short Term Solar Power Forecasting Using Deep Neural Networks . . . . . . . . . . . 218 Sana Mohsin Babbar and Lau Chee Yong Convolutional Neural Networks for Fault Diagnosis and Condition Monitoring of Induction Motors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Fatemeh Davoudi Kakhki and Armin Moghadam Huber Loss and Neural Networks Application in Property Price Prediction . . . . . 242 Alexander I. Iliev and Amruth Anand Text Regression Analysis for Predictive Intervals Using Gradient Boosting . . . . 257 Alexander I. Iliev and Ankitha Raksha Chosen Methods of Improving Small Object Recognition with Weak Recognizable Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 Magdalena Stacho´n and Marcin Pietro´n Factors Affecting the Adoption of Information Technology in the Context of Moroccan Smes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 Yassine Zouhair, Mustapha Belaissaoui, and Younous El Mrini Aspects of the Central and Decentral Production Parameter Space, Its Meta-Order and Industrial Application Simulation Example . . . . . . . . . . . . . . . . . 297 Bernhard Heiden, Ronja Krimm, Bianca Tonino-Heiden, and Volodymyr Alieksieiev Gender Equality in Information Technology Processes: A Systematic Mapping Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 J. David Patón-Romero, Sunniva Block, Claudia Ayala, and Letizia Jaccheri The P vs. NP Problem and Attempts to Settle It via Perfect Graphs State-of-the-Art Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 Maher Heal, Kia Dashtipour, and Mandar Gogate
Contents
ix
Hadoop Dataset for Job Estimation in the Cloud with Limited Bandwidth . . . . . 341 Mohammed Bergui, Nikola S. Nikolov, and Said Najah Survey of Schema Languages: On a Software Complexity Metric . . . . . . . . . . . . . 349 Kehinde Sotonwa, Johnson Adeyiga, Michael Adenibuyan, and Moyinoluwa Dosunmu Bohmian Quantum Field Theory and Quantum Computing . . . . . . . . . . . . . . . . . . 362 F. W. Roush Service-Oriented Multidisciplinary Computing: From Code Providers to Transdisciplines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 Michael Sobolewski Incompressible Fluid Simulation Parallelization with OpenMP, MPI and CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 Xuan Jiang, Laurence Lu, and Linyue Song Predictive Analysis of Solar Energy Production Using Neural Networks . . . . . . . 396 Vinitha Hannah Subburaj, Nickolas Gallegos, Anitha Sarah Subburaj, Alexis Sopha, and Joshua MacFie Implementation of a Tag Playing Robot for Entertainment . . . . . . . . . . . . . . . . . . . 416 Mustafa Ayad, Jessica MacKay, and Tyrone Clarke English-Filipino Speech Topic Tagger Using Automatic Speech Recognition Modeling and Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 John Karl B. Tumpalan and Reginald Neil C. Recario A Novel Adaptive Fuzzy Logic Controller for DC-DC Buck Converters . . . . . . . 446 Thuc Kieu-Xuan and Duc-Cuong Quach Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 Junyong You and Zheng Zhang A Neural Network Based Approach for Estimation of Real Estate Prices . . . . . . 474 Ventsislav Nikolov Contextualizing Artificially Intelligent Morality: A Meta-ethnography of Theoretical, Political and Applied Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482 Jennafer Shae Roberts and Laura N. Montoya
x
Contents
An Analysis of Current Fall Detection Systems and the Role of Smart Devices and Machine Learning in Future Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 502 Edward R. Sykes Authentication Scheme Using Honey Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 Nuril Kaunaini Rofiatunnajah and Ari Moesriami Barmawi HSDVS-DPoS: A Secure and Heterogeneous DPoS Consensus Protocol Using Heterogeneous Strong Designated Verifier Signature . . . . . . . . . . . . . . . . . . 541 Can Zhao, XiaoXiao Wang, Zhengzhu Lu, Jiahui Wang, Dejun Wang, and Bo Meng Framework for Multi-factor Authentication with Dynamically Generated Passwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 Ivaylo Chenchev Randomness Testing on Strict Key Avalanche Data Category on Confusion Properties of 3D-AES Block Cipher Cryptography Algorithm . . . . . . . . . . . . . . . . 577 Nor Azeala Mohd Yusof and Suriyani Ariffin A Proof of P ! = NP: New Symmetric Encryption Algorithm Against Any Linear Attacks and Differential Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589 Gao Ming Python Cryptographic Secure Scripting Concerns: A Study of Three Vulnerabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602 Grace LaMalva, Suzanna Schmeelk, and Dristi Dinesh Developing a GSM-GPS Based Tracking System: Vulnerable Nigerian School Children as a Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614 Afolayan Ifeoluwa and Idachaba Francis Standardization of Cybersecurity Concepts in Automotive Process Models: An Assessment Tool Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 Noha Moselhy and Ahmed Adel Mahmoud Factors Affecting the Persistence of Deleted Files on Digital Storage Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656 Tahir M. Khan, James H. Jones Jr., and Alex V. Mbazirra Taphonomical Security: DNA Information with a Foreseeable Lifespan . . . . . . . 674 Fatima-Ezzahra El Orche, Marcel Hollenstein, Sarah Houdaigoui, David Naccache, Daria Pchelina, Peter B. Rønne, Peter Y. A. Ryan, Julien Weibel, and Robert Weil
Contents
xi
Evaluation and Analysis of Reversible Watermarking Techniques in WSN for Secure, Lightweight Design of IoT Applications: A Survey . . . . . . . . . . . . . . . 695 Tanya Koohpayeh Araghi, David Megías, and Andrea Rosales Securing Personally Identifiable Information (PII) in Personal Financial Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709 George Hamilton, Medina Williams, and Tahir M. Khan Conceptual Mapping of the Cybersecurity Culture to Human Factor Domain Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729 Emilia N. Mwim, Jabu Mtsweni, and Bester Chimbo Attacking Compressed Vision Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743 Swapnil Parekh, Pratyush Shukla, and Devansh Shah Analysis of SSH Honeypot Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759 Connor Hetzler, Zachary Chen, and Tahir M. Khan Human Violence Recognition in Video Surveillance in Real-Time . . . . . . . . . . . . 783 Herwin Alayn Huillcen Baca, Flor de Luz Palomino Valdivia, Ivan Soria Solis, Mario Aquino Cruz, and Juan Carlos Gutierrez Caceres Establishing a Security Champion in Agile Software Teams: A Systematic Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796 Hege Aalvik, Anh Nguyen-Duc, Daniela Soares Cruzes, and Monica Iovan HTTPA: HTTPS Attestable Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811 Gordon King and Hans Wang HTTPA/2: A Trusted End-to-End Protocol for Web Services . . . . . . . . . . . . . . . . . 824 Gordon King and Hans Wang Qualitative Analysis of Synthetic Computer Network Data Using UMAP . . . . . . 849 Pasquale A. T. Zingo and Andrew P. Novocin Device for People Detection and Tracking Using Combined Color and Thermal Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862 Paweł Woronow, Karol Jedrasiak, Krzysztof Daniec, Hubert Podgorski, and Aleksander Nawrat Robotic Process Automation for Reducing Food Wastage in Swedish Grocery Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875 Linus Leffler, Niclas Johansson Bräck, and Workneh Yilma Ayele
xii
Contents
A Survey Study of Psybersecurity: An Emerging Topic and Research Area . . . . 893 Ankur Chattopadhyay and Nahom Beyene Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 913
The Disabled’s Learning Aspiration and E-Learning Participation Seonglim Lee1 , Jaehye Suk2(B) , Jinu Jung1 , Lu Tan1 , and Xinyu Wang1 1 Department of Consumer Science, Convergence Program for Social Innovation,
Sungkyunkwan University, Seoul, South Korea 2 Convergence Program for Social Innovation, Sungkyunkwan University, Seoul, South Korea
[email protected]
Abstract. This study attempted to examine the effects of ICT access/usage and learning aspiration on the disableds e-learning participation, using data from the 2020 Survey on Digital Divide collected by the National Information Society Agency in Korea, a nationally representative sample of 4,606 individuals, including the disabled and non-disabled. Chi-square test, analysis of variance, and Structural Equation Modeling (SEM) were conducted. The major findings of this study were as follows. First, the types of disability did not have a significant effect on learning aspirations but significantly affected whether to participate in e-learning. The visual and hearing/language disabled were more likely to participate in elearning than the non-disabled. These findings suggest that the disabled have as much learning aspirations as the non-disabled. Rather, they have greater demand for e-learning. Second, access to PC/notebook and internet usage were related to more learning aspirations. Those who have stronger learning aspirations were more likely to participate in e-learning activities. Therefore, access to PC/notebook and internet usage not only directly affect e-learning participation and also indirectly affect e-learning participation through learning aspirations, compared to those who can not either access to PC/notebook or use the Internet. The findings suggest that ICT access and use are not merely a tool necessary for e-learning. They also contribute to e-learning by stimulating learning aspirations. Keywords: Disabled · E-learning · Learning aspiration · Access to ICT devices · Internet use
1 Introduction With the outbreak of the COVID-19 pandemic, the quarantine and lockdown policies worldwide have highlighted the strengths of e-learning, such as being cost-effective, time-efficient, and independent of time and place, leading to a rapidly growing trend of e-learning [1]. E-learning works well in breaking isolation and increasing social connections through their integration into a virtual learning community while learning new knowledge [2]. With the rising popularity of e-learning, consumers can now access educational resources more efficiently than before. E-learning can be an effective mode for the disabled to improve their access to education and help them integrate © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 1–10, 2023. https://doi.org/10.1007/978-3-031-28073-3_1
2
S. Lee et al.
into a knowledge-based economy and society [3]. In the case of socially isolated disabled people, they have a high demand for e-learning because of limited educational opportunities. Accessibility is the most crucial factor affecting e-learning participation [4]. Concerning e-learning for the disabled, the problems of accessibility and the application of assistive technologies have been consistently emphasized. Despite the fact that the disabled face various difficulties in taking e-learning classes, including shortages of and high cost of usable ICT devices and services, insufficient assistive technology, and lack of skills to use devices, etc. [5–7]; scant research has been conducted on their e-learning participation. This study aims to examine the e-learning participation of the Korean disabled, focusing on adult lifelong education. Specific research questions are as follows: 1) What is the difference in the e-learning participation rate, perceived effectiveness of e-learning, and learning aspiration between the disabled and non-disabled by the types of disabilities? 2) What is the difference in ICT access and use between the disabled and non-disabled by the types of disabilities? 3) How do the types of disability, ICT access and use, and learning aspiration affect e-learning participation? 4) Does learning aspiration mediate the relationship between the types of disability and e-learning participation and between access to and use of ICT and e-learning participation? This study will reveal the unique challenges for e-learning participation faced by the disabled and provide constructive suggestions to satisfy the disableds demand for learning.
2 Literature Review 2.1 Lifelong Education for the Disabled Lifelong education refers to learning and teaching activities conducted throughout life to improve the quality of human life [8]. It is a self-directed learning activity conducted by learners for self-realization and fulfillment throughout their lives [9]. Since lifelong education is learning for everyone, no one can be excluded from it. It satisfies humans unique learning instincts, guarantees the right to learn, improves the quality of life, and increases national competitiveness. The right to lifelong education for the disabled is not an optional requirement to compensate for the deficiency caused by the disability. Regardless of the presence or absence of disabilities, opportunities should be provided to guarantee the right to lifelong learning and lifelong education for humans, and equity should be secured. Lifelong education for the disabled is not only a means of satisfying their right to lifelong learning but also fosters the ability to cope with many hardships, difficulties, and unexpected situations that the disabled can experience in their lives [9, 10]. Ultimately, it is instrumental in realizing the self-reliance of the disabled pursued by welfare for the disabled
The Disabled’s Learning Aspiration and E-Learning Participation
3
[9]. However, most of the educational activities for the disabled so far are mainly aimed at school-age disabled people. To guarantee the right to lifelong learning of the disabled, not only the quantitative growth of lifelong education for the disabled but also the qualitative growth must be achieved. Therefore, it is necessary to investigate the demands of the disableds participation in lifelong education and develop a program based on this. In particular, e-learning provides effective educational opportunities as a tool that can offer teaching-learning opportunities that meet the individualized educational needs of the disabled. 2.2 E-Learning and the Disabled E-learning is the use of online resources in both distance and conventional learning contexts, and it can be described as “learning environments created using digital technologies which are networked and collaborative via internet technologies” [11]. E-learning includes computer-based training, online learning, virtual learning, and web-based learning and so on [12]. E-learning is considered an instructive result of the Information and Communication Technologies (ICTs) and essentially includes utilizing any electronic gadget, from P.C.s to cell phones [13]. The technology behind e-learning brings many benefits during use. For educational institutions, the major benefits of e-learning are that it compensates for the scarcity of academic personnel and enhances knowledge efficiency and competency through easy access to vast bulk of information [14]. For individual learners, e-learning can remove the barriers associated with transportation, discrimination, and racism and promote learners to implement a self-paced learning process through flexibility, accessibility, and convenience [15, 16]. All humans desire to learn, especially the socially isolated disabled have a strong desire for learning [17]. However, due to the relatively low accessibility to new technologies, alienation problems faced by disabled people who may appear in education or learning may arise. In this respect, the participation of the disabled in e-learning is essential. The use of e-learning by learners with disabilities can enhance independence, facilitate their learning needs, and enable them to use their additional specialist technology [18]. However, there are some difficulties in the e-learning usage progress experienced by disabled individuals. For example, learners with mental disabilities face problems with a professors insufficient use of e-learning, hearing disability learners experience issues with accessibility to audio and video materials, and visual disability learners have difficulty accessing lecture notes and materials. For disabled learners, the limited accessibility negatively affects learning and acquiring new skills for learners with disability [3, 19, 20]. E-learning offers a number of opportunities for persons with disabilities by facilitating access to new services, knowledge, and work from any place and breaking the isolation that disabled people feel in life [2]. Recent research on e-learning for learners with disabilities has focused mainly on examining e-learning usage and issues in specific countries. The main reason consumers do not participate in e-learning activities is accessibility [4] and a severe lack of assistive technology in computer labs and libraries [7]. Specifically, compared with disabled people, non-disabled people have a better ability to use e-learning tools because disabled
4
S. Lee et al.
people are more mentally and physically stressed in social pressure tasks than nondisabled people [21].
3 Method 3.1 Sample This study used data from the 2020 Survey on Digital Divide collected by the National Information Society Agency in Korea. This nationwide survey provided information on the e-learning experience of disabled and non-disabled people and ICT use. Among the 9,200 nationally representative survey participants, a sample of 4,606 individuals in their 30s to 50s consisting of the 1,164 disabled and 3,442 non-disabled people were selected for the analysis. The descriptive characteristics of the sample are shown in Table 1. Table 1. Descriptive statistics of the sample (N = 4,606, Weighted) Variables
Disabled (N = 1,164)
Non-disabled (N = 3,442)
Freq.
(%)
Freq.
(%)
Gender
Male
846
72.68
1763
51.22
Female
318
27.32
1679
48.78
Age
30s
130
11.17
1030
29.92
40s
338
29.04
1183
34.37
Education
Occupation
Income
Mean(SD)
50s
696
59.79
1229
35.71
≤Middle school
313
26.89
41
1.19
High school
695
59.71
1562
45.38
≥College
156
13.4
1839
53.43
White color
115
9.88
1222
35.5
Service
84
7.22
472
13.71
Sales
95
8.16
690
20.05
Blue color
272
23.37
511
14.85
House-keeping
177
15.21
538
15.63
Not-working
421
36.17
9
0.26
270.30
(140.87)
442.84
(119.42)
3.2 Measurement The dependent variable was whether to participate in e-learning participation. The elearning participants were coded as one and coded as zero otherwise. The independent
The Disabled’s Learning Aspiration and E-Learning Participation
5
variable included the types of disability and ICT use. Disability consisted of four dummy variables identifying physical, brain lesions, visual, and hearing/language disabilities. ICT use variables which consisted of access to ICT devices and internet use. They were measured with three dummy variables indicating whether one can access desktop/notebook and mobile devices and whether one can use the Internet during the last month. The mediation variable was the aspiration for learning, which was measured with three items on a four-point Likert scale ranging from one (“not at all”) to four (“very much”). Socio-demographic variables such as gender, education, age group, occupation, and the logarithm of family income were included as the control variables. 3.3 Analysis Chi-square test, analysis of variance using Generalized Linear Model (GLM), and Structural Equation Modeling (SEM) were conducted. The STATA (version 17) was utilized.
4 Results 4.1 E-Learning Participation Rate and Perceived Effectiveness of E-Learning Table 2 shows the distribution of e-learning participation rates and the perceived effectiveness of e-learning by the types of disability and non-disability. The e-learning participation rates for the physical and hearing/language disabled were higher but lower for those of the brain lesion and visual disabled compared to the e-learning participation rate for the non-disabled. Perceived effectiveness for e-learning was significantly different among the types of the disabled. From the analysis using the whole sample, including nonparticipants in e-learning, the hearing and language disabled perceived e-learning as effective as the non-disabled. From the analysis using the participants only, the perceived effectiveness of e-learning was not significant among the physical, visual, and hearing/language disabled and the non-disabled. Overall e-learning participant sample perceived e-learning as more effective than the whole sample, including non-participants. The results suggested that some non-participants expected e-learning to be less effective than those who have experienced e-learning. The learning aspiration score was the highest for the non-disabled (2.82), followed by the hearing/language disabled (2.55), physical disabled (2.52), visual disabled (2.28), and brain lesion (2.24). Overall the disabled had lower learning aspirations than the non-disabled. Specifically, the brain lesion and visual disabled showed the two lowest learning aspiration scores. 4.2 ICT Access and Use Table 3 shows the percent distribution of access to ICT devices and internet use rates. Overall the non-disabled showed higher access and usage rates than the disabled. Almost all non-disabled accessed mobile devices and used the Internet. The rates of ICT access
6
S. Lee et al. Table 2. E-learning participants and perceived effectiveness of E-learning Disabled
Non-disabled
Chi sq/ F value
35.71
28.73
37.42***
2.65
2.82
2.82
16.08***
(0.92)
(0.81)
(0.50)
Physical
Brain lesion
Visual
Hearing/ Language
Participation rate (%)
37.70
15.93
15.82
Perceived effectiveness (total sample)
2.62
2.4
(0.87)
(0.81)
b
c
b
a
a
Perceived effectiveness (participants)
2.83
2.72
2.84
2.88
3.07
(0.78)
(0.67)
(0.8)
(0.72)
(0.68)
ab
b
ab
ab
a
Learning aspiration
2.52
2.44
2.28
2.55
2.82
(0.65)
(0.71)
(0.69)
(0.64)
(0.57)
b
c
c
b
a
7.55***
87.92***
* p < .05, ** p < .01, *** p < .001
and Internet use were different among the types of disability. Overall the physical and hearing/language were more likely to access and use ICT than the other disabled. The rate of access to mobile devices was higher than that of access to desktops/notebooks among the disabled. The visual disabled showed the lowest internet use and access to mobile device rates. These findings suggested that the visual disabled may be most disadvantageous in ICT access and use. Table 3. ICT access and use Disabled (%)
Non-disabled (%)
Physical
Brain lesion
Visual
Hearing/ Language
Desktop/Notebook
81.43
60.18
71.52
75.00
90.62
Mobile
94.37
84.96
82.28
96.43
99.74
Internet use
91.42
77.88
72.28
97.32
99.45
4.3 The Results of Structural Equation Model for E-Learning Participation The results of the SEM analysis are presented in Table 4. The fitness index of the structural model was χ2 = 1114.282 (P = 0.000), CFI = 0.975, TLI = 0.852, RMSEA = .044, SRMR = 0.007, thus indicating that the model was acceptable.
The Disabled’s Learning Aspiration and E-Learning Participation
7
As shown in Table 4, whether to access PC/notebook and internet use were significantly associated with learning aspiration. Those who accessed to PC/notebook and used the Internet were more likely to participate in e-learning. The types of disability were not significantly related to e-learning participation, controlling ICT access and usage variables, and socio-demographic variables. The results suggested that the disability itself did not affect learning aspiration. But the ICT environment and socio-economic conditions were influential in having learning aspirations. Those in their 30s and 40s were significantly higher learning aspirations than those in their 50s. Higher family income, having a white color, service, and sales jobs than not working were significantly associated with higher learning aspirations. The variables which significantly affected e-learning participation were learning aspiration, the types of disability, ICT access and use, gender, and age groups. The physical and hearing/language disabled were more likely to participate in e-learning than Table 4. Result of structural equation model ansalysis for e-learning participation Learning aspiration B 0.072
(0.062)
0.159 (0.047)***
Brain
–0.047
(0.085)
0.032 (0.055)
Visual
–0.041
(0.076)
0.035 (0.051)
0.079
(0.080)
0.147 (0.058)**
P.C./notebook
0.104
(0.108)***
0.050 (0.025)*
Mobile
Physical
Hearing/Language Access to devices
Bootstrap (S.E) 0.199 (0.014)***
Learning aspiration Disable (Non-disabled)1 )
E-learning participation
Bootstap (S.E) B
0.491
(0.091
Internet use
1.426
(0.124)***
Male (Female)1 )
0.040
(0.027)
−0.024 (0.036) 0.129 (0.033)*** −0.044 (0.019)**
Education (Middle school)1 )
High school
Age group (the 50s)1 )
The 30s
0.2237 (0.031)***
0.178 (0.027)***
The 40s
0.072 (0.020)***
Log(income)
0.0964 (0.028)*** 0.123 (0.039)**
Occupation(Non)1) White
0.298
(0.040)***
Service
0.132
(0.046)***
Sales
0.218
(0.041)***
Blue
0.072
(0.039)
1.429
(0.121)***
Constant
College
−0.016 (0.026)
–0.356 (0.079)***
Notes: 1) The reference point for the dummy variable. Based on 3,000 bootstrap samples (n = 2,351). * p < .05, ** p < .01, *** p < .001
8
S. Lee et al.
the non-disabled. Those who had higher e-learning aspirations, accessed PC/notebook, and used the Internet were more likely to participate in e-learning. As shown in Table 5, we found a positive mediation effect of learning aspiration in the relationship between ICT access/usage and e-learning participation. Furthermore, the results indicated that ICT access and usage not only directly affected e-learning participation but also had indirect influence through enhancing learning aspiration. Table 5. Mediation effect of learning aspiration on e-learning participation Path
Effect
Bootstrap S.E
95% CI Lower
Upper
Access to PC/notebook → Learning aspiration → E-learning participation
.029***
.008
.013
.045
Internet use → Learning aspiration → E-learning participation
.098***
.020
.059
.136
Notes: Based on 3,000 bootstrap samples. CI = confidence interval. * p < .05, ** p < .01, *** p < .001
5 Conclusion This study attempted to examine the effects of ICT access/usage and learning aspiration on the disableds e-learning participation, using a nationally representative sample of the Korean population. The major findings of this study were as follows. First, the types of disability did not have a significant effect on learning aspirations but significantly affected whether to participate in e-learning. The visual and hearing/language disabled were more likely to participate in e-learning than the non-disabled. These findings suggest that the disabled have as much learning aspiration as the non-disabled. Rather, they have greater demand for e-learning. Second, access to PC/notebook and internet usage were related to more learning aspirations. Those who have stronger learning aspirations were more likely to participate in e-learning activities. Therefore, access to PC/notebook and internet usage not only directly affect e-learning participation and also indirectly affect e-learning participation through learning aspirations, compared to those who can not either access to PC/notebook or use the Internet. The findings suggest that ICT access and use are not merely a tool necessary for e-learning but they also contribute to e-learning by stimulating learning aspirations. However, the disabled showed lower access to desktop/notebook and internet use rates than the non-disabled. Notably, the brain lesion and visual disabled showed the lowest rates due to their handicap in using desktop/notebook devices. Considering that access to desktop/notebook and internet use are the indispensable prerequisites for elearning participation, the development of desktop/notebook devices satisfying their unique needs may be necessary.
The Disabled’s Learning Aspiration and E-Learning Participation
9
References 1. Patzer, Y., Pinkwart, N.: Inclusive E-Learning–towards an integrated system design. Stud. Health Technol. Inform. 242, 878–885 (2017) 2. Hamburg, I., Lazea, M., Marin, M.: Open web-based learning environments and knowledge forums to support disabled people. In: Assoc Prof Pedro Isaias. International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, vol. 1, pp. 205–216. Springer, Berlin, Heidelberg (2003) 3. Cinquin, P.A., Guitton, P., Sauzéon, H.: Online e-learning and cognitive disabilities: a systematic review. Comput. Educ. 130, 152–167 (2019) 4. Lago, E.F., Acedo, S.O.: Factors affecting the participation of the deaf and hard of hearing in e-learning and their satisfaction: a quantitative study. Int. Rev. Res. Open Distrib. Learn. 18(7), 268–297 (2017) 5. Akcil, U., Ünlücan, Ç.: The problems disabled people face in mobile and web based elearning phases in a developing country. Qual. Quant. 52(2), 1201–1210 (2018). https://doi. org/10.1007/s11135-018-0683-z 6. Nganji, J.T.: Designing disability-aware e-learning systems: disabled students recommendations. Int. J. Adv. Sci. Technol. 48(6), 1–70 (2018) 7. Alsalem, G.M., Doush, I.A.: Access education: what is needed to have accessible higher education for students with disabilities in Jordan? Int. J. Spec. Educ. 33(3), 541–561 (2018) 8. Sun, J., Wang, T., Luo, M.: Research on the construction and innovation of lifelong education system under the background of big data. In: 2020 International Conference on Big Data and Informatization Education (ICBDIE), pp. 30–33. IEEE, Zhangjiajie, China (2020) 9. Kwon, Y.-S., Choi, H.-J.: A study on the importance and performance of service quality in lifelong education institutions-focused on adult learners. e-Bus. Stud. 23(1), 203–213 (2022) 10. Kwon, Y.-S., Ryu, K.-H., Song, W.-C., Choi, H.-J.: Using smartphone application to expand the participation in lifelong education for adult disabled. JP J. Heat Mass Trans. Spec. Issue, 111–118 (2020) 11. Lynch, U.: The use of mobile devices for learning in post-primary education and at university: student attitudes and perceptions. Doctoral dissertation, Queens University Belfast (2020) 12. Hussain, F.: E-Learning 3.0 = E-Learning 2.0 + Web 3.0?. In: International Association for Development of the Information Society (IADIS) International Conference on Cognition and Exploratory Learning in Digital Age (CELDA) 2012, pp. 11–18. IADIS, Madrid (2012) 13. Amarneh, B.M., Alshurideh, M.T., Al Kurdi, B.H., Obeidat, Z.: The impact of COVID-19 on e-learning: advantages and challenges. In: Hassanien, A.E., et al. (eds.) AICV 2021. AISC, vol. 1377, pp. 75–89. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-76346-6_8 14. Hosseindoost, S., Khan, Z.H., Majedi, H.: A shift from traditional learning to e-learning: advantages and disadvantages. Arch. Neurosci. 9(2), 1–3 (2022) 15. Coronel, R.S.: Disabled online learners: Benefits and drawbacks of online education and learning platforms when pursuing higher education. Doctoral dissertation, Capella University (2008) 16. Bruck, P.A., Buchholz, A., Karssen, Z., Zerfass, A.: E-Content: Technologies and Perspectives for the European Market. Springer Science & Business Media, Berlin (2006) 17. Oh, Y.-K.: A study on the current status and needs of lifelong education: focused on night school for people with disabilities. J. Soc. Sci. Res. 24, 85–119 (2015) 18. Wald, M., Draffan, E.A., Seale, J.: Disabled learners experiences of e-learning. J. Educ. Multimedia Hypermedia 18(3), 341–361 (2009) 19. Kim, J.Y., Fienup, D.M.: Increasing access to online learning for students with disabilities during the COVID-19 pandemic. J. Spec. Educ. 55(4), 213–221 (2022)
10
S. Lee et al.
20. Vanderheiden, G.-C.: Ubiquitous accessibility, common technology core, and micro assistive technology: commentary on computers and people with disabilities. ACM Trans. Accessible Comput. (TACCESS) 1(2), 1–7 (2008) 21. Kotera, Y., Cockerill, V., Green, P., Hutchinson, L., Shaw, P., Bowskill, N.: Towards another kind of borderlessness: online students with disabilities. Distance Educ. 40(2), 170–186 (2019)
Are CK Metrics Enough to Detect Design Patterns? Gcinizwe Dlamini(B) , Swati Megha, and Sirojiddin Komolov Innopolis University, Innopolis, Tatarstan, Russian Federation [email protected]
Abstract. Design patterns are used to address common design problems in software systems. Several machine learning models have been proposed for detecting and recommending design patterns for a software system. However, limited research is done on using machine learning models for establishing a correlation between design patterns and code quality using ck-metrics. In this paper, we firstly, present a manually labelled dataset composed of 590 open-source software projects primarily from GitHub and GitLab. Secondly, using ck metrics as input of 9 popular machine learning models we predict design patterns. Lastly we evaluated the nine machine learning models using four standard metrics namely, Precision, Recall, Accuracy and F1-Score. Our proposed approach showed noticeable improvement in precision and accuracy. Keywords: Software design Design patterns
1
· Machine learning · Code quality ·
Introduction
In the software engineering domain, design pattern refers to a general solution of the commonly existing design problem [30]. The use and creation of design patterns became popular after the Gang of Four (GoF) published a book [20] in 1994. The authors proposed 23 design patterns to solve different software design problems. Since then, many new design patterns have emerged to solve several design-related problems. Evidence shows that Design Patterns solves design issues however, evidence on the usefulness of design patterns on software products is a continuous research ever since its inception. In the past decade, many studies have been carried out to investigate the usefulness of design patterns on different aspects of software products and the software development process. Studies have shown that the effect of patterns on software quality is not uniform, the same pattern can have both a positive and a negative effect on the quality of a software product [25]. Also, most of the studies are performed on specific software product cases and lacks to present generalized correlation [50]. Currently, software engineers naturally use several design patterns to develop software products. Software systems incorporate several small to large design c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 11–24, 2023. https://doi.org/10.1007/978-3-031-28073-3_2
12
G. Dlamini et al.
patterns aiming to avoid design problems and achieve better software quality. In software development code base gets modified by programmers in a distributed manner. As code base incrementally increases in size and complexity, identifying specific implemented design patterns and the impact on the software quality becomes a complex task. The complexity of design pattern identification emerges due to multiple reasons. Firstly, poor documentation of code. Secondly, programmers implement design patterns according to their understanding and different coding style. Thirdly, design pattern can be broken or changed during a bug fix or software upgrade. Hence documenting them is a challenging job, and that’s why software teams refrains from doing so. Lastly, some design patterns are gets introduced to code base through third-party libraries and open-source software without any explicit class names, comments, or documentation, thus making it hard for programmers to understand. Therefore, to better understand the impact of using a specific design patterns on code quality (i.e. in terms of number of software defects, energy consumption, etc.), it is crucial to first detect the design pattern implemented in the code base independent of documentation. Moreover, design pattern detection contributes in matching them with project requirements [16]. To minimize the complexity of manually design pattern identification and understanding the relation to software quality, systems that can automatically detect and recommend design patterns in a software product should be developed. Sufficient studies have been conducted on using static analyzers for design pattern detection, however static analyzers cannot be considered the optimal solution for the problem due to several drawbacks. Static analyzers use criteria such as class structure for pattern analysis. In case class structures are similar in patterns static analyzers might not identify the patterns correctly. Also imposing strict conditions such as class structure can lead to the elimination of a pattern in case of slight variations [11]. In the case of increasing the measurement matrix, several static rules for the metrics need to be included. Hence with increasing matrix the number of rules also increases and at some point the rule base can be difficult to handle. In recent years, research on using machine learning models for software pattern recommendations and detection have emerged as a reasonable solution to the issue. Hummel and Buger [24] propose a proof of concept-based recommendations system in the paper. Another paper offer the detection of anti-patterns using a machine learning classifier [1]. One more model proposes using graph matching and machine learning techniques for pattern detection. Source code metrics have also been used to detect design patterns [44]. The studies proposing machine learning models for design patterns mostly focus either on detecting patterns in the existing code or recommending patterns based on the existing source code. However, limited research has been conducted on using the Machine learning approach on establishing the correlation between design patterns and software quality, and thereby at present lacks Machine learning models that can present a generalized matrix with evidence.
Are CK Metrics Enough to Detect Design Patterns?
13
To generalize the correlation between the Design pattern and software quality, the experiment must be conducted on several open-source projects rather than specific case studies of an individual software product, At present, we lack quantitative data and evidence on how design patterns can improve the software quality. To fill the domain gap we in this paper propose a machine learning approach to detect DP on open source projects. Collect dataset to be used as a benchmark. The rest of this paper is organized as following. Section 2 overviews design patterns and machine learning approaches for design patterns detection. Section 3 overviews the methodology of our proposed approach to detect design patterns. Section 4 presents the obtained results followed by discussion in Sect. 5. In Sect. 6, the paper is concluded with future research directions.
2
Background and Related Works
This section presents background on design patterns, their categories and design patterns detection approaches. 2.1
Design Patterns Background
Design patterns were created to provide proven solutions to reoccurring design problems in the world of Object-oriented systems. Gang of Four (GoF) itself has proposed 23 different design patterns. The 23 different design patterns are organized into three categories: Structural, Creational, and Behavioural. New design patterns started emerging after the popularity of GoF. A few years later in the year 1996, the authors [38] formalized and documented software architectural styles. They categorized architectural styles and created variants of those styles through careful discrimination. The purpose of these architectural styles was to guide the design of system structures [38]. Design patterns have evolved over the period and at present, we have many available design patterns, therefore organization and classification of design patterns are required for their effective use [21]. In our study, we will focus on analyzing the commonly used design patterns from the Gang of Four [20], the design patterns we will be focusing on are as follows: – Creational design Patterns: The patterns in this category provide ways to instantiate classes and to create objects in complex object-oriented software systems. Examples of Creational design Patterns are Singleton, Builder, Factory Method, Abstract Factory, Prototype, and Object Pool. – Structural design patterns: The patterns in this category deal with assembling classes and objects in large and complex systems keeping them flexible for future updation. Examples of Structural design patterns are Adapter, Bridge, Composite, Decorator, Facade, Flyweight, and proxy.
14
G. Dlamini et al.
– Behavioral design patterns: The patterns in this category deal with interaction, communication, and responsibility assignment among objects at the time of execution. Some of the examples of Behavioural design patterns are Iterator, Memento, Observer, and State. 2.2
Design Patterns Detection
Machine Learning Approach: Over the past decades, several research domains have shown evidence towards the usefulness of ML models and their out-performance over the existing solutions ( [10,39,46]). Machine learning (ML) based approaches have also been adopted in the software engineering domain specifically in the context of design pattern detection. Zanoni et al. [49] proposed a machine learning-based technique to identify five specific design patterns namely, Singleton, Adapter, Composite, Decorator, and Factory method. The authors implemented a design patterns detection approach using Metrics and Architecture Reconstruction Plugin for Eclipse (MARPLE). The presented solution works on micro-structures and metrics that have been measured inside the system. The authors further implemented and evaluated different classification techniques for all five design patterns. The highest accuracy for all five patterns varies from 0.81 to 0.93. The proposed approach has four limitations. Firstly, the training sets used in the experiment are based on a manual design pattern. Secondly, the labeling is done using a limited (10) number of publicly available software projects. Thirdly, the contents of libraries are not included in them. And lastly, Classifier performances are estimated under the assumption that the Joiner has 100% recall. Another research [40] proposed a deep learning-driven approach to detect six design patterns (Singleton, Factory Method, Composite, Decorator, Strategy, and Template method). For detection, the researchers [40] retrieved the source code given as an abstract semantic graph and extracted features to be used as the input to the deep learning model. The procedure extracted micro-structures within the abstract semantic graph. These microstructures are used then for the candidate sampling process. Candidate mappings fulfill the basic structural properties that a pattern has hence qualify for the detailed analysis via the inference method. The detection step takes the candidate role mappings and decides whether they are a specific pattern or not. The authors in addition to deep learning methods used three more machine learning techniques (Nearest Neighbor, Decision Trees, and Support Vector Machines) to detect the design patterns. Nine open source repositories with design patterns as a data set. The accuracy of final models for each pattern varies from 0.754 to 0.985. Satoru et al. [45] proposed design patterns detection techniques using source code metrics and machine learning. The authors proposed an approach aimed at identifying five design patterns (Singleton, Template method, Adapter, State, and Strategy). The researchers [45] derived experimental data into small-scale and large-scale codes and found a different set of metrics for two types of data. For the classification of the design patterns, a neural network was used. The F-measure of the proposed technique varies from 0.67 to 0.69 depending on
Are CK Metrics Enough to Detect Design Patterns?
15
the design pattern, but the solution shows some limitations, as it is unable to distinguish patterns in which the class structures are similar. Also, the number of patterns that could be recognized by the proposed technique is limited to five patterns. Graph Theory Approach: Graph theory is the study of graphs that are being formed by vertices and edges connecting the vertices [7]. It is one of the fundamental courses in the field of computer science. Graphs can be used to represent and visualize relationship(s) between components in a problem definition or solution. The concepts of graph theory have successfully been employed in solving complex problems in domains such as Linguistics [22], Social sciences [5], Physics [17] and chemistry [9]. In the same way, software engineers have adopted the principles of graph theory in solving tasks such as design patterns detection (i.e. [4,32,48]). One of the graph theory-based state-of-the-art approaches to detect design patterns was proposed by Bahareh et al. [4]. The authors proposed a new method for the detection. Their proposed model is based on the semantic graphs and the pattern signature that characterize each design pattern. The proposed two-phase approach was tested on three open-source benchmark projects: JHotDraw 5.1, JRefactory 2.6.24, and JUnit 3.7. To address the challenges faced by existing pattern detection methodologies, Bahareh et al. [43] proposed a graph-based approach for design patterns detection. The main challenges addressed by their proposed approach are the identification of modified pattern versions and search space explosion for large systems. Bahareh et al. [43] approach leverages similarity scoring between graph vertices. The researchers in their proposed approach [32] performed a similar series of experiments using a sub-graph isomorphism approach in FUJABA [31] environment. The main goal for the researchers was to address scalability problems that emerged because of variants of design patterns implementation. The proposed approach was validated on only two major Java libraries which is not quite enough the establish the effectiveness of the authors [32] proposed method. In light of all the aforementioned approaches proposed by researchers in the studies above there still exist a research gap in terms of lack of benchmark data for evaluation. Thus in this paper, we present the first step in benchmark dataset creation and comparison of machine learning methods evaluated on the dataset.
3
Methodology
Our proposed model pipeline presented in Fig. 1 contains four main stages: A. B. C. D.
Data Extraction Data Preprossessing ML model definition and training Performance Measuring The following sections presents the details about each stage.
16
G. Dlamini et al.
Train Dataset GitHub
Data Preprocessing Feature Selection
GitLab
Test Training Random Forest
Neural Network
Precision
Recall
Logistic Regression
....
Train Model(s)
Performance F1 Score
Accuracy
Testing
Stage Transition
Fig. 1. The proposed design patterns detection pipeline.
3.1
Data Extraction
A significant part of the dataset is extracted from two popular version control systems, namely GitHub1 and GitLab2 . The repositories were manually retrieved by interns in the software engineering laboratory at Innopolis University. The repositories were manually labeled by the laboratory researchers. The target programming language for the source code in all projects included in the dataset is Java. In addition to these version control systems, the student’s projects from the University (Innopolis University) were also added to the dataset. The projects from the students were produced in the scope of a upper-division bachelor course “System Software Design” (SSD). In this course, the students were taught different design patterns and were required to design and implement a system applying at least one design pattern from GoF as a part of course evaluation. For each project source code, the metrics proposed by Chidamber and Kemerer [12], now referred to as the CK metrics were extracted and served as input data for machine learning models. In this step, the CK metrics data were extracted for each project source code. To extract the CK metrics open source tool was used [3]. 3.2
Data Preprocessing
Data preprocessing is an essential part of curating the data for further processing. All the CK metrics are numerical. The numerical attributes are normalized over a zero mean and unit variance. Data normalizing procedure standardize the input vectors, and helps the machine learning model in the learning process. 1 2
https://github.com. https://gitlab.com.
Are CK Metrics Enough to Detect Design Patterns?
17
Moreover, the mean and standard deviations are also stored to transform the input data during the testing phase. The transforming and re-scaling of the input data is important for the performance of models such as neural networks and distance calculating model’s (i.e., k-nearest neighbor, SVM) [19]. Furthermore, the mean and standard deviation is also stored to transform the input data during the testing phase. To improve the performance of the machine models, the original dataset was balanced using algorithm called : Oversample using Adaptive Synthetic (ADASYN) [23]. 3.3
Machine Learning Models
To detect the type of design pattern, we employ machine learning and deep learning methods. For deep learning approaches we use simple ANN and for traditional ML models use models from different classifier families. The models used in this paper, along with the families they belong to, are as follows: – – – – – – – –
Tree Methods : Decision tree (DT) [36] Ensemble Methods : Random Forest [8] Gradient Boosting : Catboost [14] Generalized Additive Model : Explainable Boosting Machine [28] Probabilistic Methods : Naive bayes [41] Deep Learning : Artificial neural network [42] Linear models : Logistic Regression [26] Other : K-NN [18] & Support vector machine (SVM) [37]
The Python library called sklearn (version 0.21.2) implementation is used for the chosen ML models [34]. To set benchmark performance for classifiers, we used default training parameters set by sklearn Python library. The python library interpretml [33] is used for explainable gradient boosting classifier implementation. 3.4
Performance Metrics
Five standard performance metrics are used in this paper, namely: Precision, Recall, F1-score, weighted F1-score and Accuracy. They are calculated using values of a confusion matrix and computed as follows: P recision = Recall = F 1 − score = 2
TP (T P + F P )
TP (T P + F N ) P recision ∗ Recall P recision + Recall
(1) (2) (3)
18
G. Dlamini et al.
K
W eighted F 1 − score = Accuracy =
i=1
Supporti · F 1i (4)
T otal
TP + TN (T P + T N + F P + F N )
(5)
where, TN is the number of true negative, TP is the number of true positives, FP is the number of false positives, FN is the number of false negatives, Support is the number of samples with a given label i and F 1i is F1 score of a given label i.
4 4.1
Implementation Details and Results Dataset Description
After retrieving the data from GitHub, GitLab, and students projects, we compiled the dataset. The summary of the compiled dataset is presented in Table 1. We then split the full dataset into train and test sets. Table 1. Train and test data distribution Class
4.2
Training set Percentage Test set Percentage
Creational 282 86 Structural Behavioural 104
59.7% 18.2% 22.1%
38 38 42
32.2% 32.2% 35.5%
Total
100%
118
100%
472
Experiments
The experiments were carried out on an Intel core i5 CPU with 8 GB RAM. A Python library called sklearn (version 0.24.1) [34] was used for training and testing the models. Furthermore, we used default training parameters set by sklearn Python library for all the machine learning models. The obtained results after conducting experiments are summarized in Table 2 for creational design patterns, Table 3 for structural design patterns and Table 4 for behavioral patterns. For each experiment conducted the models ML models were trained with balanced data using ADASYN.
Are CK Metrics Enough to Detect Design Patterns? Table 2. Detection of creational patterns Classifiers
Precision Recall Accuracy F1-score
LR
0.76
0.71
0.71
0.72
Naive Bayes
0.73
0.69
0.69
0.70
SVM
0.80
0.77
0.77
0.78
Decision Tree
0.72
0.64
0.64
0.66
Random Forest
0.75
0.68
0.68
0.69
Neural Networks
0.76
0.76
0.76
0.76
k-NN
0.82
0.80
0.80
0.80
Catboost
0.75
0.66
0.66
0.67
Explainable Boosting Classifier 0.76
0.66
0.66
0.67
Table 3. Detection of structural patterns Classifiers
Precision Recall Accuracy F1-score
LR
0.62
0.55
0.55
0.57
Naive Bayes
0.74
0.61
0.61
0.62
SVM
0.60
0.56
0.56
0.57
Decision Tree
0.56
0.57
0.57
0.57
Random Forest
0.54
0.62
0.62
0.56
Neural Networks
0.63
0.61
0.61
0.62
k-NN
0.56
0.52
0.52
0.53
Catboost
0.54
0.58
0.58
0.56
Explainable boosting classifier 0.67
0.68
0.68
0.67
Table 4. Detection of behavioral patterns Classifiers
Precision Recall Accuracy F1-score
LR
0.66
0.65
0.65
0.66
Naive Bayes
0.69
0.65
0.65
0.66
SVM
0.68
0.68
0.68
0.68
Decision Tree
0.61
0.63
0.63
0.62
Random Forest
0.67
0.68
0.68
0.67
Neural Networks
0.67
0.69
0.69
0.67
k-NN
0.70
0.64
0.64
0.65
Catboost
0.64
0.66
0.66
0.64
Explainable Boosting Classifier 0.60
0.64
0.64
0.60
19
20
G. Dlamini et al. Table 5. Multi-class classification accuracy
Classifier
SMOTEENN SMOTE BorderlineSMOTE ADASYN Original SVMSMOTE
LR
0.53
0.51
0.52
0.52
0.39
0.47
Decision Tree
0.49
0.39
0.44
0.52
0.47
0.37
EBM
0.54
0.53
0.52
0.50
0.38
0.52
Random Forest 0.53
0.46
0.53
0.52
0.44
0.47
Catboost
0.48
0.45
0.48
0.43
0.40
0.42
Neural Network 0.35
0.41
0.44
0.42
0.52
0.39
k-NN
0.53
0.54
0.54
0.54
0.52
0.56
SVM
0.53
0.53
0.51
0.53
0.34
0.48
Naive Bayes
0.53
0.54
0.58
0.55
0.50
0.48
5
Discussion
This section presents discussions of the results reported in Sect. 4, limitations, and our proposed approach threats to validity. For creational design patterns using One-vs-Rest rest all the methods achieve good performance with k-NN having outstanding performance based on all selected evaluation metrics. Based on the objective for detecting creational design, an explainable boosting classifier is the best choice when deeper understanding through a piece of source code was classified as creational sense. In some other cases, design patterns are implemented with modification. Based on computational resources budget other approaches can be selected (i.e. Naive Bayes is computationally cheaper when compared to neural network and SVM). The algorithms showed insignificant results on structural and behavioral design patterns than creational design patterns. While Explainable Boosting Classifier showed outperforming results on the structural patterns than other methods, the algorithms on the behavioral design patterns did a relatively equal job except for Catboost and Explainable boosting classifier. We can only highlight the SVM and Neural networks that showed slightly better results. This tendency can be explained through the fact of data imbalance, and relatively smaller data samples for structural and behavioral patterns. Although this tendency might seem the limitation of the current paper, it also shows that the chosen method works, since the results are improving on the bigger datasets (i.e. creational design patterns). So, overall results show that more resources are required on small datasets since SVM and Neural networks need high computational processes. We might argue that machine learning and deep learning-based approaches are a black box and it might not help in understanding what leads to the detection of specific design patterns in comparison to graph-based approaches. However, researchers over the years have proposed a new research direction which is named interpretable/explainable machine learning. As a result methods such as LIME [35], interpret [33] and SHARP [29] have been developed. In our experiment, we have employed an explainable boosting classifier (EBM) as a step
Are CK Metrics Enough to Detect Design Patterns?
21
towards explainable design patterns detection methods. The use of models as EBM can help in the identification of semantic relationships among design patterns. From Table 1 and 5 the main disadvantage of preliminary our data set is imbalanced. To minimize the impact of the dataset on the machine learning models we have balanced the data using ADASYN which is one of the popular approaches for balancing datasets in the machine learning domain. The results multi-class classification using balanced data and comparison with other data balancing techniques in presented in Table 5. We are aimed at increasing the dataset in the future by collecting and labeling more java projects from GitHub as well as innopolis university. We are also considering using other techniques to increase the dataset such as statistical and deep learning-based [13,15,27,47,51].
6
Conclusion
Drawing a relationship between software quality and design patterns is a complex task. In this paper, we present the first step towards the analysis of the impact of design patterns on software quality. Firstly, we propose a machine learningbased approach for detecting design patterns from source code using CK metrics. Our machine learning-based approach detects design patterns and validated its performance using precision, recall, accuracy and f1-score. Secondly, we compiled a dataset that can serve as a bench-marking dataset for design patterns detection approaches. The dataset was compiled from openly available GitLab, GitHub repositories and innopolis university 3rd-year students projects. Overall the dataset has 590 records which can be used to train and test machine learning models. The achieved results suggests that using ck metics as input to machine learning models, design patterns can be detected from source code. For future works, we plan to use code embeddings [2] and granular computing techniques such as fuzzy c-means [6] since some other projects and real-life projects do not only implement a single design pattern. We also plan to increase the size of the dataset by collecting mode data and also using statistical and deep learning approaches [13,27].
References 1. Akhter, N., Rahman, S., Taher, K.A.: An anti-pattern detection technique using machine learning to improve code quality. In: 2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD), pp. 356–360 (2021) 2. Alon, U., Zilberstein, M., Levy, O., Yahav, E.: code2vec: learning distributed representations of code. In: Proceedings of the ACM on Programming Languages, vol. 3, no. POPL, pp. 1–29 (2019) 3. Aniche, M.: Java code metrics calculator (CK) (2015). https://github.com/ mauricioaniche/ck/ 4. Bahareh Bafandeh Mayvan and Abbas Rasoolzadegan: Design pattern detection based on the graph theory. Knowl.-Based Syst. 120, 211–225 (2017)
22
G. Dlamini et al.
5. Barnes, J.A.: Graph theory and social networks: a technical comment on connectedness and connectivity. Sociology 3(2), 215–232 (1969) 6. Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984) 7. Bollob´ as, B.: Modern Graph Theory, vol. 184. Springer, Heidelberg (2013) 8. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 9. Burch, K.J.: Chemical applications of graph theory. In: Blinder, S.M., House, J.E (eds.) Mathematical Physics in Theoretical Chemistry, Developments in Physical and Theoretical Chemistry, pp. 261–294. Elsevier (2019) 10. Carbonneau, R., Laframboise, K., Vahidov, R.: Application of machine learning techniques for supply chain demand forecasting. Eur. J. Oper. Res. 184(3), 1140– 1154 (2008) 11. Chaturvedi, S., Chaturvedi, A., Tiwari, A., Agarwal, S.: Design pattern detection using machine learning techniques. In: 2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), pp. 1–6. IEEE (2018) 12. Chidamber, S.R., Kemerer, C.F.: A metrics suite for object oriented design. IEEE Trans. Softw. Eng. 20(6), 476–493 (1994) 13. Dlamini, G., Fahim, M.: Dgm: a data generative model to improve minority class presence in anomaly detection domain. Neural Comput. Appl. 33, 1–12 (2021) 14. Dorogush, A.V., Ershov, V., Gulin, A.: Catboost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018) 15. Douzas, G., Bacao, F.: Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst. Appl. 91, 464–471 (2018) 16. Eckert, K., Fay, A., Hadlich, T., Diedrich, C., Frank, T., Vogel-Heuser, B.: Design patterns for distributed automation systems with consideration of non-functional requirements. In: Proceedings of 2012 IEEE 17th International Conference on Emerging Technologies & Factory Automation (ETFA 2012), pp. 1–9. IEEE (2012) 17. Essam, J.W.: Graph theory and statistical physics. Disc. Math. 1(1), 83–112 (1971) 18. Fix, E., Hodges, J.L.: Discriminatory analysis. nonparametric discrimination: consistency properties. Int. Stat. Rev./Revue Internationale de Statistique 57(3), 238– 247 (1989) 19. Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Heidelberg (2017) 20. Gamma, E.: Design Patterns: Elements of Reusable Object-Oriented Software. Pearson Education India (1995) 21. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design patterns: abstraction and reuse of object-oriented design. In: Nierstrasz, O.M. (ed.) ECOOP 1993. LNCS, vol. 707, pp. 406–431. Springer, Heidelberg (1993). https://doi.org/10.1007/3-54047910-4 21 22. Hale, S.A.: Multilinguals and wikipedia editing. In: Proceedings of the 2014 ACM Conference on Web Science, WebSci 2014, New York, NY, USA, pp. 99–108. Association for Computing Machinery (2014) 23. He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. IEEE (2008) 24. Hummel, O., Burger, S.: Analyzing source code for automated design pattern recommendation. In: Proceedings of the 3rd ACM SIGSOFT International Workshop on Software Analytics, SWAN 2017, New York, NY, USA, pp. 8–14. Association for Computing Machinery (2017)
Are CK Metrics Enough to Detect Design Patterns?
23
25. Khomh, F., Gueheneuc, Y.G.: Do design patterns impact software quality positively? In: 2008 12th European Conference on Software Maintenance and Reengineering, pp. 274–278 (2008) 26. Kotu, V., Deshpande, B.: Regression methods. In: Kotu, V., Deshpande, B. (eds.) Predictive Analytics and Data Mining, pp. 165–193. Morgan Kaufmann, Boston (2015) 27. Lemaˆıtre, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(1), 559–563 (2017) 28. Lou, Y., Caruana, R., Gehrke, J., Hooker, G.: Accurate intelligible models with pairwise interactions. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 623–631 (2013) 29. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 4765–4774. Curran Associates, Inc. (2017) 30. Martin, R.C.: Design principles and design patterns. Obj. Mentor 1(34), 597 (2000) 31. Nickel, U., Niere, J., Z¨ undorf, A.: The fujaba environment. In: Proceedings of the 22nd International Conference on Software Engineering, pp. 742–745 (2000) 32. Niere, J., Sch¨ afer, W., Wadsack, J.P., Wendehals, L., Welsh, J.: Towards patternbased design recovery. In: Proceedings of the 24th International Conference on Software Engineering, pp. 338–348 (2002) 33. Nori, H., Jenkins, S., Koch, P., Caruana, R.: Interpretml: a unified framework for machine learning interpretability. arXiv preprint arXiv:1909.09223 (2019) 34. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 35. Ribeiro, M.T., Singh, S., Guestrin, C.: “why should i trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016) 36. Safavian, R., Landgrebe, D.: A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 21(3), 660–674 (1991) 37. Sch¨ olkopf, B., Smola, A.J., Williamson, R.C., Bartlett, P.L.: New support vector algorithms. Neural Comput. 12(5), 1207–1245 (2000) 38. Shaw, M., Clements, P.: A field guide to boxology: preliminary classification of architectural styles for software systems. In: Proceedings Twenty-First Annual International Computer Software and Applications Conference (COMPSAC 1997), pp. 6–13 (1997) 39. Shoeb, A.H., Guttag, J.V.: Application of machine learning to epileptic seizure detection. In: ICML (2010) 40. Thaller, H.: Towards Deep Learning Driven Design Pattern Detection/submitted by Hannes Thaller. PhD thesis, Universit¨ at Linz (2016) 41. Theodoridis, S.: Bayesian learning: approximate inference and nonparametric models. In: Theodoridis, S. (ed.) Machine Learning, pp. 639–706. Academic Press, Oxford (2015) 42. Theodoridis, S.: Neural networks and deep learning. In: Theodoridis, S. (ed.) Machine Learning, pp. 875–936. Academic Press, Oxford (2015) 43. Tsantalis, N., Chatzigeorgiou, A., Stephanides, G., Halkidis, S.T.: Design pattern detection using similarity scoring. IEEE Trans. Softw. Eng. 32(11), 896–909 (2006) 44. Uchiyama, S., Kubo, A., Washizaki, H., Fukazawa, Y.: Detecting design patterns in object-oriented program source code by using metrics and machine learning. J. Softw. Eng. Appl. 7, 01 (2014)
24
G. Dlamini et al.
45. Uchiyama, S., Washizaki, H., Fukazawa, Y., Kubo, A.: Design pattern detection using software metrics and machine learning. In: First International Workshop on Model-Driven Software Migration (MDSM 2011), p. 38 (2011) 46. Worden, K., Manson, G.: The application of machine learning to structural health monitoring. Phil. Trans. Royal Soc. A: Math. Phys. Eng. Sci. 365(1851), 515–537 (2007) 47. Yang, Y., Zheng, K., Chunhua, W., Yang, Y.: Improving the classification effectiveness of intrusion detection by using improved conditional variational autoencoder and deep neural network. Sensors 19(11), 2528 (2019) 48. Yu, D., Ge, J., Wu, W.: Detection of design pattern instances based on graph isomorphism. In: 2013 IEEE 4th International Conference on Software Engineering and Service Science, pp. 874–877 (2013) 49. Zanoni, M., Fontana, F.A., Stella, F.: On applying machine learning techniques for design pattern detection. J. Syst. Softw. 103, 102–117 (2015) 50. Zhang, C., Budgen, D.: What do we know about the effectiveness of software design patterns? IEEE Trans. Softw. Eng. 38(5), 1213–1231 (2012) 51. Zhu, T., Lin, Y., Liu, Y.: Synthetic minority oversampling technique for multiclass imbalance problems. Pattern Recogn. 72, 327–340 (2017)
Detecting Cyberbullying from Tweets Through Machine Learning Techniques with Sentiment Analysis Jalal Omer Atoum(B) Department of Computer Science, The University of Texas at Dallas, Dallas, USA [email protected]
Abstract. Technology advancement has resulted in a serious problem called cyberbullying. Bullying someone online, typically by sending ominous or threatening messages, is known as cyberbullying. On social networking sites, Twitter in particular is evolving into a venue for this kind of bullying. Machine learning (ML) algorithms have been widely used to detect cyberbullying by using particular language patterns that bullies use to attack their victims. Text Sentiment Analysis (SA) can provide beneficial features for identifying harmful or abusive content. The goal of this study is to create and refine an efficient method that utilizes SA and language models to detect cyberbullying from tweets. Various machine learning algorithms are analyzed and compared over two datasets of tweets. In this research, we have employed two different datasets of different sizes of tweets in our investigations. On both datasets, Convolutional Neural Network classifiers that are based on higher n-grams language models have outperformed other ML classifiers; namely, Decision Trees, Random Forest, Naïve Bayes, and Support-Vector Machines. Keywords: Cyberbullying detection · Machine learning sentiment analysis
1 Introduction The number of active USA social media users in 2022 is about 270 million (81% of the USA’s total population) of which 113 million of them are Twitter users [1]. As a result, social media has quickly changed how we obtain information and news, shop, etc. in our daily lives. Furthermore, Fig. 1 illustrates that by 2024, there will be over 340 million active Twitter users worldwide [2]. According to the age group, Fig. 2 shows the percentage of USA active Twitter users in August 2022, where 38% of adults between the ages of 18 and 29 are active Twitter users during that time [3]. Given how frequently young adults use online media, cyberbullying and other forms of online hostility have proven to be major problems for users of web-based media. As a result, there are more and more digital casualties who have suffered physically, mentally, or emotionally.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 25–38, 2023. https://doi.org/10.1007/978-3-031-28073-3_3
26
J. O. Atoum
Fig. 1. Number of twitter users.
40%
StaƟsƟcs of USA acƟve TwiƩer Users by Age
35% 30% 25% 20% 15% 10% 5% 0% 18-29
30-49
50-64
65+
Fig. 2. Percentage of USA twitter active users by age as of Aug. 2022.
Cyberbullying is a form of provocation that takes place online on social media platforms. Criminals rely on these networks to obtain information and data that will enable them to carry out their wrongdoings. In addition, cyberbullying victims have a high propensity for mental and psychiatric diseases as indicated by the American Academy of Child and Adolescent Psychiatry claims [4]. Suicides have been connected to cyberbullying in severe circumstances [5, 6]. The prevalence of the phenomena is shown in Fig. 3, which displays survey data gathered from student respondents. These details demonstrate the substantial health risk that cyberbullying poses [7]. Consequently, researchers have exerted so much effort to identify methods and tactics that would detect and prevent cyberbullying. Monitoring systems for cyberbullying have received a lot of attention recently, and their main objective is to quickly find instances of cyberbullying [8]. The framework’s key concept is the extraction of specific aspects
Detecting Cyberbullying from Tweets Through Machine Learning
27
Fig. 3. Survey statistics on cyberbullying experiences.
from web-based media communications, followed by the construction of classifier algorithms to identify cyberbullying based on these retrieved features. These traits could be influenced by content, emotion, users, and social networks. ML or filtration techniques have been used most frequently in studies on detecting cyberbullying. Profane phrases or idioms must be found in texts using filtration techniques in order to detect cyberbullying [9]. Filteration strategies typically make use of ML techniques to create classifiers that can spot cyberbullying using data corpora gathered from social networks like Facebook and Twitter. As an illustration, data were obtained from Formspring and utilizing the Amazon Mechanical TURK for labelling [10]. Additionally, WEKA [11] is a collection of ML tools used to test various ML approaches. These approaches have failed to discriminate between direct and indirect linguistic harassment [12]. In order to identify potential bullies, Chen [13] suggested a technique to separate hostile language constructions from social media by looking at characteristics associated with the clients’ writing styles, structures, and specific cyberbullying material. The main technique applied in this study is a lexical syntactic component that was effective and capable to distinguish hostile content from communications given by bullies. Their results revealed a remarkable precision rate of 98.24% and recall of 94.34%. An approach for identifying cyberbullying by Nandhini and Sheeba [14] was based on a Naïve Bayes (NB) classifier and information gleaned from MySpace. They had reported a 91% accomplished precision. A better NB classifier was used by Romsaiyud el A. in [15] to separate cyberbullying terms and group piled samples. Utilizing a corpus from Kongregate, MySpace, and Slashdot, they were able to achieve a precision of 95.79%. Based on our earlier research [16], which used the Sentiments Analysis (SA) technique to categorize tweets as either positive, negative, or neural cyberbullying, we are investigating and evaluating different machine learning algorithms based on classifiers models, namely; Decision Tree (DT), Naïve Bayes (NB), Random Forest (RF), Support
28
J. O. Atoum
Vector Machine (SVM), and Convolutional Neural Networks (CNN). We will conduct this investigation and evaluation by performing various experiments on two different datasets of tweets. In addition, we will use some language models (Normalization, Tokenization, Named Entity recognition (NER), and Stemming) to enhance the classification process. The context for this research is presented in Sect. 2. The suggested model for sentiment analysis of tweets is presented in Sect. 3. The experiments and findings of the suggested model are presented in Sect. 4. The results of this study are presented in Sect. 5 as a final section.
2 Background Artificial intelligence (AI) applications such as Machine Learning (ML) enable systems to automatically learn from their experiences and advance without explicit programming. ML algorithms are usually divided into supervised and unsupervised categories. In order to anticipate future events, supervised ML algorithms use labeled examples and past knowledge to analyze incoming data. Starting from the analysis of a well-known training dataset, the learning algorithm constructs an inferred function to predict the values of the outputs. Unsupervised ML is used when the input data is unlabeled or uncharacterized. Unsupervised learning looks into the possibility that systems could infer a function from unlabeled data to explain a hidden structure. Lastly, researchers have applied supervised learning techniques to data discovered via freely accessible corpora [17]. The cyberbullying detection framework consists of two main components as depicted in Fig. 5. NLP (Natural Language Processing), and ML (machine learning). The first stage involves gathering and employing natural language processing to prepare datasets of tweets for machine learning algorithms. The machine learning algorithms are then trained to find any harassing or bullying remarks from the tweets using the processed datasets. 2.1 Natural Language Processing An actual tweet or text may contain a number of extraneous characters or lines of text. For instance, punctuation or numbers have no bearing on the detection of bullying. We need to clean and prepare the tweets for the detection phase before applying the machine learning techniques to them. In this stage, various processing tasks are performed, such as tokenization, stemming, and the elimination of any unnecessary characters including stop words, punctuation, and digits. Collecting Tweets: To collect the tweets from Twitter, an application was created. After the hashtag, each tweet needs to have the feature words—words that tell the user whether it is a good, bad, or neutral cyberbullying tweet—extracted. The extraction of tweets is necessary for the analysis of the features vector and selection process (unigrams, bigrams, trigrams, etc.), as well as the classification of both the training and testing sets of tweets.
Detecting Cyberbullying from Tweets Through Machine Learning
29
Cleaning and Annotations of Tweets: Tweets may include unique symbols and characters that cause them to be classified differently from how the authors intended. Therefore, all special symbols, letters, and emoticons must be removed from the collected tweets. Also crucial to the classification process is the replacement of such unique symbols, feelings, and emotional characteristics with their meanings. Figure 4 lists several unique symbols we’ve used together with their meanings and sentiments. The procedure of annotating the compiled tweets is done manually. Each tweet is given a cyberbullying label—positive, negative, or neutral—as a result of this annotation. Normalization: There is a list of all non-standard words that contain dates or numerals. In particular built-in vocabularies, these terms would be mapped. As a result, there are fewer tweet vocabularies and the classification process is performed with greater precision. While testing tweet collections, tweet extraction is necessary. Tokenization: Tokenization, which lessens word typographical variance, is a crucial stage in SA. Tokenization is necessary for both the feature extraction procedure and the bag of words. Words are converted into feature vectors or feature indices using a dictionary of features, and their index in the vocabulary is connected to their frequency over the entire training corpus. Named Entity Recognition (NER): Named Entity Recognition can be used to find the proper nouns in an unstructured text (NER). The three types of name entities in NER are ENAMEX (person, organization, and country), TIMEX (date and time), and NUMEX (percentages and numbers). Removing Stop Words: Some stop words can aid in conveying a tweet’s entire meaning, while others are merely superfluous characters that should be deleted. Stop words like “a”, “and”, “but”, “how”, “or”, and “what” are a few examples. These stop words don’t change a tweet’s meaning and can be omitted. Stemming: When stemming tweets, words are stripped of any suffixes, prefixes, and/or infixes that may have been added. A word that has been stemmed conveys a deeper meaning than the original word and may also result in storage savings [18]. Tweets that are stemmed are reduced to the stems, bases, or roots of any derivative or inflected words. Additionally, stemming assists in combining all of a word’s variations into a single bucket, effectively reducing entropy and providing better concepts for the data. Additionally, the N-gram is a traditional technique that may detect formal phrases by looking at the frequency of N-words in a tweet [19]. Hence, we employed the N-gram in our SA as a result. For the purpose of stemming, Weka [11] was used in this study to implement the term (word) frequency. Term frequency applies weights to each term in a document based on how many times the term appears in the document. It produces the keywords that appear more frequently in tweets more weight because these terms indicate words and linguistic patterns that are more frequently used by tweeters.
30
J. O. Atoum
Fig. 4. Some special tweets symbols and their meanings.
Feature Selection: Techniques for feature selection have been effectively applied in SAs [20]. When features are ranked according to certain criteria, unhelpful or uninformative features are eliminated to increase the precision and effectiveness of the classification process. To exclude such unimportant features from this investigation, we applied the Chi-square and information gain strategies. Sentiment Analysis: Sentiment analysis (SA) classifiers are often based on projected classes and polarity, as well as the level of categorization (sentence or document). Semantic orientation polarity and strength are annotated via lexicon-based SA text extraction. SA demonstrated how useful light stemming is for classification performance and accuracy [21].
2.2 Machine Learning Algorithms There are several machine learning algorithms used in this study to build classifier models to detect cyberbullying from tweets are explained in this subsection. Decision Trees: Classification can be done using the Decision Tree (DT) classifier [22]. It can both assist in making a decision and in representing that decision. In a decision tree, each leaf node denotes a choice, and each internal node denotes a condition. A classification tree provides the target’s class to which it belongs. The predicted value for a particular input is produced using a regression tree. Random Forest: Multiple decision tree classifiers make up the Random Forest (RF) classifier [23]. A distinct class prediction is provided by each tree. Our ultimate finding is
Detecting Cyberbullying from Tweets Through Machine Learning
31
the maximum number of the projected class. This classifier is a supervised learning model that yields correct results because the output is created by combining numerous decision trees. Instead of relying on a single decision tree, the random forest uses forecasts from each created tree and selects the outcome based on the majority votes of predictions. For instance, if a decision tree predicts the class label B for any instance out of two classes, A and B, then RF will choose the class label B as follows: f(x) = majority vote of all trees as B Naive Bayes: Naive Bayes (NB) classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naive) independence assumptions between the features. NB classifiers are used as supervised learning models. When presenting a document, NB frequently uses the “bag of words” method, gathering the most frequently used terms while ignoring less frequently used ones. The feature extraction approach is dependent on the bag of words to classify some data. Additionally, NB contains a language modeling feature that separates each text into representations of unigrams, bigrams, or n-grams and assesses the likelihood that a given query will match a certain document [24]. Support-Vector Machine (SVM): Another supervised learning model with a learning algorithm is the SVM, which examines the data used for classification and regression. An SVM training algorithm creates a model that categorizes incoming examples according to one of two categories given a set of training examples, making them non-probabilistic binary linear classifiers (although there are ways to apply SVM in a probabilistic classification situation, such as Platt scaling) [25]. Linear and Radial Basis Function models are crucial for SVM text categorization. The dataset is often trained for linear classification before a classification or categorization model is created. The features are shown as points in space that are anticipated to belong to one of the designated classes. SVM performs well in a variety of classification tasks, but it is most frequently used for text and image recognition [25]. Convolutional Neural Network (CNN): A software solution known as a neural network uses algorithms to “clone” the functions of the human brain. In comparison to conventional computers, neural networks process data more quickly and have better abilities for pattern detection and problem-solving. Neural network techniques have generally surpassed a number of contemporary techniques in machine translation, image recognition, and speech recognition. It’s possible that it will also perform better than other methods in developing classifiers for the identification of cyberbullying. This is because neural networks may automatically learn beneficial properties from complex structures from a high-dimensional input set [26]. Shallow classifiers are routinely outperformed by two specific neural network architectures, namely Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN) [27, 28]. RNNs process their input one element at a time, such as a word or character, and store a state vector in their hidden units that contain the history of previous elements. Conversely, CNN employs convolutions to identify smaller conjunctions of features from a previous layer to create new feature maps that can be used as input by many layers for classification.
32
J. O. Atoum
A CNN sentence-level classifier has been developed by Kim et al. [29] using pretrained word vectors. The results collected showed an improvement above the stateof-the-art on four out of seven benchmark datasets that incorporate sentiment analysis. Using an unsupervised corpus of 50 million tweets with word embeddings, Severyin and Mochitti [30] presented a pre-trained CNN architecture. The findings showed that it would place first on a phrase-level on the Twitter15 dataset of SemEval 2015The Very Deep-CNN classifier was created by Conneau et al. in [31], and it increased performance by using up to 29 convolutional layers of depth and eight publicly available text classification tasks, including sentiment analysis, subject classification, and news categorization.
3 Proposed Tweets SA Model The suggested Twitter bullying detection model, as shown in Fig. 5, employs various stages for analyzing, mining, and categorizing tweets. In order to improve the SA process, the gathered tweets must go through a number of preprocessing processes as discussed in the preceding section. We used the five machine learning algorithms—Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), Support Vector Machines (SVM), and Convolutional Neural Networks—to classify tweets as bullying or nonbullying (CNN). 3.1 Evaluation Measures Several evaluation measures were used in this work to assess the effectiveness of the proposed classifier models in differentiating cyberbullying from non-cyberbullying. Five performance metrics—accuracy, precision, recall, F-measure, and receiver operating characteristics—have been used in this study (ROC). These measures are defined as follows: Accuracy: The percentage of accurate forecasts overall guesses is known as accuracy. Precision: It is the division of the positive cases that are all accurately classified as positive cases (including false positives). Recall (or Sensitivity): Is calculated by dividing all of the real positive cases by the true positives (including false-negative cases). F-Measure: It is the combination of sensitivity and recall. It’s represented by (2 * Precision * Recall)/(Precision + Recall). Receiver Operating Characteristic (ROC): Recall (also known as the true positive rate) and the false-positive rate are plotted on a curve. The area under the ROC curve (AUC) quantifies the total region beneath the curve. The better the categorization model, the higher the AUC.
Detecting Cyberbullying from Tweets Through Machine Learning
33
Fig. 5. Proposed twitter bullying detection model.
3.2 Datasets To evaluate the effectiveness of the machine learning methods used in this investigation; we have gathered two datasets (Dataset-1 and Dataset-2) from Twitter on different dates (one month apart). Table 1 presents the details of these two datasets. Table 1. Datasets tweets statistics
Number of tweets
DataSet-1
Dataset-2
9352
6438
Number of supportive (cyberbullying) tweets
2521
1374
Number of unfavorable (non-cyberbullying) tweets
3942
2347
Number of impartial tweets
2889
2717
4 Experiments and Results Before we could conduct our investigations, the collected tweets had to go through a number of steps, including cleaning, preprocessing, normalization tokenization, named entity recognition, stemming, and including a determination, as was covered in the prior section. The DT, RF, NB, SVM, and CNN classifiers are then trained and tested using a ratio of (70, 30) on this data set. Finally, 10-fold equal-sized sets are created using cross-validation.
34
J. O. Atoum
Numerous experiments have been carried out on the above-mentioned two datasets of collected tweets to analyze and evaluate the DT, RF, NB, SVM, and CNN classifiers. The precision, accuracy, recall, F-measure, and ROC of these classifiers are assessed using tweets with gram sizes of 2, 3, and 4 in the main test. 4.1 Results for Dataset-1 The outcomes of evaluation measures on Dataset-1 are shown in Fig. 6. It highlights the average estimates obtained using several n-gram models for the DT, RF, NB, SVM, and CNN classifiers. This graph demonstrates that CNN classifiers outperformed all other classifiers in all n-gram language models in terms of precision, accuracy, recall, F-measure, and ROC. CNN classifiers, for instance, obtained a 4-g language model accuracy of 93.62% on average, whereas the DT, RF, NB, and SVM classifiers only managed to achieve an average accuracy of 63.2%, 65.9%, 82.3%, and 91.02%, respectively, using the same language model. Additionally, in all tests with all classifiers, the 4-g language model outperformed the remaining n-gram language models. This is because the likelihood of evaluation rises with increasing n-gram. 100 80 60 40 20
Accuracy
Precision 2-gram
Recall 3-gram
CNN
NB
SVM
RF
DT
CNN
NB
SVM
RF
DT
CNN
SVM
NB
RF
DT
CNN
SVM
NB
RF
DT
0
F-Measure
4-gram
Fig. 6. Comparisons of DT, RF, NB, SVM, and CNN measures for dataset-1.
4.2 Results for Dataset-2 Figure 7 displays the results of the same evaluation measures on Dataset-2 using various n-gram models. This Figure proves once more that CNN classifiers outperform all other classifiers in terms of precision, accuracy, recall, F-measure, and ROC in all n-gram language models. In contrast to the DT, RF, NB, and SVM classifiers, which only managed to reach average accuracy of 62.7%, 64.8%, 81.2%, and 89.7%, respectively, using the same language model, CNN was able to achieve an average accuracy of 91.03% utilizing the 4-g language model. Additionally, in all tests conducted with all classifiers (DT, RF, NB, SVM, and others), when compared to all other n-gram language models, the 4-g language model fared better.
Detecting Cyberbullying from Tweets Through Machine Learning
35
4.3 Comparing the Results of Dataset-1 and Dataset2 As can be noticed from Fig. 8, when the findings from the two datasets (Dataset-1 and Dataset-2) are compared, we clearly have slightly better results using Dataset 1 than we did with Dataset-2 for all assessment measures (using the averages of all language models: 2-g, 3-g, and 4-g). This is a result of the fact that Dataset-1 has more tweets (9352 tweets) than Dataset-2, which has a smaller number of tweets (6438 tweets). Thus, it follows that the results are improved by increasing the size of the dataset that is used to train and evaluate machine learning classifiers. Additionally, all ML classifiers perform better utilizing the 4-g language model than the 2-g and 3-g on all assessment criteria, as shown in Fig. 6 and Fig. 7. This is due to the fact that a higher n-gram increases the possibility of evaluation.
100 80 60 40 20 DT RF NB SVM CNN DT RF NB SVM CNN DT RF NB SVM CNN DT RF NB SVM CNN DT RF NB SVM CNN
0
Accuracy
Precision
Recall
2-gram
F-Measure
3-gram
ROC
4-gram
Fig. 7. Comparisons of DT, RF, NB, SVM, and CNN measures for dataset-2.
100 80 60 40 20 0 DT
NB
CNN
Accuracy
RF
SVM
Precision
DT
NB CNN Recall
Dataset-1
RF
SVM
F-Measure
DT
NB
CNN
ROC
Dataset-2
Fig. 8. Comparisons of evaluations measures of dataset-1 and dataset-2.
36
J. O. Atoum
5 Conclusion We have proposed a method to handle the detection of cyberbullying from Twitter that rely on Sentiment Analysis using machine learning techniques, notably Decision Tree (DT), Naive Bayes (NB), Random Forest (RF), Support Vector Machine (SVM), and Convolutional Neural Networks. The two collections of tweets included in this inquiry contain a variety of tweets that have been categorized as cyberbullying in one of three ways: positively, negatively, or neutrally. Before being ready for and tested with these machine learning algorithms, the collected sets of tweets have to go through a number of stages of cleaning, explanations, standardization, tokenization, named element acknowledgment, deleting paused words, stemming and n-gram, and features selection. According to the results of the directed investigations, CNN classifiers performed better than all other classifiers in both datasets’ total language models (2-g, 3-g, and 4-g). On Dataset-1 and Dataset-2, respectively, CNN classifiers achieved average accuracy levels of 93.62 and 91.03. Additionally, CNN classifiers have done better than every other classifier (DT, RF, NB, and SVM) on every other evaluation metric (precision, recall, F-Measures, and ROC). Additionally, by including the 4-g language model in these classifiers, they have produced better results than those obtained with the other language models (2-g and 3-g). Finally, we might want to look into machine learning techniques to identify bullying from other social media sites like Facebook, Instagram, and TikTok for future study on cyberbullying detection. Acknowledgment. I want to express my gratitude to the University of Texas at Dallas for their assistance.
References 1. US Social Media Statistics | US Internet Mobil Stats. https://www.theglobalstatistics.com/uni ted-states-social-media-statistics/. Accessed 05 Aug 2022 2. Cyberbullying Research Center (http://cyberbullying.org/) 3. The 2022 Social Media Demographics Guide. https://khoros.com/resources/social-mediademographics-guide 4. American Academy of Child Adolescent Psychiatry. Facts for families guide. the American academy of child adolescent psychiatry. 2016. http://www.aacap.org/AACAP/Families_and_ Youth/Facts_for_Families/FFF-Guide/FFF-Guide-Home.aspx 5. Goldman, R.: Teens indicted after allegedly taunting girl who hanged herself (2010). http:// abcnews.go.com/Technology/TheLaw/ 6. Smith-Spark, L.: Hanna Smith suicide fuels call for action on ask.fm cyberbullying (2013). http://www.cnn.com/2013/08/07/world/europe/uk-social-media-bullying/ 7. Cyberbullying Research Center. http://cyberbullying.org/). Accessed 06 Aug 2022 8. Salawu, S., He, Y., Lumsden, J.: Approaches to automated detection of cyberbullying: a survey. IEEE Trans. Affect. Comput. 11(1), 3–24 (2020). https://doi.org/10.1109/TAFFC. 2017.2761757 9. Sartor, G., Loreggia, A.: Study: The impact of algorithms for online content filtering or moderation (upload filters). European Parliament (2020)
Detecting Cyberbullying from Tweets Through Machine Learning
37
10. Amaon Mechanical Turk, 15 Aug 2014. http://ocs.aws.amazon.com/AWSMMechTurk/latest/ AWSMechanical-TurkGetingStartedGuide/SvcIntro.html. Accessed 3 July 2020 11. Garner, S.: Weka: the waikato environment for knowledge analysis. In: Proceedings of the New Zealand Computer Science Research Students Conference, pp. 57–64, New Zealand (1995) 12. Nahar, V., Li, X., Pang, C.: An effective approach for cyberbullying detection. Commun. Inf. Sci. Manag. Eng. 3(5), 238 (2013) 13. Chen, Y., Zhou, Y., Zhu, S., Xu, H.: Detecting offensive language in social media to protect adolescent online safety. In: privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on Social Computing (SocialCom), pp. 71–80 (2012) 14. Sri Nandhinia, B., Sheeba, J.I.: Online social network bullying detection using intelligence techniques international conference on advanced computing technologies and applications (ICACTA- 2015). Procedia Comput. Sci. 45, 485–492 (2015) 15. Romsaiyud, W., Nakornphanom, K., Prasertslip, P., Nurarak, P., Pirom, K.: Automated cyberbullying detection using clustering appearance pattern. In: 2017 9th International Conference on Knowledge and Smart Technology (KST), pp. 2–247. IEEE (2017) 16. Atoum, J.O.:Cyberbullying detection neural networks using sentiment analysis. In: 2021 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 158–164 (2021). https://doi.org/10.1109/CSCI54926.2021.00098 17. Bosco, C., Patti, V., Bolioli, A.: Developing corpora for sentiment analysis: the case of Irony and Senti–TUT. In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), pp. 4158–4162 (2015) 18. Rajput, B.S., Khare, N.: A survey of stemming algorithms for information retrieval. IOSR J. Comput. Eng. (IOSR-JCE), 17(3), Ver. VI (May – Jun. 2015), 76–78. e-ISSN: 2278–0661, p-ISSN: 2278–8727 19. Chen, L., Wang, W., Nagaraja, M., Wang, S., Sheth, A.: Beyond positive/negative classification: automatic extraction of sentiment clues from microblogs. Kno.e.sis Center, Technical Report (2011) 20. Fattah, M.A.: A novel statistical feature selection approach for text categorization. J. Inf. Process. Syst. 13, 1397–1409 (2017) 21. Tian, L., Lai, C., Moore, J.D.: Polarity and intensity: the two aspects of sentiment analysis. In: Proceedings of the First Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pp. 40–47, Melbourne, Australia 20 July 2018. Association for Computational Linguistics (2018) 22. Safavian, S.R., Landgrebe, D.: A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 21(3), 660–674 (1991) 23. Pal, M.: Random forest classifier for remote sensing classification. Int. J. Remote Sens. 26(1), 217–222 (2005) 24. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2– 3), 131–163 (1997) 25. Cortes, C., Vapnik, V.N.: Support-Vector Networks (PDF). Mach. Learn. 20(3), 273–297 (1995), Cutesier 10.1.1.15.9362. https://doi.org/10.1007/BF00994018 26. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 27. Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations (ICLR 2013), pp. 1–12 (2013) 28. Bengio, Y., Ducharme, R., Vincent, P.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003) 29. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), pp. 1746–1751 (2014)
38
J. O. Atoum
30. Severyn, A., Moschitti, A.: Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR 2015, pp. 959–962 (2015) 31. Conneau, H. Schwenk, L.B., Lecun, Y.: Very deep convolutional networks for natural language processing. KI - Kunstliche ¨ Intelligenz 26(4), 357–363 (2016)
DNA Genome Classification with Machine Learning and Image Descriptors Daniel Prado Cussi1(B) and V. E. Machaca Arceda2 1
Universidad Nacional de San Agust´ın, Arequipa, Peru [email protected] 2 Universidad La Salle, Mexico City, Mexico [email protected]
Abstract. Sequence alignment is the most used method in Bioinformatics. Nevertheless, it is slow in time processing. For that reason, there are several methods not based on alignment to compare sequences. In this work, we analyzed Kameris and Castor, two alignment-free methods for DNA genome classification; we compared them against the most popular CNN networks: VGG16, VGG19, Resnet-50, and Inception. Also, we compared them with image descriptor methods like First-order Statistics(FOS), Gray-level Co-occurrence matrix (GLCM), Local Binary Pattern (LBP), and Multi-resolution Local Binary Pattern(MLBP), and classifiers like: Support Vector Machine (SVM), Random Forest (RF) and k-nearest neighbors (KNN). In this comparison, we concluded that FOS, GLCM, LBP, and MLBP, all with SVM got the best results in f1-score, followed by Castor and Kameris and finally by CNNs. Furthermore, Castor got a minor processing time. Finally, according to experiments, 5-mer (used by Kameris and Castor) and 6-mer outperformed 7-mer. Keywords: Alignment-free methods · Frequency chaos game representation · Alignment-based methods · CNN · Kameris · Castor FOS · GLCM · LBP · MLBP
1
·
Introduction
Sequence alignment is a fundamental procedure in Bioinformatics. The method is very important in order to discover similar regions of DNA [15,42]. It has a relevant impact in many applications such as viral classification, phylogenetics analysis, drug discovery, etc. [40]. The main problem in sequence alignment is the high time processing, and memory consumption [19,60]. Moreover, despite the efforts of scientists to develop more efficient algorithms [9,30,56,59]. This problem is not yet resolved. For that reason, there is another approach comparing sequences. It is called alignment-free methods; they use DNA descriptors and machine learning models in order to classify and compare sequences. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 39–58, 2023. https://doi.org/10.1007/978-3-031-28073-3_4
40
D. P. Cussi and V. E. Machaca Arceda
We have used some methods based on k-mers frequency. K-mer has been applied in genome computation, and sequence analysis [46]. In this work, we analyzed two alignment-free methods (Kameris and Castor) against VGG16, VGG19, Resnet-50, Inception, First-order Statistics (FOS), Gray Level Co-occurrence Matrix (GLCM), Local Binary Pattern (LBP), and Multi-resolution local binary pattern (MLBP). This is an extension of a previous work that compared Kameris and Castor against small CNN models [31]. Moreover, we included time processing comparisons in order to show the advantages of alignment-free methods. FOS, LBP , GLCM, and MLBP are image descriptors used in computer vision. So, in this work, they are used to extract information from DNA, and we refer to them like ”image descriptor methods”. The work is structured as follows: In Sect. 2, we present the related works, and Sect. 3 explains the data, materials, and methods used. Section 4 details experiments and results; in Sect. 5, we write the limitations; in Sect. 6, we expose the conclusions and finally, in Sect. 7 we define the future work.
2
Related Work
Alignment-free methods are relevant. For example, in COVID studies, some researchers used DNA descriptors with machine learning models to find a class for the COVID virus. These studies concluded that the virus belongs to Betacoronavirus, inferring that it has origin in bats [14,26]. There are four interesting free alignment methods that are used in multiple fields of bioinformatics: First-Order Statistics (FOS), Gray-level co-occurrence matrix (GLCM), Local-Binary Patterns (LBP), and Multi-resolution LocalBinary Patterns (MLBP). Deliba proposed a new approach to DNA sequence similarity analysis using FOS to calculate the similarity of textures [10]. FOS was useful for the creation of a reliable and effective method, which can be applied to computed tomography (CT) images of COVID-19 obtained from radiographs for the purpose of monitoring the disease [54]. Moreover, FOS was also applied to diagnose the same disease, but without emitting radiation like the computerized tomography, in a developed automated system [17]. One of the most common methods used in image texture analysis is the GLCM [6]. It had relevance in the healthcare field, being part of a novel solution to improve the diagnosis of COVID-19 [3]. Additionally, it was also employed to extract textural features from histopathological images [34]. On the other hand, GLCM and LBP fused served to propose a texture feature extraction technique for volumetric images, where the discrimination is improved concerning works using deep learning networks [4]. Finally, the GLCM features were applied in the problem of automatic online signature verification, concluding that they work optimally with the SVM model for signature verification, having also been tested with other models [49]. In recent years , the LBP feature extraction method, which is responsible for describing the surface texture features [36], has been used in various applications,
DNA Genome Classification
41
making remarkable progress in texture classification [20,43], and facial recognition applications [22,33,47,55]. On the other hand, it has also had relevance in bioinformatics, being used to predict protein-protein interactions [28], or for also, in the prediction of images of subcellular localization of proteins concerning human reproductive tissues [58]. Additionally, an improved version of LBP, called Multiresolution Local Binary Pattern (MLBP), was developed [23]. This version was used to classify normal breast tissue and masses on mammograms, reducing false positive detections in computer-assisted detection of breast masses [7]. The alignment-free methods are able to solve problems in different subjects; for example, Virfinder (alignment-free method) was developed to identify viral sequences using k-mer frequencies [41]. On the other hand, another robust alignment-free method was developed to discriminate between enzymes and nonenzymes, obtaining a remarkable result [8]. In genetics, alignment-free methods have far outperformed alignment-based methods in measuring the closeness between DNA sequences at the data preprocessing stage [46]. Protein sequence similarity is a relevant topic in bioinformatics. A new alignment-free method has been developed to perform similarity analysis between proteins of different species, allowing a global comparison of multiple proteins [12]. They have also had incursions into topics such as the evolutionary relationship that exists between species, such as humans, chimpanzees, and orangutans [37]. These methods can be divided into several categories, one of the most relevant being k-mer frequencies [1], which are sequences of k characters [27], having many applications in metagenomic classification [57], and repeat classification [5]. In recent years, CNNs have become very relevant. For example, Inception-v3, was used to recognize the boundaries between exons and introns [45]. Additionally, it was used to predict cancer skin [44]. Moreover, VGG16, was employed for non-invasive detection of COVID-19 [35] and applied by the PILAE algorithm to achieve better performance in classifying DNA sequences of living organisms [32]. Finally, a method was developed that allows DNA comparisons. The authors transformed the sequences into FCGR images. Then they used SVD. In this project, the method outperformed BLAST [29]. The works mentioned have some small shortcomings, for example in [3,54] the accuracy should be 100% because people’s health is at stake, although the works showed improvements in their proposal. Additionally, in the majority of image descriptor methods, we found very little researches related with dna genoma classification, so we decided compare the four alignment free methods (FOS, GLCM, LBP, and MLBP) with three classifiers like SVM, KNN, and RFC, in this paper, noting that FOS has some interesting characteristics, and SVM classifier showed a good performance [4], because the work will explain with the experiments the comparative. The main idea of the work is try to find new alternatives to the traditional alignment-based methods which are very slow as the data grows. Also, we will make a comparison between all this alignment free methods proposed, with the aim to discover which of these have the better score and better processing time.
42
3
D. P. Cussi and V. E. Machaca Arceda
Materials and Methods
In this section, we describe the method based on CNNs, two alignment-free methods (Kameris and Castor), First-Order Statistics (FOS), Gray Level Cooccurrence Matrix (GLCM), Local binary pattern (LBP), and Multi-resolution local binary pattern (MLBP). 3.1
CNNs and Chaos Game Representation
In order to use CNNs, we need an image; for that reason, we used Frequency Chaos Game Representation (FCGR). FCGR creates images that store the frequencies of k-mers according to the value of k [2,11]. There are several works inspired by FCGR. In our case, we have focused on the so-called IntelligentIcons, proposed by Kumar and Keogh, where these bitmaps provide a very good overview of the values and their sequence in the dataset [21,24]. Then, we used VGG16, VGG19, Resnet-50, and Inception in order to build a classifier. Also, we employed VGG16 , a deep learning architecture with 16 hidden layers composed of 13 convolutional layers and three fully connected. The model achieves 92.7% top-5 test accuracy in ImageNet. Additionally, VGG19 is a variant of VGG16 with more depth [48]. We have also applied Resnet-50, which is a convolutional neural network that is trained on more than one million images from the ImageNet database. The network is 50 layers deep and can classify images into 1000 categories of objects. As a result, the network has learned feature-rich representations for a wide range of images [18]. Finally, we used Inception-v3, a convolutional neural network architecture belonging to the Inception family, developed by Google. The model presents several improvements such as Label Smoothing, 7 × 7 factored convolutions, and the use of an auxiliary classifier to do spread information about the labels at the bottom of the network [52,53]. For data preprocessing, numerical data is sometimes available instead of categorical data. In this case, the genomic sequence of the DNA dataset is categorical, so to transform the categorical data into numerical data, one-hot-encoding is used in our work [16]. For CNN architectures (VGG16, VGG19, Resnet-50, and Inception-v3), we utilized Adam optimizer in all cases, mini-bath size of 32 and 20 epochs. 3.2
Kameris and Castor
Kameris is an open-source supervised, alignment-free subtyping method that operates on k-mer frequencies in HIV-1 sequences [50]. Mainly, the approach taken by Kameris is that feature vectors expressing the respective k-mer frequencies of the sequences virus sequences are passed to known supervised classifiers. Subsequently, the genomic sequences are preprocessed by the removal of any ambiguous nucleotide codes [13].
DNA Genome Classification
43
Castor-KRFE is an alignment-free method, which extracts discriminating kmers within known pathogen sequences to detect and classify new sequences accurately. This tool detects discriminating sub-sequences within known pathogenic sequences to accurately classify unknown pathogenic sequences, CASTOR is powerful in identifying a minimal set of features, making the job of classifying sequences much simpler, as well as increasing its performance [25]. 3.3
FOS, GLCM, LBP, and MLBP
3.3.1 First-Order Statistics First-Order Statistics is applied for DNA sequence similarity between two or more sequences, using alignment-free methods, which contribute to developing new mathematical descriptors on DNA sequences [10]. Proposed by Deliba et al. [10], they used alignment-free methods that convert the DNA sequences into feature vectors. They represented DNA sequences as images and then computed the histogram. Each pair of bases (Eq. 1) have a value from 0 to 15. Then, the feature vector is scaled to values from 1 to 255. The next step is to compute the histogram. Then, four attributes are computed from the histogram to form a feature vector: skewness, kurtosis, energy, and entropy (see Eqs. 2, 3, 4, 5). Finally, this feature vector could be used for similarity analysis against other vectors using the Euclidean distance: AA, AG, AC, AT, GA, GG, GC, GT, α= (1) CG, CC, CT, CA, T A, T G, T C, T T Skewness = σ −3
G−1
(i − μ)3 p(i)
(2)
(i − μ)4 p(i) − 3
(3)
i=0
Kurtosis = σ −4
G−1 i=0
Energy =
G−1
p(i)2
(4)
i=0
Entropy = −
G−1
p(i)lg(p(i))
(5)
i=0
where, p(i) = h(i)/N M , h(i) = histogram; N,M are image’s width and height; G−1 and μ = i=0 ip(i). 3.3.2 Gray-Level Co-Ocurrence Matrix GLCM is a technique for evaluating textures by taking into account the spatial relationship of pixels. It is a method of image texture analysis, which also makes its foray into bioinformatics [38]. The texture features that GLCM exports are entropy, contrast, energy, correlation, and homogeneity, which are useful to describe the image texture [6].
44
D. P. Cussi and V. E. Machaca Arceda
This algorithm was proposed by Chen et al. [6] for sequence alignment; they convert the DNA sequences into feature vectors and compute the GLCM. Each base in sequence S = {A, C, G, T } is mapped to the numbers S = {1, 2, 3, 4}. Then, they added to each value the base position. Finally, we compute GrayLevel Co-occurrence Matrix (GLCM). The GLCM, is an algorithm that processes the occurrence of changes in pixel neighbors’ intensities. For example in Fig. 1 (left), there is a 2D input matrix A with intensities from 1 to n = 5, so the GLCM matrix will be a n∗n matrix. Then, in the output matrix GLCM, the cell [i, j] represents the number of occurrences where a pixel with intensity i has a horizontal neighbor pixel with intensity j. In this form, the cell [1, 1], has a value of 1 because there is just one occurrence where a pixel with intensity 1 has a horizontal neighbor with intensity 1, in the input matrix. For cell [1, 2], it has a value of 2, because there are two occurrences where a pixel with intensity 1 has a horizontal neighbor with intensity 2. We could get more GLCM matrices if we consider vertical and diagonal neighbors, but in the work of Chen [6], they consider horizontal neighbors. After the GLCM is computed, it is normalized from 0 to 1. Then, the entropy, contrast, energy, correlation, and homogeneity are computed (see Eqs. 6, 7, 8, 9 and 10). These five features represent the sequence’s feature vector.
Fig. 1. Examples of GLCM Algorithm. Left: GLCM Computed from a 2D Matrix with Intensities from 1 to 5. Right: GLCM Computed from a 1D Vector with Intensities from 1 to 4.
Entropy = −
L L
p(i, j)Ln(p(i, j))
(6)
i=1 j=1
Contrast =
L L
(i − j)2 p(i, j)
(7)
i=1 j=1
Energy =
L L i=1 j=1
p(i, j)2
(8)
DNA Genome Classification
Correlation =
L L (i − μi )(j − μi )p(i, j) i=1 j=1
Homogeneity =
L L i=1 j=1
σi σj p(i, j) 1 + |i − j|
45
(9)
(10)
where, p(i, j) is the GLCM matrix and L is the maximun intensity value. 3.3.3 Multi-resolution Local Binary Pattern LBP is similar to the previous ones; it is a texture descriptor [51]. More specifically, LBP is an algorithm that describes the local texture features of an image. It has relevance in image processing, image retrieval, and scene analysis [28]. Initially, LBP was proposed for image texture analysis, but it has been updated to 1D signals; in Eq. 11, we present the LBP descriptor. p/2−1
LBP (x(t)) =
(Sign(x(t + i − p/2) − x(t))2i
i=0
(11)
+Sign(x(t + i + 1) − x(t))2i+p/2 ), where p in the number of neighbouring points and Sign is: 0, x < 0 Sign(x) = 1, x ≥ 0 δ(LBPp (x(i), k)), hk =
(12) (13)
p/2≤i≤N −p/2
where δ is the Kronecker delta function, k = 1, 2, .., 2p and N is the sequence length. In Eq. 13, hk represent the histogram and it is the feature vector of LBP descriptor. Then, the MLBP, is just an extension of LBP that combines the results of LBP with several values of p. Kouchaki proposed the use of MLBP for sequence alignment. First, they mapped a sequence with the values of Table 1, then they apply MLBP with p = 2, 3, 4, 5 and 6. Finally, the result was used as a feature vector for clustering. MLBP is adapted to data comparisons and numerical nucleotide sequences. MLBP can capture genomic signature changes optimally, allowing for nonaligned comparisons and clustering of related contigs [23].
4
Experiments and Results
In this section, we explains the datasets used, metrics and results of our experiments.
46
D. P. Cussi and V. E. Machaca Arceda Table 1. Numeric representation for each base used by Kouchaki et al. [23] Base Integer EIIP
4.1
Atomic Real
A
2
0.1260 70
–1.5
T
–2
0.1335 78
1.5
C
–1
0.1340 58
–0.5
G
2
0.0806 66
0.5
Datasets
We used the datasets from Castor and a group of datasets proposed by Randhawa et al. [39]. The first seven belong to Castor, and the rest to the second-mentioned database (see Table 2). HBVGENCG corresponds to the Hepatitis-B virus. HIVGRPCG, HIVSUBCG, and HIVSUBPOL belong to the first type of the human immunodeficiency virus (HIV-1); EBOSPECG handles a set of sequences of the deadly Ebola disease; RHISPECG, on the other hand, is related to a common cold (Rhinovirus), and HPVGENCG is related to Human papillomavirus, a sexually transmitted disease. Table 2. Datasets used in the experiments Datasets HIVGRPCG
Average seq.length No. of classes No. of instances 9164
4
76
HIVSUBCG
8992
18
597
HIVSUBPOL
1211
28
1352
EBOSCPECG 18917
5
751
HBVGENCG
3189
8
230
369
3
1316
RHISPECG HPVGENCG
7610
3
125
16626
2
148
Dengue
10595
4
4721
Protists
31712
3
159
Primates
Fungi
49178
3
224
Plants
277931
2
174
Amphibians
17530
3
290
Insects
15689
7
898
Vertebrates
16806
5
4322
DNA Genome Classification
4.2
47
Metrics
The performance of the proposed approach was assessed using popular classification performance metrics, we employed the f1-score as the main metric, presented in the equation shown below: F1 =
2T P 2T P + F P + F N
(14)
where TP, FP, and FN stand for true positives, false positives, and false negatives, respectively. 4.3
Results
In this section, we have compared performance between Castor, Kameris, VGG16, VGG19, Resnet-50, Inception-v3, and image descriptor methods. In Table 3, we show the best f1-score we obtained with k = 5. We presented Kameris and Castor without dimensionality reduction and feature elimination (they got better results in this way). CNNs had results quite close to Castor and Kameris. Moreover, in Table 4 and 5, we present experiments with 5-mer, 6-mer and 7-mer where we present certain improvements, in database Plants in CNN Inception with k = 7 , having as relevant result a f1-score in Tables 4, and 5 (the highest score respect to Plants), similar to Castor and outperforming Kameris. In addition, we tested with feature vectors belonging to image descriptor methods, and with these, we made predictions with three popular machine learning methods: SVM, RFC, and KNN. On the other hand, these predictions did not use k-mer. In Table 8, where we put the best f1-scores extracted from Tables 7, 6, and 3, it is clearly shown that image descriptor methods, work better with SVM and, that they outperform Castor and Kameris on bds such as Plants, h1vsubpool, Vertebrates, and Insects showing a promising outlook for DNA species classification. VGG19 (the better CNN) scored lower than Kameris, and Castor, although it was able to match them in some databases. Similarly, it underperfomed the alignment-free methods FOS, GLCM, LBP, and MLBP. Respect to the previous paper that is inspired this work, we can notice that there is an improvement in image descriptor methods using SVM classifier, because these show a better performance compared with kameris and castor. Additionally, respect to the related works we can deduce that image descriptor methods are like the best alternative in dna genome classification, being FOS an interesting method. The reason is because this method is very similar to k-mer frequencies, with k = 2. This is probably the reason why FOS outperforms the rest even in processing time. 4.4
Processing Time
There are two kinds of processing time: feature vector generation time and the second one, which is performed by taking the mentioned vector or, in the case
48
D. P. Cussi and V. E. Machaca Arceda
Table 3. We present best f1-scores of the methods proposed. We use 5-mer for all methods Dataset
Kameris Castor VGG16 VGG19 Resnet-50 Inception-v3
HIVGRPCG
1
1
0.8378
1
HIVSUBCG
1
1
0.9654
0.9832 0.9490
HIVSUBPOL
0.993
0.7997
1 0.9183
0.993
0.9418 0.9251
0.9382
0.7520
EBOSCPECG 1
1
1
1
1
1
HBVGENCG
1
1
1
0.9558
0.9781
0.9779
RHISPECG
1
1
1
1
0.9962
0.9771
HPVGENCG
1
1
0.9595
1
0.9173
0.9200
Primates
1
1
1
0.9235
0.9234
0.9235
Dengue
1
1
1
1
1
1
Protists
1
1
0.9103
0.9375
0.8893
1
Fungi
1
1
0.9782
0.9782 0.9574
0.8792
Plants
0.882
0.972 0.8857
0.7901
0.9152
0.8823
Amphibians
1
1
0.9647
0.9617
0.9819
0.9116
Insects
0.994
0.994 0.9103
0.9312 0.9105
0.8801
Vertebrates
0.998
0.998 0.9802
0.9874 0.9837
0.9573
Table 4. We present f1-score with several k-mers for VGG16 and VGG19 Dataset
VGG16 k=5 k=6
HIVGRPCG
0.8378
HIVSUBCG
0.9654 0.9091
HIVSUBPOL
0.9418
k=7
0.9367 0.9125
VGG19 k=5 k=6 1
k=7
0.9295
0.6728
0.9286
0.9832 0.9117
0.9358
0.9525 0.9291
0.9251 0.7819
0.7796
EBOSCPECG 1
1
0.9940
1
0.9940
1
HBVGENCG
1
0.9779
1
0.9558
1
1
RHISPECG
1
0.9925
1
1
0.9962
0.9924
HPVGENCG
0.9595 0.8844
0.92
1
0.8786
0.8828
Primates
1
0.9333
0.8727
0.9235
0.9646 0.8727
Dengue
1
1
1
1
1
1
Protists
0.9103
0.8843
0.9701 0.9375
0.9382
0.9422
Fungi
0.9782 0.8866
0.9314
0.9782 0.8866
0.9314
Plants
0.8857
0.8501
0.9428 0.7901
0.9429 0.9152
Amphibians
0.9116
0.9502
1
0.9647
0.9827 0.9820
Insects
0.9103 0.8914
0.8705
0.9312 0.9117
0.8930
Vertebrates
0.9802 0.9798
0.9766
0.9874 0.9721
0.9716
DNA Genome Classification
49
Table 5. We present the f1-score with several k-mers for Resnet-50 and inception Dataset HIVGRPCG
Resnet-50 k=5 k=6
k=7
Inception k=5 k=6
k=7
0.7997
0.8487
1
1
0.7628
0.6578
HIVSUBCG
0.9490
1
0.9458
0.9183
0.9117
0.9359
HIVSUBPOL
0.9382
0.9525 0.9366
0.7520
0.9331
0.9405
EBOSCPECG 1
1
1
1
1
1
HBVGENCG
0.9579
0.9778 0.9779
1
0.9784
0.9924 0.9801
0.9781
RHISPECG
0.9962
0.9886
0.9962
HPVGENCG
0.9173
0.8844
0.9200 0.9200 0.8786
Primates
0.9234
0.9235 0.8727
0.9235
Dengue
1
1
1
0.999
0.998
Protists
0.8893
0.9679 0.8893
1
1
1
Fungi
0.9574 0.9108
0.8829
0.8792
0.9157 0.9110
Plants
0.9152 0.8857
0.9152
0.8823
0.9429
Amphibians
0.9617 0.9457
0.9277
0.9819 0.9381
0.8974
Insects
0.9105 0.9046
0.8656
0.8801 0.8316
0.8091
Vertebrates
0.984
0.981
0.9573
0.9289
1
0.965
0.9771
0.8828
0.9646 0.8727
0.9510
0.9717
Table 6. We show the f1-Score of LBP and MLBP, using Support Vector Machine (SVM), Random Forest Classifier (RFC), and K-Nearest Neighboors (KNN) Dataset
LBP-SVM LBP-RFC LBP-KNN MLBP-SVM MLBP-RFC MLBP-KNN
HIVGRPCG
1
0.8685
0.9892
1
0.7940
0.9776
HIVSUBCG
1
0.7932
0.9958
1
0.6060
0.9860
HIVSUBPOL
1
0.62966
0.9834
1
0.4000
0.9800
0.9966
0.9977
1
0.9777
1
EBOSCPECG 1 HBVGENCG
1
0.7933
0.9830
1
0.8324
1
RHISPECG
1
0.8421
0.9801
1
0.9141
0.9961
HPVGENCG
1
0.9797
1
1
0.9457
1
Primates
1
0.9601
1
1
0.9198
1
Dengue
1
0.7866
0.9929
1
0.9767
0.9985
Protists
1
0.9523
1
1
0.9578
0.9894
Fungi
1
0.9555
0.9888
1
0.9777
1
Plants
1
0.9904
1
1
0.9917
1
Amphibians
1
0.9300
1
1
0.9195
1
Insects
1
0.7960
0.9786
1
0.7035
0.9897
Vertebrates
1
0.7510
0.9862
1
0.8536
0.9930
50
D. P. Cussi and V. E. Machaca Arceda
Table 7. We show f1-Score of FOS and GLCM, using Support Vector Machine (SVM), Random Forest Classifier (RFC), and K-Nearest Neighboors (KNN) Dataset
FOS-SVM FOS-RFC FOS-KNN GLCM-SVM GLCM-RFC GLCM-KNN
HIVGRPCG
1
0.7954
1
1
0.6803
HIVSUBCG
1
1
0.9958
1
0.9820
0.9958
HIVSUBPOL
1
0.5316
0.9901
1
0.4000
0.9800
0.9891
EBOSCPECG 1
1
0.9988
1
0.9909
1
HBVGENCG
0.6657
0.9963
1
0.7140
0.9927 0.9917
1
RHISPECG
1
0.7982
0.9556
1
0.7830
HPVGENCG
1
0.8045
1
1
0.7751
0.9933
Primates
1
0.9887
0.9887
1
0.9712
0.9889 0.9944
Dengue
1
0.7866
0.9929
1
0.8820
Protists
1
0.7911
1
1
0.7975
1
Fungi
1
0.9555
0.9888
1
0.8228
0.9926
Plants
1
0.9904
1
1
0.8834
0.9806
Amphibians
1
0.9300
1
1
0.7862
0.9913
Insects
1
0.6539
0.9943
1
0.6593
0.9871
Vertebrates
0.9988
0.9949
0.9949
1
0.7786
0.9946
Table 8. We show the best results extracted from Tables 3, 6, and 7 Dataset
Kameris Castor FOS-SVM GLCM-SVM LBP-SVM MLBP-SVM VGG19
HIVGRPCG
1
1
1
1
1
1
1
HIVSUBCG
1
1
1
1
1
1
0.983
HIVSUBPOL
0.993
0.993
1
1
1
1
0.925
EBOSCPECG 1
1
1
1
1
1
1
HBVGENCG
1
1
1
1
1
1
0.955
RHISPECG
1
1
1
1
1
1
1
HPVGENCG
1
1
1
1
1
1
1
Primates
1
1
1
1
1
1
0.923
Dengue
1
1
1
1
1
1
1
Protists
1
1
1
1
1
1
0.937
Fungi
1
1
1
1
1
1
0.978
Plants
0.882
0.972
1
1
1
1
0.790
Amphibians
1
1
1
1
1
1
0.965
Insects
0.994
0.994
1
1
1
1
0.931
Vertebrates
0.998
0.998
0.998
1
1
1
0.987
of CNNs, the generated image, called prediction time. The latter is the sum of the time of the first mentioned and the time taken for the prediction. For Kameris and Castor, we measured the processing time to get the feature vectors. Additionally, to generate the FCGR image based on Intelligent Icons
DNA Genome Classification
51
we calculated the processing time with eight random sequences. Castor had the best processing time. In Table 9, we detailed the prediction time. Kameris and Castor outperformed CNNs in processing time. Also, they both got better times than the image descriptor methods because the four mentioned used SVM, KNN, and RFC as classifiers. Resnet-50 obtained times close to Castor and Kameris being better in some cases. Furthermore, in Table 13, we present the feature vector generation time. There are two types of predictions in Tables 10, 11, 12, using SVM, RFC, and KNN. FOS obtained the best prediction time compared to GLCM, LBP, and MLBP, being MLBP the one that used an exponential time. Likewise, FOS and GLCM obtained times similar to CNNs, and slightly better than Castor and Kameris. The database that consumed more time and resources was Plants because despite having a small number of sequences compared to the other bds, its genomes are large. Table 9. We presented the time in milliseconds to perform predictions for Kameris, Castor, VGG16, VGG19, Resnet-50, and Inception-v3’s cases Dataset
Kameris Castor
VGG16 VGG19 Resnet-50 Inception v3 Sequence length
HIVGRPCG
319.781
141.021 517.062 629.199 149.530
372.776
8654
HIVSUBCG
297.834
133.746 526.036 632.170 148.749
370.247
8589
HIVSUBPOL
43.919
EBOSPECG
642.427
HBVGENCG
113.392 13.916
HPVGENCG Primates
RHISPECG
20.018 514.573 643.902 147.104
370.454
1017
368.756
18828
50.541 518.405 626.492 149.673
375.796
3182
6.625 516.140 634.218 147.387
370.164
369
266.930
117.776 512.843 633.415 157.982
369.680
7100
579.338
254.598
376.814
16499
283.368
522.682 633.308 146.739
524.119 628.240 146.379
Dengue
348.030
157.447
515.302 632.949 148.390
380.846
10313
Protists
1105.138
490.876
513.511 629.238 153.244
373.343
24932
Fungi
1642.195
717.410
511.470 627.220 147.211
370.570
190834
Plants
5515.720 2443.862
516.660 626.762 147.379
373.236
103830
Amphibians
567.987
252.171
516.356 634.881 153.083
374.721
16101
Insects
531.752
233.795
513.336 635.343 147.526
372.987
14711
Vertebrates
562.125
263.150
513.335 626.717 149.502
374.435
16442
5
Discussion
In the work, we have realized some experiments to understand which are the better alignment-free methods in two aspects: f1-score and processing time. We were
52
D. P. Cussi and V. E. Machaca Arceda
Table 10. We presented the time in milliseconds to perform predictions for cases of FOS, GLCM, LBP and MLBP, using SVM algorithm Dataset
FOS-SVM GLCM-SVM LBP-SVM MLBP-SVM Sequence length
HIVGRPCG
407.311
668.111
3622.970
3724.500
8654
HIVSUBCG
398.828
680.570
3569.309
10125.360
8589
HIVSUBPOL
191.670
450.070
622.404
1540.680
1017
EBOSCPECG
665.550
910.982
7185.070
20750.720
18828
HBVGENCG
255.128
507.173
1343.540
3765.345
3182
RHISPECG
167.250
430.730
166.426
580.290
369
HPVGENCG
360.140
635.323
3022.705
8838.360
7100
Primates
580.250
860.828
6412.801
19360.160
16499
440.207
Dengue
709.300
4008.250
12140.751
10313
Protists
1005.820 1240.820
6412.820
36001.032
24932
Fungi
1480.750 1640.750
18105.190
53434.640
190834
Plants
4800.755
4711.190
61409.530
180510.82
103830
Amphibians
630.450
950.326
6270.190
18939.231
16101
Insects
571.800
847.900
6060.421
17859.780
14711
Vertebrates
455.203
711.365
6130.20
18750.111
16442
Table 11. We presented the time in milliseconds to perform predictions for cases of FOS, GLCM, LBP, and MLBP using RFC algorithm
Dataset
FOS-RFC GLCM-RFC LBP-RFC MLBP-RFC Sequence length
HIVGRPCG
407.310
665.107
3625.855
3712.500
8654
HIVSUBCG
398.815
679.470
3572.318
10112.360
8589
HIVSUBPOL
191.675
449.111
630.414
1545.680
1017
EBOSCPECG
665.555
908.979
7182.070
20758.720
18828
HBVGENCG
255.130
502.170
1344.540
3775.345
3182
RHISPECG
166.249
428.725
160.426
599.290
369
HPVGENCG
350.142
633.330
3025.710
8988.360
7100
Primates
575.255
859.825
6480.810
19356.162
16499
Dengue
435.214
704.315
4010.255
12142.751
10313
Protists
1001.814 1245.828
6412.608
36001.032
24932
Fungi
1475.715 1641.753
18115.290
53432.640
190834
Plants
4678.740 4718.195
61403.531 180508.82
103830
Amphibians
625.440
955.322
6230.190
18926.231
16101
Insects
565.798
843.920
6050.421
17840.780
14711
Vertebrates
455.201
713.368
6115.20
18725.110
16442
DNA Genome Classification
53
Table 12. We presented the time in milliseconds to perform predictions for cases of FOS, GLCM, LBP and MLBP, using KNN algorithm Dataset HIVGRPCG
FOS-KNN GLCM-KNN LBP-KNN MLBP-KNN Sequence length 407.320
665.117
3615.680
3726.500
8654
HIVSUBCG
398.818
679.495
3572.320
10131.360
8589
HIVSUBPOL
191.692
449.131
615.414
1559.680
1017
EBOSCPECG
665.585
908.985
7299.070
20798.720
18828
HBVGENCG
255.145
510.175
1515.540
3780.345
3182
RHISPECG
169.251
435.735
176.426
650.290
369
HPVGENCG
368.140
638.392
3049.710
9001.360
7100
Primates
575.260
859.801
6499.810
22152.312
16499
440.215
Dengue
692.307
4068.255
12590.746
10313
1015.822 1230.815
6445.608
36015.078
24932
Fungi
1480.710 1615.712
18215.290
53440.656
190834
Plants
4678.722 4722.184
61349.531
180512.815
103830
6267.190
19126.231
16101
Protists
Amphibians
615.440
912.398
Insects
592.805
898.915
6117.421
18440.640
14711
Vertebrates
465.215
765.334
6132.20
18712.220
16442
able to infer that the image descriptor methods are better in results, although in the processing time we noticed that Castor and Resnet-50 were the best with the most optimal time. By contrast, MLBP got the slowest time by far, for example in 10 where the image descriptor methods are using SVM, MLBP got 180510.82 milliseconds that is a exponential time, in plants database, which has a very large genome, all this in spite of having obtained a good a f1-score. Additionally, only for generating the feature vector in Table 13 we understand that Castor has the better time in all cases. Finally we have noticed that FOS has the betters times to perform predictions, this is might because the method has a relationship with the k-mer and processes the data more easily. Additionally, Castor and kameris had very close results despite only using SVM as a classifier, but the image descriptor methods got in almost all cases 100% in f1-score surpassing them. Moreover, the CNNs do not reach the SVM level due to the lack of samples or public databases to train them.
6
Limitations
In this study, we have noticed that FCGR image generation demands a more considerable time as the length of the k-mer increases, specifically in k=5, a more powerful GPU and more amount of RAM would be needed to be able to generate the images and to be able to train CNNs. Likewise, for the second part of the work, which consists of training the CNNs with the FCGR images, more
54
D. P. Cussi and V. E. Machaca Arceda
computational capacity is needed, both in the RAM and in the GPU, because from 2000 images, CNN training is exponentially slower. Additionally, in the generation of the vectors concerning to FOS, GLCM, LBP, and MLBP, we have noticed that MLBP takes a considerably long time, taking into account that only 8 test sequences were used during 10 repetitions for the experiment, inferring that if all the complete databases were used the time employed would not be acceptable. Table 13. Processing time in milliseconds. We showed the processing time to generate a feature vector of kameris and castor, and image generation in the case of fcgr inspired by the intelligent icons paper. Also, we present the processing time to generate the feature vector of FOS, GLCM, LBP, and MLBP. We used 8 random sequences for all cases, and 5-mer in Kameris, Castor, and the FCGR imagen generation. The processing time was obtained after executing each mentioned case 10 times Dataset
Kameris Castor
Imagen generation
FOS
GLCM
BLP
MLBP
Sequence length
HIVGRPCG
319.603 140.910
7962.560
267.606
556.720
3394.642 10176.645
8654
HIVSUBCG
297.660 133.637
8097.371
254.532
523.795
3244.866
9763.489
8589
HIVSUBPOL
43.790 19.893
7993.734
53.488
306.918
466.795
1363.745
1017
EBOSCPECG
642.282 283.257
8039.344
508.603
787.700
7005.136 20454.488
18828
HBVGENCG
113.221 50.438
7884.267
104.132
399.19
1184.879
3616.148
13.796 6.53064
8211.903
29.710
328.248
140.611
410.564
369
HPVGENCG
266.754 117.668
7980.977
222.328
509.008
2862.636
8559.799
7100
Primates
579.158 254.484
8047.256
457.480
709.668
6123.200 18092.78
Dengue
347.867 157.344
7868.328
298.502
586.059
3902.443 11804.980
Protists
1104.980 490.764
8131.914
851.886
Fingi
1642.013 717.287
8353.357
1269.872
1486.723 17747.163 52737.838
190834
Plants
5515.532 2443.748 9401.827
4412.888
4447.10
103830
RHISPECG
1127.297 11279.85
34617.4301
60150.70 178225.92
3182
16499 10313 24932
Amphibians
567.823 252.059
7996.048
475.382
719.989
6259.095 18317.007
Insects
531.588 233.682
8086.115
421.914
690.535
5665.195 17387.733
14711
Vertebrates
561.954 263.036
8024.314
458.204
712.370
6132.39
16442
18579.14
16101
Finally, the times employed in the Tables 10, 11, and 12, for predictions using SVM, RFC, and KNN show a slightly longer time in FOS and GLCM compared to the neural networks and castor with Kameris, besides concluding that in all the tables LBP and MLBP are the ones that consumed too many time resources, contrasting the excellent results obtained in the f1-scores.
7
Conclusions
In this work, we evaluated Kameris, Castor, VGG16, VGG19, Resnet-50, and Inception, for DNA genome classification. Moreover, we evaluated with different k-mers. Additionally, we have experimented with FOS, GLCM, LBP, and MLBP, which did not use k-mers.
DNA Genome Classification
55
We inferred that the methods based on image descriptors are the best performers even though they did not use k-mers, obtaining 100% in almost all predictions with SVM, followed by Kameris and Castor. CNNs were relegated to the last position, where VGG19 was the CNN most representative. Finally , we were able to observe that Castor’s processing time was the better. By contrast, we realized that the image generation time with FCGR and MLBP processing time employed a lot of time. Moreover, FOS (the better image descriptor) obtained a time close to kameris, as a relevant result.
8
Future Work
In this work, FOS, GLCM, LBP, and MLBP are better than Kameris, Castor, and CNNs in most cases. Nevertheless, we don’t have several samples in databases, and the number of classes differs from each database. To perform better experiments, we will build a new dataset and, we will evaluate more sophisticated deep learning techniques.
References 1. Abd-Alhalem, S.M., et al.: DNA sequences classification with deep learning: a survey. Menoufia J. Electron. Eng. Res. 30(1), 41–51 (2021) 2. Almeida, J.S., Carrico, J.A., Maretzek, A., Noble, P.A., Fletcher, M.: Analysis of genomic sequences by chaos game representation. Bioinformatics 17(5), 429–437 (2001) 3. Bakheet, S., Al-Hamadi, A.: Automatic detection of Covid-19 using pruned GLCMbased texture features and LDCRF classification. Comput. Biol. Med. 137, 104781 (2021) 4. Barburiceanu, S., Terebes, R., Meza, S.: 3D texture feature extraction and classification using GLCM and LBP-based descriptors. Appl. Sci. 11(5), 2332 (2021) 5. Campagna, D., et al.: Rap: a new computer program for de novo identification of repeated sequences in whole genomes. Bioinformatics 21(5), 582–588 (2005) 6. Chen, W., Liao, B., Li, W.: Use of image texture analysis to find DNA sequence similarities. J. Theor. Biol. 455, 1–6 (2018) 7. Choi, J.Y., Kim, D.H., Choi, S.H., Ro, Y.M.: Multiresolution local binary pattern texture analysis for false positive reduction in computerized detection of breast masses on mammograms. In: Medical Imaging 2012: Computer-Aided Diagnosis, vol. 8315, pp. 676–682. SPIE (2012) 8. Riccardo Concu and MNDS Cordeiro: Alignment-free method to predict enzyme classes and subclasses. Int. J. Molec. Sci. 20(21), 5389 (2019) 9. Cores, F., Guirado, F., Lerida, J.L.: High throughput blast algorithm using spark and cassandra. J. Supercomput. 77, 1879–1896 (2021) 10. Deliba¸s, E., Arslan, A.: DNA sequence similarity analysis using image texture analysis based on first-order statistics. J. Molec. Graph. Model. 99, 107603 (2020) 11. Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G., Fertil, B.: Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Molec. Biol. Evol. 16(10), 1391–1399 (1999)
56
D. P. Cussi and V. E. Machaca Arceda
12. Dogan, B.: An alignment-free method for bulk comparison of protein sequences from different species. Balkan J. Electr. Comput. Eng. 7(4), 405–416 (2019) 13. Fabija´ nska, A., Grabowski, S.: Viral genome deep classifier. IEEE Access 7, 81297– 81307 (2019) 14. Gao, Y., Li, T., Luo, L.: Phylogenetic study of 2019-ncov by using alignment-free method. arXiv preprint arXiv:2003.01324 (2020) 15. Gollery, M.: Bioinformatics: sequence and genome analysis. Clin. Chem. 51(11), 2219–2220 (2005) 16. Gunasekaran, H., Ramalakshmi, K., Arokiaraj, A.R.M., Kanmani, S.D., Venkatesan, C., Dhas, C.S.G.: Analysis of DNA sequence classification using CNN and hybrid models. Comput. Math. Methods Med. 2021 (2021) 17. Hammad, M.S., Ghoneim, V.F., Mabrouk, M.S.: Detection of Covid-19 using genomic image processing techniques. In: 2021 3rd Novel Intelligent and Leading Emerging Sciences Conference (NILES), pp. 83–86. IEEE (2021) 18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 19. He, L., Dong, R., He, R.L., Yau, S.S.-T.: A novel alignment-free method for hiv-1 subtype classification. Infect. Genet. Evol. 77, 104080 (2020) 20. Kaur, N., Nazir, N., et al.: A review of local binary pattern based texture feature extraction. In: 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions)(ICRITO), pp. 1–4. IEEE (2021) 21. Keogh, E., Wei, L., Xi, X., Lonardi, S., Shieh, J., Sirowy, S. Intelligent icons: integrating lite-weight data mining and visualization into gui operating systems. In: Sixth International Conference on Data Mining (ICDM 2006), pp. 912–916. IEEE (2006) 22. Kola, D.G.R., Samayamantula, S.K.: A novel approach for facial expression recognition using local binary pattern with adaptive window. Multimedia Tools Appl. 80(2), 2243–2262 (2021) 23. Kouchaki, S., Tapinos, A., Robertson, D.L.: A signal processing method for alignment-free metagenomic binning: multi-resolution genomic binary patterns. Sci. Rep. 9(1), 1–10 (2019) 24. Kumar, N., Lolla, V.N., Keogh, E., Lonardi, S., Ratanamahatana, C.A., Wei, L.: Time-series bitmaps: a practical visualization tool for working with large time series databases. In: Proceedings of the 2005 SIAM International Conference on Data Mining, pp. 531–535. SIAM (2005) 25. Lebatteux, D., Remita, A.M., Diallo, A.B.: Toward an alignment-free method for feature extraction and accurate classification of viral sequences. J. Comput. Biol. 26(6), 519–535 (2019) 26. Lee, B., Smith, D.K., Guan, Y.: Alignment free sequence comparison methods and reservoir host prediction. Bioinformatics 37, 3337–3342 (2021) 27. Leinonen, M., Salmela, L.: Extraction of long k-mers using spaced seeds. arXiv preprint arXiv:2010.11592 (2020) 28. Li, Y., Li, L.-P., Wang, L., Chang-Qing, Yu., Wang, Z., You, Z.-H.: An ensemble classifier to predict protein-protein interactions by combining pssm-based evolutionary information with local binary pattern model. Int. J. Molec. Sci. 20(14), 3511 (2019) 29. Lichtblau, D.: Alignment-free genomic sequence comparison using fcgr and signal processing. BMC Bioinf. 20(1), 1–17 (2019)
DNA Genome Classification
57
30. Liu, Z., Gao, J., Shen, Z., Zhao, F.: Design and implementation of parallelization of blast algorithm based on spark. DEStech Trans. Comput. Sci. Eng. (IECE) (2018) 31. Arceda, V.E.M.: An analysis of k-mer frequency features with svm and cnn for viral subtyping classification. J. Comput. Sci. Technol. 20 (2020) 32. Mahmoud, M.A.B., Guo, P.: DNA sequence classification based on mlp with pilae algorithm. Soft Comput. 25(5), 4003–4014 (2021) 33. Mohan, N., Varshney, N.: Facial expression recognition using improved local binary pattern and min-max similarity with nearest neighbor algorithm. In: Tiwari, S., Trivedi, M.C., Mishra, K.K., Misra, A.K., Kumar, K.K., Suryani, E. (eds.) Smart Innovations in Communication and Computational Sciences. AISC, vol. 1168, pp. 309–319. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-53455 28 ¨ urk, S 34. Ozt¨ ¸ , Akdemir, B.: Application of feature extraction and classification methods for histopathological image using glcm, lbp, lbglcm, glrlm and sfta. Procedia Comput. Sci. 132, 40–46 (2018) 35. Panthakkan, A., Anzar, S.M., Al Mansoori, S., Al Ahmad, H.: Accurate prediction of covid-19 (+) using ai deep vgg16 model. In: 2020 3rd International Conference on Signal Processing and Information Security (ICSPIS), pp. 1–4. IEEE (2020) 36. Prakasa, E.: Texture feature extraction by using local binary pattern. INKOM J. 9(2), 45–48 (2016) 37. Pratas, D., Silva, R.M., Pinho, A.J., Ferreira, P.J.S.C.: An alignment-free method to find and visualise rearrangements between pairs of dna sequences. Sci. Rep. 5(1), 1–9 (2015) 38. Pratiwi, M., Harefa, J., Nanda, S., et al.: Mammograms classification using graylevel co-occurrence matrix and radial basis function neural network. Procedia Comput. Sci. 59, 83–91 (2015) 39. Randhawa, G.S., Hill, K.A., Kari, L.: Ml-dsp: machine learning with digital signal processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels. BMC Genom. 20(1), 1–21 (2019) 40. Ranganathan, S., Nakai, K., Schonbach, C.: Encyclopedia of Bioinformatics and Computational Biology. ABC of Bioinformatics. Elsevier (2018) 41. Ren, J., et al.: Identifying viruses from metagenomic data using deep learning. Quant. Biol. 8, 1–14 (2020) 42. Rosenberg, M.S.: Sequence Alignment: Methods, Models, Concepts, and Strategies. University of California Press (2009) 43. Ruichek, Y., et al.: Attractive-and-repulsive center-symmetric local binary patterns for texture classification. Eng. Appl. Artif. Intell. 78, 158–172 (2019) 44. Bhavya, S.V., Narasimha, G.R., Ramya, M., Sujana, Y.S., Anuradha, T.: Classification of skin cancer images using tensorflow and inception v3. Int. J. Eng. Technol. 7, 717–721 (2018) 45. Santamar´ıa, L.A., Zu˜ niga, S., Pineda, I.H., Somodevilla, M.J., Rossainz, M.: Reconocimiento de genes en secuencias de adn por medio de im´ agenes. DNA sequence recognition using image representation. Res. Comput. Sci. 148, 105–114 (2019) 46. Shanan, N.A.A., Lafta, H.A., Alrashid, S.Z.: Using alignment-free methods as preprocessing stage to classification whole genomes. Int. J. Nonlinear Anal. Appl. 12(2), 1531–1539 (2021) 47. Sharifnejad, M., Shahbahrami, A., Akoushideh, A., Hassanpour, R.Z.: Facial expression recognition using a combination of enhanced local binary pattern and pyramid histogram of oriented gradients features extraction. IET Image Process. 15(2), 468–478 (2021)
58
D. P. Cussi and V. E. Machaca Arceda
48. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 49. Singh, P., Verma, P., Singh, N.: Offline signature verification: an application of glcm features in machine learning. Ann. Data Sci. 96, 1–13 (2021) 50. Solis-Reyes, S., Avino, M., Poon, A., Kari, L.: An open-source k-mer based machine learning tool for fast and accurate subtyping of hiv-1 genomes. PloS One 13(11), e0206409 (2018) 51. Sultana, M., Bhatti, M.N.A., Javed, S., Jung, S.-K.: Local binary pattern variantsbased adaptive texture features analysis for posed and nonposed facial expression recognition. J. Electron. Imaging 26(5), 053017 (2017) 52. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015) 53. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016) 54. Tello-Mijares, S., Woo, L.: Computed tomography image processing analysis in covid-19 patient follow-up assessment. J. Healthcare Eng. 2021 (2021) 55. Vu, H.N., Nguyen, M.H., Pham, C.: Masked face recognition with convolutional neural networks and local binary patterns. Appl. Intell. 52(5), 5497–5512 (2022) 56. Wang, H., Li, L., Zhou, C., Lin, H., Deng, D.: Spark-based parallelization of basic local alignment search tool. Int. J. Bioautom. 24(1), 87 (2020) 57. Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), 1–12 (2014) 58. Yang, F., Ying-Ying, X., Wang, S.-T., Shen, H.-B.: Image-based classification of protein subcellular location patterns in human reproductive tissue by ensemble learning global and local features. Neurocomputing 131, 113–123 (2014) 59. Youssef, K., Feng, W.: Sparkleblast: scalable parallelization of blast sequence alignment using spark. In: 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp. 539–548. IEEE (2020) 60. Zielezinski, A., Vinga, S., Almeida, J., Karlowski, W.M.: Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18(1), 1–17 (2017)
A Review of Intrusion Detection Systems Using Machine Learning: Attacks, Algorithms and Challenges Jose Luis Gutierrez-Garcia1 , Eddy Sanchez-DelaCruz2(B) , and Maria del Pilar Pozos-Parra3 1
National Technological of Mexico, Campus Misantla and Campus Teziutlan, Misantla, Mexico 2 National Technological of Mexico, Campus Misantla, Misantla, Mexico [email protected] 3 Autonomous University of Baja California, Mexicali, Mexico [email protected] Abstract. Cybersecurity has become a priority concern of the digital society. Many attacks are becoming more sophisticated, requiring strengthening the strategies of identification, analysis, and management of vulnerability to stop threats. Intrusion Detection/Prevention Systems are first security devices to protect systems. This paper presents a survey of several aspects to consider in machine learning-based intrusion detection systems. This survey presents the Intrusion Detection Systems taxonomy, the types of attacks they face, as well as the organizational infrastructure to respond to security incidents. The survey also describes several investigations to detect intrusions using Machine Learning, describing in detail the databases used. Moreover, the most accepted metrics to measure the performance of algorithms are presented. Finally, the challenges are discussed motivating future research. Keywords: Intrusion detection systems security
1
· Machine learning · Cyber
Introduction
Cybersecurity is the set of technologies and processes to protect computers, networks, programs and data from attacks, unauthorized access, modifications or destruction to ensure the confidentiality, integrity and availability of information [1]. Cybersecurity is critical because our society is digitally connected for everyday activities such as business, health, education, transportation and entertainment through the internet. The impacts of attacks on cybersecurity are observed in the economy even in the influence on the democracy of a country [2]. Cyber attacks are part of the list of global risks in the report published by the World Economic Forum. This report indicates that 76.1% expect that cyberattacks to critical infrastructures will increase and 75% expect an increase in attacks in search of money or data [3]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 59–78, 2023. https://doi.org/10.1007/978-3-031-28073-3_5
60
J. L. Gutierrez-Garcia et al.
There are two types of attacks: passive and active. A passive attack attempts to obtain information to learn more about the system in order to compromise other resources. An active attack is directed towards the system to alter its resources and operations. Most attacks begin with (passive) recognition due to the time it will take to know the organization of the system, reaching 90% of the effort used for an intrusion. According to data reported in [4], U.S. companies in 2018 took 197 and 69 days to detect and contain data breaches respectively, which implies a deficiency in APT’s detection systems and why The defending party cannot properly identify between a legitimate user and an adversarial user in each of the stages. Proactive defense measures are required, as well as precautions that can harm the user experience and reduce their productivity. According to the report published by [5], cyber threats are a global problem and that all sectors of daily life can be affected (Table 1). Table 1. Global attack types and sources [5] Global Finance (17%)
Top attack types
Top attack sources
Web Attacks (46%)
United States (42%)
Service-Specific Attacks (28%) China (9%) DoS/DDos (8%) Technology (17%) Reconnaissance (20%)
Business (12%)
Education (11%)
United Kingdom (6%) China (37%)
Brute-Force Attack (17%)
United States (21%)
Known Bad Source (14%)
Russia (5%)
Web Attacks (42%)
United States (26%)
DoS/DDos (20%)
China (15%)
Known Bad Source (15%)
France (10%)
Brute-Force Attack (47%)
United States (25%)
Web Attacks (18%)
Netherlands (16%)
Reconnaissance (16%)
Vietnam (15%)
Government (9%) Service-Specific Attacks (27%) United States (37%) Reconnaissance (21%)
Germany (14%)
DoS/DDos (16%)
France (13%)
The forms of attack have evolved, but the types of threats remain practically the same [6] and it is important to know them to understand how they work (Fig. 1). Tools like TajMahal [7], allow novice users to perform sophisticated attacks due to the simplicity of their modules, generate alarm signals that must be considered very seriously by professionals of information technologies. As presented in [8], 78% of the attacks that were identified during the year 2019 were successful, showing the vulnerability of computing infrastructures. In the study presented by [9], various types of threats that can be presented at different levels of the OSI model, considering the protocols that are defined at each of the levels. A classification of threats is also made according to where they originate: Network, Host, Software, Physical, Human.
Machine Learning and IDS
61
Fig. 1. Taxonomy of cyberthreats [6].
2
Intrusion Detection System
An intrusion detection/prevention system (IDS/IPS) is a device or application software that monitors networks or system activities for malicious activity, policy violations and perform the corresponding action. Intrusion detection systems are network-based (NIDS) and host-based (HIDS). HIDS cannot observe network traffic patterns, but it is excellent at detecting specific malicious activities and stopping them in their tracks. The NIDS analyzes the flow of information and detects a specific suspicious activity before a data breach has occurred. The main characteristics of each IDS scheme are shown in Table 2 [11]. There are three dominant approaches in the current research literature and commercial systems: signature-based screening, supervised learning-based screening, and those using a model hybrid [10]. Hybrid techniques used in IDS combine signature and anomaly detection. They are used to reduce false positives and increase the ability to identify unknown intrusions. Signature-based detection systems are highly effective in detecting those actions for which they were programmed to alert, that is, when that type of attack or intrusion was previously presented. In order for this to take place, system logs or logs are checked to find scripts or actions that have been identified as malware. When a new anomaly is discovered and clearly described detailing its characteristics, the associated signature is coded by human experts,
62
J. L. Gutierrez-Garcia et al. Table 2. Types of IDS technology according to their position
Technology Advantages
Disadvantages
Data source
HIDS
It can verify the behavior of end-to-end encrypted communications. No additional hardware is required. Detects intrusions by verifying the file system, system calls or network events
Delay in reporting attacks. Consume host resources. Requires installation on each device. It can only monitor on installed equipment
Audit logs, log files, APIs, system calls and pattern rules
NIDS
Detects attacks by checking network packets. It can check several devices at the same time. Detects the widest ranges of network protocols
The identification of attacks from traffic. Aimed at network attacks only. Difficulty analyzing high speed networks
SNMP, TCP/UDP/ICMP. Base Information Management. Router NetFlow Logs
which is then used to detect a new occurrence of the same action. These types of systems provide high precision for those attacks that have been previously identified, however, they have difficulty identifying the Zero Day type intrusions since there is no previous signature with which to carry out the comparison and, consequently, give the corresponding warning or alarm, causing these strategies to be less effective. The detection of anomalies involves two essential elements, firstly, there is a way to identify normal behaviors, and subsequently, any deviation from this behavior can be an indicator of unwanted activities (anomalies). The anomaly detection applies to various fields such as weather anomalies, flight routes, bank transactions, academic performance, internal traffic in an organization’s network, system calls or functions, etc. The detection of anomalies in the network has become a vital component of any network over the current internet. From unexpected non-malicious events, sudden failures, to network attacks and intrusions such as denials of service, network scanning, malware propagation, botnet activity, etc., network traffic anomalies can have serious negative effects on the network. network performance and integrity. The main challenge to automatically detect and characterize traffic anomalies is that they have changing objectives. It is difficult to define precisely and continuously the set of possible anomalies that may arise, especially in the case of network attacks, as new attacks are continually emerging, as well as new variants to known attacks. An anomaly detection system should be able to detect a wide range of anomalies with various structures, without relying solely on prior knowledge and information. In this type of systems, a model of the normal behavior of the system is created through machine-learning, statistical-based or knowledge-based methods and any observed deviation from the system’s behavior is considered as an
Machine Learning and IDS
63
anomaly and, therefore, a possible intrusion. The main advantage of this type of systems is that it can identify zero-day attacks due to the recognition of a user’s abnormal activity without relying on a signature database. It also allows the identification of malicious internal activities by detecting deviations in common user behaviors. A disadvantage that these IDS have is that they can have a high rate of false positive alerts. IDS/IPS, Firewalls, Sandbox, among others, are part of the group of lowlevel elements to protect connected devices and systems and establish a security perimeter and prevent unauthorized access. Security Operations Centers (SOC) are specialized centers that collect security events, perform threat/attack detection, share threat intelligence, and coordinate responses to incidents. Information Security and Event Management (SIEM) systems provide support in SOC operations through the management of incoming events, information classification, correlation inference, intermediate results visualizations and the integration of automatic detections. SIEM systems feed on the logs of running systems, DNS servers or security systems such as firewalls, IDS, etc. (Fig. 2).
Fig. 2. Business cyber security levels [12].
In the lower level (L0) is the information of the infrastructure of the organization, L1 provides a basic security analysis, also executes some real-time responses about some malicious events, such as access control and intrusion prevention. The answers at this level are based on what is known as Commitment Indicators (IoC). These indicators serve to discover connections between them and find missing or lost or hidden components and thus have a global vision of the attack. In each of the layers of the pyramid there are programmed or analytical methods so that the information can be processed. These methods consider prior knowledge about malicious activities (signature-based antivirus, rule-based firewalls) or knowledge of normal activities (anomaly detection). The intermediate layers (L1 to L3) consider anomaly detection mechanisms as mentioned below:
64
J. L. Gutierrez-Garcia et al.
L1. At this level, it usually focuses on particular data, such as: system calls, network packets or CPU usage. There are implementations that carry out these activities, as shown in the Table 3: Table 3. Commercial implementations at level L1 [12] Supplier
Product
Anomaly
Avast
Antivirus
Program Behavior
Fortinet
IPS
Protocol
LogRhythm
Behavior of users and entities EndPoint analysis
ManageEngine Application Manager
Performance
RSA
Network
Silver Tail
Silicon Defense Anomaly detection engine based on packet statistics SolarWinds
Network packets
Log analysis of the IDS Snort Network
L2. SIEM systems connect activity fragments and infer behaviors from the flows of the L1 layer systems, as well as classify behaviors from a discovered attack. Some SIEM implementations use time series based algorithms, considering a much larger time than those presented in the L1 layer. Users and entities can be identified from the flows provided by L1, building user models and verifying incoming user behavior against the model to detect compromised accounts, abuse of privileges, brute force, treatment filtering, etc. The Table 4 shows commercial implementations. Table 4. Commercial implementations at level L2 [12] Supplier
L2 SIEM product
Anomaly
EventTracker Security Center
Behavior analysis general
HPE
Peer group analysis
ArchSight
IBM
QRadar
Traffic behavior analysis
LogRythm
LogRythm
Users and entities analysis
MicroFocus
Sentinel Enterprise Environment analysis
Splunk
Enterprise security Statistics and behavioral analysis
Trustwave
SIEM Enterprise
Network behavior analysis
L3. The systems of this level, despite being considered automatic with selflearning characteristics and high precision, the human analyst must intervene within the detection process due to the limited knowledge of the domain poured
Machine Learning and IDS
65
into these systems and with them discard false positives, remembering that an anomaly does not always indicate an attack or threat. The sources of information on security events include data on network traffic, firewall’s logs, logs from web servers, systems, access to routers, databases, applications, etc. Additionally, the challenges imposed when handling data related to security must be considered: assurance of privacy, authenticity and integrity of data related to events, adversarial attacks and the time in which the attack is carried out. Commercial IDS use datasets to carry out their operations, however, these datasets are not available due to their privacy. There are public datasets such as DARPA, KDD, NSL-KDD, ADFA-LD, which are used to carry out the training, testing and validation of the different proposed models presented. Table 5 presents a summary of the most popular datasets. Table 5. Public datasets DataSet
Description
DARPA/KDD Cup99
Dataset developed by DARPA to strengthen the development of IDS. It consists of capturing TCP packets for two months, simulated attacks or intrusions interspersed. It has 4,900,000 records and 41 variables. In 1998, this dataset was used as a basis to form the KDD Cup99 that was used in the Third International Knowledge Discovery and Data Mining Tools Competition [13]
CAIDA
Dataset of the year 2007 that contains the flow on the network traffic in DDoS attacks, this is a disadvantage since it is only one type of attack
NSL-KDD
It is a dataset developed in 2009 to address some problems presented by the KDD Cup99 in the accuracy and duplication of packages in a high percentage, influencing ML algorithms
ISCX 2012
It contains realistic network traffic, containing different types of attacks. HTTP, SMTP, SSH, IMAP, POP3 and FTP protocols packets were captured
ADFA-LD/ADFA-WD Datasets developed by the Australian Defense Force Academy containing records of the Linux and Windows operating systems of instances of malware attacks of the Zero-day type CICIDS 2018
It includes benign behavior and details of recent malware attacks in the categories of Brute Force FTP, Brute Force SSH, DoS, Heartbleed, Web Attention, Infiltration, Botnet and DDoS [14]. It is a very complete dataset that contemplates 80 characteristics of the captured traffic
66
J. L. Gutierrez-Garcia et al.
In the work presented by [15], various datasets that have been used for research purposes, their characteristics and the types of attacks they contain are shown. The CSE-CIC-IDS 2018 dataset collected on one of Amazon’s AWS LAN network (thus also known as the CIC-AWS-2018 Dataset) by the Canadian Institute of Cybersecurity (CIC). It is the evolution of CICIDS 2017 and is publicly available for research. The types of attacks that are included in this dataset are: – Botnet attack: A botnet is a collection of internet-connected devices infected by malware that allow hackers to control them. Cyber criminals use botnets to instigate botnet attacks, which include malicious activities such as credentials leaks, unauthorized access, data theft and DDoS attacks. – FTP-BruteForce: FTP Brute Force testing is a method of obtaining the user’s authentication credentials, such as the username and password to login to a FTP Server. File Servers are repositories of any organization, Attackers can use brute force applications, such as password guessing, tools and scripts in order to try all the combinations of well-known usernames and passwords. – SSH- BruteForce: SSH is high security protocol. It uses strong cryptography to protect your connection against eavesdropping, hijacking and other attacks. But brute-force attacks is major security threat against remote services such as SSH. The SSH bruteforce attack attempts to get abnormal access by guessing user accounts and passwords pair. – BruteForce-Web: Brute-force attacks take advantage of automation to try many more passwords than a human could, breaking into a system through trial and error. More targeted brute-force attacks use a list of common passwords to speed this up, and using this technique to check for weak passwords is often the first attack a hacker will try against a system. – BruteForce-XSS: Some malicious scripts can be injected into trusted web sites. XSS attacks occurs when an attacker is sending malicious code, generally in the form of a browser, to a different browser/visitor. – SQL Injection: Is a type of an injection attack that makes it possible to execute malicious SQL statements. These statements control a database server behind a web application. Attackers can use SQL Injection vulnerabilities to bypass application security measures. – DDoS-HOIC attack: Used for denial of service (DoS) and distributed denial of service (DDoS) attacks, it functions by flooding target systems with junk HTTP GET and POST requests, which aims to flood a victim’s network with web traffic and shut down a web site or service – DDoS-LOIC-UDP attack. Flood the server with TCP-UDP packets, with the intention of interrupting the host service. – DoS- Hulk attack: A typical HULK attack may attempt to launch a SYN flood against a target host in a single or distributed manner. In addition to a large number of legitimate HTTP requests, HULK will also generate a large number of uniquely crafted malicious HTTP requests. – DoS-SlowHTTPTest attack: “Slow HTTP” attacks in web applications are based on the HTTP protocol, by design, requiring that the requests that arrive
Machine Learning and IDS
67
to it be complete before they can be processed. If an HTTP request is not complete or if the transfer rate is very low, the server keeps its resources busy waiting for the rest of the data to arrive. If the server keeps many resources in use, a denial of service (DoS) could occur.DoS-GoldenEye attack. GoldenEye is similar to a HTTP flood and is a DDoS attack designed to overwhelm web servers’ resources by continuously requesting single or multiple URLs from many source attacking machines. – DoS-Slowloris attack: Slowloris is an application layer DDoS attack which uses partial HTTP requests to open connections between a single computer and a targeted Web server, then keeping those connections open for as long as possible, thus overwhelming and slowing down the target.
3
Machine Learning and Intrusion Detection
Machine Learning (ML) refers to algorithms and processes that “learn” in such a way that they are able to generalize past data and experiences to predict future results. Essentially, it is a set of mathematical techniques implemented on computer systems to mine data, discover patterns and make inferences from the data [6]. Algorithms can be classified according to learning style. In supervised learning, the input data has known labels or results that allow you to train the model and make corrections when an incorrect result is provided until you reach the desired level of precision. In unsupervised learning the input data is untagged and there is no known result, the model deduces the structure found in the input models. Finally, in semi-supervised learning, the input data is mixed, that is, labeled and unlabeled. Algorithms can also be classified by the type of operation to which they belong Table 6 shows this classification. The proposal made by [16], expose a multilayer scheme for the timely identification of botnets (IRC, SPAM, Click Fraud, DDoS, FastFlux, Port Scan, Compiled and Controlled record by CTU, HTTP, Waledac, Storm and Zeus) that use different communication channels, making use of packet capture through specialized tools such as wireshark to carry out packet filtering and only pay attention to the desired protocol, achieving an accuracy of 98%. In a review made by [18], many models generated by AI are susceptible to receiving adversarial attacks. It is established that these attacks can occur during the three stages: training, testing and implementation. Attacks can be either white box or black box, depending on the knowledge of the target model. This study was carried out on computer vision applications, image classification, semantic image segmentation, object detection and a brief segment on cybersecurity, however, it clearly exposes the vulnerabilities of the different models used. A checklist is proposed to carry out the evaluation on the robustness of the defenses against adverse attacks, considering that they range from the identification of the most tangible threat to the identification of flags and traps that usually arise during an adversarial attack [19]. In the review carried out by [20] a classification of the main types of threats/attacks was obtained (Intrusion Detection System, Alert Correlation, DoS Detection, Botnet Detection, Forensic Analysis, APT Detection, Malware Detection, Phishing Detection) that the
68
J. L. Gutierrez-Garcia et al.
Table 6. Taxonomy of machine learning algorithms Algorithms
Description
Examples
Regression algorithms
Models the relationship between variables, using a measure of error in the predictions made by the model
Ordinary Least Squares Regression (OLSR) Linear Regression Logistic Regression Stepwise Regression Multivariate Adaptive Regression Splines (MARS) Locally Estimated Scatterplot Smoothing (LOESS)
Instance-based algorithms
A database is built with sample data and compares the new incoming data against the database using a similarity measure to find the best match and make a prediction
k-Nearest Neighbor (kNN) Learning Vector Quantization (LVQ) Self-Organizing Map (SOM) Locally Weighted Learning (LWL) Support Vector Machines (SVM)
Decision tree algorithms
A decision model is constructed from the values present in the attributes of the data. They are fast and are one of the main algorithms used in
Classification and Regression Tree (CART) Iterative Dichotomiser 3 (ID3) C4.5 and C5.0 Chi-squared Automatic Interaction Detection (CHAID) Decision Stump M5 Conditional Decision Trees
Bayesian algorithms
They apply Bayes’ theorem to regression and classification problems
Naive Bayes Gaussian Naive Bayes Multinomial Naive Bayes Averaged One-Dependence Estimators (AODE) Bayesian Belief Network (BBN) Bayesian Network (BN)
Clustering algorithms
Data structures are used to k-Means k-Medians obtain the best organization Expectation Maximisation in groups as homogeneous as (EM) Hierarchical Clustering possible
Artificial neural network algorithms
Based on the functioning of biological neural networks. The input information stops for a series of operations, weight values and limiting functions
Perceptron Multilayer Perceptrons (MLP) Back-Propagation Stochastic Gradient Descent Hopfield Network Radial Basis Function Network (RBFN)
Deep learning algorithms
Evolution of artificial neural networks, they are more complex and extensive to work with large tagged analog data datasets, such as text. Image, audio and video
Convolutional Neural Network (CNN) Recurrent Neural Networks (RNNs) Long Short-Term Memory Networks (LSTMs) Stacked Auto-Encoders Deep Boltzmann Machine (DBM) Deep Belief Networks (DBN)
Machine Learning and IDS
69
efforts to face them have been investigated and published, as well as the quality attributes that must be considered when making a proposal and architectural aspects to satisfy the desired quality attributes. It exposes a great disconnect between academia and industry in terms of shared data, as well as the implementation of analytical security systems under a business environment. Given the lack of information to carry out research on recent events on intrusions, in [21] a cloud sharing scheme and research and collaborating institutions with 5 levels of trust and security are proposed, however, say of the authors themselves, it is still complex and extensive in such a way that it requires security experts to start the process. The proposed framework is accompanied by agreements and policies to carry out the sharing of information on cyber threats. Regarding the attack through Ransomware, [22] proposes an implementation using supervised ML to detect ransomware from small datasets, showing an accuracy rate of 95%. It should be noted that no further details are provided on the characteristics of the data sets used to identify the aspects considered. This type of malware executes a series of operations [Reconnaissance, Weaponization, Delivery, Exploitation, Installation, Command and Control (C2), Actions on Objectives] to carry out the operation for which it was designed [23]. Botnets are a great threat since they can be used for various malicious activities such as distributed denial of services (DDoS), sending of mass mail considered as garbage, phishing, DNS spoofing, adware installation, etc. In the analysis presented by [24], the main components that make up a botnet are indicated, the mechanisms used for communication as well as the architecture thereof, expressing that the detection mechanisms based on signatures, anomalies and hybrids. They are the most used so far. A proposal made by [25], presents the identification of bots through graphs for being more intuitive, instead of being based on flows, due to computational cost. The proposal made by [26], expose a multilayer scheme for the timely identification of botnets (IRC, SPAM, Click Fraud, DDoS, FastFlux, Port Scan, Compiled and Controlled record by CTU, HTTP, Waledac, Storm and Zeus) that use different communication channels, making use of packet capture through specialized tools such as wireshark to carry out packet filtering and only pay attention to the desired protocol, achieving an accuracy of 98% according to the published information. Tables 7, 8 and 9 shows a summary of the proposals made through recent publications on the application of ML in the detection of intrusions through various types of attacks. The incidence of the main ML algorithms are reflected in Fig. 3, where the category of supervised learning is the most used.
4
Metrics
There are many metrics for measuring IDS performance, some metrics are more suitable for measuring performance. For example, when the data is not balanced, the accuracy metric is not recommended, as is the case with network traffic data for the classification of attacks [17]. The choice of metrics to evaluate the Machine
70
J. L. Gutierrez-Garcia et al. Table 7. Machine learning in intrusion detection. Year 2019
Year
Description
DataSet
Algorithm
Accuracy
2019 [27] Apply RNN to ISCX 2012 obtain the best characteristics dynamically based on the values that the variables present daily on each type of attack
RNN
86–90%
2019 [28] Performs an CIC-AWS-2018 evaluation of various supervised learning algorithms in each of the attacks that are established in the dataset
RF, Gaussian naive 96–100% bayes, Decision tree, MLP, K-NN
2019 [35] Flow-based anomaly NSL-KDD detection in software-defined networks, optimizing feature selection through ANOVA F-Test and RFE and ML algorithms
RF, DNN
88%
2019 [36] Classify network flows through the LSTM model
LSTM
99.46%
ISCX2012, USTC-TFC2016, IoT dataset from Robert Gordon University
Learning model is very important since it influences the way in which the model is measured and compared. Confusion Matrix: It is a tool within supervised learning to observe the behavior of the model. The test data allows us to see where there is a confusion of the model to classify the classes correctly. In Table 10 you can see the conformation of a matrix where the columns indicate the actual or current values and the rows the values that the model predicts. Based on the information shown in the matrix, the following can be indicated: TP: It is the number of actual or existing positive values that were correctly classified as positive by the model. TN: It is the number of actual or existing negative values that were correctly classified as negative by the model. FP: It is the amount of actual or existing negative values that were incorrectly classified as positive by the model. FN: It is the number of actual or existing positive values that were incorrectly classified as negative by the model.
Machine Learning and IDS
71
Table 8. Machine learning in intrusion detection. Year 2020 Year
Algorithm
Accuracy
2020 [29] Hybrid IDS is proposed that NSL-KDD combines C5 decision tree classifier and ADFA to identify known attacks and One Class Support Vector Machine to identify intrusions by means of anomalies. Network Security Laboratory-Knowledge Discovery in Databases (NSL-KDD) and Australian Defense Force Academy (ADFA) datasets are used
Description
DataSet
Decision tree, SVM
83.24, 97.4%
2020 [30] Proposes a new algorithm based on SVM, more resistant to noise and deviations, oriented to data on the sequence of system calls
FSVM based on SVDD
86.92%
2020 [31] A DNN and association rules using NSL-KDD NSL-KDD dataset to mine network traffic and classify it, subsequently, through the A priori algorithm, it seeks to eliminate false positives
DNN
89.01–99.01%
2020 [32] Adaptation of the available Datasets to balance them through the SMOTE technique to generate synthetically balanced data
KNN, DT, SVM LR, RF
93.37–98.84%
UNM sendmail, UNM live lpr published by New Mexico university
CIDDS-001 and ISCXBot-2014
2020 [33] A multilayer hierarchical model to KDD 99 work in conjunction with knowledge-based models. First, a binary classification is carried out, then the type of attack is identified and finally the previous knowledge is extracted to update the detail about the type of attack and thus improve the performance of the classifier. The time spent for training is greater than other proposals
C4.5, RF, EM, 99.8% FPA
From this confusion matrix, various metrics have been generated, oriented to problems of classification (Table 11), Regression (Table 12) or clustering (Table 13).
72
J. L. Gutierrez-Garcia et al. Table 9. Machine learning in intrusion detection. Year 2021
Year
DataSet
Algorithm
Metrics
2021 [44] They propose a two-phase technique called IOL-IDS, in the first phase it is based on the Long Short-Term Memory (LSTM) classifier and the Improved One-vs-One technique to handle frequent and infrequent intrusions. In the second phase, assembled algorithms are used to identify the types of intrusion attacks detected in each of the datasets used
Description
NSL-KDD, CIDDS- 001, CICIDS2017
LSTM
Recall: 31–98% Precision: 64–95% F1: 56–94%
2021 [43] It uses a 4-step method to detect DDoS attack: Preprocessing (coding, log2, PCA), Model generation through Random Forrest, Contrast tests with Naive Bayes, evaluation with rates of accuracy, false alarm, detection, precision, and F-measure
MIX (PORTMAP + LDAP) CICDDOS2019
RF, NB
Accuracy = 99.97% F1-score = 99.9% %
2021 [45] They propose an oversampling technique based on Generative Adversarial Networks (GAN) from the class identified as attack and selection of characteristics through ANOVA in order to balance the dataset and improve the effectiveness of intrusion detection
NSL-KDD, UNSW-NB15 y CICIDS-2017
NB, DT, RF, GBDT, SVM, K-NN, ANN
Accuracy = 97.7–99.84% F1-score= 91.89–99.6% %
2021 [46] In this study, a benchmarking is CICIDS2017 carried out on the main algorithms to evaluate the performance of the models in the detection of intrusions through various metrics and considering different percentages in the k-folds cross-validation. Only Brute Force, XSS, SQL Injection attack types were considered
ANN, DT, KNN, NB, RF, SVM, CNN K-Mean, SOM, EM
Accuracy: 75.21–99.52% Precision: 99.01–99.49% Recall: 76.6–99.52% %
2021 [34] This study uses several ML NSL-KDD algorithms to identify anomalous behavior of an IDS on the SDN controller. It first identifies when it is an attack and then classifies it to the corresponding attack type
DT, RF, XGBoost
Accuracy: 95.95% Precision: 92% Recall: 98% F1-score: 95.55% %
Machine Learning and IDS
73
Fig. 3. ML algorithms [2019–2020]. Table 10. Confusion matrix ACTUAL Positives Negatives PREDICTED Positives TP FP Negatives FN TN
5
Challenges
Dataset: Most of the investigations reviewed, use old and outdated datasets that do not reflect the current situation to carry out the tasks of identifying intrusions through different ML algorithms, and this is due to the poor accessibility of sets of data on the most recent cybersecurity related events that companies are facing. As a way to deal with this situation, many researchers choose to generate their own data sets, but constant updating of these data sets is required, implementing the latest attacks, involving a constant collection of network and system data in the purpose of maintaining training and tests as close to reality. Linking Academy - Industry: Research is conducted in simulated settings within academia, however, these settings are very different from real scenarios within organizations and the threats or attacks they face are very broad and complex requiring analysis and immediate response, and these situations generate that distance between the two entities, causing outdated knowledge. Collaboration agreements with academic entities are required to train specialized personnel and access updated data in order to ensure that the results of the investigations have an effective impact and are implemented immediately, complying with all privacy and access responsibilities at the information. Adversarial Attacks: Any strategy implemented through ML allows promising advances and results, however, the same ML technology can be used to generate
74
J. L. Gutierrez-Garcia et al.
Table 11. Performance metrics to evaluate machine learning algorithms for classification. Metrics
Formula
Classification rate or Accuracy (CR): It is the accuracy rate on detecting abnormal or normal behavior. It is normally used when the dataset is balanced to avoid a false sense of good model performance
CR =
Recall (R) or True Positive Rate (TPR) or Sensitivity: It results from dividing the number of correctly estimated attacks by the total number of attacks. It is the ability of the classifier to detect all positive cases (attacks)
TPR =
TP T P +F N
(2)
False Positive Rate (FPR): It is the rate of false alarms
FPR =
FP F P +T N
(3)
False Negative Rate (FNR): It is when the F NR = detector fails to identify an abnormality and classifies it as normal
FN F N +T P
(4)
T NR =
TN T N +F P
(5)
True Negative Rate (TNR): Also known as specificity. They are those normal cases and identified as such
Precision (P): Represents the confidence of attack P = detection. It is useful when it is not recommended to use Accuracy because the dataset is not balanced F1-Score: It is the harmonic mean between Precision and Recall (TPR), obtaining a value close to the lower value between the two when they are disproportionate. A value close to 1 means that the classifier performance is optimal
T P +T N T P +T N +F P +F N
TP T P +F P
F1 =
2P R P +R
(1)
(6)
(7)
ROC Curve: It is a graph that allows observing the performance of a binary classifier according to a cutoff threshold Area Under the Curve (AUC): It is the probability that the model classifies a random positive example more closely than a random negative example
attacks against the models in charge of identifying the intrusions, being able to alter the results provided in a biased way towards the interests of the attackers. Many researches reviewed in this sense do not consider this type of attack against the proposals expressed, so it would be desirable to know the robustness of the proposal against the adversarial attacks since they can evade the model after being trained or contaminate the data of training to force a misclassification of the model [37–39].
Machine Learning and IDS
75
Table 12. Performance metrics to evaluate machine learning algorithms for regression. Metrics
Formula
Mean Squared Error (MSE): Measure the mean square error of the predictions. The higher this value, the worse the model is
M SE =
1 N
Mean Absolute Error (MAE): It is the average of absolute differences between the original values and the predictions. It is more robust against anomalies than MSE due to the squared elevation that the latter performs
M AE =
1 N
N i=1
N i=1
(yi − yi )2
(8)
|yi − yi |2
(9)
Table 13. Performance metrics to evaluate machine learning algorithms for clustering. Metrics
Formula
Davies-Bouldin Index: The DB index Rij = assumes that clusters that are well DB = separated from each other and heavily populated form a good grouping. This index indicates the average “similarity” between groups, where similarity is a measure that compares the distance between groups with the size of the groups. Values closer to zero indicate better grouping Silhouette Coefficient: Indicates how the s = assignment of an element was carried out within the cluster. If it is closer to −1, it would have been better to assign it to the other group. If S (i) is close to 1, then the point is well assigned and can be interpreted as belonging to an “appropriate” cluster
si +sj dij k 1 i=1 k
maxi=j Rij
b−a max(a,b)
TP Fowlkes-Mallows Index (FMI): It is the FMI = √ (TP+FP)(TP+FN) geometric mean between precision and recall. The value is between 0 and 1, with the highest value being an indicator that there is good similarity between the clusters
Algorithms: Given the recent implementation of algorithms based on Deep learning, which are generating outstanding results in other investigations [40–42], it is desirable that they be implemented in case studies for Intrusion Detection Systems, in order to corroborate performance.
76
6
J. L. Gutierrez-Garcia et al.
Conclusions
Given the large amount of information generated daily and society’s dependence on information technologies, the security of computer resources and information as such is an aspect of vital importance. The different techniques and products related to computer security have played a very important role in recent years, with IDS/IPS being one of the essential elements within any cybersecurity scheme. The application of Machine Learning as a support in the identification and prevention of intrusions has provided favorable and promising results. This survey reveals the need of constantly updated data to be able to give results that respond to current threats, this being one of the challenges facing the academy when proposing alternative techniques for identifying intrusions using ML, making it clear that a close link is required between the academy and global security organizations to access specialized and updated information and training on threats, techniques and challenges to facing modern society.
References 1. Bettina, J., Baudilio, M., Daniel, M., Alajandro, B., Michiel, S.: Challenges to effective EU cybersecurity policy. European Court of Auditors, pp. 1–74 (2019) 2. Gerling, R.: Cyber Attacks on Free Elections. MaxPlanckResearch, pp. 10–15 (2017) 3. World Economic Forum. The Global Risks Report 2020. Insight Report, pp. 1–114 (2020). 978-1-944835-15-6. http://wef.ch/risks2019 4. Ponemon Institute. 2015 Cost of Data Breach Study: Impact of Business Continuity Management (2018). https://www.ibm.com/downloads/cas/AEJYBPWA 5. Katsumi, N.: Global Threat Intelligence Report Note from our CEO. NTT Security (2019) 6. Chi, C., Freeman, D.: Machine Learning and Security. O’Reilly, Sebastopol (2018) 7. Kapersky. Project TajMahal a new sophisticated APT framework. Kapersky (2019). https://securelist.com/project-tajmahal/90240/ 8. CyberEdge Group. Cyberthreat Defense Report. CyberEdge Group (2019). https://cyber-edge.com/ 9. Hanan, H., et al.: A Taxonomy and Survey of Intrusion Detection System Design Techniques, Network Threats and Datasets. ACM (2018). http://arxiv.org/abs/ 1806.03517 10. Mazel, J., Casas, P., Fontugne, R., Fukuda, K., Owezarski, P.: Hunting attacks in the dark: clustering and correlation analysis for unsupervised anomaly detection. Int. J. Netw. Manag. 283–305 (2015). https://doi.org/10.1002/nem.1903 11. Khraisat, A., Gondal, I., Vamplew, P., Kamruzzaman, J.: Survey of intrusion detection systems: techniques, datasets and challenges. Cybersecurity 2(1), 1–22 (2019). https://doi.org/10.1186/s42400-019-0038-7 12. Yao, D., Shu, X., Cheng, L., Stolfo, S.: Anomaly Detection as a Service: Challenges, Advances, and Opportunities. Morgan & Claypool Publishers, San Rafael (2018) 13. KDD. KDD-CUP-99 Task Description (1999). https://kdd.ics.uci.edu/databases/ kddcup99/task.html
Machine Learning and IDS
77
14. Sharafaldin, I., Habibi, A., Ghorbani, A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: ICISSP 2018 - Proceedings of the 4th International Conference on Information Systems Security and Privacy, pp. 108–116 (2018). https://doi.org/10.5220/0006639801080116 15. Ring, M., Wunderlich, S., Scheuring, D., Landes, D., Hotho, A.: A survey of network-based intrusion detection data sets. Comput. Secur. 147–167 (2019). https://arxiv.org/abs/1902.00053. https://doi.org/10.1016/j.cose.2019.06.005 16. Ullah, R., Zhang, X., Kumar, R., Amiri, N., Alazab, M.: An adaptive multi-layer botnet detection technique using machine learning classifiers. Appl. Sci. 9(11), 2375 (2019) 17. Mag´ an-Carri´ on, R., Urda, D., D´ıaz-Cano, I., Dorronsoro, B.: Towards a reliable comparison and evaluation of network intrusion detection systems based on machine learning. Appl. Sci. (2020). https://doi.org/10.3390/app10051775 18. Qiu, S., Liu, Q., Zhou, S., Wu, C.: Review of artificial intelligence adversarial attack and defense technologies. Appl. Sci. (2019). https://doi.org/10.3390/app9050909 19. Carlini, N., et al.: On Evaluating Adversarial Robustness (2019). https://arxiv. org/abs/1902.06705 20. Ullaha, F., Babara, M.: Architectural tactics for big data cybersecurity analytics systems: a review. J. Syst. Softw. 151, 81–118 (2019). https://doi.org/10.1016/j. jss.2019.01.051 21. Chadwick, D., et al.: A cloud-edge based data security architecture for sharing and analysing cyber threat information. Future Gener. Comput. Syst. 102, 710–722 (2020). https://doi.org/10.1016/j.future.2019.06.026 22. Menen, A., Gowtham, R.: An efficient ransomware detection system. Int. J. Recent Technol. Eng. 28–31 (2019) 23. Narayanan, S., Ganesan, S., Joshi, K., Oates, T., Joshi, A., Finin, T.: Cognitive Techniques for Early Detection of Cybersecurity Events (2018). http://arxiv.org/ abs/1808.00116 24. Ravi, S., Jassi, J., Avdhesh, S., Sharma, R.: Data-mining a mechanism against cyber threats: a review. In: 2016 1st International Conference on Innovation and Challenges in Cyber Security, ICICCS 2016, pp. 45–48 (2016). https://doi.org/10. 1109/ICICCS.2016.7542343 25. Daya, A., Salahuddin, M., Limam, N., Boutaba, R.: A graph-based machine learning approach for bot detection. In: 2019 IFIP/IEEE Symposium on Integrated Network and Service Management, IM 2019, pp. 144–152 (2019) 26. Ullah, R., Zhang, X., Kumar, R., Amiri, N., Alazab, M.: An adaptive multi-layer botnet detection technique using machine learning classifiers. Appl. Sci. 9(11), 2375 (2019). https://doi.org/10.3390/app9112375 27. Le, T., Kim, Y., Kim, H.: Network intrusion detection based on novel feature selection model and various recurrent neural networks. Appl. Sci. 9(7), 1392 (2019). https://doi.org/10.3390/app9071392 28. Zhou, Q.: Dimitrios Pezaros School. Evaluation of Machine Learning Classifiers for Zero-Day Intrusion Detection - An Analysis on CIC-AWS-2018 dataset (2019). https://arxiv.org/abs/1905.03685 29. Khraisat, A., Gondal, I., Vamplew, P., Kamruzzaman, J., Alazab, A.: Hybrid intrusion detection system based on the stacking ensemble of C5 decision tree classifier and one class support vector machine. Electronics 9(1), 173 (2020). https://doi. org/10.3390/electronics9010173 30. Liu, W., Ci, L., Liu, L.: A new method of fuzzy support vector machine algorithm for intrusion detection. Appl. Sci. 10(3), 1065 (2020). https://doi.org/10.3390/ app10031065
78
J. L. Gutierrez-Garcia et al.
31. Gao, M., Ma, L., Liu, H., Zhang, Z., Ning, Z., Xu, J.: Malicious network traffic detection based on deep neural networks and association analysis. Sensors 20, 1–14 (2020). https://doi.org/10.3390/s20051452 32. Gonzalez-Cuautle, D., et al.: Synthetic minority oversampling technique for optimizing classification tasks in botnet and intrusion-detection-system datasets. Appl. Sci. 10(3), 794 (2020). https://doi.org/10.3390/app10030794 33. Sarnovsky, M., Paralic, J.: Hierarchical intrusion detection using machine learning and knowledge model. Symmetry 12, 1–14 (2020) 34. Wang, M., Lu, Y., Qin, J.: A dynamic MLP-based DDoS attack detection method using feature selection and feedback. Comput. Secur. 88, 1–14 (2020). https://doi. org/10.1016/j.cose.2019.101645 35. Kumar, S., Rahman, M.: Effects of machine learning approach in flow-based anomaly detection on software-defined networking. Symmetry 12(1), 7 (2019) 36. Hwang, R., Peng, M., Nguyen, V., Chang, Y.: An LSTM-based deep learning approach for classifying malicious traffic at the packet level. Appl. Sci. 9(16), 3414 (2019). https://doi.org/10.3390/app9163414 37. Kwon, H., Kim, Y., Yoon, H., Choi, D.: Random untargeted adversarial example on Deep neural network. Symmetry 10(12), 738 (2018). https://doi.org/10.3390/ sym10120738 38. Anirban, C., Manaar, A., Vishal, D., Anupam, C., Debdeep, M.: Adversarial attacks and defences: a survey. IEEE Access 35365–35381 (2018). https://doi.org/ 10.1109/ACCESS.2018.2836950 39. Ibitoye, O., Abou-Khamis, R., Matrawy, A., Shafi, M.: The Threat of Adversarial Attacks on Machine Learning in Network Security - A Survey (2019). https:// arxiv.org/abs/1911.02621 40. Niyaz, Q., Sun, W., Javaid, A., Alam, M.: A deep learning approach for network intrusion detection system. In: 9th EAI International Conference on Bio-Inspired Information and Communications Technologies, pp. 1–11, May 2016 41. Guo, W., Mu, D., Xu, J., Su, P., Wang, G., Xing, X.: Lemna: explaining deep learning based security applications. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, Toronto, ON, Canada, 15 October 2018, pp. 364–379 (2018) 42. Nathan, S., Tran, N., Vu, P., Qi, S.: A deep learning approach to network intrusion detection. IEEE Trans. Emerg. Top. Comput. Intell. 2, 41–50 (2018). https://doi. org/10.1109/TETCI.2017.2772792 43. Abbas, S.A., Almhanna, M.S.: Distributed denial of service attacks detection system by machine learning based on dimensionality reduction. J. Phys. Conf. Ser. 1804(1), 012136 (2021). https://doi.org/10.1088/1742-6596/1804/1/012136 44. Gupta, N., Jindal, V., Bedi, P.: LIO-IDS: handling class imbalance using LSTM and improved one-vs-one technique in intrusion detection system. Comput. Netw. 192, 108076 (2021). https://doi.org/10.1016/j.comnet.2021.108076 45. Liu, X., Li, T., Zhang, R., Wu, D., Liu, Y., Yang, Z.: A GAN and Feature SelectionBased Oversampling Technique for Intrusion Detection (2021) 46. Maseer, Z.K., Yusof, R., Bahaman, N., Mostafa, S.A., Foozy, C.F.M.: Benchmarking of machine learning for anomaly based intrusion detection systems in the CICIDS2017 dataset. IEEE Access 9, 22351–22370 (2021). https://doi.org/10. 1109/access.2021.3056614
Head Orientation of Public Speakers: Variation with Emotion, Profession and Age Yatheendra Pravan Kidambi Murali, Carl Vogel, and Khurshid Ahmad(B) Trinity College Dublin, The University of Dublin, Dublin, Ireland {kidambiy,khurshid.ahmad}@tcd.ie, [email protected] Abstract. The orientation of head is known to be a conduit of emotions or in intensifying or in toning down emotion expressed by face. We present a multimodal study of 162 videos comprising speeches of three professionals: politicians, CEOs and their spokespersons. We investigate the relationship of the three Euler angles (yaw, pitch, roll) that characterise the orientation with emotions, using two well known facial emotion expression recognition system. The two systems (Emotient and Affectiva) give similar outputs for the Euler angles. However, the variation of the Euler angles with a given discrete emotion is different for the two systems given the same input. The gender of the person displaying a given emotion plays a key role in distinguishing the output of the two systems: consequently it appears that correlation of the Euler angles is system dependent as well. We have introduced a combined vector which is the sum of the magnitude of the three Euler angles per frame.
Keywords: Emotion recognition communication · Gender
1
· Head orientation · Multi-modal
Introduction
Emotion recognition (ER) is a complex task that is performed usually in noisy conditions. Facial ER systems analyse moving images using a variety of computer vision algorithms, and identify landmarks on the face. ER systems rely largely on statistical machine learning algorithms for recognizing emotions and crucially depend on the availability of training databases. These databases in themselves are quasi-random samples and may have racial/gender/age bias. Facial landmarks help determine head pose. A head pose essentially is characterized by its yaw, pitch, and roll angles - respectively showing up/down, sideways, and backward/forward position. In this paper, we present a case study involving the analysis of 162 semispontaneous videos either given by politicians, chief executive officers of major enterprises, and spokespersons for the politicians. Our data set comprises 12 nationalities, 33 males and 18 females. We have collected 162 videos, 90 from males and 72 videos from females. We have used the speeches of politicians and c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 79–95, 2023. https://doi.org/10.1007/978-3-031-28073-3_6
80
Y. P. K. Murali et al.
CEOs; for spokesperson we have used the answers to questions from journalists. We have used two major ER systems Emotient [3,10] and Affectiva [6,11] that detect emotions and head orientation from videos. We compare the output of the two systems for six ‘discrete’ emotions and for three head orientation and examine the correlation between orientation and emotion evidence and also according to various categories of race, gender, and age. We begin with a brief literature review (Sect. 2) followed by methods used that form the basis of a pipeline of programs that access video data, pre-process the data frame by frame to ensure that there is only one face in the frame, feed this data into emotion recognition systems, and then use tests of statistical significance to see the similarities and differences in the emotional state and head pose of a given subject as processed by the two emotion recognition systems (Sect. 4). Finally, we present a case study based on the videos of politicians, CEOs, and spokespersons, to see how the two recognition systems perform and conclude the paper (Sect. 5). Our key contributions are that we have developed a method to understand the difference in estimation of head orientation in different automatic emotion recognition systems and that we have estimated the relationship between head orientation and the emotions possibly expressed by politicians, CEOs and spokespersons.
2
Motivation and Literature Review
Public engagements of politicians, CEOs, and their spokespersons, primarily involve the use of their first language or second language (which is frequently English). The written text forms the basis on which the politicians and others deliver a performance using language and non-verbal cues - facial expressions, voice modulations, and gestures involving hands, body and the head. These cues are used to emphasize a point, attack the opposition, support a friend, or befriend an undecided. It has been argued that the hand, body, and head gestures may reinforce discrete emotions expressed verbally [8]. However, the gestural information may deliberately be at variance from the text – something like an ‘in-between-the lines’ stratagem – or the emotion felt by the speaker may spontaneously leak (emotion leakage is a term coined some 60 years ago) [5]. Head postures, head bowed/raised or no head movement, are equally well used by a speaker to dominate her/his audience, or show them physical strength, or indicate respect of social hierarchies [12]. In psychological experiments subjects were shown stick figures, based on the facial landmarks of real politicians giving a speech, and asked their opinion about the surrogate politicians’ consciousness and emotional stability. The subjects appear to agree that a pronounced head movement is an indicator of less conscientiousness and a lack of emotional stability [9]. Equally important is the role of body gestures as a whole in emotion expression: Some believe that body gestures merely increase the intensity of emotion expressed through voice or face [4], whilst others believe that body gestures are influenced by emotions felt: Specific body gestures were accompanied when actors pose for a given emotion: moving head downwards accompanies the expression of disgust, whilst head
Head Orientation and Movement Effects
81
backwards move is present during “elated joy” than any other emotion, and head bent sideways may accompany anger [13]. The recognition of emotion directly from the movement of major joints in the body, including the head, has shown accuracy of up to 80% [2]. An analysis [14] of a “3D video database of spontaneous facial expressions in a diverse group of young adults” ([15]: BP4D, comprising 41 subjects expressing 6 different emotions) the authors show that statistically significant means for yaw and roll were close to zero for all the discrete emotions – highest for anger and fear and lowest for sadness. The range of the three angles was significantly larger for all the emotions ranging between 19.4 ◦ C for pitch, 12.1 for yaw and 7.68 for roll; the highest values were for anger, joy, fear followed by surprise and disgust – lowest for sadness. Despite the differences in magnitude and sign, our results are within one standard deviation of the mean with that of Werner et al.’s [14] (see Table 1). Note that the mean for all Euler angles, across the 6 emotions, is small in magnitude for Werner et al. as it is for our calculation. Table 1. Comparison of mean euler angles across emotions computed by [14] on their 41 subjects and the results of our computations on our data base of 162 videos using emotient. Note that the results of affectiva are similar. Results are in degrees Pitch Yaw Roll Joy
Werner et al. 2.8 Our work 2.24
1 0.4 0.13 −0.67
Anger
Werner et al. 2.9 Our work 0.6
2.7 0 0.18 2.47
Fear
Werner et al. 4.1 Our work 1.4
0.3 1.6
Disgust Werner et al. 4.9 Our work 0.05
−0.2 0.7
−0.8 −0.2 2.45 1.15
There has been interesting work being carried out on how to make “humanoid robots express emotions and how humans categorize “facial” and “head” orientation of a robot in terms of discrete emotions felt. In a study of 132 volunteers who saw the robot head placed in 9 different directions of pitch and yaw: Looking up, at gaze level, and down, and looking left, gazing straight ahead, and right [7]. These authors confirm the observation discussed above that the mean value of yaw is zero, that “anger, fear and sadness are associated with looking down, whereas happiness, surprise and disgust are associated with looking upward”, and pitch varies with emotions. These studies are important in that we are not relying on the analysis of facial emotion expression as the “humanoid” robot has no facial muscles. Our literature review suggests the following questions: 1. Is the computation of head orientation independent of the facial emotion recognition systems used?
82
Y. P. K. Murali et al.
This question is important as the output of emotion evidence does vary from system to system [1]. 2. What is the relationship between head orientation/movement and emotions? (a) If there is a relationship, based largely on posed/semi-posed videos, then does this relationship also exists in spontaneous videos? (b) Is the relationship between head orientation and emotions independent of the gender and age of the video protagonist? We discuss the methods and systems used to investigate these questions.
3
Method and System Design
Facial emotion recognition systems compute the probability of the evidence of discrete emotions on a frame-by-frame basis; the systems usually produce the three Euler angles (yaw, pitch, roll). Our videos are of politicians and CEOs giving speeches and their spokespersons giving press conferences. The videos were edited to have only the politician or CEO or spokesperson in one frame the videos were then trimmed to maximize the images in the frame. We then processed the edited videos through both Emotient and Affectiva. The emotion evidence is produced by the two systems and descriptive statistics of all our videos are computed together with the correlation between the outputs of the two systems and how the outputs differ from each other. We compute the variation of the Euler angles with each of the six discrete emotions we study (anger, joy, sadness, surprise, disgust, and fear). The Euler angles are given the same statistical treatment. Typically, in the literature authors talk about small angles and large angles: this is a subjective judgment and we have a fuzzy logic description which uses the notion of fuzzy boundaries between values sets of large angles, medium angles and small angles. A number of programs were used in the computation of head orientation (and emotions) and in carrying out statistical analysis. The pipelines of such programs are shown in Fig. 1.
Fig. 1. Variation of anger in both the systems
Head Orientation and Movement Effects
83
In analyzing head position and movement, we consider the yaw, pitch and roll measurements by both systems for each frame. In some of our analyses we consider these measurements directly. In others we look at derived measurements: change in yaw, pitch and roll from one frame to the next, aggregations of yaw, pitch and roll, and change in aggregations from one frame to the next. There are two sorts of aggregations. The first (see 1) records the head pose angle sum (PAS) for each frame (i) as the sum of the magnitudes of each of the three angular measurements; absolute values are used, so that wide angles in one direction are not erased by wide angles with an opposite sign on a distinct rotation axis. P ASi = |yawi | + |pitchi | + |rolli |
(1)
A related value records the angular velocity of each of yaw, pitch and roll for each system – effectively change in yaw, pitch and roll from one frame to the next. Thus, we also consider the sum of magnitudes of those values. P AS.dfi = |yaw.velocityi | + |pitch.velocityi | + |roll.velocityi |
(2)
The second aggregation understands the values of yaw, pitch and roll at each frame (i) as a vector of those values, YPR (3), where each component is normalized with respect to the minimum and maximum angle among all of the raw values for yaw, pitch and roll. YPRi = yawi , pitchi , rolli
(3)
The measurement addressed is in the change in that vector as 1 minus the cosine similarity between YPR at one frame (i) and its preceding frame (i − 1) – see (4). YPR.dfi = 1 − cos(YPRi , YPRi−1 )
(4)
Cosine similarity, in turn, is defined through the cross product of the normalized vectors, as in (5).1 cos(x, y) =
4
crossprod(x, y) (crossprod(x) ∗ crossprod(y))
(5)
Data and Results
We describe the data set we have used followed by the results organised according to the three questions outlined in the literature review section above. 4.1
The Data Used
The data used are described in Table 2 and Table 3. The two systems will generally produce emotion evidence for frames on the same timestamp. Sometimes, one or both of the frames does not generate any emotion evidence and returns blank values. We have only considered the values where both the systems provided evidence for the same timestamp. 1
In (5), it is understood that arity-one crossproduct is identical to the arity-two crossproduct of a vector with itself.
84
Y. P. K. Murali et al.
Table 2. Data demographic profile: by nationality and gender, the count of individuals, videos and frames within videos recognized by affectiva and emotient Nationality
China
Individuals
Videos
Affectiva frames
Emotient frames
Common frames
F
F
M
Female
Female
Female
14
16
46548
M
3
6
Male 100659
Male
47263
102581
46413
Male 100205
France
1
0
2
0
7995
0
7364
0
7354
0
Germany
1
1
5
5
19249
21353
19267
21529
19217
21228
India
1
7
2
18
8707
91061
8757
103965
8094
77169
Ileland
0
1
0
2
0
19283
0
19228
0
19200
Italy
0
1
0
2
0
8356
0
8370
0
8313
Japan
0
1
0
1
0
20229
0
20213
0
20206
New Zealand
1
0
5
0
25128
0
25750
0
25002
0
Pakistan
0
2
0
6
0
72585
0
59553
0
71887
South Kolea
0
2
0
4
0
35184
0
35237
0
35172
United Kingdom
1
1
5
5
46182
47068
43044
52387
20931
46470
United States
10
11
39
31
239610
187580
248407
186610
238090
182361
Total
18
33
72
90
393419
603358
399852
620616
365101
593139
Table 3. Age ranges and counts of individuals in each occupation, by nationality Nationality
Occupation Age range CEO Politician Spokesperson
4.2
China
49–73
0
6
3
France
49–49
1
0
0
Germany
62–67
0
1
1
India
28–72
3
5
0
Ireland
51–65
0
0
1
Italy
46–46
1
0
0
Japan
73–73
0
1
0
New Zealand
41–41
0
1
0
Pakistan
46–69
0
1
1
South Korea
0
69–71
0
2
United Kingdom 51–57
0
1
1
United States
64–91
5
13
3
Total
28–91
10
31
10
Preliminary Data Profiling
First we consider agreement between Affectiva and Emotient on the basic underlying quantities of yaw, pitch and roll measurements (see plots of each of these in Fig. 2). The Pearson correlation coefficient for yaw is 0.8567581 (p < 2.2e − 16); for pitch, 0.6941681 (p < 2.2e − 16); for roll, 0.8164813 (p < 2.2e − 16).2 This seems to us to provide sufficient evidence to expect that there may be significant agreement between the systems in other measured quantities. 2
Our inspection of histograms for each of the angle measurements suggested that a Pearson correlation would be reasonable. In many other cases, we use non-parametric tests and correlation coefficients.
Head Orientation and Movement Effects
85
Fig. 2. Comparison of the output of euler angles computed by affectiva versus the angles computed by emotient
The scatter plot (Fig. 2) appear to have points distributed around a straight line and we have linearly regressed angles observed (per frame for all our videos) generated by Emotient as a function of angles observed by Affectiva: Y awEmotient = 0.58 ∗ Y awAf f ectiva + 0.72 P itchEmotient = 0.44 ∗ P itchAf f ectiva − 1.36 RollEmotient = 0.54 ∗ RollAf f ectiva + 0.76 This is the basis of our observation that the angles generated by the two systems are in a degree of agreement. We are investigating these relationships further. On the other hand, the systems show little agreement with respect to the assessments made of most probable emotion for each frame (Cohen’s κ = 0.123, p = 0). Table 4 presents the confusion matrix of system classifications. Therefore, and because the data is collected from “in the wild” rather than with ‘gold standard’ labels for each (or any) frame, we analyze the interacting effects with respect to each system’s own judgment of the most likely emotion.
86
Y. P. K. Murali et al.
Table 4. Cross classification of most probable emotion for each frame according to affectiva and emotient. The counts along the diagonal, in bold, indicate agreements. Affectiva Emotient Anger Contempt Disgust Fear Anger Contempt Disgust
4.3
39423
23110
2215
4228
29030
51650 19178
Joy
Sadness Surprise
25222
24863
1025
6074
1780
831
55633 125605 36448
56108
24737
22483
4247
1945
697
2887
1636 30555
798
790
2397
Fear
333
1256
1366
Joy
193
2126
3409
Sadness
9355
1654
17994
Surprise
7824
38090
14831
4024
1529
7759
1119
48438 56010
73547
26236
45552
Results and Discussion
Is the Computation of Head Orientation Independent of the Facial Emotion Recognition Systems Used? There is a similarity in the variation of Euler angles in the videos of all our subjects - see for example the variation in two well known politicians of the last 10 or so years one in united kingdom and other in the USA (Fig. 3). We take the highest evidence of anger (HEA) in both the systems at the same timestamp and then note the variation of evidence around the time.
Fig. 3. Variation of anger intensity around HEA based on video of US politician (left) and UK (right)
We computed the correlation between the results for each of the angles produced by Affectiva and Emotient. The results showed that angles yaw and roll showed highest correlation (87% and 82% respectively, p < 0.05); the lowest correlation was for pitch – 69% (p < 0.05). The two systems show a degree of similarity between the computation of the three Euler angles. Figure 4 shows comparison of the variation in the estimated Euler angles by both systems for the same input video in a frame where the estimated anger intensity values are the highest. The blue trend line describes the variation of Euler angles with respect
Head Orientation and Movement Effects
87
to media time in Emotient, while the orange trend line represents the variation of Euler angles with respect to media time in Affectiva. The blue vertical line represents the highest evidence of anger where both the systems had the highest anger intensity value.
Fig. 4. Distribution of euler angles in both systems near highest evidence of anger (vertical blue line), for the time course of anger: the blue trend line describes the variation of euler angles with respect to media time in emotient, while the orange trend line represents the variation of euler angles with respect to media time in affectiva
We tested this hypothesis by performing a rank correlation test for the time aligned outputs of the both the systems. If one computes the head pose angle sum (PAS) as the sum of absolute values of yaw, pitch and roll, for any timestep, then one has a proxy measure of head orientation. It is possible to consider this measurement according to each of the systems plotted against each other (see Fig. 5): testing the correlation, Spearman’s Rho is 0.62 (p < 2.2e − 16); testing the difference with a directed, paired Wilcox test, one finds significantly greater values for Affectiva than Emotient (V = 4.0994e + 11, p < 2.2e − 16). Considering the YPR vectors and their frame by frame differences for each system, Spearman’s Rho is 0.33 (p < 2.2e − 16). Thus, the values computed are not completely independent, but have significant differences, as well. Table 5 presents the results of analysis of correspondence and difference between Affectiva and Emotient in measuring each of yaw, pitch and roll, for both genders. The results of Spearman correlations each indicate significant positive correlations between the two systems (with coefficients between 0.83 and 0.89 p < 0.001), yet significant difference is also identified through the Kruskal tests (p < 0.001). The statistical tests results are computed using python libraries.
88
Y. P. K. Murali et al.
Fig. 5. Head pose angle sums, computed as the sum of absolute values of yaw, pitch and roll at each timestep, for both systems, Emotient (E) and Affectiva (A), are plotted against each other
Table 6 presents the results of analyzing the correspondence and difference between Affectiva and Emotient in measuring each of yaw, pitch and roll for each of the three occupations in the sample (CEO, Politician, Spokesperson). The systems demonstrate significant positive correlation (with Pearson coefficients between 0.82 and 0.89, p < 0.001) yet also significant differences (through Kruskal tests, p < 0.001). What Is the Relationship Between Head Orientation/Movement and Emotions in Spontaneous Videos? Table 7 represents the results of regression analysis where emotion is considered as a function of Euler angles. This allows us to understand, for example, the effect of pitch degrees (movement of the head on the vertical axis) in relation to a specific emotion (joy, for instance) Table 5. Relationships between affectiva and emotient values for Yaw, Pitch and Roll for Females and Males Yaw Pitch Roll Spearman Kruskal Spearman Kruskal Spearman Kruskal Female Coefficient 0.89 pValue 0.00
180.9 0.00
0.71 0.00
34465 0.00
0.83 0.00
1569.2 0.00
Male
2988.4 0.00
0.67 0.00
122116 0.00
0.83 0.00
6805.7 0.00
Coefficient 0.87 pValue 0.00
Head Orientation and Movement Effects
89
Fig. 6. Change in yaw-pitch-role vectors (see (4)), at each timestep, for both systems, Emotient (E) and Affectiva (A), are plotted against each other
and correlate with the results from the literature. In the table, c marks the estimated intercept and m, the slope of the regression line (as from the standard form in (6)). The extremely low r2 values indicate the poorness of fit to the data of the corresponding linear models. y = c + mx +
(6)
Within the two systems, emotion labels are derived values, based on values for head and facial action movements in posed data sets, where the emotion labels were assigned on the basis of the assumption that actors performed emotions as directed, and were able to perform those emotions authentically.3 Thus, even for the training data, head orientation will have supplied only part of the information used to infer the given label. This is shown in the mean value of head pose angle sums varying with emotion label, for both of the systems, and for the emotion labels supplied by each system. The PAS values within each emotion are greater as measured by Affectiva for Affectiva’s most likely emotion label than as measured by Emotient for Emotient’s most likely emotion label (Wilcox V = 21, p = 0.01563). For both Affectiva (Kruskal-Wallis χ2 = 133523, df = 5, p < 2.2e − 16) and Emotient (Kruskal-Wallis χ2 = 4078, df = 5, p < 2.2e − 16) there are significant differences in the PAS values as a function of the mostprobable emotion (See Table 8).
3
It is reasonable to suppose that performed emotions are exaggerated, even for “method” actors.
90
Y. P. K. Murali et al.
Table 6. Relationships between affectiva and emotient values for Yaw, Pitch and Roll, by occupation Yaw Pitch Roll Spearman Kruskal Spearman Kruskal Spearman Kruskal CEO
Coefficient 0.89 pValue 0.00
4963 0.00
0.54 0.00
23687 0.00
0.86 0.00
127.73 0.00
Politician
Coefficient 0.87 pValue 0.00
1338 0.00
0.67 0.00
92196 0.00
0.83 0.00
6982 0.00
Spokesperson Coefficient 0.89 pValue 0.00
7013 0.00
0.78 0.00
34270 0.00
0.82 0.00
3012 0.00
Table 7. Regression analysis: “Emotion” evidence as a function of euler angles Emotient Panel
Yaw
Anger c 0.1303 m 0.0004 r-squared 0.000 Joy
Affectiva Pitch
Roll
0.1339 0.1248 −0.0033 0.0073 0.007 0.026
c 0.2273 0.2176 m −0.0009 0.0090 r-squared 0.001 0.0220
Yaw
Pitch
Roll
5.8073 5.7087 5.7847 0.0237 0.0133 0.1330 0.0000 0.0000 0.002
0.2362 2.6300 2.1704 2.5911 −0.0119 0.0364 0.0757 −0.0587 0.029 0.001 0.002 0.001
Table 8. Mean of head pose angle sum for each system, by most likely emotion, according to each label source System
Label Source Anger Contempt Disgust Fear
Joy
Sadness Surprise
Affectiva Affectiva
33.510 29.463
23.165
18.127 22.441 21.291
21.807
Affectiva Emotient
26.332 28.211
24.100
21.306 25.068 25.283
24.460
Emotient Affectiva
18.327 17.099
16.341
14.126 15.006 14.508
15.554
Emotient Emotient
16.363 17.778
16.482
15.127 15.814 16.766
16.463
Similarly, in relation to change in each angle (yaw, pitch and roll) from one timestep to the next, one may examine the total magnitude of change (the sum of absolute values in change at each timestep – recall (2)) across the three angles as measured by each of the systems (See Table 9). For both Affectiva (Kruskal-Wallis χ2 = 1779.9, df = 6, p < 2.2e − 16) and Emotient (KruskalWallis χ2 = 5035.5, df = 6, p < 2.2e − 16) there are significant differences in this aggregated measure of angular movement depending on the emotion category determined most likely by the corresponding system.
Head Orientation and Movement Effects
91
Table 9. Mean of head pose angle change magnitudes (At each timestep, the sum of absolute values of change in Yaw, Pitch and Roll; therefore, the mean of that aggregate angle change magnitude) in relation to each emotion label deemed most probable for each system System
Anger Contempt Disgust Fear
Joy
Sadness Surprise
Affectiva 0.098
0.103
0.087
0.082 0.084 0.092
0.084
Emotient 0.048
0.043
0.045
0.042 0.043 0.048
0.052
Table 10 indicates the mean values of change in position between successive frames (the complement of cosine-similarity as in (4)) for the most probable emotion according to each system. Figure 6 depicts the change in YPR values derived from Emotient measurements plotted against the change in YPR values derived from Affectiva measurements. For both Affectiva (Kruskal-Wallis χ2 = 3473.3, df = 6, p < 2.2e − 16) and Emotient (Kruskal-Wallis χ2 = 6223.8, df = 6, p < 2.2e−16), there are significant differences in the YPR.df values according to the most probable emotion selected by each system. Table 10. Means of change in YPR vectors over successive frames, according to the most probable emotion for each system System
Anger
Contempt Disgust Fear
Joy
Sadness Surprise
Affectiva 0.00043 0.00042
0.00023 0.00017 0.00021 0.00021 0.00019
Emotient 0.00042 0.00039
0.00032 0.00028 0.00032 0.00046 0.00050
Thus, there is evidence that both head pose and motion vary with the most likely emotion label accorded to each timestep, for both of the systems considered, even though the systems differ with respect to which emotion labels are implicated. Is the Relationship Between Head Orientation and Emotions Independent of the Gender and Age of the Videoed Person? There appear to be significant interactions among most likely emotion label, gender and age for each of the systems on the corresponding posed angle sum (PAS) values. Using the PAS measure as a response value and a linear model that predicts this on the basis of assigned emotion, gender and age, for Emotient, the corresponding linear model has an adjusted R-squared value of 0.1151 (F-statistic: 4616 on 27 and 958212 DF, p < 2.2e − 16) and all interactions are significant (p < 0.001). The matched linear model that predicts PAS for Affectiva assuming interactions of Affectiva-assigned emotion labels, gender and age has an adjusted R-squared value of 0.1907 (F-statistic: 8366 on 27 and 958212 DF, p < 2.2e−16) and reveals all interactions to be significant (p < 0.001). From the adjusted R-squared values, it seems safe to conclude that both models omit important factors that
92
Y. P. K. Murali et al.
explain the variation in PAS values; nonetheless, they both reveal significant interactions of gender and age on the projected emotion. Modelling YPR.df values, frame by frame change in yaw-pitch-roll vectors, similarly also produces significant interactions among gender, system-calculated emotion and age, for each system, but with much lower R-squared values. The effect interactions, all are all significant, and p < 0.001. For Emotient, adjusted R2 is 0.002957 (F-statistic: 106.3 on 27 and 958211 DF, p < 2.2e − 16); for Affectiva, adjusted R2 is 0.005431 (F-statistic: 194.8 on 27 and 958211 DF, p < 2.2e − 16). More so than for the PAS models, the linear models predicting YPR.df lack explanatory variables.4 Figure 7 illustrates the nature of these interactions: for Affectiva, females in the dataset studied have pose angle magnitude sums that are greater than or equal to those of males for each emotion label except joy, fear and surprise; for Emotient, females have PAS values greater than or equal to those of males for all emotions labels except surprise. Table 11 presents the mean PAS values underlying these interaction plots. Thus, Affectiva and Emotient both reach qualitatively different judgements between genders for measurements of the confluence of angular position and surprise. Table 11. Means of posed angle sums for each system and its most probable emotion calculation, for each gender System
Gender Anger Contempt Disgust Fear
Joy
Sadness Surprise
Affectiva Female 38.168 32.579
23.830
15.521 20.458 21.724
20.957
Affectiva Male
31.781 27.737
22.878
24.792 25.039 21.128
22.758
Emotient Female 19.461 18.899
17.256
15.293 17.305 17.816
15.981
Emotient Male
16.339
14.939 14.191 15.269
17.127
15.979 17.106
Figure 8 and Table 12 present the interaction between gender and systemcalculated most probable emotion on the change in YPR vectors. Table 12. Means of change in YPR vectors for each system and its most probable emotion calculation, for each gender System
Gender Anger
Contempt Disgust Fear
Joy
Sadness Surprise
Affectiva Female 0.00074 0.00039
0.00021 0.00015 0.00019 0.00014 0.00018
Affectiva Male
0.00031 0.00044
0.00024 0.00023 0.00024 0.00023 0.00019
Emotient Female 0.00082 0.00050
0.00058 0.00030 0.00038 0.00054 0.00058
Emotient Male
0.00027 0.00027 0.00025 0.00034 0.00039
4
0.00037 0.00031
Our task here is not to identify complete theories of variation in PAS or YPR.df, but rather to assess the interaction of particular factors on head position and motion.
40
probable_emotion_y
probable_emotion_x
20
mean of E.Pose.Angle.Sum
30
surprise contempt disgust anger sadness fear joy
0
0
10
20
30
anger contempt joy fear disgust surprise sadness
10
mean of A.Pose.Angle.Sum
93
40
Head Orientation and Movement Effects
FEMALE
MALE
FEMALE
Gender
MALE Gender
Fig. 7. Interaction between system-assigned probable emotions and gender on pose angle magnitude sums. On the left, the values for affectiva are shown (y = Affectiva) and on the right, values for emotient are shown (x = Emotient).
6e−04
surprise anger sadness contempt disgust fear joy
4e−04
mean of E.YPR.df
8e−04
probable_emotion_x
0e+00
0e+00
2e−04
4e−04
6e−04
contempt anger joy disgust fear sadness surprise
2e−04
mean of A.YPR.df
8e−04
probable_emotion_y
FEMALE
MALE Gender
FEMALE
MALE Gender
Fig. 8. Interaction between system-assigned probable emotions and gender on change in YPR vectors. On the left, the values for affectiva are shown (y = Affectiva) and on the right, values for emotient are shown (x = Emotient).
For Affectiva, males show greater change in position vectors for all emotions but anger, while for Emotient, females show greater change in position vectors for all emotions. With focus upon age, it may be noted that there is a small (but significant) negative Spearman correlation between age and posed angle sum (ρ = −0.16483049; p < 2.2e − 16) for Affectiva, and a greater magnitude small negative Spearman correlation between age and posed angle sum for Emotient (ρ = −0.2899963; p < 2.2e − 16). Table 13 illustrates the mean values of PAS for each system according to ordinal age categories derived from quartiles of age values.
94
Y. P. K. Murali et al.
Table 13. Means of posed angle sums for each system, according to ordinal age categories derived from age quartiles System
[28, 50] (50, 61] (61, 68] (68, 91]
Affectiva 24.449
30.573
25.829
18.780
Emotient 18.184
17.683
16.239
12.634
These observations support the generalization that age is accompanied by smaller angles in head position, or more colloquially: with age, extreme poses are less likely.5 Table 14 shows the means of YPR.df values by age group. Both systems identify least movement for the greatest age category. Table 14. Means of YPR.df for each system, according to ordinal age categories derived from age quartiles System
[28, 50] (50, 61] (61, 68] (68, 91]
Affectiva 0.00022 0.00034 0.00031 0.00017 Emotient 0.00030 0.00043 0.00046 0.00025
5
Conclusions
We have demonstrated that although differences in critical measures constructed by Affectiva and Emotient are significant, within each, comparable patterns of interaction are visible in the judgments of emotion and aggregated quantities associated with head position and change in head position. The greatest agreement between systems on the interaction of gender and angular position (using the posed angle sum aggregation) with emotion labels is for the emotion surprise, where for both systems females have smaller angular magnitude summation values than for males, against a contrasting trend of females having larger angular magnitude summation values than males for other emotion labels. It has been demonstrated that the two systems lead to different judgments of effects on the measurements made, depending on the measurement examined (e.g. posed angle sum vs YPR.df). The systems also differ in their calculation of probable emotions, but using each system’s own determination of the likely emotion, the effects on measures of head position and movement vary. 5
An alternative generalization is that with age, politicians, CEOs and spokespeople are more likely to focus more directly to the camera recording the utterances. Here, and independently of the age variable, data recording whether teleprompters were activated or not would be useful, inasmuch as gazing at a teleprompter to access speech content would thwart wide angular positions. Implicit here is an assumption that for all of the recordings, a single “face-on” camera was available and provided a primary focal point for the speaker. For speeches recorded with multiple-camera arrangements, it is natural that from the perspective of the video stream available, attention to other cameras would appear as angled positions.
Head Orientation and Movement Effects
95
Acknowledgments. We are grateful to the GEstures and Head Movement (GEHM) research network (Independent Research Fund Denmark grant 9055-00004B). We thank Subishi Chemmarathil for helpful efforts in data collection and curation.
References 1. Ahmad, K., Wang, S., Vogel, C., Jain, P., O’Neill, O., Sufi, B.H.: Comparing the Performance of Facial Emotion Recognition Systems on Real-Life Videos: Gender, Ethnicity and Age. In: Arai, K. (ed.) FTC 2021. LNNS, vol. 358, pp. 193–210. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-89906-6 14 2. Ahmed, F., Bari, A.S.M.H., Gavrilova, M.L.: Emotion recognition from body movement. IEEE Access 8, 11761–11781 (2019) 3. Bartlett, M.S., Littlewort-Ford, G., Movellan, J., Fasel, I., Frank, M.: Automated facial action coding system, 27 December 2016. US Patent 9,530,048 4. Ekman, P., et al.: Universals and cultural differences in the judgments of facial expressions of emotion. J. Pers. Soc. Psychol. 53(4), 712 (1987) 5. Ekman, P., Oster, H.: Facial expressions of emotion. Annu. Rev. Psychol. 30(1), 527–554 (1979) 6. El Kaliouby, R.: Mind-reading machines: automated inference of complex mental states. Ph.D. thesis, The Computer Laboratory, University of Cambridge, 2005. Technical Report no. UCAM-CL-TR-636 (2005) 7. Johnson, D.O., Cuijpers, R.H.: Investigating the effect of a humanoid robot’s head position on imitating human emotions. Int. J. Soc. Robot. 11(1), 65–74 (2019) 8. Keltner, D., Sauter, D., Tracy, J., Cowen, A.: Emotional expression: advances in basic emotion theory. J. Nonverbal Behav. 43(2), 133–160 (2019) 9. Koppensteiner, M., Grammer, K.: Motion patterns in political speech and their influence on personality ratings. J. Res. Pers. 44(3), 374–379 (2010) 10. Littlewort, G., et al.: The computer expression recognition toolbox (CERT). In: 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG), pp. 298–305. IEEE (2011) 11. McDuff, D., Mahmoud, A., Mavadati, M., Amr, M., Turcot, J., Kaliouby, R.: AFFDEX SDK: a cross-platform real-time multi-face expression recognition toolkit. In: Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 3723–3726 (2016) 12. Toscano, H., Schubert, T.W., Giessner, S.R.: Eye gaze and head posture jointly influence judgments of dominance, physical strength, and anger. J. Nonverbal Behav. 42(3), 285–309 (2018) 13. Wallbott, H.G.: Bodily expression of emotion. Eur. J. Soc. Psychol. 28(6), 879–896 (1998) 14. Werner, P., Al-Hamadi, A., Limbrecht-Ecklundt, K., Walter, S., Traue, H.C.: Head movements and postures as pain behavior. PLoS ONE 13(2), e0192767 (2018) 15. Zhang, X., et al.: Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image Vis. Comput. 32(10), 692–706 (2014)
Using Machine Learning to Identify Top Antecedents Affecting Crime in US Communities Kamil Samara(B) University of Wisconsin-Parkside, Kenosha, USA [email protected]
Abstract. One of the main concerns for countries has been always crime activities. In recent years, with the development of data collection and analysis techniques, a massive number of data-related studies have been performed to analyze crime data. Studying indirect features is important yet challenging task. In this work we are using machine learning (ML) techniques to try to identify the top variables affecting crime rates in different US communities. The data used in this work was collected from the Bureau of the Census and Bureau of Justice Statistics. Out of the 125 variables collected in this data we will try to identify the top factors that correlate with higher crime rates either in a positive or a negative way. The analysis in this paper was done using the Lasso Regression technique provided in the Python library Scikit-learn Keywords: Machine learning · Lasso regression · Crime
1 Introduction Crime as a socioeconomic complication has shown multifaceted associations with socioeconomic, and environmental aspects. Trying to recognize patterns and connections between crime and these factors is vital to understand the root causes of criminal activities. By detecting the source causes, legislators can instrument solutions for those source causes, eventually avoiding most crime sources [1]. In the age of information technology, crime statistics are recorded in data bases for studying and analysis. Manual analysis is impractical due to the vast size of stored data. The suitable solution here is to use data science and machine learning techniques to analyse the data. Using the descriptive and predictive powers of those solutions officials will be able to minimize crime. The descriptive and predictive powers of machine learning techniques can give longstanding crime prevention solutions. This predictive analysis could be done on two levels. First, predicting when and where the crimes will happen. But this type of prediction is hard to implement because predictions are highly sensitive to complex disseminations of crimes in time and space. Second, focusing predictions on identifying the correlations of crimes with socioeconomic, and environmental aspects [2]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 96–101, 2023. https://doi.org/10.1007/978-3-031-28073-3_7
Using Machine Learning to Identify Top Antecedents
97
Lately machine learning methods have grew in popularity. Among the most popular approaches is Bayesian model, random forest, K-Nearest Neighbors (KNN), neural network, and support vector machine (SVM) [3]. As a step toward crime prediction using machine learning techniques, the proposed work in this paper uses the Lasso Regression technique to predict the top socioeconomic, and environmental factors related to crime rates in US cities. The study was performed using data collected form the Bureau of the Census and Bureau of Justice Statistics. The remaining of the paper is organized as follows: Sect. 2 is related work, Sect. 3 presents the work done is this study and Sect. 4 concludes the work.
2 Related Work Crime is a global problem, which motivated my researchers to apply machine learning techniques to perform predictive analytics in an effort to detect crime factors. The performed studies range in complexity depending on the volume of the datasets used in the study and the number of variables collected. A common crime prediction analysis is the focus on temporal data. A common reason behind this emphasis is crime data sets contain data collected over many years. An example of such analysis is the work of Linning in [4]. Linning has studied the variation of crime throughout the year to predict a pattern over seasons. The main observation was crime peaks in the hot summer seasons as compared to cold winter seasons. In a similar study [5], authors have examined the crime data of two major US cities and compared the statistical investigation of the crimes in these cities. The main goal of the study was to use agent-based crime environment simulation to identify crime hotspots. In [6], Nguyen and his team used data from the Portland Police Bureau (PPB) augmented with census data from other public sources to predict crime category in the city of Portland using Support Vector Machine (SVM), Random Forest, Gradient Boosting Machines, and Neural Networks. A unique approach to classify crime from crime reports into many categories using textual analysis and classification was done in [7]. The authors used five classification methods and concluded that Support Vector Machines (SVM) performed better than the other methods. The researches in [8], used data extracted from the neighborhoodscout.com from the University of California-Irvine, in the state of Mississippi to predict crime patterns using additive linear regression. Graph based techniques were used in [9] to mine datasets for correlation. The objective was to identify top and bottom correlative crime patterns. In the author’s final remarks, they conclude that it successfully discovers both positive and negative correlative relations among crime events and spatial factors and is computationally effective.
98
K. Samara
3 Proposed Work The scope of this work is to analyze crime data in an effort to recognize top factors affecting crime plausibility. The features considered in this work are socio-economic factors like race, number of people in living in the same house, and mean house income. Python was the programming language of choice in this work. To perform the regression part, the Lasso model in the Scikit-learn library was used. Scikit-learn is a free software machine learning library for the Python programming language. 3.1 Dataset The dataset used in this paper is the “Communities and Crime Unnormalized Data Set” available at the University of California Irvine (UCI) Machine Learning Repository. This data set’s main focus is communities in the United States and was combined from the following sources: 1995 US FBI Uniform Crime Report, 1990 United States Census, 1990 United States LEMAS (Law Enforcement Management and Administrative) Statistics Survey. In July 2009, this data set was presented to the UCI Machine Learning Repository [10]. The data set includes 2215 total examples and 125 features for different communities across all states. Features contain data blended from a diverse source of crime-related information, extending from the number of vacant households to city density and percent of people foreign born, to average household income. Also included are measures of crimes considered violent, which are murder, rape, robbery, and assault. Only features that had plausible connection to crime were included. So unrelated features were not included [10]. 3.2 Lasso Regression Lasso (Least Absolute Shrinkage and Selection Operator) regression is part of the linear regression family that utilizes reduction. Reduction is where data values are contracted near a central value, like the average. The lasso technique promotes models with less parameters. The lasso regression is more suitable for models expressing high levels of multicollinearity [11]. Lasso regression performs L1 regularization. As shown in Eq. 1, L1 regularization works by enforcing a penalty equivalent to the absolute value of the magnitude of coefficients. L1 regularization encourages models with few coefficients. This can be achieved by reducing some coefficients to become zero and get removed from the model. L1 regularization helps produce simpler models since larger penalties result in coefficient values almost zero. On the other hand, Ridge regression (e.g., L2 regularization) doesn’t result in removal of coefficients or sparse models. This makes the L1 regularization far easier to interpret than the L2 regularization [11]. RSSLASSO (w, b) =
N {i=1}
(yi − (w.xi + b))2 + α
p {i=1}
wj
(1)
Using Machine Learning to Identify Top Antecedents
99
where: Yi: target value w. xi + b: predicted value α: controls amount of L1 regularization (default = 1.0) 3.3 Feature Normalization Before applying the Lasso regression on the dataset, a MinMax scaling of the features was done. It is crucial in several machine learning techniques that all features are on the same scale (e.g. faster convergence in learning, more uniform or ‘fair’ influence for all weights) [12]. For each feature Xi: compute the min value XiMIN and the max value XiMAX achieved across all instances in the training set. For each feature: transform a given feature xi value to a scaled version x using e (2) xi = xi − xiMIN / xiMAX − xiMIN
3.4 Alpha Value Selection The parameter α controls amount of L1 regularization done in the Lasso linear regression. The default value of alpha is 1.00. To use the Lasso regression efficiently the appropriate value of alpha must be selected. To decide the appropriate value of alpha, a range of alpha values were compared using the r-squared value. R-squared (R2) is a statistical measurement which captures the proportion of the change of a variable that’s explained by another variable or variables in a regression fashion [12]. The alpha values used in the comparison were: [0.5, 1, 2, 3, 5, 10, 20, 50]. The results of the comparison are shown in Table 1. As we can see in Table 1, the highest r-squared value was achieved at an alpha value of 2. Table 1. Effect of alpha regularization Alpha value
Features kept
R-squared value
0.5
35
0.58
1
25
0.60
2
20
0.63
3
17
0.62
5
12
0.61
10
6
0.58
20
2
0.50
50
1
0.30
100
K. Samara
3.5 Results For alpha = 2.0, 20 out of 125 features have non-zero weight. Top features (sorted by abs. Magnitude are shown in Table 2. Higher weights indicate higher importance and impact of the feature on crime rate. Positive weights for features mean a positive correlation between the feature value and crime rate. On the other hand, negative weights for features mean a negative correlation between the feature value and crime rate. Table 2. Features with non-zero weight (sorted by absolute magnitude) Feature PctKidsBornNeverMar
Weight 1488.365
Description Kids born to never married parents
PctKids2Par
−1188.740
HousVacant
459.538
Unoccupied households
PctPersDenseHous
339.045
Persons in compact housing
NumInShelters
264.932
People in homeless shelters
MalePctDivorce
259.329’
Divorced males
Kids in family with two parents
PctWorkMom
−231.423
Moms of kids under 18 in labor force
pctWInvInc
−169.676
Households with investment
agePct12t29
−168.183
In age range 12–29
PctVacantBoarded
122.692
Vacant housing that is boarded up
pctUrban
119.694
People living in urban areas
MedOwnCostPctIncNoMtg
104.571
Median owners’ cost
MedYrHousBuilt,
91.412
Median year housing units built
RentQrange
86.356
Renting a house
OwnOccHiQuart
73.144
Owning a house
PctEmplManu
−57.530
People 16 and over who are employed in manufacturing
PctBornSameState
−49.394
People born in the same state as currently living
PctForeignBorn
23.449
People foreign born
PctLargHouseFam
20.144
Family households that are large (6 or more)
PctSameCity85
5.198
People living in the same city
Although there are many interesting features to discuss from the results shown in Table 2, we will focus our interest on the top two features. The top antecedent from the list of features was Kids Born to Never Married with a positive weight of 1488.365. The second antecedent was Kids in Family Housing with Two Parents with a negative weight of −1188.740. These two top antecedents indicate the importance of a stable family (two parents are present) for raising kids whom less likely to commit crime in the future.
Using Machine Learning to Identify Top Antecedents
101
4 Conclusion In this study, we employed machine learning through the use of Lasso linear regression in effort to predict socio-economic antecedents that affect crime rate in US cities. The regression model was implemented on a data set that was sourced from the 1995 US FBI Uniform Crime Report, 1990 United States Census, 1990 United States LEMAS (Law Enforcement Management and Administrative) Statistics Survey. The regression results pointed out that the topmost influential factors affecting crime in US cities are related to stable families. Kids born into families with two parents are less likely to commit crime. These findings should help policy makers to create strategies and dedicate fundings to help minimize crime.
References 1. Melossi, D.: Controlling Crime, Controlling Society: Thinking about Crime in Europe and America. 1st edn. Polity, (2008) 2. Herath, H.M.M.I.S.B., Dinalankara, D.M.R.: Envestigator: AI-based crime analysis and prediction platform. In: Proceedings of Peradeniya University International Research sessions, vol. 23, no. 508, p. 525 (2021) 3. Baumgartner, K.C., Ferrari, S., Salfati, C.G.: Bayesian networkmodeling of offender behavior for criminal profiling. In: Proceedings of the 44th IEEE Conference on Decision and Control (CDC-ECC), pp. 2702–2709 (2005) 4. Linning, S.J., Andresen, M.A., Brantingham, P.J.: Crime seasonality: examining the temporal fluctuations of property crime in cities with varying climates. Int. J. Offender Ther. Comp. Criminol. 61(16), 1866–1891 (2017) 5. Almanie, T., Mirza, R., Lor, E.: Crime prediction based on crime types and using spatial and temporal criminal hotspots. arXiv preprint arXiv:1508.02050 (2015) 6. Nguyen, T.T., Hatua, A., Sung, A.H.: Building a learning machine classifier with inadequate data for crime prediction. J. Adv. Inf. Technol. 8(2) (2017) 7. Ghosh, D., Chun, S., Shafiq, B., Adam, N.R.: Big data-based smart city platform: realtime crime analysis. In: Proceedings of the 17th International DigitalGovernment Research Conference on Digital Government Research, pp. 58–66. ACM (2016) 8. McClendon, L., Meghanathan, N.: Using machine learning algorithms to analyze crime data. Mach. Learn. Appl. Int. J. (MLAIJ) 2(1), 1–12 (2015) 9. Phillips, P., Lee, I.: Mining top-k and bottom-k correlative crime patterns through graph representations. In: 2009 IEEE International Conference on Intelligence and Security Informatics, pp. 25−30 (2009). https://doi.org/10.1109/ISI.2009.5137266 10. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository, School of Information and Computer Science, University of California, Irvine, CA (2007). https://www.archive.ics.uci. edu/ml/datasets/Communities+and+Crime 11. Kumar, D.: A Complete understanding of LASSO Regression. The Great Learning (26 December 2021). https://www.mygreatlearning.com/blog/understanding-of-lasso-regres sion/#:~:text=Lasso%20regression%20is%20a%20regularization,i.e.%20models%20with% 20fewer%20parameters) 12. Kuhn, M., Johnson, K.: Feature Engineering and Selection: A Practical Approach for Predictive Models, 1st edn. CRC Press, New York (2019)
Hybrid Quantum Machine Learning Classifier with Classical Neural Network Transfer Learning Avery Leider(B) , Gio Giorgio Abou Jaoude, and Pauline Mosley Pace University, Pleasantville, NY 10570, USA {al43110n,ga97026n,pmosley}@pace.edu
Abstract. The hybrid model uses a minimally trained classical neural network as a pseudo-dimensional pool to reduce the number of features from a data set before using the output for forward and back propagation. This allows the quantum half of the model to train and classify data using a combination of the parameter shift rule and a new “unlearning rate” function. The quantum circuits were run using Penny-Lane simulators. The hybrid model was tested on the wine data set to promising results of up to 97% accuracy. The inclusion of quantum computing in machine learning represents great potential for advancing this area of scientific research because quantum computing offers the potential for vastly greater processing capability and speed. Quantum computing is currently primitive; however, this research takes advantage of the mathematical simulators for its processing that prepares this work to be used on actual quantum computers as soon as they become widely available for machine learning. Our research discusses a benchmark of a Classic Neural Network as made hybrid with the Quantum Machine Learning Classifier. The Quantum Machine Learning Classifier includes the topics of the circuit design principles, the gradient parameter-shift rule and the unlearning rate. The last section is the results obtained, illustrated with visual graphs, with subsections on expectations, weights and metrics. Keywords: Machine learning · Transfer learning · Deep learning Quantum Machine Learning Classifier · Quantum computing mathematics
1
·
Introduction
The Hybrid Quantum Machine Learning Classifier (HQMLC) was developed in Python based off a previous Quantum Machine Learning Classifier [6] and an open source classical network framework found on GitHub [7]. The HQMLC was trained and tested on the machine learning Wine dataset [10]. The code is available on Google Colab in two public Jupyter notebooks [4,5]. All links, comments and references for the code are given under the links and references tabs of the Colab notebooks. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 102–116, 2023. https://doi.org/10.1007/978-3-031-28073-3_8
Hybrid Quantum Machine Learning Classifier with Transfer Learning
103
The HQMLC is a combination model composed of a classical neural network and a quantum machine learning classifier. The hybrid model works by first training the classical network on the data for one epoch using a large learning rate. The small classical network is then folded to remove a layer in a process similar to transfer learning. All the data is propagated through the folded network. The output of this network is treated as data of reduced dimensionality. It is re-scaled and passed through to the quantum machine learning classifier that trains on this data of reduced dimension in a process similar to previous work [6]. The Wine data set [10] consists of 178 measurements. The data is the result of a chemical analysis done on wine from three cultivars in the same region of Italy. Cultivars are a plant variety produced in cultivation by selected breeding. There are 13 attributes of these cultivars: alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity. hue, OD280/OD315 of diluted wines, and proline. All attributes are continuous. The dataset has three targets referred to as targets zero, one and two. In the previous Quantum Machine Learning Classifier [6], the dataset was also on three targets of iris flower species but they only had four attributes, so the wine dataset is a significantly more challenging dataset to work on for classification. Figure 1 shows a correlation matrix that illustrates the relationships between the measured characteristics in the data set. Red values are positive correlations and blue values are negative correlations. Zero is a neutral relationship. This correlation matrix is showing how strongly one characteristic feature attribute is directly correlated to another. A positive number (signified by a red square) means that if one attribute is present in a certain degree, then the second attribute on the corresponding row or column is also present to that degree with the likelihood of that standardized value. Otherwise, there is a negative correlation. The statistics of the data are given in Fig. 2 and Fig. 3. As customary in machine learning algorithms, the data was split into a training subset and a testing subset, proportioned 75% training and 25% testing. This means 45 measurements were used in every epoch for testing. The rest of this paper discusses the Classic Neural Network, the Quantum Machine Learning Classifier. The Quantum Machine Learning Classifier includes the topics of the circuit design principles, the gradient parameter-shift rule and the unlearning rate. The last section is the results obtained, illustrated with visual graphs, with subsections on expectations, weights and metrics. This is followed by a section on future work, and a list of helpful definitions.
2
The Classic Neural Network
The classical neural network was designed using the GitHub repository [7] with modifications done to better tackle a hybrid model. The network shown in Fig. 4 was used. The input layer contained 13 neurons, one for each feature. The hidden layer contained six neurons to match the number of features that would be accepted by the circuits in the second half of the hybrid model. The output
104
A. Leider et al.
Fig. 1. Correlation matrix of wine data set
Fig. 2. Statistics of wine data set 1/2
Hybrid Quantum Machine Learning Classifier with Transfer Learning
105
Fig. 3. Statistics of wine data set 2/2
layer contained three neurons, one for each target. Different activation functions were attempted for the final layer to little difference. The sigmoid activation function preformed the best by a small margin. The cost function used the sum of squared errors with a learning rate of 0.25 over one epoch. More testing is needed on the network design including hyper-parameters but, based on limited results, larger than typical learning rates and linear activation functions proved most successful. The accuracy of the classical neural network after one epoch is appreciably low. It was tested in a quality control measure. After the epoch, the neural network is folded to only two layers, shown in Fig. 5. This folded network functioned as a means of reducing the dimensionality of the features. The entire data set was propagated through the network. The output of this propagation was treated as a dataset that was re-scaled using normalization and multiplied by π (pi) as was suggested in previous work [6]. This dataset now consisted of six unit-less features and is referred to as the pooled data set.
3
Quantum Machine Learning Classifier
The Quantum Machine Learning Classifier (QMLC) is an iteration of the previous model [6]. The QMLC works on three key components. The quantum circuits used to make the classifier were based on the Capelletti design principles [1] with modifications made to fit in a multi-classification problem. The gradient parameter-shift rule used to determine the gradients have been extensively studied by other researchers including the PennyLane developers [9]. Finally, the novel component in this iteration of the QMLC is the non-intuitive “unlearning rate” used as a mechanism to attune each circuit to the data that matches their target by decreasing their affinity to produce high expectation values with data from every other target. Each of these components are explained further in later sections. The basic design of the QMLC consists of identical quantum circuits for each target of the classification problem. Using tools described later, each circuit is attuned to the data relating to a particular target by encoding the data as parameters in the circuit and changing the weights of that circuit to produce
106
A. Leider et al.
higher expectation values for that data. In principle, if a row of data is describing a target, then the circuits should produce expectation values that are low except for the circuit corresponding to that target. This is the basic working principle of quantum computing that has proven successful. The circuit design used to create the classifier at Fig. 6 is given using Qiskit [8]. 3.1
Circuit Design Principles
The work by Capelletti [1] gives general guidelines on the design of quantum circuits for the purposes of machine learning. Their guidelines were tested on the simpler Iris data set and proved successful. Many of those design principles were employed here on the more complex Wine data set.
Fig. 4. Full classical neural network
Hybrid Quantum Machine Learning Classifier with Transfer Learning
107
Fig. 5. Folded classical neural network
Fig. 6. The QMLC circuit design
The first guideline followed was that the gates remain uni-dimensional. All the gates use rotations along the X-axis for the features and weights. This adds a level of simplicity to the circuitry. The CZ gates in the circuit alternate in direction until the measurement is taken on the right end of the circuit. The second guideline followed was the general shape of the circuit. An emphasis was given on keeping a long and thin shape, with the use of minimal qubits. This may allow for the later testing of the model on real quantum computers in the future, with smaller topology designs even for larger target numbers. The final guideline followed was the proportion of weights to features and their placement in the circuit. As seen in Fig. 6, there are nearly twice as many weights as there are features. This is by intent. This design provides the best results of the QMLC without overtaxing the optimization process with too many weights.
108
A. Leider et al.
By virtue of the design, an even number of features will result in a superfluous final weight in the bottom qubit if the same pattern of two weights after two qubits is used before measuring expectation. This is why the final weight is omitted. Adding a weight to the circuit design at that location would serve no purpose. The output of the quantum circuit using a PennyLane [9] backend is an expectation value between [−1, 1]. For simplicity, when describing the results in later sections, the range of expectation values are shifted to between [0, 1] by adding one and dividing by two. Changing the bounds in this manner allowed for easier analysis and explanation of the results. 3.2
Gradient Parameter-Shift Rule
Over the course of training a QMLC, the gradient is used to change the weights so that the output of the circuits increase when encoded with rows of data matching their target. The gradient is key to this process. It is calculated using the gradient parameter-shift rule. This rule is studied by many including PennyLane [9]. The equation is outlined below in Eq. 1 with respect to an individual weight Θi (Theta). (Θ) = 1 B Θ − π ei Θ + π ei − B (1) ∇Θi B 2 2 2 During training, each circuit is encoded with the next row of data. The circuit gradients for each weight are calculated, which include direction (positive or negative) and magnitude. If the circuit number matches the target number for that row of data, the gradient is multiplied by the learning rate (alpha) and added to the current weight. If the circuit does not match the next target, it is moved in the opposite direction with a magnitude modified by the unlearning rate explained below. 3.3
Unlearning Rate
In typical machine learning explanations, the multivariate depiction of the cost function is a three dimensional surface with saddles, minima and maxima. The objective, in most models, is to find the combination of weight values that give the global minima. Re-purposing the gradient function to increase the expectation values can be thought of as moving the weights over a similar surface with the goal of finding the global maxima. For this reason, the visualization can also be re-purposed. There is one major difference. The weights of one circuit do not have any consequence on the output or gradient calculations on the other circuits. This means that the surfaces describing the optimal weights to produce higher expectation values for each circuit are then disjointed from each other. This leads to poor QMLC results when navigating each circuit through their own surface. The solution is the unlearning rate. When a row of data is used to optimize the classifier, the data is encoded into each circuit. For the circuit that matches the target corresponding to the row of
Hybrid Quantum Machine Learning Classifier with Transfer Learning
109
data, the weights are moved in the direction to increase the expectation value of the circuit as described earlier. Every other circuit has the gradient multiplied by a factor proportional to the output of the circuits. This proportion is equal to the sum of the expectation value of the circuits that do not match the target divided by the sum of all the expectation values as shown in the code below. Target expectation is the output of the circuit that matches the row of data and sum expectation is the sum of all expectation values for that row of data. By using the expectation values as a proportion with which to modify the circuits, the surfaces described previously can be linked. target_expectation = circuit(features, weights[target]) expectations = calc_expectations(features, weights, num_circuits = 3) sum_expect = sum(expectations) beta = (sum_expect - target_expectation) /sum_expect
The two equations below are paraphrased from the code [5]. The first shows the changes made to the weights (called parameters or params) made when the weight is in a circuit corresponding to the features in a row of data. The second shows the changes made to the weights when the weight is in a circuit that does not correspond to a row of data. The unlearning rate is denoted by beta in regards to the learning rate typically being denoted by alpha. params += alpha*parameter_shift(features, params) params += (-alpha*beta)*parameter_shift(features, params) The use of the unlearning rate helps the weights of one circuit avoid a configuration that matches the behaviour of another circuit by adjusting the path created by gradient descent. This is normally not needed in models where the gradient is calculated with respect to all other weights but is needed here when circuit weights do not effect the expectation values of other circuits. It is a peculiar work around that needs further study.
4
Results
All results were recorded over every optimization of the classifier during training. This significantly slowed the efficiency of the QMLC but was necessary to the study of the model. The findings and results are discussed below. 4.1
Expectations
The QMLC produces an expectation value for each target in the classification problem. By design, the expectations values produced are randomized at the early epochs of training. During the course of training, this improves. If the circuit is working properly, it will produce low expectation values when looking
110
A. Leider et al.
at rows of data not corresponding to its’ target and vice versa. Each circuit becomes more attuned to their respective target and increases the variance of their output. In Fig. 7, we see that this trial has the expectations of the QMLC initially hovering closer to the center.
Fig. 7. Initial expectations of the QMLC hovering closer to the center
As the training continues with each gradient, the expectation value moves to the extreme edges, [0, 1], more frequently. This can also be seen in Fig. 8. If the classifier is working properly, the circuit variances should mostly increase with training. 4.2
Weights
As suggested in previous work [6], the weights were randomly generated between [0, 2π]. This leads to significant increases in training speed. The weights had smaller distances to move and less frantic turns in direction. Figure 9 shows the weights for all circuits. The repetitive “static” like motion that the weights exhibit come from non-stochastic gradient descent. The classifiers are optimized by the same data in the same order every epoch. This means that the weights are pushed and pulled in the same order, resulting in the “static”. Like in previous work [6], momentum operators or stochastic gradient descent may result in stabilizing the motion of the weights. Figure 10 shows, by comparison, the weights for just target number 0, and Fig. 11 the weights for target number 1, and Fig. 12 for target number 2.
Hybrid Quantum Machine Learning Classifier with Transfer Learning
111
Fig. 8. Circuit variances increasing with training
4.3
Metrics
The metrics recorded, shown in Fig. 13, were an improvement on previous iterations of the QMLC [6]. Again, the model was able to over fit at an even faster speeds. This is attributed to the changes made in weight generation and the unlearning rate equation. It should be noted That the classifier did not constantly improve.
5
Future Work
The continued success of the quantum machine learning model is promising. Future exploration is needed into the hyper parameters, network layers and
Fig. 9. Recordings of the weights over the course of training
112
A. Leider et al.
Fig. 10. Recordings of the circuit associated with target 0
Fig. 11. Recordings of the circuit associated with target 1
other aspects of the model. With more advancements in quantum computing, it is reasonable to assume there will be more access to qubits for academic research. Therefore, the next goal is to study larger data sets using quantum circuits rather than simulating the mathematics. Further study is needed to understand the effect of the unlearning rate and the possible use of parallelization.
Hybrid Quantum Machine Learning Classifier with Transfer Learning
113
Fig. 12. Recordings of the circuit associated with target 2
Fig. 13. Recordings of the quantum machine learning classifier metrics over the course of training
6
Definitions
For a more robust list, including previous definitions, refer our previous work [6]. Quantum computing is a cross-section of computer science and quantum mechanics, a new science that is growing with new terms. Included here is a section reviewing the terms of quantum computing that are directly relevant. This covers quantum bits (“qubits”) and the specific gates, the CN OT gate, CZN OT and the RX gate, used in this quantum circuit for the quantum machine learning classifier.
114
A. Leider et al.
Activation Function: An activation function is the function that determines if a neuron will output a signal based on the signals and weights of previous neurons, and if so by how much. Backpropagation: The algorithm by which a neural network decreases the distance between the output of the neural network and the optimal possible output of the same neural network. Backpropagation is an amalgam of the phrase “backward propagation”. Bloch Sphere: The Bloch Sphere is the three dimensional representation of the possible orientations a qubit can have with a radius of one as described in the review of the lectures of Felix Bloch in [2]. It is analogous to the unit circle. Circuit: A circuit is an ordered list of operations, gates, on a set of qubits to perform computational algorithms. CNOT Gate: The CNOT gate, CX gate, based on the value of a control qubit. ⎛ 1 0 0 ⎜ 0 1 0 ⎜ ⎝ 0 0 0 0 0 1
causes a rotation in a target qubit ⎞ 0 0 ⎟ ⎟ 1 ⎠ 0
CZ Gate: The CZ gate, controlled-Z gate, causes a rotation in a target qubit on the Z-axis based on the control qubit’s position on the Z-axis. ⎛ ⎞ 1 0 0 0 ⎜ 0 1 0 0 ⎟ ⎜ ⎟ ⎝ 0 0 1 0 ⎠ 0 0 0 −1 Dirac Notation: The Dirac notation as described first by [3] is the standard notation for quantum computing by which vectors are represented through the use of Bras| and |Kets to denote, respectively, column and row vectors. Epoch: Epoch is one iteration of training. Expectation Value: The expectation value is the probabilistic value of a circuit. Fold: The process of removing layers in a classical neural network. Gate: A gate is an operation performed on a qubit. Mathematically the gate is described as a matrix that changes the values of a qubit. This value change corresponds with a rotation in three dimensional space. The degree of the rotation can be a function of an input value. In this case, the gate was parameterized with that value (called a parameter). Learning Rate: A hyperparameter that determines the responsiveness of a model in the direction opposite the gradient. It is multiplied by the negative of the gradient to determine the change in a weight.
Hybrid Quantum Machine Learning Classifier with Transfer Learning
115
Neural Network: A neural network is a learning algorithm designed to resemble the neural connections in the human brain. Neuron: A neuron is the basic building block of a neural network. Connected neurons have weights to stimulate predetermined activation functions. Pooled Data Set: The data set after being propagated through a folded classical neural network, re-scaled and multiplied by Pi. Qubit: The qubit is the quantum computing analog of a classical bit represented as a vector of two numbers. The two numbers can be represented as a vector in three dimensional space. A qubit is represented as a wire in the graphical representation of a circuit. RX Gate: The RX gate causes a rotation of a qubit about the X-axis to a degree specified by a parameter. The angle of rotation is specified in radians and can be positive or negative. cos θ2 −i sin θ2 RX (θ) = −i sin θ2 cos θ2 Testing Set: The set of data used to verify the accuracy of the trained neural network. Training Set: The subset of data used to train the neural network. Unlearning Rate: Unlike the learning rate, the “unlearning rate” is not a hyperparameter because it is determined by a function. The value is deterministic. It determines the responsiveness of the circuits in the direction of the gradient. It is multiplied by the product of the learning rate and the positive gradient to determine the change in a weight. Weight: A weight can be thought of as the value denoting the strength of the connection between two neurons. It transforms the output signal of a neuron before it is fed into another neuron in the next layer.
References 1. Cappelletti, W., Erbanni, R., Keller, J.: Polyadic quantum classifier (2020). https://arxiv.org/pdf/2007.14044.pdf 2. Bloch, F., Walecka, J.D.: Fundamentals of statistical mechanics: manuscript and notes of Felix Bloch. World Scientific (2000) 3. Dirac, P.A.M.: A new notation for quantum mechanics. In: Mathematical Proceedings of the Cambridge Philosophical Society, vol. 35, pp. 416–418. Cambridge University Press (1939) 4. Jaoude, G.A.: Second Colab Notebook (2022). https://colab.research.google.com/ drive/1f5QkqpgSs1K5apArZjHyV gpZmGHfWDT?usp=sharing 5. Jaoude, G.A.: Google Colab Showing Code (2022). https://colab.research.google. com/drive/1s8B5rQh0dDb5yYgzmpgmWUf54YZ9PUju?usp=sharing
116
A. Leider et al.
6. Leider, A., Jaoude, G.A., Strobel, A.E., Mosley, P.: Quantum machine learning classifier. In: Arai, K. (ed.) FICC 2022. LNNS, vol. 438, pp. 459–476. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98012-2 34 7. Aflack, O.: neural network.ipynb (2018). https://colab.research.google.com/drive/ 10y6glU28-sa-OtkeL8BtAtRlOITGMnMw#scrollTo=oTrTMpTwtLXd 8. Open source quantum information kit. Qiskit (2022). https://qiskit.org/ 9. PennyLane dev team. Quantum gradients with backpropagation (2021). https:// pennylane.ai/qml/demos/tutorial variational classifier.html 10. Scikit-learn. Wine Dataset (2021). https://scikit-learn.org/stable/modules/ generated/sklearn.datasets.load wine.htmll
Repeated Potentiality Augmentation for Multi-layered Neural Networks Ryotaro Kamimura(B) Tokai University and Kumamoto Drone Technology and Development Foundation, 2880 Kamimatsuo Nishi-ku, Kumamoto 861-5289, Japan [email protected] Abstract. The present paper proposes a new method to augment the potentiality of components in neural networks. The basic hypothesis is that all components should have equal potentiality (equi-potentiality) to be used for learning. This equi-potentiality of components has implicitly played critical roles in improving multi-layered neural networks. We introduce here the total potentiality and relative potentiality for each hidden layer, and we try to force networks to increase the potentiality as much as possible to realize the equi-potentiality. In addition, the potentiality augmentation is repeated at any time the potentiality tends to decrease, which is used to increase the chance for any components to be used as equally as possible. We applied the method to the bankruptcy data set. By keeping the equi-potentiality of components by repeating the process of potentiality augmentation and reduction, we could see improved generalization. Then, by considering all possible representations by the repeated potentiality augmentation, we can interpret which inputs can contribute to the final performance of networks. Keywords: Equi-potentiality · Total potentiality · Relative potentiality · Collective interpretation · Partial interpretation
1 Introduction This section explains why the basic property of equi-potentiality should be introduced to improve generalization and interpretation. The equi-potentiality means that all components should be used as equally as possible in learning. In particular, we try to show how this basic property has been explicitly or implicitly used in several methods in neural networks by surveying the literature of many existing and important methods and hypotheses, such as competitive learning, mutual information, lottery ticket hypothesis, and interpretation methods. 1.1 Potentiality Reduction The present paper proposes a method of increasing the potentiality of components as much as possible. The potentiality means to what degree the corresponding components contribute to the inner activity of neural networks. At the initial stage of learning, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 117–134, 2023. https://doi.org/10.1007/978-3-031-28073-3_9
118
R. Kamimura
the potentiality of components should be large, because the inner activity is not constrained by any outer conditions. In the course of learning, this initial potentiality tends to decrease, and the potentiality in neural networks is transformed into a form of information to be used actually in learning. For example, the potentiality of components should be decreased to reduce errors between outputs and targets in supervised learning. Thus, the potentiality for representing the inner activity tends to lose its strength implicitly and overtly in neural learning. One of the main problems is, therefore, how to increase the prior potentiality, because larger potentiality means more adaptability and flexibility in learning. In this context, this paper proposes a method of augmenting the potentiality to realize the equi-potentiality at any time. 1.2
Competitive Learning
The importance of equi-potentiality of components has been well recognized in the conventional competitive learning [1], self-organizing maps [2–8], and related informationtheoretic methods [9]. The competitive learning aims to discover distinctive features in input patterns by determining output neurons maximally responding to specific inputs. In this method, the output neurons naturally should have the basic and specific features to win the competition. In addition, it is supposed that all neurons should be equally responsible for representing inputs, namely, equi-potentiality in this paper. Though this equi-potentiality has been recognized from the beginning, this property has not necessarily been implemented with success. The problem of dead neurons without any potentiality has been one of the major problems in competitive learning [10–16]. Thus, though the importance of equi-potentiality has been considered significant, the method for equi-potentiality has not been well established until now. In addition, this equi-potentiality has been considered important in informationtheoretic methods, though little attention has been paid to this property due to the complexity of computation [9, 17–23]. For example, this equi-potentiality has been expressed in the form of mutual information in neural networks. In terms of potentiality, mutual information can be decomposed as the equi-potentiality of components and the specification of each component. Thus, this information-theoretic method deals not only with the acquisition of specific information but also with the equal use of all components, quite similar to the case with competitive learning. The importance of equal-potentiality has been recognized, but this property has not been fully considered due to the complexity of computing mutual information. 1.3
Regularization
The equi-potentiality should be related to the well-known regularization in neural networks [24–28], but the equi-potentiality has not received due attention in the explicit and implicit regularization. For example, to improve generalization, the weight decay and its related methods have been introduced to restrict the potentiality of weights. These may have no problems in learning when we suppose that neural networks can choose appropriate solutions immediately among many. However, if not, the regularization methods have difficulty in seeking for some others in the course of learning. This paper tries to show that we need to introduce the equi-potentiality of weights to find appropriate ones among many possibilities. Though the random noises or similar
Repeated Potentiality Augmentation
119
effects in the conventional learning are expected to be effective in this restoration, those methods are insufficient to find appropriate solutions in learning. This paper shows that, for the regularization to be effective, the concept of equi-potentiality should play more active roles in increasing the possibility of finding final and appropriate weights. Related to the regularization discussed above and the equi-potentiality, a new hypothesis for learning has recently received due attention, namely, the lottery ticket hypothesis [29–36], where learning is considered not as a process of creating appropriate models but as a process of finding or mining them from already existing ones. Behind a process of finding appropriate models from many in the case of huge multi-layered neural networks, there should, though implicitly, be a principle of equal-potentiality of any components or sub-networks. As mentioned above, learning is a process of reducing the potentiality in a neural network. By the effect of implicit and explicit regularization, error reduction between outputs and targets can only be realized by reducing the potentiality of some components. Thus, the lottery ticket hypothesis can be realized only based on the equipotentiality of all components or sub-networks. More strongly, we should say that the lottery ticket hypothesis cannot be valid unless the property of equi-potentiality of components or sub-networks is supposed. Thus, the lottery ticket hypothesis may be sound in principle. However, in actual learning, we need to develop a method to realize actively the equi-potentiality of all components and sub-networks. 1.4 Comprehensive Interpretation Finally, because the present paper tries to interpret the final internal representations in neural networks, we should mention the relation of equi-potentiality to the interpretation. As has been well known, the interpretation of final internal representations created by neural networks has been one of the main problems in theoretical as well as practical studies [37–42]. Unless we understand the main mechanism of neural networks, the improvement of learning methods becomes difficult, and without explaining the inference mechanism of neural networks, it is almost impossible for the neural networks to be used for specific and practical applications. One of the main problems in interpreting neural networks lies in the naturally distributed characteristics of inference processing. The actual inference is supposed to be performed collaborating with a number of different components inside and input patterns and initial conditions outside. In spite of this fact of distributed property, the main interpretation methods so far developed, especially in the convolutional neural networks, have focused on a specific instance of inference mechanism, and it can be called “local interpretation” [43–57] to cite a few. At this point, we should introduce in the field of interpretation, the equi-potentiality. This means that, though it may be useful to have a specific interpretation for a specific example of the interpretation problem, we need to consider all possible representations for the problem. More strongly, we should suppose that all possible representations should have equal potentiality for appropriate interpretation. Returning to the above discussion on the equi-potentiality of components, we need to consider all possible representations and any different configurations of components as equally as possible. Once again, we should stress that the majority of interpretation methods seem to be confined within a specific instance of possible interpretations. We need to develop a method to consider as many instances as possible for comprehensive interpretation.
120
R. Kamimura
In this paper, we interpret the inference mechanism, supposing that all instances have the same status or importance and all representations have the equal potentiality for interpretation. Thus, the interpretation can be as comprehensive as possible. 1.5
Paper Organization
In Sect. 2, we try to explain the concepts of potentiality and how to define two types of potentiality: total and relative potentiality. After briefly explaining how to train neural networks with the potentiality, we introduce the collective interpretation in which all possible instances of interpretation have the same importance. In particular, we show how a single hidden neuron tries to detect features by this interpretation method. Finally, we apply the method to the bankruptcy data set. We try to show how the final representations can be changed by repeating and augmenting the potentiality of connection weights in hidden layers. The final results show that, for improving generalization, it is necessary to control the total potentiality and to increase the relative potentiality. By examining the whole sets of final representations, we can see which inputs are important in inferring the bankruptcy of companies.
2 Theory and Computational Methods 2.1
Repeated Potentiality Reduction and Augmentation
In this paper, it is supposed that total potentiality of components such as connection weights, neurons, and layers tends to decrease. The learning is considered a process to reduce the initial potentiality for some specific objectives of learning. This reduction in potentiality can restrict the flexibility of learning, necessary in obtaining appropriate information for any components. Thus, we need to restore the potentiality to increase the possibility to obtain appropriate components. As shown in Fig. 1, we suppose that a network has maximum potentiality only when each connection weight is connected equally with all neurons. The maximum potentiality means that all neurons have the same potentiality to be connected with the other neurons. In a process of learning, error minimization between outputs and targets is forced for some connection weights to become stronger, which is necessary in reducing errors between outputs and targets. If these connection weights are not well suited for error minimization, we need to move to other connection weights. This paper supposes that we need to restore the initial potentiality as much as possible for neurons to have an equal chance of being chosen at any time of learning in Fig. 1(b). In a process of learning, we should repeat this process of reduction and augmentation as many times as possible for obtaining the final appropriate connection weights, shown in Fig. 1(d) and (e). 2.2
Total and Relative Information
In this paper, the potentiality in a hidden layer can be represented by the sum of all individual potentialities. The individual potentiality can be defined as the first approximation by the absolute weights. For simplicity, we consider weights from the second to the third layer, represented by (2, 3), and the individual potentiality is defined by
Repeated Potentiality Augmentation
121
Fig. 1. Repeated potentiality reduction and augmentation. (2,3)
ujk
(2,3)
=| wjk
|
(1)
Then, total information is the sum of all individual potentialities in the layer. T (2,3) =
n2 n3 j=1 k=1
(2,3)
ujk
(2)
where n2 and n3 denote the number of neurons in each layer. As mentioned above, learning is considered as the reduction of this potentiality. One possible way for the reduction is to reduce the number of strong weights. Then, we should count the number of strong weights in a layer. The relative potentiality is introduced to represent the number of strong weights in a layer. The relative potentiality is the potentiality relative to the corresponding maximum potentiality. (2,3) rjk
(2,3)
=
ujk
(2,3)
maxj k uj k
(3)
122
R. Kamimura
where the max operation is over all connection weights between the layers. In addition, we define the complementary one by (2,3)
r¯jk
(2,3)
=1−
ujk
(2,3)
maxj k uj k
(4)
We simply call this absolute strength “potentiality” and “complementary potentiality”. By using this potentiality, relative potentiality can be computed by (2,3) n2 n3 ujk (2,3) R (5) = (2,3) j=1 k=1 maxj k uj k When all potentialities become equal, naturally, the relative potentiality becomes maximum. On the other hand, when only one potentiality becomes one while all the others are zero, the relative potentiality becomes minimum. For simplicity, we suppose that at least one connection weight should be larger than zero. 2.3
Repeated Learning
Learning is composed of two phases. In the first phase, we try to increase total potentiality and at the same time increase the relative potentiality. For the (n + 1)th learning step, weights are computed by (2,3)
(2,3)
(2,3)
wjk (n + 1) = θ r¯jk (n) wjk (n)
(6)
where the parameter θ should be larger than one, which has an effect of increasing total potentiality. In addition, the complement potentiality r¯ is used to decrease weight strength in direct proportion to the strength of the weights. This means that larger weights become smaller, and eventually all weights became smaller and equally distributed. In the second phase, the parameter θ is less than one, reducing the strength of weights. In addition, the individual potentiality is used to reduce the strength of weights, except for one specific weight. (2,3)
(2,3)
(2,3)
wjk (n + 1) = θ rjk (n) wjk (n)
(7)
As shown in Fig. 1, this process of reduction and augmentation is repeated several times in learning. 2.4
Full and Partial Compression
The interpretation in this paper tries to be as comprehensive as possible. For realizing this, we suppose that all components and all intermediate states have the same status and importance for interpretation. This is the equi-potentiality principle of interpretation, as mentioned above.
Repeated Potentiality Augmentation
123
First, for interpreting multi-layered neural networks, we compress them into the simplest ones, as shown in Fig. 2(a). We try here to trace all routes from inputs to the corresponding outputs by multiplying and summing all corresponding connection weights. First, we compress connection weights from the first to the second layer, denoted by (1, 2), and from the second to the third layer (2, 3) for an initial condition and a subset of a data set.
Fig. 2. Network compression from the initial state to the simplest and compressed network, and final collective weights by full compression (a) and partial compression (b).
Then, we have the compressed weights between the first and the third layer, denoted by (1, 3). n2 (1,3) (1,2) (2,3) wij wjk (8) wik = j=1
Those compressed weights are further combined with weights from the third to the fourth layer (3, 4), and we have the compressed weights between the first and the fourth layer (1, 4).
124
R. Kamimura (1,4)
wil
=
n3 k=1
(1,3)
wik
(3,4)
wkl
(9)
By repeating these processes, we have the compressed weights between the first (1,5) and fifth layer, denoted by wiq . Using those connection weights, we have the final and fully compressed weights (1, 6). (1,6)
wir
=
n5 q=1
(1,5)
wiq
(5,6) wqr
(10)
For the full compression, we compress all hidden layers as explained above. For the effect of a specific hidden layer, we compress a network only with one specific hidden layer, which can be called “partial” compression. By using this partial compression, we can examine what kind of features a hidden layer tries to deal with. In Fig. 2(b), we focus on the weights from the third to the fourth layer. We first combine the input and the hidden layer. n3 (1,(3,4)) (1,2) (3,4) = wik wkl (11) wik k=1
where the notation (1,(3,4)) means that only weights from the third to the fourth layer are considered. Then, these compressed weights are combined immediately with the output layer n5 (1,(3,4),6) (1,(3,4)) (5,6) = wiq wqr (12) wir q=1
In this way, we can consider only the effect by a single hidden layer. In this paper, the number of neurons in all hidden layers is supposed to be the same, but this method can be applied to the case where the number of neurons is different for each hidden layer.
3 Results and Discussion 3.1
Experimental Outline
The experiment aimed to infer the possibility of bankruptcy of companies, based on six input variables [58]. In social sciences, due to the inability and instability of interpretation, neural networks have been reluctantly used in their analytical processes. However, the final results by the conventional statistical methods have limited the scope of interpretation, ironically due to the stability and reliability of the methods. On the contrary, neural networks are more instable models, compared with the conventional statistical models. However, the instability is related to the flexibility of interpretation. Neural networks are naturally dependent much on the inputs and initial conditions, as well as any other conditions. This seems to be a fatal shortcoming of neural networks, but neural networks have much potentiality to see an input from a number of different viewpoints. The only problem is that neural networks have so far tried to limit this potentiality as much as possible by using methods such as regularization. The present paper tries to
Repeated Potentiality Augmentation
125
consider all these viewpoints by neural networks as much as possible for the comprehensive interpretation. Thus, this experiment on bankruptcy may be simple but a good benchmark to show the flexibility and potentiality of our method of considering as many conditions as possible for the practical problem. The data set contained 130 companies, and we tried to classify whether or not the corresponding companies were in a state of bankruptcy. We used the very redundant ten-layered neural networks with ten hidden layers. The potentiality of hidden layers was controlled by our potentiality method, while the input and output layer remained untouched by the new method, because we had difficulty in controlling the input and output layer by the present method. For easily reproducing the final results in this paper, we used the neural network package in the scikit-learn software package, where almost all parameters remained as the default ones except for the activation function (changed to the tangent hyperbolic) and the number of learning epochs (determined by the potentiality method). 3.2 Potentiality Computation The results show that when the parameter θ increased gradually, total potentiality became larger, while oscillating greatly. The relative potentiality tended to be close to its maximum value by this effect. However, too-large parameter values caused extremely large potentiality. In the first place, we try to show the final results of total potentiality and relative potentiality by the conventional methods and the new methods. Figure 3 shows total potentiality (left) and relative potentiality (right). By the conventional method in Fig. 3(a), total potentiality (left) and relative potentiality (right) remained almost constant, independently of learning steps. When the parameter θ was 1.0 in Fig. 3(b), total potentiality (left) were small and tended to decrease when the number of learning steps increased. On the other hand, the relative potentiality (right) increased gradually as a function of the number of learning steps. However, the total potentiality could not come close to the maximum values. When the parameter θ increased to 1.1 in Fig. 3(c), the total potentiality was slightly larger, and the oscillation of relative potentiality became clear, meaning that the relative potentiality tried to increase with an up-and-down movement. When the parameter θ increased to 1.2 in Fig. 3(d), the oscillation could be clearly seen both in the total potentiality and relative potentiality. When the parameter θ increased to 1.3 with the best generalization in Table 1, the oscillation of total potentiality became the largest, and the total potentiality tended to increase. In addition, the relative potentiality, while oscillating, became close to the maximum value. When the parameter θ increased further to 1.4 (f) and 1.5 (g), the total potentiality tended to have extreme values, and the relative potentiality became close to the maximum values without oscillation. 3.3 Collective Weights The results show that, when the parameter θ was relatively small, collective weights and the ratio of the collective weights to the original correlation coefficients were similar to those by the conventional method. When the parameter θ was further increased to 1.3,
126
R. Kamimura
Fig. 3. Total potentiality (Left) and relative potentiality (Right) as a function of the number of steps by the conventional method (a), and by the new methods when the parameter θ increased from 1.0 (b) to 1.5 (g) for the bankruptcy data set. A boxed figure represents a state with maximum generalization.
the collective weights and the ratio became completely different from the original correlation coefficients and the ratio by the conventional method. This means that the neural networks tended to obtain weights close to the original correlation coefficients, but when generalization performance was forced to increase, completely different weights could be obtained. Figure 4(a) shows the collective weights (left) and ratio (middle) of absolute collective weights to the absolute correlation coefficients, and the original correlation coefficients between inputs and target (right) by using the conventional method. As can be seen in the left-hand figure, input No. 2 (capital adequacy ratio) and especially
Repeated Potentiality Augmentation
127
No. 5 (sales C/F to current liabilities) were strongly negative. The ratio of the collective weights to the correlation coefficients (middle) shows that the strength increased gradually when the input number increased from 1 to 5. This means that input No. 5 played an important role linearly and non-linearly. When the parameter θ was 1.0 in Fig. 4(b), almost the same tendencies could be seen in all measures except the ratio (middle). The ratio was different from that by the conventional method, where input No. 6 (sales per person) had larger strength. When the parameter θ increased to 1.1, this tendency was clearer, and input No. 6 had the largest strength. When the parameter θ was 1.2 in Fig. 4(d), the collective weights (left) were different from the correlation coefficients, but the ratio still tended to have the same characteristics. When the parameter θ increased to 1.3 with the best generalization in Fig. 4(e), the collective weights were completely different from the original correlation coefficients, and input No. 3 (sales growth rates) took the highest strength. When the parameter θ increased to 1.4 in Fig. 4(f) and 1.5 in Fig. 4(g), all measures again became similar to those by the conventional method, though in the ratio (middle), some differences could be detected. The results show that neural networks tended to produce collective weights close to the original correlation coefficients. However, when the generalization was forced to increase by increasing the parameter, the collective weights became different from the correlation coefficients. Thus, neural networks could extract weights close to the correlation coefficients, and in addition, the networks could produce weights different from the correlation coefficients, which tried to extract, roughly speaking, non-linear relations between inputs and outputs. 3.4 Partially Collective Weights The partially collective weights show that the conventional method tended to use the hidden layers close to the input and output. On the contrary, the new methods tried to focus on the hidden layer close to the input or output layer. However, to obtain the best generalization performance, all hidden layers should be somehow used. Figure 5 shows the collective weights computed by a single hidden layer. In the figure, from left to right and top to bottom, the collective weights were plotted by computing collective weights only with a single hidden layer. As shown in Fig. 5(a) by the conventional method, one of the most important characteristics is that the collective weights computed with the hidden layer close to the input (above left) and output layer (right bottom) were relative larger, meaning that the hidden layers close to the input and output layer, tended to have much information content. On the contrary, when the parameter θ was 1.0 in Fig. 5(b), the collective weights with the hidden layer close to the output layer had only larger strength. When the parameter θ increased to 1.5 in Fig. 5(d), the collective weights only with the hidden layer, close to the input layer, had larger strength. However, when the parameter θ increased to 1.3 with the best generalization performance in Fig. 5(c), the collective weights for almost all hidden layers tended to have a somewhat large strength, meaning that to improve generalization performance, it is necessary to somehow use all possible hidden layers.
128
R. Kamimura
Fig. 4. The collective weights (Left), the ratio (Middle) of absolute collective weights to absolute correlation coefficients, and the original correlation coefficients (Right) by the conventional method (a) and when the parameter θ increased from 1.0 (b) to 1.5 (g) for the bankruptcy data set.
Repeated Potentiality Augmentation
129
Fig. 5. Partially collective weights with only a single hidden layer for the bankruptcy data set.
130
R. Kamimura
Table 1. Summary of experimental results on average correlation coefficients and generalization performance for the bankruptcy data set. The numbers in the method represent the values of the parameter θ to control the potentialities. Letters in bold type indicate the maximum values. Method
Correlation Accuracy
1.0 1.1 1.2 1.3 1.4 1.5
0.882 0.863 −0.510 0.362 0.788 0.879
0.882 0.879 0.900 0.908 0.897 0.862
Conventional
0.811
0.877
Logistic
0.979
Random forest −0.905
3.5
0.813 0.779
Correlation and Generalization
The new method could produce the best generalization by using the weights far from the original correlation coefficients, while the conventional logistic regression naturally produced coefficients very close to the original correlation coefficients. Table 1 shows the summary of results on correlation and generalization. The generalization accuracy increased from 0.882 (θ = 1.0) to the largest one of 0.908 (θ = 1.3). The correlation coefficient between collective weights and the original correlation coefficient of the data set decreased from 0.882 (θ = 1.0) to 0.362 (θ = 1.3), and again, the correlation increased. When the generalization performance was the largest and the parameter θ was 1.3, the strength of the correlation became the smallest. This means that, to improve generalization performance, we need to use some non-linear relations in a broad sense. The logistic regression analysis produced an almost perfect correlation of 0.979, but the generalization was the second worst at 0.813, behind the random forest. The random forest produced a high negative correlation, because the importance could not take into account the negativity of inputs. Neural networks could improve generalization performance even if the weights represent linear relations between inputs and outputs. However, when the networks are forced to increase generalization, they try to use non-linear relations different from the correlation coefficients.
4 Conclusion The present paper showed the importance of equi-potentiality of components. The total potentiality can be defined in terms of the sum of absolute weights, and relative potentiality can be defined with respect to the maximum potentiality. We suppose that all components in neural networks should be used as equally as possible, at least in the course of learning, for the final representations to consider as many situations as possible. However, this equi-potentiality tends to decrease in the course of learning, because
Repeated Potentiality Augmentation
131
learning can be considered a process to transform the potentiality into the real information to be used in actual learning. Thus, we need to restore the potentiality of components as much as possible at any time. Every time the potentiality decreases, we need to re-increase the potentiality in learning for all components to have an equal chance to be chosen in learning. This consideration is applied not only to the learning processes but also to the interpretation. The same principle of equi-potentiality of representations can also be applied to interpretation. In the conventional interpretation methods, the specific and local interpretation is applied, even though we have a number different representations. In our interpretation method, all those representations have the same status and importance, and they should be taken into account for the interpretation to be as comprehensive as possible. The method was applied to the bankruptcy data set. By repeating and augmenting the potentiality, we could produce results with a different generalization and correlation. By examining all those instances, we could infer the basic inference mechanism in the bankruptcy. The results confirmed that the weights were quite close to the original correlation coefficients between outputs and targets when the potentiality was not sufficiently increased. Then, when the generalization was forced to increase by increasing the potentiality, collective weights far from the original correlation coefficients were obtained. This finding with better generalization can be due to the equal consideration of all representations created by neural networks. One of the main problems is how to combine different computational procedures such as total and relative potentiality reduction and augmentation. We need to examine how to control those computational procedures for improving generalization and interpretation. Though some problems in computation should be solved for practical application, the results on the equi-potentiality can contribute to understanding the main inference mechanism of neural networks.
References 1. Rumelhart, D.E., Zipser, D.: Feature discovery by competitive learning. Cogn. Sci. 9, 75–112 (1985) 2. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1995). https://doi.org/10.1007/ 978-3-642-97610-0 3. Himberg, J.: A SOM based cluster visualization and its application for false colouring. In: Proceedings of the International Joint Conference on Neural Networks, pp. 69–74 (2000) 4. Bogdan, M., Rosenstiel, W.: Detection of cluster in self-organizing maps for controlling a prostheses using nerve signals. In 9th European Symposium on Artificial Neural Networks. ESANN 2001. Proceedings. D-Facto, Evere, Belgium, pp. 131–136 (2001) 5. Yin, H.: Visom-a novel method for multivariate data projection and structure visualization. IEEE Trans. Neural Networks 13(1), 237–243 (2002) 6. Brugger, D., Bogdan, M., Rosenstiel, W.: Automatic cluster detection in Kohonen’s Som. IEEE Trans. Neural Networks 19(3), 442–459 (2008) 7. Xu, L., Xu, Y., Chow, T.W.S.: Polsom: a new method for multidimensional data visualization. Pattern Recogn. 43(4), 1668–1675 (2010) 8. Xu, L., Xu, Y., Chow, T.W.S.: PolSOM-a new method for multidimentional data visualization. Pattern Recogn. 43, 1668–1675 (2010) 9. Linsker, R.: Self-organization in a perceptual network. Computer 21(3), 105–117 (1988)
132
R. Kamimura
10. DeSieno, D.: Adding a conscience to competitive learning. In: IEEE international Conference on Neural Networks, vol. 1, pp. 117–124. Institute of Electrical and Electronics Engineers, New York (1988) 11. Fritzke, B.: Vector quantization with a growing and splitting elastic net. In: Gielen, S., Kappen, B. (eds.) ICANN 1993, pp. 580–585. Springer, London (1993). https://doi.org/10.1007/ 978-1-4471-2063-6 161 12. Fritzke, B.: Automatic construction of radial basis function networks with the growing neural gas model and its relevance for fuzzy logic. In: Applied Computing 1996: Proceedings of the 1996 ACM Symposium on Applied Computing, Philadelphia, pp. 624–627. ACM (1996) 13. Choy, C.S., Siu, W.: A class of competitive learning models which avoids neuron underutilization problem. IEEE Trans. Neural Networks 9(6), 1258–1269 (1998) 14. Van Hulle, M.M.: Faithful representations with topographic maps. Neural Netw. 12(6), 803– 823 (1999) 15. Banerjee, A., Ghosh, J.: Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres. IEEE Trans. Neural Networks 15(3), 702–719 (2004) 16. Van Hulle, M.M.: Entropy-based kernel modeling for topographic map formation. IEEE Trans. Neural Networks 15(4), 850–858 (2004) 17. Linsker, R.: Self-organization in a perceptual network. Computer 21, 105–117 (1988) 18. Linsker, R.: How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Comput. 1(3), 402–411 (1989) 19. Linsker, R.: Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Comput. 4(5), 691–702 (1992) 20. Torkkola, K.: Feature extraction by non-parametric mutual information maximization. J. Mach. Learn. Res. 3, 1415–1438 (2003) 21. Leiva-Murillo, J.M., Art´es-Rodr´ıguez, A.: Maximization of mutual information for supervised linear feature extraction. IEEE Trans. Neural Networks 18(5), 1433–1441 (2007) 22. Van Hulle, M.M.: The formation of topographic maps that maximize the average mutual information of the output responses to noiseless input signals. Neural Comput. 9(3), 595– 606 (1997) 23. Principe, J.C.: Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives. Springer, New York (2010). https://doi.org/10.1007/978-1-4419-1570-2 24. Moody, J., Hanson, S., Krogh, A., Hertz, J.A.: A simple weight decay can improve generalization. Adv. Neural. Inf. Process. Syst. 4, 950–957 (1995) 25. Kukaˇcka, J., Golkov, V., Cremers, D.: Regularization for deep learning: a taxonomy. arXiv preprint arXiv:1710.10686 (2017) 26. Goodfellow, I., Bengio, Y., Courville, A.: Regularization for deep learning. Deep Learn. 216–261 (2016) 27. Wu, C., Gales, M.J.F., Ragni, A., Karanasou, P., Sim, K.C.: Improving interpretability and regularization in deep learning. IEEE/ACM Trans. Audio Speech Language Process. 26(2), 256–265 (2017) 28. Fan, F.-L., Xiong, J., Li, M., Wang, G.: On interpretability of artificial neural networks: a survey. IEEE Trans. Radiat. Plasma Med. Sci. 5, 741–760 (2021) 29. X. Ma, et al.: Sanity checks for lottery tickets: does your winning ticket really win the jackpot? In: Advances in Neural Information Processing Systems, vol. 34 (2021) 30. Bai, Y., Wang, H., Tao, Z., Li, K., Fu, Y.: Dual lottery ticket hypothesis. arXiv preprint arXiv:2203.04248 (2022) 31. da Cunha, A., Natale, E., Viennot, L.: Proving the strong lottery ticket hypothesis for convolutional neural networks. In: International Conference on Learning Representations (2022) 32. Chen, X., Cheng, Y., Wang, S., Gan, Z., Liu, J., Wang, Z.: The elastic lottery ticket hypothesis. In: Advances in Neural Information Processing Systems, vol. 34 (2021)
Repeated Potentiality Augmentation
133
33. Malach, E., Yehudai, G., Shalev-Schwartz, S., Shamir, O.: Proving the lottery ticket hypothesis: pruning is all you need. In: International Conference on Machine Learning, pp. 6682– 6691. PMLR (2020) 34. Frankle, J., Dziugaite, G.K., Roy, D., Carbin, M.: Linear mode connectivity and the lottery ticket hypothesis. In: International Conference on Machine Learning, pp. 3259–3269. PMLR (2020) 35. Frankle, J., Dziugaite, G.K., Roy, D.M., Carbin, M.: Stabilizing the lottery ticket hypothesis. arXiv preprint arXiv:1903.01611 (2019) 36. Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018) 37. Goodman, B., Flaxman, S.: European union regulations on algorithmic decision-making and a right to explanation. arXiv preprint arXiv:1606.08813 (2016) 38. Arrieta, A.B., et al.: Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020) 39. Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215 (2019) 40. Rai, A.: Explainable AI: from black box to glass box. J. Acad. Mark. Sci. 48(1), 137–141 (2020) 41. Weidele, D.K.I., et al.: opening the blackbox of automated artificial intelligence with conditional parallel coordinates. In: Proceedings of the 25th International Conference on Intelligent User Interfaces, pp. 308–312 (2020) 42. Pintelas, E., Livieris, I.E., Pintelas, P.: A grey-box ensemble model exploiting black-box accuracy and white-box intrinsic interpretability. Algorithms 13(1), 17 (2020) 43. Nguyen, A., Yosinski, J., Clune, J.: Understanding neural networks via feature visualization: a survey. In: Samek, W., Montavon, G., Vedaldi, A., Hansen, L.K., M¨uller, K.-R. (eds.) Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. LNCS (LNAI), vol. 11700, pp. 55–76. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28954-6 4 44. Erhan, D., Bengio, Y., Courville, A., Vincent, P.: Visualizing higher-layer features of a deep network. University of Montreal, 1341 (2009) 45. Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., Yosinski, J.: Plug & play generative networks: Conditional iterative generation of images in latent space. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4467–4477 (2017) 46. Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5188–5196 (2015) 47. Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., Clune, J.: Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In: Advances in Neural Information Processing Systems, pp. 3387–3395 (2016) 48. van den Oord,A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 (2016) 49. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013) 50. Khan, J., et al.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7(6), 673–679 (2001) ˜ zller, K.-R.: How 51. Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., MAˇ to explain individual classification decisions. J. Mach. Learn. Res. 11, 1803–1831 (2010) 52. Smilkov, D., Thorat, N., Kim, B., Vi´egas, F., Wattenberg, M.: Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825 (2017) 53. Sundararajan, M., Taly, A., Yan., Q.: Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365 (2017)
134
R. Kamimura
54. Lapuschkin, S., Binder, A., Montavon, G., Muller, K.-R., Samek, W.: Analyzing classifiers: fisher vectors and deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2912–2920 (2016) 55. Arbabzadah, F., Montavon, G., M¨uller, K.-R., Samek, W.: Identifying individual facial expressions by deconstructing a neural network. In: Rosenhahn, B., Andres, B. (eds.) GCPR 2016. LNCS, vol. 9796, pp. 344–354. Springer, Cham (2016). https://doi.org/10.1007/9783-319-45886-1 28 56. Sturm, I., Lapuschkin, S., Samek, W., M¨uller, K.-R.: Interpretable deep neural networks for single-trial EEG classification. J. Neurosci. Methods 274, 141–145 (2016) 57. Binder, A., Montavon, G., Lapuschkin, S., M¨uller, K.-R., Samek, W.: Layer-Wise Relevance Propagation for Neural Networks with Local Renormalization Layers. In: Villa, A.E.P., Masulli, P., Pons Rivero, A.J. (eds.) ICANN 2016. LNCS, vol. 9887, pp. 63–71. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44781-0 8 58. Shimizu, K.: Multivariate Analysis (2009). (in Japanese), Nikkan Kogyo Shinbun
SGAS-es: Avoiding Performance Collapse by Sequential Greedy Architecture Search with the Early Stopping Indicator Shih-Ping Lin and Sheng-De Wang(B) National Taiwan University, Taipei 106319, Taiwan {r09921057,sdwang}@ntu.edu.tw Abstract. Sequential Greedy Architecture Search (SGAS) reduces the discretization loss of Differentiable Architecture Search (DARTS). However, we observed that SGAS may lead to unstable searched results as DARTS. We referred to this problem as the cascade performance collapse issue. Therefore, we proposed Sequential Greedy Architecture Search with the Early Stopping Indicator (SGAS-es). We adopted the early stopping mechanism in each phase of SGAS to stabilize searched results and further improve the searching ability. The early stopping mechanism is based on the relation among Flat Minima, the largest eigenvalue of the Hessian matrix of the loss function, and performance collapse. We devised a mathematical derivation to show the relation between Flat Minima and the largest eigenvalue. The moving averaged largest eigenvalue is used as an early stopping indicator. Finally, we used NAS-Bench-201 and FashionMNIST to confirm the performance and stability of SGAS-es. Moreover, we used EMNIST-Balanced to verify the transferability of searched results. These experiments show that SGAS-es is a robust method and can derive the architecture with good performance and transferability. Keywords: Neural architecture search · Differentiable architecture search · Sequential greedy architecture search · Flat minima · Early stopping · Image classification · Deep learning
1
Introduction
Differentiable Architecture Search (DARTS) [1] vastly reduced the resource requirement of Neural Architecture Search (NAS): using less than 5 GPU days and achieving competitive performance on CIFAR-10, PTB, and ImageNet with only one single 1080Ti. 1.1
Problems of DARTS and SGAS
However, numerous research pointed out that DARTS does not work well. Yang et al. [2] compared several DARTS-based approaches with Random Sampling baseline and found that Random Sampling often gets the better architectures. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 135–154, 2023. https://doi.org/10.1007/978-3-031-28073-3_10
136
S.-P. Lin and S.-D. Wang
Zela et al. [3] delved into this issue and showed that DARTS often gets the architecture full of skip connections. This will lead to performance collapse and make the results of DARTS unstable. Xie et al. [4] attributed these issues to the optimization gap between the encode/decode scheme of DARTS and the original NAS. To solve these problems, P-DARTS [5] and DARTS+ [6] constrained the number of skip connections manually, but this may suffer from a human bias. SmoothDARTS [7] added some perturbation to stabilize DARTS, which may mislead the optimization direction. PC-DARTS [8] used the sampling technique called partial channel connections to forward only partial channels through the operation mixture. This can regularize the effect of skip connections at the early stage. However, if we set the total training epochs from 50 to 200, 400, or even larger, it will still end up with an architecture full of skip connections. Liang et al. [6] called this “the implicit early-stopping scheme”. Performance Collapse Issue. Fig. 1 is a cell which is derived by DARTS on NAS-Bench-201 [9] with CIFAR-10 dataset. It is full of skip connections. The skip connection is an identity mapping without any parameters. Therefore, the searched cell leads to a terrible performance with only about 60% accuracy on the test set, but the best cell can reach 94.37% accuracy. This phenomenon is called performance collapse. One possible explanation of performance collapse is overfitting [3,6]. Since DARTS encodes cells into over-parameterized supergraphs (Fig. 3), operations between every two nodes will have a different number of parameters. That is, operations with parameters like convolution will be trained together with operations with no parameters like skip connection and Zero Operation. At beginning epochs, this is fine because every operation is underfitting and will be improved together. However, when the training epoch becomes larger, operations with parameters start to overfit. Therefore, α values of operations with no parameters will become larger and larger. This is irreversible, and all edges will be selected as skip connections. However, this issue cannot be observed by the validation error of the supernet. Therefore, it is required to use another value as the early stopping indicator. Cascade Performance Collapse Issue. SGAS [10] is a DARTS-based method. The main purpose of SGAS is to reduce the discretization loss of
Fig. 1. Example of performance collapse
SGAS-es
137
Fig. 2. Example of cascade performance collapse
DARTS. It splits the whole searching process of DARTS into multiple phases. However, we observed that SGAS still suffers the instability issue. Figure 2 is an example. We used SGAS on NAS-Bench-201 with the CIFAR-10 dataset. SGAS ended up selecting four edges with the skip connection. That is, performance collapse occurred at nearly every phase of SGAS. This leads to the architecture with test accuracy equal to 88.51%. Although it is already better than DARTS, there is still room for improvement. We call this phenomenon cascade performance collapse. 1.2
Research Purposes and Main Contributions
Our research aims to resolve the instability issue of SGAS. Therefore, the proposed method, SGAS-es, adopts the early stopping indicator proposed by Zela et al. [3] for each phase of SGAS. After doing so, we can stabilize the performance of derived architectures. Besides, it can improve the search ability of SGAS. In this paper, our main contributions are as follows: – Mathematical Derivation: with the sharpness definition proposed by Keskar et al. [11], we showed the relation between Flat Minima of a loss landscape and the largest eigenvalue of the Hessian matrix of the loss function. – Novel Algorithm: we pointed out that SGAS suffered cascade performance collapse issue and proposed SGAS-es, Sequential Greedy Architecture Search with the Early Stopping Indicator, to stabilize architecture search results and derive architectures with better learning ability. – Thorough Experiments: we used NAS-Bench-201 [9], unified NAS benchmarks with three different datasets: CIFAR-10, CIFAR-100, and ImageNet-16-120, to compare SGAS-es with other DARTS-based approaches. We achieved state-of-the-art results on all of them with robust performance. After that, we used DARTS CNN Search Space with the Fashion-MNIST dataset and showed that it could also work on a more complex search space. Finally, by retraining searched cells derived from the Fashion-MNIST dataset on the EMNISTBalanced dataset, we also achieved state-of-the-art performance and showed that architectures searched by SGAS-es have excellent transferability. These experiments confirm that SGAS-es is a generalized approach that can avoid stepping into performance collapse and get suitable architectures.
138
S.-P. Lin and S.-D. Wang
Fig. 3. Encoded Cell Structure. Each Node x(i) is a Latent Representation. Each Edge o(i,j) is an Operation Chosen from Candidate Operations
2 2.1
Prior Knowledge Neural Architecture Search (NAS)
Generally, NAS can be decomposed into three parts [12]: Search Space, Search Strategy, and Performance Estimation Strategy. Search Space is a collection of possible architectures defined by human experts. Since Search Space often contains a great number of elements, Search Strategy is required to pick a potentially well-performed architecture from Search Space. After picking an architecture, Performance Estimation Strategy will tell Search Strategy how the architecture performs. Then Search Strategy can start the next search round. 2.2
Differentiable Architecture Search (DARTS)
DARTS [1] Search Space is a cell-based search space. Cells will be stacked with the fixed pattern to form the supernet architecture. DARTS will only determine architectures inside cells. The supernet architecture is predefined. There are two important steps in DARTS: Continuous Relaxation and Discretization. Continuous Relaxation is to encode the discrete search space via architecture parameters α. DARTS will connect all candidate operations between every two nodes and form a supergraph like Fig. 3, so transformations between nodes x(i) and x(j) will become mixed operations o¯(i,j) (x): o¯(i,j) (x) =
o∈O
) exp(α(i,j) o o ∈O
(i,j)
exp(αo
(i,j)
)
o(x)
(1)
O is the set of candidate operations. αo is the architecture parameter at edge (i, j) corresponding to the operation o. Actually, o¯(i,j) (x) is a weighted sum (i,j) of o(x) weighted by Softmax of αo on edge (i, j).
SGAS-es
139
Discretization is to decode the supergraph to get the final searched cells. Each node in the final searched cells can only have the in-degree equal to 2. Therefore, Discretization can also be viewed as a pruning step. The basic rule of (i,j) pruning is that larger αo indicates the operation o is more important on edge (i, j). Therefore, o will have the larger priority to be picked: (i,j)
o(i,j) = Disc(αo
(i,j)
) = arg max αo o∈O
(2)
The goal of DARTS is as (3). It is a bi-level optimization problem. Given w (α), we want to find the best architecture parameter α , which can minimize the loss computed by Dvalid in the upper-level task. Given α, we want to find the best network parameter w (α), which can minimize the loss computed by Dtrain in the lower-level task. α = arg min L(Dvalid , w (α), α) α
s.t.w (α) = arg min L(Dtrain , w, α)
(3)
w
The whole process of DARTS-based architecture search methods can be divided into two parts: the searching stage and the retraining stage. The goal of the searching stage is to derive the best normal and reduction cell. In the retraining stage, we will stack searched cells derived from the searching stage to form another supernet and train it from scratch. 2.3
Sequential Greedy Architecture Search (SGAS)
SGAS [10] is a DARTS-based method to reduce discretization loss. The significant difference between SGAS and DARTS is in Search Strategy. Rather than discretization at the end of the searching stage, it will pick an edge and fix the operation of that edge per fixed number of epochs. SGAS can reduce the discrepancy between the rank of validation error of the searching stage and the rank of test error of the retraining stage. Therefore, the Kendall rank correlation coefficient between them is closer to 1 than DARTS. In this paper, each phase in SGAS means search epochs between two decision epochs. For example, if eight edges need to be selected in a cell, then there will be nine phases in the searching stage of SGAS. 2.4
ax and Performance The Relation Among Flat Minima, λm α Collapse
Figure 4 shows the relation between these three. This section will explain the first and second relation by [3]. In the Approach part, we will show the mathematical derivation of the third relation.
140
S.-P. Lin and S.-D. Wang
Fig. 4. The relation among flat minima, λmax and performance collapse α ax The Relation between λm and Performance Collapse. Zela et al. [3] α and the test error of derived architecstudied a correlation between λmax α is the largest eigenvalue of ∇2 L(Dvalid , w (α), α ). It is calculated tures. λmax α by a supernet in the searching stage with a randomly sampled mini-batch of Dvalid . After discretization, test error is calculated by supernet after retraining is finished. Zela et al. [3] used four search spaces (S1-S4) with three datasets (12 benchmarks) to provide detailed experiment results. They used 24 different architectures and made the scatter plot on each benchmark to show a strong correlation and test error. between λmax α
The Relation Between Performance Collapse and Flat Minima. Figure 5 shows that the performance drop after discretization at Sharp Minima will be larger than that at Flat Minima. This will lead to performance collapse. Besides, this can also explain why the Kendall rank correlation coefficient is far from 1 in DARTS. To verify this explanation, Zela et al. [3] also plotted correlation between and validation error drop after discretization L(Dvalid , w (αDisc ), αDisc ) − λmax α L(Dvalid , w (α ), α ) and showed that they are truly strongly correlated.
Fig. 5. The role of flat and sharp minima of α in DARTS
SGAS-es
3
141
Approach
3.1
SGAS-es Overview
The main purpose of SGAS is to reduce the discretization loss between the searching and retraining stage. Here we provide the other view of SGAS: it is doing early stopping several times. SGAS will make the edge decision per fixed user-defined decision frequency and fix the operation of the chosen edge. This is a kind of early stopping. After fixing the operation each time, the loss landscape will change, so it can be viewed as another independent subproblem that requires several epochs to reach other local minima. If the decision frequency is too large, each independent subproblem may lead to performance collapse, which we call the cascade performance collapse problem. If the decision frequency is too small, each edge decision is made when the search process is not stable enough. This will make the performance of architectures searched by different random seeds not robust enough. Besides, performance collapse may happen at different epochs depending on the dataset, search spaces, different phases in SGAS, or even different random seeds. To solve these problems, we propose SGAS-es, Sequential Greedy Architecture Search with the Early Stopping Indicator. The primary purpose is to maximize search epochs of each phase in SGAS without stepping into performance collapse. Since search epochs are as large as possible, we expect SGAS-es can have a better ability to search architectures with good performance and derive more stable results for each independent run. Besides, we tend to do the discretization in relatively flat places. This can further reduce the discretization loss and let results have better robustness. There are four major functions in SGAS-es. We will introduce them in the following four subsections. 3.2
Bilevel Optimization
To solve (3) by Gradient Descent, one will have to calculate the total derivative of L(Dvalid , w (α), α). Liu et al. [1] used Chain Rule, One-step Update [14], and Finite Difference Approximation then got (4) where is a small scalar. dL(Dvalid ,w (α),α) dα
= ∇α L(Dvalid , w (α), α) − ξ
∇α L(Dtrain ,w+ ,α)−∇α L(Dtrain ,w− ,α) 2
(4)
ξ is a user-defined learning rate. When ξ > 0, it is called “Second-order Approximation”. In contrast, “First-order Approximation” is when ξ = 0. Although Liu et al. [1] showed that searching by First-order Approximation has a worse result than by Second-order Approximation empirically, First-order Approximation is about twice faster as Second-order Approximation. Besides, lots of follow-up DARTS-based approaches such as P-DARTS [5], PC-DARTS [8], SGAS [10], Fair DARTS [13] used First-order Approximation and reached the state-of-the-art results with less searching time. As a result, we decided to use First-order Approximation.
142
S.-P. Lin and S.-D. Wang
Algorithm 1 is for search and validation per each epoch. We will first update architecture parameters α and network weights w by the validation data Dvalid and the train data Dtrain using Mini-batch Gradient Descent. Then we will calculate the validation performance of the architecture of this epoch.
Algorithm 1: Search valid func Input : Dtrain : The train data, Dvalid : The validation data, arch: The architecture parameterized by α and w which we want to search, ηα , ηw : Learning rates of α and w Output: valid perf ormance: The validation performance of arch this epoch, arch 1 2 3 4 5 6 7 8 9
3.3
for (batchtrain , batchvalid ) in (Dtrain , Dvalid ) do α ← α − ηα ∇α L(batchvalid , w, α); w ← w − ηw ∇w L(batchtrain , w, α); end Initialize valid perf ormance; for batchvalid in Dvalid do Update valid perf ormance by batchvalid ; end return valid perf ormance, arch
Early Stopping Indicator
In this section, we will first prove the third relation in Fig. 4. Here we adopted the sharpness definition proposed by Keskar et al. [11]. Given architecture parameters α ∈ Rn , a loss function L and a small constant , the sharpness of α on a loss landscape can be defined as follow: φ(α , L, ) = (maxz∈C L(α + z)) − L(α )
(5)
Equation (6) is the definition of C . C is a box which is a collection of z vectors. Each element zi in z is a delta value along i-th dimension. C = {z ∈ Rn : − ≤ zi ≤ ∀i ∈ {1, 2, 3, ..., n}}
(6)
Let α∗ = α + z, L(α∗ ) can be approximated by Taylor Series where g = ∇L(α ) and H = ∇2 L(α ): L(α∗ ) ≈ L(α ) + (α∗ − α )T g + 12 (α∗ − α )T H(α∗ − α ) = L(α ) + z T g + 12 z T Hz
(7)
SGAS-es
143
Assume that α is a critical point: L(α∗ ) ≈ L(α ) + 12 z T Hz
(8)
Substituting (8) into (5), the sharpness of α will be: φ(α , L, ) ≈ 12 (maxz∈C z T Hz)
(9)
Since H is an n × n symmetric matrix, H will have n eigenvectors {v1 , v2 , ..., vn } forming an orthonormal basis of Rn such that: n n z = i=1 ai vi where z2 = i=1 a2i (10) Substituting (10) into (9) will get: n φ(α , L, ) ≈ 12 max( i=1 a2i λi )
(11)
λi are eigenvalues corresponding to vi . Let λ1 ≥ λ2 ≥ λ3 ≥ ... ≥ λn : n 2 2 i=1 ai λi ≤ λ1 z
(12)
Assume that is small enough such that all z are nearly the same: n max( i=1 a2i λi ) = λ1 z2
(13)
Finally, the sharpness of α will be: φ(α , L, ) ≈ 12 λ1 z2
(14)
Therefore, the sharpness will be related to the largest eigenvalue of ∇2 L(α ). . We denote the largest eigenvalue of ∇2 L(α ) calculated via Dvalid by λmax α will lead to a larger test We have shown all relations in Fig. 4: larger λmax α values can be viewed error in DARTS and indicate the sharper minima, so λmax α as a better indicator of performance collapse than a supernet validation error. We adopt the early stopping indicator proposed by Zela et al. [3]. The goal value from exploding: is to prevent λmax α ¯ max (i−k) λ α ¯ max (i) λ α
< T hreshold
(15)
¯ max (i) is the average λmax from the current epoch i to the previous epoch λ α α i − w + 1. w is the window size. k is the constant. To prevent the explosion of ¯ max (i−k) divided by λ ¯ max (i) smaller than T , we will stop the searching , if λ λmax α α α of this phase, return the searching to epoch i − k and do the edge decision. Algorithm 2 is the early stopping indicator. We will use a FIFO buffer of this epoch and previous epochs. window ev with a size of w to store λmax α ¯ max (·) in (15). After window ev avg is a dictionary playing the same role as λ α storing the mean of values in window ev into window ev avg[epoch], we will check if it meets the early stopping condition or not. If so, stop will be True.
144
S.-P. Lin and S.-D. Wang
Algorithm 2: Indicator Input : epoch: The current epoch, prev stop epoch: The epoch when the previous early stopping occurs, w, k, T , Dtrain , Dvalid , arch Output: stop: If doing the early stopping or not, stop epoch: The epoch to return 1 2 3 4 5 6 7 8 9 10 11 12
3.4
Calculate λmax using Dtrain , Dvalid , arch; α Push λmax into the back of window ev; α if the length of window ev > w then Pop a value from the front of window ev; end stop ← False; stop epoch ← epoch − k; window ev avg[epoch] ←Mean(window ev); ev avg[stop epoch] if stop epoch >= prev stop epoch and window < T then window ev avg[epoch] stop ←True; end return stop, stop epoch
Edge Decision Strategy
When searching meets the early stopping condition, the edge decision strategy is used to pick the edge to fix the operation of that edge. This kind of greedy decision can reduce the discretization loss compared to DARTS. Besides, we try to do the discretization at a flat minima each time, so we expect a lower discretization loss than the original SGAS. We followed the edge decision strategy proposed by Li et al. [10]. There are two major criteria: Non-Zero Operations Proportion and Edge Entropy. Non-Zero Operations Proportion. Zero Operation (or called None Operation) indicates the edge has no operation. For the edge with the less proportion (i,j) (i,j) of Zero Operation, i.e. has the less value of exp(αzero )/ o∈O exp(αo ) where (i,j) αzero is the architecture parameter of Zero Operation of edge (i, j), we assume that this edge is more important. Therefore, we used the proportion of Non-Zero Operations to evaluate the importance of the edge. The larger the EI (i,j) is, the more possible the edge (i, j) will be selected. EI (i,j) = 1 −
exp(α(i,j) zero ) o ∈O
(i,j)
exp(αo
(16)
)
(i,j)
Edge Entropy. Consider the probability mass function of αo is defined as follow: (i,j)
P (αo
)=
EI (i,j)
(i,j) exp(αo ) (i,j) , o exp(α ) o ∈O o
(i,j)
where P (αo
∈ O, o = zero
)
(17)
SGAS-es
145
Algorithm 3: Edge decision Input : arch Output: edge index: The index of the selected edge 1
5
Calculate EI of each edge (i, j) by (16); (i,j) (i,j) Calculate P (αo ) of every αo by (17); Calculate SC of each edge (i, j) by (18); Calculate Score of each edge (i, j) by (19); edge index ← arg max Score(i,j) ;
6
return edge index
2 3 4
(i,j)
Algorithm 4: Fix operation Input : batch size: The current batch size, batch increase: The value to increase the batch size each time, arch, edge index, Dtrain , Dvalid Output: new batch size: The new batch size after increasing, arch, Dtrain , Dvalid 1 2 3 4
Turn off the gradient calculation of α[edge index]; new batch size ← batch size + batch increase; Reload Dtrain and Dvalid with the new batch size; return new batch size, arch, Dtrain , Dvalid
The larger normalized Shannon entropy is, the more uncertainly the decision is made. Therefore, we used the complement of normalized Shannon entropy as the measure of the decision certainty of the given edge (i, j). SC (i,j) = 1 −
−
(i,j)
(i,j)
P (αo ) log(P (αo log(|O|−1)
o ∈O,o =zero
))
(18)
Finally, the total score of the edge (i, j) is defined as follows: Score(i,j) = normalize(EI (i,j) ) × normalize(SC (i,j) )
(19)
where normalize(·) means normalizing values of all edges in the single cell. Given the architecture, Algorithm 3 is for edge decision. It will calculate scores of all edges and pick the edge with the maximum score. 3.5
Fixing the Operation
After making the edge decision each time, we will fix the operation of the chosen edge. That is, (1) will be degenerated to: o¯(i,j) (x) = o(i,j) (x) where o(i,j) (x) is derived from (2).
(20)
146
S.-P. Lin and S.-D. Wang
Fig. 6. Illustration of SGAS-es
As shown in Algorithm 4, since the mixed operation becomes the fixed operation, we can turn off the gradient calculation of α of that edge. This can reduce memory usage. Therefore, we can increase the batch size each time after the edge decision. The larger batch size can stabilize and accelerate the search procedure. 3.6
Put-It-All-Together: SGAS-es
The whole algorithm of SGAS-es is shown in Algorithm 5. For each epoch, the Search valid func function will do the Forward and Back Propagation via First-order Approximation and return the validation performance. Second, the of this epoch and determine whether it is Indicator function will evaluate λmax α time to do Early Stopping. If it is time to do Early Stopping or accumulated epochs of this phase have already reached the default decision frequency, we have to replace the current architecture with the architecture corresponding to stop epoch, make the edge decision, and fix the operation of the edge chosen by Edge decision function. Finally, we will judge whether all edges have been chosen or not. If so, we will break the loop and end the searching procedure. If not, we will go to the next epoch and repeat the above-mentioned steps. Figure 6 illustrates the proposed search algorithm of SGAS-es with a sequence of an initial epoch followed by repeated decision epochs. Extra hyperparameters are required to control the early stopping indicator rather than SGAS. These hyperparameters are: window size w, constant k, and threshold T . One may question how to set these hyperparameters. We use the same hyperparameters throughout all experiments and show that with these settings, we can have robust and good results: df is 15, w is 3, k is 5, and T is 1.3.
SGAS-es
147
Algorithm 5: SGAS-es Input : m: The maximum number of epochs to search (= df × total edges), df : The default decision frequency, ηα , ηw , w, k, T , batch size, batch increase Output: arch 1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20
4 4.1
prev stop epoch ← 1; Initialize arch; Load Dtrain and Dvalid with the batch size; for epoch ← 1 to m do valid perf ormance, arch ← Search valid func(Dtrain , Dvalid , arch, ηα , ηw ); stop, stop epoch ← Indicator(epoch, prev stop epoch, w, k, T, Dtrain , Dvalid , arch); if stop or (epoch − prev stop epoch) is df then if stop then arch ← Load prev arch(stop epoch); epoch ← stop epoch; end prev stop epoch ← epoch; edge index ← Edge decision(arch); batch size, arch, Dtrain , Dvalid ← Fix operation(batch size, batch increase, arch, edge index); if all edges are selected then break; end end end return arch
Experiment NAS-Bench-201
NAS-Bench-201 [9] (or called NATS-Bench Topology Search Space [15]) fixed all factors of the searching and retraining pipe, such as hyperparameters, traintest split settings, data augmentation tricks, and the search space, to provide a fair comparison among all NAS algorithms. It includes three datasets: CIFAR10, CIFAR-100, and ImageNet-16-120 [16] which is the down-sampled version of ImageNet. They are all classic RGB image classification datasets with a different number of classes and samples. For more detailed settings, please refer to NASBench-201 paper.
148
S.-P. Lin and S.-D. Wang Table 1. Results of SGAS-es on NAS-Bench-201
Search space
Methods
Topology RSPS Search Space DARTS (1st ) DARTS (2nd ) GDAS SETN ENAS SGAS SGAS-es Optimal
CIFAR-10 Validation
Test
CIFAR-100 Validation
Test
ImageNet-16-120 Validation Test
87.60 ± 0.61 49.27 ± 13.44 58.78 ± 13.44 89.68 ± 0.72 90.00 ± 0.97 90.20 ± 0.00 85.06 ± 0.00 91.40 ± 0.04 91.61
91.05 ± 0.66 59.84 ± 7.84 65.38 ± 7.84 93.23 ± 0.58 92.72 ± 0.73 93.76 ± 0.00 88.33 ± 0.03 94.20 ± 0.02 94.37 (94.37)
68.27 ± 0.72 61.08 ± 4.37 59.48 ± 5.13 68.35 ± 2.71 69.19 ± 1.42 70.21 ± 0.71 70.80 ± 0.65 73.18 ± 0.10 73.49
68.26 ± 0.96 61.26 ± 4.43 60.49 ± 4.95 68.17 ± 2.50 69.36 ± 1.72 70.67 ± 0.62 70.76 ± 1.29 73.27 ± 0.03 73.51 (73.51)
39.73 ± 0.34 38.07 ± 2.90 37.56 ± 7.10 39.55 ± 0.00 39.77 ± 0.33 40.78 ± 0.00 42.97 ± 20.53 45.61 ± 0.13 46.73
40.69 ± 0.36 37.88 ± 2.91 36.79 ± 7.59 39.40 ± 0.00 39.51 ± 0.33 41.44 ± 0.00 43.04 ± 20.76 46.19 ± 0.04 46.20 (47.31)
Fig. 7. Searched cell of SGAS-es on NAS-Bench-201 with CIFAR-10
As shown in Table 1, SGAS-es reached state-of-the-art performance on all datasets in NAS-Bench-201. Compared to searched cells of DARTS (Fig. 1) and SGAS (Fig. 2), SGAS-es (Fig. 7) is able to gain searched cells without performance collapse or cascade performance collapse. Therefore, these searched cells are more learnable and lead to better accuracy. Besides, we ran the searching procedure three times with three different random seeds on all datasets and showed their mean accuracy ± variance. Variances are close to 0, so we can get stable results with SGAS-es. 4.2
Fashion-MNIST Dataset
Fashion-MNIST [17] is also an image classification dataset with 70000 28 × 28 gray-scale images. These images are separated into ten classes such as dress, coat, sneaker, and T-shirt. There are 60000 images in the train set and 10000 in the test set. Compared to MNIST (the handwritten digits dataset [18]), FashionMNIST is more complex and, therefore, more discriminating. In this experiment, we use DARTS CNN Search Space [1]. However, we reduce the number of candidate operations from eight to five. According to StacNAS [19], these eight operations can be separated into four groups: skip connect; avg pool 3 × 3 and max pool 3 × 3; sep conv 3 × 3 and sep conv 5 × 5; dil conv 3 × 3 and dil conv 5 × 5.
SGAS-es
149
Table 2. Results of SGAS-es on fashion-MNIST Methods
Accuracy (%) Params (MB) Search method
WRN-28-10 + random erasing [21] DeepCaps [22] VGG8B [23]
95.92 94.46 95.47
37 7.2 7.3
Manual Manual Manual
DARTS (1st ) + cutout + random erasing DARTS (2nd ) + cutout + random erasing PC-DARTS + cutout + random erasing SGAS + cutout + random erasing SGAS-es + cutout + random erasing (Best) SGAS-es + cutout + random erasing
96.14 ± 0.04 96.14 ± 0.13 96.22 ± 0.06 96.22 ± 0.24 96.37 ± 0.07 96.45
2.25 2.31 2.81 3.47 3.34 3.7
Gradient-based Gradient-based Gradient-based Gradient-based Gradient-based Gradient-based
Fig. 8. Normal Cells Searched by DARTS (1st ), DARTS (2nd ), SGAS, and SGAS-es on Fashion-MNIST
Operations in each group are correlated with each other. Therefore, the original DARTS CNN search space has the Multi-Collinearity problem, which may mislead the search result. Also, DLWAS [20] pointed out that in some cases, convolution with a 5 × 5 kernel size can be replaced by convolution with a 3 × 3 kernel size. Besides, the latter costs fewer parameters. As a result, we pruned candidate operations from eight to five to avoid the Multi-collinearity problem and reduce the sizes of derived models. This can also reduce memory usage while searching. Five candidate operations are: Zero Operation, skip connect, max pool 3 × 3, sep conv 3 × 3, and dil conv 3 × 3. For more detailed settings of this experiment, please refer to Appendix A. According to Table 2, DARTS-based methods can easily perform better than architectures designed by human experts. Besides, SGAS-es can reach the best accuracy with 96.45%. Like the NAS-Bench-201 experiment, we ran the searching procedure with three different random seeds and retrained these architectures to verify the stability of SGAS-es. Results are reported by mean accuracy ± standard deviation. As we can see, SGAS-es has excellent stability, which can robustly outperform other methods.
150
S.-P. Lin and S.-D. Wang Table 3. Results of SGAS-es on EMNIST-Balanced
Methods
Accuracy (%) Params (MB) Search method
WaveMix-128/7 [25] VGG-5 (Spinal FC) [26] TextCaps [27]
91.06 91.05 90.46
2.4 3.63 5.87
Manual Manual Manual
DARTS (1st ) + random affine DARTS (2nd ) + random affine SGAS + random affine SGAS-es + random affine (Best) SGAS-es + random affine
91.06 ± 0.11 91.14 ± 0.2 91.12 ± 0.04 91.25 ± 0.11 91.36
2.25 2.31 3.47 3.34 3.03
Gradient-based Gradient-based Gradient-based Gradient-based Gradient-based
Finally, we drew normal cells derived by DARTS (1st and 2nd order), SGAS, and SGAS-es (Fig. 8). Compared to DARTS, SGAS-es prevented performance collapse. Interestingly, although SGAS did not suffer cascade performance collapse in this experiment, it derived the cell with the worst performance compared to SGAS-es and even DARTS: only 95.95% test accuracy. We suppose this is due to the instability issue caused by the edge decision of each phase made when training is not stable enough. This is also reflected in the standard deviation of SGAS: 0.24, which is the largest among these methods. 4.3
EMNIST-Balanced Dataset
EMNIST [24] is the extension of MNIST with handwritten letters. Each sample is a 28 × 28 grayscale image. There are six kinds of split methods in EMNIST: Byclass, Bymerge, Letters, Digits, Balanced, and MNIST, so there are six benchmarks in EMNIST. Here, we use EMNIST-Balanced because our primary purpose is to compare SGAS-es with other DARTS-based algorithms, not to deal with the imbalanced dataset (EMNIST-Bymerge and EMNIST-Byclass). Besides, among balanced datasets (EMNIST-Letters, EMNIST-Digits, EMNIST-Balanced, and MNIST), EMNIST-Balanced is the most difficult: with only 131600 samples but 47 classes. The number of samples in the train set is 112800, and that in the test set is 18800. As reported in [25], it has the lowest state-of-the-art accuracy with 91.06%. Therefore, we use EMNIST-Balanced to evaluate the performance of DARTS-based algorithms further. In this experiment, we used cells derived from Fashion-MNIST and retrained them on EMNIST-Balanced from scratch. The primary purpose is to test the transferability of these cells. Liu et al. also did this experiment, deriving cells from an easier dataset and retraining them on a more difficult dataset, in DARTS paper [1]. Therefore, this experiment will only include the retraining stage. For more detailed settings of this experiment, please refer to Appendix B.
SGAS-es
151
According to Table 3, we achieve the state-of-the-art accuracy, 91.36%, on the EMNIST-Balanced dataset, which is higher than the current state-of-theart, WaveMix-128/7 with 91.06%. Besides, cells derived by SGAS-es have good transferability with a mean accuracy equal to 91.25% which is the best among these DARTS-based approaches. The interesting point is that the best cell of SGAS-es on Fashion-MNIST is not the best on EMNIST-Balanced. As reported in DARTS paper [1], Liu et al. used the best cell searched on the smaller dataset to retrain it on the larger dataset. However, this may not be the best result. Therefore, it is an interesting research topic to make the rank of cells on smaller datasets and the rank of cells on larger datasets more similar.
5
Conclusion
In summary, we proposed SGAS-es, Sequential Greedy Architecture Search with the Early Stopping Indicator, to solve the instability issue and improve the searching ability of SGAS. With SGAS-es, we achieved state-of-the-art results on NAS-Bench-201: mean test accuracy equal to 94.20%, 73.27%, and 46.19% on CIFAR-10, CIFAR-100, and ImageNet-16-120. These scores are better than all other DARTS-based methods reported on NAS-Bench-201 and close to the optimal results. Besides, these results are stable: variances equal to 0.02, 0.03, and 0.04. On more complex DARTS CNN Search Space, we showed that SGAS-es also works. On Fashion-MNIST, the best-searched architecture of SGAS-es reached superior test accuracy: 96.45%. The mean accuracy, 96.37%, is also better than other DARTS-based methods. Besides, the standard deviation, 0.07, indicates that SGAS-es is a stable method. To show the transferability of searched architectures of SGAS-es, we retrained them on EMNIST-Balanced. We achieved 91.36% test accuracy, which is a state-of-the-art result. With these experiments, we can confirm that SGAS-es is a robust method and can derive the architecture with good performance.
Appendix A: FashionMNIST Experiment Settings In the searching stage, we use half of the images in the train set as training images and the other half as validation images. Some data augmentation tricks are listed as follows: for training images, we first do a random crop (RandomCrop()) with a height and a width of 32 and a padding of 4. Second, we do a random horizontal flip (RandomHorizontalFlip()). Third, we transform the format of inputs into tensors and normalize the values of images between 0 and 1 (ToTensor()). Last, we further normalize (Normalize()) these image values with the mean and the standard deviation of Fashion-MNIST training images. For validation images, we do the last two steps (ToTensor() and Normalize()) of what we have done to the training images.
152
S.-P. Lin and S.-D. Wang
Hyperparameter settings of the searching stage are as follows: a batch size is 32 (128 for PC-DARTS since only 14 channels will do the mixed operation). A batch increase is 8. An SGD optimizer is used for network weights with an initial learning rate of 0.025, a minimum learning rate of 0.001, a momentum of 0.9, and a weight decay of 0.0003. A cosine annealing scheduler is used for a learning rate decay. An Adam optimizer is used for architecture parameters with a learning rate of 0.0003, a weight decay of 0.001, and beta values of 0.5 and 0.999. Besides, a gradient clipping is used with a max norm of 5. For supernet, the number of initial channels is 16, and the number of cells is 8. For SGAS-es, we set df , w, k, and T equal to 15, 3, 5, and 1.3. For other DARTS-based methods, a search epoch is set to 50. We use the whole train set to train the supernet for 600 epochs in the retraining stage. Data augmentation tricks used for the train set and the test set are the same as those for the train set and the validation set in the searching stage. However, we add two more tricks for the train set: a cutout with a length of 16 and a random erase. Hyperparameter settings of the retraining stage are as follows: a batch size is 72. An SGD optimizer has an initial learning rate of 0.025, a momentum of 0.9, and a weight decay of 0.0003. Not only the weight decay but also a drop path is used for regularization with a drop path probability of 0.2. The cosine annealing scheduler is also used for learning rate decay. The initial channel size is 36, and the number of cells is 20. An Auxiliary loss is used with an auxiliary weight of 0.4. Other DARTS-based methods (like SGAS and PC-DARTS) and original DARTS in Table 2 follow the same settings. They used similar settings in their papers too.
Appendix B: EMNIST-Balanced Experiment Settings Most experiment settings are the same as those of the retraining stage of FashionMNIST. Some differences are as follows: we use the whole train set to train the supernet for 200 epochs. The batch size of the SGD optimizer is 96. For each training image, we will first resize (Resize()) it to 32 × 32. Second, we will do a random affine (RandomAffine()) with a degree of (–30, 30), a translate of (0.1, 0.1), a scale of (0.8, 1.2), and a shear of (–30, 30). Third, we will transform the format of inputs into tensors and normalize the values of images between 0 and 1 (ToTensor()). Last, we will further normalize (Normalize()) inputs with mean and variance equal to 0.5. For each test image, we will only do ToTensor() and Normalize().
SGAS-es
153
References 1. Liu, H., Simonyan, K., Yang, Y.: ’Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018) 2. Yang, A., Esperan¸ca, P.M., Carlucci, F.M.: NAS evaluation is frustratingly hard. arXiv preprint arXiv:1912.12522 (2019) 3. Zela, A., Elsken, T., Saikia, T., Marrakchi, Y., Brox, T., Hutter, F.: Understanding and robustifying differentiable architecture search. In: International Conference on Learning Representations (2020). https://openreview.net/forum? id=H1gDNyrKDS 4. Xie, L., et al.: Weight-sharing neural architecture search: a battle to shrink the optimization gap. ACM Comput. Surv. (2022) 5. Chen, X., Xie, L., Wu, J., Tian, Q.: Progressive differentiable architecture search: bridging the depth gap between search and evaluation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1294–1303 (2019) 6. Liang, H., et al.: Darts+: improved differentiable architecture search with early stopping, arXiv preprint arXiv:1909.06035 (2019) 7. Chen, X., Hsieh, C.-J.: Stabilizing differentiable architecture search via perturbation-based regularization. In: International Conference on Machine Learning, PMLR (2020) 8. Xu, Y., et al.: PC-DARTS: partial channel connections for memory-efficient architecture search. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=BJlS634tPr 9. Dong, X., Yang, Y.: Nas-bench-201: extending the scope of reproducible neural architecture search. arXiv preprint arXiv:2001.00326 (2020) 10. Li, G., Qian, G., Delgadillo, I.C., M¨ uller, M., Thabet, A., Ghanem, B.: Sgas: sequential greedy architecture search. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020) 11. Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On largebatch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016) 12. Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: a survey. J. Mach. Learn. Res. 20(1), 1997–2017 (2019) 13. Chu, X., Zhou, T., Zhang, B., Li, J.: Fair DARTS: eliminating unfair advantages in differentiable architecture search. In: 16th Europoean Conference On Computer Vision (2020). https://arxiv.org/abs/1911.12126.pdf 14. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International conference on machine learning, PMLR (2017) 15. Dong, X., Liu, L., Musial, K., Gabrys, B.: NATS-bench: benchmarking NAS algorithms for architecture topology and size. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 44(7), 3634–3646 (2021) 16. Chrabaszcz, P., Loshchilov, I., Hutter, F.: A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819 (2017) 17. Xiao, H., Rasul, K., Vollgraf, R.: ’Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017) 18. LeCun, Y.: The MNIST database of handwritten digits. http://yann.lecun.com/ exdb/mnist/ (1998) 19. Guilin, L., Xing, Z., Zitong, W., Zhenguo, L., Tong, Z.: Stacnas: towards stable and consistent optimization for differentiable neural architecture search (2019)
154
S.-P. Lin and S.-D. Wang
20. Mao, Y., Zhong, G., Wang, Y., Deng, Z.: Differentiable light-weight architecture search. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2021) 21. Zhong, Z., et al.: ’Random erasing data augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34. no. 07 (2020) 22. Rajasegaran, J., Jayasundara, V., Jayasekara, S., Jayasekara, H., Seneviratne, S., Rodrigo, R.: Deepcaps: going deeper with capsule networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10 725–10 733 (2019) 23. Nøkland, A., Eidnes, L.H.: Training neural networks with local error signals. In: International Conference on Machine Learning, pp. 4839–4850. PMLR (2019) 24. Cohen, G., Afshar, S., Tapson, J., Van Schaik, A.: Emnist: extending mnist to handwritten letters. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2921–2926. IEEE (2017) 25. Jeevan, P., Sethi, A.: WaveMix: resource-efficient token mixing for images. arXiv preprint arXiv:2203.03689 (2022) 26. Kabir, H., et al.: Spinalnet: deep neural network with gradual input. arXiv preprint arXiv:2007.03347 (2020) 27. Jayasundara, V., Jayasekara, S., Jayasekara, H., Rajasegaran, J., Seneviratne, S., Rodrigo, R.: Textcaps: handwritten character recognition with very small datasets. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 254–262. IEEE (2019)
Artificial Intelligence in Forensic Science Nazneen Mansoor1 and Alexander Iliev1,2(B) 1 SRH University of Applied Sciences, Berlin, Germany
[email protected] 2 Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, Sofia, Bulgaria
Abstract. Artificial intelligence is a rapidly evolving technology that is being used in a variety of industries. This report provides an overview of some artificial intelligence applications used in forensic science. It also describes recent research in various fields of forensics, and we implemented a model for a use case in digital forensics. Keywords: Forensics · CNN · Transfer learning · Deep learning · ResNet50 · Deepfake detection · AI
1 Introduction Forensic science is the study of evidence or the use of scientific methods to investigate crimes. Forensic investigation involves extensive research that includes gathering evidence from various sources and combining it to reach logical conclusions. Data extraction from mysterious sources can be productive and compulsive, but dealing with massive amounts of data can be quite confusing. There has been a wide range of areas from which data has been produced, particularly in the case of forensics, from DNA and fingerprint analysis to anthropology. With the expanding growth of Artificial intelligence (AI) in all industries, scientists have done various pieces of research to understand how forensic science could benefit from the technologies in the field of AI. Artificial intelligence plays a significant role in forensics because it allows forensic investigators to automate their strategies and identify insights and information. The use of artificial intelligence technology improves the chances of detecting and investigating crimes. Artificial intelligence could assist forensic experts in effectively handling data and providing a systematic approach at various levels of the investigation. This could save forensic researchers a significant amount of time, giving them more time to work on other projects. AI can discover new things by integrating all of the unstructured data collected by investigators. Unlike traditional forensic identification, the use cases of Artificial intelligence are numerous in the field of forensic science. This report describes a list of Artificial intelligence applications which are used in forensic science, some of the recent studies conducted in this field, and the implementation of the model for an AI use case in digital forensics.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 155–163, 2023. https://doi.org/10.1007/978-3-031-28073-3_11
156
N. Mansoor and A. Iliev
2 Related Works The drowning diagnosis in forensics is one of the most challenging tasks as the findings from the post-mortem image diagnosis is uncertain. Various studies have been focused on this area of research to address this issue. The researchers in [1, 2] have used a deep learning approach to classify the subjects into drowning or non-drowning cases by using post-mortem lung CT images. In [1], they used a computer-aided diagnosis (CAD) system which consists of a deep CNN (DCNN) model. DCNN is trained using a transfer learning technique based on AlexNet architecture. For the experiment, they used CT images of post-mortem lungs of around 280 cases in total which included 140 non-drowning cases (3863 images) and 140 drowning cases (3784 images) [1]. They could achieve better results in detecting drowning cases. In [2], the authors used the same dataset as in [1] and used a different deep learning approach based on the VGG16 transfer learning model. VGG16 model [2] showed better performance compared to AlexNet used in [1] for drowning detection. The automatic age estimation of human remains or living individuals is a vital research field in forensics. Age estimation using dental X-ray images [3] or MRI data [4] is commonly used in forensic identification, but the conventional methods do not yield good results. To enhance the performance, the researchers in [3] implemented a deep learning technique to evaluate the age from teeth X-ray images. They have used a dental dataset that consists of 27,957 labeled orthopantomogram subjects (16,383 for females and 11,574 for males). The accuracy of the age prediction of these subjects is verified using their ID card details. The authors have focused on different neural network elements that are useful for age estimation. This paper is relevant in forensic science for estimating age from panoramic radiograph images. In [4], the researchers proposed an automatic age estimation from multi-factorial MRI data of clavicles, hands, and teeth. The authors in [4] used a deep CNN model to predict age and the dataset consists of 322 subjects with the age range between 13 and 25 years. Gender estimation is another significant aspect of forensic identification, particularly in mass disaster situations. In [5], the authors proposed a model to estimate the accuracy of gender prediction from Cone Beam Computed Tomography (CBCT) scans. They used linear measurements of the maxillary sinus from CBCT scans and principal component analysis (PCA) was done to lower the dimensionality. With the development in the field of Artificial intelligence, face recognition is a significant research topic that can be beneficial in digital forensics. The researchers in [6] have implemented a deep learning system that can analyze image and video files retrieved from forensic evidence. The proposed model in [6] could detect faces or objects from the given images or videos. The model is trained using YOLOv5 algorithms which is a novel convolutional neural network (CNN) that can be used to detect faces or objects for real-time applications. From their study, it’s proven that algorithms trained using deep learning techniques are highly preferred and useful for forensic investigations. Most of the data extracted from digital forensics contain unstructured data like photos, text, and videos. Data extracted from text play a significant role in digital forensics. Natural language processing (NLP) is an interesting research topic in the field of AI which can be used to provide relevant information in forensics too. In [7], the researchers developed a pipeline to retrieve information from texts using NLP and built models like
Artificial Intelligence in Forensic Science
157
named entity recognition (NER), and relation extraction (RE) in any language. From their experimental results, it’s proven that this solution enhances the performance of digital investigation applications. In [8] and [9], the researchers used CNN models to identify the Deepfake media in digital forensics.
3 Applications of AI in Forensic Science Artificial intelligence assists forensic scientists by providing proper judgment and techniques to have better results in various areas. It involves forensic anthropology by finding out the skeletal age or sex, dealing with a large amount of forensic data, associating specific components from the images, or discovering similarities in place, communication, and time. Some of the AI use cases in the field of forensic science are listed below. 3.1 Pattern Recognition Pattern recognition is one of the main applications of AI in forensic science which deals with identifying certain types of patterns within a massive amount of data [11]. It can include any pattern recognition such as images of a person, or place, forming sequences from a text like an email, or messages, and other audio patterns from sound files [8, 11]. This pattern matching is based on solid evidence, statistics, and probabilistic thinking. AI helps in providing better ideas for identifying the trends with complex data accurately and efficiently. It also helps the detectives to find the suspect by providing information about past criminal records. The methodology which was used for face image pattern recognition is shown in Fig. 1. 3.2 Data Analysis Digital forensics is a developing field that requires complex and large dataset computation and analysis [11]. In digital forensics, scientists manage to collect digital shreds of evidence from various networks or computers [13]. These pieces of evidence are useful in various areas of investigation. Artificial intelligence acts as an efficient tool in dealing with these large datasets [13]. With the help of AI, a meta-analysis of the data extracted from different sources can be conducted. This meta-data can be transformed into an understandable and simplified format within a short time [11]. 3.3 Knowledge Discovery Knowledge discovery and data mining are the other areas in which artificial intelligence is used. Knowledge discovery is the technique of deriving useful information and insights from data. Data mining includes AI, probabilistic methods, and statistical analysis, which can gather and analyze large data samples. Forensic scientists could use AI to investigate various crimes using this approach to discover patterns.
158
N. Mansoor and A. Iliev
Fig. 1. Methodology for forensic face sketch recognition [16]
3.4 Statistical Evidence In forensic science, strong statistical evidence is required to support the arguments and narration [12]. AI helps in building graphical models that can be used to support or disprove certain arguments, thereby making better decisions. There are various computational and mathematical tools in AI that helps to build significant and statistically relevant evidence. 3.5 Providing Legal Solutions The scientific methods provided by forensic statistics support the legal system with the necessary evidence [12]. With more comprehensive and sophisticated information databases, artificial intelligence aids the legal community with better solutions. 3.6 Creating Repositories With the increasing demand in the case of storage capacity, forensic investigators find it difficult to store and analyze data related to forensic science [12]. The storage issues can be rectified by building online repositories using AI. These repositories are useful to store digital forensic data, properties, investigations, and results [13]. 3.7 Enhance Communication Between Forensic Team Members In a forensic investigation, it’s necessary to maintain strong communication between forensic statisticians, criminal investigators, lawyers, and others [12]. Miscommunication between these teams could result in misinterpretation of data which leads to wrong decisions or unfair justice. Artificial intelligence assists to bridge the communication gap between various teams in the forensic field.
Artificial Intelligence in Forensic Science
159
4 Proposed Methodology We have tried to implement a system to detect Deepfake images which can be widely used in digital forensics. With the advancement of technology and the ease of creating fake content, media manipulation has become widespread in recent years [10]. These fake media contents are generated using advanced AI technologies like deep learning algorithms and hence they are known as “Deepfakes” [15]. Therefore, it has become difficult to differentiate the original media from the fake ones with our naked eye. There are numerous Deepfake videos or images which are circulated across social media and these manipulated videos or images can be a threat to society and could lead to an increase in the number of cybercrimes. With the increasing number of Deepfake content, there is a high demand for developing a system that could detect Deepfake videos or images. Artificial intelligence contributes efficiently to detecting these manipulated multimedia content [15]. The proposed system uses a neural network with a transfer learning technique that is based on ResNet50 architecture to train our model. 4.1 Data Collection and Pre-processing The dataset selected to train the model is the Celeb-DF dataset. It includes 590 celebrity videos and 300 additional videos that are downloaded from YouTube. Also, it contains 5639 synthesized videos generated from celeb real videos. The videos are converted into frames using the cv2 python library. From the frames, faces are cropped and in total, there were 19,457 images. For training the model, we have used 80% of the data which is around 15565 images, 10% for validation, and 10% for testing purposes.
Fig. 2. Fake images [14]
Fig. 3. Real images [14]
Figure 2 and Fig. 3 denote an example of a few of the fake and real images which are part of our dataset. Pre-processing of the images is done using the Image data generator class of Keras applications.
160
N. Mansoor and A. Iliev
4.2 Training the Model The model is developed using ResNet50 architecture from the Python Keras application and there are two additional dense layers used in our model with a dropout of 50%. For activation functions, rectified linear unit (ReLU) is used in the hidden layer and the sigmoid function is used in the output layer. The number of epochs for training the models is set to 60 respectively. For compiling the models, we have opted for Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.9 and a learning rate of 0.001 and used binary cross-entropy as the loss function.
5 Results Accuracy is the performance metric that is used to evaluate the efficiency of our model. There are two graphs attained that show the accuracy and loss during training the model. Figure 4 illustrates the training and validation curves with the accuracy and loss of the trained model. The confusion matrix of the model is depicted in Fig. 5 which shows the actual and predicted number of fake and real images. From the confusion matrix, the test accuracy is evaluated as 92%. In Fig. 6, the example of a predicted class of an image using the model is depicted.
Fig. 4. Training and validation learning curves (accuracy and loss)
Artificial Intelligence in Forensic Science
161
Fig. 5. Confusion matrix of the model
Fig. 6. Example of apredicted image
6 Conclusion Artificial intelligence is rapidly becoming the most important applied science in all fields. This report summarizes a few of the various AI applications in forensics, and it could be concluded that AI can help forensic experts or investigators reduce the time taken on various tasks and thereby improving their performance. Some of the AI applications that are prevalent in the field of forensic science are pattern recognition, handling large amounts of data, and providing legal solutions. Researchers have made various studies to understand how forensic science benefits from AI technologies for different use cases such as drowning diagnosis from images, age, and gender estimation. As part of our study,
162
N. Mansoor and A. Iliev
we have tried to implement a neural network model based on ResNet50 architecture to detect Deepfake images and it could be useful in digital forensics. The proposed model can differentiate real and fake images with an accuracy rate of around 92%.
References 1. Homma, N., et al.: A Deep learning aided drowning diagnosis for forensic investigations using post-mortem lung CT images. In: 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pp. 1262–1265 (2020). https:// doi.org/10.1109/EMBC44109.2020.9175731 2. Qureshi, A.H., et al.: Deep CNN-based computer-aided diagnosis for drowning detection using post-mortem lungs CT images. In: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2309–2313 (2021). https://doi.org/10.1109/BIBM52 615.2021.9669644 3. Hou, W., et al.: Exploring effective DNN models for forensic age estimation based on panoramic radiograph images. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2021). https://doi.org/10.1109/IJCNN52387.2021.9533672 4. Štern, D., Payer, C., Giuliani, N., Urschler, M.: Automatic age estimation and majority age classification from multi-factorial MRI data. IEEE J. Biomed. Health Inform. 23(4), 1392– 1403 (2019). https://doi.org/10.1109/JBHI.2018.2869606 5. Al-Amodi, A., Kamel, I., Al-Rawi, N.H., Uthman, A., Shetty, S.: Accuracy of linear measurements of maxillary sinus dimensions in gender identification using machine learning. In: 2021 14th International Conference on Developments in eSystems Engineering (DeSE), pp. 407–412 (2021). https://doi.org/10.1109/DeSE54285.2021.9719421 6. Karaku¸s, S., Kaya, M., Tuncer, S.A., Bah¸si, M.T., Açiko˘glu, M.: A deep learning based fast face detection and recognition algorithm for forensic analysis. In: 2022 10th International Symposium on Digital Forensics and Security (ISDFS), pp. 1–6 (2022). https://doi.org/10. 1109/ISDFS55398.2022.9800785 7. Rodrigues, F.B., Giozza, W.F., de Oliveira Albuquerque, R., García Villalba, l.J.: Natural language processing applied to forensics information extraction with transformers and graph visualization. In: IEEE Trans. Comput. Soc. Syst. https://doi.org/10.1109/TCSS.2022.315 9677 8. Jadhav, E., Sankhla, M.S., Kumar, R.: Artificial intelligence: advancing automation in forensic science & criminal investigation. Seybold Rep. 15, 2064–2075 (2020) 9. Vamsi, V.V.V.N.S., et al.: Deepfake detection in digital media forensics. Glob. Trans. Proc. 3(1), 74–79 (2022). ISSN 2666–285X, https://doi.org/10.1016/j.gltp.2022.04.017 10. Jafar, M.T., Ababneh, M., Al-Zoube, M., Elhassan, A.: Forensics and analysis of deepfake videos. In: 2020 11th International Conference on Information and Communication Systems (ICICS), pp. 053–058 (2020). https://doi.org/10.1109/ICICS49469.2020.239493 11. NEWS MEDICAL LIFE SCIENCES, AI in Forensic Science. https://www.news-medical. net/life-sciences/AI-in-Forensic-Science.aspx. Accessed 30 June 2022 12. Mohsin, K.: Artificial intelligence in forensic science. SSRN Electron. J. (2021). https://doi. org/10.2139/ssrn.3910244 13. Gupta, S.: Artificial intelligence in forensic science. IRJET (2020). e-ISSN: 2395–0056 14. @inproceedings{Celeb_DF_cvpr20, author = {Yuezun Li, Xin Yang, Pu Sun, Honggang Qi and Siwei Lyu}, title = {Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics}, booktitle= {IEEE Conference on Computer Vision and Patten Recognition (CVPR)}, year = {2020} } - celeb df. Accessed 05 July 2022
Artificial Intelligence in Forensic Science
163
15. Karandikar, A.: Deepfake video detection using convolutional neural network. Int. J. Adv. Trends Comput. Sci. Eng. 9, 1311–1315 (2020). https://doi.org/10.30534/ijatcse/2020/629 22020 16. @inproceedings{Srivastava2013ForensicFS, title={Forensic Face Sketch Recognition Using Computer Vision}, author={Vineet K. Srivastava}, year={2013}
Deep Learning Based Approach for Human Intention Estimation in Lower-Back Exoskeleton Valeriya Zanina(B) , Gcinizwe Dlamini, and Vadim Palyonov Innopolis University, Innopolis, Russia {v.zanina,g.dlamini}@innopolis.university, [email protected]
Abstract. Reducing spinal loads using exoskeletons has become one of the optimal solution in reducing compression of the lumbar spine. Medical research has proved that the reduction compression of the lumbar spine is a key risk factor for musculoskeletal injuries. In this paper we present a deep learning based approach which is aimed at increasing the universality of lower back support for the exoskeletons with automatic control strategy. Our approach is aimed at solving the problem of recognizing human intentions in a lower-back exoskeleton using deep learning. To train and evaluate our approach deep learning model, we collected dataset using from wearable sensors, such as IMU. Our deep learning model is a Long shortterm memory neural network which forecasts next values of 6 angles. The mean squared error and coefficient of determination are used for evaluation of the model. Using mean squared error and coefficient of determination we evaluated our model on dataset comprised of 700 samples and achieved performance of 0.3 and 0.99 for MSE and R2 , respectively. Keywords: Deep learning
1
· Robots · Exoskeleton · Human intention
Introduction
A person whose activity related to body strain for long periods of time is subject to an increased risk of backache. In order to provide back support and help to minimize the load from the lower back, wearable robotic devices for the spine were developed, which lower the peak torque requirements around the lumbosacral (L5/S1) joint [11]. Manual work is still present in many industrial environments, and these devices can help to reduce the number of injuries in lifting tasks with heavy loads. The lower-back exoskeletons are created to support, increase the power of the human body or to reduce the load on the muscles when the motion is limited. However, in the modern world, the requirements for exoskeletons are constantly increasing. Devices should not interfere [14,15] with other tasks which are not related to lifting. Passive exoskeletons cannot address this challenge, but active ones can, because they are supplemented by actuators (electric motors, hydraulic actuators, etc.). With active exoskeletons it is important that the assistance c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 164–182, 2023. https://doi.org/10.1007/978-3-031-28073-3_12
DL Human Intention Estimation in Lower-Back Exoskeleton
165
motion is generated based on the user’s motion intention. Depending on the phase of body movements, the system has to understand when it is time to activate the motors. Thus, the control strategy of exoskeleton is currently a increasing area of research. Solutions for predicting movement types during different activities could be based on models using an optimal control approach. In order to adapt an active exoskeleton to a person’s back movements, special methodology which is based on mechanisms with backbone-based kinematics can be adopted [19]. Human activity can be rigid, set by a user output [23] or it can use an algorithm for human activity recognition. There are several approaches to solve human intention estimation proposed over the years, however research on how to predict movements and the load applied to human body is limited. Our work focuses on prediction the human intentions during lifting tasks through deep learning approach. To assess human intentions, a suit based on inertial modules was created, data was collected and preprocessed, features were selected and extracted and the recurrent neural network with LSTM units was constructed. The rest of the paper is structured as follows: Sect. 2 presents background and related work. Section 3 presents our proposed methodology and pipeline. Section 4 provides the details about our approach implementation and experiments. Section 5 presents the obtained results followed by discussion. Conclusion and future research directions are outlined in Sect. 6.
2
Related Work
Many cases for adaptive control solutions still needs to know human intention and as part of it recognize the modeling of human behavior. Thus over the years there has been a growing research interest in prediction of human behavior to help simplify the design process of exoskeletons and other wearable robotics. In addition to human intention evaluation and recognition algorithms themselves, there is also a difficult task of collecting the evaluation data themselves, because the model requires input data for training. Depending on the type of data generated, methods for recognizing human intentions can be divided into two main groups: sensor-based and vision-based ones. This section provides an overview of these two types and what other researchers have accomplished over the past years. 2.1
Vision-Based Methods
For the artificial data generation, also based on video. For example, Hyeokhyen et al. [8] suggested using the video from large-scale repositories and automatically generated data for virtual sensor IMUs that can be used in real-world settings. This approach involves a number of techniques from computer vision, signal processing, and machine learning. The limitations of the research proposed by Hyeokhyen et al. [8] is that it works only with primitive activities in 2D space
166
V. Zanina et al.
Data based on motion capture technology is actively used and studied in other works [12,21] on the topic of predicting human intentions. The largest and most complete is the Human3.6M dataset [5] but in our case this approach as the well as other methods based on data from videos and pictures will not be effective. In conditions of hard work, usually in factories, it will be expensive to have a video surveillance system only for prediction. Moreover, cameras have blind spots and any object can interfere with reliably determining the human intention. Despite this, computer vision-based systems for activity data collection in 3D space are also effective. In the research conducted by Yadav et al. [26], human body frames are acquired from Kinect-v2 sensor, which is a depth sensor-based motion-sensing input device. Kinect-v2 sensor offers a convenient way to record and capture activity of human skeleton joints, and tracks the 3D skeleton joint coordinates. In the research conducted by Yadav et al. [26] the 3D coordinates are used to make a 3D bounding box over the tracked human suitable features, extracted for identifying different activities. The final dataset which contains a total of 130,000 samples with 81 attribute values is inputted to deep learning networks activity recognition. CNN-LSTM combination called ConvLSTM network was used. Sumaira Ghazal in [4] used only 2D-skeletal data from a camera. To recognize the activity, the main features were extracted using the positions of human skeletal joints from the video due to the OpenPose system. This library extracts the locations of skeletal joints for all the persons in an image or a video frame. A 3D array with information about the persons number is the output of the OpenPose. In light of the aforementioned studies, in this paper exoskeleton of the lowerback is used together with a deep learning recurrent neural network to predict human intention and minimize the spine load when performing heavy tasks such as in factories 2.2
Data from Wearable Sensors
These group contains algorithms based on sensory data obtained from inertial sensors, such as accelerometers or gyroscopes on certain body parts or from mobile phones [25]. The main motivation to use a smartphone as a wearable sensor is due to the fact that the devices are portable and have great computing power, as well as have an open system for integration applications that read data from internal sensors, as well as the relative cheapness of the device [22]. Wang et al. [24] presented a study focused in gait recognition of exoskeleton robots based on the DTW algorithm. The researchers [24] used plantar pressure sensors and joint angle sensors to collect gait data. The pressure sensor was used to measure the pressure on the soles and roots of the feet, and the angle sensor was used to measure the angle of the hip and knee joints. The obtained gait data have to be prepossessed to avoid noise, so the researchers [24] applied S-G filter to smooth this information. Finally, the researchers [24] used DTW algorithm for gait recognition which is a simple and easy algorithm that requires limited
DL Human Intention Estimation in Lower-Back Exoskeleton
167
hardware. However, the problem with standard approach is that the training method cannot effectively use statistical methods and the amount of calculation is relatively large. That is why the authors improved it, which resulted in certain enhancement in recognition rate and real-time performance and proved that humans and machines can be controlled in a coordinated manner. Roman Chereshnev [2] made a huge contribution to the creation of data in the field of analysis and activity recognition as a result of the scientific work on the creation of a human gait control system using machine learning methods. Chereshnev [2] in his work collected data from a body sensor network consisting of six wearable inertial sensors (accelerometers and gyroscopes) located on right and left thighs, shins, and feet. In addition, two electromyography sensors were used on the quadriceps to measure muscle activity. 38 signals overall were collected from the inertial sensors and 2 from the EMG sensors. The main activities in Chereshnev’s work [2] was walking and turning at various speeds on a flat surface, running at various paces, taking stairs up and down, standing, sitting, sitting in a car, standing up from a chair and cycling. The results data named HuGaDB is publicly available for the research community. However, there is still no information for the lifting task in the dataset. There were other works that used HuGaDB for activity recognition [1]. For example, in the work [3] HuGaDB was used for evaluation of gait neural network performance as well as on data collected by an inertial-based wearable motion capture device. Thus, the entire motion acquisition system consisted of seven inertial measurement units, but only the signals from the lower limbs were selected for the human gait prediction [3]. For the basic model two temporal convolution networks were used. One temporal convolutional network (TCN) to process the original input, then the combination of the original input and the output of the first TCN are combined into the input to the second TCN and in the end, a recognition model and a prediction model are added to the network as two fully connected layers. The resulting model is used to predict the accelerations and angular velocities. Sensor-based collection is well studied area, but all research in this field relate to gait or primitive activities recognition. There is practically no contribution into datasets for lifting tasks. The best method for this task will be to use wearable sensors to collect the data. Since we are more interested in assessing a person’s not only recognition and also intentions, a deep learning method was chosen to predict the subsequent intention in time based on the readings received from the sensors.
3
Methodology
This chapter is a description of an approach for estimation human intentions in the lower-back exoskeleton through wearable sensors and machine learning tools. The pipeline is presented in Fig. 1. No exoskeleton was involved in the data collection process because an exoskeleton-independent algorithm is planned. This section describes the methodology and development of a suit based on inertial modules. The costume is designed to collect data which will be suitable
168
V. Zanina et al.
for training and testing different algorithms for estimation of human intention. This chapter also provides a description of the collected dataset itself, as well as the implementation of human intention recognition in lifting tasks to create an inference system that is appropriate controlling torque in lower back exoskeleton using deep-learning.
Fig. 1. Our proposed approach pipeline
3.1
Data Acquisition via Sensors
An active back exoskeleton should help a person to lift a heavy load. To do this, it needs to know what the movement will be in the next moment and, depending on this, decide torques of which magnitude should be applied to the motors. Based on a literature review, we see that the task of recognizing activities is widely used with wearable sensors, but the proposed methods are relevant mainly for gait recognition. The most popular sensor is a Electromyography (EMG) sensor which can accurately indicate movement, but this approach is not applicable, since the sensors are attached to the human body. During the lifting task, a person may get tired and sweaty, which is why the sensors can make noise and give out data with a bad- quality. There are several motion capture suits on the world market. For example in [16] XSens costume is used. However, such systems are usually redundant with sensors and closed in acced to data for processing it. Existing datasets are not applicable for lifting tasks, foe example is the USC-HAD [27], HuGaDB [2], PAMAP2 [18], MAREA [6] and others. So, it was decided to design a costume for collecting experimental data in order to contribute to datasets for human activity recognition. IMU sensors were selected, which are easy to wear and allow to make the algorithm independent of the design of the lower-back exoskeleton. 3.2
Kinematic Data Analysis
Lifting activity is a state where the initial position is standing up and its turns into a squat-style tilt followed by straightening in which it is necessary to include lower-back exoskeleton in order to compensate for the load on the spine. The arrangement of sensors was due to the anatomical behavior of the human body [11] in space during the lifting task. The angles that were used in the analysis to defines the features is on Fig. 2.
DL Human Intention Estimation in Lower-Back Exoskeleton
169
Fig. 2. Angle definitions for the human by the Matthias B. N¨ af [11], Where Blue Dot is the Lumbosacral Joint, (a) - Knee, (b) - Hip Angle, (c) - Lumbar Angle, (d) - Trunk Angle, (e) - Pelvis Inclination. The Position and Orientation (A) of the Lower Leg, (B) - Upper Leg,(C) - The Pelvis, (D) - The Trunk
Based on the research conducted by [10], the human spine conducts three main motions: flexion and extension in sagittal plane, lateral bending, and axial rotation on Fig. 3).
Fig. 3. The range of movements of the limb
In the lifting tasks the flexion and extension in the of the sagittal plane is the most significant in terms of the range of these movement. As a goal of the paper we are aimed at tracking this movement. The human spine models proposal that the lumbar spine can be modeled as additional joint in the contribution for flexion and extension and thought 2 we can see that the orientation of the pelvis in lifting task is also change.This is a reason why the costume needs to be
170
V. Zanina et al.
designed with sensors, which can track changes in trunk and pelvis orientation. It is impossible to leave only the sensors on the back because each person has its own style of the body movements. He or She can simply bend down as a stoop style and lift the load on the Fig. 4, but this motion can be traumatic for the body while picking up a heavy load.
Fig. 4. Stooping (a) and Squatting (b)
Another approach to lift is when the human is using only the legs, through a squat, minimally tilting the lower back. Therefore, we consider both variants of the movement, thus the controller in the target device will be capable of any of these intentions, minimizing the load on the lumbosacral joint. In this way the costume is designed as follows: 6 IMU sensors, 2 on the back, 2 on the left and right legs symmetrically fixed with elastic bands. Location of the sensors are presented in the Fig. 5. 3.3
Experimental Protocol
For the experiments, the subjects had to stand and squat to lift the load, and then straighten up with the this load. It is demonstrated on Fig. 6). This sequence of data is recorded in the dataset. After the researcher started the recording the subject is commanded after staying still for some time to do the experimental exercise in their own temp. We do not need to recognize the transition from one movement to another, like standing from the squat, because we want to know the intentions of a person and estimate the next few angles to control the motors of the exoskeleton. In this work the task is only considered when the subject starts to move. People do lifting tasks with different frequency and different speed, someone bends more with the body, and someone with their feet. During the experiments, it was noticed that it is impossible to lift a heavy load without tilting in the back and legs at the same time, so the selected location of the sensors and its quality can most accurately describe the movement for the back assistance exoskeleton.
DL Human Intention Estimation in Lower-Back Exoskeleton
171
Fig. 5. The scheme of sensors’ location with marked axes on a human body where the dot means that the axis come out and x means that the axis go away from the observer
3.4
Human Intention Estimation
The main task of an active exoskeleton is to repeat the movement of a person strengthening it. However, the exoskeleton must understand what action a person intends to do in order to switch to control mode and apply exactly as much force as necessary to start doing the action. Therefore, the estimation of human intention is a necessary task that needs to be solved when designing the control logic of an active exoskeleton. 3.5
Motion Prediction for the Exoskeleton Control with Deep Learning Approach
To develop an exoskeleton control system, estimation of intention can be divided on two tasks from the machine learning: classification and regression [17]. Classification implies the relation of a certain type of activity to a certain category, for example, standing, lifting a load or walking, in order to recognize that movement has begun and it is time to put the motors into active mode. An exoskeleton that provides help on the back should understand when it is necessary to start
172
V. Zanina et al.
Fig. 6. Experimental motion where (a) A person intends to lift the box through a squat, (b) A person is standing with a heavy load
making and not to interfere when it is needed. The regression task will allow predicting a person’s future intentions in advance in order to control the magnitude of torque of the motors of the exoskeleton. The model will allow to predict the subsequent angle of human movement based on the signals obtained from IMU sensors. 3.6
Long Short-Term Memory
The Long short-term memory (LSTM) architecture has been acknowledged as successful in human activity recognition on smartphones [13]. In addition to the outer recurrence of the RNN, LSTM recurrent networks have LSTM cells that have an internal recurrence. The cell of LSTM has more parameters and system of gating blocks that controls the flow of information but in general it has the same inputs and outputs. The cells replaced the hidden units of recurrent networks and are connected recurrently to each other. The key component of LSTM is the memory cell. The value of input feature can be accumulated into the state. The sigmoidal input gate needs to allows it. And the value is computed with a regular artificial neuron unit. The most important component is the state units, which has a linear self(t) loop. Its weight is controlled by a forget gate unit fi , where time step t and i is a cell. Forget gate unit sets this weight to a value between 0 and 1 due to a sigmoid unit: f (t) f (t−1) (t) , (1) Ui,j xi + Wi,j hj fi = σ bfi + j
(t)
j
where x is the current input vector, h is the current hidden layer vector, containing the outputs of all the LSTM cells, bf are biases, U f input weights, (t) W f recurrent weights for the forget gates. The external input gate unit gi is computed as:
DL Human Intention Estimation in Lower-Back Exoskeleton (t)
gi
g (t) g (t−1) = σ bgi + Ui,j xi + Wi,j hj . j
173
(2)
j
(t)
It is close to forget gate but with its own parameters. The output gate qi (4) uses a sigmoid unit for gating and thought it its can be shut off the output (t) of the LSTM cell hi on 3 (tz)
(t)
gi
(t)
= tanh (si z) qi
(3)
o (t) o (t−1) , = σ boi + Ui,j xi + Wi,j hj
(4)
hi
j
j
To scale and normalize the input to the proposed LSTM model we used MinMax scalar. All features are transformed into the range between 0 and 1. After scaling the input data, the data passed to the Deep learning model having one LSTM layer and one fully connected layer. The activation function is tanh and Adam [7] optimizer.
4
Implementation
This section presents implementation details starting from data extraction to deep learning model evaluation. 4.1
Hardware Implementation
The sensors were connected through wires with each other, which communicate over the I2C protocol, 2 sensors with different addresses on each bus, attached to a microcontroller ESP32. The hardware implementation of the costume is presented on the Fig. 7. The IMU sensor includes three-axis accelerometer, gyroscope and also magnetometer. However, during experiments, magnetometer data was not collected, since the presence of gravity on Earth, accelerometer perceive the downward direction, so its will be easy to calculate orientation. Magnetometer is not used for calculating yaw due magnetic field near the Earth if the gravity doesn’t changes. In the task we need to find orientation which describes rotation relative to some coordinate system thus the values we considered came only from two sensors - accelerometer and gyroscope. A typical accelerometer can measure acceleration in x, y, z axis in g or m/s2 . Gyroscopes measurements of angular speeds in degrees or rad in x,y, z axis are indeed. The ESP32 module was chosen because of its size and the possibility of easy programming, as well as due to the integrated Bluetooth and WiFi modules. During the experiments, data is transmitted through WiFi to a laptop in real time and processed locally.
174
V. Zanina et al.
Fig. 7. The hardware implementation of the costume for the data collecting
4.2
Data Format
The sensors data is sent once every 10 ms, and recorded in a CSV file for future training of the model. There are approximately 14 samples in one second. The raw data set is a sequence of values from 6 sensors. An example of the data from one IMU sensor for right leg which is located on a calf in the Table 1, where rb is the name id of the sensor, a - for value from accelerometer, g - gyroscope and x, y, z are the names of the axis. Table 1. The example of a part of the dataset with a raw data from accelerometer and Gyroscope located on a right calf
rbax
rbay
rbaz
rbgx
rbgy
rbgz
0.069431 –0.40701 9.267955 –0.015854 –0.006795 –0.010525 0.059855 –0.37589 9.229648 –0.041302 –0.001191 –0.01172 0.14604
–0.38307 9.090784 –0.041168 0.005729
–0.01705
0.15083
–0.36392 9.109938 –0.028378 0.01506
–0.013722
...
...
...
0.12689
–0.39025 9.040505 –0.024781 0.01811
...
...
...
–0.011990
The main body of the dataset contains 36 headers: 3 values for each axis of the accelerometer and gyroscope from all 6 sensors. Each column symbolizes the value of the axis of each sensor and each row is a corresponds to a one sample at the time. The bottom sensors on legs are located on calf, the top sensors located on a hip and both from the outer side. In the dataset values are arranged in the following order: back top and bottom, left leg top and bottom, right leg top and bottom. Each person moves several times, one file consist one experiment from one person but requires the future data preprocessing, each movement from one experiment will be in a separate file what will be discussed in the next subsection.
DL Human Intention Estimation in Lower-Back Exoskeleton
4.3
175
Feature Extraction
It is necessary to process raw data in order to turn them into features which are represented by orientation. Firstly, it is very important to calibrate the sensors before using them. There are always some errors due to mechanical characteristics. Calibration is performed by subtracting the offset value for each axis, since inertial sensors usually have the bias drift. The example is on Fig. 8.
Fig. 8. The example of calibration data from sensor, located on back top axes orientation
When the data is ready, the orientation values from the sensors are used as input data for estimation intention, and as an output value we will receive the next angle. To do this, it needs to translate the raw data into Euler angles: roll, pitch and yaw. Relying on the datasheet of selected modules, the axes on the sensors are signed as on Fig. 9. Based on the design of the data collection costume, the zero position of the sensors is vertical and the sensors on the back will be placed perpendicular to the sensors on the legs. Then the axes, along which the rotation will take place, will change their location and will look like in Fig. 5. Some of sensors will change their polarity of rotation, but because of the preprocessing the rotations will have a reversed polarity. As a result, the back will change positively, as will the data from the upper sensors of the legs, and the angle of the lower sensors will be directed in the opposite direction.
176
V. Zanina et al.
Fig. 9. Orientation of axes of sensitivity and polarity of rotation for accelerometer and gyroscope from datasheet
Location and orientation together fully describe how an object located in space. Euler’s rotation theorem shows that in three dimensions any orientation can be achieved by a single rotation around a fixed axis. This gives one general way to represent orientation using an axis angle representation. According to the trajectory of the selected lifting task movement and location of the sensors on the costume in the Fig. 5, we can take to account that for the legs we can estimate the human motion intention in the angle of rotation only the using z axis and for the back the angle of rotation of the x axis. We can estimate the orientation, but we can notice that the yaw angle for the sensors on legs and the roll for the back sensors will change significantly, when the others will change within the margin of error from person to person. Therefore, for the evaluation, we can allocate the correlation between the combination of these angles, varying over time. In fact, the angles of rotations can be calculated using one of the sensors, for example, an accelerometer. But this method is not suitable in our task, because it cannot calculate the rotation along the z axis. It is possible to use only gyroscope data for a discrete system according to the Eq. 5 θ = θ + ωΔt
(5)
where θ represents the angle in deg, ω is a angular speed in deg/s and Δ t is the time difference between measurements. So, roll, pitch, yaw can be calculated as: ⎞ ⎞ ⎛ ⎛ X pitchG pitchn+1 n + G Δt ⎝ rolln+1 ⎠ = ⎝ rollnG + GY Δt ⎠ (6) yawn+1 yawnG + GZ Δt The the resulting angles will be not so clear. The back top sensor located near 90◦ with roll, so the zero position will be over 90◦ , not zero, that why it is the best idea to use sensor fusion.
DL Human Intention Estimation in Lower-Back Exoskeleton
177
Sensor fusion is combining data derived from various sensors in order to get a resulting value that has less uncertainty than from the one sensor [20]. For this task some algorithms like Kalman, DCM algorithm with a rotation matrix and PI controller, Complimentary Filter or Madwgick algorithm can be used. In this paper the last one was selected. The Madgwick algorithm [9] is an orientation filter that can be applied to IMU or MARGS. It allows to get the orientation using the quaternions. These quaternions can be converted to Euler angles. In this work, this particular filter was chosen to search for Euler angles, since we need rotation angle changes only around certain axes. The filter takes into account the time delta between measurements and the beta value, which assigns the degree of confidence to the accelerometer. To estimate the trajectory of the lifting of the box, we will select the most significant axes and take it as a feature (Fig. 10) of intention to lift a heavy load.
Fig. 10. The change in the main angles of rotation during the lifting over time using the Madgwick filter
4.4
Performance Metric
The angle prediction for control lower-back exoskeleton was modeled by the deep learning model. The model’s goal was to minimize mean squared error between true and predicted angles of main rotations. Therefore to measure the performance of our proposed approach we use two popular performance measurements namely mean squared error (MSE) and coefficient of determination (R2 ) calculated as in Eq. 7 and 8. The plot of MSE function is in Fig. 13. n
1 M SE = (yi − yˆi)2 n i=1 R2 = 1 −
SSres , SStot
(7)
(8)
where the SSres is the residual sum of squares and SStot is the coefficient of determination.
178
5
V. Zanina et al.
Results and Discussion
The experiment involved 20 people (6 females and 14 males). Each participant did exercise several times - from 20 to 40 times for 15 to 20 s of each. Their age was from 20 to 38 years old. In total 700 samples were collected and they provided around 3.3 h of data. Figure 11 visualizes some of the participants in the experimentation stage.
Fig. 11. Some of the participants of the experiments. All participants authorized to use these photos
For training data, a time series was recorded where a person stood still and then intended to lift the load. Each person squatted in their own style, at different speeds and at different times. The dataset with different data was collected and processed. Figure 12 presents some examples of the samples from the collected data, on which the neural network was trained to predict the following angles. Based on the data, it can be noticed that each person does the exercise in his own style. For example, the person 3 on the figure above squats more then bends his back. The person 2 bends more to lift the load by tilting his back. Therefore, the neural network works independently of the physiological characteristics of each individual and does not require re-training for a specific user of an exoskeleton (Table 2).
DL Human Intention Estimation in Lower-Back Exoskeleton
179
Fig. 12. The example of some train data from the collected dataset with the changing the main orientation angles during a lifting task Table 2. Proposed model performance
Dataset partition Total samples MSE R2 Train
455
0.66
0.999
Validation
140
0.7
0.998
Test
105
0.3
0.994
For training the data was divided on 15 segments in order to predict next 5 segments and took six-component vector as input which corresponds to the rotations. This coefficient of the model on the validation set is 0.994 (Fig. 14), so the model is good. The data points were taken in a sliding window of size 5, calculated the absolute gradients for a line of best fit and if the gradient is greater then 0.5 it means the person is moving. To mark the start and the end of movement, the first gradient greater than 0.5 is mark and the last gradient greater than 0.5 is the mark of stopping the movement. The Fig. 15 presents the performance of our approach on test data. The error is no more than 4◦ . It can be noticed that the our proposed model will not work well if the sensors are not located properly. The error in the predicted data can be minimized by conducting a large number of experiments and retraining the model on a larger number of subjects. It is worth noting that the data was collected only inside the building, which does not allow the dataset to be more various since activity in the elevator or airplane was not taken into account. In these scenarios, the force acts on the accelerometer sensors, and in some applications it may be important to take these facts into account. The source code is available1 . 1
https://github.com/Demilaris/human intention.
180
V. Zanina et al.
Fig. 13. The MSE loss for training and validation set
Fig. 14. The coefficient of determination (R2 ) for training and validation set
Fig. 15. The predicted and true trajectory of human motion during lifting tasks where black lines denote the beginning of the movement
DL Human Intention Estimation in Lower-Back Exoskeleton
6
181
Conclusion
In paper we propose a method of human intention estimation with wearable robotics costume based on IMU sensors using deep learning based approach. The purpose of this work is to increase the versatility of exoskeletons and to apply an advance automatic control to them. A hardware costume was made and experiments were conducted. The data was collected, segmented, and preprocessed, the features were selected and extracted. The obtained dataset for evaluating human intentions is used for training a deep learning model. As result we get MSE of 0.3 and R2 of 0.994. For prediction we used LSTM model with rotation degrees as input. Our approach shows a good result with average error of only 4◦ . We examined the error and found that it occurs at the peak of the values. The real-time delay does not exceed 50 ms. Moreover, we published our dataset for human activity recognition and intention estimation for everyone to use it in their way. For the future, the performance of the our deep recurrent model can be improved by increasing the number of experiments and adding regularization which will minimise over-fitting.
References 1. Badawi, A.A., Al-Kabbany, A., Shaban, H.: Multimodal human activity recognition from wearable inertial sensors using machine learning. In: 2018 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES), pp. 402–407. IEEE (2018) 2. Chereshnev, R., Kert´esz-Farkas, A.: HuGaDB: human gait database for activity recognition from wearable inertial sensor networks. In: van der Aalst, W.M.P., et al. (eds.) AIST 2017. LNCS, vol. 10716, pp. 131–141. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73013-4 12 3. Fang, B., et al.: Gait neural network for human-exoskeleton interaction. Front. Neurorobot. 14, 58 (2020) 4. Ghazal, S., Khan, U.S., Mubasher Saleem, M., Rashid, N., Iqbal, J.: Human activity recognition using 2D skeleton data and supervised machine learning. Inst. Eng. Technol. 13 (2019) 5. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1325–1339 (2014) 6. Khandelwal, S., Wickstr¨ om, N.: Evaluation of the performance of accelerometerbased gait event detection algorithms in different real-world scenarios using the MAREA gait database. Gait Posture 51, 84–90 (2017) 7. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 8. Kwon, H., et al.: IMUTube: automatic extraction of virtual on-body accelerometry from video for human activity recognition. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4(3), 1–29 (2020) 9. Madgwick, S., et al.: An efficient orientation filter for inertial and inertial/magnetic sensor arrays. Report x-io Univ. Bristol (UK) 25, 113–118 (2010) 10. Mak, S.K.D., Accoto, D.: Review of current spinal robotic orthoses. In: Healthcare, no. 1, p. 70. MDPI (2021)
182
V. Zanina et al.
11. Manns, P., Sreenivasa, M., Millard, M., Mombaur, K.: Motion optimization and parameter identification for a human and lower back exoskeleton model. IEEE Robot. Autom. Lett. 2, 1564–1570 (2017) 12. Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2891–2900 (2017) 13. Milenkoski, M., Trivodaliev, K., Kalajdziski, S., Jovanov, M., Stojkoska, B.R.: Real time human activity recognition on smartphones using lstm networks. In: 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 1126–1131. IEEE (2018) 14. N¨ af, M.B., Koopman, A.S., Baltrusch, S., Rodriguez-Guerrero, C., Vanderborght, B., Lefeber, D.: Passive back support exoskeleton improves range of motion using flexible beams. Frontiers in Robotics and AI, p. 72 (2018) 15. Poliero, T., et al.: A case study on occupational back-support exoskeletons versatility in lifting and carrying. In: The 14th PErvasive Technologies Related to Assistive Environments Conference, pp. 210–217 (2021) 16. Poliero, T., Mancini, L., Caldwell, D.G., Ortiz, J.: Enhancing back-support exoskeleton versatility based on human activity recognition. In: 2019 Wearable Robotics Association Conference (WearRAcon), pp. 86–91. IEEE (2019) 17. Radivojac, P., White, M.: Machine Learning Handbook (2019) 18. Reiss, A., Stricker, D.: Creating and benchmarking a new dataset for physical activity monitoring. In: Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments, pp. 1–8 (2012) 19. Roveda, L., Savani, L., Arlati, S., Dinon, T., Legnani, G., Tosatti, L.M.: Design methodology of an active back-support exoskeleton with adaptable backbone-based kinematics. Int. J. Ind. Ergon. 79, 102991 (2020) 20. Sasiadek, J.Z.: Sensor fusion. Annu. Rev. Control 26(2), 203–228 (2002) 21. Shotton, J., et al.: Real-time human pose recognition in parts from single depth images. In: CVPR 2011, pp. 1297–1304. IEEE (2011) 22. Sousa Lima, W., Souto, E., El-Khatib, K., Jalali, R., Gama, J.: Human activity recognition using inertial sensors in a smartphone: an overview. Sensors 19(14), 3213 (2019) 23. Toxiri, S., et al.: Back-support exoskeletons for occupational use: an overview of technological advances and trends. IISE Trans. Occup. Ergon. Hum. Factors 7(34), 237–249 (2019) 24. Wang, H., Zhang, R., Li, Z.: Research on gait recognition of exoskeleton robot based on DTW algorithm. In: Proceedings of the 5th International Conference on Control Engineering and Artificial Intelligence, pp. 147–152 (2021) 25. Wang, J., Chen, Y., Hao, S., Peng, X., Hu, L.: Deep learning for sensor-based activity recognition: a survey. Pattern Recognit. Lett. 119, 3–11 (2019) 26. Yadav, S.K., Tiwari, K., Pandey, H.M., Akbar, S.A.: Skeleton-based human activity recognition using ConvLSTM and guided feature learning. Soft Comput. 26(2), 877–890 (2022) 27. Zhang, M., Sawchuk, A.A.: USC-HAD: a daily activity dataset for ubiquitous activity recognition using wearable sensors. In: Proceedings of the 2012 ACM Conference on Ubiquitous Computing, pp. 1036–1043 (2012)
TSEM: Temporally-Weighted Spatiotemporal Explainable Neural Network for Multivariate Time Series Anh-Duy Pham(B) , Anastassia Kuestenmacher, and Paul G. Ploeger Hochschule Bonn-Rhein-Sieg, Sankt Augustin, Germany [email protected], {anastassia.kuestenmacher,paul.ploeger}@h-brs.de Abstract. Deep learning has become a one-size-fits-all solution for technical and business domains thanks to its flexibility and adaptability. It is implemented using opaque models, which unfortunately undermines the outcome’s trustworthiness. In order to have a better understanding of the behavior of a system, particularly one driven by time series, a look inside a deep learning model so-called post-hoc eXplainable Artificial Intelligence (XAI) approaches, is important. There are two major types of XAI for time series data: model-agnostic and model-specific. Modelspecific approach is considered in this work. While other approaches employ either Class Activation Mapping (CAM) or Attention Mechanism, we merge the two strategies into a single system, simply called the Temporally Weighted Spatiotemporal Explainable Neural Network for Multivariate Time Series (TSEM). TSEM combines the capabilities of RNN and CNN models in such a way that RNN hidden units are employed as attention weights for the CNN feature maps’ temporal axis. The result shows that TSEM outperforms XCM. It is similar to STAM in terms of accuracy, while also satisfying a number of interpretability criteria, including causality, fidelity, and spatiotemporality. Keywords: Temporally-weighted · Explainability · Attention · CNN RNN · Spatiotemporality · Multivariate time series classification
1
·
Introduction
Multivariate time series analysis is used in many sensor-based industrial applications. Several advanced machine learning algorithms have achieved state-of-theart classification accuracy in this field, but they are opaque because they encode important properties in fragmented, impenetrable intermediate layers. In Fig. 1, the positions of the cat and the dog are clear, but the signal from the three sensors is perplexing, even with explanations. This is perilous because adversarial attacks can exploit this confusion by manipulating the inputs with intentional noise to trick the classification model to yield a wrong decision. Multivariate time series (MTS) are challenging to classify due to the underlying multi-dimensional link among elemental properties. As Industry 4.0 develops, a sensor system is c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 183–204, 2023. https://doi.org/10.1007/978-3-031-28073-3_13
184
A.-D. Pham et al.
incorporated into production and operational systems to monitor processes and automate repetitive tasks. Hence, given the high accuracy of multivariate time series classification algorithms, it is crucial to understand the precise rationale for their conclusions, particularly in precision manufacturing or operations such as autonomous driving. This is when interpretable techniques become useful.
Fig. 1. Explanations given by gradient-based class activation mapping method
Generally, in the diversity of interpretable AI methods, there are three main categories, namely interpretable-by-design, model-agnostic and model-specific ones. The interpretable-by-design models are those that were built with the intention of giving an explanation. Although they are of classical methods, such as decision trees and general linear models, they still give contributions to the novel interpretable applications as graph neural networks [19], a decision tree [17], or by comparing to archetypal instances [5], conditioning neurons to interpretable properties [11,13], and statistically accounting evidence from various image patches [3]. On the other hand, model-agnostic and model-specific explainable methods are applied to models that are not interpretable-by-design. The model-agnostic techniques are methods that may be used to any predictive model, while the model-specific methods are methods that are linked to a particular model class. In this work, only model-specific interpretable methods are analyzed. Post-hoc analysis, which explains the decision of a model by analyzing its output, and ante-hoc analysis, which provides an explanation as part of its decision, are the two ways that explanations could be extracted from a model regardless of whether the method is model-agnostic or model-specific. Nevertheless, there are two ways that explanations could be extracted from a model regardless of whether the method is model-specific or model-agnostic [10]. Class Activation Mapping (CAM)-based approaches that give post-hoc explanations for the output of a Convolutional Neural Network (CNN) and Attention-based Recurrent Neural Networks (RNNs) with their attention vectors as the ante-hoc explanation are analyzed in this research.
TSEM
185
This study proposes the Temporally weighted Spatiotemporal Explainable Network for Multivariate Time Series (TSEM), a novel method that overcomes several shortcomings of previous interpretations by leveraging the power of RNN to extract global temporal features from the input as an importance weight vector for the CNN-generated spatiotemporal feature maps in a parallel branch of the network. We show that our method could fix the locality of the explanations along the temporal axis yielded by 1D convolutional layers, while also improving the overall performance of the model, as the temporal information could now be directly weighted into the feature maps, as opposed to merely serving as supplementary information, as it is the case in XCM [7]. Then, the saliency maps are extracted from the weighted feature maps by CAM methods. Despite the fact that CAM only gives local fidelity with CNN saliency maps, attention neural models are unable to provide consistent explanations for classification RNN models when compared to CAM explanations [12]. Apart from being faithful, such an explanation should satisfy two additional assessment criteria, namely spatiotemporal explainability, and causality [9]. The rest of this paper is arranged in the following manner. Section 2 summarizes recent research on transparency in the realm of MTS classification. Section 3 explains the novel TSEM architecture and how its outputs should be interpreted. Section 4 details the experiments and methods of assessment. Section 5 summarizes the method’s accomplishments and offers more perspectives on how to improve it.
2 2.1
Related Work Attention Neural Models
In terms of efficiency in learning long-term correlations between values in a time series, the RNN layer may also be preserved in a neural network to retain information over an extended period of time, enabling more precise classification. The interpretation is then compensated by wrapping an attention neural model around the RNN layer to obtain additional information about the time series region of interest, which may improve the learning operation of the RNN layer. In addition, the attention neural model may offer input to the coupling RNN layer, instructing it to highlight the most important data. Numerous MTS classification and regression models have been published. This kind of architectures begins with the Reverse Time Attention Model (RETAIN) [6], which utilizes a two-layer neural attention model in reverse time order to emphasize the most recent time steps. Dual-Stage Attention-Based Recurrent Neural Network (DA-RNN) [20] adopts the concept of RETAIN but without the temporal order reversal, while Dual-Stage Two-Phase attentionbased Recurrent Neural Network (DSTP-RNN) [15] advances DA-RNN by dividing the spatial attention stage into two substages, or two phases. The Multi-level attention networks for geo-sensory time series (GeoMAN) [14]similarly has two stages of attention, but in the spatial stage, it incorporates two parallel states: the local one correlates intra-sensory interactions, while the global one correlates
186
A.-D. Pham et al.
inter-sensory ones. Spatiotemporal attention for multivariate time series prediction (STAM) [9] focused primarily on temporal feature extraction by assigning two attention vectors with independent RNN wrappers while decreasing the number of phases in the spatial attention stage to one. 2.2
Post-Hoc Model-Specific Convolutional Neural Network-Based Models
An LSTM layer is designed to learn the intercorrelation between values along the time dimension, which is always a one-dimensional sequence; hence, the layer can be replaced solely by a one-dimensional convolutional layer. Using this concept, Roy Assaf et al. [1] created the Multivariate Time Sequence Explainable Convolutional Neural Network (MTEX-CNN), a serial two-stage convolutional neural network consisting of a series of Conv2D layers coupled to a series of 1D convolutional (Conv1D) layers. Similarly, Kevin Fauvel et al. [7] obtained the two convolutional layers by linking them in parallel, believing that this would give an extra temporal interpretation for the neural network’s predictions. The eXplainable Convolutional Neural Network for Multivariate Time Series Classification (XCM) architecture provides a faithful local explanation to the prediction of the model as well as a high overall accuracy for the prediction of the model. Furthermore, as shown by the author of this research, the CNN-based structure allows the model to converge quicker than the RNN-based structure while also having lesser variation across epochs. Besides that, in their paper [7], XCM demonstrated their state of the art performance. By switching from a serial to a parallel framework, XCM was able to considerably enhance the performance of MTEX-CNN. Specifically, the information from the input data is received directly by both 1D and 2D modules; as a result, the extracted features are more glued to the input than if the data were taken from another module. However, they combine these two types of information, known as spatial and temporal features, by concatenating the temporal one as an additional feature vector to the spatial feature map and then using another Conv1D layer to learn the association between the temporal map and the spatial map, as shown in Fig. 2. This approach entails the following limitation: (1) The intermediate feature map has the form (feature length +1, time length), and it leads to the explanation map, which has the same size as the intermediate feature map. This is out of sync with the size of the input data, which is (feature length, time length) and (2) while it is assumed that the final Conv1D layer will be able to associate the relationship between the concatenated map, the Conv1D layer will only be able to extract local features, whereas temporal features require long-term dependencies in order to achieve a more accurate correlation between time steps. In order to address these issues, TSEM is proposed as an architecture that makes better use of the temporal features by adding up values to the spatial features, even seemingly spatiotemporal features, yielded from the 2D convolutional (Conv2D) layers by multiplying them together and replaces the Conv1D layer with an LSTM layer, which can be better weighted in the spatial-temporal feature maps.
TSEM
2.3
187
Explanation Extraction by Class Activation Mapping
Class Activation Mapping (CAM) methods determine which input characteristics are accountable for a certain categorization result. This is accomplished by performing backpropagation from the output logits to the desired layer to extract the activation or saliency maps of the appropriate features, and then interpolating these maps to the input to emphasize the accountable ones. CAM [25] is the original approach, which uses the max-pooling layer to link the final convolutional layer with the logits layer in order to immediately remedy the liable features in the latter. Then, CAM becomes the general method name for this strategy and is further classified into two domains, excluding itself: gradientbased CAM and score-based CAM. The gradient-based CAM group consists of algorithms that backpropagate the gradient from the logits layer in order to weight the feature maps of the associated convolutional layer. They are listed as Grad-CAM [22], Grad-CAM++ [4], Smooth Grad-CAM++ [18], XGrad-CAM [8], and Ablation-CAM [21], and are distinguished by their formulation for the combination of the backpropagated weights gradients and the weight. In contrast, score-based approaches such as Score-CAM [24], Integrated Score-CAM [16], Activation-Smoothed Score-CAM [23], and Input-Smoothed Score-CAM [23] employ the logits to directly weight the convolutional layer of interest. They also vary in their concept of the combination of scores and feature maps through multiplication.
3
Methodology
The XCM acquires a basic CNN developed to extract features from the input data’s variables and timestamps. It ensures the model choice made using GradCAM [22] is interpretable faithfully. On a variety of UEA datasets, XCM beats state-of-the-art techniques for MTS classification [2]. Since faithfulness evaluates the relationship between the explanation and what the model computes, it is critical when describing a model to its end-user. The purpose of this study is to develop a small yet scalable and explainable CNN model that is true to its prediction. The combination of CNN architecture with Grad-CAM enables the creation of designs with few parameters while maintaining accuracy and transparency. MTEX-CNN demonstrated the preceding by proposing a serial connection of 2D and Conv1D layers for the purpose of extracting essential characteristics from MTS. To leverage the above mentioned drawback of CNN post-hoc explanations, TSEM takes the backbone of the XCM architecture and improves it by replacing the Conv1D module, which includes two Conv1D layers in the second parallel branch of the architecture, with a single recurrent layer in the first parallel branch of the architecture, as previously stated. The time window aspect of the model has also been retained since it aids in scaling the model to a fraction of the input size when the data dimensions are too huge. Figure 3 depicts the overall architecture.
188
A.-D. Pham et al.
Fig. 2. XCM architecture [7]. Abbreviations: BN—Batch Normalization, D—Number of Observed Variables, F—Number of Filters, T—Time Series Length and Window Size—Kernel Size, which corresponds to the time window size
Fig. 3. TSEM architecture. Abbreviations: BN—Batch Normalization, D—Number of Observed Variables, F—Number of Filters, T—Time Series Length and Window Size— Kernel Size, which corresponds to the time window size.
Formally, the input MTS is simultaneously fed into two different branches that consider each of its two dimensions, namely spatial and temporal ones. The spatial branch is designed to extract the spatial information spanned across the constituent time series by firstly applying a convolutional filter with a customized kernel size with one side fixed to the length of temporal axis of the MTS. This is done to reduce the number of model parameters and increase the training as well as inference speed. It is then followed by a 1x1 Conv2D layers to collapse the number of previous filters into one filter, making the shape of the resulting feature map equalled to the input. The idea is proposed by Fauvel et al. [7] in their XCM architecture and kept as the backbone in our architecture, TSEM. In the temporal branch, since the temporal explanation is redundant when the first branch can extract the spatiotemporal explanation, the Conv1D module is replaced with an LSTM module with a number of hidden units equal to the window size hyperparameter. The substitution sacrifices the explainability
TSEM
189
of the Conv1D module in exchange for improved temporal features because the LSTM module treats the time series signal as a continuous input rather than a discrete one as in the convolutional. It is then upsampled to the size of the original time duration and element-wise multiplied with the feature maps from the first branch’s two-dimensional convolutional module instead of being concatenated as in the temporal branch of XCM (Fig. 2). This results in time-weighted spatiotemporal forward feature maps from which the explanation map may be retrieved by different CAM-based techniques. Additionally, the new feature map is considered to improve accuracy when compared to XCM.
4
Experiments and Evaluation
The assessment entails conducting tests for critical metrics that an interpretation should adhere to, as recommended and experimented with in several works on interpretable or post-hoc interpretability approaches. 4.1
Baselines
TSEM is evaluated in comparison to all of the attention neural models as well as post-hoc model-specific CNN-based models that are outlined in Sects. 2.1, 2.2. These includes five attention neural models, namely, RETAIN, DA-RNN, DSTPRNN, GeoMAN and STAM as well as some of their possible variants along with MTEX-CNN and XCM, which are the two model-specific interpretable models. It is important to note that in the interpretability tests, the post-hoc analysis of MTEX-CNN and XCM is supplied by each of the explanation extraction techniques stated also in Sect. 2.3 and compared among them. 4.2
Accuracy
Before delving into why the model produced such exact output, it is necessary to establish an accurate prediction model. As a consequence, inherently or post-hoc interpretable models must be assessed on their capacity to attain high accuracy when given the same set of datasets in order to compare their performance objectively. As indicated before, the XCM architecture has shown its performance and that of the MTEX-CNN on classification tasks utilizing the UEA Archive of diverse MTS datasets [2]. The comparisons are done using model accuracy as the assessment measure, and a table of model accuracy reports is then generated for each of the experimental models over all datasets in the UEA archive. Additionally, a critical difference chart is constructed to illustrate the performance of each model more intuitively by aligning them along a line marked with the difference level from the reported accuracy table.
190
A.-D. Pham et al.
Datasets. The UEA multivariate time series classification archive [2] has 30 datasets that span across six categories, including Human Activity Recognition, Motion Classification, ECG Classification, EEG/MEG Classification, and Audio Spectra Classification. As was the case with other sources of datasets, this was a collaborative effort between academics at the University of California, Riverside (UCR) and the University of East Anglia (UEA). All the time series within one dataset has the same length, and no missing values or infinity values occur. Accuracy Metrics. After training on the aforementioned datasets, each model is assessed using the following accuracy score. Accuracy =
TP + TN , TP + FP + TN + FN
(1)
where TP, FP, TN and FN are abbreviations for True Positive, False Positive, True Negative, and False Negative, respectively. The nominator (TP + TN) denotes the number of predictions that are equal to the actual class, while FP and FN indicate the number of predictions that are not equal to the true class. Following that, the average score is utilized to generate a Critical Difference Diagram that depicts any statistically significant difference between the architectures. It is created using the Dunn’s Test, a nonparametric technique for determining which means are more significant than the others. The Dunn’s test establishes a null hypothesis, in which no difference exists between groups, and an alternative hypothesis, in which a difference exists between groups. 4.3
Interpretability
Despite developing the interpretable CNN-based architecture and using GradCAM for interpretation, MTEX-CNN and XCM evaluate their explainability only using a percentage Average Drop metric and a human comprehensibility test on trivial simulation data. The majority of significant testing of CAM-based explanations are undertaken inside the CAM-based technique. Score-CAM [24] assesses their approach on the most comprehensive collection of trials merged from the other method, which is appropriate given the method’s recent publication. The assessments include metrics for evaluating faithfulness, such as Increase of Confidence, percent Average Drop, percent Average Increase, and Insertion/Deletion curves; metrics for evaluating localization, such as the energybased pointing game; and finally, a sanity check by examining the change in the visualization map when a random set of feature maps is used. By contrast, attention-based RNN techniques are primarily concerned with studying the attention’s spatiotemporal properties. As with the percent Average Drop, multiple approaches (DA-RNN, DSTP, and STAM) perform the ablation experiment by viewing the difference between the unattended and original multivariate time series data. Thus, a single set of interpretability evaluation trials should collect all the dispersed testing across the methodologies in order to serve as a baseline for comparing the effectiveness of each interpretation produced by each
TSEM
191
approach in the area of MTS classification. This includes human visual inspection, fidelity of the explanations to the model parameters, spatiotemporality of the explanation, and causality of the explanation in relation to the model parameters. Because the assessment is based on the interpretation of a given model prediction, before extracting an explanation for the model of choice, the model must be trained with at least chance-like accuracy. Since the explanation for attentionbased models is fundamental to the model parameters, explanations can be retrieved exclusively for the output class, but this is not the case with CAMbased approaches. Additionally, to ensure that interpretations become clearer, the interpretation process should be conducted on a single dataset with a maximum of three component time series. Thus, UWaveGestureLibrary is chosen for interpretability evaluation. Faithfulness. The evaluations that belong to this class attempt to justify whether the features that one explaining mechanism figures out are consistent with the outcomes of the model or not. These consist of two sub-classes, namely Average Drop/Average Increase and Deletion/Insertion AUC score. Average Drop and Average Increase are included together as a measure because they both test the same feature of an explanation’s faithfulness to the model parameters, but the Deletion/Insertion AUC score analyzes a different aspect. According to [4], given Yic as the prediction of class c on image i and Oic as the prediction of class c on image i but masked by the interpretation map, the Average Drop is defined as in Eq. 2 [24], whereas the Average Increase, also called the Increase of Confidence, is computed using Eq. 3. AverageDrop(%) =
N 1 max(0, Yic − Oic ) ∗ 100 N i=1 Yic
(2)
N Sign(Yic < Oic ) ∗ 100 N i=1
(3)
AverageIncrease(%) =
where Sign is the function that converts boolean values to their binary counterparts of 0 and 1. As the names of these approaches imply, an interpretability method performs well when the Average Drop percentage lowers and the Average Increase percentage grows. The Deletion and Insertion AUC score measures are intended to be used in conjunction with the Average Drop and Average Increase measurements. The deletion metric reflects a decrease in the predicted class’s probability when more and more crucial pixels are removed from the generated saliency map. A steep decline in the graph between the deletion percentage and the prediction score, equivalent to a low-lying area under the curve (AUC), indicates a plausible explanation. On the other hand, the insertion measure captures the increase in likelihood associated with the addition of additional relevant pixels, with a larger AUC implying a more complete explanation [24].
192
A.-D. Pham et al.
Causality. When doing a causality test, it is common practice to assign random numbers to the causes and see how they behave in response to those numbers. Using randomization, we may display different pieces of evidence that point to a causal link. This is accomplished by randomly assigning each feature vector one by one in a cascade method until all of the feature vectors of the input data are completely randomized. It is also necessary to randomize the time dimension up to the final time point, as previously stated. Each piece of randomized data is then put into the interpretable models in order to extract its interpretation matrix, which is then connected with the original explanations to see how far it is deviating from them. If the interpretation does not alter from the original one, it is possible that causal ties will be severed since the interpretation will be invariant with respect to the data input. The correlation values between the randomized input explanations and their root explanation without randomization serve as a quantitative evaluation of the tests. It is possible that the correlations produced using the same interpretable technique diverged, as shown by a drop in correlation factors, during the course of all cascading stages, and that this may be used as a shred of evidence for the existence of causal relationships. The Chi-square Goodness-of-fit hypothesis testing procedure is used to determine whether the divergence is significant enough to cause a difference between the yielded interpretations from the randomization and the initial interpretation map, with the null hypothesis being that there is no difference between the two interpretations yielded by the randomization. In other words, the null hypothesis is that the correlation between the original explanation and the observed data is 1, while the alternative hypothesis is that the correlation is the opposite. According to the following definition, the Chi-square is χ2 =
(Oi − Ei )2 Ei
,
(4)
where Oi denotes the observed value, which in this case is the correlation of interpretations obtained by input randomization, and Ei is the predicted value, which is 1, indicating a perfect match to the original interpretation map. All of the data in the UWaveGestureLibrary’s test set is evaluated in this manner, where the quantity of data minus one is the degree of freedom for the Chi-square test. Spatiotemporality. The spatiotemporality of a multivariate time series specifies and distributes the relevance weights for each time step of each feature vector. The metric for determining the explanation map’s spatiotemporality is as straightforward as ensuring both temporality and spatiality. In other words, the interpretation map must be adaptable in both time and space. For example, when N feature vectors and m time steps are used in a multivariate time series, it ensures spatiality when the summation of interpretation map values along the time axis for each feature does not equal N1 . Similarly, it would fulfill temporality if the total of the map along the feature axis for each time step t did not equal T1 . If one of these properties fails, the related property fails as well, which
TSEM
193
results in the spatiotemporality as a whole failing. These criteria are expressed mathematically in Eqs. 5 and 6.
Xnj =
j
Xit =
i
4.4
1 N
∀n ∈ {0, ..., N − 1}
(5)
1 t
∀t ∈ {0, ..., T − 1}
(6)
Experiment Settings
Due to the fact that XCM and TSEM allow for parameter adjustment to a fraction of the data length via time windows calculated in multiple layers of the architecture, it is either unfair to other architectures with a fixed-parameter setting or it makes the models themselves so large that the computational capabilities cannot handle the training and may result in overfitting to the training data. As a consequence, the number of their architectural parameters varies between datasets, and the time frame is set to the proportion that results in absolute values no more than 500. This is not the case with MTEX-CNN, since the number of parameters is fixed. The other attention-based RNN architectures employ an encoder-decoder structure, which will be set to 512 units for both the encoder and decoder modules. This assessment portion was entirely implemented using Google Colab Pro and the Paperspace Gradient platform. Google Colab Pro Platform includes a version of Jupyter Lab with 24 GB of RAM and a P100 graphics processing unit (GPU) card with 16 GB of VRAM for inference machine learning models that need a CUDA environment. Similarly, Paperspace’s Gradient platform enables users to connect with the Jupyter Notebook environment through a 30 GB RAM configuration with additional GPU possibilities up to V100, and evaluations are conducted using a P6000 card with 24 GB VRAM. 4.5
Experiment Results
The experiment results are presented by two folds: accuracy and interpretability, which has been further broken down into four sections, namely human visual evaluation, faithfulness, causality and spatiotemporality. Accuracy. As indicated before, this part evaluates 10 interpretable models on 30 datasets. Table 1 summarizes the findings. According to the table, TSEM, XCM and STAM models have significantly different average rankings and win/tie times when compared to the others. In which, TSEM has significantly higher accuracy towards datasets having long sequences such as Cricket and SelfRegulationSCP2 which have a length of 1197 and 1152 time steps, respectively [2]. RETAIN also performs well in comparison to the other approaches in terms of the average rank. While MTEX-CNN has the lowest average rank, it
194
A.-D. Pham et al.
Table 1. Accuracy evaluation of the interpretable models for each dataset with TSEM (DSTP is shorthand for DSTP-RNN, DSTP-p is shorthand for DSTP-RNN-Parallel, GeoMAN-l and GeoMAN-g are shorthands for GeoMAN-Local and GeoMAN-Global respectively) Datasets
MTEX-CNN XCM
TSEM DA-RNN RETAIN DSTP-p DSTP GeoMAN GeoMAN-g GeoMAN-l STAM
ArticularyWordRecognition 0.837
0.6
0.903
0.846
0.85
0.92
0.906
0.923
0.97
AtrialFibrillation
0.333
0.4667 0.4667 0.4
0.557
0.893
0.4
0.4
0.6
0.4
0.4667
0.333
0.533
BasicMotions
0.9
0.75
0.925
0.9
0.85
0.8
0.875
0.95
0.95
0.925
0.675
CharacterTrajectories
0.065
0.06
0.06
0.06
0.06
0.06
0.06
0.06
0.06
0.06
0.06
Cricket
0.083
0.583
0.722
0.208
0.208
0.153
0.194
0.194
0.208
0.194
0.75
0.4
0.42
0.32
0.28
0.26
0.36
0.4
0.38
0.42
0.42
0.42
0.42
0.42
0.42
0.42
0.42
0.412 0.565
DuckDuckGeese
0.2
0.54
EigenWorms
0.42
0.428 0.42
Epilepsy
0.601
0.804
0.891 0.348
0.312
0.384
0.384
0.333
0.341
0.326
EthanolConcentration
0.251
0.32
0.395 0.32
0.346
0.297
0.357
0.327
0.312
0.323
0.308
ERing
0.619
0.696
0.844 0.47
0.756
0.478
0.441
0.426
0.463
0.459
0.692
FaceDetection
0.5
0.5
0.513
0.5
0.545
0.518
0.515
0.517
0.517
0.63
0.65
FingerMovements
0.51
0.54
0.53
0.6
0.6
0.53
0.62
0.61
0.53
0.52
0.56
0.446
HandMovementDirection
0.405
0.54
0.514
0.46
0.487
0.378
0.527
0.473
0.392
0.527
Handwriting
0.051
0.095
0.117 0.051
0.061
0.055
0.051
0.051
0.037
0.051
0.099
Heartbeat
0.8727
0.771
0.746
0.722
0.756
0.722
0.722
0.722
0.722
0.727
0.756
JapaneseVowels
0.238
0.238 0.084
0.084
0.084
0.084
0.084
0.084
0.084
0.084
0.084
Libras
0.067
0.411
0.372
0.206
0.372
0.201
0.228
0.233
0.272
0.172
0.589
LSST
0.315
0.155
0.315
0.315
0.315
0.315
0.315
0.315
0.315
0.315
0.316
InsectWingbeat
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
MotorImagery
0.5
0.5
0.5
0.54
0.51
0.56
0.52
0.63
0.59
0.56
0.56
NATOPS
0.8
0.844 0.833
0.344
0.661
0.233
0.228
0.25
0.333
0.522
0.767
PEMS-SF
0.67
0.549
0.544
0.636
0.775
0.168
0.145
0.162
0.671
0.688
0.746
PenDigits
-
0.721
0.686
0.323
0.746
0.112
0.11
0.331
0.35
0.384
0.888
Phoneme
0.026
0.07
0.058
0.066
0.049
0.037
0.059
0.05
0.068
0.042
0.06
RacketSports
0.533
0.75
0.77
0.283
0.447
0.283
0.283
0.29
0.29
0.336
0.441
SelfRegulationSCP1
0.502
0.747
0.836
0.604
0.898
0.58
0.604
0.58
0.563
0.87
0.877
SelfRegulationSCP2
0.502
0.517
0.756 0.583
0.533
0.561
0.539
0.567
0.544
0.561
0.556
SpokenArabicDigits
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
StandWalkJump
0.333
0.467
0.467
0.4
0.333
0.467
0.333
0.467
0.467
0.533
0.533
UWaveGestureLibrary
0.725
0.813
0.831 0.497
0.781
0.406
0.497
0.466
0.375
0.444
0.813
Average Rank
6.5
3.7
3.5
5.1
4.3
6.1
6.1
5.0
5.1
5.2
3.2
Wins/Ties
5
8
9
2
4
2
4
4
3
3
9
has the highest number of wins/ties among the methods except from TSEM, XCM and STAM, indicating that this approach is unstable and not ideal for all types of datasets. In comparison, despite the fact that DA-RNN, GeoMANlocal, and GeoMAN-global acquire techniques with the fewest wins/ties, they have a solid average rank. Both DSTP-RNN and DSTP-RNN-parallel produce the same average rank; however, DSTP-wins/ties RNN’s times outnumber those of DSTP-RNN-parallel and this indicates that its behavior is consistent with the corresponding research on DSTP-RNN, which indicates that DSTP-RNN can remember a longer sequence than DSTP-RNN-parallel. Otherwise, the performance of DSTP-RNN and DSTP-RNN-parallel is not superior to the others, as their report shown in the regression issue, but it is the poorest of all the approaches.
TSEM
195
Fig. 4. The critical difference plot of the MTS classifiers on the UEA datasets with alpha equals to 0.05.
While comparing average rankings and wins/ties score might help determine a model’s quality, there may be discrepancies between these two measurements. Average rank is included in the Critical Difference Diagram to provide further insight. The statistical test shown in Fig. 4 was developed by Bonferroni-Dunn with an alpha of 0.05 and 30 datasets examined, which corresponds to the total number of datasets in the UEA collection. Figure 4 suggests that STAM, TSEM and XCM are the top-3 methods in terms of accuracy performance, and they are in the same group with RETAIN, DA-RNN and all the variants of GeoMAN posing a significant difference to the remaining three methods. Qualitative Evaluation. Regarding CNN-based architectures that are analyzed in this research, ten different CAM-based methods are evaluated for the explainability of the models. They are CAM [25], Grad-CAM [22], GradCAM++ [4], Smooth Grad-CAM++ [18], XGrad-CAM [8], Ablation-CAM [21], Score-CAM [24], Integrated Score-CAM [16], Activation-Smoothed Score-CAM [23] and Input-Smoothed Score-CAM [23]. However, only CAM, Grad-CAM++, and XGrad-CAM are examined in this part due to the supremacy of these three methods. Then, they are visually compared with the attention representation vectors of the attention-based RNN architectures with in a given context. Here, an example from the UWaveGestureLibrary dataset (shown in Fig. 5) illustrates a right turn downhill after a straight walk onwards. As previously mentioned, if there is no sanity check provided by human understanding, the explanations for a multivariate time series would be subtle. In other words, every element in the multivariate time series must be unambiguous about what it is supposed to be doing. To provide an example, the input data in Fig. 5 depicts a multivariate time series from the UWaveGestureLibrary dataset that corresponds to the three axes of an accelerometer that measure a staged action, which is designated by the label 1 in the category list. Knowing what each time series component represents, for example, knowing that the blue line represents the x-axis values of the acceleration sensor, it becomes clearer why they oscillate in certain patterns, as shown here by the blue line’s oscillation following the xaxis variation of the steep right turn as labeled. In particular, the action intervals for class number 1 are denoted with great precision. Indeed, the action would differ from person to person and from time to time, but in general, it can be
196
A.-D. Pham et al.
Fig. 5. A UWaveGestureLibrary class 1 instance with semantic segmentation
divided into five states and four stages based on the change in acceleration in the x-axis values of the sensor as reported by the uphill and downhill patterns of the blue line and the uphill and downhill patterns of the green line. As a result, the five states are denoted by the letters O, A, B, C and D, which correspond to the resting state, the beginning state, the switching state, the halting state, and the terminating state, in that order. The four phases are represented by the letters OA, AB, BC and CD, which represent the temporary stage, the first direction running stage, the second direction running stage, and the concluding stage, respectively, in the graphical representation. It is necessary to distinguish between transitory and concluding phases since an action is neither initiated or terminated immediately after a signal has been initiated or terminated by a device. All of the phases and stages are highlighted in both the multivariate time series and its label in order to demonstrate the sensible interconnections between them. So an interpretation map is more interesting if it highlights the critical points that are located between stages A and B; otherwise, it would be meaningless if it stressed the transient stage OA or the ending stage CD because, logically, one model should not choose class 1 over other classes simply because of its longer transient stage, for example. In general, according to Fig. 6, post-hoc explainability approaches based on CAM have yielded more continuous interpretation maps for XCM than explainability methods based on attention-based RNN models. It appears that the difference is due to the different nature of using CNN and RNN for extracting the learned features, with CNN being able to provide a local explanation specific to the input instance, whereas RNN is believed to yield a global explanation independent of the specific input instance in question. For the sake of this discussion, the explanation map produced by the attention in recurrent networks is designated for the category to which the input instance is categorized as a whole, and this category is represented by a node in the instance. Unlike RNN,
TSEM
197
Fig. 6. Explanation of some explainable models for a UWaveGestureLibrary class 1 instance. MTEX-CNN, XCM and TSEM are post-hoc explained by CAM, XGradCAM and Grad-CAM++. Explanations for attention-based RNN methods are their spatiotemporal attention vectors. The red lines show the highest activated regions of the input for class 1 as their predictions.
CNN does not have the capacity to memorize; the highlighted areas are simply those parts of the input instance that are aroused when the CNN encounters a label. As a result, the interpretation based on CAM is strongly reliant on the input signal. Faithfulness. As a contrastive pair, the two assessment metrics in each test are shown on a two-dimensional diagram, together with the correctness of each model, which is represented by the size of the circle representing the coordinates. When high-accuracy interpretable models are compared against lowaccuracy interpretable models, the accuracy might reveal how the explanations are impacted. The Average Drop and Average Increase metrics are displayed as percentages ranging from 0 to 100, and each point is represented by two coordinates that correspond to the average drop and average increase metrics. Because the Average Drop and Average Increase are inversely proportional, it is predicted that all of the points would follow a trend parallel to the line y = x. Considering that x represents the Average Drop value and y represents the Average Increase
198
A.-D. Pham et al.
Fig. 7. The average drop - average increase diagram for the UWaveGestureLibrary dataset. Accuracy is illustrated as proportional to the size of the circles. (The lower the average drop is, the more faithful the method get, as contrary to the average increase).
value, the right bottom is the lowest performance zone, while the left top is the highest performance sector in this equation. Taking into account the wide range of accuracy at each stage, it is rather simple to determine which technique provides the most accurate explanation for the model’s conclusion. Figure 7 depicts the distributions of interpretation techniques in terms of their association with the Average Drop and the Average Increase in percentages. All of the points are color-coded, with the red and yellow colors denoting the spectrum of CAM-based interpretations for the two different XCMs (XCM and TSEM), respectively. The blue hue represents MTEX-CNN, while the remaining colored spots represent the visualization for the attention-based approaches (which are not shown here). In general, the farther to the right the figure is moved, the poorer the implied performance becomes. As previously mentioned, all of the worst approaches are clustered together at the bottom right, where the Average Increase is at its lowest value and the Average Drop is at its greatest value, which is a significant difference. While both MTEX-CNN and STAM have great accuracy (as shown by the size of the circles), the collection of interpretations for MTEX-CNN and the attention vector for STAM have the lowest fidelity to their judgments.
TSEM
199
Fig. 8. The deletion/insertion AUC diagram for the UWaveGestureLibrary dataset. Accuracy is illustrated as proportional to the size of the circles. The diagram is shown in log-scale to magnify the distance between circles for a clearer demonstrative purpose. (The lower the deletion AUC score is, the more faithful the method get, as contrary to the insertion AUC score).
When only essential data points are considered in the input, the change in the models’ prediction score is gradually introduced to an empty sequence or is gradually removed from the original input data, and this is reflected through the Insertion and Deletion curves, respectively. The area under each curve serves as a measure, providing information about how quickly the curve is moving. It is anticipated that the area under the curve (AUC) of a Deletion Curve be as minimal as feasible, showing the rapid suppression of the model accuracy when the most relevant data points are beginning to be eliminated. Contrary to this, the AUC of an Insertion curve should be as substantial as feasible, which suggests that accuracy increases as soon as the initial most essential data points are added. Figure 8 depicts the relationship between Deletion AUC values and Insertion AUC values for a given sample size. Overall, there are no evident patterns in the AUC values or the accuracy of any approach when seen as a whole. When compared to their Insertion AUC values, the majority of the techniques are grouped in accordance with nearly equal Deletion AUC around 0.125, however the Grad-CAM collective approaches for MTEX-CNN stand out due to their remarkable Deletion AUC scores varied about 0.07. Furthermore, not only
200
A.-D. Pham et al.
do they have low Deletion AUC scores, but they also have a high Insertion AUC score, which is around 0.3. This is where STAM and the remaining CAM-based methods for MTEX-CNN are located, with Deletion AUC scores that are almost doubled when compared to the Grad-CAM collective for MTEX-CNN. Similarly, the RETAIN interpretation has the highest Insertion AUC score, which is approximately 0.5, which is four times higher than the XCM interpretations and four times higher than the rest of the attention-based techniques’ interpretations, with the exception of STAM. While XCM CAM-based explanations have nearly the same Insertion AUC score as GeoMAN, GeoMAN-local, GeoMANglobal, DA-RNN, DSTPRNN, and DSTP-RNN-parallel, they have the lowest Deletion AUC scores among the attention maps of GeoMAN, GeoMAN-local, GeoMAN-global, DA-RNN, DSTPRNN, and DSTP-RNN-parallel. It is hoped that the fidelity of the CAM-based explanations for TSEM will be at least as good as that of the XCM architecture, having been modified and corrected from the XCM design. Indeed, as seen in Fig. 7 and 8, the cluster for TSEM interpretation in both diagrams is distributed in a manner that is virtually identical to that of XCM. The TSEM, on the other hand, performs somewhat better in terms of two metrics: average increase and Insertion AUC (area under the curve). This implies that TSEM interpretations pay more attention to data points that are more meaningful in the multivariate time sequence. Most notably, the original CAM approach extracts an explanation map for TSEM with the smallest Average Drop when compared to the other methods tested. Causality. When attempting to reason about an effect in relation to a cause, the significance of causality cannot be overstated. Specifically, the effect corresponds to the explanation map that corresponds to its cause, which is the input data, and is connected to the cause and effect by means of a model that acts as a proxy between the cause and effect. For example, in contrast to the regression connection between a model’s input and output, explanation maps are produced as a result of a mix of inputs, model parameters, and outputs. While the model parameters and the output are maintained constant in this assessment, randomization is applied to two axes of the multivariate time series input. According to Fig. 9, the CAM-based explanations for TSEM with 320 occurrences in the UWaveGestureLibrary test set exhibit similar patterns to the XCM explanations. Specifically, none of the Score-CAM variants nor the Ablation-CAM variants pass the causality test. The distinction between TSEM and MTEX-CNN and XCM is that it renders the original CAM approach causal, which does not happen with MTEX-CNN or XCM. Additionally, the XGradexplanations CAM’s for TSEM are non-causal. Although the Ablation-CAM explanation technique fails in both XCM and TSEM, the failure is more pronounced in TSEM when the proportion of temporal non-causal data points surpasses the 10% threshold. In general, only three approaches satisfy the causality criteria for TSEM: Grad-CAM++, Smooth Grad-CAM++, and the original CAM. This is considered a good performance, because according to Fig. 9, almost 70% the number of models do not retain causality.
TSEM
201
Fig. 9. The bar chart of non-causal proportion of UWaveGestureLibrary test set inferred by TSEM CAM-based explanations vs the other interpretable methods. The lower proportion is, the better causation level a method gets and it must be below 10% to be considered (illustrated by the red line) pass the causality test.
Spatiotemporality. This assertion is made clearly in Eqs. 5 and 6, which relate to the spatiality and temporality tests, respectively. If both of these equations apply to an interpretation map, it is deemed to possess the spatiotemporal quality. Because the numbers in an explanation map do not add up to 1, they must be normalized before applying the criterion equations. This is done by dividing each value by the total of the whole map. All CAM-based method interpretations in XCM, TSEM and MTEX-CNN, as well as the attention-based interpretation, pass this set of tests. Because no negative instances are provided, the findings for each approach are omitted.
202
5
A.-D. Pham et al.
Conclusion and Outlook
After a thorough analysis of the currently available interpretable methods for MTS classification, the Temporally weighted Spatiotemporal Explainable network for Multivariate Time Series Classification, or TSEM for short, is developed on the basis of the successful XCM in order to address some of the XCM’s shortcomings. Specifically, XCM does not permit concurrent extraction of spatial and temporal explanations due to their separation into two parallel branches. Simultaneously, TSEM reweights the spatial data obtained in the first branch using the temporal information learnt from the recurrent layer in the second parallel branch. This is regarded to be a more productive method than XCM in terms of extracting real temporality from data rather than pseudo-temporality from the correlation of time-varying values in location. This also lends credence to an explanation including maps of the relative significance of temporal and spatial features. As a result, it is expected to provide a more compact and exact map of interpretation. Indeed, TSEM outperforms XCM in terms of accuracy across over 30 datasets in the UEA archive and in terms of explainability in the UWaveGestureLibrary. This study focuses only on model-specific interpretable approaches and makes no comparisons to model-independent methods. Thus, it would be interesting if TSEM could be analyzed with these methods using the same evaluation set of interpretability metrics. In any other case, TSEM would use concept embedding in its future work to encode tangible aspects into knowledge about a concept. After that, it would incorporate neuro-symbolic technique in order to provide a more solid explanation towards its prediction. In addition to this, causal inference has to be considered in order to get rid of any false connection in the logits and the feature maps. This would help strengthen the actual explanation and get rid of any confounding variables that may be present.
References 1. Assaf, R., Giurgiu, I., Bagehorn, F., Schumann, A.: MTEX-CNN: multivariate time series explanations for predictions with convolutional neural networks. In: 2019 IEEE International Conference on Data Mining (ICDM), pp. 952–957. IEEE (2019) 2. Bagnall, A., et al.: The UEA multivariate time series classification archive (2018). arXiv preprint arXiv:1811.00075 (2018) 3. Brendel, W., Bethge, M.: Approximating CNNs with bag-of-local-features models works surprisingly well on imagenet. arXiv preprint arXiv:1904.00760 (2019) 4. Chattopadhay, A., Sarkar, A., Howlader, P., Balasubramanian, V.N.: GradCAM++: generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847. IEEE (2018) 5. Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., Su, J.K.: This looks like that: deep learning for interpretable image recognition. In: Advances in Neural Information Processing Systems 32 (2019)
TSEM
203
6. Choi, E., Bahadori, M.T., Sun, J., Kulas, J., Schuetz, A., Stewart, W.: Retain: an interpretable predictive model for healthcare using reverse time attention mechanism. In: Advances in Neural Information Processing Systems 29 (2016) ´ Termier, A.: XCM: an explainable 7. Fauvel, K., Lin, T., Masson, V., Fromont, E., convolutional neural network for multivariate time series classification. Mathematics 9(23), 3137 (2021) 8. Fu, R., Hu, Q., Dong, X., Guo, Y., Gao, Y., Li, B.: Axiom-based GradCAM: towards accurate visualization and explanation of CNNs. arXiv preprint arXiv:2008.02312 (2020) 9. Gangopadhyay, T., Tan, S.Y., Jiang, Z., Meng, R., Sarkar, S.: Spatiotemporal attention for multivariate time series prediction and interpretation. In: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2021, pp. 3560–3564. IEEE (2021) 10. Holzinger, A., Goebel, R., Fong, R., Moon, T., M¨ uller, K.R., Samek, W.: xxAI beyond explainable artificial intelligence. In: Holzinger, A., Goebel, R., Fong, R., Moon, T., M¨ uller, K.R., Samek, W. (eds.) xxAI - Beyond Explainable AI. Lecture Notes in Computer Science, vol. 13200, pp. 3–10. Springer, Cham (2022). https:// doi.org/10.1007/978-3-031-04083-2 1 11. Hu, D.: An introductory survey on attention mechanisms in NLP problems. In: Bi, Y., Bhatia, R., Kapoor, S. (eds.) IntelliSys 2019. AISC, vol. 1038, pp. 432–448. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-29513-4 31 12. Jain, S., Wallace, B.C.: Attention is not explanation. arXiv preprint arXiv:1902.10186 (2019) 13. Koh, P.W., et al.: Concept bottleneck models. In: International Conference on Machine Learning, pp. 5338–5348. PMLR (2020) 14. Liang, Y., Ke, S., Zhang, J., Yi, X., Zheng, Y.: GeoMAN: multi-level attention networks for geo-sensory time series prediction. In: IJCAI 2018, pp. 3428–3434 (2018) 15. Liu, Y., Gong, C., Yang, L., Chen, Y.: DSTP-RNN: a dual-stage two-phase attention-based recurrent neural network for long-term and multivariate time series prediction. Expert Syst. Appl. 143, 113082 (2020) 16. Naidu, R., Ghosh, A., Maurya, Y., Kundu, S.S., et al.: IS-CAM: integrated scoreCAM for axiomatic-based explanations. arXiv preprint arXiv:2010.03023 (2020) 17. Nauta, M., van Bree, R., Seifert, C.: Neural prototype trees for interpretable finegrained image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14933–14943 (2021) 18. Omeiza, D., Speakman, S., Cintas, C., Weldermariam, K.: Smooth grad-CAM++: an enhanced inference level visualization technique for deep convolutional neural network models. arXiv preprint arXiv:1908.01224 (2019) 19. Pfeifer, B., Secic, A., Saranti, A., Holzinger, A.: GNN-subnet: disease subnetwork detection with explainable graph neural networks. bioRxiv (2022) 20. Qin, Y., Song, D., Chen, H., Cheng, W., Jiang, G., Cottrell, G.: A dual-stage attention-based recurrent neural network for time series prediction. arXiv preprint arXiv:1704.02971 (2017) 21. Ramaswamy, H.G., et al.: Ablation-CAM: visual explanations for deep convolutional network via gradient-free localization. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 983–991 (2020) 22. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: GradCAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618– 626 (2017)
204
A.-D. Pham et al.
23. Wang, H., Naidu, R., Michael, J., Kundu, S.S.: SS-CAM: smoothed score-CAM for sharper visual feature localization. arXiv preprint arXiv:2006.14255 (2020) 24. Wang, H., et al.: Score-CAM: score-weighted visual explanations for convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 24–25 (2020) 25. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)
AI in Cryptocurrency Alexander I. Iliev1,2(B) and Malvika Panwar1 1 SRH University Berlin, Charlottenburg, Germany
[email protected], {3105481, malvika.panwar}@stud.srh-campus-berlin.de 2 Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, 8 Acad. Georgi Bonchev Street, 1113 Sofia, Bulgaria
Abstract. This study investigates the predictability of six significant cryptocurrencies for the upcoming two days using AI techniques i.e., machine learning algorithms like random forest and gradient for predicting the price of these six cryptocurrencies. The study presents to us that machine learning can be seen as a medium to predict the prices of cryptocurrencies. A machine learning system learns from past data, constructs prediction models, and predicts the output for new data whenever it gets it. Predicted output’s accuracy is influenced by the quantity of data since the more data there is, the better the model can predict the output. The results show that with the accuracy score performance metric, which we employed for this study, we were able to calculate the accuracy of the algorithms and find that both algorithms random forest and gradient boosting respectively performed well for the cryptocurrencies such as Solana (98.07%,98.14%), Binance (96.56%, 96.85%), and Ethereum (96.61%, 96.60%)), with the exception of Tether (0.38%, 12.35%) and USD coin (–0.59%, 1.48%), the results demonstrate that both algorithms work effectively with the majority of cryptocurrencies which can be further increased by using deep learning algorithms like ANN, RNN or LSTM. Keywords: Cryptocurrency · Artificial intelligence · Machine learning · Deep learning · ANN · RNN · LSTM
1 Introduction Most developed countries throughout the world adopt cryptocurrencies on a wide scale. Some significant businesses have begun to accept cryptocurrency as a form of payment on a worldwide scale. Microsoft, Starbucks, Tesla, Amazon, and many others are just a few of them. The world today uses cryptocurrency, a form of virtual currency that is protected by cryptography and hence impossible to reproduce, to power its economy. Network-based blockchain technology is used to distribute cryptocurrencies. In its most basic form, the term “crypto” refers to the encryption of different algorithms and cryptographic techniques that safeguard the entries, such as elliptical curve encryption, public-private key pairs, etc. Cryptocurrency exchanges are easy since they are not tied to any country. These allow users to buy and sell cryptocurrencies using a variety of currencies. Cryptocurrencies are stored in a sophisticated wallet, which is conceptually © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 205–217, 2023. https://doi.org/10.1007/978-3-031-28073-3_14
206
A. I. Iliev and M. Panwar
comparable to a virtual bank account. Block chain is a place where the timestamp information and record of many trades are stored. In a block chain, a square represents each record. Each square has a link to a previous informational square. On the blockchain, the information is encrypted. Only the wallet ID of the customer is made visible during exchanges; their name is not [14]. Crypto currencies can be obtained in a few ways, including as through mining, or buying them on exchanges. Retail transactions are not used by many cryptocurrencies. They serve as a means of exchange and are typically used as assets rather than for regular purchases like groceries and utilities. As intriguing as this new class of investments may sound, cryptocurrencies come with a significant amount of risk, therefore thorough study is required. This paper’s main goal is to inform readers about the benefits and drawbacks of investing in cryptocurrencies. New investments are always accompanied by a great deal of uncertainty, so it is important to analyse every aspect of cryptocurrencies to provide users and investors with better information and to provide information that will encourage ethical usage of cryptocurrencies without having to consider the risks and potential failure of the investment. Cryptocurrencies are one of the riskiest but most prominent investments available in the modern technological and economic environment. Therefore, prediction of cryptocurrencies will play a greater role for investors to decide if it is best to invest on cryptocurrency or not. In this study, we suggest the 6 cryptocurrencies, and we use two ensemble machine learning algorithms—random forest and gradient boosting—to predict these cryptocurrencies for the next two days.
2 Related Work Early studies on bitcoin argued over whether it was a currency or purely a speculative asset, with most authors favouring the latter conclusion in light of the cryptocurrency’s high volatility, extraordinary short-term gains, and bubble-like price behaviour [3; 4; 7; 17]. The idea that cryptocurrencies are just speculative assets with no underlying value prompted research into potential correlations between macroeconomic and financial variables as well as other pricing factors related to investor behaviour [10]. Even for more conventional markets, it has been demonstrated that these factors are of utmost significance. For instance, [16] point out that Chinese companies with more attention from regular investors typically have a lower probability of stock price crashes. The market for cryptocurrencies has exploded during the past ten years [19]. For a brief period, Bitcoin’s market value exceeded $3 trillion, and it became widely accepted as a kind of legal tender [1]. A major turning point in the general use of blockchain technology was marked by these occurrences. Investors in cryptocurrencies currently have a wide range of choices, from Bitcoin and Ethereum to Dogecoin and Tether. Returns can be calculated in several different ways, and it can be challenging to predict returns for a volatile asset with wide price swings like a cryptocurrency [2]. There has been an increase in interest in cryptocurrency forecasting and profiteering using ML approaches over the past three years. Since the publication of [12], which, to the best of our knowledge, is one of the first studies to address this subject, Table 1 compiles numerous of those papers, presented in chronological order. The goal of this article is to contextualise and emphasise the key contributions of our study rather than
AI in Cryptocurrency
207
to present a comprehensive list of all papers that have been published in this area of the literature. See, for instance, [6] for a thorough analysis of cryptocurrency trading and numerous further references on ML trading. In the case of other cryptocurrencies, it was investigated whether bitcoin values are primarily influenced by public recognition [11] refer to it—measured by social media news, Google searches, Wikipedia views, Tweets, or comments in Facebook or specialist forums. To forecast changes in the daily values and transactions of bitcoin, Ethereum, and ripple, for instance, [9] investigated user comments and answers in online cryptocurrency groups. Their findings were encouraging, especially for bitcoin [15]. Hidden Markov models based on online social media indicators are used by [13] to create profitable trading methods for a variety of cryptocurrencies. Bitcoin, ripple, and litecoin are found to be unconnected to several economic and financial variables in both the time and frequency domains by [5]. In summary, these papers show that ML models outperform rival models like autoregressive integrated moving averages and exponential moving averages in terms of accuracy and improve the predictability of prices and returns of cryptocurrencies. This is true regardless of the period under analysis, data frequency, investment horizon, input set, type (classification or regression), and method. The performance of trading strategies developed using these ML models and the passive buy-and-hold (B&H) strategy is also contrasted in around half of the research surveyed (with and without trading costs). There is no clear winner in the race between various machine learning models, although it is generally agreed that ML-based strategies outperform passive ones in terms of overall cumulative return, volatility, and Sharpe ratio.
3 Artificial Intelligence and Machine Learning 3.1 Artificial Intelligence Artificial intelligence (AI) eliminates repetitive activities, freeing up a worker’s time for higher-level, more valuable work. Artificial intelligence (AI) is a cutting-edge technology that can deal with data that is too complicated for a person to manage. Artificial Intelligence may be used to automate tedious and repetitive marketing processes, enabling sales reps to concentrate on relationship development, lead nurturing, etc. Self-awareness – this is the greatest & most advanced standard of Artificial intelligence. The VP can design a successful strategy using AI data and recommendation systems. To sum it up, Artificial Intelligence looks prepared to function as upcoming for the globe. Artificial Intelligence would effortlessly manage on patients without real human direction. Artificial Intelligence have programs in several various other fields. Artificial intelligence (AI) is already being employed in almost every industry, offering any company that adopts AI a competitive advantage. Deep learning is a kind of device learning that runs inputs through the biologically inspired neural network architecture. Comprehending the improvement between Artificial intelligence, device reading, and deep training tends to be confusing. With Artificial Intelligence, devices execute features as for example discovering, preparation, reasoning and problem-solving.
208
A. I. Iliev and M. Panwar
3.2 Machine Learning A machine learning system builds prediction models based on historical data and predicts the results for fresh data whenever it receives it. The amount of data has an impact on the accuracy of the predicted output since the more data, the model has, the better it can predict the outcome. The necessity for machine learning in the sector is constantly growing. Computers are increasingly using machine learning, which enables them to draw lessons from their prior experiences. A variety of approaches are used by machine learning to build mathematical models and make predictions. Currently, the technique is employed for a wide range of purposes, including picture identification, speech recognition, email filtering, Facebook auto-tagging, recommender systems, and many more. 3.2.1 Machine Learning Algorithms a. Supervised Machine Learning Data that encompasses a predetermined label, such as spam/not-spam or a stock price at a specific time, it is known as training data. It must go through a training phase where it is asked to make predictions and is corrected when those predictions are untrue to produce a model. Until the model achieves the level of accuracy desired on the training data set, the training phase is repeated. There are other options, including Back Propagation Neural Networks and Logistic Regression. b. Unsupervised Machine Learning Unsupervised learning techniques are used when only the input variables exist and there are no corresponding output variables. To build models of the underlying data structure, they employ unlabeled training data. Data can be grouped using the clustering technique so that objects in one cluster are more like one another than they are to those in another. c. Reinforcement Machine Learning Reinforcement learning is a type of machine learning that enables agents to choose the best course of action based on their current state by training them to engage in actions that maximize rewards: Reinforcement learning systems frequently discover the best behaviors through trial and error. Imagine you’re playing a video game where you need to go to certain locations at certain times to earn points. A reinforcement algorithm playing that game would first move the character randomly, but over time and with plenty of trial and error, it would eventually figure out where and when to move the character to maximize its point total.
4 Methodology In this section, we will talk about our approach for prediction of cryptocurrency using proposed machine learning algorithms [18]. For this study, proposed workflow will look like it is shown in Fig. 1:
AI in Cryptocurrency
209
Fig. 1. Proposed flowchart
4.1 Data Collection Secondary data is chosen for this study and data is collected from in.investing.com which contains various stock price historical data for free. We have collected top 10 cryptocurrency data for this platform. Below tables show the name of selected cryptocurrency and date of the data taken from. Table 1. Cryptocurrencies S. No.
Cryptocurrency name
Data collected from date
Data collected till date
Number of observations
1
Binance
1/1/21
10/1/21
375
2
Cardano
1/1/21
10/1/21
375
3
Ethereum
1/1/21
10/1/21
375
4
Solana
1/1/21
10/1/21
372
5
Tether
1/1/21
10/1/21
375
6
USD coin
1/1/21
10/1/21
375
Collected dataset contains following features like Date, Open, High, Low, Volume. Date: Date of the observation, Open: Opening price on the given day, High: Highest price on the given day, Low: Lowest price on the given day, Price-Closing price on the given day, Volume: Volume of transactions on the given day.
210
A. I. Iliev and M. Panwar
4.2 Preliminary Data Analysis We will examine the distribution of our data before creating our final machine learning model. Since price is our target variable, we are plotting prices for each cryptocurrency, figuring out average prices after 30 and 2 days, and determining whether our data can accurately predict prices 30 or 2 days in the future. Visualization of each cryptocurrency price is displayed below: 4.2.1 Binance Plot
Fig. 2. Binance normal price vs monthly avg price
Fig. 3. Binance normal price vs 2 days avg price
Binance price randomness level is less as compared to the price of bitcoin (Fig. 2 and Fig. 3) and the price of Binance looks like increasing with time, but the current price of Binance is very less as compared to the price of bitcoin. If one wants to invest in cryptocurrency having less price value, then Binance can be a good choice. Binance monthly avg price doesn’t fit properly with normal price (Fig. 2), which means we cannot predict the 30 days future price of Binance, while if we roll a 2day average against the price of Binance (Fig. 3), we get the clear fit for this problem. Therefore, we can predict the Binance price after 2 days in the future.
AI in Cryptocurrency
211
4.2.2 Cardano Plot
Fig. 4. Cardano normal price vs monthly avg price
Fig. 5. Cardano normal price vs 2 days avg price
Cardano price was at its peak on September 21, (Fig. 4 and Fig. 5), but now Cardano price is decreasing also. So, it would be risky for investors to invest on Cardano. Prediction of Cardano is having the same issue as we saw in bitcoin and Binance. So, we can predict the price of Cardano after 2 days in the future as well. 4.2.3 Ethereum Plot From Fig. 6 and Fig. 7 we will predict the price of Ethereum just after two days in the future.
Fig. 6. Ethereum normal price vs monthly avg price
Price of Ethereum shows an increasing trend but after December 21, the price is decreasing. Price value of Ethereum is quite good and investors can try Ethereum if they want to invest. But before the suggestion, we will check the predicted price of Ethereum on 12th of January to check if the price of Ethereum is showing any variance i.e., either the price is increasing or decreasing from the price on 10th January.
212
A. I. Iliev and M. Panwar
Fig. 7. Ethereum normal price vs 2 days avg price
4.2.4 Solana Plot Here we will predict the price after two days for Solana – Fig. 8 and Fig. 9.
Fig. 8. Solana normal price vs monthly avg price
Fig. 9. Solana normal price vs 2 days avg price
4.2.5 Tether Plot Here we will predict the price after two days for Tether – Fig. 10 and Fig. 11. 4.2.6 USD Plot Here we will predict the price after two days for USD coin – Fig. 12 and Fig. 13. Thus, from the above plots, the price of cryptocurrency is changing randomly i.e., there’s no presence of trend in the price of each cryptocurrency. The monthly average price of cryptocurrencies doesn’t fit properly with normal price which means we cannot predict the 30 days future price of bitcoin, but if we roll a 2-day average against the price of bitcoin then we can predict the prices of each cryptocurrency after 2 days using machine learning.
AI in Cryptocurrency
213
Fig. 10. Tether normal price vs monthly avg price
Fig. 11. Tether normal price vs 2 days avg price
Fig. 12. USD coin normal price vs monthly avg price
Fig. 13. USD coin normal price vs 2 days avg price
4.3 Data Preprocessing Collected data is very raw in nature with which machine learning models cannot be applied therefore first data need to be processed. Data preprocessing is the process/methods of transforming data from unstructured data into structured data. Data preprocessing involves many techniques to transform the data. First, we will check if
214
A. I. Iliev and M. Panwar
there’s any null value present in the data which need to be fixed before applying any algorithms then we will normalize the data which will help to increase the accuracy of our machine learning models. 4.4 Model Preparation In this section, we will fit our machine learning models but before doing that first we will extract our dependent and independent variables because we are using supervised machine learning algorithm. Our dependent variable is “Price after two days” which can be obtained by shifting our price row by 2, and independent variables are Price, Open, High, Low and Ohlc. ohlc average =
Open + High + Low + Price 4
4.4.1 Algorithms Algorithms that are taken into consideration for this study are: random forest and gradient boosting, both algorithms are very powerful machine learning ensemble techniques. Random Forest The supervised learning technique known as random forest is typically used for classification and regression. Because it is composed of many decision tree algorithms and creates multiple decision trees depending on the classification samples, this approach is also known as a sort of ensemble learning. Regression trees are constructed using the samples’ averages. Although we may predict outcomes from both continuous (regression) and discrete (classification) data using these techniques, this algorithm performs better for classification issues than regression [8]. Understanding the ensemble technique is necessary to comprehend how random forest systems operate. In essence, ensemble combines two or more models. As a result, rather of using a single model, predictions are based on a group of models. Ensemble uses two different kinds of techniques: 1. Bagging Ensemble Learning: As was already said, several algorithms are considered in ensemble learning, but in the bagging method, the dataset is split into a few subsets before the same numerous algorithms are applied to each subset, and the final output data is decided by majority voting. Random forest employs this strategy for classification. 2. Boosting Ensemble Learning: Increasing ensemble learning: With this ensemble method, the final model produced provides the highest level of accuracy. This is accomplished by creating a sequential model after pairing weak and strong learners. A couple of these are ADA BOOST and XGBOOST. Now that we are familiar with the ensemble approaches, let’s examine the random forest method, which makes use of the first method, or bagging. The random first method considers the following steps, which are as follows:
AI in Cryptocurrency
215
Step1: The dataset is split into various subsets where each element is taken randomly. Step 2: A single algorithm is used to train each subset; in the case of a random forest, a decision tree is employed. Step 3: Each individual decision tree produces an output. Step 4: The final output is created by adding the results of each predictor together, identifying the largest result, and then averaging the regression results to determine the final output. Hyperparameter values taken into consideration: We selected 200 numbers of estimators at 42 random states for random forest algorithm. Gradient Boosting Gradient boosting is a very powerful algorithm of machine learning. In gradient boosting, many weak algorithms are combined to form a strong algorithm. Gradient boosting can be optimized by changing its learning rate. In this algorithm, we will check our model accuracy with multiple numbers of different learning rates and among those learning rates, we will find the best model for our problem. Hyperparameter values taken into consideration: Random state = 0, n_estimators = 100.
5 Results We achieved positive results on test data after training the data. The Solana cryptocurrency has the maximum accuracy, at 98.14%, while the USD coin and Tether cryptocurrency have the lowest accuracy, as their prices fluctuate erratically (see Table 2). Table 2. Accuracy score of algorithms on each cryptocurrency S. No.
Cryptocurrency name
Random forest accuracy
Gradient boosting accuracy
Learning rate
1
Binance
96.56%
96.85%
0.05
2
Cardano
94.88%
95.11%
0.05
3
Ethereum
96.61%
96.60%
0.05
4
Solana
98.07%
98.14%
0.1
5
Tether
38.00%
12.35%
0.05
6
USD coin
59.00%
14.80%
0.05
Since we last observed the price of each cryptocurrency on January 10th, we are predicting its price for January 12th and comparing it to the actual price of each coin (see Table 3).
216
A. I. Iliev and M. Panwar Table 3. Predicted results
S. No.
Cryptocurrency name
Prediction of price on Jan.12th using random forest
Prediction of price on Jan.12th using gradients boosting
Actual price on Jan.12th
1
Binance
435.50
429.70
487.01
2
Cardano
1.20
1.20
1.31
3
Ethereum
3142.00
3131.40
3370.89
4
Solana
141.70
141.20
151.43
5
Tether
1.00
1.00
1.00
6
USD coin
1.00
1.00
0.99
6 Conclusions and Future Work In this study, we suggested the idea of cryptocurrencies and how it helps investors to gain money while taking loss risk into account. We proposed a prediction model employing artificial intelligence (AI) and machine learning, with which we are projecting the prices of the six cryptocurrencies for the next two days to lower the risk of loss. However, to discover the optimal algorithm for predicting the performance of these six cryptocurrencies, we applied two ensemble machine learning algorithms. The results show that both algorithms perform well with most cryptocurrencies, except for Tether and USD coin (whose prices changed erratically). The performance metric we used for this study was accuracy score, with which we calculated the accuracy of algorithms and discovered that both algorithms perform well (about accuracy > 95%) for the cryptocurrencies Solana, Binance, and Ethereum. In future, deep learning models like ANN, LSTM can be used to train the model to achieve better results which can be again optimized using optimization techniques like particle swarm optimization, generating algorithms etc. which are part of evolutionary algorithms.
References 1. AF, B. The inefficiency of bitcoin revisited: a dynamic approach. Econ. Lett. 161, 1–4 (2017) 2. Anon. Top 10 Cryptocurrencies to Bet on for Good Growth in Feb 2022 (2022). https://www. analyticsinsight.net/top-10-cryptocurrencies-to-bet-on-for-good-growth-in-feb-2022/ 3. Cheah, E.T., Fry, J.: Speculative bubbles in bitcoin markets? an empirical investigation into the fundamental value of Bitcoin. Econ Lett 130, 32–36 (2015) 4. Cheung, A., Roca, E.: Crypto-currency bubbles: an application of the Phillips-Shi-Yu Yu. methodology on Mt.Gox Bitcoin prices. Appl. Econ. 47(23), 2348–2358 (2015) 5. Corbet, S., Meegan, A.: Exploring the dynamic relationships between cryptocurrencies and other financial assets. Econ. Lett. 165, 28–34 (2018) 6. Fang, F., Carmine, V.: Cryptocurrency trading: a comprehensive survey. Preprint arXiv:2003. 11352 (2020)
AI in Cryptocurrency
217
7. Dwyer, G.P.: The economics of Bitcoin and similar private digital currencies. J. Financ. Stab. 17, 81–91 (2015) 8. Borges, D.L., Kaestner, C.A.A. (eds.): SBIA 1996. LNCS, vol. 1159. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-61859-7 9. Kim, Y.B., Kim, J.H.: Predicting fluctuations in cryptocurrency transactions based on user comments and replies. PLoS One 11(8), e0161197 (2016) 10. Böhme, R., Christin, N., Edelman, B., Moore, T.: Bitcoin: economics, technology, and governance. J. Econ. Perspect. (JEP) 29(2), 213–238 (2015) 11. Li, X., Wang, C.A.: The technology and economic determinants of cryptocurrency exchange rates: the case of Bitcoin. Decis. Support Syst. 95, 49–60 (2017) 12. Madan, I, Saluja, S.: Automated bitcoin trading via machine learning algorithms (2019). http://cs229.stanford.edu/proj2014/Isaac%20Madan,20 13. Phillips, R.C., Gorse, D.: Redicting cryptocurrency price bubbles using social media data and epidemic modelling. In: 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–7. https://doi.org/10.1109/SSCI.2017.8280809 14. Kaur, A., Nayyar, A.: Blockchain a Path to the Future Cryptocurrencies and Blockchain Technology Applications. John Wiley & Sons, Ltd (2020) 15. Tiwari, A.K., Jana, R.K.: Informational efficiency of Bitcoin—an extension. Econ. Lett. 163, 106–109 (2018) 16. Wen, F., Xu, L.: Retail investor attention and stock price crash risk: evidence from China. Int. Rev. Financ. Anal. 65, 101376 (2019) 17. Yermack, D.: Is Bitcoin a Real Currency? An Economic Appraisal. Springer, Berlin (2015) 18. Catania, L., Grassi, S., Ravazzolo, F.: Predicting the volatility of cryptocurrency time-series. In: Corazza, M., Durbán, M., Grané, A., Perna, C., Sibillo, M. (eds.) Mathematical and Statistical Methods for Actuarial Sciences and Finance, pp. 203–207. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-89824-7_37 19. Tran, V.L., Leirvik, T.: Efficiency in the markets of crypto-currencies. Finance Res. Lett. 35, 101382 (2020)
Short Term Solar Power Forecasting Using Deep Neural Networks Sana Mohsin Babbar(B) and Lau Chee Yong School of Computer, Engineering and Technology, Asia Pacific University of Technology and Innovation (APU), 57000 Kuala Lumpur, Malaysia [email protected], [email protected]
Abstract. An enigmatic challenge has been seen in recent years for the intermittency and unpredictable nature of solar power energy. It is imperative to mitigate the sporadic behavior of solar energy resources for the PV system generation’s future prospect. For optimization of grid control and economic dispatch of Energy, forecasting ensures an important role. This paper presents a model which forecasts the short-term solar power ranging to 6 h ahead using recurrent neural network (RNN). Recurrent neural networks are the superlative type among all other neural networks. This study shows an extensive review of implementing recurrent neural networks for solar power generation prediction. Simulations and results show that the proposed methodology has outperformed well. The data of seven months was chosen for the training purpose, which reduces the RMSE from 47 to 30.1, MAE from 56 to 39, and most importantly MAPE from 18 to 11% on the whole. This research also reveals that good accuracy and efficacy is attained by the calibration of solar irradiance as well, which manifests the plausibility and effectiveness of the proposed model. Keywords: PV system generation · Recurrent neural network (RNN) · Short-term solar power forecasting
1 Introduction Solar Energy is one of the most widely and popular used sources among all other renewable sources. The concept of ‘Green Energy’ has been successfully held due to its extensive use mostly in tropical areas where solar irradiance is maximum [1]. It also helps to mitigate the emission of carbon. Around the globe, the usage of solar Energy has been this era, almost every large scale companies use PV solar systems to reduce the pollution and emission of carbon due to its environmentally friendly nature [2]. Although, it is difficult to manage the variation and intermittency in solar Energy produced. For this reason, predictive tools are used for having the controllability and better accuracy [3]. Forecasting or prediction of solar power generation plays a vital role in the smart grid [4]. There are generous perks and benefits of forecasting solar power generation based on the time horizon. Firstly, the forecast is present for the energy imbalance market (EIM). Secondly, it was also stated that the probability of an energy imbalance could be omitted by 19.65% using forecasting methods [5]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 218–232, 2023. https://doi.org/10.1007/978-3-031-28073-3_15
Short Term Solar Power Forecasting Using Deep Neural Networks
219
Furthermore, by past research, it was observed that forecasts could also reduce the operational costs for the economic sustainability of any state. In [6], an electric vehicle (EV) charge-discharge management framework has been proposed for the better exploration of PV output to the smart grid by trading the information about home energy management systems (HEMS) and grid energy management systems (GEMS). The advantages of solar Energy have been popularly stated in almost every research regarding renewable energies. PV arrays produce no pollution while installing as it is an environmental-friendly system. Moreover, the PV power plants have minimal expenditure on maintenance and the operation cost is also low [7]. Constructive and effective planning for installing the PV solar plant leads to better power utility. Particularly forecasting methods should be planned according to the PV planting and procedures [8]. Hence, for avoiding any inconvenience and intermittency in solar Energy, efficacious forecasting models are the state of the art worldwide. Forecasting plays an important role in finding out the solutions to renewable energy related problems. Basically it’s a contribution to the energy sector. With the management perspective, energy management systems (EMS) using forecast mechanisms ensures the clean energy at high efficiency.
2 Literature Review In the past few years, forecasting and prediction of solar power generation have gained research attention. For any reason, the prediction methods, particularly for the PV prediction, are mainly classified into three classes: a physical method, statistical method, and artificial intelligence techniques. The physical models are the models that predict data through different models i.e., Model output statistics (MOS), Numerical weather prediction (NWP), and other geological models. They are meant to measure or obtain meteorological data with precision. However, it needs fine calibration and precision to deal with operating the models. On the other hand, statistical models are the one which is data-managed approaches. They mostly deal with the historical data sets and detects the error minimization [9]. Concurrently, the models based on artificial intelligence techniques are adequate tools for PV and wind prediction. The best feature of AI techniques is that they can easily deal with non-linear features [10]. In recent and in the past research, the focus on AI and machine learning approaches for forecasting purposes has been at the peak. Different linear and models have been implemented on the data sets formed by the physical and statistical models. In [11], the study has shown that for massive and complicated computation neural networks with a hybrid approach can be proved as the best predictive tool. A model was established in 2017 using an artificial neural network (ANN) and analog ensembles (AnEn) for short term PV power forecasting. Results and discussions showed that a combination of both techniques could yield the best results. It was also observed that in 2015 in China, which has a large renewable energy reservoir, a study was conducted for forecasting one day ahead of PV power output with four different variables using support vector machines (SVM). Simulations and facts showed that the proposed SVM model is effective and promising for PV power output [12]. An improved and accurate PV power forecasting model is proposed using deep LSTM-RNN (Long short term memory-Recurrent Neural Network). In this study, hourly data sets have been
220
S. M. Babbar and L. C. Yong
used for one year. LSTM is used for its extensive recurrent architecture and memory unit. As compared to other techniques and models, it was observed that LSTM-RNN is minimizing the errors at its best [13]. Extensive and detailed research is conducted recently in 2020. The researchers have culminated in the review from past years on solar power forecasting using artificial neural networks in the study. Furthermore, it was also observed by the brief research that forecasting plays a crucial role in economic dispatch and optimization of solar Energy into the smart grid. Therefore, forecasting and prediction algorithms must be implemented. In this stance, the review has proved that neural networking is the hallmark in PV power forecasting [14]. Comparative analysis was made to observe the efficacy of RNN. In 2018, a research was conducted with the multiple time horizon within the domain of short term solar forecasting using RNN. It was found by the results that only 4 h head forecasting was done and the RMSE obtained was between 20 to 40% [15]. On the other hand, this paper comprises the six hour ahead prediction and flaunts 30% of the RMSE and good accuracy under the MAPE criterion. This research aims to predict solar power generation by using RNN with the assistance of the Levenberg-Marquardt training model [16]. The key objective and the intention behind this paper are to compare the PV power output with the persistence model to see the predictive model’s performance. In short-term power forecasting, the persistence model is chosen as a comparative model for checking the precision [17]. In the literature review, it was observed that RNN with any training model chosen shows good accuracy and precision. RNN contains the unique feature of memorizing each information about the data set. It always proved to be useful for prediction due to its remembrance of previous inputs as well [18].
3 Deep Neural Network Deep Neural Networks (DNN) machine learning approaches have been adopted widely over the past few years. It can extract data from several kinds of data sets such as audio, video, images, matrix and arrays, etc. Numerous researches have been conducted in the area of deep learning, recurrent neural networks are one of them [19]. RNN’s are mostly used in Natural Language Processing (NLP) as they are specialized in processing sequences and large data sets. The captivating feature of RNN is that the variable length does not get compromise as both inputs and outputs. RNN is a special type of feedforward neural network which commonly deals with multiple layers and parameters. It is mainly designed and chosen for its multiple parameter feature. The architecture of RNN is very saturated and rich as it contains back-propagation function. In the literature review [20], different types of recurrent neural networks are discussed i.e., infinite impulse neural network, Elman recurrent neural network, diagonal recurrent neural network, and local activation feedback multilayer network. Figure 1 shows the basic layout and architecture of recurrent neural networks works.
Short Term Solar Power Forecasting Using Deep Neural Networks
221
Fig. 1. Basic architecture of RNN.
The Basic Formula for the RNN shows the working of the network in Eq. 1 and 2 below. This equation describes that h(t) is a hidden state of function f. Where h(t-1) is the previous hidden state, x(t) is the current input, and θ denotes the parameters chosen. The system typically figures out h(t) as a past summary of the past sequence input up to time interval t. Furthermore, for having an output, y(t) yields as output by taking the product of weights with the hidden states and denotes the biases. h(t) = f (h(t − 1), x(t); θ )
(1)
y(t) = why h(t) + by
(2)
For deep analysis in the hidden states, Fig. 2 demonstrates how the hidden layer is connected to the previous layer [21]. While Fig. 3 shows the n-number of inputs and outputs.
Fig. 2. One fully connected layer.
On the other hand, Long term short memory (LSTM) has also been used in ample of application of prediction and forecasting. By the research, it has been observed that LSTM works more effectively as compared to RNN. While in this research, LSTM did not bring efficacy with precision as RNN does.
222
S. M. Babbar and L. C. Yong
Fig. 3. Many to many type of RNN.
LSTM is type of RNN with few distinct features. This is also used in speech recognition, image processing and classification as well. Most of the time, LSTM are used in sentiment analysis, video analysis and language modelling. With comparison to RNN, LSTM’s architecture contains bunch of gates as shown in Fig. 4. The mathematical expression is illustrated below in Eq. 3, 4 and 5 respectively.
Fig. 4. Architecture of LSTM.
it = σ (wi ht−1 , xt + bi )
(3)
ft = σ (wf ht−1 , xt + bf )
(4)
Ot = σ (wo ht−1 , xt + bo )
(5)
where it represents the input gate, f t is forget while ot shows the output gate. While σ represents the sigmoid function, ht−1 are the hidden states, w shows the weights and b depicts the biases in all gates. The major difference between RNN and LSTM is the memory feature. LSTM has long term dependency leaning process. As discussed above that LSTM are mainly used for speech and vide analysis, they did not perform well. Due to the data type, LSTM gave the output in a matrix form with negative values. The negative points for prediction purpose are not encouraged and hard to analyze the accuracy.
Short Term Solar Power Forecasting Using Deep Neural Networks
223
4 Proposed Methodology Due to the high rate of intermittency and variability in the weather conditions, RNN is chosen as a predictive tool for this study. The methodology of this research has been divided into a few steps as shown in the Fig. 5. Firstly the input parameters were selected from the data set. Solar irradiance, module temperature, and solar power are taken as input as shown in Fig. 5. The RNN model has multiple steps of featuring the input, and the current step input of the hidden layer also includes the state of the previous step hidden layer as shown in Eq. 1 and 2. The next browsing step is to pre-process the data with in-depth analysis. The data is sifted according to the high solar energy produced during the day time. Mostly, the higher penetration of sun rays is during the period from 8:30 AM to 5:30 PM. Furthermore, the most important and trivial factor is solar irradiance [22]. The information about solar irradiance is very much essential for predicting the future for generating PV power. Mostly in the past research, solar irradiance is taken from the data set above 300 W/m2 . After done with the pre-processing of the data, the establishment of a model is a key step. RNN model has been designed with hit and trial rules. The number of hidden layers in neural networks is always decided on the hit and trial rule. The data set is trained from 10 to 50 layers and later is analyzed that how many layers are giving accurate results. In this study, 30 hidden layers are chosen between the input and the output layers.
Fig. 5. Flow chart of the proposed methodology.
224
S. M. Babbar and L. C. Yong
For the brief analysis of the data, the trend between the input parameter before training has been observed as shown in Fig. 6. The trend shows that solar power and irradiance are almost directly proportional to each other, but module temperature remains constant all day. The readings are taken for the entire month of July 2019.
Fig. 6. Trend between input parameters.
The trends between target and all input features have also been catered to as the training target is essential as shown in Fig. 7. Supervised machine learning approaches need a target to kick-start the simulation. Target acts as a plant between the input and the output layer. Employing a target is an essential part of any supervised machine learning model as it feeds data for calculating the prediction. In short, it brings accuracy to the results. Figure 8 shows the trend between all the input parameters individually.
Fig. 7. Trend of input parameters with the target.
The division of the data set is built-in with a percentage of 70 to 30%. The 70% of the data is allocated to the training while 15% to the validation and the rest 15% to the testing, as demonstrated in Fig. 8. Network topologies of RNN model is illustrated in Table 1.
Short Term Solar Power Forecasting Using Deep Neural Networks
225
Table 1. Network topologies used in RNN Arguments
Values
Layerdelays
1:5
Hiddenlayers
30
Trainfunction
Trainlm
No of hidden neurons
12283
Batch size
12283 × 3
Fig. 8. Division of data set.
Endogenously, after the successive training of the RNN model. The accuracy is calculated with the help of quantifying measures. In this paper, RMSE (Root Mean Square Error), MAE (Mean Absolute Error), and MAPE (Mean Absolute Percentage Error) are chosen best for efficacy, which is expressed below in Eq. 6, 7, and 8 [23]. N 2 i=1 (y − y ) (6) RMSE = N n i=0 y − y MAE = (7) N 1 n y − y × 100 (8) MAPE = i=0 y N where y = actual data and y = predicted data. As it is trivially observed in nature that solar energy is variable in conditions. There are no hard and fast conditions on the sun’s intensity. Weather climates are intermittent and can be changed anytime. The variation in solar power generation was also observed in this study to examine the behavior and trend with respect to the time. It was analyzed that maximum solar energy was discerned between the ranges from almost 8:30 in the morning from 5:30 in the evening. The trend is shown below in Fig. 11 below. It is shown clearly that how sprightly solar power is increasing and decreasing concerning time and weather conditions. Each day shows different fluctuations and variations according to the solar irradiance. For the comparison purpose, LSTM was chosen as a comparative model with RNN. By the output it was observed that, LSTM is not suitable model for forecasting the solar power for array kind of data set. The output was obtained in negative points, which is
226
S. M. Babbar and L. C. Yong
no need for this forecasting purpose. The output gained from the LSTM model is shown below in Fig. 9.
Fig. 9. Solar power (MW) characteristic using LSTM.
Predicted solar power obtained from LSTM is showing distinct behavior as compare to the RNN model. The domain of the predicted output is generated in the form of 3 × 3 matrix, which includes negative values as well. Prediction cannot be held in negative values. It does not give any concrete information or implies that model is dealing with the outliers. Concrete observations can be made by comparing the LSTM’s output that LSTM model is not suitable for this kind of data set used in this research. As the data is real time data with thousands of time steps with per minute resolution, the output according to the targeted output has been compressed in the form of matrix and showing no observations as shown in Fig. 10. It shows the trend of solar power characteristic among seven days. The output also reveals that solar power forecasting is bit challenging for LSTM as this model is best known for image processing, speech and vide recognition respectively.
Fig. 10. Trend of solar power among seven days.
5 Results and Discussion In this study, the results are quantified through RMSE (Root Mean Square Root) and MAE (Mean Absolute Error). Solar power, solar irradiance, and module temperature are
Short Term Solar Power Forecasting Using Deep Neural Networks
227
taken as input, while the target contains solar power for having the desired output. Due to the vast data set and training time constraints, the results are shown and discussed in the months. After hard bargaining and several hit and trials, the output depicts low accuracy errors. Figure 11 shows the output versus target trend for the RNN model. The trend is shown on the test indices, which are 15% of the while data. The results show a strong trend between target and output, and the model is minimizing the error. In Table 2, the minimization of errors has been shown briefly in terms of every month. As it is evident from the quantifying measures that RMSE is decreasing from 57.50 to 31.38 during February. The inclination in percentage MAPE almost fulfills the criteria of good accuracy [24, 25]. In the regime of good accuracy, it is stated that MAPE should be around 10%. In this study, the succession of reduced percentage MAPE has been shown with the persistence model’s comparison. Furthermore, a significant reduction in the MAE is also observed. In short term solar power prediction, persistence model is said to be benchmark for comparing the predicting output. Persistence model is actually an assumption that the time of the forecast will not change. The comparison of predicted solar power is made at the zeroth hour. It predominantly uses the previous day data as a forecast. The mathematical expression of persistence model is expressed below in Eq. 9. y(t − 1) = y(t + 1) + y
(9)
where y(t-1) is previous time steps while y(t + 1) is expected outcome.
Fig. 11. Solar power generation prediction characteristic.
Regression analysis is done to see the data set’s behavior, whether it best fits the model or not. It describes the relationship between the predictor and the response. In Fig. 12, the data points are fitting to the model and reducing the error showing the minimum training error. Furthermore, in Fig. 13, the performance of the training model has been shown at epoch 85. It briefly describes that model is training at its best by giving the minimum mean squared error (MSE). While on the other hand, Fig. 14 and Table 2 shows the results from quantifying measures from August 2019 to March 2020. As stated in the aforementioned methods, solar energy is intermittent and variant, so the variation in errors also occurs every month. However, the error minimization between response and the predictor is almost the same. Therefore, in Fig. 15 the final RMSE, MAE and MAPE
228
S. M. Babbar and L. C. Yong
Fig. 12. Regression analysis after successful training.
Fig. 13. Performance plot obtained after training.
are compared with the persistence model. The histogram clearly shows that the RNN model has performed well and reduced the errors compared to the persistence model. While on the other hand, LSTM has also been chosen to compare the results obtained from RNN. When encountering the quantifying measures, results from LSTM model cannot be computed. Due to the domain of output matrix, it is barely hard to calculate RMSE, MAE and MAPE. By the results and discussions, it is noteworthy that RNN seems to be a promising model for predicting solar power generation. It works well for the time series and continuous data. It can reduce the errors at its best. According to the percentage MAPE criterion, when the forecasting estimation appears to be greater than 10% and less than 20%, it falls under the category of good accuracy [26]. However, in this paper, RNN shows good accuracy and fulfilling the percentage MAPE estimation.
Short Term Solar Power Forecasting Using Deep Neural Networks
Fig. 14. Reduction of errors from Aug 2019 to Mar 2020.
Table 2. Quantifying measures. 2019–2020
Errors
Persistence
RNN
Aug
RMSE MAE MAPE
46.50 55.75 17.52
30.19 37.40 11.29
Sep
RMSE MAE MAPE
40.52 48.44 14.52
31.54 39.22 11.97
Oct
RMSE MAE MAPE
47.04 54.97 16.88
31.47 39.18 11.83
Nov
RMSE MAE MAPE
49.05 60.52 18.03
30.8 38.2 11.59
Dec
RMSE MAE MAPE
47.4 56.70 17.50
31.39 39.5 11.88
Jan
RMSE MAE MAPE
30.14 68.75 11.01
29.87 39.6 6.06
Feb
RMSE MAE MAPE
57.50 73.00 22.82
31.38 38.6 11.90
Mar
RMSE MAE MAPE
53.93 67.79 19.95
30.19 37.40 11.29
229
230
S. M. Babbar and L. C. Yong
Fig. 15. Comparison of RNN with the persistence model.
6 Conclusion and Future Work In this paper, the implementation of RNN containing the Levenberg-Marquardt model is used to predict six hours ahead of solar power generation. Unlike other traditional neural networks, RNN best deals with huge data sets’ complexities containing errors and biases. It omits the errors at its maximum with the back-propagation property, where the simulations and verifications are made on the real-time data set. The proposed forecasting methods have reduced %MAPE from almost 18% to 11%, which comes under good accuracy while RMSE is dropping compared to the persistence model. The RMSE is decreasing from 47 to 30. It depicts that the RNN model is more accurate as compared to the persistence model. Consequently, the proposed model provides well-founded forecasting for the actual PV power grids. Additionally, the proposed methodology can also enhance the future employments of solar energy and be used for broader time horizons. In the future prospect, exciting and accurate work can be added for further improvements. There is always room for advancements in machine learning approaches. A plethora of work has been done in the field of energy forecasting. Firstly, different training models like gradient descent, resilient propagation, and Bayesian regularization can be implemented on the RNN for observing the performance. Secondly, more precision can be observed in the PV power output by a combination of different machine learning approaches, such as SVM and regression techniques [27]. Lastly, the RNN model can be applied to other forecasting fields like wind speed, energy management, trading and load forecasting etc.
References 1. Hadi, R.S., Abdulateef, O.F.: Modeling and prediction of photovoltaic power output using artificial neural networks considering ambient conditions. Assoc. Arab Univ. J. Eng. Sci. 25(5), 623–638 (2018) 2. Ogundiran, P.: Renewable energy as alternative source of power and funding of renewable energy in Nigeria. Asian Bull. Energ. Econ. Technol. 4(1), 1–9 (2018)
Short Term Solar Power Forecasting Using Deep Neural Networks
231
3. Antonanzas, J., Osorio, N., Escobar, R., Urraca, R., Mar-tinez-De-Pison, F., AntonanzasTorres, F.: Review of photovoltaic power forecasting. Sol. Energy 136(78–111), 4 (2016) 4. Abuella, M.: Solar power forecasting using artificial neural networks. In: North American Power Symposium, IEEE, pp. 1–5 (2015) 5. Kaur, A.N.: Benefits of solar forecasting for energy imbalance markets. Renew. Energ. 86, 819–830 (2015) 6. Kikusato, H., Mori, K., Yoshizawa, S., Fujimoto, Y., Asano, H., et al.: Electric vehicle charge– discharge management for utilization of photovoltaic by coordination between home and grid energy management systems. IEEE Trans. Smart Grid 10(3), 3186–3197 (2018) 7. Ni, K., Wang, J., Tang, G., Wei, D.: Research and application of a novel hybrid model based on a deep neural network for electricity load forecasting: a case study in Australia. Energies 12(13), 2467 (2019) 8. Kumari, J.: Mathematical modeling and simulation of photovoltaic cell using matlab-simulink environment. Int. J. Electr. Comput. Eng. 2(1), 26 (2012) 9. Li, G., Wang, H., Zhang, S., Xin, J., Liu, H.: Recurrent neural networks based photovoltaic power forecasting approach. Energies 12(13), 2538 (2019) 10. Torres, J.F., Troncoso, A., Koprinska, I., Wang, Z., Martínez-Álvarez, F.: Big data solar power forecasting based on deep learning and multiple data sources. Expert. Syst. 36(4), e12394 (2019) 11. Cervone, G., Clemente-Harding, L., Alessandrini, S., Delle Monache, L.: Short-term photovoltaic power forecasting using artificial neural networks and an analog ensemble. Renew. Energ. 108, 274–286 (2017) 12. Shi, J., Lee, W., Liu, Y., Yang, Y., Wang, P.: Forecasting power output of photovoltaic (2015) 13. Abdel-Nasser, M., Mahmoud, K.: Accurate photovoltaic power forecasting models using deep LSTM-RNN. Neural Comput. 31, 1–14 (2017) 14. Pazikadin, A., Rifai, D., Ali, K., Malik, M., Abdalla, A., Faraj, M.: Solar irradiance measurement instrumentation and power solar generation forecasting based on artificial neural networks (ANN): a review of five years research trend. Sci. Total Environ. 715, 136848 (2020) 15. Mishra, S. Palanisamy, P.: Multi-time-horizon solar forecasting using recurrent neural network. In: 2018 IEEE Energy Conversion Congress and Exposition (ECCE), pp. 18–24. IEEE (2018) 16. Ye, Z., Kim, M.K.: Predicting electricity consumption in a building using an optimized backpropagation and Levenberg–Marquardt back-propagation neural network: case study of a shopping mall in China. Sustain. Cities Soc. 42, 176–183 (2018) 17. Panamtash, H., Zhou, Q., Hong, T., Qu, Z., Davis, K.: A copula-based Bayesian method for probabilis-tic solar power forecasting. Sol. Energ. 196, 336–345 (2020) 18. Zhang, R., Meng, F., Zhou, Y., Liu, B.: Relation classification via recurrent neural network with attention and tensor layers. Big Data Min. Analytics 1(3), 234–244 (2018) 19. Yu, Y., Si, X., Hu, C., Zhang, J.: A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 31(7), 1235–1270 (2019) 20. Lee, D., Kim, K.: Recurrent neural network-based hourly prediction of photovoltaic power output using meteorological information. Energies 12(2), 215 (2019) 21. Salehinejad, H., Sankar, S., Barfett, J., Colak, E., Valaee, S.: Recent advances in recurrent neural networks.. arXiv preprint arXiv, 1801.01078 (2017) 22. Javed, A., Kasi, B.K., Khan, F.A.: Predicting solar irradiance using machine learning techniques. In: 2019 15th International Wireless Communications & Mobile Computing Conference (IWCMC), pp. 1458–1462 (2019) 23. Chicco, D., Warrens, M.J.: The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci. 7, e623 (2021)
232
S. M. Babbar and L. C. Yong
24. Wang, L., Lv, S.X., Zeng, Y.R.: Effective sparse adaboost method with ESN and FOA for industrial electricity consumption forecasting in China. Energy 155, 1013–1031 (2018) 25. Babbar, S.M., Lau, C.Y., Thang, K.F.: Long term solar power generation prediction using adaboost as a hybrid of linear and non-linear machine learning model. Int. J. Adv. Comput. Sci. Appl. 12(11) 2021 26. Ding, J., Tarokh, V., Yang, Y.: Model selection techniques: an overview. IEEE Sig. Process. Mag. 35(6), 16–34 (2018) 27. Babbar, S.M., Lau, C.Y.: Medium term wind speed forecasting using combination of linear and nonlinear models. Solid State Technol. 63(1s), 874–882 (2020)
Convolutional Neural Networks for Fault Diagnosis and Condition Monitoring of Induction Motors Fatemeh Davoudi Kakhki1,2(B) and Armin Moghadam1 1 Department of Technology, San Jose State University, San Jose, CA 95192, USA
{fatemeh.davoudi,armin.moghadam}@sjsu.edu 2 Machine Learning and Safety Analytics Lab, Department of Technology,
San Jose State University, San Jose, CA 95192, USA
Abstract. Intelligent fault diagnosis methods using vibration signal analysis is widely used for fault detection of bearing for condition monitoring of induction motors. This has several challenges. First, a combination of various data preprocessing methods is required for preparing vibration time-series data as input for training machine learning models. in addition, there is no specific number(s) of features or one methodology for data transformation that guarantee reliable fault diagnosis results. In this study, we use a benchmark dataset to train convolutional neural networks (CNN) on raw vibration signals and feature-extracted data in two separate experiments. The empirical results show that the CNN model trained on raw data has superior performance, with an average accuracy of 98.64%, and ROC and F1 score of over 0.99. The results suggest that training deep learning models such as CNN are promising substitution for conventional signal processing and machine learning models for fault diagnosis and condition monitoring of induction motors. Keywords: Bearing fault diagnosis · Convolutional neural networks · Condition monitoring
1 Introduction Bearing fault is the most common type of fault and main source of machine failures in induction motors [1]. Unexpected machine failure may have disastrous consequences such as personnel casualties, financial loss and breakdown of the motor [2]. Therefore, condition monitoring and fault diagnosis of machinery is crucial to the safe and reliable production in industrial systems [3]. Condition monitoring includes the processes and methods for observing the health of the system in fixed interval times [3]. Condition monitoring of induction motors provides the opportunity to prevent or minimize unscheduled downtime and increase efficiency of induction motors [3, 4]. There are many procedures for fault diagnosis of bearing in induction motors. One of the efficient approaches for detecting faults in bearings is the comparison between © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 233–241, 2023. https://doi.org/10.1007/978-3-031-28073-3_16
234
F. D. Kakhki and A. Moghadam
normal and faulty vibration signals [5–7] gained from induction motor using vibration sensors such as accelerometers. These sensors can be located around the bearing in a designed testbed and show the condition of the bearing [8]. Since these signals contained considerable amount of noise, they do not distinct faulty conditions of bearing if analyzed in raw format. In addition, feature extraction from vibration signals is challenging for fault diagnosis purposes due to the non-stationary and non-linear nature of signals [6]. The rest of the paper is organized as follows: Sect. 2 presents a brief discussion on previous related works in using machine learning and deep learning models for condition monitoring of induction motors. The details of data used in the study and methodology are explained in Sects. 3 and 4, followed by presentation and discussion of results in Sect. 5. A brief overview of the work plus implementation of results for condition monitoring, in Sect. 6, completes this paper.
2 Related Work To address the challenges raised from the nature of vibration signals, various signal processing methods have been used in literature to prepare vibration type data for fault detection analysis. Many studies have focused on fault diagnosis in bearings using signal processing approaches [9, 10] and recently machine learning (ML) methods [6, 11–15]. ML models are popular in prognostics and health management studies for structural health and condition monitoring [16]. The combination of signal processing and ML models have shown promise in providing high accuracy in classification and prediction of faulty versus normal bearing conditions. A popular method for preparing data for modeling is extracting time domain statistical features from raw signals, and feed them as input for training ML models [10, 17, 18]. The purpose of feature extraction is to reduce the high dimension of the data as well as improving the accuracy and predictive power of the models [19]. The main challenge with this approach is that there exists no single scientific rule for the exact number of features that can be drawn from signals for gaining highest accuracy from training ML models. Therefore, experimental studies with different feature extraction approaches, with various numbers and types of features, are conducted for fault diagnosis of bearings [5]. However, an alternative solution is to take advantage of modeling techniques that do not necessarily require feature extraction, and their performance is not significantly affected by the quantity and type of features extracted from vibration signals. Deep neural networks, or deep learning (DL) models, could be used as an alternative for the purpose of fault diagnosis and detection to address the challenges of data preparation, which is a required and influential step in training ML models. The reason is that DL algorithms are capable of automatically extracting most important features from original signals and simultaneously classifying types of fault [20]. Furthermore, DL models have high performance in extracting information from noisy data and avoid overfitting as they are non-parametric statistical models [21]. The structure of a DL model typically includes an input layer, hidden layers, and an output layer. DL models are capable of detecting complex nonlinear relationship between the input and output data [22]. Among various types of DL algorithms, convolutional
Convolutional Neural Networks for Fault Diagnosis
235
neural networks (CNN) seems promising for fault detection of bearing when applied on raw signals or on transformed signal data [2, 23]. Compared to typical neural networks, large-scale network implementation with CNN is more efficient, and less challenging. In addition, CNN models have the weight sharing feature that allows the network to be trained on the reduced number of features to enhance generalizability of models and avoid overfitting [25]. A CNN structure has a convolutional layer, a pooling layer, and a fully connected layer. The feature extraction and classification occur simultaneously [8]. The purpose of this study is to evaluate the performance of CNN models in accurate classification of bearing faults in induction motors. Using a benchmark dataset in this paper, we compare the performance of CNN in classifying types of faults in bearing in two different experiments, one with CNN being trained on raw data and one on specifically extracted features data. The results contribute to performance assessment of CNN models in each experiment, and provide insights for most efficient and costeffective method for fault diagnosis for the purpose of condition monitoring in induction motors.
3 Data Description Majority of studies in condition monitoring of induction motors through fault diagnosis of bearing evaluate their proposed solution on publicly available benchmark datasets [24]. In this study, we used the public ally available data set from the Case Western Reserve University, known as the CWRU data, and has been used by researchers in fault diagnosis area as a benchmark dataset. According to the available data on how CWRU data was generated, the data was collected through two accelerometers that captured the vibration signals in both drive end (DE) bearing and fan end bearing. Each bearing consisted of four main components of inner race, outer race, rolling element and cage, in each of which a bearing fault might occur. The CWRU dataset includes single point faults that were generated, using electrodischarge machining, at 0.007, 0.014, and 0.021 (inch) diameters in three parts: bearing inner raceway (IR), bearing outer raceway (OR), and bearing ball (B). The dataset also includes normal status data on bearing. In this study, we used a subset of the data for performing two experimental modeling. We used a subset of the whole data which includes vibration signals at DE bearing, collected for motor loads of one horsepower and motor speed of 1772 rpm for sampling frequencies of 48 kHz.
4 Methodology In order to evaluate the effect of data preprocessing for fault diagnosis purposes in this research, the performance of the same model should be evaluated on raw data and already preprocessed data. Therefore, we propose two experiments. In the first experiment, data points from raw signals are used for developing the first CNN model. In the second one, specific statistical features are extracted from each row of data, and are used as input variables for building the second CNN model.
236
F. D. Kakhki and A. Moghadam
4.1 Experiment One: Raw Signal Data In this experiment, raw data is used for training CNN models. The only preprocessing step on the data is choosing proper sampling intervals that can be representative the whole data behavior. To reduce overlapping between two data sample intervals, we used segmentation of data for drawing training and testing samples for model training. To produce larger sampling data for the CNN model, for the 48 kHz sampling frequency, we collected segments of length1024. Therefore, approximately 460 sample points per revolution for each type of bearing fault with length of 1024, are generated with a total of 10 output classes for normal and faulty bearing. The classes are B/IR/OR0.007, B/IR/OR0.014, B/IR/OR0.021 for faults at various diameters in ball, inner race, and outer race, respectively. Therefore, B0.007 represent data where fault was located in the bearing ball with 0.007 diameter. All other classes can be interpreted the same, representing a total of nine output levels for faulty bearings plus one class representing normal bearings in good condition. The final prepared dataset has the dimension of (460*10, 1024) for the 10 various labels. In the next step, data is partitioned into 80% for training and 20% for testing. The CNN model is built on training data, and its performance in evaluated on the test data. The details of the sequential CNN used in this study is given in Table 1, which includes the types of layers and number of parameters for each layer. CNN model is trained on this segmented raw data with 50 epochs and batch size 128. Table 1. CNN structure for experiment on raw data Layer
Output shape
Number of parameters*
Conv2d
(none, 24, 24, 32)
2624
Max_pooling2d
(none, 12, 12, 32)
0
Conv2d_1
(none, 4, 4, 32)
82976
Max_pooling2d_1
(none, 2, 2, 32)
0
flatten
(none, 128)
0
dense
(none, 64)
8256
Dense_1
(none, 96)
6240
Dense_2
(none, 10)
970
* Total parameters: 101,066; trainable parameters: 101,066; non-trainable parameters: 0
4.2 Experiment Two: Statistical Features Data In the second experiment, we extracted time domain statistical features from segmented data. To reduce the number of input variables in the CNN model, instead of the whole data points, we used time domain feature extraction methodology. In this approach, a number of statistical features from each raw of normal and faulty signals are extracted,
Convolutional Neural Networks for Fault Diagnosis
237
and labeled with the relevant class. These features are used as input in the CNN for classifying and predicting type of fault class. The time domain statistical features we used are vibration values of minimum, maximum, mean, standard deviation, root mean square error, skewness, kurtosis, crest factor, and form factor, representing F1 to F9 , respectively. These features can be calculated using python NumPy and SciPy libraries. Adding fault labels to the dataset, the feature extracted data shape has input size (460*10, 10), corresponding to the DE time, nine statistical features F1 to F9 , and fault class. The statistical formula for the extracted features is shown in Table 2. In the next step, 80% of the data is used for training CNN model. The performance of the model is assessed using the other 20% of data, as test set. The details of the sequential CNN used in this study is given in Table 3, which includes the types of layers and number of parameters for each layer. CNN model is trained on the features extracted data with 50 epochs and batch size 128. Table 2. Statistical features used in the study Statistical feature
Equation
Statistical feature
Equation n
i=1 xi
Minimum
F1 = min(xi )
Root mean square
F5 =
Maximum
F2 = max(xi )
Skewness
F6 = 1n
Mean
n F3 = 1n xi
Kurtosis
F7 = 1n
Crest factor
F8 = xxmax
Form factor
F9 = X μ
i=1
Standard deviation
F4 =
n (x −μ)2 i=1 i
n−1
n
i=1 (xi −μ) σ3
3
n
i=1 (xi −μ) σ4
4
min
rms
Table 3. CNN structure for experiment on statistical feature data Layer
Output shape
Number of parameters*
Conv2d
(none, 28, 28, 6)
156
Max_pooling2d
(none, 14, 14, 6)
0
Conv2d_1
(none, 10, 10, 16)
2416
Max_pooling2d_1
(none, 5, 5, 16)
0
flatten
(none, 400)
0
dense
(none, 120)
48120
Dense_1
(none, 84)
10164
Dense_2
(none, 10)
850
2
n
* Total parameters: 61,706; trainable parameters: 61,760; non-trainable parameters: 0
238
F. D. Kakhki and A. Moghadam
5 Results The model performance metrics we used are average accuracy, receiver operating characteristic curve (ROC) values and F1 score. The values for ROC are between 0 and 1, and values closer to 1 show higher power and usefulness of the model in distinguishing among multi-level classes in a classification problem. Another measure of model performance, F1 score, represents the weighted average of recall and precision values of a classifier model and is a more reliable metric for performance assessment compared to average accuracy of the model [26]. The results from CNN model performance for both experiments are shown in Table 4. Table 4. CNN performance on raw data and statistical feature data Loop iteration
Experiment 1 accuracy
Experiment 2 accuracy
1
0.9859
0.9728
2
0.9870
0.9522
3
0.9880
0.9554
4
0.9793
0.9543
5
0.9815
0.9717
6
0.9859
0.9696
7
0.9880
0.9674
8
0.9913
0.9804
9
0.9859
0.9598
10
0.9913
0.9609
Average accuracy
0.9864 ± 0.0036
0.9645 ± 0.0088
ROC
0.9951
0.9891
F1 Score
0.9913
0.9804
In experiment one, we trained the CNN model on the raw segmented data. The average model accuracy is 0.9864 with a standard deviation of 0.0036. The ROC and F1 score values are also high. The results show that CNN model was capable of producing reasonably good classification values even though it was trained with raw data without any specific preprocessing method. In experiment two, we trained the CNN model on the nine statistical features that were extracted from each raw of signal data. This adds an extra data preprocessing step to the process, which requires more time and computational resources. For this experiment, the average model accuracy is 0.9645 with a standard deviation of 0.0088. The ROC and F1 scores are 0.9891 and 0.9804. While comparable with results from experiment one, all model performance values are slightly lower for CNN developed on specific number of statistical features.
Convolutional Neural Networks for Fault Diagnosis
239
This result is significant since previous studies have shown ML methods have better performance while trained on feature data, compared to raw signal data. However, result of this study show that CNN have superior performance when trained on raw data. This confirms the challenge that was mentioned previously here regarding no single rule for the number of features or type of features that can produce the highest performance models for purpose of fault diagnosis.
6 Conclusion This study conducted an empirical study on application and performance assessment of convolutional neural networks in fault diagnosis of bearing using Case Western Reserve University benchmark dataset for 48 kHz data. We trained two multi-classification convolutional neural networks; one on raw data and one on time domain statistical features that were extracted from vibration signals for nine types of faults and normal bearing condition. The results confirm better performance of convolutional neural network in producing superior results in multi-level fault classification of bearing due to its capability in automatically extracting most important features for distinguishing among multi-classes of outputs. The results suggest that deep learning models such as convolutional neural networks can be used as reliable fault detection and classification approach for condition monitoring due to less requirement for data preprocessing and preparation, compared to methods in which data transformation is required for model training, such as conventional machine learning modeling. Future direction of this study includes developing other deep learning algorithms to provide a comprehensive comparative study for application of such methods in fault diagnosis and condition monitoring of induction motors.
References 1. Cerrada, M., et al.: A review on data-driven fault severity assessment in rolling bearings. Mech. Syst. Sig. Process. 99, 169–196 (2018). https://doi.org/10.1016/j.ymssp.2017.06.012 2. Lu, C., Wang, Y., Ragulskis, M., Cheng, Y.: Fault diagnosis for rotating machinery: a method based on image processing. PLoS ONE 11(10), 1–22 (2016). https://doi.org/10.1371/journal. pone.0164111 3. Choudhary, A., Goyal, D., Shimi, S.L., Akula, A.: Condition monitoring and fault diagnosis of induction motors: a review. Arch. Comput. Methods Eng. 26(4), 1221–1238 (2018). https:// doi.org/10.1007/s11831-018-9286-z 4. Duan, Z., Wu, T., Guo, S., Shao, T., Malekian, R., Li, Z.: Development and trend of condition monitoring and fault diagnosis of multi-sensors information fusion for rolling bearings: a review. Int. J. Adv. Manuf. Technol. 96(1–4), 803–819 (2018). https://doi.org/10.1007/s00 170-017-1474-8 5. Sugumaran, V., Ramachandran, K.I.: Effect of number of features on classification of roller bearing faults using SVM and PSVM. Expert Syst. Appl. 38(4), 4088–4096 (2011). https:// doi.org/10.1016/j.eswa.2010.09.072 6. Moghadam, A., Kakhki, F.D.: Comparative study of decision tree models for bearing fault detection and classification. Intell. Hum. Syst. Integr. (IHSI 2022) Integr. People Intell. Syst. vol. 22, no. Ihsi 2022, (2022). https://doi.org/10.54941/ahfe100968
240
F. D. Kakhki and A. Moghadam
7. Russo, D., Ahram, T., Karwowski, W., Di Bucchianico, G., Taiar, R. (eds.): IHSI 2021. AISC, vol. 1322. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68017-6 8. Toma, R.N., et al.: A bearing fault classification framework based on image encoding techniques and a convolutional neural network under different operating conditions. Sensors 22, 4881 (2022). https://www.mdpi.com/1424-8220/22/13/4881 9. Li, C., Cabrera, D., De Oliveira, J.V., Sanchez, R.V., Cerrada, M., Zurita, G.: Extracting repetitive transients for rotating machinery diagnosis using multiscale clustered grey infogram. Mech. Syst. Sign. Process. 76–77, 157–173 (2016). https://doi.org/10.1016/j.ymssp. 2016.02.064 10. Li, C., Sanchez, V., Zurita, G., Lozada, M.C., Cabrera, D.: Rolling element bearing defect detection using the generalized synchrosqueezing transform guided by time-frequency ridge enhancement. ISA Trans. 60, 274–284 (2016). https://doi.org/10.1016/j.isatra.2015.10.014 11. Li, C., et al.: Observer-biased bearing condition monitoring: from fault detection to multifault classification. Eng. Appl. Artif. Intell. 50, 287–301 (2016). https://doi.org/10.1016/j.eng appai.2016.01.038 12. Yang, Y., Fu, P., He, Y.: Bearing fault automatic classification based on deep learning. IEEE Access 6, 71540–71554 (2018). https://doi.org/10.1109/ACCESS.2018.2880990 13. Islam, M.M.M., Kim, J.M.: Automated bearing fault diagnosis scheme using 2D representation of wavelet packet transform and deep convolutional neural network. Comput. Ind. 106, 142–153 (2019). https://doi.org/10.1016/j.compind.2019.01.008 14. Zhang, Y., Ren, Z., Zhou, S.: A new deep convolutional domain adaptation network for bearing fault diagnosis under different working conditions. Shock Vib. 2020 (2020). https://doi.org/ 10.1155/2020/8850976 15. Atmani, Y., Rechak, S., Mesloub, A., Hemmouche, L.: Enhancement in bearing fault classification parameters using gaussian mixture models and mel frequency cepstral coefficients features. Arch. Acoust. 45(2), 283–295 (2020). https://doi.org/10.24425/aoa.2020.133149 16. Badarinath, P.V., Chierichetti, M., Kakhki, F.D.: A machine learning approach as a surrogate for a finite element analysis: status of research and application to one dimensional systems. Sensors 21(5), 1–18 (2021). https://doi.org/10.3390/s21051654 17. Soualhi, A., Medjaher, K., Zerhouni, N.: Bearing health monitoring based on hilbert-huang transform, support vector machine, and regression. IEEE Trans. Instrum. Meas. 64(1), 52–62 (2015). https://doi.org/10.1109/TIM.2014.2330494 18. Prieto, M.D., Cirrincione, G., Espinosa, A.G., Ortega, J.A., Henao, H.: Bearing fault detection by a novel condition-monitoring scheme based on statistical-time features and neural networks. IEEE Trans. Ind. Electron. 60(8), 3398–3407 (2013). https://doi.org/10.1109/TIE. 2012.2219838 19. Toma, R.N., Prosvirin, A.E., Kim, J.M.: Bearing fault diagnosis of induction motors using a genetic algorithm and machine learning classifiers. Sensors (Switz) 20(7), 1884 (2020). https://doi.org/10.3390/s20071884 20. Hoang, D.T., Kang, H.J.: A survey on deep learning based bearing fault diagnosis. Neurocomputing 335, 327–335 (2019). https://doi.org/10.1016/j.neucom.2018.06.078 21. Kakhki, F.D., Freeman, S.A., Mosher, G.A.: Use of neural networks to identify safety prevention priorities in agro-manufacturing operations within commercial grain elevators. Appl. Sci. 9, 4690 (2019). https://doi.org/10.3390/app9214690 22. Yedla, A., Kakhki, F.D., Jannesari, A.: Predictive modeling for occupational safety outcomes and days away from work analysis in mining operations. Int. J. Environ. Res. Public Health 17(19), 1–17 (2020). https://doi.org/10.3390/ijerph17197054 23. Zhang, D., Zhou, T.: Deep convolutional neural network using transfer learning for fault diagnosis. IEEE Access 9, 43889–43897 (2021). https://doi.org/10.1109/ACCESS.2021.306 1530
Convolutional Neural Networks for Fault Diagnosis
241
24. Zhang, J., Zhou, Y., Wang, B., Wu, Z.: Bearing fault diagnosis base on multi-scale 2D-CNN model. In: Proceedings of 2021 3rd International Conference on Machine Learnimg Big Data Bus. Intell. MLBDBI 2021, no. June 2020, pp. 72–75 (2021). https://doi.org/10.1109/MLB DBI54094.2021.00021 25. Alzubaidi, L., et al.: Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J. Big Data 8, 1–74 (2021) 26. Kakhki, F.D., Freeman, S.A., Mosher, G.A.: Evaluating machine learning performance in predicting injury severity in agribusiness industries. Saf. Sci. 117, 257–262 (2019). https:// doi.org/10.1016/j.ssci.2019.04.026
Huber Loss and Neural Networks Application in Property Price Prediction Alexander I. Iliev1,2(B) and Amruth Anand1,2 1 SRH Berlin University, Charlottenburg, Germany
[email protected] 2 Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, Sofia, Bulgaria
Abstract. In this paper we aim to explore the Real Estate Market in Germany, and particularly we have taken a dataset of Berlin and applied various advanced neural network and optimisation techniques. It’s always difficult for people to estimate what price is best for a property and there are various categorical and numerical features involved in it. And the main challenge is to choose the model, loss function and customize the neural network to best fit for the marketplace data. We have developed a project that can be used to predict the property price in Berlin. Firstly, we have worked on procuring the online current market price of the property by web scraping. Then we did intensive exploratory Data Analysis on it, prepared the best data for experiments. Then we build four different models and worked on the best loss functions which can suite our model and tabulated the mean squared and mean absolute errors for the same. We have tested our model with the current on the market properties, and the sample results are plotted. This methodology can be applied efficiently, and the results can be used by the people who are interested in investing in real estate in Berlin, Germany. Keywords: Huber loss · Hyperparameter tuning · Exploratory data analysis · Recurrent neural nets · Convolution neural nets · Deep neural nets
1 Introduction Real Estate is a great investment option, investing on a property is always difficult in Germany due to the lack of visibility in the market data. In this paper we aim to build an AI model which will help make decision for the most suitable price for a property in Berlin [1–5]. In the study carried out we hypothesize that Huber loss function suits best for real estate data. And we used different neural network models to test this hypothesis and compared with a standard mean squared error loss function. During these experiments we also considered using different neural networks to compare which suits the best. In this paper, we covered collecting data and performing exploratory data analysis in Sect. 2. In Sect. 3 we discuss about different neural network models we used to perform the experiments and best optimization in order to select them. In Sect. 4, we present the results of all our finding and discuss about them. In Sect. 5, we state our conclusion for the research topic and suggest future scope for improvement. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 242–256, 2023. https://doi.org/10.1007/978-3-031-28073-3_17
Huber Loss and Neural Networks Application
243
2 About the Data We collected the data by scraping the data from open online resources, for this experiment we have checked the robots.txt of those websites and gathered the data. After scraping we had 15,177 properties to work on. And after exploratory data analysis we ended up with 12,743 properties. A heatmap of the data is observed in Fig. 1.
Fig. 1. Heat map of dataset
2.1 Data Preparation Preparation of data plays very important role in every Machine Learning and Artificial Intelligence projects, especially if we are trying to build our own dataset. Since we gathered information online by web scraping, there was a lot of unstructured data. We did data cleaning in Python to give it a proper structure. The feature for the property we considered are purchase price, living space, number of rooms, floor of the apartment, number of floors in the apartment, number of rooms, construction year, renovation year, list of features in the property, postcode, category to which the property belongs, balcony, garden, terrace, garden, lift, guest toilet, heating type, furnishing quality, garage space, furnishing quality, and address of the property. Since most of the above-mentioned features were not provided online, we had to exclude them and some of the features were removed because they were correlated as seen in Fig. 2 [6]. Once after removing the null data, it was important for us to investigate the outliers. First step we did removal of outliers. We used Boxplots to remove the outliers. We considered any point above third quartile by 1.5 times inter quartile range or below first quartile by more than 1.5 times inter quartile range as an outlier. For some important feature like purchase price, we had a right skewed data, so we used log1p to make it a normal distribution as seen in Fig. 3.
244
A. I. Iliev and A. Anand
Fig. 2. Heat map for features correlation
Fig. 3. Normalization
Similarly, we have considered different techniques for different features. And one important thing worth mentioning here is that we need to perform log transform to all the features associated with purchase price. During EDA, we understood a lot of insights about the real estate market in Berlin.
3 Neural Network Models We have built three different type of neural network models to experiment how the model behaves for the data we have. One was a simple deep neural network, recurrent neural network [7], and convolution neural network. We adjusted these neural nets to work for a regression problem. Like we have in this case.
Huber Loss and Neural Networks Application
245
3.1 Deep Neural Network Deep neural network is a popular choice for any regression problems, we used a fivelayer neural network model for our 12,743 data, which is shown in Fig. 4. The reason for five layer is because we know the number of hidden layers should be less than twice the size of the input layer. We have 13 features, hence five layers. Initially, we have experimented using three layers on our data, the model seems to not learn properly. We also checked out four layers, but five layers stood out the best. To select the number of hidden neurons, the thumb rule is it should be between input size layer and output size layer. But to find the best number of hidden layers, it is always an iterative process to do trial and find which is the best. In our model we have taken, 600, 450, 300, 100, and 50 as several hidden layers with one final layer.
Fig. 4. Deep neural network model
We have used ReLU activation function, activation functions controls on how well the neural network model learns from the training dataset. As we can see in the figure that we have not used activation function for our output layer, because we need the model to work as a regression problem. We chose rectified linear activation function (ReLU) because we are using five layers and we expect to have vanishing gradients in other activation function. ReLU overcomes these problems and help the model to learn quickly and perform well. Optimization is a method in neural network which is used to change the weights and learning rate in accordance to reduce the losses, this will also let us get our results much faster [8]. We have used Adam Optimizer [9]; it involves a combination of two gradient descent methodologies: • Gradient Descent with momentum • Root Mean Square Propagation (RMSprop) Adaptive moment Estimation is an algorithm for optimization technique for gradient descent. This method is efficient when working with large problem involving a lot of data or parameters. It requires less memory and is very effective and efficient. Cost function or popularly called as loss function plays a very important role in the behavioural of model. It is always important to select the best loss function for our model. For regression problem of ours we had few options and we experimented with two of them. One is Mean Squared Error and the other is Huber loss.
246
A. I. Iliev and A. Anand
Mean Squared Error is a sum of squared distances between our target variables and predicted values. n p 2 i=i yi − yi MSE = n p
MSE = mean squared error, n = number of data points, yi = observed value, and yi = predicted value.
Fig. 5. Loss function MSE
Like we explained above, we have used the same in our code shown in Fig. 5. Here, to see the performance of the model we have used metrics, MAE that is mean absolute error. We could have used any of them. As a result, for this model, we got mean squared error as 0.17 and mean absolute error as 0.27. Since, this is a regression problem, we can’t just rely on the metrics. But we also need to see how good the model has learnt itself.
Fig. 6. Loss curve ANN MSE
We can see in the graph from Fig. 6 that the model is not learning as expected, and the model is useless. But looking at this we can say that the model is not learning properly, so we decided to change the loss function and try the same model again. For this experiment we have decided to use the Huber loss function. 1 2 2 (y − f (x)) for|y − f (x)| ≤ δ, Lδ (y, f (x)) = δ|y − f (x)| − 21 δ 2 The formula says, if the error is small, we take first condition which is error squared divided by 2.
Huber Loss and Neural Networks Application
247
Here we need to define the delta. To select the delta, we have used hyperparameter tuning, as shown in Fig. 7, using a function in sklearn GridSearchCV. GridSearchCV will loop through the predefined hyperparameter’s and fit the model on our training set.
Fig. 7. Hyperparameter tuning for delta in huber loss
With this experiment we found the best delta value for our training data is 1.75. Then, we used the loss function to our model and training it again as seen in Fig. 8:
Fig. 8. Loss function huber loss
Using this, we got mean squared error as 0.21 and mean absolute error as 0.31. But most interestingly here we can see that our model has good learning rate per every epoch and there are no noises.
Fig. 9. Loss curve ANN huber loss
As we can see in the graph from Fig. 9 the model has improved its learning drastically from before. So, using Huber loss was the best option for our training data. Huber loss is less sensitive to the outliers, and we see that in real estate, some of the data points are subjected as outliers.
248
A. I. Iliev and A. Anand
In Mean squared error the gradient decreases whenever the loss gets closer to the minima, which makes it more precise, but in Huber loss whenever the gradient decreases it curves around the minima. 3.2 Recurrent Neural Network Our model for the Recurrent Neural Network is displayed in Fig. 10. RNN is also considerably a good choice for house price prediction, as this type of model is used for sequential data. Problems: • Inputs and outputs which has different shape • Doesn’t share features learnt across different position In RNN, the parameters it uses for each time step are shared, so there will be a set of parameters which we will discuss now, But the parameters governing the connection from x to the hidden layer will be some set of parameters, we will be writing them as Wax and is the same parameters Wax that is used for every time step. The activations, the horizontal connections will be governed by some set of parameters Waa and for every time step. So, in this Recurrent Neural Network, when making the prediction for y , it gets the information not only from x but also the information from x and x because the information on x can pass through this way to help to prediction with y .
Fig. 10. RNN forward propagation
Huber Loss and Neural Networks Application
249
The activations in the choice of RNN are tanh and sometimes ReLU are also used although the tanh is a common choice, and we have other ways of preventing the vanishing gradient problems. Depending on output function ‘Y ’, if binary classification then we should have used sigmoidal function, SoftMax or K-way classification. Simplified RNN notation. a = g(Waa a + Wax x + ba Let’s say, a → 100D, x → 10000D And it is stacked horizontally, then Waa → (100, 100) Wax → (100, 10000) Wa → (100, 10100) ⎤ a ⎢ ⎥ Similarly, a , x = ⎣ ... ⎦ ⎡
a
x
= Waa a + Wax x xt Therefore, more generally we end up with this equation, a = g Wax x + Waa a + ba for tanh/relu. y = g(Waa a + by ) for Sigmoidal. To perform Back Propagation, we need cost/cross entropy. L yi , y = −y log yi − 1 − y log(1 − yi ) This is for an element in the sequence. Overall Function, ty y t , y t , This will be the equation for Back Propagation L(yi , y) = t=1 L i through time. Here, we have four types of RNN: Finally, [Waa , Wax ]
1. 2. 3. 4.
One-to-One One-to-Many Many-to-One May-to-Many For our use case, for price prediction we have used Many-to-One. Our RNN neural network snippet looks like this.
250
A. I. Iliev and A. Anand
Fig. 11. RNN neural network
As we can see from the code snippet displayed in Fig. 11, we have our first layer as a lambda layer. We have used lambda layer to use the arbitrary expression as a layer during our experimentation. These layers are generally used for a sequential model. In our use case we had a simple expression to experiment with. Hence, we used it. With this model we achieved mean squared error as low as 0.089 and mean absolute error as 0.222.
Fig. 12. Lose curve simple RNN
This graph in Fig. 12 shows us how the model learns from every epoch, and we see that the model improved in a very less epochs and got stabilised later. We have used hyperparameter tuning for the learning rate in our RNN model, this way we can control the speed with which the model learns as evident from Fig. 13.
Fig. 13. RNN learning rate
Huber Loss and Neural Networks Application
251
Fig. 14. Loss curve RNN using hyperparameter tuning
After implementing this, we can see that the model is improving by looking at the way it is learning in this graph Fig. 14. When we train a model, we expect a smooth bell-shaped curve which we achieved with this experiment. 3.3 Hybrid Neural Network In our last neural network, we have worked on the hybrid neural network [10, 11] with the combination of one lambda layer, one conv1d, two LSTM layers, two Simple RNN layers, three dense layers, and one final output dense layer, as shown in Fig. 15. We wanted to experiment with the hybrid model and see the performance of it, with our data and compare how it will work compared to others. In the hybrid model we have also used hyperparameter for the learning rate for Adam optimizer.
Fig. 15. Hybrid neural network
252
A. I. Iliev and A. Anand
Most of the layers which are been used here are being explained in the above Sects. 3.1 and 3.2. The metrics we have used for calculation the loss for all the above models were mean squared and mean absolute errors, and the values we have got from this neural net was 0.12 and 0.26. The loss curve for this hybrid model was also impressive as shown in Fig. 16, as we see there is no noise in its trends.
Fig. 16. Loss curve hybrid neural network
4 Results The main objective of this paper was to find the best neural network for the real estate market in Berlin, Germany. We have worked on multiple models and tried to improvise the models using different loss functions, figuring the best delta value for Huber loss function, using hyperparameter tuning for Adam optimizer. In this section, we have discussed and tabulated the results and the tests we have done. We have taken the properties online in the market and tried to compare the models on how they are efficient on the real-world data. We have considered doing various pre-processing steps to make the data suitable for the models to learn from and we have explained them briefly in Sect. 2 of the paper. Different models and their respective errors are shown in Table 1:
Huber Loss and Neural Networks Application
253
Table 1. Metrics comparison for models Model
Mean squared error
Mean absolute error
Deep Neural Nets with MSE loss function
0.177
0.271
Deep Neural Nets with Huber Loss function
0.215
0.3163
Simple RNN
0.102
0.235
Simple RNN with hyperparameter tuning Adam Optimizer
0.106
0.241
Hybrid Neural Network
0.124
0.268
We have also discussed about the loss curve of these models in Sect. 3, to understand how the models are learning. A comparison of the test data for each model is shown in Table 2. Table 2. Test data comparison Test data
Model 1
Model 2
Model 3
Model 4
Model 5
8.545
8.584
8.782
8.598
8.530
8.613
8.370
8.283
8.475
8.398
8.261
8.434
8.694
8.577
8.632
8.408
8.253
8.412
7.717
7.777
7.390
7.866
7.861
7.909
7.740
7.651
7.719
7.783
7.664
7.743
NOTE: These are the log values directly from the model which has normalized log values
Here, in this table. • • • • •
Model 1 = Deep Neural Nets with MSE loss function Model 2 = Deep Neural Nets with Huber Loss function Model 3 = Simple RNN Model 4 = Simple RNN with hyperparameter tuning Adam Optimizer Model 5 = Hybrid Neural Network
The number in the tables are the property price per square feet. And the values are the log values. These results are straight from the model on the test data, we have divided our data into 80% train and 20% validation. And in the pre-processing, we have normalised the data to get the better performance out of the model. And we have tabulated the log values directly from the model. In this section we take the real time data from online and try to find which model works best on them. These are the real properties which we used to test our model.
254
A. I. Iliev and A. Anand
Fig. 17. Online real time property data
Note: In Fig. 17 we have taken the log values for living_sqm field because our model is trained with log values. To do that, we just need to use log1p from the NumPy library. Here, we will compare two ANN models, two RNN models and Hybrid models separately just to observe how the changes are behaving.
Fig. 18. Comparison of ANN models
In the graph Fig. 18, we have plotted the online price, deep Ann model price, and a model with Huber loss.
Huber Loss and Neural Networks Application
255
It is evident in the plot that Deep ANN model performs a lot better than the Deep Neural Network Model. As we have explained in the Sect. 3.1 that Huber loss is more robust to the outliers than the usual MSE loss function. We can see for the property 9 and 15 that the online prices are high, and the model performs well for the outliers, which here means that the property is highly priced than usual.
Fig. 19. Comparison of RNN and hybrid neural network
This graph in Fig. 19 is a comparison of RNN and Hybrid model. In this figure, we can clearly observe that the hybrid model with Huber loss is more accurate compared to the conventional neural networks.
5 Conclusion The main object of this paper was to experiment on neural nets, optimizers, and loss functions to see which suit the best for property price prediction regression problem statement. And we have considered all the combination and discussed the results in Sect. 5. From all the experiments carried out on this topic, we can conclude that. • hyperparameter tuning the learning rate of Adam optimizer • Using the Huber loss function with delta value 1.75 • Using Hybrid Neural Network is the best combination for this use case.
256
A. I. Iliev and A. Anand
References 1. Kauskale, L.: Integrated approach of real estate market analysis in sustainable development context for decision making. Procedia Eng. 172, 505–512 (2017) 2. Jha, S.B.: Machine learning approaches to real estate market prediction problem: a case study arXiv:2008.09922 (2020) 3. Tabales, J.M.N.: Artificial neural networks for predicting real estate prices. Revista De Methodos Cuantitativos para la economia y la empresa 15, 29–44 (2013) 4. Hamzaoi, Y.E., Hernandez, J.A.: Application of artificial neural networks to predict the selling price in the real estate valuation. In: 10th Mexican International Conference on Artificial Intelligence, pp. 175–181 (2011) 5. Kauko, T., Hooimaijer, P., Hakfoort, J.: Capturing housing market segmentation: an alternative approach based on neural network modelling. Hous. Stud. 17(6), 875–894 (2002) 6. Fukumizu, K.: Influence function and robust variant of kernel canonical correlation analysis. Neurocomputig, 304–307 (2017) 7. Rahimi, I., Bahmanesh, R.: Using combination of optimized recurrent neural network with design of experiments and regression for control chart forecasting. Int. J. Sci. Eng. Invest. 1(1), 24–28 (2012) 8. Cook, D.F., Ragsdale, C.T., Major, R.L.: Combining a neural network with a genetic algorithm for process parameter optimization. Eng. Appl. Artif. Intell. 13, 391–396 (2000) 9. Kingma, D.P.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015) 10. Lee, E.W.M.: A hybrid neural network model for noisy data regression. IEEE Trans. Cybern. 34, 951–960 (2004) 11. Williamson, J.R.: Gaussian artmap: a neural network for fast incremental learning of noisy multidimensional maps. Neural Netw. 9(5), 881–897 (1996)
Text Regression Analysis for Predictive Intervals Using Gradient Boosting Alexander I. Iliev1,2(B) and Ankitha Raksha1 1 SRH Berlin University, Charlottenburg, Germany
[email protected] 2 Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, Sofia, Bulgaria
Abstract. In this paper, we aim to explore text data analysis with the help of different methods to vectorize the data and carry out regression methods. At present, we have a lot of text categorization techniques, but very few algorithms are dedicated to text regression. But many regression models predict only a single estimated value. However, it’s more efficient to predict a numerical range than a single estimate, since we can be more sure that the true value will be within the range rather than considering the estimated value as the true value. We aim to combine both these techniques in this paper. The goal of this paper is to collect text data, clean the data, and create a text regression model to find the best-suited algorithm that could be used in any situation, through the use of quantile regression. Keywords: Text analysis · tf-.idf · word2vec · GloVe · Text regression · Word vectorization · Quantile · Gradient boosting · Natural language processing
1 Introduction With the help of Natural Language Processing (NLP) and its components, we can organize huge amounts of data. NLP is a powerful AI method for communicating with an intelligent system using natural language. We can perform numerous automated tasks and solve a wide range of problems, such as automatic summarization, machine translation, entity recognition, speech recognition, and topic segmentation. Since machine learning algorithms are not able to understand text, we need a way to convert the text into numbers so that the algorithm can understand the data that is sent to the model. After the data is pre-processed and cleaned, we perform a process called vectorization. In machine learning, most algorithms work with numerical data. Therefore, we need to find a way to convert the text data into numbers so that it can be incorporated into the machine learning models. This is called word vectorization, and the numerical form of the words is called vectors. To gain the confidence of decision makers, it is often not necessary to present a single number as an estimate, but to provide a prediction range that reflects the uncertainty inherent in all models.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 257–269, 2023. https://doi.org/10.1007/978-3-031-28073-3_18
258
A. I. Iliev and A. Raksha
2 About the Data The data was collected from Plentific GmbH in the last two years which accommodates around 170,000 records. The data consisted of various textual data that described the specificity of a home repair job that has been posted on the marketplace such as category, service, type, job description, emergency tag type which were the feature vectors for predicting the price for that repair job which is our target value.
Fig. 1. Histogram of the repair job and their price value for the maximum value of 500
From the histogram shown in Fig. 1 of the repair job and their prices, we see that the distribution of repair job prices has been spread from a very little amount as in 10 lb to a very large amount of 500 lb. 2.1 Data Preparation Data preparation involves cleaning the data, which is one of the simplest yet most important steps in the entire data modelling process. How well we clean and understand the raw input data and prepare the output data will have a big impact on the quality of the results, even more than the choice of model, model tuning, or trying different models. All text data in its original form contains a lot of unnecessary data, which we call noisy data or dirty data, and which must be cleaned. Failure to do so can result in skewed data that ultimately leads to the organization making poor decisions. Cleaning text is particularly difficult compared to cleaning numbers because we cannot perform any of the usual statistical analysis that we can with numerical data. Steps in text cleaning: a) b) c) d)
Removal of null values in the data. Convert all the words into lower case. Remove numbers. Remove newline character
Text Regression Analysis for Predictive Intervals
259
e) Remove any HTML tags f) Remove punctuations, limit to a certain word/s where there is an actual message. g) Lemmatization: bringing a word into its root format. To remove outliers from the dataset and to reduce the variance among different repair jobs, we have tried to limit the accepted price value from 45 to 125 lb as in Fig. 2. Initially, we had almost 170,000 data records but after removing outliers we have 115,000 data left in our dataset, which comprises almost 70% of data.
Fig. 2. Histogram of the repair job and their price value for the maximum value of 125 and minimum of 45
3 Literature Survey For our case, we had to understand insights from previous works for text regression. One such study includes considering different text corpora to predict the age of the author that’s written using a linear model and text features [1]. There was another study that used much more complex modeling like Convulsion neural networks (CNN) for text regression for financial analysis [2]. A similar study was done for financial documents pertaining to stocks volatility to analyze different documents and predict risk using Support Vector Regression [3]. In another study, the authors have proposed a text regression model based on a conditional generative adversarial network (GAN) [4] the model works with unbalanced datasets of limited labeled data, which align with real-world scenarios [5]. Having both text data and metadata sent into the same model is a unique concept of having two different data structures being combined, this was beautifully explained in predicting revenues for movies [6]. This paper shows a non-linear method based on a deep convolutional neural network [7]. Another paper suggests the use of logistic regression and its advantages to text categorization [8]. TF-IDF has its mathematical implications that provide the notion of the probabilityweighted amount of information (PWI). To understand how TF-IDF weights can help give a relevant decision, the authors have come up with a model that combines the local relevance with document-wide relevance decision [9]. Another method for word
260
A. I. Iliev and A. Raksha
representation is word2vec, which is originally done by the Google developers to have an efficient word representation in an N-d vector space and have illustrated two different models: Skip-gram and Continuous bag of words [10]. By extending this study, the author of the paper has discussed emerging trends that come with word2vec, with all different operations that can be carried out on words to find out relationships [11] and how it can be applied to big text contents [12]. Like word2vec we have GloVe, which is short for Global Vectors, founded by Stanford researchers who have shown that the result is a combination of global matrix factorization and local context window methods [13]. Creating a model on top of the GloVe baseline model has been done to get better results [14]. There is also a study which shows a way to combine both word2vec and Glove model [15]. Decision trees suit better to fit nonlinear non-normal data [16], and a combination of many models together in ensembling learning increases the accuracy of the model [17]. Gradient Boosting is a greedy function where it is an inclusive model with loss functions like least square, least absolute deviation, and Huber-M [18]. This paper provides conditions that point to a quantile regression model that proposes the conversion of the data that disregards the fixed effects by assuming of location shifters [19]. There were many attempts for predicting intervals like fuzzy intervals [20], and machine learning approach for artificial and real hydrologic data sets [21].
4 Word Vectorization Word vectorization schemes are needed in machine learning for the simple reason that the encoding scheme we’re most familiar with as humans and the alphabet tell us very little about the nuances of the word it represents. The need for good embedding schemes increases because we cannot simply feed a network the binary data from ASCII or something that’s been hot coded based on an extensive vocabulary because these soon balloon into unacceptably large dimensions that no neural network can easily handle. Neural networks for natural language processing are well known for the fact that their performance depends almost exactly on how good their word embedding is. The goal of word embedding, then, is to solve all these problems and provide the model with a learning method that allows it to place words in hyperspace, where their position tells it something about the word and its relationship to all the other words in its vocabulary. The goal of word embedding is: • To reduce dimensionality • To use a word to predict the words around it • Inter-word semantics must be captured. There are many word vectorization techniques out there, but we want to focus on the following ways:
Text Regression Analysis for Predictive Intervals
261
4.1 Term Frequency-Inverse Document Frequency (TF-IDF) Term frequency-inverse document frequency (TF-IDF) is a method whose main purpose is to introduce the concept of semantic meaning of words and to consider the meaning of words in a document. This is done through a statistical analysis of word frequencies. This is represented by the following formula: TF(w)* IDF(w). Where TF stands for term frequency and IDF stands for inverse document frequency and is formulated as follows: TF(w) = (number of occurrences of a word in the document) / (total number of words in the document). IDF(w) = log (number of documents / number of documents containing the word w). 4.2 Word2Vec Word2Vec is a method for generating word embeddings. It converts words into vectors, and with vectors we can perform several operations, such as adding, subtracting, and calculating distance, so that a relationship between words is preserved. Unlike TF-IDF, word2vec and other encoding methods use a neural network model that provides an N-d vector for the words. Training the word2vec model is also very resource intensive, as it requires a large amount of RAM to store the vocabulary of the corpus. In simple terms, word2vec tries to consider words with similar contexts as similarly embedded. Word2vec, or any other vector embedding method using a neural network trained with different corpora, has “rightly” gained knowledge about words through the context of similarity with surrounding words. There are two types: a) Continuous Bag of Words (CBOW): a word is predicted based on the surrounding words and as an example shown in Fig. 3, we are trying to predict the word ‘Jumps’ from all the other words in that sentence. b) Skip-Gram: one word is used to predict the surrounding words and as an example shown in Fig. 4, we are trying to predict all the other words in that sentence from the word ‘Jumps’.
Fig. 3. Continuous bag of words example
262
A. I. Iliev and A. Raksha
Fig. 4. Skip-gram model example
4.3 GloVe GloVe is an abbreviation for Global Vector. The difference between Word2Vec and GloVe is that in Word2Vec we only considered the local presence of a word in the dataset. GloVe is a word representation scheme that aims to extract semantic relations between words in their embeddings. Thus, GloVe aims to capture the meaning of the word in the embedding by explicitly recognizing the probabilities of co-occurrence, and this is empirically demonstrated in the paper, as in Table 1: Table 1. GloVe co-occurrence matrix example
P(k|’ice’)
k = ‘solid’
k = ‘gas’
k = ‘water’
k = ‘fashion’
1.90E-04
6.60E-05
3.00E-03
1.70E-05
P(k|’steam’)
2.20E-05
7.80E-04
2.20E-03
1.80E-05
P(k|’ice’) / P(k|’steam’)
8.90E+00
8.50E-02
1.30E+00
9.60E-01
The words ‘ice’ and ‘steam’ are compared to various probe words like ‘gas’, ‘water’, ‘solid’,’fashion’. Now we can see that the word ‘ice’ is more related to ‘solid’ than it is to ‘gas’ and the converse is true for ‘steam’ as seen by the ratios calculated in the last row. Both terms have similar large values with water, and on the other hand have very small values in the context of the word ‘fashion’ [13]. All these aims to argue the point that embedding should be built not on just the word probabilities but the co-occurrence probabilities within the context and this is what is going to be in the loss function. The essence of GloVe is to build a matrix from these probabilities and subsequently learn a vector representation of each word. And here we arrive at the loss function in its entirety J = f (Xij )(wiT w j + bi + b j − log(Xij ))2 where f (Xij ) = weighting function wiT w j = dot product of input/output vectors. bi + b j = bias term. Xij = number of occurrences of j in context i.
Text Regression Analysis for Predictive Intervals
263
5 The Intuition of Text Regression We have all heard of text classification, where a given text or sentence is divided into different classes, thus predicting a class name when a new text is present. But in today’s world, there is very little structured data, the rest is unstructured, and this is where text data comes in. There are cases where we need to predict a number for a given text, such as a property description and predicting the price of the property, or an article and predicting how many “likes” it can get, or an article (or story) and predicting the age of the author based on the vocabulary used [1]. These all seem to be real-world examples, and we may need to look ahead a bit to grasp the importance of text regression. 5.1 Gradient Boosting Regressor Gradient Boosting Regression is a boosting algorithm with serial implementation of decision trees. We trained two models, one for predicting a lower bound and the other for the upper bound, by providing values for the quantile or “alpha” to create an interval. Predicting a numerical range by this method can be very useful in situations where we cannot rely one hundred percent on a single number, but rather on a range. This is very important in our case because with textual data such as the repair order description, there are many variations within the description that can only be properly understood by the contractor based on their experience. For this reason, we chose to use quantile regression rather than just linear regression. Gradient boosting regressors have been shown to fit and perform better on complex data sets. Another reason for choosing this algorithm is that it provides intrinsic quantile prediction functionality through its loss function. 5.2 Quantile Regression Model Before we dive into understanding a quantile regression model, we must first understand the meaning of quantiles. In general, quantiles are just lines that divide the data into equal groups. Percentiles are simply quantiles that divide the data into a hundred equal groups. For example, when we talk about a 75th quantile, it generally means that 75% of the data is below that point. The quantiles are only known from the distribution of the data provided to the model. Linear regression, or so-called ordinary least squares (OLS), assumes that the relationship between the input variable X and the output label Y can be transformed into a linear function. Y = θ0 + θ1 X1 + θ2 X2 + . . . + θp Xp + ε where Y is the dependent variable, X 1 .. Are the independent variables, θ0…p are the coefficient and is the bias term. The objective loss function is a squared error represented as:
264
A. I. Iliev and A. Raksha
L = (y − X θ )2 where y is the actual value and Xθ is the predicted value. With quantile regression, we have an extra parameter τ, which talks about the τth quantile of our target variable Y that we’re interested in our model, where τ ∈ (0, 1) and our loss function becomes: L = τ y − y , if y − y ≥ 0 or L =(τ − 1) y − y , if y − y < 0 where τ is quantile, L is the quantile loss function, y is the actual value, y’ is the predicted value. We want to penalize the loss function when the percentile is low, but the prediction is high, and when the percentile is high, but the prediction is low. What is meant by this is that we use the quantile loss to predict a percentile within which we are confident that the true estimate lies. The first condition in the quantile loss function indicates when y-y’ ≥ 0, which means that the value predicted by our model is low, which is good if we want to predict the lower percentiles, but we want to penalize the loss if the predicted value is much higher than the value. y-y’ < 0 indicates that our prediction is high, which is good for higher percentiles, but we want to penalize the loss if the predicted value is much lower than the true value. The quantile loss differs depending on the quantile evaluated, so negative errors are penalized more when we specify higher quantiles, and positive errors are penalized more for lower quantiles. Now that we know quantile regression, we need to see how to apply it in our scenario. After some research, we saw that gradient boosting regression has an option to use quantile regression, which simplifies the process. This is easily accomplished by specifying loss = “quantile” with the desired quantile in the alpha. 5.3 Benchmarking the Model In order to evaluate the model performance, and how well the interval range fits the true value, we have decided to consider the frequency of the true value captured in the range. In other words, the accuracy of the model is the number of times the actual priced value is in the predicted interval over the number of records of data. 5.4 Model Comparisons We have considered only three popular methods of text vectorization for now i.e.TF-IDF, Word2Vec and GloVe. In Table 2 we show the statics of those methods:
Text Regression Analysis for Predictive Intervals
265
Table 2. Method comparison Methods
Test accuracy
Training time
TF-IDF
75.10%
3 min
Word2Vec
75.50%
10 min
GloVe
74.60%
6 min
The values given in the table were trained with gradient boosting models with quantile values 0.8 and 0.2 to understand how different methods behave. We also see that the more time required to train the model, the higher the accuracy. However, the trade-off is always between the accuracy and the computational time because if training the data in the production environment is computationally expensive, it will be questioned because it would cost a lot of money on the production side as well. There, it is more optimal that we achieve good accuracy with minimal cost. Note that the training time may vary depending on the size of the dataset and the instances used to train the models. In our case we used a GPU instance g4dn.xlarge, but again it is all a matter of resources and optimization. To keep the training time for the different tests as low as possible, we decided to use the model with the TFIDF vectorization method to test the models with different quantile values, as shown in Table 3: Table 3. Quantile comparisons Lower quantile
Upper quantile
Testing accuracy
Interval difference [in pounds]
0.2
0.8
75%
28
0.18
0.82
78%
30
0.15
0.85
82%
34
0.1
0.9
89%
44
The above table shows the data with different quantile levels and the test accuracy, as well as the interval difference (in pounds) between the lower and upper limits. From the above values, we can see that there is a clear trade-off between the accuracy achieved and the interval difference, as it is very easy to capture more of the actual value within the range by simply extending it further. However, this is not our goal, as a large range is not suitable in the real world. What we need is a decently narrow range with good detection accuracy. 5.5 Hyperparameter Tuning To understand the model even better, and since we had a large enough data set, we tried to increase the ratio of training to test as test data from 20% to 40%. The result of this
266
A. I. Iliev and A. Raksha
experiment is quite interesting, as it yielded similar accuracy to the 85% to 20% split. Before reaching a conclusion, we had to test different values for the parameters of the gradient boosting regression. We had chosen n_estimators as a parameter to vary and compare the results. n_estimators show the number of trees formed in the model. The following Table 4 shows different values for n_estimators: Table 4. N_estimators hyperparameters n_estimators
Test accuracy
40
82.66%
60
82.78%
80
82.86%
100
83.20%
120
83.08%
140
82.93%
160
82.92%
We can conclude that the change in accuracy is very minimal and that it increases/decreases in decimal points. The accuracy increases around 100 trees and then decreases again. A special feature of the gradient boosting trees is that they are based on the learning rate and the average price value. Since the target value is mostly the same even for different repair jobs, this also affects the overall value when we try to build the trees.
6 Result In Table 5, we see an example of the prediction of the lower and upper limits for the repair order, which are called “lower” and “upper” values in the diagram, while “price” is the actual value in the test data. Table 5. Results id
Price
Upper
Lower
1
90
58.821936
90.234450
2
85
49.699136
97.002349
3
84
59.973129
94.544094
4
80
65.785190
93.253945
5
95
60.164752
90.126343
For our model, we need to show the prediction error on the test set. Measuring the error from a prediction interval is more difficult than predicting a single number. We
Text Regression Analysis for Predictive Intervals
267
can easily determine the percentage of cases where the true value was captured in the interval, but these values can easily be increased if we increase the interval boundaries. Therefore, we need to show the absolute error calculations to account for this, as in Table 6: Table 6. Absolute error calculations absolute_error_lower
absolute_error_upper
absolute_error_interval
Count
45343.000
45343.000
45343.000
Mean
18.749340
20.576137
19.662739
Std
15.600487
13.186537
6.646751
Min
0.000465
0.000292
2.152931
25%
6.298666
9.967990
14.898903
50%
13.917017
19.058457
18.511
75%
27.664566
29.542923
22.241346
Max
75.874350
67.695534
57.763181
We see that the lower prediction has a smaller absolute error (with respect to the median) than the upper prediction. It is interesting to note that the absolute error for the lower bound in terms of the mean and standard deviation is almost the same as the absolute error for the upper bound, which in turn shows that the upper bound and the lower bound are almost equally far from the true value. 6.1 Inference From the tests performed above, we can conclude that the accuracy of the model is not drastically changed even when increasing/decreasing the test-train size or the n_estimator values, because the variance in the data set is very small. To get this right, we added the outliers back into our data set to see if there was any additional variance added to the data set. But that did not change anything, and the model returned about 78 percent on the test data. This confirms that the variance in the data set is very small and that the power changes only slightly with each change.
7 Scope of Future Improvements Building a model never has an end, as there will always be new requirements, adjustments and improvements that need to be reconciled with the product development process. The main concern is to keep re-training the model as we collect more and more data over time. The reason for this is that new data brings its own variance and deviation that could be affected or incorrectly predicted by using a previous model.
268
A. I. Iliev and A. Raksha
We have listed down use cases of improvements for this model in the future: a) Try it with various other machine learning algorithms or neural networks by altering the loss function to give an interval instead of a single estimated value. b) Add location data in terms of GPS coordinates to see how the price changes with location.
8 Conclusion The main goal of this work was to experiment with different word vectorization methods that can be used in the gradient boosting engine to take advantage of its inherent functionality of quantile regression and to find out how text data behaves with respect to regression. From all the experiments that have been conducted on this topic, we can conclude that: • We have been able to successfully show that we can predict numerical intervals for arbitrary text data. When a machine learning model predicts a single number, it creates the illusion of a high degree of confidence in the entire modelling process. However, since one of the models is only a rough approximation, we need to convey the uncertainty in the estimates. • Even though the tree models can be robust, it always depends on the problem we want to solve and the input data for the model. Comparison of different methods TF-IDF, Word2Vec, GloVe for converting words to vectors has helped to obtain important text features from the input data.
References 1. Nguyen, D., Smith, N.A., Rose, C.P.: Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (2011) 2. Dereli, N., Saraclar, M.: Convolutional neural networks for financial text regression. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop (2019) 3. Kogan, S., Levin, D., Routledge, B.R., Sagi, J.S., Smith, N.A.: Predicting risk from financial reports with regression. In: Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL (2009) 4. Aggarwal, A., Mittal, M., Battineni, G.: Generative adversarial network: an overview of theory and applications. Elsevier (2020) 5. Li, T., Liu, X., Su, S.: Semi-supervised text regression with conditional generative adversarial networks (2018) 6. Joshi, M., Das, D., Gimpel, K., Smith, N.A.: Movie reviews and revenues: an experiment in text regression. Language Technologies Institute (n.d.)
Text Regression Analysis for Predictive Intervals
269
7. Bitvai, Z., Cohn, T.: Non-linear text regression with a deep convolutional neural network. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), pp. 180–185 (2015) 8. Genkin, A., Lewis, D.D., Madigan, D.: Sparse logistic regression for text categorization (n.d.) 9. Chung Wu, H., Fai Wong, K., Kui, L.K.: Interpreting TF-IDF term weights as making relevance decisions. ACM Trans. Inf. Syst. 26, 1–37 (2008) 10. Mikolov, T., Chen, K.C., Dean, J.: Efficient estimation of word representations in vector space (2013) 11. Church, K.W.: Emerging trends Word2Vec. Nat. Lang. Eng. 155–162 (2016) 12. Ma, L., Zhang, Y.: Using Word2Vec to process big text data (2016) 13. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation (n.d.) 14. Ibrahim, M., Gauch, S., Gerth, T., Cox, B.: WOVe: incorporating word order in GloVe word embeddings (n.d.) 15. Shi, T., Liu, Z.: Linking GloVe with word2vec (2014) 16. Chowdhurya, S., Lin, Y., Liaw, B., Kerby, L.: Evaluation of tree based regression over multiple linear regression for non-normally distributed data in battery performance (2021) 17. Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000). https://doi.org/10.1007/3540-45014-9_1 18. Friedman, J.H.: Greedy function approximation: a gradient boosting machine (2001) 19. Canay, I.A.: A simple approach to quantile regression for panel data. Econ. J. 14, 368–386 (2011) 20. Sáez, D., Ávila, F., Olivares, D., Cañizares, C., Marín, L.: Fuzzy prediction interval models for forecasting renewable resources and loads in microgrids. IEEE Trans. Smart Grid 6, 548–556 (2015) 21. Shrestha, D.L., Solomatine, D.P.: Machine learning approaches for estimation of prediction interval for the model output, 225–235 (2006)
Chosen Methods of Improving Small Object Recognition with Weak Recognizable Features Magdalena Stacho´ n1 and Marcin Pietro´ n2(B) 1
Institute of Computer Science, AGH University of Science and Technology, Cracow, Poland 2 Institute of Electronics, AGH University of Science and Technology, Cracow, Poland [email protected]
Abstract. Many object detection models struggle with several problematic aspects of small object detection including the low number of samples, lack of diversity and low features representation. Taking into account that GANs belong to generative models class, their initial objective is to learn to mimic any data distribution. Using the proper GAN model would enable augmenting low precision data increasing their amount and diversity. This solution could potentially result in improved object detection results. Additionally, incorporating GAN-based architecture inside deep learning model can increase accuracy of small objects recognition. In this work the GAN-based method with augmentation is presented to improve small object detection on VOC Pascal dataset. The method is compared with different popular augmentation strategies like object rotations, shifts etc. The experiments are based on FasterRCNN model. Keywords: Deep learning · Object detection · Generative Adversarial Networks · CNN models and VOC Pascal dataset
1
Introduction
Computer vision relays deeply on object detection including such domains as self-driving cars, face recognition, optical character recognition or medical image analysis. Over the past years, great progress has been made with the appearance of deep convolutional neural networks. The first, based on regional nomination methods, such as the R-CNN model family [5], the other - one-stage detector, which enables real-time object detection with methods such as YOLO [6] or SSD [7] architectures. For those models very impressive results have been achieved for high resolution, clear objects, however, this process does not apply to very small objects. The deep learning models enable creating low level features, which afterwards are combined into some higher level features that the network aims to detect. Due to significant image resolution reduction, small object features, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 270–285, 2023. https://doi.org/10.1007/978-3-031-28073-3_19
Chosen Methods of Improving Small Object Recognition
271
extracted on the first layers disappear in the next layers and are not affected by detection and classification scopes. Their poor quality appearance makes impossible to distinguish them from the other categories. Small object accurate detection is crucial for many disciplines and determines their credibility and effective usage. Small traffic signs and objects detection influence the self-driving car safety rules. In medical image diagnosis, a few pixel size tumor detection or chromosome recognition enables early treatment [13]. To make full use of satellite image inspection many small objects need a precise annotation. Taking some of those examples into account, the small objects are present in every computer vision aspect, and they should be treated with special attention, as they constitute one of the weakest parts of current object detection mechanisms. In this work few approaches were tested and compared how they can help in improving detection of small objects. First the most popular methods were taken based on augmentation techniques. In the next stage the efficiency of the perceptual GAN was tested [3]. Perceptual GAN was trained on data from unbalanced original dataset and data generated by DCGAN [9]. These approaches were tested on VOC Pascal dataset [4] with Faster R-CNN model [1] with VGG [11] architecture as a backbone. The VOC Pascal dataset consists mainly of big objects, which enlarge the small object accuracy disparity, as the model focuses mainly on medium and big objects. Moreover, there is a significant disproportion in class count depending on object size, which results in a lack of diversity and location of small objects. The experiments face three problematic aspects of small object detection: the low number of samples, their lack of diversity, and low features representation. The first phase involves dataset preparation by augmenting classes for small objects from the VOC Pascal dataset with several oversampling strategies. The original objects used for the oversampling method are enhanced by the ones generated by Generative Adversarial Network [14,26,27], based on the original paper [9] customized for the training purpose. The augmented dataset is introduced to the Faster R-CNN model and evaluated on original groundtruth images. Secondly, for a selected class from training augmented dataset with low classification results, FGSM attack is conducted on objects as a trial to increase the identification score. Finally, in order to cope with poor small object representation, the enhanced dataset is introduced to Perceptual GAN, which generates super-resolved large-object like representations for small objects and enables higher recognizability.
2
Related Works
To address the small object detection problem numerous methods are introduced with different results. In order to cover the small object accuracy gap between one and two-stage detectors, Focal Loss [16] can successfully be applied. It involves a loss function modification that puts more emphasis on misclassified examples. The contribution of correctly learned, easy examples is diminished during the training with the focus on difficult ones. With high-resolution photos and relatively small objects, the detection accuracy can also be improved by splitting
272
M. Stacho´ n and M. Pietro´ n
the input image into tiles, with every tile separately fed into the original network. The so-called Pyramidal feature hierarchy [17] addresses the problem of scale-invariant object detection. By replacing the standard feature extractor, it allows creating better quality, high-level multi-scale feature maps. The mechanism involves two inverse pathways. The feature maps computed in the forward pass are upsampled to match the previous layer dimension and added elementwise. In this way, the abstract low-level layers are enhanced with higher-level semantically stronger features the network calculated close its head, which facilitates the detector’s small objects pick up. The evaluations on the MS COCO dataset [15] allowed increasing the overall mAP from 47.3 up to 56.9. Some other approach [18] faces problem subdomain - face detection, tries to make use of object context. The detectors are trained for different scales on features extracted from multiple layers of feature hierarchy. Using context information to improve small object accuracy is also applied in [19]. In this work, the authors firstly extract object context from surrounding pixels by using more abstract features from high-level layers. The features of the object and the context are concatenated providing enhanced object representation. The evaluation is performed on SSD with attention module (A-SSD) to allow the network to focus on important parts rather than the whole image. Comparing with conventional SSD the method achieved significant enhancement for small objects from 20.7% to 28.5%. Another modification of the Feature Proposal Network (FPN) approach applied to Faster RCNN [20] extracts features of the 3rd, 4th, and 5th convolution layers for objects and uses multiscale features to boost small object detection performance. The features from the higher levels are concatenated with the ones from the lowers into a single dimension vector, running a 1 × 1 convolution on the result. This allowed a 0.1 increase in the mAP in regards to the original Faster R-CNN model. The small number of samples issue is faced in [21], where the authors use the oversampling method as a small object dataset augmentation technique and reuse the original object to copy-paste it several times. In this way, the model is encouraged to focus on small objects, the number of matched anchors increases, which results in the higher contribution of small objects to the loss function. Evaluated on Mask R-CNN using MS COCO dataset, the small objects AP increased while preserving the same performance on other object groups. The best performance gain is achieved with oversampling ratio equal to three. Some generative models [8] attempt to achieve super-resolution representations for small objects, and in this way facilitate their detection. Those frameworks already have capabilities of inferring photo-realistic natural images for 4x upscaling factors, however, they require heavy time consumption for training. The proposed solution uses a deep residual network (Resnet [12]) in order to recover downsampled images. The model loss includes an adversarial loss, that pushes discriminator to make a distinction between super-resolution images and original ones, and content loss to achieve perceptual similarity instead of pixel space similarity. SRGAN derivative - classification-oriented SRGAN [22] append classification branch and introduce classification loss to typical SRGAN, generator of CSRGAN is trained to reconstruct realistic super-resolved images with
Chosen Methods of Improving Small Object Recognition
273
classification-oriented discriminative features from low-resolution images while discriminator is trained to predict true categories and distinguish generated SR images from original ones. Some other approach [23] proposes a data augmentation based on the foreground-background segregation model. It adds an assisting GAN network [2] to the original SSD training process. The first training phase focus on the foreground-background model and pre-training object detection. The second stage encloses a certain probability data enhancement such as color channel change, noise addition, and contrast boost. The proposed method increases the overall mAP to 78.7% (SSD300 baseline equals 77.5%). Another Super-Resolution Network SOD-MTGAN [24] aims to create the images where it will be easier for the resulting detector, which is trained along the side of the generator, to actually locate the small objects. So, the generator here is used to upsize blurred images to better quality and create descriptive features for those small objects. The discriminator, apart from differentiating between real and generated images, describes them with category score and bounding box location. The Perceptual GAN presented in [3] has the same goal as the previous super-resolution network but slightly different implementation. Its generator learns to transfigure poor representations of the small objects to super-resolved ones that are commensurate to real large objects to deceive a competing discriminator. Meanwhile, its discriminator contends with the generator to identify the generated representation and enforces an additional perceptual loss - generated super-resolution representations of small objects must be useful for the detection task. The small objects problem has been already noticed with some enhancement methods proposed, however, in this area there is still much room for improvement, as often described methods are domain-specific and apply to certain datasets. Generative Adversarial Networks are worth further exploration in the object detection area.
3
Dataset Analysis
Three size groups are extracted from the VOC Pascal dataset according to the annotation bounding boxes: small (size below 32 × 32), medium (size between 32 × 32 and 64 × 64), and big (size above 64 × 64). Corresponding XML annotations are saved per each category containing only objects with selected size. Tables 1 and 2 present object distribution in regards to the category and size, with significant variances. For the trainval dataset, small objects constitute less than 6% of the total objects count. Excluding difficult examples, this number reduces to 1.3%. Similar statistics apply to the test dataset. Those numbers confirm the problem described earlier. There is a significant disproportion in the object numerosity for different size groups. The categorized VOC Pascal dataset is introduced to the pre-trained PyTorch Faster R-CNN model described above with the accuracy metrics presented in Table 3. Overall the network’s performance on small objects (3.13%) is more than 20 times worse than on big objects (70.38%). Additionally, the number of samples per category differs substantially,
274
M. Stacho´ n and M. Pietro´ n
which at least partially concurs to the very low accuracy scores. Only 5.9% of annotated objects from trainval dataset belong to small objects, whereas medium and big objects take 15.36% and 78.74% respectively. The low number of samples and poor representation of smaller objects is one of the major obstacles in further work, as it disables the network to learn the right representation for the object detection network. The great disparity between big and small objects count bias the Faster R-CNN training to focus on bigger objects. Moreover, as shown in Table 3, there is a significant disproportion in the class numerosity. Basing on the publicly available DCGAN model, a customized, stable GAN implementation is introduced in order to increase the variety of small objects and provide clearer representation. The selected solution includes objects augmentation instead of whole images.
4
Data Augmentation with DCGAN
One of the augmentation technique used in experiments was small object generation with DCGAN (deep convolutional GAN) [28]. The model’s discriminator is made up of convolution layers, batch norm layers, and leaky ReLU activations. The discriminator input is a 3 × 32 × 32 image and the network’s output is a scalar probability that the input is from the real data distribution. The generator is comprised of a series of convolutional-transpose layers, batch norm layers, and ReLU activations. The input is a 100-dimensional latent vector, z, extracted from a standard Gaussian distribution and the output is a 3 × 32 × 32 RGB image. The initial model weights are initialized randomly with a normal distribution with mean 0 and stdev 0.02. Both models use Adam optimizers with learning rate 0.0002 and beta = 0.5. The batch size is 64. Additionally, in order to improve the network’s stability and performance, some adjustments are introduced. The training is split into two parts for the generator and the discriminator, as different batches for real and fake objects are constructed. Secondly, to equalize the generator and discriminator training progress, soft and noisy data labels are introduced. Instead of labeling real and fake data as 1 and 0, a random number from range 0.8–1.0 and 0.0–0.2 is chosen. Moreover, the generator uses dropouts after each layer (25%). The generator’s progress is assessed manually, by generating a fixed batch of latent vectors that are drawn from a Gaussian distribution and periodically input to the generator. The evaluation includes both the quality and diversity of the images in relation to the target domain. The typical training lasts from 1000–2000 epochs depending on dataset numerosity. In Fig. 1 the bird class generation is presented.
Chosen Methods of Improving Small Object Recognition
275
Table 1. Object number statistics for classes from VOC Pascal (airplane, bike, bird, boat, bottle, bus, car, cat, chair, cow) for trainval and test set with the division for small, medium, big categories Type
Airplane Bike Bird Boat Bottle Bus Car
Cat Chair Cow
Test small Test small (non diff) Trainval small Trainval small (non diff)
25 23 19 15
12 0 15 1
57 12 36 8
65 4 25 1
40 1 62 10
6 1 12 1
149 33 173 35
1 1 0 0
58 0 34 2
28 3 21 1
Test medium Trainval medium
35 39
31 36
116 119
87 90
186 168
18 18
339 398
6 9
250 258
96 56
251 273
346 367
403 444
241 283
431 404
230 1053 363 1066 242 1073 380 1140
205 279
Test big Trainval big
Table 2. Object number statistics for classes from VOC Pascal (table, dog, horse, motorbike, person, potted plant, sheep, sofa, train, tv monitor) for trainval and test set with the division for small, medium, big categories Type Test small Test small (non diff) Trainval small Trainval small (non diff) Test medium Trainval medium Test big Trainval big
Table Dog Horse Moto Person Plant
Sheep Sofa Train Tv
1 0 0 0
4 0 0 0
4 1 3 1
9 1 11 1
305 99 370 89
35 15 41 20
47 14 87 22
0 0 0 0
0 0 1 0
15 7 14 2
8 11
18 18
34 15
36 30
811 819
123 159
57 79
0 6
11 13
64 65
508 357 520 388
324 349
4111 4258
434 207 425 187
396 419
291 314
282 288
290 299
Fig. 1. DCGAN generated samples of a bird (left), original training dataset for the bird category (right)
276
M. Stacho´ n and M. Pietro´ n
Table 3. Mean average precision metric in percent for VOC Pascal small, medium, big objects for pre-trained faster R-CNN model. Below the number of samples per size category for trainval and test images (both include objects marked as difficult). As one image may contain objects from multiple size groups, it’s ID may find in different size categories. That the reason why the number of small, medium and big object images does not bring the cumulative number of images. Dataset
All objects Small objects Medium objects Big objects
mAP
69.98
3.13
12.62
70.38
Number of images (test) Number of objects (test)
4 952 14 976
366 861
995 2 326
4843 11 789
Number of images (trainval) 5 011 Number of objects (trainval) 15 662
378 924
486 2 406
4 624 12 332
GAN training requires a considerable amount of data of low dimensionality and clear representation, which results in generated objects quality. The main evaluation object detection dataset in presented work is VOC Pascal, however taking into account that it is a relatively small dataset, despite including into training all size groups for a given object. To make training of DCGAN more efficient the dataset has to be enhanced by some other images. The samples from following multiple datasets were tried for the training: Stanford-cars [25], CINIC-10, FGVC-aircraft, MS COCO, CIFAR-100 [10], 102 Category Flower Dataset, ImageNet and Caltech-UCSD. The seven categories were taken as a case study: car, airplane, tv monitor, boat, potted plant, bird and horse. Table 4 presents the datasets that are successfully applied for object generation with the corresponding count. Having to fulfill the input requirements some data preprocessing procedures had to be conducted in order to obtain a 32 × 32 RGB image. Despite the numerosity, ImageNet dataset training does not bring positive results. The representation of many images are not clear enough, multiple objects are present on single category image and downsampling high-resolution images outputs noise. Similar results are obtained for the MS COCO dataset, with the majority of rectangle objects, after downsampling to square 32 × 32 resolution, the preprocessed images present noise. In conclusion, a successful dataset for deep GAN should consist of at least several thousands of samples of low resolution, clear representation images per each generated class.
5 5.1
Augmentation Setup Oversampling Strategies
The training dataset for learning FasterRCNN is augmented with several oversampling techniques. As a common rule, the objects are copied from the original location and pasted to a different position, which does not overlap with
Chosen Methods of Improving Small Object Recognition
277
Table 4. Datasets used for DCGAN training with the cumulative number of samples per each used category Category
Datasets
Count
Car
Stanford-cars, CINIC-10, VOC Pascal
22 928
Airplane
FGVC-aircraft, CINIC-10, MS COCO, VOC Pascal
19 358
Tv monitor
MS COCO, VOC Pascal
10 175
Boat
MS COCO, CINIC-10, VOC Pascal
28 793
Potted plant CIFAR-100, ImageNet, MS COCO, 102 Category Flower Dataset, VOC Pascal
17 772
Bird
Caltech-UCSD, CINIC-10, VOC Pascal
12 363
Horse
CINIC-10, MS COCO, VOC Pascal
33 012
other objects and the image boundaries. The oversampling ratio equals three. This easy method allows enlarging the area covered by small objects and puts more emphasis on smaller object loss during the training stage. The experiment’s overview is summarized in Table 5. There are two sets of experiments conducted. In the first one, the VOC Pascal objects are used as oversampling strategy in three different scenarios. First, the original small object is picked per image and copied-pasted three times in random locations keeping the original object size (strategy 1). Second, instead of multiplying the original object, a random VOC Pascal object for the matching category is picked with returns for every copy-paste action. Each oversampled object is rescaled to a random width and height taken from a range of current width and height and 32 pixels (strategy 2). The oversampling increases the objects count, however taking into account, that the original number of small object samples is around 30 times less than the bigger ones, this oversampling strategy would result in the dataset size increased by 3, still leaving the considerable count disproportion. To address this problem, every image containing a small object is used 5 times with the described random oversampling strategy, which results in multiple original images oversampled with different objects for a given category (strategy 3). The third strategy additionally involves object class modification. The previous method is extended with the following assumption, for every picked small object a random VOC Pascal category is chosen, from the set of most numerous small test categories. This excludes the following classes: sofa, dog, table, and cat for which the number of both train and test samples is below five. The second set of experiments makes use of DCGAN generated objects. The generated objects are switched for the following categories: airplane, bird, boat, car, chair, horse, person, potted plant, and tv monitor. There are two test settings similar to the ones conducted for original VOC Pascal objects augmentation. In the first set, for every original small object, a random object of the same category is picked. For DCGAN available classes, generated objects are used, for the others (bike, bottle, bus, cow, motorbike, sheep) the samples come from VOC Pascal. Similarly to the first experiment set, every image containing a small object is used
278
M. Stacho´ n and M. Pietro´ n
for oversampling five times with the oversampling ratio equals three. The second setting preserves the conditions from the first one and additionally switches the class for every object, meanwhile increasing the oversampling strategy count. For less numerous classes with trainval count below 15 (airplane, train, bicycle, horse, motorbike, bus), the oversampling procedure is repeated 15 times, for other categories, 10 times. The detailed information about the augmented dataset numerosity for each oversampling strategy is presented in Table 6. Taking into account the datasets introduced, it is clear that some of them (strategies 1, 2 and 4) present a significant disparity in the number of samples for different categories, as they do not include the class change. In Strategy 2, person class numerosity is more than 100 times of the count for train category. To cover this imparity, the random class modification is introduced in Strategies 3 and 5, which results in more even class distributed dataset. For every experiment, VOC Pascal image file annotations are created with the corresponding objects, while erasing the original annotations in order to avoid duplicate annotations. The five resulting augmented datasets are introduced to the Faster R-CNN network separately, for each the training is combined with the original trainval VOC Pascal dataset. The results are evaluated on the original ground-truth dataset, divided into three size categories (small, medium and big). Table 5. Augmentation strategies overview Strategy 1 Oversample x3 original VOC Pascal object with random size and random location Strategy 2 Oversample x3 random VOC Pascal object of the same category with random size and location, the procedure is repeated 5 times for every small object Strategy 3 Oversample x3 random VOC Pascal object of the randomly changed category with random size and location, the procedure is repeated 5 times for every small object Strategy 4 Oversample x3, for selected classes DCGAN generated objects are used, for others random VOC Pascal objects with random size and location, the object category is preserved, the procedure is repeated 5 times for every small object Strategy 5 Oversample x3, for selected classes DCGAN generated objects are used, for others random VOC Pascal objects with random size and location, the object category is randomly chosen, the procedure is repeated 15 times for less numerous categories (aeroplane, train, bicycle, horse, motorbike, bus), 10 times for the others
5.2
Perceptual GAN
Having prepared the equally size distributed dataset and ensured that the generated objects are correctly recognized by the network, they are introduced to Perceptual GAN, which addresses the next reason of poor small object detection - their low feature representation. The PCGAN aims to generate super-resolved
Chosen Methods of Improving Small Object Recognition
279
Table 6. Augmented dataset count summary, distributed per classes and augmentation strategies. Strategies 1, 2 and 4, as they do not Include the class change, present a significant disparity in the number of samples for different categories. In strategy 2, person class numerosity is more than 100 times of the count for train category. To Cover this imparity, the random class modification is introduced in strategies 3 and 5, which results in more even class distribution. Strategy 1 Strategy 2 Strategy 3 Strategy 4 Strategy 5 All
3 696
6 810
6 810
6 810
Airplane
76
187
381
187
790
Bike
60
48
296
48
703
Bird
144
216
406
216
783
Boat
100
235
413
235
762
Bottle
248
422
414
422
829
Bus
48
120
426
120
763
Car
692
1 448
540
1 448
858
Chair
136
319
416
319
745
Cow
84
186
367
186
729
Horse
12
51
334
51
742
Motorbike
44
116
366
116
699
1 480
2 680
722
2 680
1 098
Plant
164
251
412
251
728
Sheep
348
342
457
342
882
Train
4
25
381
25
712
56
164
379
164
801
Person
Tv
12 624
large-object like representation for small objects. The approach is similar to architecture described in paper [3]. The generator model is a modified Faster RCNN network. The generator network is based on Faster R-CNN with residual branch which accepts the features from the lower-level convolutional layer (first conv layer) and passes them to the 3 × 3 and 1 × 1 convolutions, followed by max pool layer. As a next step there are two residual blocks with the layout consisting of two 3 × 3 convolutions, batch normalizations with ReLU activations which aim to learn the residual representation of small objects. The super-resolved representation is acquired by the element-wise sum of the learned residual representation and the features pooled from the fifth conv layer in the main branch. The learning objective for vanilla GAN models [27] corresponds to a minimax two-player game, which is formulated as (Eq. 1): min, max L(D, G) = Ex∼pdata(x) logD(x) + Ez∼pdata(z) [log(1 − D(G(z)))] (1) G
D
G represents a generator that learns to map data z (with the noise distribution pz (z)) to the distribution pdata (x) over data x. D represents a discriminator that estimates the probability of a sample coming from the data distribution pdata (x) rather than pz (z). The training procedure for G is to maximize the probability of D making a mistake.
280
M. Stacho´ n and M. Pietro´ n
The x and z are the representations for both large objects and small objects, i.e., Fl and Fs respectively. The goal is to learn a generator function which transforms the representations of a small object Fs to a super-resolved one G(Fs ) that is similar to the original one of the large object Fl . Therefore, a new conditional generator model is introduced which is conditioned on the extra auxiliary information, i.e., the low level features of the small object f from which the generator learns to generate the residual representation between the representations of large and small objects through residual learning (Eq. 2). min, max L(D, G) = EFl ∼pdata(Fl ) logD(Fl )+EFs ∼pFs [log(1−D(Fs +G(Fs |f )))] G
D
(2) In this case, the generator training can be substantially simplified over directly learning the super-resolved representations for small objects. For example, if the input representation is from a large object, the generator only needs to learn a zero-mapping. The original’s paper discriminator consists of two branches [3]: adversarial to distinguish generated super resolved representation from the original one for the large object and perception, to validate the accuracy influence of generated super-resolved features. In this solution, the perception branch is omitted with the main emphasis put on an adversarial branch. The adversarial branch consists of three fully connected layers, followed by sigmoid activation producing an adversarial loss. For the training purpose, there are two datasets prepared, containing images with only small and big objects respectively. The images are resized to 1000 × 600 pixels. To solve the adversarial min-max problem the parameters in the generator and the discriminator networks are optimized. Denote GΘg as the generator network with parameters Θg . The Θg is obtained by optimizing the loss function Ldis (Ldis is the adversarial loss, Eq. 3). Θg = arg min Ldis (GΘg (Fs )) Θg
(3)
Suppose DΘa is the adversarial branch of the discriminator network parameterized by Θa . The Θa is obtained by optimizing a specific loss function La (Eq. 4). (4) Θa = arg min La (GΘG (Fs ), Fl ) Θa
The loss La is defined as: La = −logDΘa (Fl ) + log(1 − DΘa (Fs + GΘG (Fs )))]
(5)
The La loss encourages the discriminator network to distinguish the difference between the currently generated super-resolved representation for the small object and the original one from the real large object. In the first phase, the generator is fed with large objects with the real batch forward pass through the discriminator. Next, the generator is trained with the small object dataset, trying to maximize the loss log(D(G(z)), where G(z) is the fake super-resolved small object image. The generator’s loss, apart from
Chosen Methods of Improving Small Object Recognition
281
the adversarial loss justifying the probability of the input belonging to a large object, acknowledges the RPN and ROI losses. The whole network is trained with Stochastic Gradient Descent with the momentum 0.9 and learning rate 0.0005. The perceptual GAN training is performed separately for two oversampled datasets representing the small object dataset and original VOC Pascal trainval set for large objects. For the small object dataset, the oversampling strategy with VOC Pascal objects combined with the random class switch was used. First, the evaluation is conducted on VOC small objects subset. Then ensemble model is created with original Faster-RCNN and PCGAN and voting mechanism is added at the end. This solution allows to detect large objects at the similar level as before and increase detection accuracy of small objects.
6
Results
Table 7 shows the mAP score achieved by the FasterRCNN model trained with datasets obtained with described augmentation strategies, evaluated on the original VOC Pascal dataset. Table 7. Evaluation of different augmentation strategies, described in Sect. 5. The tests are conducted on VOC Pascal dataset, splitted into three size categories (see Sect. 5.1). Mean average precision is given as percentage value with additional information about the augmented small object train dataset count Strategy Original
Small obj count mAP - small mAP - medium mAP - big 861
3.10
12.62
70.38
Strategy 1
3696
5.79
12.71
66.80
Strategy 2
6810
5.84
12.95
67.75
Strategy 3
6810
7.08
14.17
66.73
Strategy 4
6810
5.47
13.56
67.84
Strategy 5 12 624
7.60
16.28
67.01
The strategies including random class modification for oversampling with VOC Pascal and generated objects (strategies 3 and 5) outperform the original results by 3.98% and 4.5% respectively. Generally, by increasing the number of samples during the training, the mAP on small objects can be improved without any model modification. As proved, even the most naive solution, by oversampling the original object without any changes allowed to achieve almost two times better score. The most gain is observed with oversampling using DCGAN generated objects. However, the accuracy differences between using VOC Pascal and generated objects are quite low (∼0.6%). In addition, augmenting small objects affected medium objects’ performance. In case of the first two strategies, mAP remained unchanged, for the other cases it achieved a better score than
282
M. Stacho´ n and M. Pietro´ n
the original. The best performance is registered for the last oversampling strategy, which assumed augmentation with generated objects together with random class switch with a score 16.28%. It outperforms original results and augmented by VOC objects by 3.66% and 2.11% respectively. Summing up, the strategy 5 oversampling method produced the overall best results. Firstly, due to the highest number of samples used distributed evenly between categories, secondly by enhancing the representation of the objects. In order to demonstrate where does the improvement come from, FasterRCNN results over classes are presented. Tables 8 and 9 show the results of oversampling scenarios evaluated on test VOC Pascal dataset splitted by categories. Overall, the presented augmentation strategies bring improvement to most analyzed categories with oversampling performed. This applies to the following classes: airplane, bird, boat, bus, car, cow, horse, motorcycle, person, potted plant, sheep and tv monitor. As explained in Sect. 4 some categories are not the subject to augmentation procedure due to very low number of original train and test samples (sofa, dog, table and cat). For the remaining classes: bike, bottle, chair, despite boosting the trainval set representation, there is no detection improvement observed. The reason might be the quality and the features of the train and test subset for those categories, where the majority of small objects is classified as difficult. This outcome is also confirmed by the chair class, where applying representative DCGAN generated objects, once again does not influence the achieved mAP score as all 58 test objects are difficult ones. Additionally, it is worth mentioning that the context plays a significant role for the airplane category. The best results are obtained for the second strategy with no class change. Another interesting observation is the fact that the bird and cow are the categories that benefits most from DCGAN generated objects. The mAP score for bird is 5.34%, which is 4.45% and 1.6% better than the original and best VOC oversampling strategy, respectively. The cow category has at least two times better mAP score using DCGAN based oversampling. On the other hand the airplane is the only category, for which the generated objects used in strategy 5 do not improve the detection accuracy. Table 8. VOC Pascal categories mAP scores (airplane, bike, bird, boat, bottle, bus, car, chair, cow) for trainval and test set with the division for small, medium, big categories. Mean average precision is given as percentage value Type
Airplane Bike Bird Boat Bottle Bus Car Chair Cow
Orginal
6.09
0.00 0.89 1.19
Strategy 1
0.00
0.00 2.19 0.00
3.16
7.04
0.00 0.69 1.09
0.00
0.00 4.31 0.00
10.98
Strategy 2 18.70
0.00 2.04 0.00
0.00
0.83 4.26 0.00
4.16
Strategy 3 10.72
0.00 3.74 0.85
0.00
2.38 4.38 0.00
7.08
Strategy 4
2.42
0.00 5.34 1.47
0.00
0.00 4.39 0.00
22.52
Strategy 5
5.37
0.00 1.66 1.36
0.00
0.00 3.99 0.00
41.52
Chosen Methods of Improving Small Object Recognition
283
Table 9. VOC Pascal categories mAP scores (horse, motorbike, person, potted plant, sheep and tv monitor) for trainval and test set with the division for small, medium, big categories. Mean average precision is given as percentage value Type
Horse Moto Person Plant Sheep Tv
Original
2.63
0.00
0.92
0.44
20.07
Strategy 1 2.56
3.57
1.24
3.93
29.09 16.34
Strategy 2 3.03
4.00
1.28
5.17
25.29 13.07
Strategy 3 7.14
5.00
1.43
4.75
34.16 17.47
Strategy 4 1.96
2.17
1.32
2.31
27.39
Strategy 5 3.03
5.55
1.17
1.03
25.66 16.43
6.20
5.23
Table 10 provides the summary of PCGAN training results on augmented VOC Pascal. For this process, the third oversampling strategy dataset is used as a small object image dataset, instead of original VOC Pascal, in order to obtain similar number of small and big objects, required for the training phase. Overall, PCGAN allowed increasing mAP for small objects in nine presented classes. It is apparent, that the solution allows much better performance than simple augmentation strategies. The most gain in the mAP score is observed for the motorcycle class, from the original 0%, through 5% (oversampling) up to 50% (perceptual GAN). The other significant improvement represent bird category with a score 11.62%, which is 2.3 and 13.1 times better than oversampling and original result. For aeroplane, boat, bus, cow, sheep and tv monitor categories the mAP performance fluctuates around 1.5 times better than oversampling score. Worth mentioning is a cow class, for which FasterRCNN achieved over 65% accuracy. The following observation may be extracted at this point. Firstly, the Perceptual GAN can be successfully extended to natural scene image dataset from its initial application. Secondly, the dedicated solution, such as PCGAN, despite heavier training procedure (generator and discriminator), allows to achieve significantly better detection accuracy for small objects than augmentation methods. The described solution may be efficiently employed with the original FasterRCNN as a parallel network and forms ensemble model. The second possibility is to use conditional generator as described in Sect. 5.2. In both solutions the original mAP score for big objects is preserved. The small object mAP is significantly improved. The score for medium object is at the same level or slightly better than in original model. After applying these approaches mAP for whole dataset is improved up to 0.3% (from 69.98% up to ∼70.3%). The small increase is dictated by small percentage of small and medium objects in test dataset (Table 1 and 2).
284
M. Stacho´ n and M. Pietro´ n
Table 10. Evaluation of PCGAN training described in Sect. 5.2. The results are presented in comparison with original FasterRCNN and oversampling strategy that received best score for given category. All three tests are conducted on small object VOC Pascal test group. Mean average precision is given as percentage value
Strategy Original
Aeroplane Bird
Boat Bus Cow
Horse Moto Sheep Tv
6.09
0.89 1.20
0.00
3.16 2.63
0.00
20.07
Best oversampling 18.70
5.34 1.47
2.38 41.52 7.14
5.55
34.16 17.47
11.62 2.49
3.45 65.19 9.09
50.0
44.58 27.27
PCGAN
28.97
6.20
For all simulation presented in the paper the ratios of width and height of the generated FasterRCNN anchors used are 0.5, 1 and 2. The areas of anchors (anchor scales) are defined as 8, 16 and 32 with the feature stride equal 16. The learning rate is 0.01.
7
Conclusions and Future Work
The work presents comparison of few strategies for improving small object detection. The presented results show that solution with GAN architecture outperforms other well known augmentation approaches. The perceptual GAN is significantly better than oversampling strategies based on DCGAN image generation. It achieves better results with the similar amount of the training data. It is worth noting that all presented approaches required a 10–20 fold increase in the number of small objects. Future work will concentrate on further improvements using perceptual GAN. The experiments will focus on perceptual GAN architecture exploration. Next, the solution will be tested on other object detection datasets.
References 1. Ren, S., Ross, K.H., Sun, G.J.: Faster R-CNN: towards real-time object detection with region proposal networks. https://papers.nips.cc/paper/5638-faster-rcnn-towards-real-time-object-detection-with-region-proposal-networks.pdf 2. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014). http://papers.nips.cc/paper/ 5423-generative-adversarial-nets.pdf 3. Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., Yan, S.: Perceptual generative adversarial networks for small object detection (2017). https://arxiv.org/pdf/1706.05274. pdf 4. Everingham, M., van Gool, L., Williams, C., Winn, J., Zisserman, A.: The PASCAL visual object classes homepage (2014). http://host.robots.ox.ac.uk/pascal/VOC/ 5. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation (2014). https://arxiv.org/pdf/ 1311.2524.pdf
Chosen Methods of Improving Small Object Recognition
285
6. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection (2016). https://arxiv.org/pdf/1506.02640.pdf 7. Liu, W., et al.: SSD: single shot multibox detector (2016). https://arxiv.org/pdf/ 1512.02325.pdf 8. Ledig, C., et al.: Twitter. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network (2017). https://arxiv.org/pdf/1609.04802.pdf 9. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks (2016). https://arxiv.org/pdf/ 1511.06434.pdf 10. Krizhevsky, A., Nair, V., Hinton, G.: Cifar-10, Cifar-100 dataset (2009). https:// www.cs.toronto.edu/kriz/cifar.html 11. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015). https://arxiv.org/pdf/1409.1556.pdf 12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015). https://arxiv.org/pdf/1512.03385.pdf 13. Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J., Greenspan, H.: GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification (2018). https://arxiv.org/pdf/1803.01229.pdf 14. Xiao, C., Li, B., Zhu, J.-Y., He, W., Liu, M., Song, D.: Generating adversarial examples with adversarial networks (2019). https://arxiv.org/pdf/1801.02610.pdf 15. Lin, T.-Y., et al.: Microsoft COCO: common objects in context (2015). https:// arxiv.org/pdf/1405.0312.pdf 16. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection (2018). https://arxiv.org/pdf/1708.02002.pdf 17. Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection (2017). https://arxiv.org/pdf/1612.03144. pdf 18. Hu, P., Ramanan, D.: Finding tiny faces (2017). https://arxiv.org/pdf/1612.04402. pdf 19. Lim, J.-S., Astrid, M., Yoon, H.-J., Lee, S.-I.: Small object detection using context and attention (2019). https://arxiv.org/pdf/1912.06319.pdf 20. Hu, G., Yang, Z., Hu, L., Huang, L., Han, J.: Small object detection with multiscale features. Int. J. Digit. Multimedia Broadcast. 1–10 (2018) 21. Kisantal, M., Wojna, Z., Murawski, J., Naruniec, J., Cho, K.: Augmentation for small object detection (2019). https://arxiv.org/pdf/1902.07296.pdf 22. Chen, Y., Li, J., Niu, Y., He, J.: Small object detection networks based on classification-oriented super-resolution GAN for UAV aerial imagery. In: Chinese Control And Decision Conference (CCDC), pp. 4610–4615 (2019) 23. Jiang, W., Ying, N.: Improve object detection by data enhancement based on generative adversarial nets. https://arxiv.org/pdf/1903.01716.pdf 24. Bai, Y., Zhang, Y., Ding, M., Ghanem, B.: SOD-MTGAN: small object detection via multi-task generative adversarial network (2018) 25. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for finegrained categorization. In: 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia (2013) 26. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples, arXiv preprint arXiv:1412.6572 (2014) 27. Goodfellow, I., et al.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014) 28. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks (2016)
Factors Affecting the Adoption of Information Technology in the Context of Moroccan Smes Yassine Zouhair(B) , Mustapha Belaissaoui, and Younous El Mrini SIAD Laboratory, Hassan First University of Settat, Settat, Morocco {y.zouhair,mustapha.belaissaoui,younous.elmrini}@uhp.ac.ma
Abstract. Today, information technology (IT) has become the foundation for streamlining processes, optimizing costs and organizing information, transforming traditional business models into dynamic enterprises that make their operations profitable through technology. The accelerated adoption of IT as a service solutions is one of the fastest growing trends among businesses of all sizes. Despite the great benefits they are expected to bring to businesses, the level of IT usage in Moroccan small and medium-sized enterprises (SME) remains very low. This article attempts to identify the factors that influence the adoption of IT in the context of Moroccan SMEs. We applied a qualitative research approach using the interview technique. The population of this study includes team leaders, users, internal IT managers, external consultants and individuals who are familiar with the Moroccan SME sector with different levels of education and experience. The research results indicate that several determinants affect the willingness of SMEs to adopt IT. These determinants are of two types: internal determinants, which refer to organizational and individual factors, also referred to as factors specific to the organization; and external determinants, which include factors specific to the technology, and factors related to the environment in which the organization operates. Keywords: IT · Adoption · Challenges · Integrated IS · Moroccan SMEs
1 Introduction Today’s company is faced with a constantly changing economic context, strong economic pressures and strategic decisions critical to its survival: • • • • • •
Increasingly fierce competition. Rapid commercial reactivity. Need to innovate. Increasing overhead costs. Cost reduction requirements to remain competitive. Obligation to follow new technologies.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 286–296, 2023. https://doi.org/10.1007/978-3-031-28073-3_20
Factors Affecting the Adoption of Information Technology
287
Today, IT has become essential to the activity of a company and this in an economic context that is constantly changing [1]. The application of IT can be considered as the use of an information system (IS) to control information at all levels of the company [1]. The IS is closely linked to all functions and facets of the company and is becoming essential. Without it, it becomes difficult for a company to compete with its competitors, to manage administrative constraints and to be informed of what is happening in the economic markets [1]. The IS plays an essential role in supporting organizational processes, it is considered a “nervous system” [2, 3]. The IS represents the infrastructure in IT, data, applications and personnel, it links the different elements to provide a complete solution [4]. As such, it includes a large number of functions such as data collection and transmission. Nowadays, Moroccan companies, including SME, are faced with ever-changing market requirements. In order to best satisfy the organization, it is important to achieve a coherent and agile IS to integrate the new needs of the company. However, if the IS system plays a central role in the activity and life of the company, and if it can contribute to rationalization and growth, it can also be the cause of a chaotic operation. Guariglia et al. explain that SMEs reduce their level of investment in technology development because of limited financial resources and the difficulty of obtaining external funding [5]. Parasuraman measures technology use readiness based on the Technology Readiness Index (TRI), which has four dimensions: optimism, innovation, unease, and insecurity [6]. Many researches have studied the readiness to use information technology, but none have focused on the factors that influence the readiness of SMEs to adopt technology. This paper attempts to identify the factors that influence the adoption of IT in the context of Moroccan SMEs.
2 Research Question The diffusion of IT within developing countries can be an effective lever for economic and social development. IT is both a good and a service that allows a wide diffusion of knowledge and know-how, but also an investment good that allows to increase the microeconomic performance of companies by increasing productivity and constitutes an industry that can significantly contribute to the increase of the macroeconomic performance of nations. IT play today a preponderant role in the development of Moroccan SMEs by supporting one or several functions within the organization. The purpose of this research is to identify the factors that influence the adoption of IT in the context of Moroccan SMEs. The research addresses the following question: What are the factors that influence the adoption of IT in the context of Moroccan SMEs?
288
Y. Zouhair et al.
3 Literature Review 3.1 Definition of SME The size criterion is most often taken into consideration in the process of defining of an SME notwithstanding the diversity of approaches that have attempted to define the SME. In fact, each country has a distinct way of defining of the SME, which is usually based on “number of employees”. According to the "White Paper on SMEs”, produced by the Ministry of the Prime Minister in charge of General Government Affairs (1999) it is not easy to define the term SME. In Morocco, SME has a wide variety of definitions. In fact, there are several definitions depending on the elements taken into consideration. To qualify as an SME, existing companies must have a workforce under 200 permanent employees, have an annual turnover excluding tax under 75 million Moroccan Dirham (MAD), and/or a balance sheet total under MAD 50 million. However, the definition of the SME elaborated by the ANPME takes into account only the criterion of the turnover and disregards the number of workers of the company. According to this last definition, three types of companies are distinguished: • The very small company: less than 3 million MAD • The small company: between 3 and 10 million MAD • The medium enterprise: between 10 and 175 million MAD 3.2 Adoption of Information Technology A study was done by Parasuraman [6], they define technological readiness as “the propensity of people to adopt and use new technologies to achieve goals in private life and in the workplace”. TRI is an index that was set by Parasuraman [6] to measure a person’s ideas and beliefs about the application of technology. A person’s thinking about the application of technology can be positive, i.e. speaking optimistically about the application of technology and also having the will in the use of new technologies, or negative, i.e. feeling uncomfortable towards technology. There are four dimensions to technology readiness: Optimism: refers to a positive view of technology as well as the perceived benefits of using technology. Innovation: refers to the degree to which the person enjoys experimenting with the use of technology and being among the first to try the latest technology services or products. Discomfort: refers to the lack of mastery of technology as well as a lack of confidence in the use of technology.
Factors Affecting the Adoption of Information Technology
289
Insecurity: refers to the distrust of technology-based transactions. Optimism and innovation can help increase technological readiness, they are “contributors”, while for others, discomfort and insecurity are “inhibitors”, they can suppress the level technology readiness [7]. According to Parasuraman et al. [7], technology readiness is a tool for measuring thoughts or perceptions about technology use and not as a measure of a person’s ability to use technology. Depending on the level of technology readiness, users are classified into five sections: Explorers have the highest score in the ‘contributor’ dimension (optimism and innovation) and a low score in the ‘inhibitor’ dimension (discomfort and insecurity), they are easily attracted to new technologies and generally become the first group to try new technologies [8]. The laggards have the lowest score in the ‘contributor’ dimension (optimism and innovation) and the highest score in the ‘inhibitor’ dimension (discomfort and insecurity), they are the last group to adopt a new technology [8]. The others (the pioneers, the skeptics and the paranoids) have a very complex perception of technology [8]. Pioneers have a higher level of optimism and innovation like explorers, but they can easily stop using technology if they get discomfort and insecurity [8]. Skeptics are not very motivated by the use of technology, but they have a low level of the ‘inhibitors’ dimension; they need to be informed and convinced by the benefits of the technology [8]. Technology is quite interesting for paranoids but they always take into account the risk factor, which allows them to have a high level of the ‘inhibitors’ dimension (discomfort and insecurity) [8]. Lucchetti et al. [9] studied several cases of Italian SMEs, they noted that the adoption and use of information and communication technologies (ICT) was differentiated, depending on the nature and internal funds of the company, on the one hand, and the technological skills used in the company, on the other. According to Harindranath et al. [10] the fear of limited use of technology and the need to update technology frequently is one of the concerns of UK SMEs. According to Nugroho et al. [11] customer pressure, need, capital, urgency, and ease of use are the factors that influence SMEs in Yogyakarta to adopt IT. Research has shown that the relative advantage of these technologies, competitive pressure, and management support are significant predictors of IT adoption [12]. According to Beatty et al. [13] and Dong [14] leaders and their commitment play a crucial role in the IT adoption decision. Thong et al. [15] states that CEO’s innovation, CEO’s IT knowledge, CEO’s attitude to adopting IT, company size, information intensity and competitive environment are the factors that influence SMEs to adopt IT. According to Khalil et al. [16], organizational structure, technology strategy, human organization and external environment affect the intention to adopt IT. Grandon et al. [17] and Tung et al. [18] found that social influence and external pressure are significantly related to IT adoption.
290
Y. Zouhair et al.
The results of Naushad et al. [19] indicate that SMEs adopt IT to gain an advantage over their competitors. Value creation, productivity, ease of use and affordability are the top reasons. Top management support and cost-effectiveness are other essential factors that influence IT adoption. Other pertinent factors include personal characteristics and technology self-efficacy. Shaikh et al. [20] affirm that high infrastructure costs, data security, cost of training, lower efficiency and technical skills, lower government support and lack of support from the organization are factors that influence the adoption of IT. Kossaï et al. [21] show that firm size, export and import intensity, and business human capital are the most significant factors in the adoption of IT. In our study, technological readiness can be defined as the preparation to improve the quality of the currently used technology, in other words to update the level of the technology.
4 Research Methodology This paper attempts to identify the factors that influence the adoption of IT in the context of Moroccan SMEs. We applied a qualitative research approach using the interview technique. This research used the design of several case studies to describe the phenomenon [22]. According to Yin [22, 23], the case study method can be used to explain, identify, or study events or phenomena in their real-life context. In addition, “case studies are particularly recommended when dealing with new and complex areas, where theoretical developments are weak and the recovery of the context is crucial for the development of the understanding process” [22, 23]. The use of the case study method is relevant when the study has to answer research questions such as “what”, “how” and “why” [22, 23]. This fits exactly with our problem. In this research, we used the snowball sampling technique while the results of the interviews are analyzed using a descriptive statistical approach. Prior to the interview session, informants were given sample questions to discuss during the interview sessions. Participants The population of this study includes team leaders, users, internal IT managers, external consultants, and people with knowledge of the Moroccan SME sector with different levels of education and experience. The idea is to collect data from organizations operating in the SME sector in order to identify factors that influence the adoption of IT in the Moroccan SME context. All case organizations operate in the private sector in Morocco. The organizations are labeled Company A, Company B, Company C and Company D. Table 1 provides an overview of the cases studied.
Factors Affecting the Adoption of Information Technology
291
Table 1. Description of the case studies Company A
Company B
Company C
Company D
Nature and industry
Consulting
Transport & Logistics
Audit
Service
Number of employees
20
90
24
41
Number of interviews
6
9
5
7
Interview participants
Managers, team leaders, users and internal IT managers
Managers, team leaders, users, external consultants and internal IT managers
Managers, team leaders, users and internal IT managers
Managers, team leaders, users and internal IT managers
Procedure We applied a qualitative research approach using the interview technique. The data were collected through personal interviews, a total of 27 interviews conducted in the four organizations. Interviews were conducted with managers, team leaders, users, external consultants and internal IT managers in order to collect the different perspectives that influence the adoption of IT in the context of Moroccan SMEs. The duration of the interviews was between 30 and 60 min. E-mails and telephone calls were used to clarify some questions. In the first section of the interview, informants were asked to provide demographic and company information. In the second phase, we began asking questions based on Parasuraman’s instruments [6], for example: How does the IS contribute to the company’s sustainability? Do you have unpleasant experiences in using the IS? How does the IS affect your work? Why do you want to apply IS in your company? Why don’t you want to apply IS in your company? Describe your reasons to use or not use IS in your company. Tell me about your experience interacting with IS. When choosing an IS, how will you select the system that you believe is best for your company? How does the environment affect your decision to use IS? The company is still supporting the advancement of the use of technology. Has the progress of the system already brought benefits to your company? For what kind of condition and purpose is it intended?
292
Y. Zouhair et al.
5 Findings and Results The frequency of technology adoption varies from organization to organization and this is the result of a panoply of organizational, individual, technological, and environmental factors, which are directly or indirectly correlated with the decision to adopt or not adopt. The determinants that will be discussed in this research are the results of the informant interviews. 5.1 Organizational Factors Organizational factors are primarily related to the structure and strategy of the organization. The first factor is organizational strategy, informants state that the level of communication between departments within an organization is positively correlated with the commitment of departments in the process of adopting innovations, as well as the establishment of a project management team responsible for the implementation and integration of these technologies significantly increases the probability of successful adoption. The second factor is the size of the company, it presented as a good predictor of technology adoption, informants say that if the size of the company increases, IT will be needed. The informants also pointed out the importance of tangible resources of a human and material nature, as well as intangible resources, notably the stock of knowledge at the enterprise level in the decision-making process for the adoption of new technologies. Material resources are critical for innovation, development and new technology acquisition projects. For their part, human resources and intangible resources, especially the stock of knowledge available to firms, affect their propensity to adopt new technologies through their impact on the absorption capacity of the knowledge incorporated into these new technologies. Employee competence is among the factors for IT adoption mentioned by informants, the more qualified the staff, the more likely they are to seek out new technologies. Staff skills refer to the competence of employees, their level of experience, and their versatility. According to the informants, technological compatibility positively influences the adoption decision. This technological compatibility is composed of two dimensions: a high degree of confidence that the new technology is compatible with the company’s current operations and practices, and that the company believes in the availability of the necessary resources for the implementation and integration of the technology; and, above all, its conviction that it has the competent human resources to succeed in the processes of adoption and integration of the new technology. Informants mentioned that the vision of the organization’s senior management and their knowledge of the innovation or technology to be adopted would also be significant determinants of whether or not to adopt that innovation or technology. Finally, some informants identified the timing of a technology’s adoption as a determinant of whether or not it is adopted.
Factors Affecting the Adoption of Information Technology
293
5.2 Individual Factors The second category of factors is individual factors. Informants mentioned that the perceived usefulness of a technology significantly and positively influences the decision to adopt it, when the potential user of the technology perceives that its use will increase production while maintaining quality, decrease the unit production cost, and make the company more competitive, the likelihood of its adoption will increase. Some informants argued that individuals in an organization, regardless of size, are decisive in the decision to adopt a new technology, as adoption is directly dependent on their skills, knowledge, and ability to foster successful implementation of the technology. On the other hand, other informants identified user resistance behavior to technology as a barrier to adoption. 5.3 Technological Factors This category of factors refers to factors external to the company. These are essentially non-controllable factors directly related to the technology to be adopted, such as the attributes of the technology, its maturity and the characteristics of the technology. These characteristics include the perceived compatibility of the technology to be adopted with existing technologies, its complexity, and the perceived net benefit of its adoption. Informants identified acquisition costs and integration costs of a new technology as important and often decisive barriers to adoption by the company. The complexity of the technology to be adopted could be a barrier to adoption. Since the potential users of this technology will be the employees, we will capture this complexity by the intensity of the barriers experienced by the informants. A final factor that falls into this category is associated with the uncertainty related to the evolution of the technology within the company. 5.4 Environmental Factors Informants mentioned that the extent of the SME’s social network, including with suppliers, customers, and research institutions, can influence its decision whether to adopt a new technology. These networks help build trust and social capital between the firm and its key partners. It can thus reduce transaction costs and establish reliable and efficient communication with the members of its network. This creates a climate that is conducive to innovation and the successful integration of new technologies. All informants have expressed the same idea that use of IT can help in marketing of products. Although it is seen as an aid to marketing, some informants believe that customer satisfaction can be obtained through the adoption of IT because it allows for the interaction between the potential buyer and the products. The Table 2 summarizes the various factors that influence the adoption of IT in the context of Moroccan SMEs.
294
Y. Zouhair et al. Table 2. IT adoption factors
Category
Factors
Organizational
Organizational strategy The size of the company Human and material resources The stock of knowledge at the company level The competence of the employees Technological compatibility The vision of the organization’s senior management The timing of adoption
Individual
Perceived usefulness of the technology Individual skills, knowledge, and abilities Resistance to change
Technological
Technology characteristics: relative advantage, perceived compatibility and complexity Acquisition costs and integration costs Technology maturity Perceived net benefit of the technology The level of uncertainty associated with the technology
Environmental
The company’s social network and its role as a source of information (customers, competitors, suppliers, research institutions, etc.) Knowledge sharing with suppliers and customers Reliable and effective communication with members of its network
6 Conclusion and Future Works The purpose of this study was to identify the factors that influence the adoption of IT in the context of Moroccan SMEs. We applied a qualitative research approach using the interview technique. This research used the design of several case studies to describe the phenomenon. The frequency of technology adoption varies from one organization to another and is the result of a range of organizational, individual, technological, and environmental factors that are directly or indirectly correlated with the decision to adopt or not to adopt. The determinants discussed in this research are the results of the interviews conducted with the informants. These determinants are of two types: internal determinants, which refer to organizational and individual factors, also referred to as organization-specific (controllable) factors; and external determinants, which include technology-specific factors (technological factors) and factors related to the environment in which the organization operates (environmental factors). This study has several limitations, namely: • The difficulty of extending the scope to other Moroccan SMEs • The confidentiality of the companies
Factors Affecting the Adoption of Information Technology
295
The study provides several future research themes: critical success factors for IT implementation projects in the context of Moroccan SMEs, extending this study to other Moroccan SMEs, and confirming the results of this qualitative research with quantitative research.
References 1. Carpentier, J.-F.: La gouvernance du Système d’Information dans les PME Pratiques et évolutions. Editions ENI (2017) 2. St-Hilaire, F.: Les problèmes de communication en entreprise: information ou relation? Université Laval, Diss (2005) 3. Millet, P.-A.: Une étude de l’intégration organisationnelle et informationnelle. Application aux systèmes d’informations de type ERP. Diss. INSA de Lyon (2008) 4. Hammami, I., Trabelsi, L.: Les Green IT au service de l’urbanisation des systèmes d’information pour une démarche écologiquement responsable. No. hal-02103494 (2013) 5. Guariglia, A., Liu, X., Song, L.: Internal finance and growth: microeconometric evidence on Chinese firms. J. Dev. Econ. 96(1), 79–94 (2011) 6. Parasuraman, A.: Technology Readiness Index (TRI) a multiple-item scale to measure readiness to embrace new technologies. J. Serv. Res. 2(4), 307–320 (2000) 7. Parasuraman, A., Colby, C.L.: Techno-Ready Marketing: How and Why Your Customers Adopt Technology. Free Press, New York (2001) 8. Demirci, A.E., Ersoy, N.F.: Technology readiness for innovative high-tech products: how consumers perceive and adopt new technologies. Bus. Rev. 11(1), 302–308 (2008) 9. Lucchetti, R., Sterlacchini, A.: The adoption of ICT among SMEs: evidence from an Italian survey. Small Bus. Econ. 23(2), 151–168 (2004) 10. Harindranath, G., Dyerson, R., Barnes, D.: ICT adoption and use in UK SMEs: a failure of initiatives? Electron. J. ˙Inf. Syst. Eval. 11(2), 91–96 (2008) 11. Nugroho, M.A., et al.: Exploratory study of SMEs technology adoption readiness factors. Procedia Comput. Sci. 124, 329–336 (2017) 12. Ifinedo, P.: An empirical analysis of factors influencing Internet/e-business technologies adoption by SMEs in Canada. Int. J. ˙Inf. Technol. Decis. Mak. 10(04), 731–766 (2011) 13. Beatty, R.C., Shim, J.P., Jones, M.C.: “Factors influencing corporate web site adoption: a time-based assessment. Information & management 38(6), 337–354 (2001) 14. Dong, L.: Modelling top management influence on ES implementation. Bus. Process Manag. 7, 243–250 (2001) 15. Thong, J.Y.L., Yap, C.S.: CEO characteristics, organizational characteristics and information technology adoption in small businesses. Omega 23(4), 429-422 (1995) 16. Khalil, T.M.: Management of Technology: The Key to Competitiveness and Wealth Creation. McGraw-Hill Science, Engineering & Mathematics (2000) 17. Grandon, E.E., Pearson, J.M.: Electronic commerce adoption: an empirical study of small and medium US businesses. Inf. Manag. 42(1), 197–216 (2004) 18. Tung, L.L., Rieck, O.: Adoption of electronic government services among business organizations in Singapore. J. Strat. Inf. Syst. 14(4), 417–440 (2005) 19. Naushad, M., Sulphey, M.M.: Prioritizing technology adoption dynamics among SMEs. TEM J. 9(3), 983 (2020) 20. Shaikh, A.A, et al.: A two-decade literature review on challenges faced by smes in technology adoption. Acad. Mark. Stud. J. 25(3) (2021)
296
Y. Zouhair et al.
21. Kossaï, M., de Souza, M.L.L., Zaied, Y.B., Nguyen, P.: Determinants of the adoption of information and communication technologies (ICTs): the case of Tunisian electrical and electronics sector. J. Knowl. Econ. 11(3), 845–864 (2019). https://doi.org/10.1007/s13132018-0573-6 22. Yin, R.K.: Design and methods. Case Study Res. 3(92), 1–9 (2003) 23. Yin, R.K.: Case Study Research: Design and Methods, vol. 5. Sage, Thousands Oaks (2009)
Aspects of the Central and Decentral Production Parameter Space, its Meta-Order and Industrial Application Simulation Example Bernhard Heiden1,2(B) , Ronja Krimm1 , Bianca Tonino-Heiden2 , and Volodymyr Alieksieiev3 1
3
Carinthia University of Applied Sciences, 9524 Villach, Austria [email protected] 2 University of Graz, 8010 Graz, Austria Faculty of Mechanical Engineering, Leibniz University Hannover, An der Universit¨ at 1, Garbsen, Germany http://www.cuas.at
Abstract. In this paper, after giving an overview of this research field, we investigate the central-decentral parameter room using a theoretical approach, including a cybernetic meta-order concerning system theoretic concepts. For this, we introduce an axiom system with four axioms describing production systems in general. The modeling axiom and three axioms for system states: The attractor, bottleneck, and diversity theorem, all describing complex ordered systems. We then make a numerical investigation of central and decentral production in conjunction with a practical industrial application example and compare it to previous simulation results. As a result, we recommend simulating production concerning these two possibilities, central and decentral in the production control parameter-space and accompanying additional production parameters, from the customer and the product quantity to produce case-specific optimally. Keywords: Orgiton theory · Graph theory · Witness · Central · Decentral · Production · Manufacturing · Logistics · System theory Cybernetics · Decentral control · Heterarchical systems · Additive manufacturing · AM
1
·
Introduction
In production, there arises the question of optimal production in general. So for this, future trends like Industry 4.0 (see, e.g. [1]) tend to flexibilize the production and include a lot of more and more sophisticated automation. This new trend of automation and automation of automation, can be regarded as an approximation of what we understand today under Artificial intelligence or AI. So in this context, production control becomes increasingly important, as there are, due to flexibility arrangements, -possibilities, and production efficiency need, possibly c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 297–309, 2023. https://doi.org/10.1007/978-3-031-28073-3_21
298
B. Heiden et al.
to be implemented new strategies that were not feasible up to now technologically, as e.g. the arising paradigm of Additive Manufacturing (AM) (see, e.g. [8]) changes production in principle. This is closely linked to the new arising osmotic paradigm, that is inclined towards decentrality as it increases ecological properties like resilient, and overall efficient production as general known properties by means of diversity, autonomy and heterarchy (cf. e.g. [13,26]). In this context it can be seen, that production technologies like AM and Computer Numerical Controlled (CNC) machines in general allow for emergent features like flexible, personalized and decentral production as a consequence of their basic process-structure. So the hypothesis is that new production methods are more aped to fulfill current requirements like the diversity of products or the diversity of costumer buy decisions, to mention only two. So the essential point in personalization and flexibilization is whether it makes sense from the efficiency points of view in a modern production environment and how to achieve this with central or decentral strategies. Up to now, the central paradigm was without question, but as it gets to its limit, we have to ask where, when and how this makes sense under specific points of views. The problem of central-decentral control is well known in general and is the only classification type for production and production control that we will focus on here. It relates to general properties that we can find in networks (e.g., [30]) or graph theory (see, e.g., [4]), where a lot of research has been done. For example, according to Watts and Strogatz, some networks have the now famous “small world” properties, which means that already very sparse networks are effectively cross-linked, similar to a dense net. Graph theoretical approaches, on the other side, allow for applying algorithms. Still, according to their computational implementation, they may not be so easy to be argued with or to be understood. Therefore, our strategy is to find general argumentation strategies that we then test in more detail for their validity by simulation models in this work. Other recent approaches for the arising decentral paradigm are the holonic manufacturing approach, heterarchical systems, the distributed control or multiagent systems (cf. also [27]), that can be for example straight forward implemented in the program Anylogic1 . The emergent necessity for this kind of often neglected research is that it becomes necessary in an environment of more and more dense nets of material and informational processes. The very reason is that, in this particular case of a complex system, and generally, in nonlinear systems, a new and unpredicted behavior of systems is to be expected (cf. e.g. [14]) and has to be investigated to make a difference in the efficiency of such production processes. Content. In this paper, we give after the presentation of the research goal, question, method and limitation in Sect. 2 the meta-description of the centraldecentral production parameter space. In Sect. 3 we introduce the used Witness model, and discuss it there. In Sect. 4 we finally conclude with open and promising research questions in the field. 1
https://www.anylogic.de/.
Central Decentral Meta-Order and Application Example
299
Goal and Research Question. The goal of this paper is to investigate further the properties of production in the parameter space of central and decentral production. The research question is: Which parameters influence the production, focusing on the bi-valued discrimination in the parameter space of centraldecentral production and which control method shall be preferred under which conditions? Research Method. This research paper analyzes the research question and gives in Sect. 2 a natural language axiomatization for the research focus and thesis. Section 3 explores the topic with a Witness simulation example of an industrial production problem and relates it to the given axiomatization. By this we use a modern cybernetic knowledge process approach (see e.g. [25,29]). Limitations. The limitation of this work is that only a limited complexity of the simulation model is computed, as well as the problem is restricted theoretically and practically to only a few parameters in production. Hence it might be that other parameters could also be necessary. Another significant limitation is the bi-valued discrimination of central-decentral models, which could be overcome in future research by more sophisticated models and or with a transition model for a seamless transition in the parameter-space central-decentral. For this, in special, there would have to be defined a complex variable intermediating between the two extremes. An interesting and promising approach in this regard is found in [6], where a typology of transition in the parameter space central-decentral is given in the form of specific networks, which we are planning to investigate in the future.
2
Metadescriptive Approach in Orgitonal Terms of the Central-Decentral Problem
When we look at the central-decentral parameter space, we can formulate the following meta-descriptive system properties, which we will test and confirm later in more detail. As first Axiom we formulate the modeling theorem, which allows for an overall modeling of the production process by a digital twin: Axiom 1. The production can be divided into an information and a material production line which both are (a) decoupled with respect to time/room and (b) structurally coupled to each other and the environment. According to orgiton and systems theory (see also [12,28]), the information process is of potentially higher order and structurally coupled with the material process, which we can also call attractor theorem: Axiom 2. Information and material processes have an attractor according to their order and can hence dominate or limit the process.
300
B. Heiden et al.
Therefore, the domination process is limiting efficiency. For a further explanation of the attractors related to the growth process see also our recent investigation of growth processes [11], which can be understood as bottleneck theorem: Axiom 3. Production can be regarded as a growth process under limitations. Optimal production is at the overall dominating limit. How diversity creates approximate optimality can be expressed by the following diversity theorem: Axiom 4. The central-decentral parameter space in production integrates production parameters into overall production efficiency. This Axiom relates to or constitutes the hypothesis that in the lockstepping region of a growth process, an order can emerge through an arising osmotic diversity. Hence, combinatorial arrangements allow for continuing or autopoietic evolution processes, due to, e.g., the property of self-similarity.
3
Model in Witness and Comparison to Previous Approach
The following use case has the function of a concrete example that is applied to demonstrate the problem [18] is covering in a practical context. One factor that changes production systems more than others is the technology with which the products are manufactured. One technological advancement of great impact is Additive Manufacturing (AM), also known as 3D-Printing. It is not only a different approach to how materials are processed, compared to conventional manufacturing, but it also makes a different structuring of production systems possible. Therefore, the manufacturing of electric motor housings with AM and with traditional processes will be used as an example to simulate the different production and production control systems. In the next subsections, a short summary of both manufacturing processes will be given and will it be explained how they relate to the production of the electric motor housing. The findings will then be used to create the simulation of the use case in the production simulation software Witness [19]. 3.1
Traditional and Additive Manufacturing (AM)
In short, as its name implies, AM adds layer upon layer of a certain material in a desired geometric shape in order to “grow” a three-dimensional object that was defined beforehand by Computer Aided Design (CAD) software or 3D object scanners (see, e.g. [8] and [2]). There are various AM processes, including powder bed fusion, binder jetting, direct energy deposition or material extrusion. Each technology can not process all materials. Materials that AM technologies can generally process are metals, thermoplastics, and ceramics [7]. Traditional
Central Decentral Meta-Order and Application Example
301
manufacturing processes, by contrast, are solidification processes like casting or molding, deformation processes like forging or pressing, subtractive processes like milling, drilling, or machining and last but not least, joining processes like welding, brazing, and soldering [22]. Some of these techniques are hundreds of years old but are still used in an adapted and optimized way. 3.2
Electric Motor Housing as Application Example for the Case Study
Kampker describes in [16] the classical production of an electric motor housing in great detail as he counts it as one of the essential parts of the motor because it protects its active machine parts. There are three steps needed to manufacture the housing. First, the form is cast where different methods are available, primarily depending on the number of units that must be produced. The second process step is machining like deburring, milling, or drilling to ensure that all the form and surface requirements are met, and the last step is cleaning the workpiece to ensure a smooth assembly. One problem with motors independent of their type, combustion engine or electric motor, is that they have to be cooled in order to work correctly. There are several solutions for heat removal for electric motors. Traditionally an electric motor has a liquid cooling system, meaning that a liquid is pumped through assorted hoses and is rejected through a radiator [15]. This system is effective, but it adds difficulties to the design and production of the housing because cooling lines have to be added, resulting in a complex manufacturing process containing not just the housing but several parts combined into one module. The German company PARARE GmbH is specialized in selective metal laser melting and has developed together with the Karlsruhe Institute for Technology an electric motor housing with an integrated cooling channel as depicted in [24]. Due to the freedom of geometry, AM opens up possibilities to reduce the number of parts in a module and, by this, also assembly costs. However, a disadvantage of AM is that simple components or standard parts are already “cheap” to manufacture and they do not become cheaper through AM. The highest potential of AM lies hence in the performance enhancement of parts and lightweight construction especially for this concerning metals, small-scale production and individualization. The reason is that production cost and time stay the same whether the same part is produced n-times or n different parts are produced. The design effort increases, of course, if a product is optimized for the requirements of a customer [23]. The production of an electric motor housing with a liquid cooling system will serve as the use case in order to simulate a central and decentral production control system where the decentral production system works with AM, and the central production system works with the traditional manufacturing of the said product.
302
3.3
B. Heiden et al.
Simulation of the Case Study in Witness
The first simulation in Fig. 1 (a) shows the central production control where the parts are pushed through the manufacturing process. As described above, the housing refers to the first step casting, the second step is machining, followed by cleaning. Finally, before the housing is put on stock, it will be assembled with the cooling pipes. In order to keep the simulation as compact as possible, the manufacturing of the other module parts is not shown, but their process time will be integrated at the process step “Assembly”. Central Instance-2
a.) Information →
Central Instance-3
Central Central Instance-4 Instance-5
environment border
Central Instance-1
Casting Information ↑↓
SCRAP Machining Information ↑↓
Material Casting →
Machining
Cleaning Assembly Information Information ↑↓ ↑↓ SCRAP Cleaning Assembly Stock
Order- ← Orders Customer information order ↑↓ → Customer
SHIP PrintingInformation
b.) Preparing-printerInformation
←
↑↓
Material →
SCRAP ↑↓
Post-processing Orders -Information
↑↓
↑↓
Preparing-printer Printing Post-processing Inventory
environment border
→
→
Customer-order Information ←
↑↓ Customer
SHIP
→
Fig. 1. Central (a) and decentral (b) production control system as Witness model [18]. The chain dotted line indicates the system border of information and material processes. The up and down arrows ↑↓ indicate the meta-information exchange between material and informational processes. The left and right arrows ← → Indicate the material and informational flows from and towards the environment with regard to their direction towards or from the border line which is limiting the cases (a) and (b). The model can be regarded as an implementation according to the overall process structure given in Axiom 1.
The second simulation in Fig. 1 (b) shows the decentral production control system where the parts are pulled through the manufacturing process, which consists of the following steps: first the printer and the printing needs to be prepared, the second step is the printing itself and the last step is the postprocessing after which the part will be put into the inventory and eventually shipped to the customer.
Central Decentral Meta-Order and Application Example
303
In order to show the differences between the production methods and control systems, two scenarios are calculated. One where a customer orders ten housings of a known type and another one where a customer orders two parts that are slightly differing prototypes. Scenario 1. The die casting process is a very fast production method as the metal, in this case probably an aluminum alloy, is injected into the mold within parts of a second and, depending on the wall thickness of the casting, cools in up to seven seconds. The longest part of the casting time is the die retention time with up to 30 s [5]. Under the condition that the molds already exist for the desired product, this manufacturing method is highly productive. Furthermore, it is assumed that only one mold for this product is used. The process steps, in this case, are (1) Preparation (only once for all parts), (2) Casting, (3) Machining, (4) Cleaning and (5) Assembly: P roduction − time − per − part = (1) + (2) + (3) + (4) + (5) 60 [min] + 1 [min] + 15 [min] + 1 [min] + 60 [min] 10 = 83 [min]
=
(1)
In order to calculate the total production time, the times of each process step are added, which results in 83 min for one part to be finished if the preparation time is evenly distributed over all parts. According to this model, it would take 830 min or nearly 14 h to produce 10 electric motor housings. AM, on the other hand, needs more time for production. The company Parare uses a printer which works with several lasers and can print four housings simultaneously. In order to print 10 housings, the printer needs to be prepared three times, and each part needs 10 min of post-processing. Printing four parts simultaneously takes 1440 min or 24 h. However, printing the remaining two parts after printing two times four housings should still take two thirds of the time, meaning 960 min. The process steps (1) Preparing the printer, (2a) Printing the first eight parts, (2b) Printing the last two parts simultaneously and (3) Post-processing add then up to: T otal − production − time = (1) + (2a) + (2b) + (3) = 3 · 5 [min] + 2 · 1440 [min] + 960 [min] + 10 · 10 [min]
(2)
= 3955 [min] ≈ 66 [h] Total production time adds up to 66 h, meaning that it would take up to three days to print all 10 electric motor housings.
304
B. Heiden et al.
Scenario 2. In this scenario, the time is looked at how long it takes for each production method to produce two slightly different and completely new designs of an electric motor housing. Each design is to be produced only once. Die casting is a procedure that is only used for producing high quantities as a lot of effort and investment is needed to create the new dies. For a scenario like this, sand casting is a better option because it is cheaper and faster. Moreover, for sand casting, a new sand mold is created for each part [3]. With two separate designs that need to be cast, total production time adds up according to the process steps (1) Preparation (of one sand mold), (2) Casting, (3) Machining, (4) Cleaning and (5) Assembly to: T otal − production − time = (1) + (2) + (3) + (4) + (5) = 2 · 2880 [min] + 2 · 5 [min] + 2 · 20 [min] + 2 · 1 [min] + 2 · 60 [min]
(3)
= 5932 [min] ≈ 100 [h] Using sand casting, it would take over four days to produce the prototypes. AM is known for its potential and proven performance in rapid prototyping on a high-quality level. The following times can be assumed for the process steps (1) 3D-Modeling (of both parts), (2) Preparing Printer, (3) Printing two parts simultaneously, and (4) Post-processing (for each part), using AM, in this case, and the total production time adds in the same order up to: T otal − production − time = (1) + (2) + (3) + (4) = 360 [min] + 5 [min] + 960 [min] + 2 · 10 [min]
(4)
= 1345 [min] ≈ 22, 5 [h] This technology makes it possible to manufacture two different parts in the time span of a day. Furthermore, it has to be mentioned that there is a difference in product performance that is not accounted for in the given simulation as it does not have an inherent relationship with production time or cost in this case: the printed housing cools the motor more efficiently, it saves room because the part is more minor, and it saves weight compared to the traditionally produced housing [24]. 3.4
Comparison to Previous Work
In order to further the discussion about the topic of central versus decentral control systems, a comparison between the results given above and results found by Knabe (see [17] and also [10]) will be given. His simulation aimed at a more abstract representation of the different control systems without implementing
Central Decentral Meta-Order and Application Example
305
realistic production processes. Another difference here is that this work has not looked at one fixed period of time like at Knabe’s 1440 min (one day), but at the time it takes the system to satisfy the customer demand. Therefore, there are large differences between the results found in Knabe’s work compared to the results of this paper but also significant similarities. One of the most important and also expected similarities is that of the complexity of the central production control model. There are constant connections between the central instance and every single process step needed to provide it with the required information to make decisions. He has also found that the length of the communication channels and the frequency of the information exchange influences the productivity of the systems [17]. The results of this paper does not reflect this as the productivity of the machines was not measured but it could be done in this hindsight. Again, both decentral system models are similar, as the production control is not as complex, with regard to information flow, as in a centrally organized system. Parts are only produced if they are actually demanded, which in his case results in varying productivity of the machines. The 3D-Printing process, in this case, is always working to capacity until all desired parts are produced. Regarding productivity, no meaningful comparison can therefore be made at this point. He concludes his thesis by saying that the demand or, more specifically, the number [#] and timing of the product orders play a significant role in the choice of the production control system [17]. By looking at the obtained results of this paper, you can add that the possible variants of a product and the “art of manufacturing” also play an important role in the said choice of the production control system. So concerning Table 1, which is ordered with regard to central and decentral, our simulation shows that central and decentral can have good or bad results, respectively, for the specific scenario case. Table 1. Summary of the scenarios of this work Scenarios
Parts produced Variants Total production time
Scenario 1 central
10
1
830 [min] ≈ 14 [h]
Scenario 1 decentral
10
1
3955 [min] ≈ 3 [d]
Scenario 2 central
2
2
5932 [min] ≈ 4 [d]
Scenario 2 decentral
2
2
1345 [min] ≈ 1 [d]
Although this case study uses only a limited variation of factors like (1) parts number [#], (2) product variants, and (3) a rough structural configuration for the process control in the parameter space central-decentral, it seems preferable to make hence, in general, a dynamic allocation of production as a function of those factors used here to produce resource optimally.
306
3.5
B. Heiden et al.
Discussion
ad Axiom 1 According to Axiom 1 the system is decoupled by information and material processes. There can be a dynamic switching between information and material-limited processes. In this part, different production techniques come in. AM, as in the example, is limited with regard to the processing time. A central processing unit fixes this, and hence an information centered and limited material process. On the other hand, the possibility to produce in one automated process and quickly change the information set-up for personalization, which is the case due to the intrinsic cybernetic process, the complexity is hidden behind the fast computation, which allows for an efficient decentralization and hence also variation in product demand, which is another kind of decentralization process, here concerning the consumer demand, with dilated production demand by time dilatation, and is hence a time decentral process, versus the commonly meant room decentral process, also in this paper. Axiom 1 can serve as the modeling structure. Information and material processes are separated, and their interrelatedness can be sketched like with the arrows in Fig. 1. With regard to the Witness implementation, the simulation strategy is to construct a continuous material flow, which we have used here in the same manner for constructing the information flow as an understanding. ad Axiom 2 When we compare Table 1 and the simulation results in this work and that of Knabe, it can be concluded that number variation is only one parameter in a multivariate optimization problem for production. According to Axiom 2 this can be interpreted as the limiting bottleneck that increases when the system’s flexibility is increasing. The flexibility amount is dependent on parameters that increase complexity like personalization, distribution, decentral production and others. ad Axiom 3 With regard to Axiom 3 the optimum production can be different, both central and decentral, and the production parameters are (a) nonlinearly coupled, and there are (b) different levels of higher order combinations of central-decentral system interaction, or respectively multiversity of a combinatorial multi bi-variate parameter room of the category central and decentral (see also [9,11]). The results in the previous sections support this and lead to the recommendation to always optimize production case specific, e.g. with a production simulation process. ad Axiom 4 With regard to Axiom 4 it can be said that the necessary integration can be seen by means of the simulation, as here parameters can be varied, and by this, the optimal production scenario can be determined as a function of the actual production or producer state and consumer need. Especially with regard to optimal structures, it becomes apparent that we have static and dynamic variants for optimization, and each has different advantages. In any case, the dynamic reallocation of resources, which generally corresponds to the increasing need for flexibilization of production, seems to become more favorable concerning optimal results in more dense network configurations.
Central Decentral Meta-Order and Application Example
4
307
Conclusion and Outlook
In our study, we have first formulated four axioms that guide the parameterspace central-decentral concerning orgiton and general system theory. The first Axiom 1 is the basis how to model a production system with regard to material and information flows. The Axioms 2-4 describe general properties of production system operation. Axiom 2, the attractor theorem, states that there will be, in a given configuration, a quasi stationary state of operation. Axiom 3 then formulates a possibly switching of attractors under certain circumstances. Axiom 4 finally states possible optimal operation in diverse attractor environments, or multivariate dependent systems. The reason may be similar to that of Portfolio theory [21], due to nonlinear statistical system properties. Where in the Portfolio theory the “overall system” is the market, in our theory or axiomsystem here it is the multivariate production system. We then have approached a small step toward how an optimal production can take place under varying parameters of (1) parts number [#], (2) product variants and (3) the parameter space of dynamic process control in its discrete ends central-decentral, through an industrial production application example. In the following discussion, we have reasoned that the given axioms, defining the overall order structure of the presented research problem, are mirrored in the simulation results and the overall applicable production strategy. We have found that a real-time and case-specific simulation is a good way to screen optimal production possibilities case-specific. On reason for this is that (3) is a factor of overall fabric organization and that this is deeply connected to the structural arrangement of the fabric, which means the types of production devices and specific production process sequence organizations and their types of operations. Regarding systems theory, this affects autopoiesis, or the cybernetics of processes, the process of the process and the self-organization or the organization of the organization of structures or machines. Both specifics are peculiarities of Industry 4.0, and hence important for the future fabric. Finally, factors (1) and (2) depend deeply on society, production and human environment and product demand, or the personal dimension of the market, or the market of one person. This other end of the market diversification is termed with the important principle called personalization of production, which is also an essential part of Industry 4.0 and following further developed cybernetic future production paradigms. So in future, we focus on the following themes that colleagues are also invited to investigate as open questions: How can we shape or discover new patterns with further simulations of these types with regard to (a) general properties of specific productions schemes, which are gradually different in the parameter space central decentral? Another topic or open question is (b): What are, in any case, interesting influence parameters in the simulation process?, and (c), to mention only a few important: How will we transform production systems to self-organizational and even autopoietic systems? Last but not least (d) also the system theoretic debate whether optimization can be neglected if autopoiesis is
308
B. Heiden et al.
achieved (see here also [20, p. 172]) and which applications this has in production systems, will be of striking interest.
References 1. Bauernhansl, T., ten Hompel, M., Vogel-Heuser, B. (eds.): Industrie 4.0 in Produktion, Automatisierung und Logistik. Springer, Wiesbaden (2014). https://doi. org/10.1007/978-3-658-04682-8 2. Burkhard, H.: Industrial production manufacturing processes, measuring and testing technology, original in German: Industrielle Fertigung Fertigungsverfahren, Mess- und Pr¨ uftechnik. Haan-Gruiten: Verlag Europa-Lehrmittel, 6 edn. (2013) 3. Cavallo, C.: Die casting vs. sand casting - what’s the difference? Thomas-Company. https://www.thomasnet.com/articles/custom-manufacturing-fabricating/diecasting-vs-sand-casting/, 1 July 2022 4. Dasgupta, S., Papadimitriou, C., Vazirani, U.: Algorithms. The McGraw-Hill Companies (2008) 5. DCM. The time control of die casting process. Junying Metal Manufacturing Co., Limited. https://www.diecasting-mould.com/news/the-time-control-of-diecasting-process-diecasting-mould, 1 July 2022 6. Fadhlillah, M.M.: Pull system vs push system. http://famora.blogspot.com/2009/ 11/pull-system-vs-push-system.html, 1 July 2022 7. GE-Additive. http://famora.blogspot.com/2009/11/pull-system-vs-push-system. html, 1 July 2022 8. Gibson, I., Rosen, D., Stucker, B.: Additive Manufacturing Technologies. Springer, New York (2015). https://doi.org/10.1007/978-1-4939-2113-3 9. Heiden, B., Alieksieiev, V., Tonino-Heiden, B.: Selforganisational high efficient stable chaos patterns. In: Proceedings of the 6th International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS, pp. 245–252. INSTICC, SciTePress (2021). https://doi.org/10.5220/0010465502450252 10. Heiden, B., Knabe, T., Alieksieiev, V., Tonino-Heiden, B.: Production orgitonization - some principles of the central/decentral dichotomy and a witness application example. In: Arai, K. (ed.) FICC 2022. vol. 439, pp. 517–529. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98015-3 36 11. Heiden, B., Tonino-Heiden, B.: Lockstepping conditions of growth processes: some considerations towards their quantitative and qualitative nature from investigations of the logistic curve. In: Arai, K. (ed.) Intelligent Systems and Applications. IntelliSys 2022. Lecture Notes in Networks and Systems, vol. 543. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-16078-3 48 12. Heiden, B., Tonino-Heiden, B.: Philosophical Studies - Special Orgiton Theory/Philosophische Untersuchungen - Spezielle Orgitontheorie (English and German Edition) (unpublished) (2022) 13. Heiden, B., Volk, M., Alieksieiev, V., Tonino-Heiden, B.: Framing artificial intelligence (AI) additive manufacturing (AM). Procedia Comput.Sci. 186, 387–394 (2021). https://doi.org/10.1016/j.procs.2021.04.161 14. Hilborn, R.C.: Chaos and Nonlinear Dynamics - An Introduction for Scientists and Engineers. Oxford University Press, New York (1994) 15. Huang, J., et al.: A hybrid electric vehicle motor cooling system–design, model, and control. EEE Trans. Veh. Technol. 68(5), 4467–4478 (2019)
Central Decentral Meta-Order and Application Example
309
16. Kampker, A.: Elektromobilproduktion. Springer, Heidelberg (2014). https://doi. org/10.1007/978-3-642-42022-1 17. Knabe, T.: Centralized vs. decentralized control of production systems (original in German: Zentrale vs. dezentrale Steuerungvon Produktionssystemen), bachelor’s thesis, Carinthia University of Applied Sciences, Austria (2021) 18. Krimm, R.: Comparison of central and decentral production control systems and simulation of an industrial use case, bachelor’s thesis, Carinthia University of Applied Sciences, Villach, Austria (2022) 19. Lanner. Technology witness horizon (2021) 20. Luhmann, N.: Einf¨ uhrung in die Systemtheorie, 3 edn.. Carl-Auer-Systeme Verlag (2006) 21. Markowitz, H.M.: Portfolio selection*. J. Financ. 7(1), 77–91 (1952) 22. Pan, Y., et al.: Taxonomies for reasoning about cyber-physical attacks in IoT-based manufacturing systems 4(3):45–54 (2017). https://doi.org/10.9781/ijimai.2017.437 23. Quitter, D.: Additive Fertigung: Geeignete Bauteile f¨ ur die additive Fertigung identifizieren 24. Quitter, D: Metall-3d-Druck: Spiralf¨ ormiger K¨ uhlkanal gibt e-Motorengeh¨ ause zus¨ atzliche Funktion (2019). https://www.konstruktionspraxis.vogel.de/ spiralfoermiger-kuehlkanal-gibt-e-motorengehaeuse-zusaetzliche-funktion-a806744/, 1 July 2022 25. Ruttkamp, E.: Philosophy of science: interfaces between logic and knowledge representation. South Afr. J. Philos. 25(4), 275–289 (2006) 26. Tonino-Heiden, B., Heiden, B., Alieksieiev, V.: artificial life: investigations about a universal osmotic paradigm (UOP). In: Arai, K. (ed.) Intelligent Computing. LNNS, vol. 285, pp. 595–605. Springer, Cham (2021). https://doi.org/10.1007/ 978-3-030-80129-8 42 27. Trentesaux, D.: Distributed control of production systems. Eng. Appl. Artif. Intell. 22(7), 971–978 (2009). https://doi.org/10.1016/j.engappai.2009.05.001 28. von Bertalanffy, L.: General System Theory. George Braziller, revised edition (2009) 29. von Foerster, H.: Cybernetics of epistemology. In: Understanding Understanding, pp. 229–246. Springer, New York (2003). https://doi.org/10.1007/0-387-21722-3 9 30. Watts, D., Strogatz, S.: Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998)
Gender Equality in Information Technology Processes: A Systematic Mapping Study J. David Patón-Romero1,2(B) , Sunniva Block1 , Claudia Ayala3 , and Letizia Jaccheri1 1 Norwegian University of Science and Technology (NTNU), Sem Sælands Vei 7, 7034
Trondheim, Norway [email protected], [email protected] 2 SimulaMet, Pilestredet 52, 0167 Oslo, Norway 3 Universitat Politècnica de Catalunya (UPC), Jordi Girona 29, 08034 Barcelona, Spain [email protected]
Abstract. Information Technology (IT) plays a key role in the world we live in. As such, its relation to the 17 Sustainable Development Goals (SDGs) stated by the United Nations to improve lives and health of the people and the planet is inexorable. In particular, the SDG 5 aims to enforce gender equality and states 9 Targets that drive the actions to achieve such goals. The lack of women within IT has been a concern for several years. In this context, the objective of this study is to get an overview of the state of the art on gender equality in IT processes. To do so, we conducted a Systematic Mapping Study to investigate the addressed targets, challenges, and potential best practices that have been put forward so far. The results we have obtained demonstrate the novelty of this field, as well as a set of opportunities and challenges that currently exist in this regard, such as the lack of best practices to address gender equality in IT processes and the need to develop proposals that solve this problem. All of this can be used as a starting point to identify open issues that help to promote research on this field and promote and enhance best practices towards a more socially sustainable basis for gender equality in and by IT. Keywords: Gender equality · Information Technology · Processes · Sustainability · Systematic Mapping Study
1 Introduction The United Nations (UN) proposed 17 Sustainable Development Goals (SDGs) for sustainable development, with the aim of making the world work together for peace and progress [1]. The Goals call out for environmental, social, and economic sustainability [2] to better the world; perspectives that can be seen in relation to Information Technology (IT). IT has revolutionized the world as we know in the past decades [3] within areas such as education, social interactions, and defence, among others. However, this revolution has been accompanied by negative aspects for the three perspectives of sustainability (environment, society, and economy) [2], which must be considered to achieve true © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 310–327, 2023. https://doi.org/10.1007/978-3-031-28073-3_22
Gender Equality in Information Technology Processes
311
sustainable development. An example of this is the marginal representation of women in IT research, practice, and education [4]. Numbers from 2019 show that just 16% of engineering roles are held by women and 27% of roles within computing [5]. While gender gaps have evened out in many fields and parts of the society in recent decades, it seems to lag in IT [6]. There are several questions that need to be addressed, such as why not more women enter IT, why women often leave IT, and what they specifically bring to IT. Studies show that women leave IT at a higher rate than men, and that of the already few women in tech, 50% of them will resign from their tech role before they turn 35 [7]. In the same way, for the past years, IT development has created different kinds of tools that help improve people’s lives and make it easier to communicate, among other relevant functions and characteristics. So, how does the lack of female input in development of IT solutions and, generally, in IT processes, affect the resulting application? This is a difficult question to answer and one that has not yet been adequately addressed. Albusays et al. [4] stated the following: “Although it is well accepted that software development is exclusionary, there is a lack of agreement about the underlying causes, the critical barriers faced by potential future developers, and the interventions and practices that may help”. For these reasons, the objective of this study is to understand the state of the art and how it corresponds to the Goal of gender equality (SDG 5 [1]) in IT processes, a topic that, until now, had not been explored or analyzed in previous works. By going through the current research and identifying the challenges and current best practices (through a Systematic Mapping Study) the prospect is to find out how IT processes can be improved and adapted to enhance gender equality. The rest of this study is organized as follows: Sect. 2 includes the background about gender equality and IT; Sect. 3 presents the methodology followed to conduct the analysis of the state of the art; Sect. 4 shows the results obtained; Sect. 5 discusses the main findings, as well as the limitations and implications; and Sect. 6 contains the conclusions reached. In the same way, Appendix A includes the list of references of the selected primary studies; and Appendix B shows a mapping of the answers to the research questions from each of these primary studies.
2 Background 2.1 Gender Equality The SDG 5 that the UN put forward in 2015 [1] targets gender equality. The UN recognize that progress between the SDGs is integrated, and that technology has an important role to play in achieving them. In the past decades progress in gender equality has been made, and there are today more women in leadership and political positions [8]. However, numbers from the UN show that there is still a long way to go; in agriculture women own only 13% of the land, and representation in politics is still low at 23.7%, even though it has increased1 . In developing countries genital mutilation and child marriage are some of the biggest threats affecting girls and women [9, 10]. The UN emphasizes 1 https://www.un.org/sustainabledevelopment/gender-equality/.
312
J. D. Patón-Romero et al.
that: “Ending all discrimination against women and girls is not only a basic human right, it is crucial for sustainable future; it is proven that empowering women and girls helps economic growth and development”2 . Gender equality is a complex goal with many dependencies that needs to be fulfilled. The main goal of gender equality is to make all discrimination against all women cease. This goal affects many different issues and types of discrimination and has a set of 9 Targets [1]. The Targets help articulate in more detail the different challenges from child mutilation to equal opportunities in political leaderships, among others, as can be seen in Table 1 (extracted from [1]). It is important to point out that in gender equality all genders should be included, and this no longer only contains men and women. However, this study will revolve around the gender equality dilemma that women face, as this is what is pointed out in the SDG’s and what is presented in the current literature analysis. Table 1. SDG 5 targets [1] Target
Description
Target 5.1
End all forms of discrimination against all women and girls everywhere
Target 5.2
Eliminate all forms of violence against all women and girls in the public and private spheres, including trafficking and sexual and other types of exploitation
Target 5.3
Eliminate all harmful practices, such as child, early and forced marriage and female genital mutilation
Target 5.4
Recognize and value unpaid care and domestic work through the provision of public services, infrastructure and social protection policies and the promotion of shared responsibility within the household and the family as nationally appropriate
Target 5.5
Ensure women’s full and effective participation and equal opportunities for leadership at all levels of decision-making in political, economic and public life
Target 5.6
Ensure universal access to sexual and reproductive health and reproductive rights as agreed in accordance with the Programme of Action of the International Conference on Population and Development and the Beijing Platform for Action and the outcome documents of their review conferences
Target 5.a
Undertake reforms to give women equal rights to economic resources, as well as access to ownership and control over land and other forms of property, financial services, inheritance and natural resources, in accordance with national laws
Target 5.b
Enhance the use of enabling technology, in particular information and communications technology, to promote the empowerment of women
Target 5.c
Adopt and strengthen sound policies and enforceable legislation for the promotion of gender equality and the empowerment of all women and girls at all levels
2 https://www.undp.org/sustainable-development-goals#gender-equality.
Gender Equality in Information Technology Processes
313
2.2 Gender Equality in IT From the Target 5.5 of the SDG 5, promoting more women within IT positions at all levels is presented, which can seem superficial in comparison to child mutilation, but having more women in these positions can help by putting more focus on the dangers girls face. As Diekman et al. state: “Lower numbers of women in STEM result in a narrower range of inquiry and progress in those fields; fields that have experienced increases in diversity also witness an increase in the range of topics pursued…” [11]. The lack of women within IT has made them high in demand for many employers, and it is interesting to get an understanding of why women are sought-after in the IT market [12]. There are examples of how companies with more women create better styles of management and are more creative and have innovative processes as well as more focus on better user experiences [13]. To increase women in IT many resources are in play, and some of the main enactments consist of university and mentoring programs [14]. These aim to create woman networks and get them to continue their degree, but it is also necessary to look further into why women leave IT and how this can be addressed. The lack of diversity within software development is well known but the barriers that future developers will face, as well as the practices that can help, are not thoroughly discussed and agreed upon [4]. One barrier that we are already seeing is that the lack of female input during the development of IT can lead to non-inclusive solutions [4]. A part of gender equality that can be perceived as conflicting for many is the drive for equality while still focusing on the differences such as in the missing female perspective in IT. It is important to note that even though the gender equality is achieved there will always be different perspectives that can only be obtained through the inclusion of all. It is for this reason that it becomes essential to include women (in addition to other discriminated social groups regardless of their race, culture, and other types of discrimination) and to achieve a balance throughout all processes that involve IT. When we talk about processes, ISACA (Information Systems Audit and Control Association) defines a process as a series of practices that are affected by procedures and policies, taking inputs from several sources and using these to generate outputs [15]. Further it is explained that processes also have a defined purpose, roles, and responsibilities, as well as a performance measure. In this study we understand IT processes as processes, frameworks, and/or best practices leading to the development of IT solutions.
3 Research Method The study will be conducted as a Systematic Mapping Study (SMS) and will follow the guidelines established by Kitchenham [16], adopting also the lessons learned for the data extraction and analysis identified by Brereton et al. [17]. 3.1 Research Questions Table 2 shows the research questions (RQs) established to address the objective of the SMS, as well as the motivation of each of them. As one of the prospects of the study is to see connections to the SDG 5, the RQ2 addresses this through the Targets of the SDG
314
J. D. Patón-Romero et al.
5. Further the statistics found and included above showed that many women leave IT, so this motivated RQ3 and RQ4, addressing what challenges are present in IT and the best practices for gender equality in IT processes. Table 2. Research questions Research question
Motivation
RQ1. What kind of studies exist on IT processes and gender equality?
Discover what studies exist on IT processes and gender equality and how they are distributed to get an overview of the field
RQ2. What gender equality Targets are addressed by IT processes?
Based on the Targets from the SDG 5, identify which targets are covered
RQ3. What are the main challenges to achieve gender equality in IT processes?
Identify the main challenges reported by existing studies in order to understand the obstacles that women in the IT sector face
RQ4. What are the best practices established to Uncover best practices that have been address gender equality in IT processes? reported to promote gender equality in IT processes
3.2 Search Strategy To define the keywords to be used to implement the searches, 3 main topics related to the research were identified: • The first topic refers to the field of technology. Concepts as IT, information technology, or information systems could be used, but it was concluded that technology itself represents the field well and covers the expected scope. • The second topic addresses processes and best practices, where both terms were implemented in the search string. • The third topic represents the gender equality, so terms in this regard were implemented in the search string. To address all the Targets from the SDG 5, it was decided to conduct specific searches that focused on each Target. As a result, 10 different searches were performed. Table 3 shows the search strings established for each one of these searches. 3.3 Selection Criteria In order to select the primary studies, a set of criteria were put forward to include all relevant studies and exclude those that would not aid the task. So, first, the criteria for including a study were established as follows:
Gender Equality in Information Technology Processes
315
Table 3. Search strings Scope
Search string
General
(Technology AND (Process* OR “Best practice*”) AND (Gender OR “Women rights” OR “Social sustainability” OR “SDG 5”))
Target 5.1
(Technology AND (Process* OR “Best practice*”) AND ((Women OR Girls OR Gender) AND Discrimination))
Target 5.2
(Technology AND (Process* OR “Best practice*”) AND ((Women OR Girls OR Gender) AND (Violence OR Exploitation OR Trafficking)))
Target 5.3
(Technology AND (Process* OR “Best practice*”) AND ((Women OR Girls OR Gender) AND (“Harmful practices” OR Mutilation)))
Target 5.4
(Technology AND (Process* OR “Best practice*”) AND (“Social protection policies” OR “Care work” OR “Domestic work”))
Target 5.5
(Technology AND (Process* OR “Best practice*”) AND ((Women OR Girls OR Gender) AND (“Equal opportunities” OR Participation OR Leadership)))
Target 5.6
(Technology AND (Process* OR “Best practice*”) AND ((Sexual OR Reproductive) AND (Health OR Rights)))
Target 5.a
(Technology AND (Process* OR “Best practice*”) AND ((Women OR Girls OR Gender) AND Rights AND Equal*))
Target 5.b
(Technology AND (Process* OR “Best practice*”) AND ((Women OR Girls OR Gender) AND Empower*)
Target 5.c
(Technology AND (Process* OR “Best practice*”) AND ((Women OR Girls OR Gender) AND (Equal* OR Empower*) AND (Policies OR Legislation*)))
• I1. English studies published between 2016 (the year after the publication of the SDGs [1]) and 2021 about gender equality in and by the IT sector. • I2. Complete studies that are peer reviewed in journals or conferences. And, in the same way, the exclusion criteria defined were the following: • E1. Studies presenting opinions, or such as abstracts or presentations. • E2. Studies that do not revolve around IT processes and gender equality. • E3. Duplicated work, only the most recent will be considered. 3.4 Data Sources and Study Selection The selection of data sources and studies were performed through the following steps: • Data Source Selection. The searches were all performed with the bibliographic database Scopus, through the advanced search functions. • Initial Search. The initial search consisted of 10 search strings that resulted in a total of 4,206 studies. The study selection was performed by first reading through the titles and abstracts of all the studies and selecting according to the inclusion and exclusion criteria, which resulted in 50 potential studies.
316
J. D. Patón-Romero et al.
• Limiting the Studies. The 50 potential studies were further narrowed down by applying the selection criteria on the whole study. This resulted in 15 primary studies that were analyzed in detail and data extraction was performed. 3.5 Strategy for Data Extraction and Analysis Table 4 shows the classification scheme related to the possible answers identified during the planning for each of the RQs. In addition to the general information of each study (title, authors, venue…), this classification helps to identify and extract specific data such as the type of study, scope, practices and challenges in this regard, etc. Table 4. RQs classification scheme Research question
Answers
RQ1. What kind of studies exist on IT processes and gender equality?
a. State of the art c. Validation b. Proposal
d. Others
RQ2. What gender equality Targets are addressed by IT processes?
a. Target 5.1
d. Target 5.4 g. Target 5.a
b. Target 5.2
e. Target 5.5
h. Target 5.b
c. Target 5.3
f. Target 5.6
i. Target 5.c
RQ3. What are the main challenges to achieve gender equality in IT processes?
Keyword extraction to identify answers (due to the large scope of answers that this RQ can have)
RQ4. What are the best practices established Keyword extraction to identify answers (due to to address gender equality in IT processes? the large scope of answers that this RQ can have) * The answers to RQ1 have their origin in an adaptation from the example of Petersen et al. [18].
4 Results 4.1 RQ1: What Kind of Studies Exist on IT Processes and Gender Equality? The RQ1 is set to discover what studies exist in the field of IT processes and gender equality. Following the extraction plan visualized in Fig. 1, we find that out of the 15 primary studies there are five state of the art analysis ([S06], [S07], [S10], [S12], and [S15]), one proposal ([S05]), four validations ([S04], [S08], [S09], and [S13]), and five categorized as others ([S01], [S02], [S03], [S11], and [S14]). All results were limited to the last six years, which is a short publication period whose purpose is to have a quick overview and a first approach to the most recent and updated works. It is worth mentioning that seven of the studies were published from 2020–2021, three in 2019, and the resulting five from 2016–2018, indicating a growing interest in the area.
Gender Equality in Information Technology Processes
317
Fig. 1. Results from data extraction of RQ1 (Percentage of the Kind of Studies)
4.2 RQ2: What Gender Equality Targets are addressed by IT Processes? All the primary studies have been assessed for which of the 9 Targets from the SDG 5 [1] they contribute to. The results obtained in this regard (and represented in the Fig. 2) show that all of the 15 studies foster Target 5.5 that is concerned with ensuring women’s participation and opportunities at all levels in public life. Further 10 out of the 15 primary studies ([S01], [S02], [S04], [S05], [S06], [S11], [S12], [S13], [S14], and [S15]) condition Target 5.1 that is related to end all forms of discrimination. A third of the studies ([S04], [S05], [S06], [S10], and [S13]) address Target 5.4 that applies to promoting equality and shared responsibility within household and domestic care. A third of the studies again ([S04], [S05], [S06], [S08], and [S10]) contribute to Target
Fig. 2. Results from data extraction of RQ2 (Number of Studies Addressing each of the Targets within the SDG 5)
318
J. D. Patón-Romero et al.
5.c concerning policies for promoting empowerment of women and gender equality. Finally, Target 5.b is also hit by one study ([S08]), regarding using enabling technology to promote the empowerment of women. It is equally important to highlight the Targets that were not address by any study, which are Targets 5.2, 5.3, 5.6, and 5.a. 4.3 RQ3: What are the Main Challenges to Achieve Gender Equality in IT Processes? With the aim of discovering the current challenges on gender equality in IT processes, a keyword extraction was performed, identifying those concepts in this regard that each study deals with. Figure 3 shows an overview of all the challenges that have been identified in two or more studies. In the same way, it is important to remember that the mapping of the full data extraction results can be found in Appendix B. First, the challenge that is most frequently mentioned is gender bias, appearing in 8 studies ([S02], [S03], [S04], [S05], [S06], [S07], [S08], and [S14]). To better understand what this challenge refers to, the APA Dictionary of Psychology describes gender bias as “any one of a variety of stereotypical beliefs about individuals on the basis of their sex, particularly as related to the differential treatment of females and males” [19]. Therefore, this challenge refers to preconceptions without evidence of the involvement, performance, responsibilities, and possibilities, among others, of women in IT processes. Second, imposter syndrome is the next challenge that occurred with most frequently in seven studies ([S02], [S04], [S08], [S09], [S11], [S12], and [S13]). Embedded in the term imposter syndrome is the fear of being revealed as a fraud or seen as incompetent for one’s job. Di Tullio [S13] states that we often become what we think others expect of us, and in this case imposter syndrome can lead to increased insecurity and the belief that others also believe that one is a fraud. And, third, with four occurrences ([S01], [S07], [S13], and [S15]), implicit bias refers as the result of internalized bias that one is unaware of posing and is generally understood as people acting on stereotypes or prejudice without intention. Implicit bias is often gender bias, but due to the people unawareness it leads to other challenges than just gender bias. Other challenges mentioned are the stereotype threat, pay gap, motherhood penalty, gender preferences, and retention problems. Stereotype threat concerns the fear of failing and thereby confirming a negative stereotype, which can lead to a decrease in career interest [S08] [S10]. The pay gap between genders is not just a problem, but also a direct indication of the value employees of different genders has in a company [S06] [S07] [S12]. Challenges about motherhood penalty are many, and one example is the perception that parenthood builds men’s commitment, but reduces women’s commitment [S04] [S11] [S12]. Some studies also present gender preferences, referring to the fact that women often choose occupations that are seen as “softer”, which is also a factor within IT where women often choose “softer” roles [S06] [S10]. Likewise, some challenges are only mentioned once, such as code acceptance, disengagement, few women, poor management, symbolic violence, queen bee syndrome, gender-based discrimination, self-efficacy, stereotype bias, and negative environment.
Gender Equality in Information Technology Processes
319
Fig. 3. Results from data extraction of RQ3 (Number of Studies Dealing with the Challenges to Achieve Gender Equality in IT Processes)
4.4 RQ4: What are the Best Practices Established to Address Gender Equality in IT Processes? For the RQ4, the data extraction model was based on keyword extraction identifying the best practices to address gender equality in IT processes. Of all the selected primary studies, five of them provide best practices or frameworks that tackle the challenges presented in the previous subsections, but none of them were specifically for IT processes. However, despite not being specific best practices for IT, their characteristics and points of view allow them to be easily adapted and made applicable to the IT context. So, the best practices found in these five primary studies are presented below. First, study [S04] presents the importance of women having their own safe place to discuss challenges and support each other through women only workshops or other arenas (online forums, offline networks…), where the main point is that women feel free to talk openly. This can keep more women in tech and combat retention problems. Second, study [S05] presents “nudging” as a way to encourage gender equality by establishing its importance without setting hard demands. Nudge theory can be established on different levels, where the main point is nudging behavior in a direction to remove negative biases in a predictable way without changing policies or mandates. This practice can address several challenges; for example, an organization could ask all contractors to provide a pay equity report, nudging them to diminish the pay gap, or ask for the percentage of women leaders, to establish a gender balance criterion in tenders. Third, study [S06] presents the habit-breaking approach to reduce bias, which applies to both gender bias and implicit bias. The first step towards breaking a bias is being made aware of it and the consequences it has. The second step consists of using strategies that are set to address the bias, this can be done through, for example. Perspective-taking, individualization, or counter-stereotype exposure. Fourth, study [S08] puts forward the goal congruity model as a way of understanding how people often follow gender roles. The model suggests that women often chose not
320
J. D. Patón-Romero et al.
to go into IT because it goes against the communal goals society has set for women. However, this model implies that by changing the social expectations for women, they can feel more valued in their IT role or motivated to pursue a career in this field. Finally, study [S10] promotes anti-bias and gender-blind training to create more tolerance and awareness for diversity in the workplace, helping people to work more smoothly together. Although in this case this is applied to the field of gender equality, it should be highlighted that it is a method applicable to any type of discrimination.
5 Discussion 5.1 Principal Findings Lack of Studies on IT Processes and Gender Equality. The primary studies selected cover a diverse field of studies in psychology, neuroscience, business, sociology, and technology. Since the field of gender equality and IT is an intersection between several fields, this also generates a great variation in the studies analyzed. This diversity and interdisciplinarity are undoubtedly a very positive aspect and help to obtain better results in the developments and research performed. However, due to the large number of studies found (4,206 in total), we expected to have obtained more relevant studies and not just the 15 primary studies. This shows that, although the concepts of IT and gender equality are very common and have already been analyzed before [20, 21], the direct intersection of IT processes and gender equality is an innovative and novel field that should be investigated in more detail. Low Number of Proposals and Validations. There is only one study classified as a proposal, which presents new research ideas that have not yet been implemented. In addition, there are four studies that validate their approach using a gold standard measure and are thereby assessed as a validation study following the definition of Fox et al. [22]. These results not only demonstrate the novelty of the research area of gender equality and IT processes, but also the need to develop new and updated proposals to address gender equality in and by IT processes. Likewise, it is equally important to properly validate the proposals to really verify their effectiveness and efficiency in real contexts, creating high-quality research in these fields. Right Approach towards the Targets of the SDG 5. We can observe that some of the Targets are not addressed by any study, such as the Targets 5.2 and 5.3, concerning exploitation, harmful practices, and mutilation [1]. These Targets are very important, but they have little and indirect relationship with IT processes, being considered outside the scope in this regard. However, although the approach of addressing at a first level the most fundamental Targets and that are directly related to IT processes is correct, it is important to also address these secondary Targets. For example, within IT processes, a series of practices can be established aimed at the specific development of IT proposals that address problems such as exploitation, harmful practices, and mutilation. Focusing on the Targets that the studies complied with, we can observe several findings. First, Target 5.5, about improving women’s participation and opportunities in all levels of public life, is addressed by all the primary studies and seems to be very
Gender Equality in Information Technology Processes
321
coherent with the RQs focus of achieving gender equality in IT processes. Many of the studies emphasize that women have the skills and qualities required for IT jobs. For example, the study [S03] tests how a development team’s risk-taking is affected by having more, fewer, or no women, and found no significant differences. Another example is the study [S08], which suggests that many women assess themselves as having lower abilities than men, even in situations where they are externally assessed as performing better than men. This can be seen in connection to those women who often only apply for a job if they feel fully qualified as stated by the study [S07]. Second, Target 5.1 is to end all discrimination against women, and this is also presented as one of the most addressed Targets. The studies contribute through creating awareness about challenges women face and providing statistics to highlight the inequality. For example, the study [S06] highlights the pay gap that women experience. Third, Target 5.c is concerned with enforcing policies for the promotion of gender equality. It is very easy to see and understand the direct connection between this Target and IT processes, especially when it comes to implementing best practices that address gender equality. And this is demonstrated, since the 5 primary studies that deal with this Target ([S04], [S05], [S06], [S08], and [S10]) are the only ones that identify and establish a series of best practices in this regard. Finally, Target 5.4 prompts the importance of valuing domestic work and promote shared responsibility within the household. The study [S06] portrays how motherhood is seen as lessening a woman’s commitment to work, which can be seen both as a stereotype but also a real outcome in households where women are expected to stand for most of the childcare and additional household duties that come with an expanded family. The view of mothers as less committed can result in being passed over for promotions as well as salary increases in the workplace. Thus, it is necessary to understand that this is not the case and measures must be taken to raise awareness and put an end to these preconceived and erroneous ideas. Importance of Tackling all Challenges Together. The challenges found through the primary studies affect human relational behaviors such as bias and other challenges related to women’s self-efficacy as imposter syndrome and stereotype threat. However, most of the challenges apply more to the organizations and society as a whole, such as pay gap, retention problems, and challenges related to motherhood. The challenges that affect how women are treated are often because of bias. Study [S01] states that most people agree that standards for excellence should be the same for all, but that it is difficult to achieve in practice due to gender bias. Some challenges are also mainly affecting women’s self-efficacy. An example is the stereotype threat, where the fear is of confirming the negative stereotypes of one’s group, as identified by the study [S07]. This same study further explains that stereotype threat can affect motivation and interest in career, maybe one of the reasons why some women choose to leave the IT field. Likewise, an argument made by the study [S14] is that women should allow more external attributions for their setbacks, indicating that this can make them feel like they are in the right place even in an opposing environment. Several of the challenges presented are complex and need to be addressed in organizations and society. One of the biggest and most complex challenges is the motherhood
322
J. D. Patón-Romero et al.
penalty, where, after having children, women are often seen as less committed, receive less opportunities, and are paid less [S06]. These examples together with the rest of the challenges found shows that, even though they are different challenges, each one has a certain connection with the others and, therefore, it is vital to address them together to really meet their particular objectives. For example, it is not possible to try to end the pay gap if the bias that leads to the idea that the work of a woman is not up to the work of a man is not addressed; and, in the same way, the pay gap leads women to feel less valued and capable of doing a job, materializing in other challenges such as imposter syndrome or stereotype threat. Lack of a Common Framework of Best Practices. The lack of answers to the RQ4 and, therefore, of sets of best practices on gender equality with emphasis on IT processes generates a series of findings and opportunities. Some of the studies address general best practices to better gender equality in IT, but in an isolated manner and none of them discuss this in relation to IT processes. Thus, the most prominent result is that there seems to have been no research on using IT processes as such to achieve gender equality (but there are best practices in this regard that can be used and put together), as far as the studies found through this SMS. In further detail the studies show no way to assess or ensure that the artifact from an IT process results in a product that fits the needs of all genders, or that the process itself has any focus on gender equality. For example, the best practices identified could serve as a foundation for the development of a framework or guidelines that help implement gender-friendly IT processes. Likewise, the studies that have an answer for the RQ4 are also the only studies that correspond to the Target 5.c, which is about enforcing policies for gender equality and empowerment [1]. This suggests that a research or development for promoting gender equality in IT processes has potential to further promote the Target 5.c. However, we must not forget the other Targets of the SDG 5 and these results also demonstrate the opportunity for innovation and the need to develop new research to address these important Targets in and by IT processes.
5.2 Limitations During the planning and execution of an SMS there are always limitations that can affect the results and findings. To mitigate them, it was decided to conduct 10 searches (1 at general level and 9 in relation to each of the Targets of the SDG 5 [1]). This has helped us to find studies with very specific terminology related to each of the Targets. However, it was also decided to limit the search to a short period of publication (the last 6 years). Although it is true that the purpose is to perform a first approach on the most recent and updated works and that the area of gender equality is relatively young with the most relevant studies published in recent years, this period could be longer. Finally, certain studies may have been overlooked for different reasons or certain evidence or advances on the studies found may not have been published at the time of the execution of this SMS. Likewise, the analysis of the results and findings performed in this SMS comes from the perspectives and experiences of the authors, which could not be interpreted in the same way by other stakeholders in this area. That is why, with
Gender Equality in Information Technology Processes
323
the aim to mitigate these risks, an attempt has been made to reduce the bias by analyzing the data and results obtained independently by the authors and reaching a consensus. 5.3 Implications This SMS is highly relevant both for researchers and professionals in the fields in which it is framed. The results obtained demonstrate the lack of research and developments that address gender equality in and by IT processes, as well as the importance of conducting proposals in this regard. That is why this SMS, in addition to identifying the current state of the art, also highlights the gaps and possible future lines of research/work that can be performed. The findings obtained can be used by both researchers and professionals who are working in areas such as IT management, gender equality, and social sustainability, among others. Therefore, this SMS is a relevant starting point and a demonstration of the importance of the fields it affects, which will attract new researchers and professionals in the search for gender equality in and by IT processes.
6 Conclusions and Future Work Technology has changed the world as we know it in practically all areas that surround us [3]. However, these changes, far from being perfect, are not always accompanied by positive aspects, as is the case of the gender inequality in IT [4]. That is why this study has focused its goal on analyzing the state of the art on gender equality and IT processes through an SMS. IT processes are the foundations on which all aspects related to IT in organizations are governed and managed [15]. For this reason, it is necessary for them to be sustainable and, in this case, to project exemplary gender equality, diversity, and inclusion towards the entire IT context. Through the results obtained, the novelty of this study has been evidenced, since, of the 4,206 studies found, only 15 studies are related to the established scope. Likewise, the findings achieved identify a series of opportunities and challenges on which it is necessary and urgent to act due to the importance that these fields have together. Therefore, following these findings, as future work we are working on an empirically validated proposal through the development and implementation of an IT process framework that considers all the Targets of the SDG 5 [1] and addresses the challenges identified through a set of egalitarian and inclusive best practices. In this way, we intend to help organizations establish socially sustainable foundations, as well as promote research and practice in these fields. In addition, we also intend to conduct a more indepth evaluation of the results obtained in this study through interviews or surveys with relevant professionals and researchers in the areas of gender equality and IT processes. It is our duty to ensure that the changes in our present positively affect our future and that this future is balanced, diverse, and inclusive for all. Acknowledgments. This work is result of a postdoc from the ERCIM “Alain Bensoussan” Fellowship Program conducted at the Norwegian University of Science and Technology (NTNU).
324
J. D. Patón-Romero et al.
This research is also part of the COST Action - European Network for Gender Balance in Informatics project (CA19122), funded by the Horizon 2020 Framework Program of the European Union.
Appendix A. Selected Studies S01. Nelson, L. K., Zippel, K.: From Theory to Practice and Back: How the Concept of Implicit Bias Was Implemented in Academe, and What This Means for Gender Theories of Organizational Change. Gender & Society 35(3), 330–357 (2021). S02. Albusays, K., Bjorn, P., Dabbish, L., Ford, D., Murphy-Hill, E., Serebrenik, A., Storey, M. A.: The Diversity Crisis in Software Development. IEEE Software 38(2), 19–25 (2021). S03. Biga-Diambeidou, M., Bruna, M. G., Dang, R., Houanti, L. H.: Does gender diversity among new venture team matter for R&D intensity in technology-based new ventures? Evidence from a field experiment. Small Business Economics 56(3), 1205– 1220 (2021). S04. Schmitt, F., Sundermeier, J., Bohn, N., Morassi Sasso, A.: Spotlight on Women in Tech: Fostering an Inclusive Workforce when Exploring and Exploiting Digital Innovation Potentials. In: 2020 International Conference on Information Systems (ICIS 2020), pp. 1–17. AIS, India (2020). S05. Atal, N., Berenguer, G., Borwankar, S.: Gender diversity issues in the IT industry: How can your sourcing group help?. Business Horizons 62(5), 595–602 (2019). S06. Charlesworth, T. E., Banaji, M. R.: Gender in Science, Technology, Engineering, and Mathematics: Issues, Causes, Solutions. Journal of Neuroscience 39(37), 7228–7243 (2019). S07. González-González, C. S., García-Holgado, A., Martínez-Estévez, M. A., Gil, M., Martín-Fernandez, A., Marcos, A., Aranda, C., Gershon, T. S.: Gender and Engineering: Developing Actions to Encourage Women in Tech. In: 2018 IEEE Global Engineering Education Conference (EDUCON 2018), pp. 2082–2087. IEEE, Spain (2018). S08. Diekman, A. B., Steinberg, M., Brown, E. R., Belanger, A. L., Clark, E. K.: A Goal Congruity Model of Role Entry, Engagement, and Exit: Understanding Communal Goal Processes in STEM Gender Gaps. Personality and Social Psychology Review 21(2), 142–175 (2017). S09. Gorbacheva, E., Stein, A., Schmiedel, T., Müller, O.: The Role of Gender in Business Process Management Competence Supply. Business & Information Systems Engineering 58(3), 213–231 (2016). S10. Stewart-Williams, S., Halsey, L. G.: Men, women and STEM: Why the differences and what should be done?. European Journal of Personality 35(1), 3–39 (2021). S11. Harvey, V., Tremblay, D. G.: Women in the IT Sector: Queen Bee and Gender Judo Strategies. Employee Responsibilities and Rights Journal 32(4), 197–214 (2020). S12. Segovia-Pérez, M., Castro Núñez, R. B., Santero Sánchez, R., Laguna Sánchez, P.: Being a woman in an ICT job: An analysis of the gender pay gap and discrimination in Spain. New Technology, Work and Employment 35(1), 20–39 (2020).
Gender Equality in Information Technology Processes
325
S13. Di Tullio, I.: Gender Equality in STEM: Exploring Self-Efficacy Through Gender Awareness. Italian Journal of Sociology of Education 11(3), 226–245 (2019). S14. LaCosse, J., Sekaquaptewa, D., Bennett, J.: STEM Stereotypic Attribution Bias Among Women in an Unwelcoming Science Setting. Psychology of Women Quarterly 40(3), 378–397 (2016). S15. Shishkova, E., Kwiecien, N. W., Hebert, A. S., Westphall, M. S., Prenni, J. E., Coon, J. J.: Gender Diversity in a STEM Subfield – Analysis of a Large Scientific Society and Its Annual Conferences. Journal of The American Society for Mass Spectrometry 28(12), 2523–2531 (2017).
Appendix B. Results Mapping Table 5 includes a mapping of the answers of the different selected primary studies with respect to the defined research questions (RQs). Table 5. Data Extraction Results from the Primary Studies ID
Type (RQ1)
Targets (RQ2) Challenges (RQ3)
Best Practices (RQ4)
S01 Others
• Target 5.1 • Target 5.5
• Implicit bias
S02 Others
• Target 5.1 • Target 5.5
• • • •
S03 Others
• Target 5.5
• Gender bias
S04 Validation
• • • •
Target 5.1 Target 5.4 Target 5.5 Target 5.c
• • • • •
Few women Gender bias Retention problems Imposter syndrome Motherhood penalty
• Women workshops
S05 Proposal
• • • •
Target 5.1 Target 5.4 Target 5.5 Target 5.c
• • • •
Recruitment Poor management Retention problems Gender bias
• Nudging
S06 State of the art • • • •
Target 5.1 Target 5.4 Target 5.5 Target 5.c
• Gender bias • Gender preferences • Pay gap
S07 State of the art • Target 5.5
• • • •
Gender bias Code acceptance Disengagement Imposter syndrome
• Habit-breaking
Gender bias Pay gap Retention problems Implicit bias (continued)
326
J. D. Patón-Romero et al. Table 5. (continued)
ID
Type (RQ1)
Targets (RQ2) Challenges (RQ3)
Best Practices (RQ4)
S08 Validation
• Target 5.5 • Target 5.b • Target 5.c
• Gender bias • Imposter syndrome • Stereotype threat
• Goal congruity model
S09 Validation
• Target 5.5
• Imposter syndrome
S10 State of the art • Target 5.4 • Target 5.5 • Target 5.c
• Gender preferences • Stereotype threat
S11 Others
• • • •
• Target 5.1 • Target 5.5
Symbolic violence Queen bee syndrome Motherhood penalty Imposter syndrome
S12 State of the art • Target 5.1 • Target 5.5
• Pay gap • Gender-based discrimination • Motherhood penalty • Imposter syndrome
S13 Validation
• Target 5.1 • Target 5.4 • Target 5.5
• Implicit Bias • Self-efficacy • Imposter syndrome
S14 Others
• Target 5.1 • Target 5.5
• • • •
S15 State of the art • Target 5.1 • Target 5.5
• Anti-bias training • Gender blind training
Gender bias Stereotype bias Negative environment Stereotype threat
• Implicit bias
References 1. United Nations: Transforming Our World: The 2030 Agenda for Sustainable Development. In: Seventieth Session of the United Nations General Assembly, Resolution A/RES/70/1. United Nations (UN), USA (2015) 2. Purvis, B., Mao, Y., Robinson, D.: Three pillars of sustainability: in search of conceptual origins. Sustain. Sci. 14(3), 681–695 (2019) 3. Schwab, K.: The Fourth Industrial Revolution. The Crown Publishing Group, USA (2017) 4. Albusays, K., et al.: The diversity crisis in software development. IEEE Softw. 38(2), 19–25 (2021) 5. DuBow, W., Pruitt, A.S.: NCWIT scorecard: the status of women in technology. National Center for Women & Information Technology (NCWIT), USA (2018) 6. Stoet, G., Geary, D.C.: The gender-equality paradox in science, technology, engineering, and mathematics education. Psychol. Sci. 29(4), 581–593 (2018) 7. Glass, J.L., Sassler, S., Levitte, Y., Michelmore, K.M.: What’s so special about STEM? A comparison of women’s retention in STEM and professional occupations. Soc. Forces 92(2), 723–756 (2013)
Gender Equality in Information Technology Processes
327
8. Keohane, N.O.: Women, power & leadership. Daedalus 149(1), 236–250 (2020) 9. Ahinkorah, B.O., et al..: Association between female genital mutilation and girl-child marriage in sub-Saharan Africa. J. Biosoc. Sci. 55(1), 1–12 (2022) 10. Avalos, L., Farrell, N., Stellato, R., Werner, M.: Ending female genital mutilation & child marriage in Tanzania. Fordham Int. Law J. 38(3), 639–700 (2015) 11. Diekman, A.B., Steinberg, M., Brown, E.R., Belanger, A.L., Clark, E.K.: A goal congruity model of role entry, engagement, and exit: understanding communal goal processes in STEM gender gaps. Pers. Soc. Psychol. Rev. 21(2), 142–175 (2017) 12. González Ramos, A.M., Vergés Bosch, N., Martínez García, J.S.: Women in the technology labour market. Revista Española de Investigaciones Sociológicas (REIS) 159, 73–89 (2017) 13. González-González, C.S., et al.: Gender and engineering: developing actions to encourage women in tech. In: 2018 IEEE Global Engineering Education Conference (EDUCON 2018), pp. 2082–2087. IEEE, Spain (2018) 14. de Melo Bezerra, J., et al.: Fostering stem education considering female participation gap. In: 15th International Conference on Cognition and Exploratory Learning in Digital Age (CELDA 2018), pp. 313–316. IADIS, Hungary (2018) 15. ISACA: COBIT 2019 Framework: Governance and Management Objectives. Information Systems Audit and Control Association (ISACA), USA (2018) 16. Kitchenham, B.: Guidelines for Performing Systematic Literature Reviews in Software Engineering (Version 2.3). Keele University, UK (2007) 17. Brereton, P., Kitchenham, B.A., Budgen, D., Turner, M., Khalil, M.: Lessons from applying the systematic literature review process within the software engineering domain. J. Syst. Softw. 80(4), 571–583 (2007) 18. Petersen, K., Feldt, R., Mujtaba, S., Mattsson, M.: Systematic mapping studies in software engineering. In: 12th International Conference on Evaluation and Assessment in Software Engineering (EASE 2008), pp. 68–77. ACM, Italy (2008) 19. American Psychological Association: APA Dictionary of Psychology (Second Edition). American Psychological Association, USA (2015) 20. Borokhovski, E., Pickup, D., El Saadi, L., Rabah, J., Tamim, R.M.: Gender and ICT: metaanalysis and systematic review. Commonwealth of Learning, Canada (2018) 21. Yeganehfar, M., Zarei, A., Isfandyari-Mogghadam, A.R., Famil-Rouhani, A.: Justice in technology policy: a systematic review of gender divide literature and the marginal contribution of women on ICT. J. Inf. Commun. Ethics Soc. 16(2), 123–137 (2018) 22. Fox, M.P., Lash, T.L., Bodnar, L.M.: Common misconceptions about validation studies. Int. J. Epidemiol. 49(4), 1392–1396 (2020)
The P vs. NP Problem and Attempts to Settle It via Perfect Graphs State-of-the-Art Approach Maher Heal1(B) , Kia Dashtipour2 , and Mandar Gogate2 1
2
Baghdad University, Jadirayh, Iraq [email protected] School of Computing, Edinburgh Napier University, Merchiston Capus, Edinburgh, UK
Abstract. The P vs. NP problem is a major problem in computer science. It is perhaps the most celebrated outstanding problem in that domain. Its solution would have a tremendous impact on different fields such as mathematics, cryptography, algorithm research, artificial intelligence, game theory, multimedia processing, philosophy, economics and many other fields. It is still open since almost 50 years with attempts concentrated mainly in computational theory. However, as the problem is tightly coupled with npcomplete class of problems theory, we think the best technique to tackle the problem is to find a polynomial time algorithm to solve one of the many npcompletes problems. For that end this work represents attempts of solving the maximum independent set problem of any graph, which is a well known np-complete problem, in a polynomial time. The basic idea is to transform any graph into a perfect graph while the independence number or the maximum independent set of the graph is twice in size the maximum independent set or the 2nd largest maximal independent set of the transformed bipartite perfect graph. There are polynomial time algorithms for finding the independence number or the maximum independent set of perfect graphs. However, the difficulty is in finding the 2nd largest maximal independent set of the bipartite perfect transformed graph. Moreover, we characterise the transformed bipartite perfect graph and suggest algorithms to find the maximum independent set for special graphs. We think finding the 2nd largest maximal independent set of bipartite perfect graphs is feasible in polynomial time. Keywords: P vs. NP · Computational complexity Independence number · Perfect graphs
1
· Np-Complete ·
Introduction
The P vs NP problem is the most outstanding problem in computer science. Informally defined it asks if there are algorithms that are fast (P from polynomial) to solve problems that only slow algorithms are known to solve them (NP from non-deterministic polynomial). For stating it more explicitly, we need to define the two problems classes P and NP. P class is the set of the problems c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 328–340, 2023. https://doi.org/10.1007/978-3-031-28073-3_23
The P vs. NP Problem
329
that there are algorithms that run in polynomial manner as a function of the input resources time and space to solve them. On the other hand NP class is the set of problems that their solution is verifiable in polynomial time but can be found in non polynomial time as a function of the input resources time and space. As one example consider positive integers factorisation to primes, while any set of primes as a factorisation of a certain integer is easily verified if they are indeed the primes factors of the integer, there is no easy way (polynomial) to factor the integer into its prime factors especially when the integer is very large. Another example is the maze with start and destination points. While it is easy to check if a path leads to the destination from the start point given that path, it is not easy and maybe we need to check exponential number of paths to find a solution to the maze. Thus, the P vs. NP problem ask if there are P-algorithms to solve problems that their solution is polynomially verified, but otherwise not polynomially can be found or thought so. A formal definition assuming a Turing machine as our computer can be found in the literature, see for example [1–3]. However, we do not need that formal definition as our approach is graph theoretic and based on the concept of npcomplete which we define below. Np-complete class problems are a set of problems that any np problem can be reduced in polynomial time to the np-complete problem. Accordingly, if one succeeded in finding a P-algorithm for one of the np-complete problems that means all np problems can be solved in polynomial time. This is the approach we will follow in trying to settle the P vs NP problem. This work explore solutions to one of the np-complete problem, mainly the maximum independent set, by transforming any graph to a perfect graph while the maximum independent set is encoded in the perfect graph attributes. The attribute is either the maximum independent set of the perfect graph, where polynomial time algorithms are known to find them, or the second largest maximal independent set which we try to find a polynomial time algorithm for. This paper is organised as follows: Sect. 2 is a brief review of the literature on P vs. NP problem and perfect graphs. Section 3 covers the basic transformation of any graph to a perfect graph that we propose. Section 4 explains the possible techniques to extract the maximum independent set for the graph from its transformed perfect graph and some properties of that perfect graph. Section 5 contains results for some graphs. Finally, Sect. 6 is the conclusion and future work.
2
Literature Review
Many attempts tried to settle the P vs NP problem mainly centred around computational complexity. One of the techniques to prove P = NP is relativisation. However, Baker, Gill and Solovay [5] showed no relativizable proof can settle the P versus NP problem in either direction. Another attempt is based on Circuit Complexity. To show P = NP we just need to show some np-complete problem cannot be solved by relatively small circuits of AND, OR, and NOT gates. However, Furst, Saxe, and Sipser [8] and Razborov [6,7] showed this path
330
M. Heal et al.
does not lead to P vs. NP problem settlement. Some researchers relied on the proof complexity methods to show P = NP, but as can be seen from the work of Haken, A. [9] this path is not fruitful either. For other procedures that are mainly computational complexity procedures and trying to show P = NP one can read the interesting article in the Communication of the ACM [4]. Our approach is based on attacking one of the np-complete problems, namely the maximum independent set problem, and proving there is/is not a polynomial time algorithm to solve it. For the purpose of that we convert the graph into a perfect graph. Perfect graphs are defined as such graphs the chromatic number of any induced subgraph is equal to the clique number of that induced subgraph. An interesting property of the perfect graph that was conjuctered by Berge [10] and proved by Maria et al. [11] (see Maher Heal for simpler polyhedral proofs [12]) is characterised by the strong perfect graph theorem. That is a perfect graph is free of odd holes and odd anti-holes. There are many interesting classes of graphs that are perfect such as bipartite, interval,. . . etc. One useful property of perfect graphs is that there are polynomial time algorithms to find essential graph variants such as the independence number, clique number and chromatic number [13]. We use this property to try to find solutions for the general problem of the independence number and/or maximum independent set of any graph.
3
Transforming Any Graph into a Perfect Graph
Assume we have an undirected graph G = (V, E) where V is the set of vertices and E is the set of edges. An undirected graph is defined as a set of vertices V and a set of edges which are unordered pair of vertices Ei = (vj , vk ), Ei ∈ E, vj ∈ V and vk ∈ V . The set of all vertices that are mutually not connected by an edge is called an independent set or a stable set. If this set cannot be extended by adding more vertices to it, keeping the mutual non-connectivity of its vertices condition, it is called a maximal independent set. The maximum of all maximal independent sets is called a maximum independent set. Finding the maximum independent set or maximal independent sets of a general graph are both well-known np-complete problems. However, for perfect graphs at least the maximum independent set can be found in a polynomial time [13]. As per the strong perfect graph theorem a perfect graph is a graph that is free of odd holes and odd anti-holes [11,12]. A hole is a loop of vertices of at least 4 vertices with no chord inside. The anti-hole is the complement of the hole. Odd holes or odd anti-holes are holes or anti-holes whose number of vertices is odd respectively. See Fig. 1 for an example of odd hole and odd anti hole of 5 vertices. We transform the graph G into a perfect graph T by replacing every vertex a ∈ V by two vertices a and a . For any two vertices a, b ∈ G if a is not connected to b then their images in T , a, a , b, b are mutually not connected. On the other hand if a is connected by an edge to b then a is connected to b and a is connected to b in the graph T . By using this transformation every odd hole becomes even hole and every odd antihole becomes even antihole; see Fig. 2 and Fig. 3 for illustration of this.
The P vs. NP Problem
331
Fig. 1. (A) An odd hole of size 5 (B) The complement of the odd hole, an antihole of size 5-the dashed lines are for the odd hole.
Fig. 2. A graph G, an odd hole (left side), its transformed graph T (middle) and the bipartite graph representation of T (right side).
4
Extracting the Independence Number from the Transformed Perfect Graph
If we assume the graph G has n vertices with independence number α(G), then it is easy to see the independence number of the transformed bipartite graph T is max(n, 2.α(G)). So if 2.α(G) ≥ n, we have a polynomial time algorithm to find α(G) since α(T ) = 2.α(G) and there is a polynomial time algorithm to find α(T ) since T is perfect. Indeed if we have a polynomial time algorithm to find the two largest maximal independent sets of any perfect graph, that means P = NP. 4.1
Characterisation of the Transformed Graph T
We state here some simple lemmas to characterise the maximal independent set of T that is a map for a maximum independent set in G. Lemma 1. Starting from any vertex as a centre in graph T we can build a tree that is a representation of graph T such that Δab is constant, where Δab is the shortest distance between vertices a and b, {a, b} ⊆ T . Δab is independent of the tree centre, and is equal to the smallest number of hops (vertices) between vertices a and b on a path between them such that a and b are inclusive
332
M. Heal et al.
Fig. 3. A graph G, an odd anti-hole (left side) and its transformed graph T (right side).
Proof. This is a straightforward lemma since ω(T ) = 2. Note that Δab is the smallest number of hops (vertices) between vertices a and b on a path between them such that a and b are inclusive. As an example, please see Fig. 5. We will use concrete examples to illustrate the different concepts and lemmas in the subsequent text later on. Lemma 2. For any vertex v ∈ T the number of hops between v and v , i.e. the distance Δvv is even and at least 4, i.e. 4, 6, 8,.... Proof. Since the vertices a, b, c, ... etc. and a , b , c , ... etc. are mutually not connected respectively in T from its definition, then the possible paths from v to v could be v, r , s, v , i.e. Δvv is 4 or v, r , s, u , m, v , i.e. Δvv is 6 and so on. v, r , s, u , m, v are all vertices in T . Lemma 3. If G is connected, so is T if and only if G has an odd loop of 3 vertices or more (which is the case if ω(G) ≥ 3 or if we have an odd hole). If we have ω(G) = 2 and G may contain only even loops (holes) then T = T1 ∪ T2 such that each of the graphs T1 and T2 is connected, but T1 is not connected to T2 . Proof. If G may have only even loops, i.e. ω(G) = 2 and there are no odd holes in G, then the numbers of vertices of each loop is even, say 2n, n = 2, 3, 4, ..., then the loop is {1, 2, 3, ..., 2n, 1} and its image in T are the subgraphs {1, 2 , 3, 4 , ..., 2n−1, (2n) , 1} and {1 , 2, 3 , 4, ..., (2n−1) , 2n, 1 }, which is clearly each is connected, but both are not connected to each other. However if G contains a loop of odd number of vertices (i.e. ω(G) ≥ 3 or ω(G) = 2 and G contains an odd hole), say the loop is {1, 2, ...2n, 2n+1, 1}, n = 1, 2, 3, ... etc. then the image of that loop in T is the even loop {1, 2 , 3, 4 , ..., 2n + 1, 1 , 2, 3 , ...(2n + 1) , 1} which is clear an even connected loop. Now assume G is connected and
The P vs. NP Problem
333
there is an odd loop of 3 vertices or more in G, we will show that T for this graph is connected. Let x1 and x2 are two vertices in G and x1 , x1 and x2 , x2 are their images in T respectively, see Fig. 4. Let R be an odd loop of 3 vertices or more and a, b are two vertices of R such that there is a path from x1 to a and a path from x2 to b; such paths do exist since G is connected. It is easy to see that in T there is a path from x1 to a or a , x1 to a or a , x2 to b or b and x2 to b or b . Since the image of R in T is connected, i.e. there is a path between any two of a, a , b and b then there is a path between any two vertices of x1 , x1 , x2 and x2 in T .
R Odd loop of 3 or more vertices x2
x1 (A)
a b
c x2
x1
(B) c b a
x1
x2 2
1
(C)
Fig. 4. (A) Graph G that contains at least one odd loop R (B) An example graph of graph G (C) The transformation of graph G in B, i.e. Graph T .
Lemma 4. The following are true, assuming we have a tree representation of T starting from an arbitrary vertex v as per Lemma 1. 1. Δab = Δa b and Δab = Δa b for any two vertices a and b in T , assuming there is a path between any two of those vertices such as a, b or a, b , ... etc. 2. The vertices on any circle are mutually not connected to each other, where the vertices on each circle are those having a fixed equal distance from the center v. 3. The vertices on any circle are either all primed such as r , s , t , ... or all not primed such as r, s, t, ...
334
M. Heal et al.
4
c3 2
c1
circle 2 b11
b12
x
b13
1
circle 1
v
b33
b21
2
3
b22 b32
b31
b23
Fig. 5. An example of a tree representation of graph T starting at vertex v, Δvx = 4.
4. Vertices that belong to adjacent circles are such that all vertices on one of these circles are primed (not primed) and on the second circle not primed (primed), respectively. Proof. The proof of all the statements is straightforward as a conclusion from the definition of graph T and its symmetry. Lemma 5. Assume T is represented by a tree with centre v and circles that are 4 hops apart from each other, as the example in Fig. 5, the distance between v, circle 1 and circle 2 is 4 hops, i.e. Δvc = 4 and also note Δc c3 = 4. A set Υ that is the set formed by the union of vertex v and the vertices that are on circles that are 4 hops far from each other contains at least one of the pair of vertices members (primed or not primed) of a maximal independent set that could be a maximum independent set in T which is the map of a maximal independent set that could be a maximum independent set in G, and that maximal (maximum) independent set in T , is formed by extending Υ , by taking those vertices in Υ with their pairs, i.e. we take v, v , u, u , s, s .... and so on. The case when there are vertices not on circles that are 4 hops apart, will be discussed at the end of this lemma proof. Proof. We seek to find the maximum maximal independent set in T that has as its members pairs such as u, u , v, v , s, s ...etc., because this is the image in T of a maximum independent set in G; we call this ‘Υ extension’. Lemma 5 claims that if Υ is the set of all vertices in the tree representation of T that are 4 hops apart, starting from v and v is inclusive, then Υ contains at least one of
The P vs. NP Problem
335
each vertices pair that form a maximal (maximum) independent set in T which is an image for a maximal (maximum) independent set in G. To prove that we need to prove the following (1) Υ vertices are not connected to each other (2) we can extend Υ by adding the vertex pair of any vertex in Υ and still Υ contains mutually non-connected vertices and finally (3) Υ cannot be extended beyond that as stated in step 2; we call this extended set ‘Υ extension’. It is clear 1 is true since the vertices on any circle are either primed or not and since the distance between any vertices on different circles is more than 2. Regarding 2, see Fig. 6, where labels of vertices with numbers shows possible locations of a vertex pair. For example for A which is on circle 2 in black, the possible positions of A are A1, A2, A3, A4, A5, ... since Δuu = 4, 6, 8, ... for any vertex u. By removing the vertices on the red circles we are left with Υ set of vertices which are mutually not connected to each other. By lemma 4 the added vertices to Υ are also mutually not connected to each other since we excluded all vertices that the distance (Δ) between them is 1. However, when we have vertices not on circles, i.e. they are not multiple of 4 hops apart from v, then we need to consider them separately as they could be added; for example if we have vertices on circle 1’ in red and there are no vertices connected to them on circle 2 in black, i.e. when branches terminate in circle 1’ in red. It is easy to see such set ‘Υ extension’ cannot be extended further. To account for all maximal independent sets of pairs we need to change the center of the tree v by considering all vertices inside circle 2 in black to be centers of the tree and repeating the procedure for each one of them, otherwise ‘Υ ’ can be extended beyond ‘Υ extension’.
4.2
Algorithms to Find the Independence Number for Special Graphs
Algorithm I. By assuming the maximum independent set of graph G has a vertex only belongs to it then an algorithm to find it as follows. By noting that as we have tree representation of T , as in Fig. 7; start from some arbitrary vertex v (this is to be repeated for all vertices) and assigning v and v 1 and move from v and v to all their neighbours that are assigned 0 and keep moving (tracing the graph) and assign 1 to neighbours of v and v neighbours N (v, v ) (as example v(1) − − > p (0) − − > q(1) − − > ..., v (1) − − > p(1) − − > q (0) − − > ....). However, in any vertex assignment if that result in adjacent vertices assigned 1, then we assign that vertex 0. The algorithm is depicted in Algorithm 1. Since the maximum independent set in G has a vertex only belongs to it, we sure will end in a maximum independent set of G since we are starting from a different vertex in each run of the algorithm and we are exhausting all the vertices. Algorithm II. Another algorithm keeping the same assumption as the pervious section as follows. By starting from any vertex u we assign 1 to u and u , all vertices
336
M. Heal et al. A3
B3
x3' A2
B2
x2' A1
Circle 4
E2' E1'
B1 E
x1' Circle 3
A'
B' C'
Circle 2
A5 D'
x(D1) z A4 Circle 1 v Circle 1
Circle 2 Circle 2
Circle 3
Fig. 6. An Example of Graph T and its Tree Representation Starting from Vertex v and the Different Combinations to Find Maximum Maximal Independent Set of Pairs.
one hop away from u and u ; i.e. neighbours, N (u, u ), of u and u are assigned 0, and then taking another vertex r, different from u, u , N (u, u ) and assign 1 to it and to r ; all vertices connected to r, r , i.e. N (r, r ) are assigned 0, if that is not possible, i.e. we have a conflict by having two adjacent vertices have to be assigned 1 while they were already (or at least one is) assigned 0, then r and r are assigned 0. We repeat that until we exhaust all vertices. Then we repeat the same procedure all over again by taking a vertex different from u as the starting vertex. The procedure is repeated for all vertices as starting points. If we assume there are 2n vertices in T then we need to repeat the procedure n times as starting vertex and for every vertex we need to check at most 2n.2n pairs. Hence the worst case scenario is O(n3 ) operations. Accordingly, we have a P-algorithm to find a maximum independent set. This can be easily confirmed by noting the procedure converges to a maximal independent set and since there is one vertex belongs to a maximum independent set and we repeat it by taking as a starting point all the vertices in T one by one, we must converges to a maximum maximal independent set in T , which is the map of a maximum independent set in G.
The P vs. NP Problem
337
1: start at any vertex v ∈ T ; marker: 2: v ← 1, v ← 1 ; 3: if last step makes two adjacent vertices assigned 1 then 4: v ← 0, v ← 0; 5: end if 6: all N (v, v ) ← v¯; 7: if all vertices were visited then 8: MIS = vertices assigned 1; 9: Quit; 10: end if 11: ∀r ∈ N (N (v, v )); set v = r; 12: goto marker;
Algorithm 1: Search Algorithm to Find a Maximum Independent set by Tracing Graph T , v¯ is complement of the assigned bit to vertex v.
4.3
Example Graphs
In Fig. 8 we see a graph in (A) i.e. the T graph and its tree representation in (B). Δuu is even and at least 4, such as Δ11 = Δ22 = 6 and Δ44 = 4. Note that Δab = Δa b , and Δab = Δa b such as Δ13 = Δ1 3 = 5 and Δ15 = Δ1 5 = 4. If we take 1, 2’, 6’ then we must exclude 3, 4, 5 and so are 3’, 4’, 5’, thus the maximum maximal independent set in T is 1, 1’, 2, 2’, 6, 6’ and the maximum independent set in G is 1, 2, 6. Now when we apply algorithm 2, we start with a vertex, let it be 1 so we set vertex 1 and 1’ to 1. All their neighbours are set to 0, 4 and 4’. we pick a vertex different from 1, 1’, 4 and 4’ and set it with its pair to 1, let that vertex be 2 for example, so we set 2, 2’ to 1. Now all neighbours of 2 and 2’ are set to 0, i.e. 5 and 5’, and we have left 3, 3’, 6 and 6’. We can set any of the left vertices to 1 while the others to 0; so the maximum independent sets are {1, 2, 6} or {1, 2, 3}. Note maybe we converge to a maximal independent set and not to a maximum independent set, but by starting from a vertex that belongs to only one maximum independent set sure we will converge to that maximum independent set. Now following algorithm 1, and starting by vertex 1, 1’ we set both to 1 and we start tracing the tree so 4, 4’ are set to 0 and then all 5, 6, 5’, 6’ are set to 1. However, we have now a conflict - two adjacent vertices 5 and 6’ in addition to 5’ and 6’ are set to 1, but by taking any of 5 or 6 and setting it with its pair to 1 we don’t have that conflict. Let us say we set 5 and 5’ to 1 and hence the next vertices on the path that are set to 0 are 2’, 6’ and 2, 6 and moving from 6 to 3’ (or 6’ to 3), we set 3 and 3’ to 1. Thus the maximum independent set is {1, 5, 3}.
338
M. Heal et al.
e f g
h
a
i
j Fig. 7. Tree Representation of Graph T used in Algorithm 1. 3 1 4 2
4
5
2
3 6 1 4
5
6 A
B
Fig. 8. A Concrete Example of Graph T and its Tree Representation Starting from Vertex 1.
5
Results
We applied algorithm 1 and algorithm 2 using MacBook Pro 2019 with 2.3 GHz 8-core Intel core i9 CPU, 32 GB of memory and macOS Monterey, to some of DIMACS benchmarks to find the maximum clique [14], see Table 1. We see good results for all the graphs except one. Algorithm 2 (ω2 is the estimated clique number by the algorithm) is better than algorithm 1 (ω1 is the estimated clique
The P vs. NP Problem
339
Table 1. Some of DIMACS benchmarks for maximum clique Graph name or No ω
ω1
ω2
time
time
4
4
0.1617 × 10−3
4
0.0192
MANN-a9
16
9
0.7075 ×10−4
16
0.0263
hamming6-2
32
32
hamming6-4
4
4
johnson8-4-4
14
14
johnson16-2-4
8
8
C125.9
34
keller4
11
johnson8-2-4
0.1438 × 10−3 32
0.0440
0.9787 × 10−4
0.0623
4
0.1023 × 10−3 14
0.0656
0.3476 × 10−3
8
0.1095
26
0.4048 × 10−3 31
0.1841
11
0.7667 × 10−3 11
0.7368
Table 2. Some Graphs from the House of Graphs for Maximum Independent Set Graph name and/or No
α
Hoffman Graph (1167)
α1
8
5
50
50
2
Hoffman Singelton Graph(1173)
time 0.625×10−5
α2
time
8
0.0020
0.3691 × 10−3 50
0.1074
2
0.2185 × 10−3
2
0.0780
15
7
0.5590 × 10−3
7
0.0207
Hoffman Singleton Line Graph (1175)
25
20
0.7714 × 10−3
24
0.3375
Hoffman Singleton Minus Star Graph (1177)
14
6
0.2527 × 10−4 11
0.0167
35502
12
8
0.4930 ×10−4
9
0.0160
Hanoi Graph-Sierpinski Triangle Level 5 (35481)
81
81
81
0.7213
Hoffman Singleton, BiPartite Double Graph (1169) Hoffman Singleton Complement Graph (1171)
0.0022
number by the algorithm) due to the fact that when we select another vertex not in the neighbours of the pervious vertex that was added to the maximum independent set there is a better chance of being a vertex in the maximum independent set. The execution time is extremely fast with algorithm 1 is faster. Table 2 shows the independence number of some graphs selected from the graphs database House of Graphs [15]. We see also fair results with extremely high speed. α1 and α2 are the estimated independence number by algorithm 1 and algorithm 2 respectively. The reason for failures for a few graphs is the maximum independent set of these graphs is partitioned into sets of vectors such that each vector is a subset of a maximal independent set, thus the algorithm converges to a maximal independent set. The execution time is in minutes.
6
Conclusion and Future Work
We proposed a method to settle the P vs. NP problem by solving an np-complete problem, namely the maximum independent set problem. Our technique transforms any graph into a perfect graph such that the source graph maximum independent set is either twice in size the maximum independent set of the
340
M. Heal et al.
transformed graph or twice in size the second largest maximal independent set in the transformed perfect graph. We characterised some important properties of the perfect graph that may help in finding the maximum independent set of the source graph and proposed two algorithms that find the maximum independent set of the source graph for a special case. The results section shows that the algorithms are very fast. As a future work we will extend [13] work to find the 2nd largest maximal independent set of the perfect graph, since the maximum independent set in the source graph is either the maximum or 2nd largest maximum independent set in the transformed perfect graph.
References 1. Carlson, J.A., Jaffe, A., Wiles, A.: The millennium prize problems. Cambridge, MA, American Mathematical Society, Providence, RI, Clay Mathematics Institute (2006) 2. Goldreich, O.: P, NP, and NP-Completeness: The basics of computational complexity. Cambridge University Press (2010) 3. Garey, M.R., Johnson, D.S.: Computers and intractability, vol. 174. San Francisco: freeman (1979) 4. Fortnow, L.: The status of the P versus NP problem. Commun. ACM 52(9), 78–86 (2009) 5. Baker, T., Gill, J., Solovay, R.: Relativizations of the P=?NP question. SIAM J. Comput. 4(4), 431–442 (1975) 6. Razborov, A.A.: Lower bounds for the monotone complexity of some Boolean functions. Soviet Math. Dokl. vol. 31 (1985) 7. Razborov, A.A.: On the method of approximations. In: Proceedings of the TwentyFirst Annual ACM Symposium on Theory of Computing (1989) 8. Furst, M., Saxe, J.B., Sipser, M.: Parity, circuits, and the polynomial-time hierarchy. Math. Syst. Theory 17(1), 13–27 (1984) 9. Haken, A.: The intractability of resolution. Theoret. Comput. Sci. 39, 297–308 (1985) 10. Berge, C.: Farbung von Graphen, deren samtliche bzw. deren ungerade Kreise starr sind. Wissenschaftliche Zeitschrift (1961) 11. Robertson, N., et al.: The strong perfect graph theorem. Ann. Math. 164(1), 51– 229 (2006) 12. Heal, M.H.: Simple proofs of the strong perfect graph theorem using polyhedral approaches and proving P= NP as a conclusion. In: 2020 International Conference on Computational Science and Computational Intelligence (CSCI). IEEE (2020) 13. Gr¨ otschel, M., Lov´ asz, L., Schrijver, A.: Geometric algorithms and combinatorial optimization, vol. 2. Springer Science & Business Media (2012) 14. Heal, M., Li, J.: Finding the maximal independent sets of a graph including the maximum using a multivariable continuous polynomial objective optimization formulation. In: Science and Information Conference. Springer, Cham (2020) 15. Brinkmann, G., Coolsaet, K., Goedgebeur, J., M´elot, H.: House of Graphs: a database of interesting graphs. Discrete Appl. Math. 161(1–2), 311–314 (2013). http://hog.grinvin.org
Hadoop Dataset for Job Estimation in the Cloud with Limited Bandwidth Mohammed Bergui1(B) , Nikola S. Nikolov2 , and Said Najah1 1
2
Laboratory of Intelligent Systems and Applications, Department of Computer Science, Faculty of Sciences and Technologies, University of Sidi Mohammed Ben Abdellah, Fez, Morocco [email protected] Department of Computer Science and Information Systems, University of Limerick, Limerick, Ireland Abstract. Hadoop MapReduce is a well-known open source framework for processing a large amount of data in a cluster of machines; it has been adopted by many organizations and deployed on-premise and on the cloud. MapReduce job execution time estimation and prediction are crucial for efficient scheduling, resource management, better energy consumption, and cost saving. In this paper, we present our new dataset of MapReduce job traces in a cloud environment with limited network bandwidth; we describe the process of generating and collecting the dataset in this paper. We believe that this dataset will help researchers develop new scheduling approaches and improve Hadoop MapReduce job performance. Keywords: Hadoop · MapReduce Estimating the runtime
1
· Cloud computing · Bandwidth ·
Introduction
Now, with the development and use of new systems, we are dealing with a large amount of data. Due to the volume, velocity, and variety of this big data, its management, maintenance, and processing require dedicated infrastructures. Apache Hadoop is one of the most well-known big data frameworks [1], it splits the input data into blocks for distributed storage and parallel processing using the Hadoop distributed file system and MapReduce on a cluster of machines [15]. One of the characteristics of Hadoop MapReduce is the support for public cloud computing, which allows organizations to use cloud services on a pay-asyou-go basis. This is advantageous for small and medium-sized organizations that cannot implement a sophisticated, large-scale private cloud due to financial constraints. Therefore, running Hadoop MapReduce applications in a cloud environment for big data analytics has become a viable alternative for industrial practitioners and academic researchers. Since one of the most critical functions of Hadoop is job and resource management, more efficient management will be achieved if the estimation and prediction of the execution time of a job are done accurately. Also, critical resources c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 341–348, 2023. https://doi.org/10.1007/978-3-031-28073-3_24
342
M. Bergui et al. Table 1. The Different VM types for hadoop cluster deployments
Machine types
vCPUs Memory (GB) Maximum egress bandwidth (Gbps)
e2-standard-2
2
8
e2-standard-4
4
16
8
e2-standard-8
8
32
16
e2-standard-16 16
64
16
e2-highmem-2
2
16
4
e2-highmem-4
4
32
8
e2-highmem-8
8
64
16
e2-highmem-16 16
128
16
Standard persistent storage
4
500 GB
like CPU, memory, and network bandwidth are shared in a cloud environment and subject to contention. This is an important issue regarding efficient scheduling, better energy consumption, cost saving, congestion detection, and resource management [7,9,12]. Several Hadoop MapReduce performance models have been proposed either for on-premise or cloud deployment [8,10,11,13,14,16]. However, the data generated and collected is not well described. Also, most proposed solutions rely on benchmarks that only process a certain data structure. This article proposes a new dataset of Hadoop MapReduce version 2 job traces in a cloud environment with limited network bandwidth. For this purpose, first, we deploy multiple Hadoop cluster configurations on Google Cloud Platform [5], then we use a big data benchmark to generate synthetic data with different structures; this data is then processed using SQL-like queries. Lastly, we construct our dataset by extracting MapReduce job traces and cluster configuration parameters using our python toolkit that is based on REST APIs provided by the Hadoop Framework [2,3]. The remainder of this paper is organized as follows. Experimental Setup section provides the different types of cluster deployment used in our experiment. The Hadoop MapReduce Job Traces Database section describes the steps to generate synthetic data, process it then collect MapReduce job traces that construct the dataset. Finally, Conclusion and Future Work section concludes the paper.
2
Experimental Setup
In order to estimate the job execution time, we started by deploying eight different clusters with four nodes, one master node, and three workers on Google Cloud Platform; Dataproc is a Spark and Hadoop managed service on Google Cloud Platform that can easily create and manage clusters [5]. The version of Dataproc used is 1.5-centos8 which includes CentOS 8 as the operating system, Apache Spark 2.4.8, Apache Hadoop 2.10.1, Apache Hive 2.3.7, and Python 3.7 [4]. Each cluster has a different workers/salves configuration ranging from 2
Hadoop Dataset for Job Estimation in the Cloud with Limited Bandwidth
343
Table 2. Benchmark dataset tables names and sizes Table
Size
Table
data/customer
128.5 MB data/warehouse
data/customer address
51.9 MB
data/web clickstreams
data/customer demographics
74.2 MB
data/web page
data/date dim
14.7 MB
data/web returns
data/household demographics 151.5 KB data/web sales data/income band
Size 2.2 KB 29.7 GB 98.4 KB 870.4 MB 21.1 GB
327 B
data/web site
8.6 KB
data/inventory
16.3 GB
data refresh/customer
data/item
65.0 MB
data refresh/customer address
537.2 KB
data/item marketprices
66.6 MB
data refresh/inventory
168.7 MB
data/product reviews
633.2 MB data refresh/item
data/promotion
568.5 KB data refresh/item marketprices
data/reason
38.2 KB
data refresh/product reviews
6.3 MB
data/ship mode
1.2 KB
data refresh/store returns
7.3 MB
data/store
33.0 KB
data refresh/store sales
data/store returns
708.6 MB data refresh/web clickstreams
data/store sales
14.7 GB
data refresh/web returns
data/time dim
4.9 MB
data refresh/web sales
1.3 MB
673.2 KB 26.2 MB
152.5 MB 308.5 MB 8.8 MB 219.1 MB
vCPUs to 16, 8 GB of memory to 128, and the type and size of storage were kept the same; the master node configuration was the same through the experiment with 4 vCPUs and 16 GB of memory, the types of VMs used in our experiments is shown in Table 1. After each deployment, we changed the replication factor in HDFS from the default value, which is 3 to 1. We also had to change the Hive execution engine from Tez to Mapreduce for the experiment. We then limit the workers’ maximum network bandwidth four times on each cluster deployment to 4.6 Gbps, 2.3 Gbps, 1.1 Gbps, and 0.7 Gbps.
3
The Hadoop MapReduce Job Traces Database
This paper proposes a Hadoop MapReduce job traces dataset in a cloud environment with limited network bandwidth1 . This dataset includes many Hadoop MapReduce jobs based on multiple processing methods (MapReduce, Pure QL, NLP, . . .) and different data structures. This dataset will help researchers develop new scheduling approaches and improve Hadoop MapReduce performance. This dataset has been constructed to predict the among of intermediate data that need to be transferred over a limited network and predict the job execution time regardless of the type of the query statement. 1
The dataset is available upon request from the corresponding author.
344
M. Bergui et al. Table 3. The Distribution of the Different Query Types and the Data Types
Query Data type
Method
Query Data type
UDF/UDTF
14
Structured
Pure QL
2
Semi-Structured Map Reduce
15
Structured
Pure QL
3
Semi-Structured Map Reduce
16
Structured
Pure QL
4
Semi-Structured Map Reduce
17
Structured
Pure QL
6
Structured
Pure QL
19
Un-Structured
UDF/UDTF/NLP
7
Structured
Pure QL
21
Structured
Pure QL
8
Semi-Structured Map Reduce
22
Structured
Pure QL
9
23
1
Structured
Structured
Pure QL
10
Un-Structured
UDF/UDTF/NLP 27
11
Structured
Pure QL
12
Semi-Structured Pure QL
13
Structured
3.1
Method
Structured
Pure QL
Un-Structured
UDF/UDTF/NLP
29
Structured
UDF/UDTF
30
Semi-Structured UDF/UDTF/Map Reduce
Pure QL
Data Generation
To evaluate the efficiency of Hadoop-based Big Data systems, we used the TPCxBB Express Benchmark BB [6]. By executing 30 frequently used analytic queries in the context of retailers, it evaluates the performance of software and hardware components. For structured data, SQL queries can make use of Hive or Spark, while for semi-structured and unstructured data, machine learning methods make use of ML libraries, user-defined functions, and procedural programs. Data Generation in HDFS. In order to populate HDFS with structured, semi-structured, and unstructured, TPCx-BB uses an extension of the Parallel Data Generation Framework (PDGF). PDGF is a parallel data generator that generates a large amount of data for an arbitrary schema. The already existing PDGF can generate the structured part of the benchmark model. However, it cannot generate the unstructured text of the product reviews. First, PDGF is extended to produce a key-value data set for a fixed set of mandatory and optional keys. This is enough to generate the weblog part of the benchmark. To generate unstructured data, an algorithm based on the Markov Chain technique is used to produce synthetic text based on sample input text; the initial sample input is a real products review from an online retail store. The benchmark defines a set of scaling factors based on the approximate size of the raw data generated by PDGF in Gigabytes. In our experiment, we used a scale factor of 100, resulting in approximately 90 GB of data evenly spread over three data nodes. Table 2 shows the names and sizes of the tables. It should be noted that the sizes of the tables differ from each execution of the data generation, i.e., the size of the tables is different on each cluster deployment. The generated data set is mainly unstructured, and structured data accounts only for 20%.
Hadoop Dataset for Job Estimation in the Cloud with Limited Bandwidth
345
Specification of Hadoop MapReduce Jobs. We used a Hive-based query to run our experiments from the 30 queries provided by the benchmark [6]; we were able to run 23 queries, and the remaining seven queries rely on frameworks that are not in the scope of our experiment. The queries are complex, containing various clauses such as SELECT, ORDER BY, GROUP BY, CLUSTER BY, and all kinds of JOIN clauses. The distribution of the different query types and the data types they access are illustrated in Table 3. 3.2
Data Collection
The queries generated 101 jobs and 1595 to 1604 tasks per cluster configuration, totaling 3232 jobs and 50732 tasks. In order to collect information about these jobs, we developed a toolkit with python for collecting information about the applications, jobs, jobs counters, and tasks. The toolkit makes use of the REST APIs provided by the Hadoop framework [2,3] and SSH to collect data about applications, jobs, jobs counters, tasks, cluster metrics, and framework Table 4. Application, Cluster and YARN Collected Features Object Type
Application, Scheduler, Cluster metrics and framework configuration
How data were acquired
- ResourceManager REST APIs allow getting information about the cluster, scheduler, nodes, and applications. - Connecting through SSH and parsing yarn-site.xml
File
CSV file, applications.csv Feature name
Type
Description
id elapsedTime
string long
memorySeconds
long
vcoreSeconds
long
The application id The elapsed time since the application started (in ms) The amount of memory the application has allocated The number of CPU resources the application has allocated
Cluster metrics
totalMB totalVirtualCores totalNodes networkBandwidth
long long int long
The amount of total memory in MB The total number of virtual cores The total number of nodes The maximum available network bandwidth for each node
YARN configuration
yarn-nodemanagerresource-memory yarn-nodemanagerresource-cpu-vcores yarn-schedulermaximum-allocation yarn-schedulerminimum-allocation
long
Amount of physical memory, in MB, that can be allocated for containers Number of vcores that can be allocated for containers The maximum allocation for every container request at the RM in MBs The minimum allocation for every container request at the RM in MBs
Application
int long long
346
M. Bergui et al. Table 5. Job and job counters collected features
Object Type
Job and MapReduce configuration
How data were - MapReduce History Server REST APIs allow to get status acquired on finished jobs. - Connecting through SSH and parsing mapred-site.xml File
Job features
MapReduce configuration
Object Type
CSV file, jobs.csv Feature name
Type
Description
id startTime finishTime mapsTotal reducesTotal avgShuffleTime
string long long int int long
The The The The The The
mapreduce-job-maps mapreduce-map-cpuvcores mapreduce-reducememory mapreduce-job-reduceslowstartcompletedmaps mapreduce-task-io-sort
int int
The number of map tasks per job The number of virtual cores to request from the scheduler for each map task The amount of memory to request from the scheduler for each reduce task Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job The total amount of buffer memory to use while sorting files, in megabytes
long float
long
job id time the job started time the job finished total number of maps total number of reduces average time of the shuffle (in ms)
Job Counters
How data were MapReduce History Server REST API’s allows to get status acquired on finished jobs File
Counters features
CSV file, counters.csv Feature name
Type
Description
id map-input-records
string long
map-output-records
long
reduce-shuffle-bytes file-bytes-read-map
long long
hdfs-bytes-read-map
long
The job id The number of records processed by all the maps The number of output records emitted by all the maps Map output copied to Reducers The number of bytes read by Map tasks from the local file system The number of bytes read by Map and Reduce tasks from HDFS
Hadoop Dataset for Job Estimation in the Cloud with Limited Bandwidth
347
Table 6. Task collected features Object Type
Task
How data were acquired
MapReduce History Server REST API’s allows to get status on finished tasks
File
CSV file, tasks.csv Feature name
Type
Description
Task features
id Job-id type elapsedTime
string string string long
The task id The job id The task type - MAP or REDUCE The elapsed time since the application started
configuration by making an HTTP request, connecting to the master node through SSH and parsing JSON and XML files. The features collected and their descriptions for application, cluster metrics, and YARN configuration are shown in Table 4. Job, job counters collected features are shown in Table 5; finally, task collected features are presented in Table 6.
4
Conclusion and Future Work
Apache Hadoop is a well-known open-source platform for handling large amounts of data. The runtime of a job needs to be estimated accurately for better management. This paper proposes a Hadoop MapReduce job traces dataset in a cloud environment with limited network bandwidth. A big data benchmark and different cluster deployments are used to generate MapReduce job traces; the dataset contains information about cluster and framework configuration and applications, jobs, counters, and tasks. The purpose of this dataset is to researchers develop new scheduling approaches and improve Hadoop MapReduce job performance. We plan to extend the proposed dataset for future work to include network bandwidth fluctuations and heterogeneous machine configurations. Also, by extending the dataset, we plan to work on predicting job execution time in a geo-distributed Hadoop cluster.
References 1. 2. 3. 4. 5.
Apache hadoop Apache hadoop 2.10.1 – resourcemanager rest apis Apache hadoop mapreduce historyserver – mapreduce history server rest apis Dataproc image version list — dataproc documentation — google cloud Dataproc — google cloud
348
M. Bergui et al.
6. Tpcx-bb express big data benchmark 7. Alapati, S.R.: Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS, , 1st edn. Addison-Wesley Professional (2016) 8. Ceesay, S., Barker, A., Lin, Y.: Benchmarking and performance modelling of mapreduce communication pattern. In: 2019 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), pp. 127–134 (2019) 9. Heidari, S., Alborzi, M., Radfar, R., Afsharkazemi, M., Ghatari, A.: Big data clustering with varied density based on mapreduce. J. Big Data 6, 08 (2019) 10. Kadirvel, S., Fortes, J.A.B.: Grey-box approach for performance prediction in mapreduce based platforms. In: 2012 21st International Conference on Computer Communications and Networks (ICCCN), pp. 1–9 (2012) 11. Khan, M., Jin, Y., Li, M., Xiang, Y., Jiang, C.: Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans. Parallel Distrib. Syst. 27(2), 441–454 (2016) 12. Singh, R., Kaur, P.: Analyzing performance of apache tez and mapreduce with hadoop multinode cluster on amazon cloud. J. Big Data 3, 10 (2016) 13. Song, G., Meng, Z., Huet, F., Magoules, F., Yu, L., Lin, X.: A hadoop mapreduce performance prediction method. In: 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, pp. 820–825 (2013) 14. Tariq, H., Al-Sahaf, H., Welch, I.: Modelling and prediction of resource utilization of hadoop clusters: a machine learning approach. In: Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing, UCC 2019, pp. 93–100. Association for Computing Machinery, New York (2019) 15. White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc., 4th edn. (2015) 16. Zhang, Z., Cherkasova, L., Loo, B.T.: Benchmarking approach for designing a mapreduce performance model. In: Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, ICPE 2013, pp. 253–258. Association for Computing Machinery, New York (2013)
Survey of Schema Languages: On a Software Complexity Metric Kehinde Sotonwa(B) , Johnson Adeyiga, Michael Adenibuyan, and Moyinoluwa Dosunmu Bells University of Technology, Ota, Nigeria [email protected]
Abstract. Length in Schema (LIS) is a numerical measurement of the schema documents (DS) of extensible markup language (XML) that contain schemas from xml schema language in a manuscript form. LIS likened to source line of codes (SLOC) in software complexity which is used to calculate the amount of effort that will be required to develop a schema document. Different LIS were considered such as Blank Length in Schema (BLIS), Total Length in Schema (TLIS), Commented Length in Schema (CLIS) and Effective Length in Schema (ELIS) for sixty (60) different schema documents acquired online through Web Services Description Language (WSDL) and implemented in two schema languages: Relax-NG (rng) and W3C XML schema (wxs) to estimate schema productivity and maintainability. It was discovered that overall understandability and flexibility of schemas become much easier with less maintenance efforts in rng than wxs. Keywords: Relax-NG (rng) · W3C XML schema (wxs) · Schema documents (DS)
1 Introduction The increased complexity of modern software applications also increases the difficulty of making the code reliable and maintainable. Code metrics is a set of software measures that provide developers better insight into the code they are developing. By taking advantage of code metrics, developers can understand which types and/or methods should be reworked or more thoroughly tested. Code complexity should be measured as early as possible in coding [1, 4] to locate complex code in order to obtain high quality software with low cost of testing and maintenance. It is also used to compare, evaluate and rank competitive programming applications [2–4]. Code based complexity measure comprises of line of codes/source line of code metric, Halstead complexity measure and McCabe cyclomatic complexity measure but this paper is only considering line of codes metric in relation to xml schema documents. Xml is a dedicated data-description language used to store data [5]. It is often used of web development to save the data into xml files. The use of XSLT API’s is to generate the content in required format such as HTML, XHTML and XML to allows the developers transfer data and to save configuration or business data for application [6]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 349–361, 2023. https://doi.org/10.1007/978-3-031-28073-3_25
350
K. Sotonwa et al.
Xml is a markup language created by the world wide consortium (W3C) to define syntax for encoding documents that both humans and machines could read. It does this through the use of tags that define the structure of the document, as well as how the document should be stored and transported. The data representation and transportation formats which accepted in diverse fields were been made by designing the schema, and this can be written by a series of xml schema languages. A schema is a formal definition of the syntax of xml based language that defines a family of xml documents. Xml schema language is a description of a type of xml document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by xml itself [4, 7, 8]. These constraints are generally expressed using some combinations of grammatical rules governing the order of elements. A schema language is a formal language for expressing schemas [9]. There are a number of schema languages available: dtd [10], wxs [11, 12], rng [13, 14] and schematron [15–17] etc. Length in schema similar to line of codes is generally considered as the count of line in the schema of xml documents which also considered the validated documents [4, 18]. LIS count the line of schema files implemented in rng and wxs and it is independent of what the schema documents used. The LIS evaluates the complexity of the software via the physical length. The xml documents in this paper were acquired online and implemented in rng and wxs.
2 Review of Related Works Harrison et al., [19] measured line of codes (LOC) metric, a code written in one programming language may be much more effective than another, therefore two programs that give the same functionalities written in two different languages may have different LOC values because of this, it neglected all other factors that affect the complexity of software. Vu-Nguyen et al., [20] presented a set of counting standards that defines what and how to count SLOC. It was suggested that this problem, can be alleviated by the use of a reasonable and unambiguous counting standard guide with the support of a configurable counting tool. Kaushal-Bhatt et al. [21] evaluated some drawbacks in SLOC metrics that affect the quality of software because SLOC metric output is used as an input in other software estimation methods like COCOMO model. Armit and Kamar [22] formulated a metric that counted the number of line of codes but neglected the intelligence content, layout and other factors that affect the complexity of the code. Sotonwa et al., [2, 3] proposed a metric that counts the number of line of codes, commented lines and non-commented lines for various object oriented programming languages. Sotonwa et al., [4] also used SLOC metrics to xml schema documents in order to estimate schemas productivity and maintainability. The SLOC metrics were based on only one xml schema language: (RNG). There is need to take this work further by comparing different schema languages with SLOC metrics and was done with different object oriented programming languages to show the effectiveness of the flexibility and understandability of schemas in respect to less maintenance effort.
Survey of Schema Languages: On a Software Complexity Metric
351
3 Materials and Methods 3.1 Experimental Setup of LIS Metric The LIS metric investigates the code based complexity metric like source of code (SLOC) in complexity metric using schema documents implemented in xml schema languages: (rng and wxs) for length of the schemas and to find the complexity variations of different implementation languages. The following approaches were applied: • Length in schema were implemented in rng and wxs of xml schemas languages • Evaluation of different types of LIS for different implementation of schema documents in rng and wxs. • Comparison of the results from the two (2) schemas languages: rng and wxs • Analysis of variance of the schema languages developed and confirmed as an explanation for the observed data. The metric is applied on sixty (60) different schemas files acquired online through web services description languages and implemented in rng and wxs; rng and wxs codes were different from each other in their architecture [23–40] and different types of LIS were considered for each schema document as follows: • Total length in schema (TLIS): it is obvious from its name that it counts the number of lines in source code of the schema. it counts every line including commented and blank lines. • Blank length in schema (BLIS): this counts only the space line in the source codes. These lines of code only make the codes look spacious enough and easy to comprehend; with or without BLIS in the code, the code will still execute. • Commented length in schema (CLIS): counts line of codes that contain comments only. • Effective length in schema (ELIS): counts line of codes that are not commented, blank, standalone braces or parenthesis. This metric presents the actual work performed within the code. It calculates all executable line codes. The equation is defined as:
ELIS SD.rng = TLIS SD.rng − BLIS SD.rng + CLIS SD.rng
(1)
ELIS SD.wxs = TLIS SD.wxs − (BLIS SD.wxs + CLIS SD.wxs )
(2)
where ELIS is effective length of schema SD is schema document of rng and wxs TLIs is the total length of schema BLIS is the blank length of schema and CLIs is the commented length of schema
352
K. Sotonwa et al.
Demonstration sample of the proposed metric for rentalProperties, contact-info and Note; implemented in rng is given in Fig. 1, Fig. 2, Fig. 3 and their analyses were also given for different variations of LIS metric.
Fig. 1. Schema document for RentalProperties in rng.
Survey of Schema Languages: On a Software Complexity Metric
TLISrentalProperties.rng = 46 BLISrentalProperties.rng = 0 CLISrentalProperties.rng = 2 ELIS = TLISrentalProperties.rng − rentalProperties.rng BLISrentalProperties.rng + CLISrentalProperties.rng = 46 − (0 + 2) = 44
Fig. 2. Schema document for contact-info in rng
TLIScontact−info.rng = 19 BLIScontact−info.rng = 1 CLIScontact−info.rng = 5 ELIScontact−info.rng = TLIScontact−info.rng − BLIScontact−info.rng + CLIScontact−info.rng = 19 − (1 + 5) = 13
353
354
K. Sotonwa et al.
Fig. 3. Schema document for note in rng
TLISnote.rng = 23 BLISnote.rng = 3 CLISnote.rng = 3 ELISnote.rng = TLISnote.rng − BLISnote.rng + CLISnote.rng = 23 − (3 + 3) = 17 Demonstration sample of the proposed metric for rentalProperties, contact-info and note; implemented in wxs is given in Fig. 4, Fig. 5, Fig. 6 and their analyses were also given for different variations of LIS metrics.
Survey of Schema Languages: On a Software Complexity Metric
Fig. 4. Schema document for RentalProperties in wxs
TLISrentalProperties.wxs = 46 BLISrentalProperties.wxs = 1 CLISrentalProperties.wxs = 23 = TLISrentalProperties.wxs − ELIS rentalProperties.wxs BLISrentalProperties.wxs + CLISrentalProperties.wxs = 46 − (1 + 23) = 22
355
356
K. Sotonwa et al.
Fig. 5. Schema document for contact-info in wxs
TLIScontact−info.wxs = 14 BLIScontact−info.wxs = 1 CLIScontact−info.wxs = 4 ELIScontact−info.wxs = TLIScontact−info.wxs − BLIScontact−info.wxs + CLIScontact−info.wxs = 14 − (1 + 4) = 9
Fig. 6. Schema document for note in wxs
TLISnote.wxs = 14 BLISnote.wxs = 0 CLISnote.wxs = 1 ELISnote.wxs = TLISnote.wxs − (BLISnote.wxs + CLISnote.wxs ) = 14 − (0 + 1) = 13
Survey of Schema Languages: On a Software Complexity Metric
357
Table 1. Complexity measures for comparing rng and wxs schema documents S/no Schemas
ELISrng
BLISrng
CLISrng
TLISWXS
ELISWXS
BLISWXS
1
RentalProperties
TLISrng 46
44
0
2
46
22
1
CLISWXS 23
2
LinearLayout
68
58
4
6
52
27
1
24
3
Weather-observation
97
92
1
4
75
43
1
31
4
Cookingbook
37
32
1
4
32
24
2
6
5
myShoeSize
12
8
1
3
13
11
0
2
6
Documents
38
37
0
1
50
39
0
11
7
Supplier
26
24
1
1
24
15
2
7
8
Customer
19
17
1
1
23
14
1
8
9
Contact
17
15
1
1
16
8
1
7
10
Books
42
32
3
7
43
26
2
15
11
Saludar
40
30
8
2
37
28
1
8
12
Portfolio
24
21
1
2
21
11
1
9
13
Breakfast_menu
26
24
1
1
25
14
1
10
14
Investments
15
Library
16
Contact-info
19
13
1
5
14
9
1
4
17
Students
17
16
0
1
22
14
2
6
18
Configuration-file
24
20
0
4
24
18
0
6
19
Shiporder
49
43
3
3
32
21
0
11
20
Bookstore
30
26
1
3
30
17
1
12
21
Dataroot
29
25
2
2
66
52
2
12
22
Dictionary
29
25
1
3
34
21
1
12
23
Catalog
33
31
0
2
30
16
0
14
24
Soap
34
27
0
7
36
24
1
11
25
PurchaseOrder
68
62
0
6
60
30
1
29
26
Letter
67
60
0
7
62
40
0
22
27
Clients
31
27
1
3
33
20
2
11
28
ZCSImport
32
30
1
1
36
20
1
15
29
Guestbook
17
16
0
1
20
14
0
6
30
Note
24
18
3
3
14
13
0
1
25
16
1
8
20
12
1
7
110
97
4
9
53
34
3
16
4 Result and Discussion In this section, we presented the results from the series of experiment conducted to show the efficiency of the proposed metric. The applicability of LIS in schema documents showed that the effort required in understanding the information contents of the metric when implemented in rng and wxs for all schema documents are given in Table 1. Column 1 in Table displayed the serial numbers for all the schema documents, column 2 illustrated thirty (30) schema documents, columns 3 to 6 and columns 7 to 10 revealed different complexity values calculated for each LIS: (TLIS, BLIS, CLIS and ELIS) in rng and wxs.
358
K. Sotonwa et al.
The relative graph depicted in Fig. 7 exhibited all complexity values in rng and wxs for BLIS and CLIS. The results while comparing different LIS in graph representation for the sample schema documents given above: rentalProperties, contact-info and note in rng and wxs disclosed that the results BLISrentalProperties.rng = 0, BLIScontact-info.rng = 1, BLISnote.rng = 3 and BLISrentalProperties.wxs = 1, BLIScontact-info.wxs = 1, BLISnote.wxs = 0 respectively all have closer complexity values in both rng and wxs; this is because BLIS are just blank lines and with or without these lines schemas will still validate even though wxs do have empty elements and whitespaces. Empty element in wxs does not mean that the line of the schema is blank; it is just that the element has no content at all. So also, whitespace is of two types: significant whitespace and insignificant whitespace. The significant whitespaces occur within the element which contain text and markup present together while insignificant whitespaces are the spaces where only element content is allowed. On the other hand, schema documents CLISrentalProperties.rng = 2, CLIScontact-info.rng = 5, CLISnote.rng = 3 also have close complexity values in rng but CLISrentalProperties.wxs = 23, CLIScontact-info.wxs = 4, CLISnote.wxs = 1, do not have closer complexity values because wxs is quite verbose and has a weak structure support for unordered content therefore this made it to produce more CLIS in wxs the rng; thus making it more complex and difficult to understand and compare with rng that is easier, lightweight and has a richer structure option. Figure 8 unveiled comparison between TLIS and ELIS in rng and wxs; all the schema documents presented have larger complexity values in TLIS than ELIS in both rng and wxs. For examples TLIS rentalProperties.rng :ELISrentalProperties.rng = 46:44, TLIScontact-info.rng :ELIScontact-info.rng = 19:13, TLISnote.rng :ELISnote.rng = 24:18 and TLISrentalProperties.wxs :ELISrentalProperties.wxs = 46:22, TLIScontact-info.wxs :ELIScontact-info.wxs = 14:9, TLISnote.wxs :ELISnote.wxs = 14:13 respectively. In comparing complexity values in rng and wxs for TLIS and ELIS; it was discovered that TLIS has larger values in rng and wxs because this is the overall total of all the lines of schema which include the blank, commented and effective lines while ELIS are just the logical schemas i.e. actual lines in a document that make the schema to validate, in this case commented and blank lines are not considered. Finally, general comparison of the whole complexity values between rng and wxs revealed that rng had larger values for more than two third of the schema documents presented due to more diversity of the elements in rng (i.e. appearance of the elements in any order) hence, this gain more regularity and reusability traits for high frequency occurrence of similarly structured elements in rng, as a result, encouraged leveraging on existing schema documents instead of building newly schemas from the scratch.
Survey of Schema Languages: On a Software Complexity Metric
359
Fig. 7. Relative graph for schema documents of LIS in rng and wxs
Fig. 8. Comparison of TLIS and ELIS in rng and wxs
5 Conclusion Length in line of any code or schemas is widely used and universally accepted because it permits comparison of size and productivity of metrics between diverse development groups. It directly related to the end product and easily measured upon project completion. It measures software from the developers’ point of view-what actually line of code does in relation to line of schemas as well, in return aids continuous improvement activities exist for estimation techniques. For the comparison of different LIS in rng and wxs, it was discovered that rng exhibits better presentation of schema documents with high degree of flexibility, reusability and comprehensibility qualities which assists the developer to gain more familiarity with the schema languages structure because of strong support for class elements to appear in any order in rng than wxs, though, wxs is also good but weak support for unordered contents.
360
K. Sotonwa et al.
References 1. Elliot T.A.: Assessing fundamental introductory computing concept knowledge in a language independent manner: A Ph.D. Dissertation, Georgia Institute of Technology, USA (2010) 2. Sotonwa, K.A., Olabiyisi, S.O., Omidiora, E.O.: Comparative analysis of software complexity of searching algorithms using code base metric. Int. J. Sci. Eng. Res. 4(6), 2983–2992 (2014) 3. Sotonwa, K.A., Balogun, M.O., Isola, E.O., Olabiyisi, S.O., Omidiora, E.O., Oyeleye, C.A.: Object oriented programming languages for searching algorithms in software complexity metrics. Int. Res. J. Comput. Sci. 4(6), 2393–9842 (2019) 4. Sotonwa, K.A., Olabiyisi, S.O., Omidiora, E.O., Oyeleye, C.A.: SLOC metric in RNG schema documents. Int. J. Latest Technol. Eng. Manag. Appl. Sci. 8(2), 1–5 (2019) 5. Gavin, B.: What is an XML file, and do I opne one (2018) 6. RoseIndia.Net.: Why XML is used for? (2018) 7. Makoto, M., Dongwon, L., Murali, M.: Taxonomy of XML schema languages using formal language theory, Extreme markup language: XML source via XSL, Saxon and Apache FOP Mulberry Technologies, Inc., pp. 153–166 (2001) 8. Sotonwa, K.A.: Comparative analysis of XML schema languages for improved entropy metric. FUOYE J. Eng. Technol. 5(1), 36–41 (2020) 9. Satish, B.: Introduction to XML part 1: XML Tutorial 10. Bray, T., Jean, P., Sperberg-McQueen, M.C.: Extensible markup language (XML) 1.0 W3C recommendation, (eds) (W3c) (2012). http://www.w3.org/TR/1998/REC-xml-19980210. html 11. Binstock, C., Peterson, D., Smith, M., Wooding, M., Dix, C., Galtenberg, C.: The XML Schema Complete Reference. Addison Wesley Professional Publishing Co. Inc., Boston (2002). ISBN: 0672323745 12. Thompson, H.S., Beech, D., Muzmo, M., Mendel- sohn, N.: XML schema part 1: Structures (eds) W3C recommendation (2004). http://www.w3.org/TR/xmlschema-1/ 13. Makoto, M.: RELAX (regular language description for XML) ISO/IEC DTR 22250-1, Document Description and Processing Languages -- Regular Language Description for XML (RELAX) -- Part 1: RELAX Core (Family Given) (2000) 14. ISO ISO/IEC TR 22250-1:2002 - Information Technology -- Document description and processing languages - Regular Language Description for XML (RELAX) -- Part 1: RELAX Core, First Edition, Technical Committee 36 (2002) 15. Makoto, M., Dongwon, L., Murali, M., Kohsuke, K.: Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Technol. 5(4), 660–704 (2005) 16. Gill, G.K., Kemerer, C.F.: Cyclomatic complexity density and software maintenance. IEEE Trans. Softw. Eng. 17, 1284–1288 (1991) 17. Sotonwa, K.A., Olabiyisi, S.O., Omidiora, E.O., Oyeleye, C.A.: Development of improved schema entropy and interface complexity metrics. Int. J. Res. Appl. Sci. Eng. Technol. 7(I), 611–621 (2019) 18. Balogun, M.O., Sotonwa, K.A.: A comparative analysis of complexity of C++ and Python programming languages using multi-paradigm complexity metric (MCM). Int. J. Sci. Res. 8(1), 1832–1837 (2019) 19. Harrison, W.K., Magel, R., Kluczny, Dekock, A.: Applying Software Complexity Metrics to Program Maintenance. IEEE Journal Computer Archive Society Press, Los Alamitos (1982) 20. Vu, N., Deeds-Rubin, S., Thomas, T., Boehm, B.: A SLOC counting standard, Center for Systems and Software Engineering University of Southern California (2007) 21. Bhatt, K., Vinit, T., Patel, P.: Analysis of source lines of code (SLOC) metric. Int. J. Emerg. Technol. Adv. Eng. 2(2), 150–154 (2014)
Survey of Schema Languages: On a Software Complexity Metric
361
22. Amit, K.J., Kumar, R.: A new cognitive approach to measure the complexity of software. Int. J. Softw. Eng. Appl. 8(7), 185–198 (2014) 23. http://docbook.sourceforge.net/release/dsssl/current/dtds/ 24. http://java.sun.com/dtd/ 25. http://www.ncbi.nlm.nih.gov/dtd/ 26. http://www.cs.helsinki.fi/group/doremi/publications/XMLSCA2000.Html 27. http://www.w3.org/TR/REC-xml-names/ 28. http://www.omegahat.org/XML/DTDs/ 29. http://www.openmobilealliance.org/Technical/dtd.aspx 30. http://fisheye5.cenqua.com/browse/glassfish/update-center/dtds/ 31. http://www.python.org/topics/xml/dtds/ 32. http://www.okiproject.org/polyphony/docs/raw/dtds/ 33. http://www.w3.org/XML/. Accessed 2008 34. http://ivs.cs.uni-magdeburg.de/sw-eng/us/metclas/index.shtml. Accessed 2008 35. http://www.xml.gr.jp/relax. Accessed 2008 36. http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/. Accessed 2008 37. http://www.w3.org/TR/2001/PR-xmlschema-0-20010330/. Accessed 2008 38. http://www.w3.org/TR/1998/REC-xml-19980210. Accessed 2008 39. http://www.xfront.com/GlobalVersusLocal.html. Accessed 2008 40. http://www.oreillynet.com/xml/blog/2006/05/metrics_for_xml_projects_1_ele.html. Accessed 2008
Bohmian Quantum Field Theory and Quantum Computing F. W. Roush(B) Alabama State University, Montgomery, AL 36101-0271, USA [email protected] Abstract. Abrams and Lloyd proved that if quantum mechanics has a small nonlinear component then theoretical quantum computers would be able to solve NP-complete problems. We show that a semiclassical theory of electrodynamics in which the fermions are quantized but the electromagnetic field is not, and in which the particles and the field interact in a natural way, does have such a nonlinear component. We argue that in many situations this semiclassical theory will be a close approximation to quantum field theory. In a more speculative argument, we discuss the possibility that the apparent quantization of the electromagnetic field could be a result of (1) quantization of interactions of the electromagnetic field with matter and (2) wave packets, regions within the electromagnetic field that are approximate photons. At the least this gives a theory which, if crude, avoids the major divergences of standard quantum field theory. We suggest how this might be extended to a quantum theory of the other three forces by modifying the Standard Model and using a model of gravity equivalent to spin 2 gravitons. This also provides a quantum field theory that agrees with Bohmian ideas. Keywords: Quantum computing · NP-complete · Semiclassical field theory · Bohmian mechanics · Unified field theory
1
Introduction
Standard quantum mechanics is a theory that can predict motions of particles that are also waves. It can do this by computing the wave equation, which is a standard linear partial differential equation, odinger equation, like the √ the Schr¨ heat equation except for a factor i = −1, to get solutions which are wave functions ψ. Then |ψ|2 gives the probability density function for positions of particles such as a system of electrons. There are also alternative formulations of quantum mechanics due to Heisenberg, Dirac, and Feynman, which use matrices or path integrals to derive versions of ψ. The motion of electrons around nuclei is quantized in terms of operators occuring in the Schr¨ odinger equation which represent energy. Angular momentum is also quantized. The allowed energy levels are the eigenvalues of these operators, and radiation occurs when a particle jumps from one state to the other, whose magnitude is the difference of the energy levels in the two states. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 362–371, 2023. https://doi.org/10.1007/978-3-031-28073-3_26
Bohmian Quantum Field Theory and Quantum Computing
363
As will be mentioned later, Planck’s law E = hν relating energy E, frequency h k relating momentum p ν, and Planck’s constant h as well as the law p = 2π and wave number k, are satisfied. Essentially all of chemistry is an application of quantum mechanics. Essentially all of chemistry is an application of quantum mechanics. Quantum mechanics deals with a system of a given number of particles of matter, called fermions, such as protons and electrons, and their interactions. Quantum field theory is a refinement of quantum mechanics which includes fields as well as matter particles. The fields are considered as quantified into particles called bosons, such as photons. There are many notorious mathematical difficulties with quantum field theory, such as divergent series and infinite masses and operators that don’t lie in Hilbert spaces, but where calculations have been possible, it agrees with experiment to high accuracy. Here we focus on a theory which is intermediate between the two, and can be called semiclassical. The interaction of matter particles and fields is considered; only the matter particles are fundamentally quantized, though the fields inherit some quantum properties from the fermions. Almost none of the ideas presented here is new, but possibly the combination of ideas is new. A motivation for this method is the desire to see whether a more powerful quantum computer could be built using effects from general quantum field theory and not just quantum mechanics and a separate theory of photons. This is difficult using the standard perturbative methods because so many computations diverge (in fact even the perturbative series of quantum electrodynamics diverge in all nontrivial cases). There is no good theory of the quantum vacuum and this is important because the zero-point energy of the vacuum, which seems theoretically to be infinite, can affect real experiments. Even if the model presented next is a gross oversimplification, it might still allow computations to be made which are approximately correct. The primary goal of this paper is to bring the goal of a quantum computer that could solve NP-complete problems a little closer by proposing a nonlinear quantum dynamics that is consistent with the theorem of Abrams and Lloyd that such computers would exist provided that quantum mechanics has a small nonlinear component. The particular mechanics involved is a model of quantum electrodynamics that is first quantized but not second quantized. In many situations this will closely approximate the results from a second quantized model, and it is in some respects a lot easier to work with. A second goal is to discuss the question whether this model might actually be a correct theory of quantum electrodynamics. This requires a discussion of what photons are. This and the final part is speculative. A final goal is to indicate how this might be extended to other forces for a unified field theory. This theory fits well with Bohmian ideas, but does not require them.
364
2
F. W. Roush
The Abrams-Lloyd Theorem
It was proved by Abrams and Lloyd [1] that if quantum mechanics has a small nonlinear component, then quantum computers could solve NP-complete problems in polynomial time. Abrams and Lloyd give two constructions, of which the first is more general and the second more fault-tolerant. By the Valiant-Vazirani theorem [8] it is enough to work with the case in which an NP problem has at most 1 solution. We might take a problem of the form f (v) = 1 where f is a Boolean function of n binary inputs. By the construction for Forrelation, a standard quantum computer can obtain a unit vector of size N = 2n , where 1 there are n qubits for this vector, such that the entries of the vector are ± 2n/2 and where the sign of the K entry in binary notation is (−1)f (v) . This can then be transformed to a fairly arbitrary unit vector by quantum operations. Then Abrams and Lloyd use the result that almost any nonlinear dynamical system given by a transformation f on a state space V will have exponential sensitivity to initial conditions, a positive Lyapunov exponent, so that iterates f n will magnify distances exponentially as n increases. The distance between a vector for 1 solution and the vector for no solutions is exponentially small, but these iterates will magnify the difference until a quantum computer can measure it. The second algorithm of Abrams and Lloyd involves applying a gate to pairs of coordinates, so that if both are positive the results are stable, but if one is negative then the results become measurably smaller. We first use pairs differing by 2 then by 4, and so on, equivalently working with different qubits. This will produce a relatively significant difference in a significant fraction of all entries, and again, this can be measured. If there is good relative accuracy for the individual operations, then the result will be accurate.
3
Lagrangians and Yang-Mills Fields
In classical fields, the Lagrangian is the difference L = T − V of kinetic and potential energy. The least action principle, which is equivalent to Newton’s laws for conservative forces, says that the action S, the time integral of the Lagrangian, is the minimum or at least a critical point, for the actual trajectory of an object, as compared with other possible paths. This will also be true in the theory here. The Euler formula for maximizing or minimizing a definite integral b F (t, f, f )dt over functions f (t) with given values at the endpoints, f (t) = a df dt is d ∂ ∂ F− F. 0= ∂f dt ∂f The Euler formula can be extended to more variables and higher derivatives in a natural way, and constraints can be incorporated with Lagrange multipliers. 2 For example, in a flat space with no forces, V = 0, T = mv 2 , and the least action principle says particles will move in a straight line with constant velocity.
Bohmian Quantum Field Theory and Quantum Computing
365
The theories of general fields, such as the color field which affects quarks, is often stated in terms of Lagrangians, but using other principles. Schwinger and Tomonaga used a modified variational principle. This is δB|A = iB|δS|A where |A and |B are early and late states of a quantum system and S is action and δ is functional derivative (as in Euler’s equation above). Feynman used the principle that the integral of an exponentiated Laplacian e2iLπ/h over all paths determines the transition matrix between the wave function at one point of space-time and the wave function at another. Path integrals involve some mathematical difficulties but can be approximated by breaking them into polygonal segments and using a Gaussian probability distribution. This principle applies to fields as well as particles. In the process we need to sum over all possibilities for interactions of particles such as absorption or emissions of photons from electrons. Yang-Mills theory associates a Lagrangian formula with a continuous group G. It is the way in which the force laws for the strong and weak forces are most conveniently derived and related to electromagnetism in the Standard Model of forces other than gravity. The equations can be specified in terms of the Lie algebra of G, which specifies the infinitesimal multiplication, or in terms of the curvature of a connection, more theoretically. The electromagnetic, weak, and strong forces correspond respectively to the groups U (1), SU (2), SU (3) of complex matrices of sizes 1×1, 2×2, 3×3 which preserve the lengths of vectors in terms of complex absolute values, and in the latter two cases have determinant 1. Interactions between forces can be specified in a similar way.
A Nonlinear Semiclassical Quantum Electrodynamics We first note that typically, nonlinear physical dynamical systems can be viewed as unitary systems on some huge space, namely usually some measure will be preserved. Then there will be a space of all measurable, square integrable functions on this space, L2 , and the dynamics of this function space will be linear and unitary. This is a little like passing to a second quantization. But if the original nonlinear theory is large enough to do quantum mechanics with it, that is, there are Sch¨ odinger or Dirac equations and particle wave functions tensor as we include more particles, then that system is large enough for nonlinear quantum computers. In the simplest sense, any model for electrodynamics that includes both fields and particles and some kind of interaction between the day will include some version of the equation for the force on a charged particle in a magnetic field F = qv × B where F is force, q is charge, v is the velocity of the particle, and B is magnetic field.
366
F. W. Roush
This equation is nonlinear when the particle motion affects the magnetic field. However we consider a more exact system in the case of one particle, which can be viewed as the standard system of electrodynamics after first quantization but before second quantization [9] (this is for one particle): 1 SQED = dx4 [− F μν Fμν + ψ(iγ μ Dμ − m)ψ] 4 Here S is action, x is space-time, F is the electromagnetic field tensor, ψ is the Dirac wave function, ψ is the conjugate transpose ψ † γ 0 , the γ are the Dirac matrices (given constants), the μ, ν are space-time indices in tensor notation, m is mass and D is the covariant derivative Dμ = ∂μ + ieAμ + ieBμ where A is electromagnetic potential, B is an external potential, and e is charge or coupling constant. When it is standard, the physics notation ∂μ = ∂x∂ µ will be used. This system is intrinsically nonlinear. Below we will consider only one electromagnetic field. The equations of motion in this system can be given by applying the EulerLagrange equations of the calculus of variations to obtain critical points of the action. To be more specific in [9], the equation of motion of ψ is (iγ μ ∂μ − m)ψ = eγ μ Aμ ψ that is, this gives the time rate of change of ψ in a frame of reference. The equation of motion for the field A can be reduced Aμ = ej μ for a current j by using the Lorenz gauge condition and this is a natural form of Maxwell’s equations. The standard form of Dirac’s equation [10] for two fermions (with an additional scalar field S) is, for i=1,2 indexing the two particles, [(γi )μ (pi − Ai )μ + mi + Si ]ψ = 0 where ψ is now a 16-component function of the positions of both fermions, and pμ = −i
∂ . ∂xμ
This agrees with the expression above, and Maxwell’s equations give an equation of motion for the field. These equations can be extended in a natural way to n particles. Here too there will be only one electromagnetic field A in our situation. This system will reflect interactions of fermions and photons, viewed as part of an electromagnetic field, fairly well, though it does ignore annihilation of fermions and their antiparticles, which is rarer. The approximation should be good enough that it should also be valid for a second-quantized electrodynamics to the extent calculations can be done in it, because it agrees with the individual interaction histories represented by Feynman diagrams. To expand on this, we next discuss photons.
Bohmian Quantum Field Theory and Quantum Computing
4
367
Fields and Photons
At the beginnings of quantum theory it was considered that possibly particles but not fields were quantized, and this seems to have been David Bohm’s view. Before quantum mechanics was well-developed, methods did not exist that could realize this idea. If we use a (possibly less accurate and cruder) model in which the electromagnetic field is not quantized, then there must be a way to account for the effects which are accounted for by photons in the standard theory. One way to account for them is to say that the quantization of fermion systems means that also interactions between fermions and force fields are quantized. That is, photons are typically observed as the result of absorption by matter. This occurs when electrons in matter are raised to a higher energy state. For the natural frequencies involve, this will agree with Planck’s law E = hν: that is a consequence of the Schr¨ odinger equation in the form ih ∂ ψ = Hψ 2π ∂t when we search for solutions of the form exp(iωt)ψ1 (x) using a separation of variables. So the interaction is quantized even if the field does not actually consist of photons; the effects of photons are still present. An actual existence of photons in itself leads to strange conclusions: if the frequency and momentum are fixed then photons have a trigonometric form which suggest that they extend through all time and space. This would also be true for free electrons, but it is more natural to consider electrons as bound but photons as free until they interact with matter. In our model we will assume that approximate photons do exist as wave packets within the electromagnetic field, that is these exist mostly in bounded regions of space and within those regions frequency and wavelength are approximately constant. These packets will spread as they travel, but perhaps no more than pulses from a laser. It would be somewhat natural to take the electromagnetic field for photons as the analogue of the probability wave field for electrons, in that they both can explain slit experiments and diffraction. The analogy must be made a little more complicated when it is to deal with more than one photon with different polarizations, which in the standard theory is a symmetrization of a tensor product. In order to fit this within a single electromagnetic field we will assume that the approximate photons are somewhat spatially separated portions of a single electromagnetic field. Otherwise we would need to allow some linear dependence among combinations of polarized photons.
5
Quantum Logic Gates
We do not have a definite proposal for a nonlinear quantum logic gate to be added to a standard set of linear quantum gates. The construction of linear quantum gates varies widely with the type of quantum effect used for qubits.
368
F. W. Roush
Trapped ion qubits are the current favorites. A quadruple trap is constructed using an oscillating electric field at radio frequencies. This can use the Cirac Zoller controlled-Not gate [11] The interaction of 2 qubits is mediated by an entire chain of qubits. This involves a specific sequence of 3 pulses. The qubits must be coupled. In the Loss-Vicenzo quantum dot computer [12] the 1/2 spin of electrons confined in quantum dots are used as qubits. Gates are by swap operations and rotations, with local magnetic fields. A pulsed inter-dot gate voltage is used so that there is a constant in the Hamiltonian which becomes time-dependent. A square root of a corresponding matrix gives the exchange. Kane’s quantum computer [13] involves a combination of nuclear magnetic resonance and electron spin; it passes from nuclear spin to electron spin. An alternating magnetic field allows the qubits to be manipulated. We alter the voltage on the metal A gates, which are metal attachments on top of an insulating silicon layer. This alters a resonant frequency and allows phosphorus donors within silicon to be dealt with individually. A potential on a J gate between two a gates draws donor electrons together and allows interactions of qubits. Electron-on-helium qubits [14] uses a binding of electrons to the surface of liquid helium. The electron is outside the helium and has a series of energy levels like the Rydberg series. Qubit operations are done by microwave fields exciting the Rydberg transition. The Coulomb interaction facilitates qubit interactions. There might be exchange interaction of adjacent qubits, but it is not clear if this is enough for powerful quantum interactions. Topological quantum computers based on quantum surface effects on special materials like topological insulators might have the gates described in [6], which are a little complicated. It might be said that a nonlinear gate will have to involve quantum field effects beyond those from quantum mechanics, and those effects cannot become too small as we work with a number of qubits. If particles are to have fairly high velocities and not leave the system, it is more convenient for their motion to be oscillatory or to travel a closed loop. It is also reasonable that it might use existing ideas from the above linear gates.
6
Bohmian Mechanics
At this point we go even further into the realm of speculation, and the reader who does not enjoy this might prefer to stop here. The remainder is not presently relevant to quantum computing. The interpretation of quantum mechanics is often considered a matter of personal taste, and there is no unanimity on such an interpretation. In the past the Copenhagen interpretation was dominant at one time, and the Evereet many worlds or multiverse interpretation is rather popular now. The interpretation by David Bohm grew out of ideas of Louis de Broglie that a fermion is not either a wave or a particle but a pair, a wave and a particle, where the wave is the usual probability wave and the particle can travel
Bohmian Quantum Field Theory and Quantum Computing
369
faster than light, but only in a way which is unobservable and cannot transmit information, and the particle is guided by the wave. Special assumptions are needed to reconcile this with special relativity in terms of what can be observed. The most natural way to do this seems to be to assume there is a particular but physically unobservable space-time frame, and compute in it, but then transform the results if different frames are used. Workers in Bohmian mechanics, Durr et al. [4], [5], Dewdney and Horton [3], and Nikoliˇc [7] have come up with versions which are consistent with special relativity. This is done in something like two methods: either we specify a timelike vector field, as Durr does, or we say that each particle in a multiparticle system can have its individual time coordinate. It is more reasonable philosophically however if we make an assumption which is not generally allowed in modern physics, that is, that there is a hidden special coordinate frame for space-time. This might be considered as a way of breaking Lorentz symmetry. Siddhant Das and Markus N¨ oth [2], building on previous work with Detlef Durr, have studied experiments that might distinguish Bohmian mechanics from other interpretations of quantum mechanics.
7
A System of Equations
The mathematical formulation of the theory here is that it consists of five equation systems: (1) a version of the Dirac equation for systems involving any finite number of fermions and a given set of fields, which is the natural generalization of the system considered above (2) the main equation of Bohmian mechanics h dxi = ∇i (Im(ln(ψ))) dt 2πmi (3) Maxwell’s equations of the electromagnetic field (4) a way to account for creation and annihilation of fermion-anti-fermion pairs (5) a way to obtain the electromagnetic field, which is, specifically, the same equations as for (1) in the semiclassical model. The 4th is a comparatively rare event. We propose that it is represented by a singularity in which the energy of the electromagnetic field increases by the amount lost when the fermion and anti-fermion are destroyed. This affects the Dirac equation by changing the number of particles and hence the dimensionality of the space on which ψ is defined. We must thus transform ψ when this event happens. This can be done by projecting ψ to the lowerdimensional space. If we consider an analogous case of a Schr¨ odinger equation and given electric fields which for each particle depend on its position, we can image that a solution is a limit of sums or linear combinations of products of functions which solve the equation for each of the separate particles, then the projection on each term replaces the functions for the deleted particles by 1. It can be seen that a linear combination of functions solving the equation for separate particles will satisfy the total equation, by the multiplicative property of the time derivative, and the Laplacians and potential multiples affecting each only a single factor. For particle-antiparticle creation, we time-reverse the equations for
370
F. W. Roush
creation. This requires a theory in which singularities can be determined from the nonsingular points of a solution. Real-analytic functions, for instance, have this property. As previously mentioned, for (5) we can use a form of the freespace Maxwell equations. To this could be added a specification of the nature of the singularity of the field that would occur for a point particle such as an electron. This is similar to what can be done for Newtonian gravity by saying that the field is a free field with singularities which are first-order poles at the particles. Nikoliˇc has observed that Bohmian theory might give an explanation of the success of otherwise mysterious string theory and its huge number of variables. Strings are approximations to Bohmian orbits. However closed strings are usually thought of as smaller than closed orbits in Bohmian theory. In the Standard Model all the fields, such as the gluon field, have a classical version as well as a quantum version, because they are described in terms of Lagrangians. These can be added to the model at the beginning of this section, with these fields and with the particles of the Standard Model. Durr [5] has produced a Bohmian model of quantum gravity involving curved space. One can also consider quantum gravity in terms of gravitons. It is known that a hypothetical theory mediated by spin 2 gravitons must coincide essentially with general relativity as a non-quantum theory, regardless of the way the gravitons interact. The problem is that this theory is not renormalizable. This is not a problem for the theory of the previous section, which does not require renormalization. However if we can produce a formal Lagrangian, then we might alternatively produce a classical gravity field in this way.
8
Conclusion
A semiclassical field theory is nonlinear and appears to be a suitable setting in which the Abrams-Lloyd Theorem would provide a theoretical quantum computer that could solve NP-complete problems fairly directly. The question of specific nonlinear quantum logic gates is left for the future. It seems possible that this semiclassical theory is not just an approximation but is an accurate model of fields which avoids convergence problems. Moreover this semiclassical theory extends to a unified theory of all four forces, which is also compatible with Bohmian ideas.
References 1. Abrams, D.S., Lloyd[6], S.: Nonlinear quantum mechanics implies polynomial-time solution for NP-complete and # P problems. arxiv:quant-phy/9801041 2. Das, S., N¨ oth, M.: Times of arrival and gauge invariance. arxiv:quant-ph/2102.02661 3. Dewdney, C., Horton, G.: Relatively invariant extension of the de Broglie-Bohm theory of quantum mechanics. arxiv: quant-ph/0202104 4. Durr, D., Goldstein, S., Norsen, T., Struyve, W., Zhangh`ı, N.: Can Bohmian mechanics be made relativistic. arxiv: quant-phy/1307.1714
Bohmian Quantum Field Theory and Quantum Computing
371
5. Durr, D., Struyve, W.: Quantum Einstein equations. arxiv:quant-phy/2003.03839 6. Bonderson, P., Das Sarma, S., Freedman, M., Nayak, C.: A blueprint for a topologically fault-tolerant quantum computer. arxiv:math/1003.2856 7. Nikoliˇc, H.: Relativistic quantum mechanics and quantum field theory. arxiv:quant-phy/1203.1139 8. Valiant, L., Vazirani, V.: NP is as easy as detecting unique solutions. Theoret. Comput. Sci. 47, 85–93 (1986) 9. Wikipedia article, Quantum Electrodynamics. https://en.wikipedia.org/wiki/ Quantum electrodynamics 10. Wikipedia article, Two-body Dirac equations. https://en.wikipedia.org/wiki/Twobody Dirac equations 11. Wikipedia article, Trapped ion quantum computer. https://en.wikipedia.org/wiki/ Trapped ion quantum computer 12. Wikipedia article, Spin qubit quantum computer. https://en.wikipedia.org/wiki/ Spin qubit quantum computer 13. Wikipedia article, Kane quantum computer. https://en.wikipedia.org/wiki/Kane quantum computer 14. Wikipedia article, Electron-on-helium qubit. https://en.wikipedia.org/wiki/ Electron-on-helium qubit
Service-Oriented Multidisciplinary Computing: From Code Providers to Transdisciplines Michael Sobolewski1,2(B) 1 Air Force Research Laboratory, WPAFB, Dayton, OH 45433, USA
[email protected] 2 Polish Japanese Academy of IT, 02-008 Warsaw, Poland
Abstract. True service-oriented architecture provides a set of guidelines and the semantically relevant language for expressing and realizing combined request services by a netcentric platform. The transdisciplinary Modeling Langue (TDML) is an executable language in the SORCER platform based on service abstraction (everything is a service) and three pillars of service-orientation: contextion (context awareness), multifidelity, and multityping of code providers in the network. TDML allows for defining complex polymorphic disciplines of disciplines (transdisciplines) as service that can express, reconfigure, and morph large distributed multidisciplinary processes at runtime. In this paper the approach applicable to complex multidisciplinary systems is presented with five types of nested service aggregations into distributed transdisciplines. Keywords: True service orientation · Contextion · Multifidelities · Multityping · Transdisciplines · Emergent systems · SORCER
1 Introduction Service-oriented architecture (SOA) emerged as an approach to combat complexity and challenges of large monolithic applications by offering cooperations of replaceable functionalities by remote/local component services with one another at runtime, as long as the semantics of the component service is the same. However, despite many efforts, there is a lack of good consensus on netcentric semantics of a service and how to do true SOA well. The true SOA architecture should provide the clear answer to the question: How a service consumer can consume and combine some functionality from service providers, while it doesn’t know where those providers are or even how to communicate with them? In TDML service-oriented modeling - three types of services are distinguished: operation services, and two types of request services: elementary and combined. An operation service, in short opservice, invokes an operation of its code provider. TDML opservices never communicate directly to service providers in the network. An elementary request service asks an opservice for output data given input data. A combined request service specifies cooperation of hierarchically organized multiple request services that in turn © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 372–384, 2023. https://doi.org/10.1007/978-3-031-28073-3_27
Service-Oriented Multidisciplinary Computing
373
execute operation services. Therefore, a service consumer utilizes output results of multiple executed request and operation services. The end user that creates request services and utilizes the created service partnership of code providers becomes the coproducer and the consumer. Software developers develop code providers provisioned in the network, but the end users relying on code providers develop combined services by reflecting their experiences, creativity, and innovation. Such coproduction is the source of innovation and competitive advantage. The Service-ORiented Computing EnviRonment (SORCER) [6–10] adheres to the true SO architecture based on formalized service abstractions and the three pillars of SO programming. Evolution of the presented approach started with the FIPER project [5] funded by NIST ($21.5 million) at the beginning of this millennium then continued at the SORCER/TTU Laboratory [9] and maturing for real world aerospace applications at the Multidisciplinary Science and Technology Center, AFRL/WPAFB [1–3]. The mathematical approach for the SORCER platform is presented in [7, 8]. In this paper the focus is on a transdisciplinary programming environment for SORCER.
2 Disciplines and Intents in TDML A discipline is considered as a system of rules governing computation in a particular field of study. Multidisciplinary science is the science of multidisciplinary process expression. A discipline considered as a system of multiple cooperating disciplines is called a transdiscipline. A transdisciplinary process is a systemic form of multidisciplinary processes. The Transdisciplinary Modeling Language (TDML) is attentive on transdisciplinary process expression and execution of transdisciplines as disciplines of disciplines. In mathematical terminology, if a discipline is a function, then transdiscipline is a functional (higher-order function). An intent is a service context of requestor, but a transdisciplinary intent is a context of contexts that specifies relationships between disciplinary contexts of the transdiscipline and related parts so that a change to one propagates intended data and control flow to others. It is a kind of ontology that represents the ideas, concepts, and criteria defined by the designer to be important for expressing and executing the multidisciplinary process. It should express both functional and technical needs of the multidisciplinary process. The transdisciplinary approach is service-oriented, it means that functionality is what a service provider does or is used for, while a service is an act of serving in which a service requestor takes the responsibility that something desirable happens on the behalf of other services represented by service requestors or service providers. Thus, a discipline represents a service provider while a discipline intent represents a discipline requestor. Service-orientation implies that everything is a service to a large extent, thus both requestors and providers represent services. Disciplines of disciplines and their parts form hierarchically organized service cooperations at runtime. Service cooperation partitions workloads between collaborating request and provider services. At the bottom of the cooperative service hierarchy code providers make local and/or remote calls to run executable codes of domain-specific applications, tools, and utilities (ATUs).
374
M. Sobolewski
2.1 The Service-Oriented Conceptualization of TDML A code provider corresponds to actualization of an opservice. A single opservice repreents elementary request service, but a combined request service is actualized by a cooperation of code providers. Therefore, a combined request service may represent a process expression realized by cooperation of many request services and finally corresponding execution of code providers in the network. In TDML, combined services express hierarchically organized disciplinary dependencies. Transdisciplines are combined services at various levels of granularity. Granularity refers to the extent to which a larger discipline is subdivided into smaller distinguishable disciplines. Cohesion of a discipline is a measure of the strength of relationship between dependences of the discipline. High cohesion of disciplines often correlates with loose coupling, and vice versa. Operational services (opservices) represent the lowest service granularity and refer to executable codes of functions, procedures, and methods of local/remote objects. The design principle for aggregating all input/output data into service contexts and using contexts by all cooperating services working in unison is called service context awareness. Service context awareness, also called data contextion, is a form of parametric polymorphism. A contextion is a mapping from input service context to output service context. Using service contexts, a contextion can be expressed generically so that it can handle values generically without depending on individual argument types. Request services as contextions are generic services that form the core of SO programming [8]. Service disciplines are aggregations of contextions as illustrated in Fig. 1. A service domain is either a routine (imperative domain) or model (declarative domain), or aggregation of both. Domain aggregations are called transdomains. They provide for declarative/imperative transitions of domains within a transdomain. Transroutine and Transmodel contextions are subtypes of the Transdomain type. The service oriented TDML semantics (from code providers to transdisciplines) and relationships between service types are illustrated in Fig. 1. Combined disciplinary request services at the bottom of their service hierarchy are actualized by executable codes of code providers. A workload of a code provider is expressed by a service signature, sig(op, tp), where an operation op of tp has to be executed by a code provider of type tp. A code provider may implement multiple service types each with multiple operations. Therefore, a signature type can be generalized to a multitype that serves as a classifier of service providers in the network. A multitype signature sig(op, tp1 , tp2 ,..., tpn ) is an association of a service operation op of tp1 and a multitype in the form of the list of service types tp1 , tp2 ,..., tpn to be implemented by a service provider. The operation op is specified by the first service type in the list of service types. If all service types of a signature are interface types, then such signature is called remote. If the first type is a class type, then a signature is called local. Note that binding remote signatures to code providers in the network is dynamic, specified only by service types (interfaces considered as service contracts) to be implemented by the code provider. Service signatures used in request services are free variables to be bound at runtime to redundant instances of code providers available or provisionable in the network. To provision a code provider in the network, the signature needs to declare a deployment configuration as follows:
Service-Oriented Multidisciplinary Computing
375
sig(op, tp, deploy(config("myCfg"), idle("1h")))
where myCfg is a configuration file and idle specifies that after one hour of idle tine the code provider should be deprovisioned. Provisioning od code providers in SORCER is supported by Rio technology [12]. In the case of instantiation with local signatures, the instances of a given code provider type are constructed at runtime. Therefore, service signatures provide the unform representation for instantiation of local, remote, and on-demand provisioned code providers. Service signature are fundamental enablers of true service orientation in TDML and SORCER [8]. Opservices (evaluators and signatures in TDML) are bound to code providers (ATUs) at runtime. They are used by elementary services (entries and tasks) to create the executable foundation for all combined request services. Service entries are elementary services that represent functionals (higher-order functions), procedures (first-order functions), system calls via corresponding opservices. Service tasks use service signatures predominantly to represent net-centric local/remote object-oriented invocations. A signature binds to a code provider but doesn’t know where the provider is or even how to communicate with - the basic principle of net-centricity in TDML implemented with the Jini technology [11].
Fig. 1. TDML Service-Oriented Conceptualization: From Code Providers (Actualization) to Consumers of Multidisciplinary Services Realized by Service Requestors
2.2 Three Pillars of Service Orientation The presented service semantics of service orientation can be to summarize the three SO pillars (see Fig. 1 indicated in the red color) as follows: 1. Contextion allows for a service to be specified generically, so it can handle context data uniformly with required data types of context entries to be consistent with
376
M. Sobolewski
ontologies of service providers. Contextion as the form of parametric polymorphism is a way to make a SO language more expressive with one generic type for inputs and outputs of all request services. 2. Morphing a request service is affected by the initial fidelities selected by the user and morphers of morph-fidelities. Morphers associated with morph-fidelities use heuristics provided by the end user that dependent on the input service contexts, and subsequent intermediate results obtained from service providers. Multifidelity management is a dispatch mechanism, a kind of ad hoc polymorphism, in which fidelities of request services are reconfigured or morphed with fidelity projection at runtime. 3. Service multityping as applied to service signatures to be bound at runtime to code providers is a multiple form of subtype polymorphism with the goal to find a remote instance of the code provider by the range of service types that a code provider implements and registers for lookup. It also allows a multifidelity opservice to call an operation of a primary service type implemented by the service provider as an alternate service fidelity.
3 Discipline Instantiation and Initialization In multidisciplinary programming, an instance is a concrete occurrence of any discipline, existing usually during the runtime of a multidisciplinary program. An instance emphasizes the distinct identity of the discipline. The creation of an instance is called instantiation and initialization is the assignment of initial values of data items used by a transdiscipline itself and its component disciplines. In Sect. 2.1 service signatures are described as the constructors of code providers used by elementary request services. Signatures prepare dynamically instances for use at runtime, often accepting multitypes to be implemented by a required instance of code provider. Multitype signatures may create new local code providers, bind to existing, in the network, or on-demand provision remote instances. Below, instantiation of request services is presented by so called builder signatures that uses static operations of declared builder types. A discipline builder is a design pattern designed to provide a flexible solution to creation of various types of transdisciplines (disciplines of disciplines). The purpose of the discipline builder is to separate the construction of a complex discipline from its representation and initial data. The discipline builder describes how encapsulate creating and assembling the parts of a complex discipline along with its data initialization. In SORCER builders are Java classes that may extend the Builder utility class. Therefore, a discipline delegates its creation to a builder instead of creating the discipline directly. It allows to change a discipline representation, called a builder signature, later independently from (without having to change) the domain itself. A builder signature is a representation of entity that is closely and distinctively associated and identified with a service provider, requestor, or intent. 1. A builder signature declares the corresponding entity builder. In TDML, a builder signature is expressed as follows:
Service-Oriented Multidisciplinary Computing
377
sig(op, bt) or sig(op, bt, init(att, val), ...) or sig(op, bt, initContext, init(att, val), ...)
where bt is the builder type; op is its static operation; initContext represents the initialization context (a collection of attribute–value pairs) used by the builder, usually by its initialize method; init(att, val) declares the initialization of the attribute att with the value val, init attribute-value pairs can be multiplied. 3.1 Explicit Discipline Instantiation with Builder Signatures The TDML instantiation operator inst is specified as follows: inst(sig(bt)) or inst(sig(op, bt)) or inst(sig(op, bt, init(att, val), ...)) or inst(sig(op, bt, initContext)) or inst(sig(op, bt, initContext, init(att, val), ...))
where bt is a builder class type and op its static builder operation. The first case corresponds to the default constructor of the class bt. 3.2 Implicit Discipline Instantiation by Intents intent( dscSig(builderSignature) or dscSig(op, bt) or dscSig(op, bt, init(att, val) ...), or dscSig(op, bt, initContext) or dscSig(op, bt, initContext, init(att, val) // other parts and builders of executable intent ...) )
where dscSig stands for the operator declaring a discipline builder signature.
4 Discipline Execution and Aggregations With respect to types of disciplines declared in disciplinary intents, an intent is executed by a corresponding TDML operator (executor), e.g., responses, search, analyze, explore, supervise, and hypervise. The executed intent contains both a result and its executed discipline. A created or executed discipline that is declared by myIntent is selected by discipline(myIntent). If a discipline intent declares an output filter, then myResult = explore(myInent), otherwise the result can be selected from the executed intent with TDML operators (getters) and/or disciplinary Java API.
378
M. Sobolewski
Note that when a discipline is a discipline of disciplines, so it becomes a service consumer with respect to its component disciplines that in turn are service providers. Subsequently any composed provider may be a requestor of services. Service-oriented computing adhere to distributed architecture that partitions workloads between service peers. A peer can be a service requestor and/or service provider. A service consumer utilizes results from multiple service requestors that rely on code providers (workers). Services are said to form a peer-to-peer (P2P) network of services. Disciplinary peers make a portion of their disciplines directly available to other local and/or remote network disciplines, without the need for central coordination in contrast to the traditional client– server model in which the consumption and supply of resources is strictly divided. Transdisciplines and disciplines adopt many organizational architectures. Elementary services are functional service entries and procedural tasks. Aggregations of functional entries (functions of function) form declarative domains called service models. Aggregations of procedural tasks form imperative domains called service routines: block-structured and composite-structured routines. Models and routines are elementary disciplines used to create various types of transdisciplines, e.g., transdomains (either transmodels or transroutines), collaboration, regions, and governances with relevant adaptive multifidelity morphers and controllers (e.g., analyzers, optimizers, explorers, supervisors, hypervisors, initializes, and finalizers) that manage cooperations of disciplines. Five service-oriented types of service aggregations are distinguished as illustrated in Fig. 2. Elementary and combined services are called request service. Elementary services comprise opservices, but combine service comprise request services. Disciplines are combined services. Each disciplinary type represents a different granularity of hierarchically organized code providers in the network referred by operation services (evaluators and signatures). Transdisciplines are comprised of disciplines, transdomains from domain services (models and routines), that in turn are comprised of elementary services (entries and tasks correspondingly), that in turn rely on opservices (evaluators and signatures), that in turn bind at runtime to code providers in the network of domain specific ATUs. The UML diagram in Fig. 2, illustrates five described granularities of services from the highest transdisciplinary to the lowest opservice granularity. Opservices as required by transdisciplines use code providers that call associated executable codes.
Fig. 2. Five Types of Service Aggregations: From Code Providers to Transdisciplines
Service-Oriented Multidisciplinary Computing
379
5 An Example of a Distributed Transdiscipline in TDML To illustrate the basic TDML concepts, the Sellar multidisciplinary optimization problem [4] is used to implement a multidisciplinary analysis and optimization (MADO) transdiscipline. Distributed transdisciplines in TDML allow for component disciplines to be two-way coupled and distributed in the network as well. We will specify in Sect. 5.1 a Sellar intent sellarIntent that declares its transdiscipline by a builder signature as follows: disciplineSig(SellarRemoteDisciplines.class, createSellarModelWithRemoteDisciplines").
The transdiscipline is implemented by a method createSellarModelWithRemoteDisciplines of a class SellarRemoteDisciplines described
in Sect. 5.2, then in 5.3 the Sellar intent sellarIntent is executed. 5.1 Specify the Sellar Intent with the MadoIntent Operator
Intent sellarIntent = madoIntent( initialDesign(predVal("y2$DiscS1", 10.0), val("z1", 2.0), val("x1", 5.0), val("x2", 5.0)), disciplineSig(SellarRemoteDisciplines.class, "createSellarModelWithRemoteDisciplines"), optimizerSig(ConminOptimizerJNA2.class), ent("optimizer/strategy", new ConminStrategy(…)), mdaFi("fidelities", mda("sigMda", sig(SellarMda.class)), mda("lambdaMda", (Requestor mdl, Context cxt) -> { Context ec; double y2a, y2b, d = 0.000001; do { update(cxt, outVi("y1$DiscS1"), outVi("y2$DiscS2")); y2a = (double) exec(mdl, "y2$DiscS1"); ec = eval(mdl, cxt); y2b = (double) value(ec, outVi("y2$DiscS2")) } while (Math.abs(y2a - y2b) > d); })));
380
M. Sobolewski
Note that the Sellar transdiscipline is declared by disciplineSig with two fidelities for multidisciplinary analysis (MDA) and the Conmin optimizer. The first MDA fidelity is specified by a ctor signature, the second one by a lambda evaluator. 5.2 Define the Sellar Model with Two Distributed Disciplines The Sellar transdiscipline has own builder for itself with two separate builders for component disciplines DiscS1 and DiscS2. Builders implement all disciplines as declared in TDML below with component disciplines to be deployed in the network. The Sellar model declares variables of dependent remote disciplines by proxies of remote models declared by the remote signatures sig(ResponseModeling.class, prvName("Sellar DiscS1")) and sig(ResponseModeling.class, prvName("Sellar DiscS2")) correspondingly. Note that the proxy variables y1 and y2 of remote disciplines DiscS1and DiscS2 are coupled in sell-
arDistributedModel being remote disciplines in the network as specified by the remote interface ResponseModeling and names used for remote models: "Sellar DiscS2" and "Sellar DiscS2" . The TDML svr operator declare a service variable (as a function of functions via args).
Service-Oriented Multidisciplinary Computing
381
MadoModel sellarDistributedModel = madoModel("Sellar", objectiveVars(svr("fo", "f", SvrInfo.Target.min)), outputVars(svr("f", exprEval( "x1*x1 + x2 + y1 + Math.exp(-y2)", args("x1", "x2", "y1", "y2"))), svr("g1", exprEval("1 - y1/3.16", args("y1"))), svr("g2", exprEval("y2/24 - 1", args("y2")))), // w.r.t inputVars(svr("z1", 1.9776, bounds(-10.0, 10.0)), svr("x1", 5.0, bounds(0.0, 10.0)), svr("x2", 2.0, bounds(0.0, 10.0))), // s.t. constraintVars( svr("g1c", "g1", SvrInfo.Relation.lte, 0.0), svr("g2c", "g2", SvrInfo.Relation.lte, 0.0)), // disciplines with remote vars responseModel("DiscS1", outputVars( prxSvr("y1", sig(ResponseModeling.class, prvName("Sellar DiscS1")), args("z1","x1", "x2", "y2")))), responseModel("DiscS2", outputVars( prxSvr("y2", sig(ResponseModeling.class, prvName("Sellar DiscS2")), args("z1", "x2", "y1")))), //two-way couplings: svr-from-to cplg("y1", "DiscS1", "DiscS2"), cplg("y2", "DiscS2", "DiscS1")); configureSensitivities(sellarDistributedModel);
Remote response models that are deployed by SORCER service provider containers, are declared in TDML as follows:
382
M. Sobolewski
Model dmnS1 = responseModel("DiscS1", inputVars(svr("z1", 1.9776, bounds(-10.0, 10.0)), svr("x1", 5.0, bounds(0.0, 10.0)), svr("x2", 2.0, bounds(0.0, 10.0)), svr("y2")), outputVars( svr("y1", exprEval("z1*z1 + x1 + x2 - 0.2*y2", args("z1","x1", "x2", "y2"))))); Model dmsS2 = responseModel("DiscS2", inputVars(svr("z1", 1.9776, bounds(-10.0, 10.0)), svr("x1", 5.0, bounds(0.0, 10.0)), svr("x2", 2.0, bounds(0.0, 10.0)), svr("y1")), outputVars( svr("y2", exprEval("Math.sqrt(y1) + z1 + x2", args("z1", "x2", "y1")))));
5.3 Execute the Sellar Intent Now, we can perform exploration of the distributed Sellar specified by the intent sellarIntent created in 5.1: ExploreContext result = explore(sellarIntent);
and inspect received results embedded in the returned result. The presented above Sellar model case study was developed as a design template for currently used complex aero-structural multidisciplinary models at the Multidisciplinary Science and Technology Center/AFRL.
6 Conclusions The mathematical view of process expression has limited computing science to the class of processes expressed by algorithms. From experience in the past decades it becomes obvious that in computing science the common thread in all computing disciplines is process expression; that is not limited to algorithms or actualization of process expression by a single computer. In this paper, service-orientation is proposed as the approach with five types of service aggregations (see Fig. 2). The “everything is a service” semantics of TDML (see Fig. 1) has been developed to deal with multidisciplinary complexity at various levels to be actualized by dynamic cooperations of code providers in the network. The SORCER architectural approach represents five types of net-centric service cooperations expressed by request services. In general, disciplinary requestors (context intents) are created by the end users but executable codes of code providers by software developers. It elevates combinations of disciplines into the first-class citizens of the SO multidisciplinary process expression.
Service-Oriented Multidisciplinary Computing
383
True service-orientation means that in the netcentric process both the service requestors and providers must be expressed then realized under condition that service consumers should never communicate directly to service providers. Transdisciplines are asserted complex cooperations of code providers represented in TDML directly by operation services. This way, everything is a service at various service granularity (see Fig. 2). Therefore, request services represent cooperations of opservices bound at runtime to code providers to execute computations. The essence of the approach is that by making specific choices in grouping hierarchically code providers for disciplines, we can obtain desirable dynamic properties from the SO systems we create with TDML. Thinking more explicitly about SO languages, as domain specific languages for humans than software languages for computers, may be our best tool for dealing with real world multidisciplinary complexity. Understanding the principles that run across process expressions in TDML and appreciating which language features are best suited for which type of processes, bring these process expressions (context intents in TDML) to useful life. No matter how complex and polished the individual process operations are, it is often the quality of the operating system (SORCER) and its programing environment (TDML) that determines the power of the computing system. The ability of presented transdisciplines with SO execution engine to leverage network resources as services is significant to real-world applications in two ways. First, it supports multi machine executable codes via opservices that may be required by SO multidisciplinary applications; second, it enables cooperation of variety of computing resources represented by multiple disciplines that comprise of the network opservices actualized by the multi machine network at runtime. Embedded service integration in the form of transdisciplines in TDML solves a problem for both system developers and end users. Embedded service integration is a transformative development that resolves the stand-off between system developers who need to innovate service integrations and end users, as coproducers, want their services to be productive in their multidisciplinary systems, not hold them back. Multidisciplinary integration is key to this, but neither system developers nor end-users want to be distracted by time-consuming integration projects. The SORCER multidisciplinary platform has been successfully deployed and tested for design space exploration, parametric, and aero-structural optimization in multiple projects at the Multidisciplinary Science and Technology Center AFRL/WPAFB. Most MADO applications and results are proprietary except those for public release. Acknowledgments. This effort was sponsored by the Air Force Research Laboratory’s Multidisciplinary Science and Technology Center (MSTC), under the Collaborative Research and Development for Innovative Aerospace Leadership (CRDInAL) - Thrust 2 prime contract (FA8650-16C-2641) to the University of Dayton Research Institute (UDRI). This paper has been approved for public release: distribution unlimited. Case Number: AFRL-2022-1664. The effort is also partially supported by the Polish Japanese Academy of Information Technology.
384
M. Sobolewski
References 1. Burton, S.A., Alyanak, E.J., Kolonay, R.M.: Efficient supersonic air vehicle analysis and optimization implementation using SORCER. In: 12th AIAA Aviation Technology, Integration, and Operations (ATIO) Conference and 14th AIAA/ISSM AIAA 2012–5520 (2012) 2. Kao, J.Y., White, T., Reich, G., Burton, S.: A multidisciplinary approach to the design of a lowcost attritable aircraft. In: 18th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, AIAA Aviation Forum 2017, Denver, Colorado (2017) 3. Kolonay, R.M., Sobolewski, M.: Service oriented computing environment (SORCER) for large scale, distributed, dynamic fidelity aeroelastic analysis & optimization. In: International Forum on Aeroelasticity and Structural Dynamics, IFASD2011, 26–30 June, Paris, France (2011) 4. Sellar, R.S., Batill, S.M., Renaud, J.E.: Response surface based, concurrent subspace optimization for multidisciplinary system design, Paper 96-0714. In: AIAA 34th Aerospace Sciences Meeting and Exhibit, Reno, Nevada January (1996) 5. Sobolewski, M.: Federated P2P services in CE environments. In: Advances in Concurrent Engineering, pp. 13–22. A.A. Balkema Publishers (2002) 6. Sobolewski, M.: Service oriented computing platform: an architectural case study. In: Ramanathan, R., Raja, K. (eds.) Handbook of Research on Architectural Trends in ServiceDriven Computing, pp 220–255. IGI Global, Hershey (2014) 7. Sobolewski, M.: Amorphous transdisciplinary service systems. Int. J. Agile Syst. Manag. 10(2), 93–114 (2017) 8. Sobolewski, M.: True service-oriented metamodeling architecture. In: Ferguson, D., Méndez Muñoz, V., Pahl, C., Helfert, M. (eds.) CLOSER 2019. CCIS, vol. 1218, pp. 101–132. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49432-2_6 9. SORCER/TTU Projects. http://sorcersoft.org/theses/index.html. Accessed 22 July 2022 10. SORCER Project. https://github.com/mwsobol/SORCER-multiFi. Accessed 22 July 2022 11. River (Jini) project. https://river.apache.org. Accessed 22 July 2022 12. Rio Project. https://github.com/dreedyman/Rio. Accessed 22 July 2022
Incompressible Fluid Simulation Parallelization with OpenMP, MPI and CUDA Xuan Jiang1(B) , Laurence Lu2 , and Linyue Song3 1
Civil and Environmental Engineering Department, UC Berkeley, Berkeley, USA [email protected] 2 Electrical Engineering and Computer Science Department, UC Berkeley, Berkeley, USA 3 Computer Science Department, UC Berkeley, Berkeley, USA
Abstract. We note that we base our initial serial implementation off the original code presented in Jos Stam’s paper. In the initial implementation, it was easiest to implement OpenMP. Because of the grid-based nature of the solver implementation and the shared-memory nature of OpenMP, the serial implementation did not require the management of mutexes or otherwise any data locks, and the pragmas could be inserted without inducing data races in the code. We also note that due to the Gauss-Seidel method, which in solving a linear system only requires intermediate steps, it is possible to introduce errors that cascade due to relying on neighboring cells which have already been updated. However, this issue is avoidable by looping over every cell in two passes such that each pass constitutes a disjoint checkerboard pattern. To be specific, the set bnd function for enforcing boundary conditions has two main parts, enforcing the edges and the corners, respectively. However, this imposes a strange implementation where we dedicate exactly a single block and a single thread to an additional kernel that resolves the corners, but it’s almost not impacting the performance at all and the most time consuming parts of our implementation are cudaMalloc and cudaMemcpy. The only synchronization primitive that this code uses is syncthreads(). We carefully avoided using atomic operations which will be pretty expensive, but we need syncthreads() during the end of diffuse, project and advect because we reset the boundaries of the fluid every time after diffusing and advecting. We also note that similar data races are introduced here without the two passes method mentioned in the previous OpenMP section. Similar to the OpenMP implementation, the pure MPI implementation inherits many of the features of the serial implementation. However, our implementation also performs domain decomposition and the communication necessary. Synchronization is performed through these communication steps, although the local nature of the simulation means that there is no implicit global barrier and much computation can be done almost asynchronously. Keywords: OpenMP Computation
· MPI · CUDA · Fluid Simulation · Parallel
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 385–395, 2023. https://doi.org/10.1007/978-3-031-28073-3_28
386
1
X. Jiang et al.
Introduction
In order to make the fluid simulator unconditionally stable, Jos Stam’s [4] implementation makes an approximation while computing the diffusion effects. Specifically, instead of intuitively using the densities of eight surrounding cells in the PREVIOUS step to update the density of the center cell in current step, it uses the densities of the neighboring cells in the CURRENT step. Hence, makes the implementation horizontally and vertically asymmetric and unparallizable. However, we find that the intuitive implementation of the simulation would be unstable only if the diffusion rate is larger than 1, and Jos Stam defines it to be dt ∗ gridSize2 ∗ const. We think the qualitative local behaviour should not depend on the size of the grid, so we replaced the gridSize with a Scale variable which is a constant that represents the spatial step size. Now, as long as we set an appropriate value for the scale variable, the intuitive implementation (using previous states to update current states) is always stable. Therefore, we implement our own serial version of fluid simulator, which has a newly defined diffusion rate and uses the previous states to update the current states. We set this parallizable implementation as our serial base code.
2 2.1
Implementation and Tuning Effort OpenMP
After running Intel vTune profiler on our serial implementation, we have the following results (Fig. 1):
Fig. 1. Serial implementation runtime distribution
The project function is by far the most costly function. Indeed, there are a few multi-level loops in the project function. By using omp for collapse around the loop blocks, we successfully reduce the run time of the project function to a quarter of its original run time when using four threads. After running Intel vTune profiler on our serial implementation, we have the following results (Fig. 2):
Incompressible Fluid Simulation Parallelization
387
Fig. 2. OpenMP implementation runtime distribution
2.2
MPI
Because of the heavily localized nature of our fluid solver, we make the following observations: because the projection, advection, and diffusion steps all act on different quantities of the fluid (e.g. pressure and velocity), we found that it would be more efficient With MPI, one of the more difficult implementation problems was finding a general method for domain decomposition. With the dimension of the grid being rigidly fixed in an array, as opposed to being an abstract square of fixed width, it is difficult to find a division of the array that is not simple row-by-row (especially if the number of processes and the array-width are co-prime. We first attempted our domain decomposition by finding a pair (a, b) such that ab = p, and both of which divide N + 2 (the number of rows with borders). Given such a pair, we chunk the fluid array with into blocks of size (N + 2)/a by (N + 2)/b. If no satisfactory decomposition can be found, we implemented domain decomposition in a nearly row-by-row manner. In the case that there are more rows than processes, we rely on tags and nonblocking send/receive calls to prevent malicious overlapping of messages, with each process getting N/p rows, and an additional row if their rank happens to be below the remainder N %p. If there are more processes than rows, we force the processes with rank higher than N to terminate early. However, this method is still difficult to generalize fully, especially if N + 2 happens to be prime (or coprime to the number of processors, which is likely!). Otherwise, within each processor’s block, iteration proceeds very similarly to a regular serial solver. We note the extreme burden of communication costs in the row-by-row scenario by immediately adapting the serial implementation. In the worst case, row decomposition is known to maximize the side-length of a block with respect to other regular decompositions (such as square grid cells) and because of the dependence of the Gauss-Seidel method on intermediate computations, there is a very high amount of communication performed between adjacent processes. Because of this, we experimented with using the Jacobi method (which only relies on the initial values at each timestep) so that communication happens when computations finish, as opposed to communicating on every intermediate step.
388
2.3
X. Jiang et al.
CUDA
Initially we copy data from CPU to GPU during each iteration of the simulation, but this is a super time consuming process and it’s introducing a lot of segmentation fault which took us a lot of efforts to debug. And after we finally put everything into work(no segmentation fault but still didn’t pass the correctness check), we found the speed of CUDA is even slower than openMP. After carefully discussion, we decided to only copy data once at the beginning of the simulation instead of copying several times for different iterations. This saved us a decent amount of time for copying data since we are doing 100 iterations in our simulation. And this time our CUDA code is 7X faster than openMP. We used an array to store the location of the fluid into different grids as the pointer to the location, since fluid would be sorted and stored contiguously into grids. But we had a lot of trouble figuring out how to make this ensure correctness, as in init_simulation everything would seem correct, but after iterations, there was synchronization issues where sometimes the grids wouldn’t be sorted properly. Adding synchronization primitives such as cudaDeviceSynchronize() and __syncthreads() only slowed down the code and didn’t fix correctness. We tried several things to solve this: 1.Locate exact locations where density array went wrong by printf in CUDA 2.Checked another variables to see whether it goes wrong 3. check the value before and after copy to see is there any GPU end of error 4. checked whether did we copied data or pointers. After all those checks, we finally located that there are some copy things going wrong since there are so many of them and it’s easy to mess the size of different objects up. And we didn’t load all the modules we need which causes the segmentation fault as well, now we passed the correctness check but we were worried that data racing would incur a lot of GPU cache misses. Given that the GPU architecture is very expensive and performant on registers and memory bandwidth, this type of concern should be free and very minimal. The final implementation is way faster than openMP and serial code Ignoring correctness, the performance of copy data at each iteration was very bad, and it was due to the calls of cudaMemCpy to do transfer data from CPU to GPU several times. Fortunately, our final implementation resides all of its computation on the GPU memory. Surprisingly, the code is very straightforward and cleaner than the original implementation.
3
Experimental Data: Scaling and Performance Analysis and Interesting Inputs and Outputs
After finishing our OpenMP and CUDA implementation, we conducted our experiments on the Cori supercomputer at National Energy Research Scientific center and Bridges-2 supercomputer at Pittsburgh Supercomputing Center respectively.
Incompressible Fluid Simulation Parallelization
3.1
389
OpenMP
Slowdown Plot. Figure 3 shows the slowdown plot of our OpenMP implementation. We can clearly see that the can be divided into two sections. For the first three data points in the plot, the run time grows linearly as the problem size increases. This is what we expected since in our implementation there is minimal communication among the threads. However, if we take a closer look, the slope of the line is less than 1. We think this abnormality can be explained by the problem size of the first three data points. When the problem size is small, the program spends most of its run time in setup operations such as memory allocations and initial operations, rather than actual computations. For the second half the plot, the program run time grows linearly with a slope close to 1. This is an indication that the computation time starts to dominate the program run time. Because there is minimal communication overheads in our OpenMP implementation, the slope of the curve is close to 1.
Fig. 3. OpenMP log-log slowdown plot. Num Thread = 68
3.2
Strong Scaling
Figure 4 shows a strong scaling plot for our OpenMP implementation when grid size is 200 * 200. Before the number of threads increases to 16, the curve decreases linearly with a close to –1 slope, which indicates a good strong scaling efficiency. However, when the number of threads reaches 16, the program run time starts to flat out. We believe this is caused by the setup overheads of OpenMP threads. When there are excessive number of threads for a problem with small size, the marginal benefit of adding additional threads diminishes. To verify our
390
X. Jiang et al.
Fig. 4. Strong scaling OpenMP for grid size = 200 * 200
Fig. 5. Strong scaling OpenMP for grid size = 400 * 400
hypothesis, we also conduct the strong scaling experiment with a problem size of 400 * 400 (Fig. 4). We can see that with a larger problem size, our OpenMP implementation scales well even when the number of threads increases to 64 and is very close to the ideal scaling line. 3.3
Weak Scaling
Figure 6 shows the weak scaling plot of our OpenMp implementation. For the first three data points the weak scaling efficiency is around 67% and for the entire curve the weak scaling efficiency is around 31%. The slope suddenly increases near the end of the curve. Since there is not much communication overheads in our OpenMp implementation, we think this is caused by thread setup/free costs and the memory allocation costs. Memory allocations are not done in parallel and allocating a large chunk of memory can impact the program run time significantly, as the overall program run time is relative small.
Incompressible Fluid Simulation Parallelization
391
Fig. 6. Weak scaling OpenMP
3.4
MPI
Strong Scaling. We tested for one node with fixed 3200 * 3200 block size Fig. 7 shows that we are only seeing a little jump on from 16 to 32 ranks and then we are remaining good performance on increasing number of processors by two times. 1.00
strong scale efficiency
0.75
0.50
0.25
0.00 10
20
30
40
50
60
Num_of_processors
Fig. 7. Strong scaling MPI
Weak Scaling. We can see from Fig. 8 that our weak scaling efficiency is highest while we are using two tasks which is 77.56% and when we increasing the number of of ranks and number of blocks by the same times, the weak scaling efficiency is going all the way down to 55.01%, but it’s still pretty good.
392
X. Jiang et al. 0.8
weka scaling efficinecy
0.6
0.4
0.2
0.0 10
20
30
40
50
60
num of processors
Fig. 8. Weak scaling MPI
SlowDown Comparison. From Fig. 9, we can see that the openMP is the slower, and then MPI and CUDA is the fastest. OpenMP -> MPI -> CUDA Cuda
OpenMP
MPI(68 ranks)
80
60
40
20
0 500
1000
1500
2000
2500
3000
Fig. 9. Slow down comparison
3.5
CUDA
Slowdown Plot. Figure 10 shows the slowdown plo