Intelligent Computing: Proceedings of the 2021 Computing Conference, Volume 2 (Lecture Notes in Networks and Systems, 284) [1st ed. 2021] 303080125X, 9783030801250

This book is a comprehensive collection of chapters focusing on the core areas of computing and their further applicatio

1,523 70 21MB

English Pages 1268 [1269] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Intelligent Computing: Proceedings of the 2021 Computing Conference, Volume 2 (Lecture Notes in Networks and Systems Book 284) 9783030801267, 9783030801250, 3030801268

210 59 175MB Read more

Intelligent Computing: Proceedings of the 2021 Computing Conference, Volume 3 (Lecture Notes in Networks and Systems, 285) [1st ed. 2021] 3030801284, 9783030801281

This book is a comprehensive collection of chapters focusing on the core areas of computing and their further applicatio

2,215 107 16MB Read more

Intelligent Computing: Proceedings of the 2021 Computing Conference, Volume 1 (Lecture Notes in Networks and Systems, 283) [1st ed. 2022] 3030801187, 9783030801182

This book is a comprehensive collection of chapters focusing on the core areas of computing and their further applicatio

969 145 26MB Read more

Intelligent Systems and Applications: Proceedings of the 2021 Intelligent Systems Conference (IntelliSys) Volume 1: 294 (Lecture Notes in Networks and Systems) [1st ed. 2022] 3030821927, 9783030821920

This book presents Proceedings of the 2021 Intelligent Systems Conference which is a remarkable collection of chapters c

1,214 101 98MB Read more

Intelligent Sustainable Systems: Proceedings of ICISS 2021 (Lecture Notes in Networks and Systems, 213) [1st ed. 2022] 9811624216, 9789811624216

This book features research papers presented at the 4th International Conference on Intelligent Sustainable Systems (ICI

2,315 148 25MB Read more

Intelligent Computing: Proceedings of the 2020 Computing Conference, Volume 2 [1st ed.] 9783030522452, 9783030522469

This book focuses on the core areas of computing and their applications in the real world. Presenting papers from the Co

2,582 133 79MB Read more

Data Science and Security: Proceedings of IDSCS 2021 (Lecture Notes in Networks and Systems, 290) [1st ed. 2021] 9811644853, 9789811644856

This book presents the best-selected papers presented at the International Conference on Data Science, Computation and S

2,104 102 16MB Read more

International Conference on Innovative Computing and Communications: Proceedings of ICICC 2021, Volume 3 (Advances in Intelligent Systems and Computing, 1394) [1st ed. 2022] 9811630704, 9789811630705

This book includes high-quality research papers presented at the Fourth International Conference on Innovative Computing

1,467 47 23MB Read more

Intelligent Computing: Proceedings of the 2020 Computing Conference, Volume 3 [1st ed.] 9783030522421, 9783030522438

This book focuses on the core areas of computing and their applications in the real world. Presenting papers from the Co

772 81 81MB Read more

Intelligent Computing: Proceedings of the 2020 Computing Conference, Volume 1 [1st ed.] 9783030522483, 9783030522490

This book focuses on the core areas of computing and their applications in the real world. Presenting papers from the Co

1,279 89 98MB Read more

Intelligent Computing: Proceedings of the 2021 Computing Conference, Volume 2 (Lecture Notes in Networks and Systems, 284) [1st ed. 2021]
303080125X, 9783030801250

Author / Uploaded
Kohei Arai (editor)

Table of contents :
Editor’s Preface
Contents
Balanced Weighted Label Propagation
1 Introduction
1.1 Contributions
2 Related Work
3 Proposed Approach
3.1 Resolution Limit
4 Experiment
4.1 Datasets and Setup
4.2 Results and Discussion
5 Conclusion
References
Analysis of the Vegetation Index Dynamics on the Base of Fuzzy Time Series Model
1 Introduction
2 Problem Definition
3 Fuzzy Modeling of NDVI Dynamics
4 Conclusion
References
Application of Data–Driven Fault Diagnosis Design Techniques to a Wind Turbine Test–Rig
1 Introduction
2 Wind Turbine System Description
3 Data–Driven Methods for Fault Diagnosis
3.1 Fault Estimators via Artificial Intelligence Tools
4 Simulation and Experimental Tests
4.1 Simulated Results
4.2 Hardware–in–the–Loop Experiments
5 Conclusion
References
Mathematical Model to Support Decision-Making to Ensure the Efficiency and Stability of Economic Development of the Republic of Kazakhstan
1 Introduction
2 Problem Statement and Its Mathematical Model
3 Method and Algorithm for Solving the Problem
4 Conclusion
References
Automated Data Processing of Bank Statements for Cash Balance Forecasting
1 Introduction
2 Terminology and Notation
3 Data Description
4 Account Balance Data Challenges
4.1 Reconstruction Challenges
4.2 Forecasting Challenges
5 Data Description
5.1 Conventional Approaches to Financial Forecasting
5.2 Machine Learning Approaches to Financial Forecasting
6 Experimental Comparison
6.1 Dataset Description
6.2 Experimental Results
7 Conclusions and Further Work
References
Top of the Pops: A Novel Whitelist Generation Scheme for Data Exfiltration Detection
1 Introduction
2 Related Work
3 Proposed Scheme
3.1 Feature Computation
3.2 Whitelist Candidate Computation
3.3 Whitelist Candidate Expansion
4 Evaluation
4.1 Threat Model
4.2 Data Set and Experiment Configuration
4.3 Feature Evaluation
4.4 Whitelist Evaluation
4.5 Detection Efficiency Evaluation
4.6 Performance Evaluation
5 Conclusion
References
Analyzing Co-occurrence Networks of Emojis on Twitter
1 Introduction
2 Related Work
3 Data Collection
3.1 Topics
3.2 Collection Methodology
3.3 Search Terms
4 Constructing Emoji Co-occurrence Network
5 Results
5.1 Eigenvectors
5.2 Dating
5.3 Music
5.4 US Politics
5.5 Olympics
5.6 Epidemics
5.7 Control Dataset
5.8 Data
6 Conclusion
7 Future Work
References
Development of an Oceanographic Databank Based on Ontological Interactive Documents
1 Introduction
2 State of the Art
3 The Method of Recursive Reduction
3.1 Reduction Operator
3.2 Applicability Function
3.3 Transformation Function
4 Ontological Interactive Documents
4.1 Ontological Presentation Templates
4.2 Ontological Descriptors for Software Integration
5 Oceanographic Data Processing
5.1 Scientific Reports
5.2 Measurement Result Files
6 Oceanographic Databank Software System
6.1 Ontology-Based User Interface
7 Conclusions
References
Analyzing Societal Bias of California Police Stops Through Lens of Data Science
1 Introduction
2 Related Work
3 Experimental Setup
4 Analysis
4.1 Analysis of Stops Through Visualizations
4.2 Assessing Racial Discrimination in Traffic Stop Decisions Using Statistical Analysis
5 Conclusion
References
Automated Metadata Harmonization Using Entity Resolution and Contextual Embedding
1 Introduction
2 Related Studies
3 System Design
3.1 Framework Overview
3.2 Implementation
4 Experimentation
5 Limitations
6 Conclusion
References
Data Segmentation via t-SNE, DBSCAN, and Random Forest
1 Introduction
2 Methods
2.1 The Algorithm
2.2 Cluster Profiles
2.3 Generalizing Segments
2.4 Software
3 Empirical Results
3.1 Iris Data Set
3.2 Instagram Data
3.3 MNIST Case Study
3.4 Generalization Performance
4 Conclusion and Future Work
A Matching Clusters for Generalization Analysis
References
GeoTree: A Data Structure for Constant Time Geospatial Search Enabling a Real-Time Property Index
1 Introduction
2 Background
2.1 Naive Geospatial Search
2.2 GeoHash
2.3 Efficiency Improvement Attempts
3 GeoTree
3.1 High-Level Description
3.2 GeoTree Data Nodes
3.3 Subtree Data Caching
3.4 Retrieval of the Subtree Data
3.5 Memory Requirement of the Data Structure
3.6 Technical Implementation
3.7 Time Complexity
3.8 Comparison with the Prefix Tree (Trie)
4 Real-World Performance
4.1 Application: House Price Index Algorithm
4.2 Performance Results
4.3 Correlation
4.4 Scalability Testing
4.5 Discussion of Results
5 Conclusion
References
Prediction Interval of Future Waiting Times of the Two-Parameter Exponential Distribution Under Multiply Type II Censoring
1 Introduction
2 The General Weighted Moments Estimation of the Scale Parameter of the One-Parameter Exponential Distribution
3 Intervals of Future Waiting Time
4 Example
5 Discussion
References
Predictive Models as Early Warning Systems: A Bayesian Classification Model to Identify At-Risk Students of Programming
1 Introduction
2 Related Work
2.1 Predictors of Student Performance
2.2 Predictive Data Mining Modelling for Student Performance in Computing Education
2.3 Predictive Modelling with Naive Bayes Classification
2.4 Early Warning Systems (EWS) in Education
3 Research Methodology
3.1 Description of the Course and Data Sources
3.2 Instruments and Assessments
3.3 Predictive Model
4 Data Analysis and Results
5 Discussion
6 Conclusions, Limitations and Future Work of the Study
References
Youden's J and the Bi Error Method
1 Introduction
2 Methods
3 Results
4 Discussion
4.1 On the Choice of Hypothesized
4.2 Normal Distribution
4.3 F Distribution
5 Conclusion
A Simulation Results
References
A New Proposal of Parametric Similarity Measures with Application in Decision Making
1 Introduction
2 Preliminaries and Generalization of Equality Index
3 Two Classes of Parametric Equality Indices
4 Parametric Similarity Measures for Fuzzy Vectors
4.1 Ordered Weighted Aggregation (OWA) Operators
4.2 A New Proposal
5 Application
6 Conclusions
References
The Challenges of Using Big Data in the Consumer Credit Sector
1 Introduction
2 Alternative Data in Credit Scoring
2.1 Psychometric Information
2.2 Social Networks
2.3 Peer-to-Peer Lending
3 Privacy and Ethics
3.1 Equifax Data Breach
3.2 Ethical Aspects of the Equifax Data Breach
4 Information Sharing
4.1 Regulations
4.2 Discrimination and Manipulation
5 Discussion
References
New Engine to Promote Big Data Industry Upgrade
1 Introduction
2 Big Data Industry Upgrade and Construction Goals and Key Breakthroughs
3 Realization of Big Data Industrialization
3.1 Industrialization Core Construction-“General Big Data Platform+” Model
3.2 Establish a Data Connection
3.3 The Realization Path of Big Data Industrialization
4 The Realization of the Big Data Industry Chain
4.1 Core Construction of the Industrial Chain
4.2 Framework and Map of the Industry Chain
4.3 The Transformation and Upgrading Path of the Industrial Chain
5 Conclusions and Prospects
References
Using Correlation and Network Analysis for Researching Intellectual Competence
1 Introduction
1.1 Research Questions
1.2 Purpose of the Study
2 Study Participants
3 Research Methods
3.1 Methodology for Diagnosing Conceptual Abilities. “Conceptual Synthesis” [14]
3.2 Methods for Identifying Metacognitive Abilities
3.3 The Author’s Method of Diagnostics of Intentional Abilities “Mood”
3.4 Author’s Methodology for Assessing Intellectual Competence “Interpretation”
3.5 Research Techniques
4 Results
4.1 Correlation Analysis
4.2 Network Analysis
5 Findings
6 Conclusion
7 Future Work
References
A Disease Similarity Technique Using Biological Process Functional Annotations
1 Introduction
2 Background and Related Work
3 Methods and Techniques
3.1 Gene Ontology and Biological Process Annotations
3.2 Biological Process Jaccard Score
4 Evaluation and Results
4.1 Experiment 1 - Phenotype Jaccard Score and BP-Weight Correlation
4.2 Experiment 2 - Correlation Between Shared Chemicals and BP-Weight
4.3 Experiment 3 - Comparing Diseases Without Genes in Common
4.4 Experiment 4 - Correlation Between Shared Chemicals and BP-Weight for a Given Disease
4.5 Experiment 5 - Comparing Diseases with Genes in Common
4.6 Experiment 6 - Comparing Diseases with High BP-Weights
5 Analysis of Results and Discussion
6 Conclusion
References
Impact of Types of Change on Software Defect Prediction
1 Introduction
2 Related Work
3 Data Collection
3.1 Churn Data Extraction
3.2 Churn Metrics Construction
4 Experiments and Results
4.1 Experiment I
4.2 Experiment II
5 Conclusion
References
Analyzing Music Genre Popularity
1 Introduction
2 Related Work
3 Data Description and Collection
3.1 Gathering Data
3.2 Cleaning and Formatting Data
4 Data Analysis
5 Results and Evaluation
6 Conclusion and Future Work
References
Result Prediction Using Data Mining
1 Introduction
2 Classification Model
2.1 Naïve Bayes
2.2 J-48
2.3 K*
2.4 Random Tree
3 Performance Measure Metrics
3.1 Kappa Statistics
3.2 Root Mean Square Error
3.3 Relative Absolute Error
3.4 Root Relative Squared Error
3.5 Info Gain
3.6 Relief Attribute
3.7 False Positive and False Negative
3.8 Confusion Matrix
4 Data Collection
4.1 Data Migration
4.2 Data Aggregation
4.3 Data Cleaning
4.4 Outlier Detection and Replacement
4.5 Attribute Selection
4.6 Export Data
4.7 Input Parameter
4.8 Output Parameter
5 Result and Analysis
5.1 0% Error
5.2 10% Error
6 Conclusion
References
A Systematic Literature Review on Big Data Extraction, Transformation and Loading (ETL)
1 Introduction
2 Methodology
3 Result and Answers to Review Questions
3.1 RQ1: What Approaches are Currently Used to Implement ETL Solutions?
3.2 RQ2: What are the Quality Attributes to be Considered When Designing ETL Solutions?
3.3 RQ3: What are the Prevailing Challenges in Developing ETL Solutions?
3.4 RQ4: What is the Current Depth of Coverage in ETL Research with Respect to the Application Domain, Geographical Locations and Frequency of Papers Published?
4 Discussion of Results
4.1 ETL Implementation Techniques
4.2 Quality Attributes
4.3 Prevailing Challenges
4.4 Current Trends
5 Conclusion and Future Work
References
Accelerating Road Sign Ground Truth Construction with Knowledge Graph and Machine Learning
1 Introduction
2 Related Work
3 Road Sign Ontology
3.1 Road Sign Features
3.2 Road Sign Conventions and Prototypes
3.3 Alignment with Domain Vocabularies
4 Road Sign Knowledge Graph
4.1 Road Sign Knowledge Graph Construction with Crowdsourcing
4.2 Alignment of Road Sign Knowledge Graph to Domain-Specific Knowledge Graphs
4.3 Notes on Architecture
5 Knowledge Graph and Machine Learning Assisted Road Sign Annotation Pipeline
6 Evaluation
6.1 Search Space Reduction with Road Sign Knowledge Graph
6.2 Variational Prototyping-Encoder Model Accuracy with Respect to Search Space Size
7 Conclusion
References
Adjusted Bare Bones Fireworks Algorithm to Guard Orthogonal Polygons
1 Introduction
2 Mathematical Problem Definition
3 Visibility Algorithm for IC of the OGP
4 Bare Bones Fireworks Algorithm
5 Optimal Camera Placement by ABBFWA
6 Simulation Experiment and Analysis of Results
7 Conclusion and Future Work
References
Unsupervised Machine Learning-Based Elephant and Mice Flow Identification
1 Introduction
2 Related Work
3 Proposed Methodology
4 Experimental Set-Up
4.1 Data Description
4.2 Data Pre-processing
4.3 Data Reduction
4.4 K-mean Clustering
4.5 Flow Labelling
5 Results
5.1 Principal Component Analysis
5.2 K-Means Clustering
5.3 Flow Identification
5.4 Discussion
6 Conclusions
References
Automated Generation of Zigzag Carbon Nanotube Models Containing Haeckelite Defects
1 Introduction
1.1 Purpose
1.2 Background
2 Method
3 Results
4 Conclusions
References
Artificial Intelligence Against Climate Change
1 Introduction
2 Background
2.1 Energy Consumption in the United States
2.2 AI and the IoT
2.3 Importance of Mathematical Modeling
2.4 Summary
3 Existing Technology
3.1 Buildings
3.2 Transportation
3.3 Industry
4 Blockades in Technological Progression
4.1 Lack of Cohesion
4.2 Lack of Policy
5 Developing Innovations
5.1 Buildings
5.2 Transportation
5.3 Industry
6 Ethical Implications of Emerging Technology
7 Future Research
8 Conclusion
References
Construction Site Layout Planning Using Multiple-Level Simulated Annealing
1 Introduction
2 Simulated Annealing Algorithm
3 Problem Description
3.1 Input and Output of the Algorithm
3.2 Design Objectives and Variables
3.3 Domain of Definition
4 Objective-Functions Design
5 Results
6 Conclusions
References
Epistocracy Algorithm: A Novel Hyper-heuristic Optimization Strategy for Solving Complex Optimization Problems
1 Introduction
2 Epistocracy Algorithm
2.1 Overview
2.2 Generating the Initial Population
2.3 Performance Evaluation
2.4 Population Separation
2.5 Governor Assignment
2.6 Leading the Population
2.7 Improving Governors
2.8 Governor’s Performance Adjustment
2.9 Genetic Operators: Recombination and Mutation
3 Experimental Results and Analysis
3.1 Evaluation of Epistocracy Algorithm Using Benchmark Functions
3.2 Evaluation of Epistocracy Algorithm Using the MNIST Dataset
4 Conclusion and Future Work
References
An Approach for Non-deterministic and Automatic Detection of Learning Styles with Deep Belief Net
1 Introduction
1.1 Context
1.2 Problem
1.3 Objectives
2 Literature and Theory Review
2.1 Understanding Learning Styles
2.2 Theory and Past Research
3 Theoretical Framework and Hypothesis
3.1 Automatic Learning Styles Estimation
3.2 Nondeterministic Learning Traces
3.3 Deep Belief Nets
3.4 Conceptual Framework
4 Research Methodology
4.1 Data Collection
4.2 Proposed Technique of Data Analysis: DBN-LIS Method
4.3 Model Design
5 Data Analysis Measurement Analysis
5.1 Measurement Analysis
5.2 DBN-LIS Analysis
6 Discussion
7 Conclusion
Annexure A
References
Using Machine Learning to Identify Methods Violating Immutability
1 Introduction
2 Related Work
3 Method
3.1 Categories
3.2 Architecture
3.3 Training
4 Evaluation
4.1 Dataset
4.2 Grammar
4.3 Results
5 Conclusion
References
Predicting the Product Life Cycle of Songs on the Radio
1 Introduction
2 Dataset
3 Clustering
4 Designs
5 Models
5.1 KNN
5.2 Random Forest
5.3 MLP
5.4 CNN
6 Findings
7 Discussion
8 Future Work
References
Hierarchical Roofline Performance Analysis for Deep Learning Applications
1 Introduction
2 Methodologies
2.1 ERT Extensions for Machine Characterization
2.2 Nsight Compute Metrics for Application Characterization
3 Experimental Setup
3.1 Hardware and Software Configuration
3.2 DeepCAM Benchmark
4 Results
4.1 The TensorFlow Version of DeepCAM
4.2 The PyTorch Version of DeepCAM
4.3 Automatic Mixed Precision
4.4 Zero-AI Kernels
4.5 Overall Performance
5 Conclusions
References
Data Augmentation for Short-Term Time Series Prediction with Deep Learning
1 Introduction
2 Related Work
3 Background
3.1 Data Augmentation
3.2 Deep Learning
4 The Proposal
4.1 Parameters
4.2 Generating Linear Synthetic Values
4.3 Generating Non-linear Synthetic Values
5 Experimentation
5.1 Times Series Selection
5.2 Test and Training Data
5.3 Data Augmentation
5.4 Prediction with Deep Learning
5.5 Evaluation of Results
6 Results
7 Conclusion
8 Future Work
Appendix
References
On the Use of a Sequential Deep Learning Scheme for Financial Fraud Detection
1 Introduction
2 Related Work
3 Problem Definition
4 The Proposed Methodology
4.1 The Training Process
4.2 Dataset Oversampling
4.3 The Proposed Autoencoder for Feature Selection
4.4 The Proposed CNN
5 Experimentation and Evaluation
5.1 Experimental Setup and Performance Metrics
5.2 Performance Assessment
5.3 Comparative Assessment
6 Conclusions
References
Evolutionary Computation Approach for Spatial Workload Balancing
1 Introduction
2 Background and Related Work
3 GIS Map Partitioning
3.1 Map Indexes/Indices Computation
3.2 Evolutionary Computation Implementation
4 Data and Experiments
4.1 Data and Materials
4.2 Non–merging Based Experiment
4.3 Merging Based Experiment
5 Results and Discussion
6 Conclusion and Further Research
References
Depth Self-optimized Learning Toward Data Science
1 Introduction
2 Related Work
3 Methods
3.1 F-Model
3.2 Reinforcement Learning Stage
4 Experiment
4.1 F-Model
4.2 RL-Stage
4.3 Validation
5 Conclusion and Perspective
References
Predicting Resource Usage in Edge Computing Infrastructures with CNN and a Hybrid Bayesian Particle Swarm Hyper-parameter Optimization Model
1 Introduction
2 Related Work
3 A Convolutional Neural Network Approach
3.1 Convolutional Neural Network
3.2 Long Short-Term Memory
3.3 Regularization
3.4 Learning Process
3.5 Particle Swarm Optimization
3.6 Bayesian Optimization
3.7 Hybrid Bayesian Particle Swarm Hyper-parameter Optimization Model
4 Experimental Results and Discussion
4.1 Experimental Setup
4.2 Results and Discussion
5 Conclusion
References
Approaching Deep Convolutional Neural Network for Biometric Recognition Based on Fingerprint Database
1 Introduction
2 Literature Review
3 Material and Method
3.1 Convolutional Neural Network
3.2 Proposed Convolutional Neural Network
4 Dataset and Result Analysis
5 Conclusion
References
Optimizing the Neural Architecture of Reinforcement Learning Agents
1 Introduction
2 Related Work
3 Reinforcement Learning Methods
4 Neural Architecture Search Methods
4.1 Adaptation to Reinforcement Learning
4.2 Efficient Neural Architecture Search (ENAS)
4.3 Single-Path One-Shot with Uniform Sampling (SPOS)
5 Experiments
5.1 Atari Environment
5.2 Search Spaces
5.3 Methodology
5.4 Random Search Baseline
5.5 Experiments with Larger Search Spaces
6 Discussion
7 Conclusion
A Search Spaces
B Nature CNN Architecture
C Implementation Details
D The Best Architectures
References
Automatic Ensemble of Deep Learning Using KNN and GA Approaches
1 Introduction
2 Datasets
3 Methodology and Algorithm
3.1 Offline Process
4 Results
4.1 Case Studies
4.2 Case Studies Summary
5 Conclusions
References
A Deep Learning Model for Data Synopses Management in Pervasive Computing Applications
1 Introduction
2 Related Work
3 Preliminaries and Problem Description
4 Uncertainty Driven Proactive Synopses Dissemination
4.1 Estimating Future Trends of Synopses
4.2 The Uncertainty Driven Model
4.3 Synopses Update and Delivery
5 Experimental Setup and Evaluation
5.1 Setup and Performance Metrics
5.2 Performance Assessment
6 Conclusions
References
DeepObfusCode: Source Code Obfuscation through Sequence-to-Sequence Networks
1 Introduction
2 Related Work
2.1 Code Obfuscation Methods
2.2 Applications of Neural Networks
3 DeepObfusCode
3.1 Ciphertext Generation
3.2 Key Generation
3.3 Source Code Execution
4 Results
4.1 Stealth
4.2 Execution Cost
5 Conclusion
References
Deep-Reinforcement-Learning-Based Scheduling with Contiguous Resource Allocation for Next-Generation Wireless Systems
1 Introduction
2 System Model and Problem Formulation
3 Proposed Scheduling Algorithm Based on DRL
3.1 State Design
3.2 Action Design
3.3 Reward Design
3.4 Complexity Analysis
3.5 Training of DQN
4 Numerical Results
5 Conclusion and Discussion
References
A Novel Model for Enhancing Fact-Checking
1 Introduction
2 Related Work
3 The Proposed Fact-Predictor Architecture
3.1 Key Idea
3.2 A High-Level Policy for Plausible Warrant Extraction
3.3 Best Warrant Picking with Low-Level Policy
4 Experiments
4.1 Dataset
4.2 Settings
4.3 Results and Discussions
5 Conclusion and Future Work
References
Investigating Learning in Deep Neural Networks Using Layer-Wise Weight Change
1 Introduction
2 Related Work
3 Methods
3.1 Relative Weight Change
3.2 Experimental Approach
4 Results
4.1 Residual Networks
4.2 VGG
4.3 AlexNet
5 Conclusion
References
Deep Reinforcement Learning for Task Planning of Virtual Characters
1 Introduction
2 Related Work
3 Overview
3.1 Background
4 Environment and Agent Modeling
4.1 Environment
4.2 Agent
4.3 Reward
5 Experiments
5.1 Training
5.2 Results
5.3 Discussion
6 Conclusion and Future Work
A Motion Synthesis Module
B Training Configuration
References
Accelerating Deep Convolutional Neural on GPGPU
1 Introduction
2 Related Work
3 Convolutional Neutral Networks
3.1 VGG-16 Model
3.2 CNN-non Static
3.3 11 Convolutions
4 Convolution Algorithms on GPGPU
4.1 General Matrix Multiply (GEMM)
4.2 Fast Furrier Transformation
4.3 Winograd
5 Building CSR Weight Format
6 Convolution Operation Using a Sparse Operation on GPGPU
7 Results
8 Conclusions
References
Enduring Questions, Innovative Technologies: Educational Theories Interface with AI
1 Introduction
2 Learning Theories
2.1 How Have Things Changed? How Have Things Remained the Same?
2.2 What Are the Social Dimensions of Teaching and Learning?
3 Teaching and Learning
3.1 Are Teachers to Be Primarily Interacting with the Software Data?
3.2 Should Schooling Lead to Employment or Human Creativity?
4 Ethics
4.1 How Can AI Hurt Humans?
4.2 Is Persuasive Language the Coin of the Realm in AI Student Learning?
4.3 Is It Possible to Align Machine Learning Values with Human Values?
5 Data Collection and Interpretation
6 Conclusion and Future Phases
References
DRAM-Based Processor for Deep Neural Networks Without SRAM Cache
1 Introduction
2 Architecture
3 SRAM Cache Elimination
4 WAFER-ON-WAFER Stacking
5 Results
6 Conclusion
References
Study of Residual Networks for Image Recognition
1 Introduction
2 Related Work
3 Dataset and Implementation
3.1 Tiny ImageNet Dataset
3.2 PyTorch Implementation
4 Network Design
4.1 Network Architectures
4.2 Data Augmentation
5 Results
6 Conclusion
References
A Systematic Review of Educational Data Mining
1 Introduction
2 Related Work
3 Research Directions
3.1 Traditional Teaching
3.2 Distance Education and Online Learning
4 Methodology
4.1 Machine Learning
4.2 Deep Learning
4.3 Other Methods
5 Discussion and Future Work
5.1 Discussion
5.2 Future Work
6 Conclusion
References
Image Classification with A-MnasNet and R-MnasNet on NXP Bluebox 2.0
1 Introduction
2 Literature Review
2.1 NXP Bluebox 2.0
3 Features of A-MnasNet and R-MnasNet
3.1 Convolutions
3.2 Activation Functions
3.3 Data Augmentation
3.4 Learning Rate Annealing or Scheduling
3.5 Optimizers
4 Implementation on NXP BLUEBOX 2.0
4.1 Implementation Setup
4.2 Implementation Methodology
5 Results
6 Conclusion
7 Future Scope
References
CreditX: A Decentralized and Secure Credit Platform for Higher Educational Institutes Based on Blockchain Technology
1 Introduction
1.1 Background
1.2 Literature Survey
1.3 Research Gap
2 Methodology
2.1 System Overview
2.2 Implementation
3 Results and Discussion
4 Conclusion
References
An Architecture for Blockchain-Based Cloud Banking
1 An Introduction
1.1 A Literature Review on Blockchain and Its Applications
1.2 The Paper's Structural Contents
2 Blockchain in Domestic Banking: Challenges and Inspiration
3 A Framework for Blockchain-Based Banking Cloud
3.1 Sketching a Network Architecture
3.2 Two-Tier Network and Workflow
3.3 Block-Data and Distribution
3.4 Core Protocols
4 Governance on the Network
4.1 The Conditions to Become a Node
4.2 Reputation Score System
5 Block Production and Consensus Mechanism
6 Differentiation and Advantage Offering
7 Applications, Assessment and Conclusion
References
A Robust and Efficient Micropayment Infrastructure Using Blockchain for e-Commerce
1 Introduction
2 Requirements for a Robust Micropayment Infrastructure
3 Micropayment Infrastructure Based on Blockchain
3.1 Description of the Actors
3.2 Interactions Between Actors
4 Buyer's Trust Models
4.1 Neutral Trust Model
4.2 Optimistic Trust Model
4.3 Pessimistic Trust Model
5 Auditing Entity's Making Decision
5.1 Computation of the Block Size
5.2 Risk Evaluation
6 Infrastructure Validation and Simulation
6.1 Infrastructure Validation
6.2 Simulation and Results
7 Conclusion
References
Committee Selection in DAG Distributed Ledgers and Applications
1 Introduction
1.1 DAG Based Cryptocurrencies
1.2 Contributions
1.3 Outline
2 Related Work
2.1 Reputation and Committees in DLT
2.2 Decentralized Random Numbers Generation
3 Protocol Overview and Setup
4 Committee Selection
4.1 Application Message
4.2 Checkpoint Selection
4.3 Maximal Reputation
5 Threat Model and Security
5.1 Zipf Law
5.2 Overtaking the Committee
6 Applications
6.1 Decentralised Random Number Generator
6.2 Smart Contract Oracles
6.3 Consensus and Sharding
7 Conclusion and Future Research
References
An Exploration of Blockchain in Social Networking Applications
1 Purpose
2 Background
2.1 A Brief About Blockchain
2.2 Social Networking Sites (SNS) and Shared Economy
2.3 Impact of Blockchain on Social Networking
2.4 Existent Blockchain-Based Social Networking Applications
3 The Challenge of Cash Currency Exchange
4 Method
4.1 The Application Functionality
4.2 Scenario
4.3 Architecture
5 Results
6 Conclusion
6.1 Future Work
References
Real and Virtual Token Economy Applied to Games: A Comparative Study Between Cryptocurrencies
1 Introduction
2 Monetization
3 Crypto Games
4 Cryptocurrencies in Games
5 Proposed Architecture
6 Conclusions and Future Work
References
Blockchain Smart Contracts Static Analysis for Software Assurance
1 Introduction
2 Review of Literature and Related Work
2.1 Smart Contract Use Cases
2.2 Smart Contract Correctness and Verification
2.3 Other Smart Contract Software Assurance Methodologies
2.4 Smart Contract Data Assurance
2.5 Developing Smart Contracts
2.6 Smart Contract Attacks and Protection
3 Static Analysis to Detect Errors
4 Findings
5 Solidity Source Code Analysis
6 Future Work
7 Conclusions
References
Promize - Blockchain and Self Sovereign Identity Empowered Mobile ATM Platform
1 Introduction
2 Promize Platform Architecture
2.1 Overview
2.2 Core Bank Layer
2.3 Distributed Ledger
2.4 Money Transfer Layer
2.5 Peer to Peer Communication Layer
3 Promize Platform Functionality
3.1 Use Case
3.2 Data Privacy and Security
3.3 Low Cost Transactions
4 Promize Platform Production Deployment
4.1 Overview
4.2 Service Architecture
5 Performance Evaluation
5.1 Transaction Throughput
5.2 Transaction Execution and Validation Time
5.3 Transaction Scalability
5.4 Transaction Execution Rate
5.5 Search Performance
5.6 Block Generate Time
6 Related Work
7 Conclusions and Future Work
References
Investigating the Robustness and Generalizability of Deep Reinforcement Learning Based Optimal Trade Execution Systems
1 Introduction
2 DRL Based Optimal Trade Execution
2.1 Preliminaries
2.2 Problem Formulation
2.3 Architecture
2.4 Assumptions
3 Experimental Results
3.1 Data Sources
3.2 Experimental Methodology and Settings
3.3 Algorithms
3.4 Main Evaluation and Backtesting
4 Conclusion and Future Work
References
On Fairness in Voting Consensus Protocols
1 Introduction
1.1 Preliminaries
1.2 Weights and Fairness
1.3 IOTA
1.4 Outline
2 Related Work
2.1 Weighted Voting Consensus Protocols
2.2 Fairness
3 Voting Consensus Protocols
4 Fairness
5 Distribution of Weight
6 Scalability and Message Complexity
7 Simulations
7.1 Threat Model
7.2 Failures
7.3 Results
8 Discussion
References
Dynamic Urban Planning: An Agent-Based Model Coupling Mobility Mode and Housing Choice. Use Case Kendall Square
1 Introduction
2 Background
2.1 Related Work
2.2 Agent-Based Simulations
3 Model Description
3.1 Entities, State Variables, and Scales
3.2 Process Overview
3.3 Initialization and Input Data
4 Model Calibration and Validation: Use Case Kendall Square
5 Discussion
6 Conclusion
References
Reasoning in the Presence of Silence in Testimonies: A Logical Approach
1 Introduction
2 Background
2.1 Grice's Sealed Lips
2.2 Answer Set Programming
3 Case Study
3.1 Method
3.2 Riddle the Criminal
4 Interpreting Omission
4.1 Defensive Silence
4.2 Acquiescent Silence
4.3 False Statements or Silence
4.4 Discussion
5 Conclusions
A Criminal.lp for Clingo 4.5.4
B Silence.py Prototype for Python 3.7
References
A Genetic Algorithm Based Approach for Satellite Autonomy
1 Introduction
1.1 History of Autonomy and Distributed Space Systems
1.2 Related Work
1.3 Advantages of Learned Approaches
1.4 Evolutionary Algorithm Preliminaries
2 Genetic Algorithm Implementation
2.1 Goal Task Description
2.2 Experimental Setup
2.3 Basic Algorithm
2.4 Results
3 Conclusions and Future Work
References
Communicating Digital Evolutionary Machines
1 Introduction
2 An Artificial World
3 Digital Evolutionary Machines
4 Results
5 Summary
References
Analysis of the MFC Singuliarities of Speech Signals Using Big Data Methods
1 Introduction
2 Literature Review
2.1 Review of Classifications Methods
2.2 Maintaining the Integrity of the Specifications
3 Statistical Data Corpus
3.1 Emotional Speech Databases
3.2 Feature Extraction [39–41]
3.3 Decision Tree Classifiers
4 Problem Statement
5 Experimental Confirmation of Results
6 Conclusion
References
A Smart City Hub Based on 5G to Revitalize Assets of the Electrical Infrastructure
1 Introduction
2 State of the Art
3 Design of the Solution
3.1 Hub 5G Architecture
3.2 Data Integration Model
4 Configuration of the System
4.1 Mobile Network Initialization
4.2 Application Initialization
4.3 Deployment Scenario
5 Implementation of the Hub
5.1 Hardware and Communication
5.2 Integration Between Back End and Mobile Application
5.3 Integrated Ecosystem
6 Conclusions and Future Work
References
LoRa RSSI Based Outdoor Localization in an Urban Area Using Random Neural Networks
1 Introduction
2 LoRa and LoRaWAN
3 Related Work
3.1 Overview of RSSI Fingerprint Localization Methods
3.2 Random Neural Networks
3.3 Gradient Descent Algorithm
4 Methodology
4.1 Real Test Measurements
4.2 RNN-LoRa RSSI Based Localization Model
5 Results and Performance Analysis
6 Conclusion and Future Work
References
A Deep Convolutional Neural Network Approach for Plant Leaf Segmentation and Disease Classification in Smart Agriculture
1 Introduction
2 Material and Methods
2.1 Mask -RCNN
2.2 Convolutional Neural Networks
2.3 Data Augmentation
2.4 Leaf Segmentation
2.5 Disease Classification
3 Results and Discussion
3.1 Leaf Detection
3.2 Disease Classification
4 Conclusion
References
Medium Resolution Satellite Image Classification System for Land Cover Mapping in Nigeria: A Multi-phase Deep Learning Approach
1 Introduction
2 Review of Related Works
2.1 CNN Model
2.2 Transfer Learning
2.3 Data Augmentation
3 Methodology
3.1 Dataset
3.2 The Model Framework
3.3 Training the CNN Model
4 Results and Analysis
4.1 Discussions
5 Conclusion
References
Analysis of Prediction and Clustering of Electricity Consumption in the Province of Imbabura-Ecuador for the Planning of Energy Resources
1 Introduction
2 Related Works
3 Materials and Methods
3.1 Database Description
3.2 Data Analysis Proposed
3.3 Transformation
3.4 Regression Models
3.5 Clustering
4 Results
4.1 Regression Models
4.2 Clustering
4.3 Interface
5 Conclusions and Future Works
References
Street Owl. A Mobile App to Reduce Car Accidents
1 Introduction
2 Literature Review
2.1 Background and Overview of Related Work
2.2 Related Mobile Applications
2.3 Analysis of Related Work
3 Research Problem
4 Methodology
5 Analysis
5.1 Data Collection
5.2 Observation Description
5.3 Observation Analysis
5.4 Discussion
6 Interface Design
7 Usability Testing
8 Conclusion
References
Low-Cost Digital Twin Framework for 3D Modeling of Homogenous Urban Zones
1 Introduction
2 Research Trends
3 Framework Design Methodology
3.1 Spatial Data Acquisition
3.2 Building a 3D Environment
3.3 Data Processing to Facilitate Administration
3.4 Environment Behaviors Simulation in a Virtual Environment
3.5 Recommendation and Actions to the Physical World
3.6 Establishing Real-Time Connections Between the Physical and Virtual Environments
3.7 Multi-source Data Collection of Vital Information
4 Implementation
5 Discussion
6 Conclusion and Future Work
References
VR in Heritage Documentation: Using Microsimulation Modelling
1 Introduction
2 Research Scope
3 Research Objectives
4 Literature Review
5 Details of the Case Study
6 Advantages of the VR Function
7 VR in Architectural Heritage Documentation
7.1 VR Modelling Process
7.2 Details of the New VR Function
7.3 Microsimulation Modelling
8 Conclusion
References
MQTT Based Power Consumption Monitoring with Usage Pattern Visualization Using Uniform Manifold Approximation and Projection for Smart Buildings
1 Introduction
2 Theoretical Background
2.1 Total Aggregated Power Consumption
2.2 MQTT
2.3 Time Series Clustering and Dimensionality Reduction
3 Methodology
3.1 System Architecture
3.2 Hardware System Setup
3.3 Software System Flow
4 Results and Discussion
4.1 Data Collection Results
4.2 Preprocessing Results
4.3 Appliance Usage Patterns
5 Conclusion
References
Deepened Development of Industrial Structure Optimization and Industrial Integration of China’s Digital Music Under 5G Network Technology
1 Development Background and Status Quo of China’s Digital Music Industry
2 Optimization of China Digital Music Industry Structure Under 5G Network
2.1 5G-Based Artificial Intelligence Composition Technology
2.2 Virtual Concert and AR Augmented Reality Experience
2.3 5G-Based Music Blockchain
3 In-Depth Development of Digital Music Industry Convergence Format under 5G Network
4 Summary
References
Optimal Solution of Transportation Problem with Effective Approach Mount Order Method: An Operational Research Tool
1 Introduction
2 Problem Description
2.1 Basic Feasible Solution (BFS) of TP
2.2 Optimal Solution to a TP
3 Solution Approach
4 Numerical Example
5 Related Problem
6 Conclusion
References
Analysis of Improved User Experience When Using AR in Public Spaces
1 Introduction
2 Literature Review
2.1 The User Centered Design
2.2 Cultural Application
2.3 Cultural Application
3 Research Methodology
3.1 Sample
3.2 Research Steps
4 Results
4.1 Structural Questionnaire
4.2 Technical Questionnaire 1
4.3 Technical Questionnaire 2
4.4 Technical Questionnaire 3
5 Recommendations
6 Conclusion and Future Research
References
Autonomous Vehicle Decision Making and Urban Infrastructure Optimization
1 Introduction
2 Background and Literature Review
2.1 Agent Based Modeling of Complex Systems
2.2 Behavioral Influencers
2.3 Urban Planning
3 Conclusion
References
Prediction of Road Congestion Through Application of Neural Networks and Correlative Algorithm to V2V Communication (NN-CA-V2V)
1 Introduction
2 Methodology
3 Results and Discussion
4 Conclusions
References
Urban Planning to Prevent Pandemics: Urban Design Implications of BiocyberSecurity (BCS)
1 Introduction
2 Background
3 Methods
4 Results
4.1 Potential Attack Vectors
4.2 Designing from First Principles
4.3 Hypothetical Responses to Disease
5 Further Discussion
6 Limitations
7 Conclusions
8 Future Work
References
Smart Helmet: An Experimental Helmet Security Add-On
1 Introduction
2 Background
3 Developed System
3.1 Mode of Operation
3.2 Crash Detection
3.3 Sleepiness and Fatigue Evaluation
4 Selected Experiments and Results
5 Conclusion
5.1 Future Work
References
Author Index

Citation preview

Lecture Notes in Networks and Systems 284

Kohei Arai Editor

Intelligent Computing Proceedings of the 2021 Computing Conference, Volume 2

Lecture Notes in Networks and Systems Volume 284

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas— UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Turkey Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA; Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada; Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong

The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

More information about this series at http://www.springer.com/series/15179

Kohei Arai Editor

Intelligent Computing Proceedings of the 2021 Computing Conference, Volume 2

123

Editor Kohei Arai Faculty of Science and Engineering Saga University Saga, Japan

ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-030-80125-0 ISBN 978-3-030-80126-7 (eBook) https://doi.org/10.1007/978-3-030-80126-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Editor’s Preface

It is a great privilege for us to present the proceedings of the Computing Conference 2021, held virtually on July 15 and 16, 2021. The conference is held every year to make it an ideal platform for researchers to share views, experiences and information with their peers working all around the world. This is done by offering plenty of networking opportunities to meet and interact with the world-leading scientists, engineers and researchers as well as industrial partners in all aspects of computer science and its applications. The main conference brings a strong program of papers, posters, videos, all in single-track sessions and invited talks to stimulate significant contemplation and discussions. These talks were also anticipated to pique the interest of the entire computing audience by their thought-provoking claims which were streamed live during the conferences. Moreover, all authors had very professionally presented their research papers which were viewed by a large international audience online. The proceedings for this edition consist of 235 chapters selected out of a total of 638 submissions from 50+ countries. All submissions underwent a double-blind peer-review process. The published proceedings has been divided into three volumes covering a wide range of conference topics, such as technology trends, computing, intelligent systems, machine vision, security, communication, electronics and e-learning to name a few. Deep appreciation goes to the keynote speakers for sharing their knowledge and expertise with us and to all the authors who have spent the time and effort to contribute significantly to this conference. We are also indebted to the organizing committee for their great efforts in ensuring the successful implementation of the conference. In particular, we would like to thank the technical committee for their constructive and enlightening reviews on the manuscripts in the limited timescale. We hope that all the participants and the interested readers benefit scientifically from this book and find it stimulating in the process.

v

vi

Editor’s Preface

Hope to see you in 2022, in our next Computing Conference, with the same amplitude, focus and determination. Kohei Arai

Contents

Balanced Weighted Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . Matin Pirouz

1

Analysis of the Vegetation Index Dynamics on the Base of Fuzzy Time Series Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elchin Aliyev, Ramin Rzayev, and Fuad Salmanov

13

Application of Data–Driven Fault Diagnosis Design Techniques to a Wind Turbine Test–Rig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Silvio Simani, Saverio Farsoni, and Paolo Castaldi

23

Mathematical Model to Support Decision-Making to Ensure the Efficiency and Stability of Economic Development of the Republic of Kazakhstan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Askar Boranbayev, Seilkhan Boranbayev, Malik Baimukhamedov, and Askar Nurbekov

39

Automated Data Processing of Bank Statements for Cash Balance Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vlad-Marius Griguta, Luciano Gerber, Helen Slater-Petty, Keeley Crocket, and John Fry Top of the Pops: A Novel Whitelist Generation Scheme for Data Exfiltration Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Cheng Yi Cho, Yuan-Hsiang Su, Hsiu-Chuan Huang, and Yu-Lung Tsai Analyzing Co-occurrence Networks of Emojis on Twitter . . . . . . . . . . . Hasan Alsaif, Phil Roesch, and Salem Othman Development of an Oceanographic Databank Based on Ontological Interactive Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oleksandr Stryzhak, Vitalii Prykhodniuk, Maryna Popova, Maksym Nadutenko, Svitlana Haiko, and Roman Chepkov

49

65

80

97

vii

viii

Contents

Analyzing Societal Bias of California Police Stops Through Lens of Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Sukanya Manna and Sara Bunyard Automated Metadata Harmonization Using Entity Resolution and Contextual Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Kunal Sawarkar and Meenakshi Kodati Data Segmentation via t-SNE, DBSCAN, and Random Forest . . . . . . . . 139 Timothy DeLise GeoTree: A Data Structure for Constant Time Geospatial Search Enabling a Real-Time Property Index . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Robert Miller and Phil Maguire Prediction Interval of Future Waiting Times of the Two-Parameter Exponential Distribution Under Multiply Type II Censoring . . . . . . . . . 166 Shu-Fei Wu Predictive Models as Early Warning Systems: A Bayesian Classification Model to Identify At-Risk Students of Programming . . . . 174 Ashok Kumar Veerasamy, Mikko-Jussi Laakso, Daryl D’Souza, and Tapio Salakoski Youden’s J and the Bi Error Method . . . . . . . . . . . . . . . . . . . . . . . . . . 196 MaryLena Bleile A New Proposal of Parametric Similarity Measures with Application in Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Luca Anzilli and Silvio Giove The Challenges of Using Big Data in the Consumer Credit Sector . . . . . 221 Kirill Romanyuk New Engine to Promote Big Data Industry Upgrade . . . . . . . . . . . . . . . 232 Jing He, Chuyi Wang, and Haonan Chen Using Correlation and Network Analysis for Researching Intellectual Competence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Sipovskaya Yana Ivanovna A Disease Similarity Technique Using Biological Process Functional Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Luis David Licea Torres and Hisham Al-Mubaid Impact of Types of Change on Software Defect Prediction . . . . . . . . . . 273 Atakan Erdem Analyzing Music Genre Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 Jose Fossi, Adam Dzwonkowski, and Salem Othman

Contents

ix

Result Prediction Using Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Hasan Sarwar, Dipannoy Das Gupta, Sanzida Mojib Luna, Nusrat Jahan Suhi, and Marzouka Tasnim A Systematic Literature Review on Big Data Extraction, Transformation and Loading (ETL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 Joshua C. Nwokeji and Richard Matovu Accelerating Road Sign Ground Truth Construction with Knowledge Graph and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 325 Ji Eun Kim, Cory Henson, Kevin Huang, Tuan A. Tran, and Wan-Yi Lin Adjusted Bare Bones Fireworks Algorithm to Guard Orthogonal Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Adis Alihodzic, Damir Hasanspahic, Fikret Cunjalo, and Haris Smajlovic Unsupervised Machine Learning-Based Elephant and Mice Flow Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 Muna Al-Saadi, Asiya Khan, Vasilios Kelefouras, David J. Walker, and Bushra Al-Saadi Automated Generation of Zigzag Carbon Nanotube Models Containing Haeckelite Defects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 M. Leonor Contreras, Ignacio Villarroel, and Roberto Rozas Artificial Intelligence Against Climate Change . . . . . . . . . . . . . . . . . . . . 378 Leila Scola Construction Site Layout Planning Using Multiple-Level Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 Hui Jiang and YaBo Miao Epistocracy Algorithm: A Novel Hyper-heuristic Optimization Strategy for Solving Complex Optimization Problems . . . . . . . . . . . . . . 408 Seyed Ziae Mousavi Mojab, Seyedmohammad Shams, Hamid Soltanian-Zadeh, and Farshad Fotouhi An Approach for Non-deterministic and Automatic Detection of Learning Styles with Deep Belief Net . . . . . . . . . . . . . . . . . . . . . . . . . 427 Maxwell Ndognkon Manga and Marcel Fouda Ndjodo Using Machine Learning to Identify Methods Violating Immutability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 Tamás Borbély, Árpád János Wild, Balázs Pintér, and Tibor Gregorics Predicting the Product Life Cycle of Songs on the Radio . . . . . . . . . . . . 463 O. F. Grooss, C. N. Holm, and R. A. Alphinas

x

Contents

Hierarchical Roofline Performance Analysis for Deep Learning Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 Charlene Yang, Yunsong Wang, Thorsten Kurth, Steven Farrell, and Samuel Williams Data Augmentation for Short-Term Time Series Prediction with Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492 Anibal Flores, Hugo Tito-Chura, and Honorio Apaza-Alanoca On the Use of a Sequential Deep Learning Scheme for Financial Fraud Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 Georgios Zioviris, Kostas Kolomvatsos, and George Stamoulis Evolutionary Computation Approach for Spatial Workload Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 Ahmed Abubahia, Mohamed Bader-El-Den, and Ella Haig Depth Self-optimized Learning Toward Data Science . . . . . . . . . . . . . . 543 Ziqi Zhang Predicting Resource Usage in Edge Computing Infrastructures with CNN and a Hybrid Bayesian Particle Swarm Hyper-parameter Optimization Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562 John Violos, Tita Pagoulatou, Stylianos Tsanakas, Konstantinos Tserpes, and Theodora Varvarigou Approaching Deep Convolutional Neural Network for Biometric Recognition Based on Fingerprint Database . . . . . . . . . . . . . . . . . . . . . . 581 Md. Saiful Islam, Tanhim Islam, and Mahady Hasan Optimizing the Neural Architecture of Reinforcement Learning Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591 N. Mazyavkina, S. Moustafa, I. Trofimov, and E. Burnaev Automatic Ensemble of Deep Learning Using KNN and GA Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607 Ben Zagagy, Maya Herman, and Ofer Levi A Deep Learning Model for Data Synopses Management in Pervasive Computing Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 619 Panagiotis Fountas, Kostas Kolomvatsos, and Christos Anagnostopoulos DeepObfusCode: Source Code Obfuscation through Sequence-to-Sequence Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637 Siddhartha Datta Deep-Reinforcement-Learning-Based Scheduling with Contiguous Resource Allocation for Next-Generation Wireless Systems . . . . . . . . . . 648 Shu Sun and Xiaofeng Li

Contents

xi

A Novel Model for Enhancing Fact-Checking . . . . . . . . . . . . . . . . . . . . 661 Fatima T. AlKhawaldeh, Tommy Yuan, and Dimitar Kazakov Investigating Learning in Deep Neural Networks Using Layer-Wise Weight Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678 Ayush Manish Agrawal, Atharva Tendle, Harshvardhan Sikka, Sahib Singh, and Amr Kayid Deep Reinforcement Learning for Task Planning of Virtual Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694 Caio Souza and Luiz Velhor Accelerating Deep Convolutional Neural on GPGPU . . . . . . . . . . . . . . . 712 Dominik Żurek, Marcin Pietroń, and Kazimierz Wiatr Enduring Questions, Innovative Technologies: Educational Theories Interface with AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725 Rosemary Papa and Karen Moran Jackson DRAM-Based Processor for Deep Neural Networks Without SRAM Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743 Eugene Tam, Shenfei Jiang, Paul Duan, Shawn Meng, Yue Pan, Cayden Huang, Yi Han, Jacke Xie, Yuanjun Cui, Jinsong Yu, and Minggui Lu Study of Residual Networks for Image Recognition . . . . . . . . . . . . . . . . 754 Mohammad Sadegh Ebrahimi and Hossein Karkeh Abadi A Systematic Review of Educational Data Mining . . . . . . . . . . . . . . . . . 764 FangYao Xu, ZhiQiang Li, JiaQi Yue, and ShaoJie Qu Image Classification with A-MnasNet and R-MnasNet on NXP Bluebox 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 781 Prasham Shah and Mohamed El-Sharkawy CreditX: A Decentralized and Secure Credit Platform for Higher Educational Institutes Based on Blockchain Technology . . . . . . . . . . . . 793 Romesh Liyanage, D. P. P. Jayasinghe, K. T. Uvindu Sanjana, H. B. D. R. Pearson, Disni Sriyaratna, and Kavinga Abeywardena An Architecture for Blockchain-Based Cloud Banking . . . . . . . . . . . . . 805 Thuat Do A Robust and Efficient Micropayment Infrastructure Using Blockchain for e-Commerce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825 Soumaya Bel Hadj Youssef and Noureddine Boudriga Committee Selection in DAG Distributed Ledgers and Applications . . . 840 Bartosz Kuśmierz, Sebastian Müller, and Angelo Capossele

xii

Contents

An Exploration of Blockchain in Social Networking Applications . . . . . 858 Rituparna Bhattacharya, Martin White, and Natalia Beloff Real and Virtual Token Economy Applied to Games: A Comparative Study Between Cryptocurrencies . . . . . . . . . . . . . . . . . . 869 Isabela Ruiz Roque da Silva and Nizam Omar Blockchain Smart Contracts Static Analysis for Software Assurance . . . 881 Suzanna Schmeelk, Bryan Rosado, and Paul E. Black Promize - Blockchain and Self Sovereign Identity Empowered Mobile ATM Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891 Eranga Bandara, Xueping Liang, Peter Foytik, Sachin Shetty, Nalin Ranasinghe, Kasun De Zoysa, and Wee Keong Ng Investigating the Robustness and Generalizability of Deep Reinforcement Learning Based Optimal Trade Execution Systems . . . . 912 Siyu Lin and Peter A. Beling On Fairness in Voting Consensus Protocols . . . . . . . . . . . . . . . . . . . . . . 927 Sebastian Müller, Andreas Penzkofer, Darcy Camargo, and Olivia Saa Dynamic Urban Planning: An Agent-Based Model Coupling Mobility Mode and Housing Choice. Use Case Kendall Square . . . . . . . 940 Mireia Yurrita, Arnaud Grignard, Luis Alonso, Yan Zhang, Cristian Ignacio Jara-Figueroa, Markus Elkatsha, and Kent Larson Reasoning in the Presence of Silence in Testimonies: A Logical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 952 Alfonso Garcés-Báez and Aurelio López-López A Genetic Algorithm Based Approach for Satellite Autonomy . . . . . . . . 967 Sidhdharth Sikka and Harshvardhan Sikka Communicating Digital Evolutionary Machines . . . . . . . . . . . . . . . . . . . 976 Istvan Elek, Zoltan Blazsik, Tamas Heger, Daniel Lenger, and Daniel Sindely Analysis of the MFC Singuliarities of Speech Signals Using Big Data Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987 Ruslan V. Skuratovskii and Volodymyr Osadchyy A Smart City Hub Based on 5G to Revitalize Assets of the Electrical Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1010 Santiago Gil, Germán D. Zapata-Madrigal, and Rodolfo García Sierra LoRa RSSI Based Outdoor Localization in an Urban Area Using Random Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1032 Winfred Ingabire, Hadi Larijani, and Ryan M. Gibson

Contents

xiii

A Deep Convolutional Neural Network Approach for Plant Leaf Segmentation and Disease Classification in Smart Agriculture . . . . . . . . 1044 Ilias Masmoudi and Rachid Lghoul Medium Resolution Satellite Image Classification System for Land Cover Mapping in Nigeria: A Multi-phase Deep Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056 Nzurumike L. Obianuju, Nwojo Agwu, and Onyenwe Ikechukwu Analysis of Prediction and Clustering of Electricity Consumption in the Province of Imbabura-Ecuador for the Planning of Energy Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1073 Jhonatan F. Rosero-Garcia, Edilberto A. Llanes-Cedeño, Ricardo P. Arciniega-Rocha, and Jesús López-Villada Street Owl. A Mobile App to Reduce Car Accidents . . . . . . . . . . . . . . . 1085 Mohammad Jabrah, Ahmed Bankher, and Bahjat Fakieh Low-Cost Digital Twin Framework for 3D Modeling of Homogenous Urban Zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106 Emad Felemban, Abdur Rahman Muhammad Abdul Majid, Faizan Ur Rehman, and Ahmed Lbath VR in Heritage Documentation: Using Microsimulation Modelling . . . . 1115 Wael A. Abdelhameed MQTT Based Power Consumption Monitoring with Usage Pattern Visualization Using Uniform Manifold Approximation and Projection for Smart Buildings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124 Ray Mart M. Montesclaros, John Emmanuel B. Cruz, Raymark C. Parocha, and Erees Queen B. Macabebe Deepened Development of Industrial Structure Optimization and Industrial Integration of China’s Digital Music Under 5G Network Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1141 Li Eryong and Li Yukun Optimal Solution of Transportation Problem with Effective Approach Mount Order Method: An Operational Research Tool . . . . . 1151 Mohammad Rashid Hussain, Ayman Qahmash, Salem Alelyani, and Mohammed Saleh Alsaqer Analysis of Improved User Experience When Using AR in Public Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1169 Vladimir Barros, Eduardo Oliveira, and Luiz Araújo Autonomous Vehicle Decision Making and Urban Infrastructure Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1190 George Mudrak and Sudhanshu Kumar Semwal

xiv

Contents

Prediction of Road Congestion Through Application of Neural Networks and Correlative Algorithm to V2V Communication (NN-CA-V2V) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1203 Mahmoud Zaki Iskandarani Urban Planning to Prevent Pandemics: Urban Design Implications of BiocyberSecurity (BCS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1222 Lucas Potter, Ernestine Powell, Orlando Ayala, and Xavier-Lewis Palmer Smart Helmet: An Experimental Helmet Security Add-On . . . . . . . . . . 1236 David Sales, Paula Prata, and Paulo Fazendeiro Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1251

Balanced Weighted Label Propagation Matin Pirouz(B) California State University, Fresno, CA 93740, USA [email protected]

Abstract. One key methodology to understand the structure of complex networks is through community detection and analysis. Such information is used to find relationships and hidden structures in social communities. Existing community detection algorithms are either computationally expensive in large-scale real-world networks or require specific information such as the number and size of communities. Another problem with the existing benchmark algorithms is the resolution problem, i.e. as the data grows larger or more complex (with a higher chance of having outliers), the identified community structure loses quality. In this paper, we introduce a multi-stage novel edge-influenced label propagation algorithm that uses the network structure to create values for the edges. Initially, edges are used to create the flow of the community structure. Next, every node is initialized with a unique label and at every step, each node adopts the label of the neighbor with the lowest edge weight. Finally, all nodes converge and construct the community structures. Through an iterative process, densely connected groups of nodes form a consensus on a unique label to form communities. The proposed method, Balanced Weighted Label Propagation, is tested on both synthetic and real-world benchmark datasets with known community structures. Keywords: Social network analysis · Normalized Mutual Information · Adjusted Rand Index · Resolution limit

1

Introduction

Networks are groups of nodes inter-connected with edges. For example in social media networks, nodes represent people and edges represent the relationship among people [10]. There exist some subgroups with more internal edges than external edges, commonly identified as communities. One of the main tasks in studying complex networks is to find these structures. Finding communities can result in finding meaningful information, and is particularly important in social networks [15] and recommender systems [3, 19]. Several community detection methods exist. Modularity, a quality-function introduced by Newman, has become one of the major algorithms for which there have been many optimization methods. On the other hand, the resolution limits [12] problem found by Fortunato and Barthélemy in [1] is a key limitation c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1–12, 2021. https://doi.org/10.1007/978-3-030-80126-7_1

2

M. Pirouz

of modularity. A metric called resolution value has been used to solving the problem, where the higher the value, the more communities are found [7]. Another widely known approach is Label Propagation developed by Raghvan et al. [12]. Label Propagation is a near linear approach, which is known as the coloring method as it is based on the flow of uniquely assigned labels in the network. The problem with this method is that the core method is based on random distribution; therefore, many different scenarios of answer could be resulted. Several enhancements have been introduced over the years to solve the problem and are further studied in the related work section. In this paper, a new node influence algorithm based on Label Propagation is proposed to solve the randomness and resolution limit problem in community detection. Balanced Weighted Label Propagation (BWLP) algorithm has a time complexity linear to the number of edges. Our approach uses the influence of nodes on one another to find an edge value for any pair of nodes in any complex graph. The neighboring relationship between two nodes is used as a metric to define the closeness of the pair. We develop a convergence function which defines when to stop exchanging labels. Finally, when the distribution of labels concludes, all the nodes that share the same label find them selves in the same community. 1.1

Contributions

By computing the edge weight based on dissimilarity between nodes, BWLP achieves the following: – Intuitive Community Detection: Contrary to solely relying on neighboring labels count in which is not the most efficient way, BWLP uses weighted flow to find community structure. – Hidden Community Detection: In large-scale networks, hidden and small communities can be neglected due to various factors (as explained in Sect. 3.1). Relying on the local topology-driven weighted edges, BWLP enables the discovery of small communities. – Scalability: BWLP algorithm is superlinear to the number of nodes, which results in being scalable for large networks. BWLP only iterates over edges around every node and there is no running over the same node twice. As a result, BWLP achieves a time complexity of O(|n|). Given this property, BWLP lends itself to handling large real-world networks in a relatively short time. – Addressing the resolution limit ring problem: Using edge weights and structure flow, we address the resolution problem.

2

Related Work

Community detection algorithms vary based on their approach and strengths/limitations. CPM [8] searches the graph for all cliques of maximum

Balanced Weighted Label Propagation

3

size as the first phase. After the maximum cliques have been found, each clique is enumerated and a clique adjacency matrix is constructed. A parameter K is set and any clique with size of at least K is considered to be a “part” of a larger module. Girvan and Newman method [2] creates communities by removing the highest edge betweenness and re-calculating influenced edges value again. At the end, it provides a dendrogram with nodes as the leaves. The resolution of communities in ll the optimization method with a fitness function is dependent on the resolution value chosen for the given function. Lancichinetti et al. [4] choose a random node and expands a community from there. Similar to modularity, resolution value defines the size and quality of the detected communities depends. In EAGLE [14] and GCE [6] algorithms, maximal cliques in the network replaces the random node used in [4]. Zhang et al.’s spectral method [22] is used to lower the dimensions of the graph. The fuzzy c-mean algorithm is then used for clustering. Psorakis et al. [11] introduced a novel method utilizing Bayesian non-negative matrix factorization (NMF). The limitations of this method include the need to know the number of communities and the use of matrices which results in an inefficient time complexity and increases the space complexity. Zhang et al. [21] used a stochastic block model to find an unequal size community structure where the number of nodes inside community should be equal when using stochastic block models. Label propagation methods [18] propagate the label of the most popular nodes throughout the graph to find the structure of communities. Particle Swarm Optimization (PSO) [23] introduced a new label propagation method mixed with modularity density. The detected communities have good resolution but the algorithm has a higher time complexity than other label propagation algorithms. SLPA introduced a dynamic algorithm to find overlapping community structure. Li et al. [17] introduced an optimized method called “Stepping LPA-S”, where similarity for propagating labels is used. In addition, a stepping framework is used to divide networks. Then, an evaluation function is used to select the final unique partition.

3

Proposed Approach

This section develops the main idea behind the proposed multi-stage edge weighing algorithm. In the initialization layer, all the links get their own labels. Assume a pair of nodes with unique sets of common and exclusive neighbors. Suppose X and Y are neighbors, then C is the set of their common neighbors and E is the set of exclusive neighbors between the pair. In this study, we use sets C and E to find the dissimilarity between nodes X and Y . Function 2 finds a dissimilarity value between any pair of nodes in the graph based on their structural formats. The calculated dissimilarity value is assigned to the edge between two nodes as their edge weight. Algorithm 1 shows the next step which is sorting all nodes based on their dissimilarity ratio [9]. First, every node in the graph gets their node number as their label. This way, every node is assigned to its own community. Later, swapping labels

4

M. Pirouz

begins as shown in Algorithm 2. In every iteration, vertices will change membership to the community of the most similar vertex around them. To choose which community the node X belongs to, X looks around among its neighbors NX = {n1 , n2 , n3 , ..., ni } and joins the community of the neighbor which has the lowest dissimilarity to itself among all its neighbors using set WNX = {w1 , w2 , w3 , ..., wi } . This way, X picks the label of its most influential neighbor. In case of a tie of edge weights among the neighbors of X, the algorithm chooses labels with the least number of exclusive nodes. In parallel with the label exchange, Convergence Algorithm 3 takes place. Convergence is the process of every node checking whether they are in the correct community or not. Every node holds a binary value which decides if the node should continue searching for the community it belongs to or it has the right label. This value is initially set to false for every node. If the majority of set NX = {n1 , n2 , n3 , ..., ni } for X share the same label of X, the value changes to true and X stops searching. On the other hand, if the majority of neighbors do not agree with X’s label, X look around the neighbors of which has the lowest edge weight to it. This continues until the neighbors agree on X’s label. By the end of each iteration, all the nodes holding the same label count as one community. Label propagation is an iterative clustering procedure, in which at each iteration the label (cluster) of a node is updated based on a function of graph topology. In this algorithm, the edge weighted function is based on the shared neighbors of the nodes in question. More formally, we let G(V, E) be an unweighted graph with vertex set V and edge set E. For a node x ∈ V , let N (X) := {y | (x, y) ∈ E}, let D(x) be the degree of x, and define the function F : V × V → R as F (x, y) =

−2|N (x) ∩ N (y)| + |N (x)| + |N (y)| . D(x) ∗ D(y)

(1)

If G(V, E) is a simple graph, then D(x) = |N (x)| and the function can be written as −2|N (x) ∩ N (y)| 1 1 + + . (2) |N (x)| ∗ |N (y)| N (x) N (y) The label propagation algorithm has two main procedures, an initialization procedure, and an iterative procedure. The initialization process begins by assigning each edge (x, y) ∈ E a weight given by F (x, y) and each node x ∈ V a unique label denoted by label.x. At each step in the iterative procedure, each node is considered and assigns a new label label.y where y ∈ V satisfies the following: (3) min F (x, y). F (x, y) =

y∈N (x)

It is known that modularity maximization suffers from the resolution problem. Namely, it has a tendency to cluster the k-cliques together in the ring of cliques network. In the following lemma, we show that our algorithm will never cluster the k-cliques together if k >2.

Balanced Weighted Label Propagation

5

Algorithm 1: Weighted Label Propagation 1 2 3 4 5 6 7 8 9 10 11 12 13

Input: G network graph, wmin Output: network communities for n ∈ G do nlabel = n for i ∈ {N eigbors of n} do (i)|+|N (n)|+|N (i)| [Nw ] = −2|N (n)∩N D(n)∗D(i) end for j ∈ {SN L} do if min{Nw } < SN L[j] then SN L[j] insert n end end end

Algorithm 2: Map Traversing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Input: G network graph, wmin , SN L Output: network communities while Nc ! = 0 do for n ∈ SN L do for i ∈ {N eighbors of n} do if niw < Chosen then chosen=labelniw end for i ∈ {N eighbors of n} do HAL= Find highest occurrence label end end if HAL == Chosen then F lag n as converged end else call Convergence(n, G); end end end

Lemma 1. If the given algorithm converges on the ring of cliques network where each clique has the size k>2, then no node will be clustered with nodes outside of its clique. Proof. This lemma is proven if we can show that no node has the incentive to cluster with a node outside of its clique. To this end, let x ∈ V and define C(x) to be the set of nodes that share a clique with x. Let y be a node in the set

6

M. Pirouz

Algorithm 3: Convergence 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Input: G network graph, n, Chosen for i ∈ {N eighbors of n} do Chosen = label of neighbor with most similarity m = neighbor with lowest edge value for j ∈ {N eighbors of m} do HAL= Find highest occurrence label end if HAL == Chosen then F lag n as converged end else call Convergence(m, G); end end

N (x) − C(x) and notice that by the construction of the ring of cliques we have |N (x) ∩ N (y)| = 0.

(4)

depending on how we construct the ring of cliques and the choice of x and y F (x, y)’s value will vary; however min F (x, y) =

2 . k+1

(5)

Let z be any node in C(x). Since z belongs to the same clique as x we have |N (x) ∩ N (z)| = k − 2.

(6)

Again, depending on how we construct the ring of cliques and the choice of x and z F (x, z)’s value will vary; however max F (x, z) =

1 1 −2(k − 2) + + . (k + 2)(k) k + 2 k

(7)

√ Finally, we note that max F (x, z) < min F (x, y) as long as k > 12 (1 + 13) since we assumed that k is an integer that is greater x will never cluster with a node outside of its clique. Table 1. Real world unweighted datasets [20] Datasets

Num. nodes

Num. edges

Connected component

CC

polbooks

105

441

1

0.488

Football

Avg. degree 8.4

Network diameter 7

115

613

1

0.403 10.661

4

Dolphin social network

62

159

1

0.303

5.129

8

Karate

34

78

1

0.285

4.588

3

Balanced Weighted Label Propagation

7

Table 2. Synthetic unweighted datasets created with LFR benchmark graphs

3.1

Datasets Num. nodes Num. edges

Connected component

CC

Avg. degree

Network diameter

SM1

1000

4288

3

0.477

4.28

18

SM2

1000

4334

4

0.526

4.34

20

SM3

1000

4360

4

0.495

4.36

20

SM4

1000

4316

3

0.466

4.31

22

SM5

1000

4520

7

0.492

4.52

20

BM1

1000

24564

1

0.592 24.56

5

BM2

1000

24636

1

0.427 24.36

4

BM3

1000

23476

1

0.274 23.47

4

BM4

1000

25066

1

0.246 25.06

4

BM5

1000

25106

1

0.148 25.10

4

Resolution Limit

We address the ring problem of Resolution Limit, a well-known problem in community detection. Quality-functions optimization is a key approach for community detection. Modularity aims to compute the inter-cluster number of edges inside a cluster and compare it with the expected number of such edges in a similar-sized random network. Modularity also computes the node degrees. This way, the randomness considers the possibility of any inter-node connection, which is impractical for large networks. Furthermore, the larger the network, the smaller the number of edges between two groups of nodes. So, if a network is large enough, the expected number of edges between two groups of nodes in modularity’s null model may be smaller than one. Hence, modularity interprets a single edge as a sign of a strong correlation between the two clusters, and merging them would optimize modularity. In other words, for large networks, weak connections would merge for a dense graph [1]. The resolution problem is addressed by integrating nodes’ influence on one another together with the flow propagation approach.

4 4.1

Experiment Datasets and Setup

Two real-world benchmark datasets are used for this experiment: Polbooks: (a) A network of books about recent US politics sold by the on line bookseller Amazon.com. Edges between books represent frequent co-purchasing of books by the same buyers. (b) Karate Club: social network of friendships between 34 members of a karate club at a US university in the 1970s [20].

8

M. Pirouz

We also used LFR benchmark datasets presented by Lancichinetti et al. [5]. LFR benchmark networks has features of real-world networks, the distributions of degree and the number of triangles in the network. Moreover, clustering coefficient and community sizes are adjustable with this benchmark network creator. The experimental environment includes Python 3.7 and NetworkX. The processor used to run the estimation and the proposed algorithm was Intel(R) Xeon(R) CPU E5-1630 v4 @ 3.70 GHz with a RAM of 64.00 GB on a Windows 10 pro 64-bit Operating System. 4.2

Results and Discussion

The goal of our computational experiment is to evaluate and validate the correctness and reliability of the BWLP algorithm. All the experiments are performed on both synthetic and real-world networks to ensure analyses of various distributions of nodes and edges. Detailed information about the datasets used is given in Sect. 4.1 as well as Tables 1 and 2. Run time and accuracy are presented in Fig. 1. Figure 2 (a) and (b) shows the results for state-of-art-algorithms versus the proposed BWLP algorithm. Vertical line show the Normalized Mutual Information (NMI) [16] as an accuracy metric for benchmark graphs. The horizontal axes shows the mixing parameter µ. The two charts show community size ranges where (a) is from 10 to 50 vertices and (b) ranges from 20 to 100 vertices. Figure 2 (c) and (d) represent the results for state-of-art-algorithms versus the proposed BWLP algorithm. The vertical axes show the Adjusted Rand Index (ARI) [13]. The horizontal axes shows the mixing parameter µ. Figure 3.a and 3.b show the computational time of Louvain and Infomap against BWLP. Figure 3.a depicts BWLP achieving an average of 550 times faster run time than the benchmark algorithms. The ratio grows to 3500 times faster presented in Fig. 3.b. It should be noted that all the algorithms have been written in Python 3.7 and all experiments were run on the same machine. As shown in Fig. 3, the computational times of Infomap and Louvain grow as the network gets larger. In these types of optimization methods, there are two problems: first is the hidden communities and resolution limit as explained in Sect. 3.1 which result in lower accuracy (shown in Fig. 1.b). The second is that the computational time grows polynomially as the number of edges in the dataset grows shown in Fig. 1.a). Benchmark algorithms stated in Table 3 are not efficient as they calculate a quality score for every randomly chosen group of communities to measure their quality. Then, nodes will move between communities and with every move, the quality-function is re-calculated. If the quality score is better, the changes are kept; otherwise, changes are discarded. Infomap has a complexity of O(n) in the best case, and O(n2 logn) in the worst case. In addition to the complexity, Infomap has quality-function calculations with every adjustment, which makes the algorithm even slower. MCL has a complexity of O(n3 ) in all the cases which makes this algorithm slower than Infomap when it comes to big networks. In contrast, BWLP calculates a one-time weight value for all edges of every node

Balanced Weighted Label Propagation

9

Table 3. Experiment results for given datasets Datasets BWLP

Karate NMI

polbooks NMI

ARI

1

Modularity 0.6924

1

0.568199 0.677826

0.6803

0.5308

0.6379

Infomap

0.699488 0.702155 0.493454

0.53606

Attractor

0.924092 0.939252 0.5308

0.6379

MCL

0.836498 0.882302 0.52086

0.59365

100000

1.02 Infomap

BWLP

Infomap

BWLP

1

10000

0.98

1000

NMI

Time in Seconds

ARI

0.96

100

0.94 10

0.92

1

0.9 5

15

10

20

25

5

10

20

15

Average Degree

25

Average Degree

(a) Computation Time

(b) Accuracy

Fig. 1. (a) Run time (seconds) to detect the communities in LFR benchmark datasets based on varying node degrees. (b) Accuracy of the detected communities using NMI

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 1

2

AƩractor

Infomap

3 BWLP

4 Modularity

1

5 louvain

(a) NMI for Small Communities 1.2

AƩractor

2

Infomap

3

BWLP

4

Modularity

5

louvain

(b) NMI for Large Communities 1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 1

2

AƩractor

Infomap

3 BWLP

4

5

Modularity

louvain

(c) ARI for Small Communities

1

AƩractor

2

Infomap

3

BWLP

4

Modularity

5

louvain

(d) ARI for Large Communities

Fig. 2. NMI and ARI performances on the LFR benchmark.

10

M. Pirouz

1000

1000000

10000

seconds

seconds

100000

100

10

1000 100 10 1

1

1

2

3 Infomap

4

5 BWLP

6

louvain

(a) Small Communities

7

8

1

2

3

Infomap

4

5

BWLP

louvain

6

7

8

(b) Large Communities

Fig. 3. Community detection run time (seconds) in LFR benchmark datasets with varying mixing parameter µ (horizontal axis)

and from there every node chooses its label (community group) based on its edge weight and neighboring labels. With this method, we avoid recalculating and regrouping of the nodes. In terms of accuracy, Fig. 2.a, 2.b presents BWLP results over two networks of 1000 vertices with a mixing parameter µ in range of (1–5) on horizontal axis. Vertical axes shows the correctness measure of ARI [13]. BWLP outperforms all other algorithms except for the µ = 1 in which Infomap is slightly better in Fig. 2.a µ = 1. In the Fig. 2.b Louvain is slightly better only at µ = 1. In addition, Fig. 2.a and 2.b depict the correctness measure of Normalized Mutual Information (NMI) [16]) on vertical axes and µ in range of (1–5) on the horizontal axis over two networks of 1000 vertices. The results from Fig. 2 confirm that the communities are detected with a higher accuracy by BWLP as compared to the other algorithms.

5

Conclusion

In this paper, a new multi-stage community detection algorithm is proposed. This algorithm automatically identifies the structures in networks based on dissimilarity between nodes and the structure of the network. There is no preprocessing or prior knowledge needed of the dataset/network for the proposed algorithm. We noticed that on average, nodes agree on the same label in less than three iterations. The major phases of the proposed algorithm are threefold: The first is to identify the dissimilarity value and structural form of networks and to sort data nodes based on their dissimilarity values. The second layer is to find the flow of the network from each vertex, then re-label it to vertices which belong to that particular flow. Finally, the last layer converges all the nodes and construct the final community structures. Extensive experiments were conducted to demonstrate the correctness and quality and expenses of the method introduced versus benchmark algorithms. Results shown in Sect. 4.2 demonstrate that BWLP requires lower resources as compared to benchmark algorithms. In addition, better accuracy is achieved particularly over the datasets with higher edge density and complexity. BWLP

Balanced Weighted Label Propagation

11

successfully finds the hidden communities, a known problem for other benchmark algorithms. BWLP is a better choice for big data graphs considering it outperforms the state-of-the-art algorithms, being 550 to 3500 time faster for given data-sets as shown in Fig. 3(a) and Fig. 3(b) We presented the advantages of BWLP against several benchmark methods. Future research plans are threefold: first, to extend the algorithm for weighted graphs; second, to find overlapping community structure with identical time and space complexity; and third, to improve the accuracy of the existing algorithms through introducing more conditions for special graph structures.

References 1. Fortunato, S., Barthélemy, M.: Resolution limit in community detection. Proc. Natl. Acad. Sci. 104(1), 36–41 (2007). Resolution Limit 2. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99(12), 7821–7826 (2002). Betweenness 3. Ioannidis, V.N., Zamzam, A.S., Giannakis, G.B., Sidiropoulos, N.D.: Coupled graph and tensor factorization for recommender systems and community detection. IEEE Trans. Knowl. Data Eng. (2019) 4. Lancichinetti, A., Fortunato, S., Kertész, J.: Detecting the overlapping and hierarchical community structure in complex networks. New J. Phys. 11(3), 033015 (2009). LFM 5. Lancichinetti, A., Fortunato, S., Radicchi, F.: Benchmark graphs for testing community detection algorithms. Phys. Rev. E 78, 046110 (2008) 6. Lee, C., Reid, F., McDaid, A., Hurley, N.: Detecting highly overlapping community structure by greedy clique expansion. arXiv preprint arXiv:1002.1827 (2010). GCE 7. Mukunda, A., Pirouz, M.: Influence-based community detection and ranking. In: 2019 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1341–1346. IEEE (2019) 8. Palla, G., Derényi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. arXiv preprint physics/0506133 (2005). CPM 9. Pirouz, M., Zhan, J.: Optimized label propagation community detection on big data networks. In: Proceedings of the 2018 International Conference on Big Data and Education, pp. 57–62 (2018) 10. Pirouz, M., Zhan, J., Tayeb, S.: An optimized approach for community detection and ranking. J. Big Data 3(1), 1–12 (2016). https://doi.org/10.1186/s40537-0160058-z 11. Psorakis, I., Roberts, S., Ebden, M., Sheldon, B.: Overlapping community detection using Bayesian non-negative matrix factorization. Phys. Rev. E 83(6), 066114 (2011). NMF 12. Raghavan, U.N., Albert, R., Kumara, S.: Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E 76(3), 036106 (2007) 13. Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971) 14. Shen, H., Cheng, X., Cai, K., Hu, M.-B.: Detect overlapping and hierarchical community structure in networks. Physica A: Stat. Mech. Appl. 388(8), 1706–1712 (2009). EAGLE

12

M. Pirouz

15. Singh, S.S., Kumar, A., Singh, K., Biswas, B.: C2IM: community based contextaware influence maximization in social networks. Physica A: Stat. Mech. Appl. 514, 796–818 (2019) 16. Strehl, A., Ghosh, J.: Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3(Dec), 583–617 (2002). NMI 17. Li, W., Huang, C., Wang, M., Chen, X.: Stepping community detection algorithm based on label propagation and similarity. Physica A: Stat. Mech. Appl. 472, 145– 155 (2017) 18. Xie, J., Szymanski, B.K.: Community detection using a neighborhood strength driven label propagation algorithm. In: 2011 IEEE Network Science Workshop (NSW), pp. 188–195. IEEE (2011) 19. Xue, C., Wu, S., Zhang, Q., Shao, F.: An incremental group-specific framework based on community detection for cold start recommendation. IEEE Access 7, 112363–112374 (2019) 20. Zachary, W.W.: An information flow model for conflict and fission in small groups. J. Anthropol. Res. 33(4), 452–473 (1977). http://www-personal.umich. edu/∼mejn/netdata/ 21. Zhang, P., Moore, C., Newman, M.E.J.: Community detection in networks with unequal groups. Phys. Rev. E 93(1), 012303 (2016) 22. Zhang, S., Wang, R.-S., Zhang, X.-S.: Identification of overlapping community structure in complex networks using fuzzy c-means clustering. Physica A: Stat. Mech. Appl. 374(1), 483–490 (2007). Spectral, Euclidean Space, Fuzzy C-mean Algorithm 23. Zhou, D., Wang, X.: A neighborhood-impact based community detection algorithm via discrete PSO. Math. Prob. Eng. 2016 (2016)

Analysis of the Vegetation Index Dynamics on the Base of Fuzzy Time Series Model Elchin Aliyev, Ramin Rzayev(B) , and Fuad Salmanov Institute of Control Systems of ANAS, B. Vahabzadeh str. 9, 1141 Baku, Azerbaijan

Abstract. Based on the fuzzy analysis of satellite monitoring data, the annual dynamics of the vegetation index for the selected crop area is investigated by means of MODIS images (LPDAAC – the Land Processes Distributed Active Archive Center). To reconstruct and predict the weakly structured vegetation index time series, the fuzzy models are proposed, compiled taking into account the analysis of internal connections of the first and second orders, which are presented in the form of fuzzy relations. The proposed models were investigated for adequacy and suitability from the point of view of the analysis of the peculiarities of the intraannual mean annual dynamics of the index, typical for the cultivated area. On the basis of the proposed approach, the results of the study of long-term dynamics of vegetation indices can be used for a complex analysis of the dynamics of vegetation cover, including modeling and forecasting the efficiency and productivity of agricultural crops. Keywords: Vegetation index · Fuzzy time series · Fuzzy relation

1 Introduction Modern technologies for satellite monitoring of the Earth’s surface provide agricultural producers with useful information about the health of crops. The remote sensor’s ability to detect minor differences in vegetation makes it a useful tool for quantifying variability within a given field, assessing crop growth, and managing land based on current conditions. Remote sensing data, collected on a regular basis, allows growers and agronomists to make a current map of the state and strength of crops, analyze the dynamics of changes in the health of crops, and predict it. For the interpretation of remote sensing data, the most effective means are all kinds of vegetation indices, in particular, Normalized Difference Vegetation Index (NDVI), which are calculated empirically, i.e. by operations with different spectral ranges of satellite monitoring data. Most agricultural crops are characterized by changes in the phases of development, which is reflected in the dynamics of the spectral-reflective properties of plants. The study of seasonal and long-term changes in the spectral and brightness characteristics of crops is possible through the analysis and modeling of the dynamic series of vegetation indices, which makes it possible to quantitatively evaluate the features of the vegetation cover and the patterns of its temporal dynamics. At the same time, standard algorithms for solving problems of predicting the dynamics of the spectral-reflective properties © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 13–22, 2021. https://doi.org/10.1007/978-3-030-80126-7_2

14

E. Aliyev et al.

of plants work, as a rule, with “crisp” or structured data of satellite earth sensing, i.e. with data presented as averaged numbers. Therefore, the averaging of the results of measurements of vegetation indices is one of the most common empirical operations in systems for collecting data from satellite monitoring and crop management. In particular, the achievement of the required accuracy in the NDVI averaging process is achieved by multiple measurements, where the results of individual measurements are partially compensated by positive and negative deviations from the exact value. The accuracy of their mutual compensation improves with an increase in the number of measurements, since the absolute value of the mean of negative deviations approaches to the mean of positive deviations. More generally, satellite monitoring data, for example, NDVI values should be considered as weakly structured, i.e. those about which only their belonging to a certain type is known [1]. In particular, the interval [NDVImin , NDVImax ], which includes all its measurements, can serve as an adequate reflection of the poorly structured NDVI values. Another more adequate reflection of the weakly structured value of NDVI can be a statement of the form “high”, which, in fact, is one of the terms (values) of the linguistic variable “value of the vegetation index”, which can be formally described by the appropriate fuzzy set [2]. Therefore, based on this premise, it becomes obvious the importance and relevance of studying methods for studying seasonal and long-term changes in the spectral-brightness characteristics of crops by means of time series of satellite monitoring indicators relative to NDVI, which we will consider as weakly structured values.

2 Problem Definition On the basis of reliable information obtained by remote sensing of the Earth, it is necessary to assess the regularities in the change in the average long-term values of NDVI on the specifically selected area of cropland. More generally, the study of the NDVI dynamics should include three stages: 1) calculation and analysis of the annual dynamics of the NDVI values averaged for each analyzed date, typical for the selected region; 2) analysis of the time series of long-term NDVI values; 3) estimation of the dynamics of the average annual NDVI values for cultivated areas for a certain period. The purpose of this study is to provide a fuzzy analysis of seasonal and long-term characteristics in the NDVI dynamics for the cultivated areas created by the MODIS platform (LPDAAC) [3].

3 Fuzzy Modeling of NDVI Dynamics The existing methods of fuzzy modeling of weakly structured time series imply the sequential implementation of the following steps: 1) building the coverage of the entire data set in the form of a universal set (universe); 2) fuzzification of weakly structured time series data; 3) definition of internal connections in the form of fuzzy relations and their division into groups; 4) determination the fuzzy outputs of the applied model and their defuzzification. To define a universe, the following step-by-step procedure is applied [4].

Analysis of the Vegetation Index Dynamics

15

Step 1. Sorting the time series data {x t } (t = 1 ÷ n) into an ascending sequence {x p(i) }, where p is a permutation that sorts the data values in ascending order, i.e. x p(t) ≤ x p(t+1) . Step 2. Calculation of the mean value over the set of all pairwise distances d i = |x p(i) − x p(i+1) | between any two consecutive values x p(i) and x p(i+1) by the formula: AD(d1 , d2 , ..., dn ) =

1 n−1 xp(i) − xp(i+1) , i=1 n−1

and the standard deviation by the formula 1 n−1 σAD = (di − AD)2 . i=1 n−1

(1)

(2)

Step 3. Detection and elimination of anomalies – sharply distinguished quantities to be ejected. It uses both the mean distance AD and the standard deviation σ AD from the previous step. In this case, the values of pairwise distances that do not satisfy the condition are subject to ejection: AD − σAD ≤ di ≤ AD + σAD .

(3)

Step 4. Recalculate the mean distance between any two consecutive values from the set of values remaining after sorting for outliers. Step 5. Establishing the universe U in the form of U = [Dmin − AD, Dmax + AD] = [D1 , D2 ], where Dmin and Dmax are the minimum and maximum values, respectively, on the entire data set {x t } (t = 1 ÷ n). There are various ways to identify membership functions that restore the fuzzy subsets of the given universe. In particular, one of such methods is symmetric trapezoidal membership functions of the following form (see Fig. 1): ⎧ 0, x < ak1 ⎪ ⎪ ⎪ x−ak1 ⎪ ⎪ ⎨ ak2 −ak1 , ak1 ≤ x ≤ ak2 , (4) μAk (x) = 1, ak2 ≤ x ≤ ak3 , ⎪ ak4 −x ⎪ ⎪ , a ≤ x ≤ a , k3 k4 ⎪ ⎪ ⎩ ak4 −ak3 0, x > ak4 , with parameters satisfying the conditions: ak2 − ak1 = ak3 − ak2 = ak4 − ak3 , where k = 1 ÷ m, m is the total number of fuzzy sets Ak describing the time series data. According to [4], this number is calculated by the formula: m = [D2 − D1 − AD]/[2 · AD].

(5)

As an example, it was chosen the time series reflecting the annual dynamics of the NDVI index (see Table 1) based on MODIS (LPDAAC) images (Fig. 2, [5]) of the cultivated area in Jonesboro (USA) with geographic coordinates (−90.1614583252562, 35.8135416634583). For the entire set of time series data, after recalculating the average distance and standard deviation according to formulas (1) and (2) at the 5th step their final values are

16

E. Aliyev et al.

Fig. 1. Symmetric trapezoidal membership function.

Table 1. NDVI time series. № Date

NDVI

№ Date

NDVI

1

18.02.2000 0.3599 9

25.06.2000 0.7101

2

05.03.2000 0.4099 10 11.07.2000 0.7135

3

21.03.2000 0.368

4

06.04.2000 0.3296 12 12.08.2000 0.6587

5

22.04.2000 0.2535 13 28.08.2000 0.5473

6

08.05.2000 0.2966 14 13.09.2000 0.4815

7

24.05.2000 0.3211 15 29.09.2000 0.3719

8

09.06.2000 0.5104 16 15.10.2000 0.3217

11 27.07.2000 0.2479

Fig. 2. MODIS (LPDAAC) images.

established under fulfilling condition (3): AD = 0.0296 and σ AD = 0.0007. In this case, the desired universe is defined as the following segment U = [0.2479 – 0.0296, 0.7135 +

Analysis of the Vegetation Index Dynamics

17

0.0296] = [0.2183, 0.7431], where 0.2479 and 0.7135 are the minimum and maximum values of the NDVI, respectively. To describe the qualitative criteria for evaluating the NDVI the sufficient number of fuzzy subsets of the U is established from equality (4) as follows: m = [0.7431 – 0.2183 – 0.0296]/[2 · 0.0296] = 8.3649 ≈ 8. Then the corresponding 8 trapezoidal membership functions are identified (see Fig. 3), the parameters of which are summarized in Table 2.

1 0.8 0.6 0.4 0.2 0.7431

0.7063

0.6623

0.6327

0.6031

0.5735

0.5439

0.5143

0.4847

0.4551

0.4255

0.3959

0.3663

0.3367

0.3071

0.2775

0.2479

0.2183

0

Fig. 3. Trapezoidal membership functions describing time series data. Table 2. Parameters of trapezoidal membership functions. Fuzzy set Parameters ak1

ak2

ak3

ak4

A1

0.2183

0.2479 0.2775 0.3071

A2

0.2775

0.3071 0.3367 0.3663

A3

0.3367

0.3663 0.3959 0.4255

A4

0.3959

0.4255 0.4551 0.4847

A5

0.4551

0.4847 0.5143 0.5439

A6

0.5143

0.5439 0.5735 0.6031

A7

0.5735

0.6031 0.6327 0.6623

A8

0.6327

0.6695 0.7063 0.7431

According to [4], fuzzification of time series data by identified trapezoidal membership functions is carried out by following rule: NDVI is described by those fuzzy set to which its value belongs with the greatest degree. When the NDVI value belongs to the interval [ak2 , ak3 ], it is relatively easy to find its fuzzy analogue. In other cases, clarifications are needed. In particular, according to (4) for NDVI = 0.3599 we have: μA3 (0.3599) = 0.7838 and μA2 (0.3599) = 0.2162 (see Fig. 4). Therefore, fuzzy set A3 should be chosen as an analogue, because the value of its membership function at the point 0.3599 is greater. The obtained fuzzy analogs for all NDVI indexes are summarized in Table 3. Fuzzy time series modeling is created on the base of the analysis of internal relationships (causal-effect relations) of different orders between the NDVI values. Within the

18

E. Aliyev et al.

Fig. 4. Neighboring membership functions.

Table 3. Fuzzy time series. №

Date

NDVI

Fuzzy set

№

Date

NDVI

Fuzzy set

1

18.02.2000

0.3599

A3

9

25.06.2000

0.7101

A8

2

05.03.2000

0.4099

A4

10

11.07.2000

0.7135

A8

3

21.03.2000

0.3680

A3

11

27.07.2000

0.2479

A1

4

06.04.2000

0.3296

A2

12

12.08.2000

0.6587

A8

5

22.04.2000

0.2535

A1

13

28.08.2000

0.5473

A6

6

08.05.2000

0.2966

A2

14

13.09.2000

0.4815

A5

7

24.05.2000

0.3211

A2

15

29.09.2000

0.3719

A3

8

09.06.2000

0.5104

A5

16

15.10.2000

0.3217

A2

framework of the NDVI fuzzy time series, the internal relationships of the 1st and 2nd orders were chosen. These relationships demonstrate fuzzy relations in the implicative form of the form “If , then ” [6]. For example, internal relationships of the 1st order are grouped according to the principle: if the fuzzy set A3 is related to the sets A4 and A2 , then the group of the 1st order is localized relative to it: A3 ⇒ A2 , A4 (see Table 4, group G3). The groups of the 1st and 2nd orders are summarized in Table 4 and Table 5, respectively. Table 4. The groups of the 1st order relationships Group Relation

Group Relation

G1

A 1 ⇒ A2 , A 8

G2

A2 ⇒ A1 , A2 , G4 A5

G3

Group Relation

Group Relation

A3 ⇒ A2 , A4 G5

A5 ⇒ A3 , A8 G7

A 4 ⇒ A3

A 6 ⇒ A5

G6

A 8 ⇒ A1 , A 6 , A8

If the NDVI value on the i-th day is denoted by x i , and the NDVI value on the next (i + 1)-th day is denoted by x i+1 , then the 1st order internal relationship, for example, A4

Analysis of the Vegetation Index Dynamics

19

Table 5. The groups of the 2nd order relationships. Group Relation

Group Relation

Group Relation

G1

A3 , A4 ⇒ A3 G6

A2 , A2 ⇒ A5 G11

A 1 , A 8 ⇒ A6

G2 G3

A4 , A3 ⇒ A2 G7 A3 , A2 ⇒ A1 G8

A2 , A5 ⇒ A8 G12 A5 , A8 ⇒ A8 G13

A 8 , A 6 ⇒ A5 A 6 , A 5 ⇒ A3

G4

A2 , A1 ⇒ A2 G9

A8 , A8 ⇒ A1 G14

A5 , A3 ⇒A2

G5

A1 , A2 ⇒ A2 G10

A 8 , A 1 ⇒ A8

⇒ A3 can be interpreted as a fuzzy implicative rule: “If x i = A4 , then x i+1 = A3 ”. Or, for example, the 1st order internal relationship of the form A8 ⇒ A1 , A6 , A8 can be interpreted as a fuzzy implicative rule: “If x i = A8 , then x i+1 = A1 or x i+1 = A6 or x i+1 = A8 ”. Accordingly, the 2nd order internal relationship, for example, A3 , A4 ⇒ A3 can be interpreted as: “If x i = A3 and x i = A4 , then x i+1 = A3 ”. Various models are used to define fuzzy forecasts and their defuzzification. As one of these, a model was chosen, the essence of which is as follows [7]. If NDVI x i is described by the fuzzy set Aj , which within the totality of all data forms only one internal relationship, for example, the relation: Aj ⇒ Ak , then the forecast for the next (i + 1)-th day is the fuzzy set Ak . In the case, when there is the group of internal relationships, for example, Aj ⇒ Ak1 , Ak2 , …, Akp , then the union Ak1 ∪ Ak2 ∪ … ∪ Akp is the fuzzy forecast for the (i + 1) th day. To defuzzify the outputs of this model, the following two principles are applied [1, 6, 7]. Principle 1 [1, 6]. In the case of the fuzzy relation of the form Ai ⇒ Aj , where Ai is the fuzzy analogue of NDVI on the i-th day, the crisp forecast for the next (i + 1)-th day as the defuzzified value of the fuzzy forecast Aj, is the abscissa of the bisecting point of the upper base of the corresponding trapezoid. Actually, according to F(A) =

1 αmax

α max

M (Aα )d α,

(6)

0

where Aα = {u|μA (u) ≥ α, u ∈ U} are the α-level sets (α ∈ [0, 1]); M (Aα ) =

m k=1 uk /m (uk ∈ Aα ) are the cardinalities of the corresponding α-level sets, for the fuzzy set A3 = {0/0.3367, 1/0.3663, 1/0.3959, 0/0.4255} (see Table 2), which is a forecast in the fuzzy relation A4 ⇒ A3 , we have: For 0 < α < 1, α = 1, A3α = {0.3663, 0.3959}, M(A3α ) = (0.3663 + 0.3959)/2 ≈ 0.3811. Then, according to (5), the crisp (defuzzified) output of the model is calculated as 1 1 M (A3α )d α ≈ M (A3α ) · α = 0.3811. F(A3 ) = 1 0 Principle 2 [7]. In the case of the fuzzy relation of the form Ai ⇒ Aj , At , Ap , where Ai is the fuzzy analogue of NDVI on the i-th day, the crisp forecast for the next (i + 1)-th day

20

E. Aliyev et al.

is calculated as the arithmetic mean of the abscissa of the midpoints of the upper bases of trapeziums corresponding to the fuzzy sets Aj , At and Ap . In particular, according to the group of internal relationships A8 ⇒ A1 , A6 , A8 the forecast F for 11.07.2000, 27.07.2000 and 28.08.2000 dates is calculated as follows: F = [(0.2479 + 0.2775)/2 + (0.5439 + 0.5735)/2 + (0.6695 + 0.7063)/2]/3 = 0.5031. The forecasts obtained on the basis of the 1st order predictive model are summarized in Table 6, and the geometric interpretation of this model is shown in Fig. 5. Table 6. The 1st order predictive model. Date

NDVI

Output

Predict

Date

NDVI

Output

Predict

18.02.2000

0.3599

05.03.2000

0.4099

A 2 , A4

0.3811

18.02.2000

0.3599

A 3 , A8

0.5345

05.03.2000

0.4099

A 1 , A6 , A8

0.5031

21.03.2000

0.3680

A3

0.3811

21.03.2000

0.3680

A 1 , A6 , A8

0.5031

06.04.2000

0.3296

22.04.2000

0.2535

A 2 , A4

0.3811

06.04.2000

0.3296

A 2 , A8

0.5049

A 1 , A2 , A5

0.3614

22.04.2000

0.2535

A 1 , A6 , A8

0.5031

08.05.2000

0.2966

A 2 , A8

0.5049

08.05.2000

0.2966

A5

0.4995

24.05.2000

0.3211

A 1 , A2 , A5

0.3614

24.05.2000

0.3211

A 3 , A8

0.5345

09.06.2000

0.5104

A 1 , A2 , A5

0.3614

09.06.2000

0.5104

A 2 , A4

0.3811

The set of internal connections of the 2nd and higher orders coincides with the set of NDVI fuzzy analogs (see Table 3). Nevertheless, we considered it necessary to show the defuzzified outputs of the 2nd order fuzzy model (see Table 7) and its geometric interpretation (see Fig. 5). Table 7. The 2nd order predictive model. Date

NDVI

Date

NDVI

Output

Predict

18.02.2000

0.3599

Output

Predict

18.02.2000

0.3599

A8

0.6879

05.03.2000

0.4099

05.03.2000

0.4099

A8

0.6879

21.03.2000

0.3680

A3

0.3811

21.03.2000

0.3680

A1

0.2627

06.04.2000

0.3296

A2

0.3219

06.04.2000

0.3296

A8

0.6879

22.04.2000

0.2535

A1

0.2627

22.04.2000

0.2535

A6

0.5587

08.05.2000

0.2966

A2

0.3219

08.05.2000

0.2966

A5

0.4995

24.05.2000

0.3211

A2

0.3219

24.05.2000

0.3211

A3

0.3811

09.06.2000

0.5104

A5

0.4995

09.06.2000

0.5104

A2

0.3219

Analysis of the Vegetation Index Dynamics

21

Time series

1-st order fuzzy model

10/6/2000

10/13/2000

9/29/2000

9/22/2000

9/8/2000

9/15/2000

9/1/2000

8/25/2000

8/18/2000

8/4/2000

8/11/2000

7/28/2000

7/21/2000

7/7/2000

7/14/2000

6/30/2000

6/23/2000

6/9/2000

6/16/2000

6/2/2000

5/26/2000

5/19/2000

5/5/2000

5/12/2000

4/28/2000

4/21/2000

4/7/2000

4/14/2000

3/31/2000

3/24/2000

3/17/2000

3/3/2000

3/10/2000

2/25/2000

2/18/2000

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

2-nd order fuzzy model

Fig. 5. Fuzzy time series models.

To assess the adequacy of fuzzy time series models the following criteria are used: MAPE =

1 m |Fj − Aj | 1 m × 100, MSE = (Fj − Aj )2 , j=1 j=1 m Aj m

where Aj and F j denote the actual value and predict at the t j date, respectively. The results of forecasting and assessments of their reliability for both models (see Fig. 5) are summarized in Table 8. Table 8. Assessments of the model reliabilities. Date

NDVI

Fuzzy models 1st order

18.02.2000

0.3599

05.03.2000

0.4099

0.3811

21.03.2000

0.3680

0.3811

06.04.2000

0.3296

0.3811

22.04.2000

0.2535

08.05.2000 24.05.2000 09.06.2000

Date

NDVI

2nd order

Fuzzy models 1st order

2nd order

18.02.2000

0.3599

0.5345

0.6879

05.03.2000

0.4099

0.5031

0.6879

0.3811

21.03.2000

0.3680

0.5031

0.2627

0.3219

06.04.2000

0.3296

0.5049

0.6879

0.3614

0.2627

22.04.2000

0.2535

0.5031

0.5587

0.2966

0.5049

0.3219

08.05.2000

0.2966

0.4995

0.4995

0.3211

0.3614

0.3219

24.05.2000

0.3211

0.5345

0.3811

0.5104

0.3614

0.4995

09.06.2000

0.5104

0.3811

0.3219

MSE

0.0186

0.0198

MAPE

30.5881

3.2796

4 Conclusion The results of the study of NDVI long-term dynamics can be used for modeling and forecasting the effectivity and productivity of agricultural crops, for the complex analysis

22

E. Aliyev et al.

of the vegetation dynamics. Certainly, the obtained fuzzy models cannot be a reliable source of information for conclusions about the prospects for the state of vegetation on a given plot of agricultural land, because, firstly, they rely on a small amount of annual NDVI data, which is extremely insufficient for analyzing internal cause-effect relations of higher orders, and, secondly, they describe one annual cycle. Nevertheless, this approach can be extrapolated to all long-term NDVI data obtained using MODIS technology. As a result, it is possible to model and accordingly predict, for example, the dynamics of long-term seasonal NDVI values averaged for each analyzed date, as well as the dynamics of the average annual, annual minimum and annual maximum NDVI values.

References 1. Rzayev, R.R.: Analytical Support for Decision-Making in Organizational Systems. Palmerium Academic Publishing, Saarbruchen (2016). (in Russian) 2. Zadeh, L.A.: The concept of a linguistic variable and its application to approximate reasoning. Inf. Sci. 8(3), 199–249 (1965) 3. Vegetation Indices 16-Day L3 Global 250 m MOD13Q1. LPDAAC. https://lpdaac.usgs.gov/ dataset_discovery/modis/modis_products_table/mod13q1. Accessed 12 Oct 2020 4. Ortiz-Arroyo, D., Poulsen, J.R.: A weighted fuzzy time series forecasting model. Indian J. Sci. Technol. 11(27), 1–11 (2018) 5. Vegetation Indices 16-Day L3 Global 250 m MOD13Q1. LPDAAC: https://goo.gl/maps/YAd domuoXsD4QQN36. Accessed 12 Oct 2020 6. Rzayev, R.R., et al.: Time series modeling based on fuzzy analysis of position-binary components of historical data. Nechetkie Sistemy i Myagkie Vychisleniya [Fuzzy Systems and Soft Computing] 10(1), 35–73 (2015). (in Russian) 7. Chen, S.M.: Forecasting enrollments based on high-order fuzzy time series. Cybern. Syst. Int. J. 33, 1–16 (2002)

Application of Data–Driven Fault Diagnosis Design Techniques to a Wind Turbine Test–Rig Silvio Simani1(B) , Saverio Farsoni1 , and Paolo Castaldi2 1 2

Department of Engineering, University of Ferrara, Ferrara, Italy [email protected] Department of Electrical, Electronic, and Information Engineering, University of Bologna, Bologna, Italy http://www.silviosimani.it

Abstract. The fault diagnosis of safety critical systems such as wind turbine installations includes extremely challenging aspects that motivate the research issues considered in this paper. In particular, this work studies fault diagnosis solutions that are considered in a viable way and used as advanced techniques for condition monitoring of dynamic processes. To this end, the work proposes the design of a fault diagnosis strategies that exploits the estimation of the fault by means of data– driven approaches. This solution leads to the development of effective methods allowing the management of partial unknown information of the system dynamics, while coping with measurement errors, the model– reality mismatch and other disturbance effects. In mode detail, the proposed data–driven methodologies exploit fuzzy systems and neural networks in order to estimate the nonlinear dynamic relations between the input and output measurements of the considered process and the faults. To this end, the fuzzy and neural network structures are integrated with auto–regressive with exogenous input descriptions, thus making them able to approximate unknown nonlinear dynamic functions with arbitrary degree of accuracy. Once these models are estimated from the input and output data measurement acquired from the considered dynamic process, the capabilities of their fault diagnosis capabilities are validated by using a high–fidelity benchmark that simulates the healthy and the faulty behaviour of a wind turbine system. Moreover, at this stage the benchmark is also useful to analyse the robustness and the reliability characteristics of the developed tools in the presence of model–reality mismatch and modelling error effects featured by the wind turbine simulator. On the other hand, a hardware–in–the–loop tool is finally implemented for testing the performance of the developed fault diagnosis strategies in a more realistic environment. Keywords: Wind turbine · Data–driven approach Neural Networks · Hardware–in–the–loop

· Fuzzy Systems ·

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 23–38, 2021. https://doi.org/10.1007/978-3-030-80126-7_3

24

S. Simani et al.

1

Introduction

As the power required worldwide is increasing, and at the same time in order to meet low carbon requirements, one of the possible solutions consists of increasing the exploitation of wind–generated energy. However, this need implies to improve the levels of availability and reliability, which represent the fundamental ‘sustainability’ feature, extremely good for these renewable energy conversion systems. In fact, wind turbine processes should produce the required amount of electrical energy continuously and in an effective way, trying to maximise the wind power capture, on the basis of the grid’s demand and despite malfunctions. To this end, possible faults affecting this energy conversion system must be properly diagnosed and managed, in their earlier occurrence, before they degrade into failures and become critical issues. Another important aspect regards wind turbines with very large rotors, which allow to reach megawatt size, possibly maintaining light load carrying structures. On one hand, they become very expensive and complex plants, which require advanced control techniques to reduce the effects of induced torques and mechanical stress on the structures. On the other hand, they need an extremely high level of availability and reliability, in order to maximise the generated energy, to minimise the production cost, and to reduce the requirements of Operation and Maintenance (O & M) services. In fact, the final cost of the produced energy depends first on the installation expenses of the plant (fixed cost). Second, unplanned O&M costs may increase it up to about 30%, especially if offshore wind turbine installations are considered, as they work in harsh environments. These considerations motivate the implementation of condition monitoring and fault diagnosis techniques that can be integrated into fault tolerant control strategies, i.e. the ‘sustainable’ solutions. Unfortunately, several wind turbine manufacturers do not implement ‘active’ approaches against faults, but rather adopt conservative solutions. For example, this means to resort to the shutdown of the plant and wait for O&M service. Therefore, more effective tools for managing faults in an active way must be considered, in order to improve the wind turbine working conditions, and not only during faulty behaviour. Note also that, in principle, this strategy will be able to prevent critical faults that might affect other components of the wind turbine system, and may thus avoid to require unplanned replacement of its functional parts. Moreover, it will lead to decrease possible O&M costs, while improving the energy production and reducing the energy cost. On the other hand, the implementation of advanced control systems, with the synergism with big data tools and artificial intelligence techniques will lead to the development of real–time condition monitoring, fault diagnosis and fault tolerant control solutions for these safe–critical systems that can be enabled also only with on–demand features. In the last decades several papers have investigated the problem of fault diagnosis for wind turbine systems, as addressed e.g. in [6]. Some of them have investigated the diagnosis of particular faults, i.e. those affecting the drive– train part of the wind turbine nacelle. In fact, sometimes the detection of these faults can be enhanced if the wind turbine subsystems are compared to other

Data–Driven Fault Diagnosis Application

25

modules of the whole plant [9]. Moreover, more challenging topics regarding the fault tolerant control of wind turbines have been considered e.g. in [11], also by proposing international co–operations on common problems, as analysed e.g. in [7]. Therefore, the key point is represented by the fault diagnosis task when exploited to achieve the sustainability feature for safety–critical systems, such as wind turbines. In fact, it has been shown to represent a challenging topic [20], thus justifying the research issues investigated in this paper. With reference to this paper, the topic of the fault diagnosis of a wind turbine system is analysed. In particular, the design of practical and reliable solutions to Fault Detection and Isolation (FDI) is considered. However, differently from other works by the same authors, the Fault Tolerant Control (FTC) topic is not investigated here, even if it can rely on the same tools exploited in this paper. In fact, the fault diagnosis module provides the reconstruction of the fault signal affecting the process, which could be actively compensated by means of a controller accommodation mechanism. Moreover, the fault diagnosis design is enhanced by the derived fault reconstructors that are estimated via data–driven approaches, as they also allow to accomplish the fault isolation task. The first data–driven strategy proposed in this work exploits Takagi–Sugeno (TS) fuzzy prototypes [1, 4], which are estimated via a clustering algorithm and exploiting the data–driven algorithm developed in [15]. For comparison purpose, a further approach is designed, which exploits Neural Networks (NNs) to derive the nonlinear dynamic relations between the input and output measurements acquired for the process under diagnosis and the faults affecting the plant. The selected structures belong to the feed–forward Multi–Layer Perceptron (MLP) neural network class that include also Auto–Regressive with eXogenous (ARX) inputs in order to model nonlinear dynamic links among the data. In this way, the training of these Nonlinear ARX (NARX) prototypes for fault estimation can exploit standard back–propagation training algorithm, as recalled e.g. in [5]. The designed fault diagnosis schemes are tested via a high–fidelity simulator of a wind turbine process, which describes its behaviour in healthy and faulty conditions. This simulator, which represents a benchmark [8], includes the presence of uncertainty and disturbance effects, thus allowing to verify the reliability and robustness characteristics of the proposed fault diagnosis methodologies. Moreover, this work proposes to validate the efficacy of the designed fault diagnosis techniques by exploiting a more realistic scenario, which consists of a Hardware–In–the–Loop (HIL) tool. It is worth noting the main contributions of this paper with respect to previous works by the authors. For example, this study analyses the solutions addressed e.g. in [19] but taking into account a more realistic and real–time system illustrated in Sect. 4. On the other hand, the fault diagnosis scheme developed in this paper was designed for a wind turbine system also in [17], but without considering the HIL environment. The fuzzy methodology was also proposed in by the authors in [14], which considered the development of recursive algorithms for the implementation of adaptive laws relying on Linear Parameter Varying (LPV) systems. The app-

26

S. Simani et al.

roach proposed in this paper estimates the fault diagnosis models by means of off–line procedures. Moreover, this paper further develops the achievements obtained e.g. in [18], but concerning the fault diagnosis a wind farm. The paper [16] proposed the design of a fault tolerant controller using the input–output data achieved from a single wind turbine, by exploiting the results achieved in [17]. On the other hand, this work considers the verification and the validation of the proposed fault diagnosis methodologies by exploiting an original HIL tool, proposed considered in a preliminary paper by the same authors [13]. The work follows the structure sketched in the following. Section 2 briefly summarises the wind turbine simulator, as it represents a well-established benchmark available in literature [10]. Section 3 describes the fault diagnosis strategies based on Fuzzy Systems (FSs) and NN structures. Section 4 summarises the obtained results via the simulations and the HIL tool describing the behaviour of the wind turbine process. Finally, Sect. 5 concludes the work by reporting the main points of the paper and suggesting some interesting issues for further research and future investigations.

2

Wind Turbine System Description

The Wind Turbine (WT) benchmark considered in this work for validation purposes was earlier presented in [8, 10] and motivated by an international competition. Despite its quite simple structure, it is able to describe quite accurately the actual behaviour of a three–blade horizontal–axis wind turbine that is working at variable–speed and it is controlled by means of the pitch angle of its blades. The plant includes several interconnected subsystems, namely the wind process, the wind turbine aerodynamics, the drive–train, the electric generator/converter, the sensor and actuator systems and the baseline controller. The overall system is sketched in Fig. 1, which represents the fault diagnosis target developed in this work. Further details of the WT benchmark will not be provided here, as they were described in detail in [9] and the references therein. Rotor Pitch actuator

Blade

Drive-train Low speed shaft

Gear box

Pitch sensor

Power system High speed shaft

Generator Converter Power sensor

Speed sensor

controller controller

Switch

controller

Fig. 1. The WT benchmark and its functional subsystems.

Switch

Data–Driven Fault Diagnosis Application

27

This wind turbine benchmark is able to generate different typical fault cases affecting the sensors, the actuators and the process components. This scenario comprising 9 fault situations is illustrated by means of Table 1, which reports the input and output measurements acquired form the WT process signals and mainly affected by these faults. Table 1. Fault scenario of the WT benchmark. Fault case Fault type Most affected input–output measurements 1

Sensor

β1,m1 , β1,m2 , ωg,m2

2

Sensor

β1,m2 , β2,m2 , ωg,m2

3

Sensor

β1,m2 , β3,m1 , ωg,m2

4

Sensor

β1,m2 , ωg,m2 , ωr,m1

5

Sensor

β1,m2 , ωg,m2 , ωr,m2

6

Actuator

β1,m2 , β2,m1 , ωg,m2

7

Actuator

β1,m2 , β3,m2 , ωg,m2

8

Actuator

β1,m2 , τg,m , ωg,m2

9

System

β1,m2 , ωg,m1 , ωg,m2

In this way, Table 1 reports the most sensitive measurements uj (k) and yl (k) acquired from the WT system with respect to the fault conditions implemented in the WT benchmark. In practice, the fault signals of Table 1 were injected into the WT simulator, assuming that only a single fault may occur. Then, by checking the Relative Mean Square Errors (RMSEs) between all the fault–free and faulty measurements from the WT plant, the most sensitive signal uj (k) and yl (k) was selected and reported in Table 1. For fault diagnosis purpose, the complete model of the WT benchmark can be described as a nonlinear continuous–time dynamic model represented by the function fwt of Eq. (1) including the overall behaviour of the WT process reported in Fig. 1 with state vector xwt and fed by the driving input vector u: x˙ wt (t) = fwt (xwt , u(t)) (1) y(t) = xwt (t) Equation (1) highlights that the simulator allows to measure all the state vector signals, i.e. the rotor speed, the generator speed and the generated power of the WT process: xwt (t) = y(t) = [ωg,m1 , ωg,m2 , ωr,m1 , ωr,m2 , Pg,m ] The driving input vector is represented by the following signals: u(t) = [β1,m1 , β1,m2 , β2,m1 , β2,m2 , β3,m1 , β3,m2 , τg,m ]

28

S. Simani et al.

that represent the acquired measurements of the pitch angles from the three WT blades and the measured generator/converter torque. These signals are acquired with sample time T in order to obtain N data indicated as u(k) and y(k) with index k = 1, . . . , N that are exploited to design the fault diagnosis strategies addressed in this work.

3

Data–Driven Methods for Fault Diagnosis

This section recalls the fault diagnosis strategy proposed in this paper that relies on FS and NN tools, as summarised in Sect. 3.1. These architectures are able to represent NARX models exploited for estimating the nonlinear dynamic relations between the input and output measurements of the WT process and the fault signals. In this sense, these NARX prototypes will be employed as fault estimators for solving the problem of the fault diagnosis of the WT system. Under these assumptions, the fault estimators derived by means of a data– driven approach represent the residual generators r(k), which provide the on– line reconstruction ˆf (k) of the fault signals summarised in Table tab:faults, as represented by Eq. (2): r(k) = ˆf (k) (2) ˆ where the the general fault vector of Table tab:faults, i.e. term f (k) represents ˆf (k) = fˆ1 (k), . . . , fˆ9 (k) . The fault diagnosis scheme exploiting the proposed fault estimators as residual generator is sketched in Fig. 2. Note that, as already highlighted, this scheme is also able to solve the fault isolation task [2].

ur(k) u2(k)

Residual generation scheme for fault diagnosis

Inputs bus Input sensors

u1(k)

Outputs Selector Selector

…

…

…

…

Dynamic process

Fault estimator1

r1(k) = f1(k)

Fault estimator 2

r2(k) = f2(k)

Selector

Inputs Output sensors y (k) 1

Selector

Selector

ym(k) Bus

y (k) Selector Selector

Fault estimator i

ri (k) = fi (k)

…

Outputs Bus

u j(k)

Selector

…

y2(k)

Fault estimatorr+m

Fig. 2. Bank of fault reconstructors for fault diagnosis.

rr+m(k) = fr+m(k)

Data–Driven Fault Diagnosis Application

29

Figure 2 shows that the general residual generator exploits the input and output measurements acquired from the process under diagnosis, u(k) and y(k), properly selected according to the analysis shown in Table 1. The fault detection problem can be easily achieved by means of a simple threshold logic applied to the residuals themselves, as described in [2]. This issue will not be considered in this paper. Once the fault detection phase is solved, the fault isolation stage is directly obtained via the bank of estimators of Fig. 2. In this case, the number of estimators of Fig. 2 is equal to the faults to be detected, i.e. 9, which is lower than the number of input and output measurements, r + m, acquired from the WT process. This condition provides several degrees of freedom, as the i–th reconstructor of the fault fˆ(k) = ri (k) is a function of the input and output signals u(k) and y(k). These signals are thus selected in order to be affected sensitive to the specific fault fi (k), as highlighted in Table 1. This procedure enhances also the design of the fault reconstructors, as it reduces the number of possible input and output measurements, uj (k) and yl (k), which have to be considered for the identification procedure reported in Sect. 3.1. The sensitivity analysis already represented in Table 1 has to be performed before the estimation of the fault estimators. Therefore, once the input–output signals are selected, according to Table 1, the FSs and the NNs used as fault reconstructors can be developed, as summarised in Sect. 3.1. 3.1

Fault Estimators via Artificial Intelligence Tools

This section recalls the procedure for developing the fault estimators modelled as Takagi–Sugeno (TS) FSs. In this way, the unknown dynamic relations between the selected input and output measurements of the WT plant and the faults are represented by means of FSs, which rely on a number of rules, antecedent and consequent functions. These rules are used to represent the inference system for connecting the measured signals from the system under diagnosis to its faults, in form of IF =⇒ THEN relations, implemented via the so–called Fuzzy Inference System (FIS) [1]. According to this modelling strategy, the general TS fuzzy prototype has the of Eq. (3): T nC λ (x(k)) a x(k) + b i i i i=1 nC (3) fˆ(k) = i=1 λi (x(k)) Using this approach, in general, the fault signal fˆ(k) is reconstructed by using suitable data taken from the WT process under diagnosis. In this case, the fault function fˆ(k) is represented as a weighted average of affine parametric relations aTi x(k) + bi (consequents) depending on the input and output measurements collected in x(k). These weights are the fuzzy membership degrees λi (x) of the system inputs. The parametric relations of the consequents depend on the unknown variables ai and bi , which are estimated by means on an identification approach. The rule

30

S. Simani et al.

number is assumed equal to the cluster number nC exploited to partition the data via a clustering algorithm with respect to regions where the parametric relations (consequents) hold [1]. Note that the system under diagnosis corresponds to a WT plant, which is described by a dynamic model. Therefore, the vector x(k) in Eq. (3) contains both the current and the delayed samples of the system input and output measurements. Therefore, the consequents includes discrete–time linear Auto– Regressive with eXogenous (ARX) input structures of order o. This regressor vector is described in form of Eq. (4): T x(k) = . . . , yl (k − 1), . . . , yl (k − o), . . . uj (k), . . . , uj (k − o), . . .

(4)

where ul (·) and yj (·) represent the l–th and j–th components of the actual WT input and output vectors u(k) and y(k). These components are selected according to the results reported in Table 1. The consequent affine parameters of the i–th model of the Eq. (3) are usually represented with a vector: T

(i) (i) ai = α1 , . . . , αo(i) , δ1 , . . . , δo(i)

(5)

(i)

where usually the coefficients αj are associated to the delayed output samples, (i)

whilst δj to the input ones. The approach proposed in this paper for the derivation of the generic i–th fault approximator (FIS) starts with the fuzzy clustering of the data u(k) and y(k) from the WT process. This paper exploits the well-established Gustafson– Kessel (GK) algorithm [1]. Moreover, the estimation of the FIS parameters is addressed as a system identification problem from the noisy data of the WT process. Once the data are clustered, the identification strategy proposed in this work exploits the methodology developed by the authors in [3]. Another key point not addressed in this work concerns the selection of the optimal clusters number nC . This issue was investigated and developed by the authors, which leads to the estimation of the membership degrees λi (x(k)) required in Eq. (3) and solved as a curve fitting problem [1]. This paper considers an alternative data–driven approach, which exploits neural networks used as fault approximators in the scheme of Fig. 2. Therefore, in the same way of the fuzzy scheme, the bank of NNs is exploited to reconstruct the faults affecting the WT system under diagnosis using a proper selection of the input and the output measurements. The exploited NN structure consists of a feed–forward Multi–Layer Percepron (MLP) architecture with 3 layers of neurons [5]. However, as MPL networks represent static relations, the paper suggests to implement the MLP structure with a tapped delay line. Therefore, this quasi– static NN represents a powerful way for estimating nonlinear dynamic regressions between the input and output measurements from the WT process and its fault functions. This solution allows to obtain another Nonlinear ARX (NARX)

Data–Driven Fault Diagnosis Application

31

description among the data. Moreover, when properly trained, these NARX NNs are able to reconstruct the fault function fˆ(k) using a suitable selection of the past measurements of the WP system inputs and outputs ul (k) and yj (k), respectively. The example of the general solution is sketched in Fig. 3.

u*(k) Input sensors

y*(k)

WT process

Output sensors u(k)

Neural fault estimator

y(k) Dynamic neural network u(k) u(k)

…

y(k)

u(k-1) z -1

…

z -1

y(k-1)

z -1

r(t) ^ f(k)

…

…

z -1

y(k-2)

Feedforward multilayer perceptron neural network

z -1 Delayed inputs

Fig. 3. NARX NN for fault reconstruction.

Similarly to the fuzzy scheme, with reference to the i–th fault reconstructor, a bank of NARX NNs is exploited, where the generic NARX system models the relation of Eq. (6): (6) fˆ(k) = F . . . , uj (k), . . . , uj (k − du ), . . . yl (k − 1), . . . , yl (k − dy ), . . . where fˆ(k) represents the estimate of the general i–th fault in Table 1, whilst uj (·) and yl (·) indicate the components of the measured inputs and outputs of the WT process. These signals are selected again by means of the solution of the fault sensitivity problem reported in Table 1. The accuracy of the fault reconstruction depends on the number of neurons per layer, their weights and their activation functions.

4

Simulation and Experimental Tests

This section reports the results of the simulations when the fault diagnosis schemes described in Sect. 3 are implemented for the WT process in Sect. 2. As already remarked, the WT benchmark implements realistic uncertainty and disturbance effects, which are considered for analysing the robustness characteristics of the designed fault diagnosis strategies. The same techniques are validated also using a more realistic HIL tool, as described in Sect. 4.2.

32

4.1

S. Simani et al.

Simulated Results

With reference to the WT benchmark of Sect. 2, the simulations are driven by different wind sequences generated in a random way. They represent real measurements of wind speed sequences representing typical WT operating conditions, with ranges varying from 5 m/s. to 20 m/s. This scenario was modified by the authors with respect to the earlier benchmark proposed in [8]. The simulations consist of 4400 s., with single fault occurrences and a number of samples N = 440000 for a sampling frequency 100 Hz. Almost all fault signals are modelled as step functions lasting for 100 s. with different commencing times. Further details can be found in [8, 10]. The first part of this section reports the results achieved by means of the fuzzy prototypes used as fault reconstructors according to Sect. 3.1. In particular, the fuzzy c–means and the GK clustering algorithms were exploited. A number of clusters nC = 4 of clusters and a number of delays o = 3 were estimated. The (i) membership functions of the TS FS and the parameters of the consequents αj (i)

and δj were estimated for each cluster by following the procedure developed by the same authors in [12]. The TS FSs of Eq. (3) were thus determined and 9 fault reconstructors were organised according to the scheme of Fig. 2. The performances of the 9 TS FSs when used as fault estimators were evaluated again according to the RMSE % index, computed as the difference between the reconstructed fˆ(k) and the actual f (k) signals for each of the fuzzy estimators. These values were reported in Table 2. Table 2. FS fault estimator capabilities. Fault case 1

2

3

4

5

6

7

8

9

RMSE%

1.61%

2.22%

1.95%

1.87%

1.92%

2.15%

1.76%

2.13%

1.98%

Sdt. Dev.

±0.02% ±0.03% ±0.01% ±0.01% ±0.01% ±0.02% ±0.01% ±0.02% ±0.01%

Indeed, the RMSE % values reported in Table 2 represent an average of the results obtained from a campaign of 1000 simulations, as the benchmark exploited in this work changes the parameters of the WT model at each run. Moreover, the model–reality mismatch, the measurement errors, uncertainty and disturbance effects are described as Gaussian processes with suitable distributions, as remarked in Sect. 2. Therefore, Table 2 reports also the values of the standard deviation of the estimation errors achieved by the FS fault estimators. Note that these reconstructed signals fˆ(k) can be directly used as diagnostic residuals in order to detect and isolate the faults affecting the WT. Moreover, each TS FS of Eq. (3) is fed by 3 inputs (according to Table 1), with a number of delayed inputs and outputs n = 3 and nC = 4 clusters. As an example, Fig. 4 shows the results regarding the fault cases 1, 2, 3, and 4 of the WT plant recalled in Sect. 2. In particular, Fig. 4 reports the estimated faults fˆ(k) = ri (k) provided from the FSs in faulty conditions (black continuous line). They are compared with

Data–Driven Fault Diagnosis Application Fault 1 reconstruction

Fault 2 reconstruction

6

f (k) 1

33

f2(k) 2

4

1 0

2 -1

0

-2 -3

-2 0

500

1000

1500

2000

2500

3000

3500

4000 4400

6

f (k) 3

-4 0

Time (s.) Fault 3 reconstruction f (k) 4

4

500

1000

1500

2000

2500

3000

3500

4000 4400

Time (s.) Fault 4 reconstruction

2 1 0

2

-1 0 -2 -2 0

500

1000

1500

2000

2500

Time (s.)

3000

3500

4000 4400

-3 0

500

1000

1500

2000

2500

3000

3500

4000 4400

Time (s.)

Fig. 4. Reconstructed faults fˆ(k) for cases 1, 2, 3, and 4.

respect to the corresponding fault–free residuals (grey line). The fixed thresholds reported with dotted lines are used for fault detection. Note that the reconstructed fault functions fˆ(k) = ri (k) are different from zero also in fault–free conditions due to the measurement errors and the model-reality mismatch. This aspect serves to highlight the accuracy of the reconstructed signals provided by the estimated fuzzy models. As for the FSs, 9 NARX NNs summarised in Sect. 3.1 were derived to provide the reconstruction of the 9 faults affecting the WT plant. In particular, the NARX were implemented as MLP NNs with 3 layers: the input layer consisted of 3 neurons, the hidden one used 10 neurons, whilst one neuron for the output layer. 4 delays were used in the relation of Eq. (6). Moreover, sigmoidal activation functions were used in both the input and the hidden layers, and a linear function for the output layer. With reference to Table 1, the NARX NNs were fed by 9 signals, representing the delayed inputs and outputs from the WT process. As for the FSs, the prediction accuracy of the NARX NN was analysed by means of the RMSE % index and its average values summarised in Table 3. Table 3. NN fault estimator capabilities. Fault case 1

2

3

4

5

6

7

8

9

RMSE %

0.91%

0.92%

0.94%

1.21%

1.17%

1.61%

0.98%

0.95%

1.41%

Sdt. Dev.

±0.01% ±0.01% ±0.01% ±0.02% ±0.01% ±0.01% ±0.01% ±0.01% ±0.02%

As for the FS case, Table 3 reports also the values of the standard deviation of the estimation errors achieved by the NARX NN fault estimators.

34

S. Simani et al.

Also in this case, Fig. 5 depicts some of the residual signals fˆ(k) = ri (k) provided by the NARX NNs for the fault conditions 6, 7, 8, and 9, and compared with respect to the fixed detection thresholds (dotted lines). Fault 6 reconstruction

3

f (k) 6

2

2 1

1 0

0

-1

-1

-2 0

Fault 7 reconstruction

3

f7(k)

500

1000

1500

2000

2500

3000

3500

Time (s.) Fault 8 reconstruction

6

f 8(k)

-2 0

4000 4400

f (k) 9

500

1000

1500

2000

2500

3000

3500

4000 4400

Time (s.) Fault 9 reconstruction

3 2

4 1 0

2

-1

0

-2 -3

-2 0

500

1000

1500

2000

2500

Time (s.)

3000

3500

4000 4400

-4 0

500

1000

1500

2000

2500

3000

3500

4000 4400

Time (s.)

Fig. 5. Estimated faults for cases 6, 7, 8, and 9.

Also in this case, the results obtained by the NARX NNs serve to highlight the efficacy of the developed solution, taking into account also disturbance and uncertainty affecting the WT system. 4.2

Hardware–in–the–Loop Experiments

In order to validate the developed fault diagnosis solutions in more realistic real–time working situations, the WT process and the designed algorithms have been implemented and executed by means of a HIL tool. This test–bed allows to reproduce experimental tests that are oriented to the verification of the results achieved in simulations. This test–bed is sketched in Fig. 6, which highlights its 3 main modules. The WT simulator developed in the Matlab and Simulink environments that was used to describe WT system dynamics, its actuator, measurement sensors, and the WT controlled has been implemented in the LabVIEW environment. Realistic effects such as uncertainty, measurement errors, disturbance and the model–reality mismatch effects were also included, as recalled Sect. 2. The overall system is converted in the C++ code running on a standard PC, and allows also to test and monitor the signals generated by the proposed fault diagnosis strategies. These fault diagnosis schemes summarised in Sect. 3.1 have been also compiled as executable code and implemented in an AWC 500 industrial system that features typical wind turbines requirements. This industrial module receives the

Data–Driven Fault Diagnosis Application

35

WT Simulator Real-time Wind generator

Clock Wind energy Monitored signals Signal monitoring Model parameters Disturbance model Disturbance & uncertainty simulator

Fault estimation solutions

Measured signals

Measured signals

Wind gusts

Processor

WT system

Interface circuits Control command

Disturbance signal

Control command

On-board electronics

Electronic circuits

Wind model

On-board fault diagnosis system

Disturbance model

Simulation code

Fig. 6. HIL tool for real–time validation.

signals acquired from the PC simulating the realistic WT plant that represent the monitored signals reported in Table 1. Therefore, the on board electronics elaborate these signals according to the fault diagnosis algorithms and produce the monitoring signals transmitted back to the WT simulator running on the PC. An intermediate module represents the interface circuits providing the communications between the PC with the WT simulator and the on board electronics running the fault diagnosis algorithms. In this way, it manages the signals and exchanges the data between the WT simulator and the AWC 500 system. The results achieved via this HIL tool are reported in Table 4 that summarises the capabilities of the fault diagnosis algorithms by means of the NSSE % performance index. Table 4. RMSE % index for the HIL tool. Fault case

1

TS FSs

1.69% 2.29% 2.01% 1.94% 1.99% 2.22% 1.81% 2.21% 2.03%

2

3

4

5

6

7

8

9

NARX NNs 0.99% 0.98% 0.99% 1.28% 1.21% 1.69% 1.02% 1.01% 1.51%

Note that the tests summarised in Table 4 are consistent with the results reported in Tables 2 and 3. Although the accuracy of the simulations seems better than the performance achieved via the HIL tool, some remarks have to be drawn. First, the AWC 500 system uses calculations that are more restrictive than the PC simulator. Moreover, A/D and D/A devices are also exploited, which can introduce further deviations. On the other hand, the testing of real scenarios does not involve the data transfer from a PC to on board electronics,

36

S. Simani et al.

thus reducing possible errors. Therefore, it can be finally remarked that the achieved results are quite accurate and motivate the application of the developed fault diagnosis strategies to real WT installations.

5

Conclusion

The paper addressed fault diagnosis strategies that were developed by means of artificial intelligence tools. In this way, the challenging problem of the condition monitoring of a wind turbine process was solved via data–driven approaches. The design fault diagnosis strategy was based on the reconstruction of the faults affecting the considered process, thus implemented via dynamic system identification and artificial intelligence tools. The paper proposed these strategies as they represented viable methods that were able to manage partial known information on the process dynamics, and to cope with measurement errors, model–reality mismatch and disturbance effects. In particular, these fault diagnosis schemes were implemented by means of fuzzy and neural network structures exploited determine the nonlinear dynamic relations between the input and output measurements acquired from the process under diagnosis and the fault signals affecting the process. These developed prototypes included auto– regressive with exogenous input architectures in order to approximate the nonlinear dynamic relations with arbitrary degree of accuracy. The designed fault diagnosis solutions were validated via a high–fidelity benchmark that simulated the behaviour of a realistic wind turbine plant. This wind turbine simulator was also employed to test the reliability and robustness characteristics of the fault diagnosis schemes by considering the presence of uncertainty and disturbance effects included in this wind turbine benchmark. Further works will verify the features of the same fault diagnosis schemes when applied to real plants and by means of real data.

References 1. Babuˇska, R.: Fuzzy Modeling for Control. Kluwer Academic Publishers, Boston (1998) 2. Chen, J., Patton, R.J.: Robust Model-Based Fault Diagnosis for Dynamic Systems. Kluwer Academic Publishers, Boston (1999) 3. Fantuzzi, C., Simani, S., Beghelli, S., Rovatti, R.: Identification of piecewise affine models in noisy environment. Int. J. Control 75(18), 1472–1485 (2002). https:// doi.org/10.1109/87.865858 4. Harrabi, N., Kharrat, M., Aitouche, A., Souissi, M.: Control strategies for the grid side converter in a wind generation system based on a fuzzy approach. Int. J. Appl. Math. Comput. Sci. 28(2), 323–333 (2018) 5. Korbicz, J., Koscielny, J.M., Kowalczuk, Z., Cholewa, W. (eds.): Fault Diagnosis: Models, Artificial Intelligence, Applications, 1st edn. Springer, London, 12 February 2004. ISBN 3540407677 6. Lan, J., Patton, R.J., Zhu, X.: Fault-tolerant wind turbine pitch control using adaptive sliding mode estimation. Renew. Energy 116(Part B), 219–231 (2018). https://doi.org/10.1016/j.renene.2016.12.005

Data–Driven Fault Diagnosis Application

37

7. Odgaaard, P.F., Shafiei, S.E.: Evaluation of wind farm controller based fault detection and isolation. In: Proceedings of the IFAC SAFEPROCESS Symposium 2015, Paris, France, September 2015, vol. 48, pp. 1084–1089. IFAC, Elsevier (2015). https://doi.org/10.1016/j.ifacol.2015.09.671 8. Odgaard, P.F., Stoustrup, J., Kinnaert, M.: Fault-tolerant control of wind turbines: a benchmark model. IEEE Trans. Control Syst. Technol. 21(4), 1168–1182 (2013). https://doi.org/10.1109/TCST.2013.2259235. ISSN 1063-6536 9. Odgaard, P.F., Stoustrup, J.: A benchmark evaluation of fault tolerant wind turbine control concepts. IEEE Trans. Control Syst. Technol. 23(3), 1221–1228 (2015) 10. Odgaard, P.F., Stoustrup, J., Kinnaert, M.: Fault tolerant control of wind turbines - a benchmark model. In: Proceedings of the 7th IFAC Symposium on Fault Detection, Supervision and Safety of Technical Processes, Barcelona, Spain, 30 June–3 July 2009, vol. 1, pp. 155–160 (2009). https://doi.org/10.3182/200906304-ES-2003.0090 11. Parker, M.A., Ng, C., Ran, L.: Fault-tolerant control for a modular generatorconverter scheme for direct-drive wind turbines. IEEE Trans. Ind. Electron. 58(1), 305–315 (2011) 12. Rovatti, R., Fantuzzi, C., Simani, S.: High-speed DSP-based implementation of piecewise-affine and piecewise-quadratic fuzzy systems. Signal Process. J. 80(6), 951–963 (2000). Special Issue on Fuzzy Logic applied to Signal Processing. https:// doi.org/10.1016/S0165-1684(00)00013-X 13. Simani, S.: Application of a data-driven fuzzy control design to a wind turbine benchmark model. In: Advances in Fuzzy Systems 2012, 2nd November 2012, Invited paper for the special issue: Fuzzy Logic Applications in Control Theory and Systems Biology (FLACE), pp. 1–12 (2012). ISSN 1687-7101, e-ISSN 1687711X. https://doi.org/10.1155/2012/504368. http://www.hindawi.com/journals/ afs/2012/504368/ 14. Simani, S., Castaldi, P.: Data-driven design of fuzzy logic fault tolerant control for a wind turbine benchmark. In: Astorga-Zaragoza, C.M., Molina, A. (eds.) 8th IFAC Symposium on Fault Detection, Supervision and Safety of Technical Processes SAFEPROCESS 2012, Mexico City, Mexico, 29th–31st August 2012, vol. 8, pp. 108–113. Instituto de Ingenier´ıa, Circuito escolar, Ciudad Universitaria, CP 04510, México D.F., IFAC. Invited session paper (2012). ISBN 978-3-902823-09-0. ISSN 1474-6670. https://doi.org/10.3182/20120829-3-MX-2028.00036 15. Simani, S., Fantuzzi, C., Rovatti, R., Beghelli, S.: Parameter identification for piecewise linear fuzzy models in noisy environment. Int. J. Approx. Reason. 1(22), 149–167 (1999) 16. Simani, S., Farsoni, S., Castaldi, P.: Active fault tolerant control of wind turbines using identified nonlinear filters. In: Proceedings of the 2nd International Conference on Control and Fault-Tolerant Systems - SysTol 2013, Nice, France, 9–11 October 2013, pp. 383–388. Centre de Recherche en Automatique de Nancy CRAN. IEEE Control Systems Society (2013). Special session invited paper ISBN 978-1-4799-2854-5. https://doi.org/10.1109/SysTol.2013.6693827 17. Simani, S., Farsoni, S., Castaldi, P.: Fault tolerant control of an offshore wind turbine model via identified fuzzy prototypes. In: Whidborne, J.F. (ed.) Proceedings of the 2014 UKACC International Conference on Control (CONTROL), Loughborough University, Loughborough, UK, 8th–11th July 2014, pp. 494–499. UKACC (United Kingdom Automatic Control Council), IEEE (2014). ISBN 9781467306874. Special session invited paper. https://doi.org/10.1109/CONTROL.2014.6915188

38

S. Simani et al.

18. Simani, S., Farsoni, S., Castaldi, P.: Residual generator fuzzy identification for wind farm fault diagnosis. In: Proceedings of the 19th World Congress of the International Federation of Automatic Control - IFAC 2014, Cape Town, South Africa, 24–29 August 2014, vol. 19, pp. 4310–4315. IFAC & South Africa Council for Automation and Control, IFAC (2014). Invited paper for the special session “FDI and FTC of Wind Turbines in Wind Farms” organised by P. F. Odgaard and S. Simani. https://doi.org/10.3182/20140824-6-ZA-1003.00052 19. Simani, S., Farsoni, S., Castaldi, P.: Data-driven techniques for the fault diagnosis of a wind turbine benchmark. Int. J. Appl. Math. Comput. Sci. - AMCS 28(2), 247–268 (2018). https://doi.org/10.2478/amcs-2018-0018 20. Xu, F., Puig, V., Ocampo-Martinez, C., Olaru, S., Niculescu, S.I.: Robust MPC for actuator-fault tolerance using set-based passive fault detection and active fault isolation. Int. J. Appl. Math. Comput. Sci. 27(1), 43–61 (2017). https://doi.org/ 10.1515/amcs-2017-0004

Mathematical Model to Support Decision-Making to Ensure the Efficiency and Stability of Economic Development of the Republic of Kazakhstan Askar Boranbayev1(B) , Seilkhan Boranbayev2 , Malik Baimukhamedov3 , and Askar Nurbekov2 1 Nazarbayev University, Nur-Sultan, Kazakhstan

[email protected] 2 L.N. Gumilyov Eurasian National University, Nur-Sultan, Kazakhstan 3 Kostanay Socio-Technical University named after Academician

Z. Aldamzhar, Kostanay, Kazakhstan

Abstract. At planning the state program and the regional programs of economic development it is necessary to define the total of resources, which are required for performance of these programs. It is required to calculate performance of programs at the set plan of deliveries of resources. Such calculations can be carried out on models of forward planning of development of economy and mathematical models of process of performance of economic programs. In the article the mathematical model and a method of calculation of process of performance of economic programs is resulted. The mathematical model is necessary for a choice of different variants of programs. The suggested mathematical model enables to build the schedule of performance of a complex of programs in directive terms at the set plan receipt of resources, or to be convinced that the given complex of programs cannot be executed in directive terms. The analyzed variants of programs are considered set. The program is considered set, if the set of operation, which should be executed, and resources, which are necessary for performance of each of operation, are set. For the mathematical description of a complex of programs the network is used. Keywords: Operations · Resources · Network · Schedule · Management

1 Introduction The Republic of Kazakhstan is administratively divided into 14 regions. Each of the regions has its own planning and management bodies, which are an integral part of the state planning and management body. Regional economic development programs should be planned taking into account the programs of other regions and the general state program. Each of the regions separately does not produce all types of resources necessary for the implementation of the state development program. Therefore, ensuring © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 39–48, 2021. https://doi.org/10.1007/978-3-030-80126-7_4

40

A. Boranbayev et al.

the efficiency and stability of the economic development of the Republic of Kazakhstan depends on the optimal distribution of resources in the regions. The goal of state bodies is to develop and maintain the stability of the state. This goal is the source of the formation of the state development program, which, in turn, may consist of several programs. The following main ways of maintaining the stability of the state can be distinguished: meeting the needs of the regions; implementation of national programs, such as defense programs, environmental protection programs and others. Thus, the state and the regions have common goals. They are the source of the formation of parts of national programs, which take into account the needs of the regions. Thus, planning the development of the region cannot be viewed as an isolated task. It should be considered in conjunction with planning the development of the entire state system. Regional development planning procedures are part of the state development planning procedures. Many methods are devoted to solving problems of optimizing economic development, in particular [1–10]. We will assume that most of the resources are subordinate to government bodies or are under double subordination. In particular, the natural resources located on the territory of the region are at the disposal of the state. At the same time, the population living in the territory of this region is the source of the formation of labor resources. Therefore, the region has an important type of resources - labor resources. The labor resources of the region are used in all types of property and in all funds (state, regional, mixed, foreign and others). They are not directly at the disposal of state and regional authorities. The structure and quality of labor resources, as well as the stability of the region and the state system as a whole, depend on the degree of satisfaction of the needs of the regions. It can be considered that the regional planning and management bodies should determine the needs of the population of the region and the degree of their satisfaction. Thus, the quantity and quality of labor resources depends on the degree to which the needs of the region’s population are met. The procedures for the development of state and regional development programs should include methods for assessing the total amount of resources that can be allocated for the implementation of programs and the calculation of program implementation for a given resource supply plan. This kind of calculations can be carried out on models of long-term planning of the economy and models of the process of implementation of economic programs. This paper provides a mathematical model and a method for calculating the process of implementing economic programs. The model is needed to select different program options. It makes it possible to build a schedule for the execution of programs on target dates for a given resource flow, or to make sure that a given program cannot be completed on target dates.

2 Problem Statement and Its Mathematical Model Let’s consider a directed network K, top i of a network we shall designate Ki . Vertexes of network designate some operations which should be executed. Arches of a network mean technological connections between these operations. Directions of the arches which are starting on top Ki , specify tops which states are influenced with a state of top Ki .

Mathematical Model to Support Decision-Making

41

Set of all operation of a network we shall designate I , I = {1, 2, . . . , n}. Each operation we shall attribute conditional number, some integer i, i ∈ I . We shall designate Ii+ operation set, directly previous to operation i, that is operation i cannot start before all operations from set Ii+ , Ii+ ⊂ I will end. Thus, in Ii+ enter those operations which directly influence a condition of operation i. We shall designate Ii− operation set, directly following for operation i, that is if operation i is not finished, operations from set Ii− cannot start to be carried out. Thus, operation i directly influences on conditions of operation from set Ii− . We assume, that i ∈ Ii+ , i ∈ Ii− , that is the network does not contain loops. Each operation i ∈ I demand for execution a resources. Set of kinds of resources we shall designate J , J = { 1, 2, 3, ..., m }. Ij - operation set, using a resource of the j kind. We shall designate xi (t) a condition of operation i, also, we shall name xi (t) a number of operation i. When the number xi (t) of operation i will reach value of unity, operation is considered executed. At xi (t) = 0 performance of operation i did not begin yet, at 0 < xi (t) < 1 operation is executed partly. The magnitude xi (t) is determined for each operation i in own way, depending on the concrete contents of operation. For example, it can be time of performance of operation or volume of operation in some units, etc. Thus, the condition of all complex of operation of a network can be described a vector x = (x1 , x2 , . . . , xn ). Value x varies eventually. The moment of time, in which vector x for the first time coincides with unit vector, will be the moment of end of all complex of operation. Intensity of performance of operation in given moment of time we shall name speed of increase of a number of a condition of operation during this moment of time. Operation uses resources during the performance and returns it back after the end. The resources which have been not used during some moment of time, do not increase amount of resources during the subsequent moments of time, they do not accumulate. Let’s designate ui a condition of managing objects for operation i, i ∈ I . Managements, in a problem of calculation of performance in a complex of a network’s operation, are intensity of the operation, appointed for performance at present time. We shall designate u as intensity performance vector of a complex of operation at present time, u = (u1 , u2 , . . . , un ), where ui - intensity performance of operation i at present time. The condition of operation i depends on a condition of operation v, v ∈ Ii+ ⊂ I , intensity ui , and also is defined by value of the parameters. Process of performance of a complex of operation of a network is controlled. There is certain freedom in operation select which will be carried out at present time, and intensity their performance. The problem consists in that for each moment of time, at existing restrictions on resources, should be named a vector of intensity u with which it is necessary for operation of a network to carry out. As the best are considered such u which deliver a minimum to the some functional. All restrictions with which should satisfy intensity u, are broken into restrictions of two types. Restrictions of the first type describe interdependence of performance of operation (the attitude of precedence). Generally restrictions of the first type are described by some function which to each condition x of a complex of operation of a network puts

42

A. Boranbayev et al.

in conformity allowable values u operations intensity, from some set U (x) with which operations of a complex in the given condition can be carried out. Thus, generally, restrictions of the first type look like: u ∈ U (x). Restrictions of the second type include restrictions on resources. Generally the vector of the resources spent on performance of a complex of operation at the moment of time t, depends on conditions x(t) of a complex of operations and intensity u(t) at the moment of time t. We shall designate V (t) - a vector of the total resources available in system at the moment of time t. D(x(t), u(t)) - total of the resources spent on performance of operation of a complex during the moment t. Then, generally, restrictions of the second type look like: D(x(t), u(t)) ≤ V (t). If the vector V (t) during each moment of time is set, calculation of performance of a complex of operations is reduced to the following: for each moment t the vector intensity u(t) with which it is necessary to carry out operations of a complex should be named. It is considered the best such u(t) at which some fynkcional will be minimal. As functional it is possible to take, for example, a degree of satisfiability of complex of operation by some directive fixed moment of time T . Let’s take functional in the following kind: F=

n 1 λi (1 − xi (T ))1+α 1+α i=1

Where α ≥ 0 λi ≥ 0. It is possible to replace restrictions of the first type the penalty for their infringement. As is known, computing methods of search of the optimum decision can be treated as use of penalties. In one cases the penalty is imposed without taking into account features of a problem, in other cases there is a communication between a deviation from the optimum decision and size of the penalty. The penalty for infringement of restriction of the first type we shall take into account in functional. Then instead of functional F, it is optimized functional with the penalty: T ϕ(x, u, t, ε)dt

=F+

(1)

0

Where parameter ε > 0. The second composed in the right part (1) increases value functional if restrictions of the first type are broken. If these conditions are executed, the second composed in the right part (1) is equal to zero and values functional Φ and F coincides. Hence, their minimal values are realized by the same function of management. For definiteness function ϕ(x, u, t, ε) undertakes in the following kind: ϕ(x, u, t, ε) =

1 ui (t) θ (1 − xk (t)) ε + i∈I

k∈Ii

Mathematical Model to Support Decision-Making

43

⎧ ⎪ ⎨ 0, τ < 0 Where θ (τ ) = 1, τ > 0 ⎪ ⎩ 0, τ = 0 Then the problem will look like: n 1 λi (1 − xi (T ))1+α 1+α

=

i=1

+

1 ε

T 0

ui (t)

θ (1 − xk (t))dt → min

k∈Ii+

i∈I

dxi (t) = ui (t), i ∈ I dt xi (0) = 0, i ∈ I xi (T ) = 1, i ∈ I

dji (t) ui (t) ≤ Vj (t), j ∈ J , Vj (t) > 0

i∈Ij

0 ≤ ui (t) ≤ Hi (t), i ∈ I ∀ i∃j : dji (t) > 0, j ∈ J . Where dji (t) - intensity of an expenditure resource j at performance operations i with individual intensity; Vj (t)- intensity of receipt of a resource j- th kind, step function. It is necessary to finish all complex of operation for directive time T . If restrictions of the first type are not executed, θ (1 − xk (t)) > 0, i ∈ I ui (t) k∈Ii+

If restrictions of the first type are executed, θ (1 − xk (t)) = 0, i ∈ I ui (t) k∈Ii+

and both functional and F coincide.

44

A. Boranbayev et al.

3 Method and Algorithm for Solving the Problem The problem under consideration is solved by the method proposed in the article [1]. Use of this method allows to carry out decomposition of an initial problem on sequence of problems of linear programming not the big dimension. Thus, the problem of planning and distribution of resources of a network with directive terms of performance of a complex of operation of a network is reduced to sequence of problems of construction of the decision for set technological independent operations on the basis of methods of local optimization. Let’s note, that at calculation of the networks containing a plenty of operation, there can be the certain difficulties connected to high dimension of a problem. In this case it is expedient to replace a problem of high dimension with several problems of small dimension. We shall consider the methods, allowing it to make. 1) The interval [0, T ] is broken into some intervals [0, T1 ], [T1 , T2 ], …, [Tk−1 , Tk ], …. On an interval [0, T1 ] resources are distributed not for all operations I of a network, but only for some operations I1 subnet, I1 ⊂ I . After on a method stated in clause [1] it is defined optimum on [0, T1 ] the decision, pass to the following interval of time [T1 , T2 ]. On an interval [T1 , T2 ] are calculated ui (t) for operation from set I2 , I2 ⊂ I . The set I2 will consist of the operations which are included in I1 and whose performance has not ended till the moment T1 , and some operations from I , not belonging I1 . After the decision is received on k interval, pass on (k + 1) interval. Thus: xik+1 (Tk ) = xik (Tk ), i ∈ Ik (k+1)

xi

(Tk ) = 0, i ∈ I \

k

Is ,

s=1

that is, the initial condition of operation on an interval k + 1 coincides with value of a condition of operation during the corresponding moment of time on an interval k. After decisions on each such interval will be found, the basic iteration of algorithm comes to an end. If the complex of operation of a network will be completed in directive terms the required decision is found. If the complex of operation of a network will not be completed in directive terms the following iteration for new splitting a network on sub network is carried out, in view of the decision constructed on previous iteration. Let’s note, that calculation on the resulted method allows to receive decisions with the earliest moments of the beginning of performance of operation with the maximal surplus of resources at the moment of time T . 2) If the iterative process described above carry out, since last site it is possible to receive the schedule which is similar to the schedule of dynamic programming [2]. For this purpose we shall enter a variable: τ = T − t.

Mathematical Model to Support Decision-Making

45

The interval [0, T ] is broken into subintervals [0, τ1 ], [τ1 , τ2 ], . . . , [τk , τk+1 ], . . .. The method resulted in [1] and decision is applied to each of subintervals is defined. On the first interval [0, τ1 ] the decision for sub network S1 , S1 ⊂ R, where R - a network, inverse to an initial network K. Then we find solution on the following interval [τ1 , τ2 ] for sub network S2 , S2 ⊂ R. Sub network S2 will consist on operation of sub network S1 , not reached a zero level of a point and the some operation from R\S1 . After the decision on interval k is received, pass on interval (k + 1). Thus, xik+1 (τk ) = xik (τk ), i ∈ Sk xik+1 (τk ) = 1, i ∈ R\

k

Sr

r=1

Let’s note, that this method allows to receive decision with the latest moments of time of the termination of performance of operation with the minimal surplus of resources at the moment of time T . A software system has been developed that alternates calculations by these two methods and allows you to get the best solution. The data for the calculations are taken from the respective databases. For uninterrupted and trouble-free operation of the control system, it is necessary to ensure the reliability and safety of automated information systems [11–41]. When developing a software system for modeling and supporting decision-making to ensure the efficiency and stability of economic development, technologies and methods were used to increase the level of their reliability and safety. When developing this program, the following systems were used: Delphi, ADO technology, MS SQL Server database management system.

4 Conclusion The development goals of the regions are determined by their needs, the degree of their satisfaction, and the system of legal relations. The mismatch between needs and the degree of their satisfaction is the source of the formation of regional programs of economic development. We consider a state with centralized control, and the resources of the state are greater than the resources of the regions. National interests in decision-making procedures are the starting point for the planning of economic development. The article developed a mathematical model and a calculation method for the selection of various options for economic development programs. The developed method makes it possible to build a schedule for the execution of programs in the target time frame for a given plan of resource inflow, or to make sure that the given program cannot be executed in the target time frame. To form a program of economic development, it is necessary to assign a set of operations to be performed and the resources that are required to complete each of the operations.

46

A. Boranbayev et al.

References 1. Boranbayev, S.N.: The algorithm of decision task of planning and distribution of net resources by methods of sequence approaches. In: Materials of International Science Practical Conference “Ualihanov’s Reading”, Kokshetau, pp. 191–194 (2003) 2. Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton (1957) 3. Hartman, J.: Engineering Economy and the Decision-Making Process, Upper Saddle River. Pearson Prentice Hall, Hoboken (2007) 4. Yatsenko, Y., Hritonenko, N.: Economic life replacement under improving technology. Int. J. Prod. Econ. 133, 596–602 (2011) 5. Yatsenko, Y., Hritonenko, N.: Machine replacement under evolving deterministic and stochastic costs. Int. J. Prod. Econ. 193, 491–501 (2017) 6. Yatsenko, Y., Hritonenko, N.: Asset replacement under improving operating and capital costs: a practical approach. Int. J. Prod. Res. 54, 2922–2933 (2016) 7. Hritonenko, N., Yatsenko, Y., Boranbayev, S.: Non-equal-life asset replacement under evolving technology: a multi-cycle approach. Eng. Econ. 25 (2020) 8. Hritonenko, N., Yatsenko, Y., Boranbayev, S.: Environmentally sustainable industrial modernization and resource consumption: is the Hotelling’s rule too steep? Appl. Math. Model. 39(15), 4365–4377 (2015) 9. Boranbayev, S.N., Nurbekov, A.B.: Construction of an optimal mathematical model of functioning of the manufacturing industry of the Republic of Kazakhstan. J. Theor. Appl. Inf. Technol. 80(1), 61–74 (2015) 10. Hritonenko, N., Yatsenko, Y., Boranbayev, A.: Generalized functions in the qualitative study of heterogeneous populations. Math. Popul. Stud. 26(3), 146–162 (2019) 11. Boranbayev, A., Boranbayev, S., Nurusheva, A., Yersakhanov, K.: Development of a software system to ensure the reliability and fault tolerance in information systems. J. Eng. Appl. Sci. 13(23), 10080–10085 (2018) 12. Boranbayev, S., Goranin, N., Nurusheva, A.: The methods and technologies of reliability and security of information systems and information and communication infrastructures. J. Theor. Appl. Inf. Technol. 96(18), 6172–6188 (2018) 13. Boranbayev, A., Boranbayev, S., Nurusheva, A.: Development of a software system to ensure the reliability and fault tolerance in information systems based on expert estimates. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) Intelligent Systems and Applications, vol. 869, pp. 924–935. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01057-7_68 14. Boranbayev, A., Boranbayev, S., Yersakhanov, K., Nurusheva, A., Taberkhan, R.: Methods of ensuring the reliability and fault tolerance of information systems. In: Latifi, S. (ed.) Information Technology – New Generations. AISC, vol. 738, pp. 729–730. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77028-4_93 15. Boranbayev, S., Altayev, S., Boranbayev, A.: Applying the method of diverse redundancy in cloud based systems for increasing reliability. In: The 12th International Conference on Information Technology: New Generations (ITNG 2015), Las Vegas, Nevada, USA, 13–15 April 2015, pp. 796–799 (2015) 16. Turskis, Z., Goranin, N., Nurusheva, A., Boranbayev, S.: A fuzzy WASPAS-based approach to determine critical information infrastructures of EU sustainable development. Sustainability 11(2), 424 (2019). https://doi.org/10.3390/su11020424 17. Turskis, Z., Goranin, N., Nurusheva, A., Boranbayev, S.: Information security risk assessment in critical infrastructure: a hybrid MCDM approach. Informatica 30(1), 187–211 (2019). https://doi.org/10.15388/Informatica.2018.203 18. Boranbayev, A.S., Boranbayev, S.N., Nurusheva, A.M., Yersakhanov, K.B., Seitkulov, Y.N.: Development of web application for detection and mitigation of risks of information and automated systems. Eurasian J. Math. Comput. Appl. 7(1), 4–22 (2019)

Mathematical Model to Support Decision-Making

47

19. Boranbayev, A., Boranbayev, S., Nurusheva, A., Seitkulov, Y., Sissenov, N.: A method to determine the level of the information system fault-tolerance. Eurasian J. Math. Comput. Appl. 7(3), 13–32 (2019). https://doi.org/10.32523/2306-6172-2019-7-3-13-32 20. Boranbayev, A., Boranbayev, S., Nurbekov, A., Taberkhan, R.: The development of a software system for solving the problem of data classification and data processing. In: Latifi, S. (ed.) 16th International Conference on Information Technology-New Generations (ITNG 2019). AISC, vol. 800, pp. 621–623. Springer, Cham (2019). https://doi.org/10.1007/978-3-03014070-0_90 21. Askar, B., Seilkhan, B., Assel, N., Kuanysh, Y., Yerzhan, S.: A software system for risk management of information systems. In: Proceedings of the 2018 IEEE 12th International Conference on Application of Information and Communication Technologies (AICT 2018), Almaty, Kazakhstan, 17–19 October 2018, pp. 284–289 (2018) 22. Seilkhan, B., Askar, B., Sanzhar, A., Askar, N.: Mathematical model for optimal designing of reliable information systems. In: Proceedings of the 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT 2014), Astana, Kazakhstan, 15–17 October 2014, pp. 123–127 (2014) 23. Seilkhan, B., Sanzhar, A., Askar, B., Yerzhan, S.: Application of diversity method for reliability of cloud computing. In: Proceedings of the 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT 2014), Astana, Kazakhstan, 15–17 October 2014, pp.244–248 (2014) 24. Boranbayev, A., Boranbayev, S., Nurusheva, A.: Analyzing methods of recognition, classification and development of a software system. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) Intelligent Systems and Applications, vol. 869, pp. 690–702. Springer, Cham (2019). https:// doi.org/10.1007/978-3-030-01057-7_53 25. Boranbayev, A.S., Boranbayev, S.N.: Development and optimization of information systems for health insurance billing. In: 7th International Conference on Information Technology: New Generations, ITNG 2010, pp. 1282–1284 (2010) 26. Akhmetova, Z., Zhuzbayev, S., Boranbayev, S., Sarsenov, B.: Development of the system with component for the numerical calculation and visualization of non-stationary waves propagation in solids. Front. Artif. Intell. Appl. 293, 353–359 (2016) 27. Boranbayev, S.N., Nurbekov, A.B.: Development of the methods and technologies for the information system designing and implementation. J. Theor. Appl. Inf. Technol. 82(2), 212– 220 (2015) 28. Boranbayev, A., Shuitenov, G., Boranbayev, S.: The method of data analysis from social networks using apache hadoop. In: Latifi, S. (ed.) Information Technology – New Generations. AISC, vol. 558, pp. 281–288. Springer, Cham (2018). https://doi.org/10.1007/978-3-31954978-1_39 29. Boranbayev, S., Nurkas, A., Tulebayev, Y., Tashtai, B.: Method of processing big data. In: Latifi, S. (ed.) Information Technology – New Generations. AISC, vol. 738, pp. 757–758. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77028-4_99 30. Boranbayev, A., Boranbayev, S., Nurbekov, A.: Estimation of the degree of reliability and safety of software systems. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) FICC 2020. AISC, vol. 1129, pp. 743–755. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-39445-5_54 31. Boranbayev, A., Boranbayev, S., Nurbekov, A.: Development of the technique for the identification, assessment and neutralization of risks in information systems. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) FICC 2020. AISC, vol. 1129, pp. 733–742. Springer, Cham (2020). https:// doi.org/10.1007/978-3-030-39445-5_53 32. Boranbayev, A., Boranbayev, S., Nurusheva, A., Seitkulov, Y., Nurbekov, A.: Multi criteria method for determining the failure resistance of information system components. In: Arai, K., Bhatia, R., Kapoor, S. (eds.) FTC 2019. AISC, vol. 1070, pp. 324–337. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-32523-7_22

48

A. Boranbayev et al.

33. Boranbayev, A., Boranbayev, S., Nurbekov, A., Taberkhan, R.: The software system for solving the problem of recognition and classification. In: Arai, K., Bhatia, R., Kapoor, S. (eds.) CompCom 2019. AISC, vol. 997, pp. 1063–1074. Springer, Cham (2019). https://doi.org/10. 1007/978-3-030-22871-2_76 34. Akhmetova, Z., Boranbayev, S., Zhuzbayev, S.: The visual representation of numerical solution for a non-stationary deformation in a solid body. In: Latifi, S. (ed.) Information Technolog: New Generations, pp. 473–482. Springer, Cham (2016). https://doi.org/10.1007/978-3-31932467-8_42 35. Boranbayev, A., Boranbayev, S., Nurbekov, A.: Evaluating and applying risk remission strategy approaches to prevent prospective failures in information systems. In: Latifi, S. (ed.) 17th International Conference on Information Technology–New Generations (ITNG 2020). AISC, vol. 1134, pp. 647–651. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-430207_87 36. Boranbayev, A., Boranbayev, S., Nurbekov, A.: A proposed software developed for identifying and reducing risks at early stages of implementation to improve the dependability of information systems. In: Latifi, S. (ed.) 17th International Conference on Information Technology–New Generations (ITNG 2020). AISC, vol. 1134, pp. 163–168. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43020-7_22 37. Boranbayev, A., Boranbayev, S., Nurbekov, A.: Java based application development for facial identification using OpenCV library. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1251, pp. 77–85. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-551 87-2_8 38. Boranbayev, A., Boranbayev, S., Nurbekov, A.: Measures to ensure the reliability of the functioning of information systems in respect to state and critically important information systems. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1252, pp. 139– 152. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55190-2_11 39. Boranbayev, A., Boranbayev, S., Nurbekov, A.: Development of a hardware-software system for the assembled helicopter-type UAV prototype by applying optimal classification and pattern recognition methods. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) SAI 2020. AISC, vol. 1229, pp. 380–394. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52246-9_28 40. Boranbayev, A., Shuitenov, G., Boranbayev, S.: The method of analysis of data from social networks using rapidminer. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) SAI 2020. AISC, vol. 1229, pp. 667–673. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52246-9_49 41. Askar, B., Seilkhan, B., Yuri, Y., Yersultan, T.: Methods and software for simulation of optimal renewal of capital assets. J. Theor. Appl. Inf. Technol. (JATIT) 98(21), 3545–3558 (2020)

Automated Data Processing of Bank Statements for Cash Balance Forecasting Vlad-Marius Griguta1(B) , Luciano Gerber2 , Helen Slater-Petty1 , Keeley Crocket2 , and John Fry3 1 AccessPay, Manchester M1 4BT, UK [email protected] 2 Manchester Metropolitan University, Manchester M15 6BH, UK 3 Bradford University, Bradford B7 1DP, UK

Abstract. The forecasting of cash inflows and outflows across multiple business operations plays an important role in the financial health of medium and large enterprises. Historically, this function was assigned to specialized treasury departments who projected future cash flows within different business units by processing available information on the expected performance of each business unit (e.g. sales, expenditures). We present an alternative forecasting approach which uses historical cash balance data collected from standard bank statements to systematically predict the future cash positions across different bank accounts. Our main contribution is on addressing challenges in data extraction, curation, and pre-processing, from sources such as digital bank statements. In addition, we report on the initial experiments in using both conventional and machine learning approaches to forecast cash balances. We report forecasting results on both univariate and multivariate, equally-spaced cash balances pertaining to a small, representative subset of bank accounts. Keywords: Time series forecasting · Cash flow forecasting · Data wrangling

1 Introduction Cash flow forecasting is a critical task for corporations of all sizes and across the whole spectrum of business activities. The more diverse these activities are, the more demanding it is for the company stakeholders to make informed financial decisions. Teams of experienced treasurers are involved in estimating the future available cash of corporations and making investment decisions based upon these estimations. Every hour spent on investigating the financial information results in a non-negligible opportunity cost to the business stakeholders, especially in volatile times when the distribution of resources needs to be closely monitored. The benefits of using analytical methods to issue cash flow forecasts based on historical data are, therefore, twofold. Firstly, there is the potential to improve the forecasting accuracy, which optimizes the resource allocation across the business and, secondly, the potential to improve the performance of the treasury departments, shifting the focus from low-yield data collection tasks to lucrative investment decision making. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 49–64, 2021. https://doi.org/10.1007/978-3-030-80126-7_5

50

V.-M. Griguta et al.

This paper addresses the problem of forecasting daily cash balances of corporate bank accounts by modeling the data as equally-spaced time series. In line with the typical cash operations of medium and large enterprises, the forecasting time horizon of our predictive models has been set to one full month (22 business days). This requirement for predictions over larger time spans seems to be a distinguishing feature and additional challenge addressed in this work, when compared to other common financial time series modelling tasks (e.g., stock prices movement). The data in this research is extracted from bank statements, which are collected through the SWIFT network (see Sect. 4.1). The information contained in bank statements is aimed at providing live cash visibility and transparency across bank accounts. These are a standardized form of communicating financial information, which means that a forecasting system that consumes bank statements can be used by a large number of companies. By focusing on reporting the overall liquidity within the account, bank statements often neglect reporting information at the transactional level, which impedes the flexibility of analytical forecasting methods. In our approach, we have identified and addressed some significant challenges (Sect. 4) in data collection and pre-processing, as well as with forecasting itself. Some of these challenges, such as inconsistency of transactional data reconciliation, irregularity of the statement issue time and missing data, are faced while reconstructing the cash balance data from the bank statements of different accounts. Other challenges revealed in the modeling process are related to the reduced historical data, operational outliers and pattern changes in cash balances. The main contribution of this paper is the assessment of these practical challenges and the proposal of solutions to alleviate them, with a view of scoping the prediction of cash balances based on bank statement data as a pure time series forecasting problem. By addressing the challenges, this paper exemplifies the use of both conventional (SARIMA, TES) and machine learning (ANN) models to issue cash balance forecasts in an automated and scalable manner. The scalability of the approach presented in this paper is compared to the historical (and still conventional) method used in cash flow forecasting. This method requires manual data aggregations and domain experts to estimate the expected performance of different business units within an organisation. This paper is organized as follows: Relevant terminology is first introduced in Sect. 2, followed by a description of the data in Sect. 3. Section 4 describes data challenges associated with account balance forecasting. Section 5 presents related work on both conventional and machine learning methods applied to the problem of time series forecasting. Section 6 describes the experimental methodology used to compare the performance of several conventional and machine learning methods across cash balance accounts selected to illustrate the identified challenges. Section 7 presents the conclusions and future directions.

2 Terminology and Notation A time series dataset is a series of values of one variable (univariate) or multiple variables (multivariate) that are organized in an ordered structure provided by the time component of the series. The time component of a time series not only enriches the series with information, but also sets constraints on the dependencies between the values of the

Automated Data Processing of Bank Statements for Cash Balance Forecasting

51

variables in the series. In that respect, all entries of a time series are interdependent, which constraints the sampling methods that can be applied to the data. A forecast of a time series is defined as an ordered prediction of the future values of the series. We define a time series yk , where yk takes values from the ordered group y1 , y2 , . . . , yn . A forecast Fn+1 , Fn+2 , .., Fn+m of the series is defined as a prediction the values of the series yk , over the period of n, n + m days, where m is the forecasting horizon. The accuracy of a forecast is inferred from the deviation of the forecast from the actual values of the series over the forecasting horizon. There are multiple functions that can be used to compute the deviation. In this paper, we used the normalized root mean squared error (NRMSE) and the symmetric mean absolute percentage error (SMAPE) defined below: n+m i=n+1 (Fi − yi )2 (1) NRMSE = , − 2 n m ∗ 1 yi − y SMAPE =

|Fi − yi | 100% n+m . i=n+1 (|Fi | + |yi |)/2 m

(2)

A transfer function f is used to map a subset of the past values of the series to the forecasted values. Depending on the forecasting methods used, the subset of past values can vary in size. A good way to choose the subset size is by analyzing the autocorrelation function of the series. The autocorrelation function is the correlation of the signal with a lagged copy of itself as a function of the lagging steps. A strong correlation for a certain number of lagged steps l indicates that the value of the k th entry in the series has a strong influence on the (k + 1)th entry, and therefore can be used to predict it. Similarly, a rapid drop in the autocorrelation function at lag l’ indicates that the entries past the l’th do not influence the prediction power of the transfer function, suggesting feeding the function a subset of l’ series entries. So far, only the univariate time series forecasting problem has been discussed. A multivariate time series problem can contain, along the target series yk , both endogenous and exogenous variables. An input variable is exogenous if it influences the target variable without being influenced by it; and endogenous if it can be influenced by the target variable. For example, the day in the month might influence the corporate cash balance due to the seasonality of cash operations, however, the cash balance does not influence what day it is. In contrast, the largest daily transaction within a bank account influences and is influenced by the daily cash balance of the account.

3 Data Description The format of the data collected for this work is standardized by the Society for Worldwide Interbank Financial Telecommunication (SWIFT), the provider of a global network used for financial transactions. Depending on the scope of the information transferred, SWIFT uses different messaging types (MT). The types referring to cash management are under the format MT9xx. The two messaging types used in this work are MT940 and MT942. MT940 is the format used for end-of-day bank account statements whereas

52

V.-M. Griguta et al.

MT942 is the format used for intraday reporting. An MT940 statement includes the list of transactions having cleared during a business day and a MT942 statement includes a subset of the transactions cleared from the previous statement onwards. Although the standard of the messaging types for cash management services is ensured by SWIFT, different banking entities have different data submission conventions maintained within their various legacy systems. The bank statements issued for corporate users are similar in format to the retail bank statements. They contain a header detailing the account name, identification code, datetime of issue and the opening balance and closing balance. The bulk statement contains the list of transactions that sum up to the difference between the closing and the opening balance. Each transaction has an entry date, value date, amount, and several optional references: funds code, transaction type, identification code, reference to account owner and information for account owner. A subset of relevant features synthetized from a statement is: “‘datetime’: ‘2020-01-30 21:30’, ‘open’: £400000, ‘close’: £387000, ‘transactions’: {‘value date’: ‘2020-01-31 00:00’, ‘value’: +£4000, ‘identification code’: ‘CHK’, ‘ref’: ‘12354538 Cheque Company A’, etc.}”. This section outlined the richness of the information contained in a standard bank statement. Section 4 will discuss the specific challenges associated with the extraction of consistent and ordered cash balances from various bank statements.

4 Account Balance Data Challenges The challenges presented by time series forecasting are well known. Fine-tuned models developed through conventional or machine learning methods suffer from time instability due to the lack of stationarity, degrees of uncertainty in historical time periods and the restrictions on train, test and validation splits due to time sequences, see e.g. [1]. Time Series data corresponding to financial transactions, however, pose additional difficulties, which do not seem to have been addressed in the academic literature yet. 4.1 Reconstruction Challenges Inconsistency of Transactional Data Reconciliation. Some challenges of integrating merging financial time series data into a consistent format were discussed in [2]. Due to the global nature of many corporations, international payments and markets have an effect on statement data. Statements received from the bank are given relative timestamps based on time zones, meaning that collating transactions from multiple regions can lead to discrepancies. One such discrepancy often occurs when aggregating intraday statements (MT942) and reconciling against end of day statements (MT940), due to transactions having value dates past the issue date of the end of day statements. Irregularity in Statement Time of Issue. In addition to reconciling transactional information within bank statements at the global market scale, a forecasting system needs to consider the temporal component of the cash balance time series obtained from bank statements. The temporal component of a statement of a bank account can be

Automated Data Processing of Bank Statements for Cash Balance Forecasting

53

inferred as the time of issue of the statement by the banking institution managing the account. Because different banking institutions have different legacy systems for collating transactional information into statements, the statement reporting offering varies across regions and legislations but also across banks within the same country. Most affected by this inconsistency are the intraday statements which need to be assigned an accurate date and time to allow for the reconstruction of the corresponding time series. With the statements arriving at different times during the day, only an intermittent time series can be reconstructed. Additionally, the average number of statements collected per day is between 2 and 5, limiting the potential of resampling methods as a solution to the series intermittence. Given the circumstances, we chose to de-scope the use of intraday statements from the current study. Missing Data. Notwithstanding the irregularity in time of issue, assuming that there are a small number of transactions at the time when the end of day statement is collated (usually mid-night), the reconstructed time series of the end of day balances should be continuous and equally spaced. However, there are other factors that influence the data collection process, within a live environment. For example, there can be a temporal downtime of a service yielding interruptions or duplications of data that needs to be investigated manually. A specific example is the closing available balance which might be reported only sparsely or not reported at all for some bank accounts. It is important to note that in other contexts that data seemingly missing at random can have a hidden serious bias (see e.g. [3]). Here, much of the data is missing purely due to different financial reporting conventions associated with the different accounts. Rather than being a specific problem with missing data per se, it is possible that there may be some hidden structure in the data associated with an intra-monthly effect identified by practitioners. This is something we wish to explore in a continuous-time model in future research [4].

4.2 Forecasting Challenges Reduced Historical Data - Many of the state-of-art linear forecasting algorithms of univariate time series are based upon the extraction of seasonal patterns. These patterns are either extracted directly, in the case of stationary seasonality, or are indirectly trained upon through feeding the algorithms additional exogenous features of the time component (e.g. day of month, month of year, etc.). However, when the time series has less than one full period of a season, little information can be inferred on the seasonal pattern of the series. In the context of this study, many of the corporate accounts considered were recorded for between 6 to 9 months, which posed a challenge in modelling any potential pattern manifested for longer than a quarter. Additionally, depending on the commercial agreement between the client and their banking partners, the cash balances of the client’s bank accounts are reported with different granularities. Some accounts are reported on every calendar day, accounts are only reported on working days, and others are reported at different days during the month. For example, there are accounts which are set to report only on the first Wednesday, Thursday and Friday of each month (Fig. 1). The resulting time series of cash balances is sparse, with the balance information

54

V.-M. Griguta et al.

being populated at irregular time intervals that depend on the day of week rather than the calendar day. In both these situations, we chose to put the accounts with insufficient data points out of scope of this paper. Consequently, we excluded the bank accounts for which the reconstructed cash balance contained less than 15 data points per month (missing data) or less than 6 months’ worth of data.

Fig. 1. Sparsely reported account. The y axis is a scaled true cash flow of the account.

Operational Outliers - Barring the stochastic component of the cash inflows and outflows, which are guided by external factors such as market volatility or client performance, the data contained within the statements issued for a business bank account is, a genuine representation of the cyclical business activities managed by the account. However, within the treasury management sector, it is often the case that seemingly random interventions of the domain specialists blurs away the valuable insights that the data has to offer (e.g. Fig. 2). Operations such as inter-subsidiary lending, long term investments, mergers or acquisitions are traced in the cash balance as irregular sparks or dips which are challenging to be picked up through univariate modelling of the time series. The immediate solution to the challenges posed by the operational outliers is to flag out and eliminate the transactions corresponding to these outliers from the reconstructed cash balance and only model those transactions that are inherent to the business operations performed through the account. However, due to the subjective nature of the treasury management decisions, flagging these manual interventions has proven to be challenging even for domain experts that are external to the company department making the treasury decisions. Applying tests based on statistical metrics such as the standard deviation did not yield an improvement in forecasting performance. In the future we plan to use more sophisticated metrics for identifying the operational outliers, including smoothing functions and designated anomaly detection models. Pattern Changes in Cash Balances - Large enterprise clients managing multiple production lines, brands and businesses present an additional layer of complexity into their cash balance sheets. Due to the volatility of the different markets their products address, decisions of cash allocations are being made at an increased pace, resulting in bank accounts being opened, closed or put out of use at various times during the year. One example of a cash balance with pattern change is shown in Fig. 3. In this instance, the average cash balance suddenly drops during the period April to October

Automated Data Processing of Bank Statements for Cash Balance Forecasting

55

Fig. 2. Operational outliers revealed in an account sample. The y axis is a scaled true cash balance of the account.

2019 with no respective behavior for the same period of 2018. Similar anomalies in the cash balances were considered separately by applying transformations to the time series and optimizing the model hyper-parameters in a manual manner.

Fig. 3. Account sample demonstrating a pattern change in the period April - October 2019. The y-axis represents the scaled true cash balance of the account.

5 Data Description This section provides an overview of related work using both conventional and machine learning methods for financial forecasting. Traditional methods include Naïve Forecasting, which uses the actual cash flow data from a previous time period as the forecast for the upcoming period (REF). Simple Moving Average Forecasting (SMAF) adds the recent closing prices, then divides the total by the number of time periods to calculate the average. Exponential Smoothing uses exponential functions to assign exponentially decreasing weights over time periods and is useful when the more recent past is likely to

56

V.-M. Griguta et al.

have more impact on predicting a forecast than more historical past. For large historical cash balance datasets, autoregressive moving average (ARMA) and autoregressive integrated moving averages models (ARIMA) can be used to identify autocorrelations and are more suitable for long term forecasting. However, due to the challenges identified in Sect. 4, there is no one size fits all solution. 5.1 Conventional Approaches to Financial Forecasting It is important to recognize that finance presents idiosyncratic forecasting challenges that are significant in their own right. The inherently stochastic nature of the subject is further exacerbated by additional sources of short-term randomness that are typically modelled using an unobserved stochastic volatility component. Conventional forecasting approaches such as ARIMA are also typically at odds with notions of market efficiency [5], theoretical options-pricing models constructed via foundational arguments such as absence of arbitrage [6] and the stylized empirical facts of financial time series [7]. Financial time series modelling is also an inherently specialist area. Commonly used ARCH/GARCH models for (financial) time series are equivalent to ARMA models for an unobserved volatility component. However, a range of different model variations are possible [8]. This sheer range of available models underscores the specialist nature of the subject. Further extensions of these classical models have been used in applications that variously account for further autocorrelations in the observed series [9] and for additional regime-switching effects [10]. However, within this, the need to account for unobserved volatility fluctuations remains paramount. Other non-time-series approaches to forecasting, e.g. those based around the technical analysis methods popularized by practitioners (see e.g. [11]), are possible. The academic literature on technical analysis is voluminous, see e.g. [12] or [13] for a review. However, such approaches have yet to permeate mainstream finance. An exception is [14] who use localized regression approaches to gauge the plausibility of technical analysis strategies. Thus, this serves to motivate the study of machine-learning techniques within financial forecasting. As an illustration, [15] reviewed corporate cash flow forecasting using account receivable data collected through a specialized accounting software, which provides a richer view of the individual transactions. 5.2 Machine Learning Approaches to Financial Forecasting The use of deep learning for time series prediction, in specific domains is not new, but remains challenging due to the need for extremely large datasets of high quality data and the lack of transparency in how decisions were made. One of the most targeted areas is stock market forecasting predicting stock prices in different time slice windows. The author in [16] created a deep learning framework which combined wavelet transforms, stacked auto-encoders and long-short term memory (LSTM) networks to predict six stock indices, one-step-ahead of the closing price. The author in [17] utilized LSTM networks for predicting out-of-sample directional movements for a number of financial stocks and outperform other methods such as random forests. The author in [18] proposed day-ahead multi-step load forecasting using both recurrent neural networks (RNN) and convolutional neural networks (CNN). In their work the use of the CNN model improved

Automated Data Processing of Bank Statements for Cash Balance Forecasting

57

the forecasting accuracy by 22.6% compared to the application of seasonal ARIMAX. However, the dataset used was concerned with predicting accurate building-level energy load forecasts which looked at how similarities within data space can be identified in financial forecasting. Whilst Deep Learning has been successfully applied in many domains, it is not always successful. Small datasets do not tend to perform well, with research indicating that to be successful, millions of data points are required. Data quality is always an issue when applying machine learning, the generalization error of artificial neural networks can be improved by the addition of noise in the training phase. Consequently, this provides a barrier to be overcome with respect to forecasting financial time series. This may help to explain some of the data challenges described in Sect. 3. Despite some recent progress, explaining and interpreting models remains challenging. Ensuring the financial interpretability of the deep learning models constructed is thus far from being a foregone conclusion. Ensemble machine learning models have also been used for financial forecasting in e.g. [19] and [20]. The author in [19] combined two traditional ensemble machine learning algorithms: random subspace and MultiBoosting to create a method known as RSMultiBoosting to try and improve the accuracy of forecasting the credit risk of small-to-medium companies. RSMultiBoosting outperformed traditional machine learning algorithms on small datasets and the ability to rank features according to the decision tree relative importance score improved accuracy. The author in [21] conducted a study investigating several models including deep and recurrent neural networks and the CART regression forest to examine non-linear relationships between input and output features on abnormal stock returns generated from earnings announcements based on financial statement data. The results indicated that non-linear methods could predict the direction of the “absolute magnitude of the market reaction to earnings announcements correctly in 53% to 59% of the cases on average.” The author in [21] with random forest approaches providing the best results. Whilst this is a reasonable result, it highlights the issues of data quality (as discussed in Sect. 2) and its impact on whether an account is forecastable.

6 Experimental Comparison This section provides a comparative analysis of the performance of different time series forecasting techniques on a subset of anonymized time series that replicate real bank account cash balances. The sampling and pre-processing procedures are explained in subsection A. We report a novel approach of enriching the univariate time series pertaining to cash balance data by aggregating the transactions pertaining to bank statements by various statistical metrics (e.g. standard deviation). Subsection B describes the forecasting methodologies, beginning with the univariate approaches (SARIMA and TES), then enriching SARIMA with multivariate exogeneous constraints, and ultimately leveraging the multivariate input via a neural network architecture. 6.1 Dataset Description To share the learnings gained from forecasting cash balance in various bank accounts, a sample time series dataset representative of cash balance data was created. The dataset

58

V.-M. Griguta et al.

consists of collections of time series corresponding to daily cash balances and some additional exogenous and endogenous variables. As described throughout the section on the data challenges, there are multiple granularities in which the account balance statements are recorded. The examples selected for this work are those for which the reconstructed time series contains at least one data point per business day. An additional level of complexity is given by the possibility of some accounts to consist of a group of individual bank accounts. A sparse time series is created by collating the closing balance amounts with corresponding value dates. To provide a consistent view over the balances of multiple accounts of a client, the closing balances are converted to a currency of choice. The missing values in the sparse time series are then forward filled to the granularity set for the account (per business day or daily). The reason for filling the values with previous valid entries (forward filling) is that it is presumed that in the valid dates when the balance is not reporting, there were no cash movements. In the case of groups of accounts, the date range is initially established as the minimum and maximum dates reported by any of the bank accounts in the group. The missing values in each individual bank account are then filled, initially forward and then backward as well, to cover the period between the earliest statements of the group and the individual earlies statement. Subsequently, the individual continuous time series are summed up to the grouped time series. The exogeneous variables obtained from the time component of each time series are then computed. Endogenous variables obtained through applying various statistical aggregations to the transactional statement data are also used to predict the cash balance. However, due to potential clashes with the IP of AccessPay, the aggregation methods will not be discussed explicitly. The sample dataset used throughout the paper represents a selection of 6 bank account aggregates with end of day cash balances reported in each business day during the period 18 June 2019 – 27 January 2020. Figure 4 below shows each time series collected. A train-test split was applied, where the size of the test set is equal to the forecasting horizon of one month.

Fig. 4. Cash balance series of the selected accounts. The train set is shown in blue and the test set is shown in black. The y axis represents the scaled true cash balance of each account.

Automated Data Processing of Bank Statements for Cash Balance Forecasting

59

6.2 Experimental Results The results of the forecasting experiments are separated in three sections. The first section discusses the problem of univariate time series forecasting via conventional methods, ARIMA and TES. In the second section we propose an extension of the ARIMA to account for multivariate input. The third section discusses the results of a machine learning approach based upon ANNs. Conventional Univariate Time Series Forecasting. The conventional time series forecasting algorithms referred throughout this paper are the Seasonal Autoregressive Integrated Moving Average (SARIMA) and the Triple Exponential Smoothing (TES). The SARIMA model has seven hyper-parameters, one for the seasonal differentiation term, three for modelling the seasonal component of the series and three for modelling the remainder of the series. In the experiments undertaken for this paper, these parameters were determined through experimentation against the AIC score, leading to the SARIMA (1,0,1)(1,1,1)22. The value of 22 for the seasonal differentiation was chosen as the main number of business days in a month. The Triple Exponential Smoothing features four hyper-parameters. These are the trend type, the damping of the trend, the seasonal type and the seasonal differentiation term. Based upon heuristics, it was found that the best set of values for these parameters are: trend: additive, damped: False, seasonal: additive and seasonal differentiation: 22. The first three rows in Table 1 detail the performance of the two methods over the account samples, as compared to the Naive Average benchmark. While there are some accounts for which the TES prevails both the benchmark and the SARIMA model, overall, only SARIMA overcomes the benchmark in a consistent manner. Table 1. Aggregate accuracy metrics on the account samples Account

A1

A2

A3

A4

A5

A6

Mean

Root mean squared error Naïve Mean

1.27

4.47

1.12

0.47

0.66

0.85

1.47

SARIMA

0.67

0.73

0.93

4.25

0.70

0.42

1.28

TES

5.51

0.3

8.66

2.22

0.17

1.05

2.99

MULTI SARIMA

0.67

0.56

0.62

4.67

1.55

0.53

1.43

ANN

1.42

3.12

0.44

1.38

1.33

0.43

1.35

Symmetric mean absolute percentage error Naïve Mean

53.17 60.38

26.86 33.98 22.84 50.87 41.35

SARIMA

64.44 26.68

24.37 61.79 29.59 51.91 43.13

TES

129.2

24.37 132.0

73.78 10.82 81.01 75.24

MULTI SARIMA

63.64 27.73

22.68 60.51 47.26 55.19 46.17

ANN

59.02 52.71

16.65 74.60 31.22 40.93 45.87

60

V.-M. Griguta et al.

To understand the gains and failures of the conventional univariate time series forecasting methods, we looked at the extreme cases for which the methods either outperformed or underperformed the benchmark by a considerable margin. From Table 1, these are account 2 (Fig. 5) and account 3 (Fig. 6). On the second account sample, both ARIMA and TES yielded accuracies exceeding the benchmark as measured by both the NRMSE and the SMAPE metrics. Noticeably, the NRMSE score of TES was the global minimum across all methods and accounts tested. A different outcome was observed for Account 3. In this example, the vague monthly seasonality is only captured by the ARIMA method while TES seems to be tricked by the outlier around December into predicting a decreasing trend across January. To conclude, the univariate models are unstable and fail to generalize on the multitude of cash balance series. While some level of progress is achieved for a subset of accounts through these methods, they would ultimately be overwhelmed by the unresolved challenges discussed in Sect. 2, in particular the Operational Outliers. As these outliers appear into the cash balances as a result of the institutional decisions that are made based upon transactional data, it is hoped that through making use of the information contained within the transactional data more insights could be drawn to support the unidimensional forecasting.

Fig. 5. Account 2 – Conventional forecasting outperformed the benchmark. The y Axis represents the scaled true cash balance of the account.

Fig. 6. Account 3 – Only SARIMA outperformed the benchmark. The y axis represents the scaled true cash balance of the account.

Automated Data Processing of Bank Statements for Cash Balance Forecasting

61

Conventional Multivariate Time Series Forecasting. As detailed in the dataset description (Sect. 6.1), a multivariate dataset was obtained through extracting information from transactional data contained within the bank statements. A series of aggregation techniques were applied on the transactional data, each based upon the different types of transactions and correlations between them. The aggregates obtained in a form of sparse time series are forward-filled to ensure there is no effect of future values on the past entries. As the sum of the transactions within a day reconcile the end of day cash balance, the aggregate transactional time series represent a set of endogenous variables to the target variable. Therefore, these series cannot be directly used as exogenous constraints to the SARIMA model. The solution implemented in this paper was to shift the time component of the transactional aggregates forward by the size of the test set. In other words, for example, the aggregates computed for the July cash balances were used as exogenous constraints for predicting the cash balance during August (Fig. 7).

Fig. 7. Performance of SARIMA with transactional exogenous variables on accounts 2 and 3. The y axis represents the scaled true cash balance of each account.

Machine Learning Approaches. The improvement in accuracy through the use of exogenous constraints based upon past transactional data indicates that the transactional data could be leveraged to train endogenous regressors on top of the univariate signal from the cash balance. A family of models that were shown to be able to map multivariate inputs to time ordered outputs are the neural networks (e.g. [22]). The assumption that the transactional aggregates retain information about future cash balances, a stacked ensemble of a dimensionality reduction algorithm and an artificial network architecture was built. Due to the commercial nature of the experiment reported in this paper, the exact architecture of the neural network (ANN) cannot be revealed. We report that the ANN architecture outperforms SARIMA with transactional exogenous constraints judged by both metrics used in this experiment. Compared against the univariate SARIMA, the machine learning method still underperforms, albeit by a small margin of 0.07 in the normalized root mean squared error metric. By comparing the individual performance over each account sample we can get a clearer picture of the difference between conventional and machine learning models. Figure 8 presents a comparison of the accuracy of the two best performing models, univariate ARIMA and multivariate ANN, over the same two account samples discussed

62

V.-M. Griguta et al.

Fig. 8. Performance comparison of the univariate ARIMA and the multivariate NN methods on Account 2 (0.73 vs 3.11) and Account 3 (0.93 vs 0.44). The y axis represents the scaled true cash balance of each account.

Fig. 9. Performance comparison of univariate ARIMA and multivariate NN over Account 1 (0.67 vs 1.42). The y axis represents the scaled true cash balance of the account.

in previous sections. The 3 month historical actuals of the cash balances are included to give a view of the most recent trend in the series. Noticeably, while the ANN method outperforms by a large margin ARIMA on the Account 3, the performance over the second account is poor. We attributed the underperformance to the changing trend of the second account, which the ANN could not capture. In Fig. 9 we compare the two models over a different bank account which contains a salient operational outlier. Similar to the case of the second account, the neural networks could not learn the rapidly changing pattern of Account 1. However, when eliminating the very first datapoint in the series, the root mean squared error drops from 1.42 to 0.45, whereas the SARIMA model yields the same error of 0.66 (Table 1). These observations suggest that, given the rapidly changing trends within the series, the ANN architecture can learn the smoothly varying features better than the conventional methods.

7 Conclusions and Further Work In this paper we presented an overview of the challenges of understanding, collating and exploring account balance data pertaining to private enterprises with a view of forecasting

Automated Data Processing of Bank Statements for Cash Balance Forecasting

63

their future cash balances. We showed several techniques for addressing these challenges, which allowed us exemplify the use of both conventional and machine learning techniques for cash balance forecasting. Additionally, we performed a comparative analysis of conventional and machine learning based time series forecasting models on a representative subset of bank accounts. The intermediate results portrayed a fair competition between Seasonal Auto Regressive Moving Average and Neural Network Architectures, indicative of the stochastic nature of enterprise account cash balances. Permanent collaboration with field experts, either internal or external (banks and customers), and with the architects of the legacy code used for parsing the SWIFT statements is the ultimate solution to alleviating a number of challenges discussed in this paper. In the future, we plan on exploring the transactional features in more depth to get an understanding of the way they infer the future values of the end of day cash balance. Additionally, through an improved collaboration with the domain specialists, we aim at limiting the influence of the challenges emphasized in this paper. Acknowledgments. The authors would like to express their gratitude to the two anonymous referees for the helpful and supportive comments received.

References 1. Giles, C.L., Lawrence, S., Tsoi, A.C.: Noisy time series prediction using recurrent neural networks and grammatical inference. Mach. Learn. 44, 161–183 (2001) 2. Katselas, D., Sidhu, B., Yu, C.: Merging time-series Australian data across databases: challenges and solution. Account. Finan. 56, 1071–1095 (2016) 3. Shang, Y.: Subgraph robustness of complex networks under attacks. IEEE Trans. Syst. Man Cybern. Syst. 49, 821–832 (2019) 4. Fry, J., Griguta, V.-M., Gerber, L., Slater-Petty, H., Crockett, K.: Stochastic modelling of corporate accounts. Preprint (2021) 5. Fama, E.: Efficient capital markets: a review of theory and empirical work. J. Finan. 25, 383–417 (1970) 6. Merton, R.C.: The theory of rational options pricing. Bell J. Econ. Manag. Sci. 4, 141–183 (1973) 7. Cont, R.: Empirical properties of asset returns: stylized facts and statistical issues. Quant. Finan. 1, 223–236 (2001) 8. Hentschel, L.: All in the family: nesting symmetric and asymmetric GARCH models. J. Finan. Econ. 39, 71–104 (1995) 9. Katsiampa, P.: Volatility estimation for Bitcoin: a comparison of GARCH models. Econ. Lett. 158, 3–6 (2017) 10. Walid, C., Chaker, A., Masood, O., Fry, J.: Stock market volatility and exchange rates in emerging countries: a Markov-state switching approach. Emerg. Mark. Rev. 12, 272–292 (2011) 11. Meyers, T.A.: The Technical Analysis Course, 4th edn. McHraw-Hill, New York (2011) 12. Park, C.-H., Irwin, S.H.: What do we know about the profitability of technical analysis? J. Econ. Surv. 21(4), 786–826 (2007). https://doi.org/10.1111/j.1467-6419.2007.00519.x 13. Nazário, R.T.F., e Lima, J.L., Sobreiro, V.A., Kimura, H.: A literature review of technical analysis on stock markets. Q. Rev. Econ. Finan. 66, 115–126 (2017). https://doi.org/10.1016/ j.qref.2017.01.014

64

V.-M. Griguta et al.

14. Lo, A.W., Mamaysky, H., Wang, J.: Foundations of technical analysis: computational algorithms, statistical inference and empirical investigation. J. Finan. 55, 1705–1765 (2000) 15. Weytjens, H., Lohmann, E., Kleinsteuber, M.: Cash flow prediction: MLP and LSTM compared to ARIMA and Prophet. Electron. Commer. Res. (2019) 16. Bao, W., Yue, J., Rao, Y.: A deep learning framework for financial time series using stacked autoencoders and long-short term memory. PLOS ONE 12(7), e0180944 (2017). https://doi. org/10.1371/journal.pone.0180944 17. Fischer, T., Krauss, C.: Deep learning with long short-term memory networks for financial market predictions. Eur. J. Oper. Res. 270(2), 654–669 (2018). https://doi.org/10.1016/j.ejor. 2017.11.054 18. Cai, M., Pipattanasomporn, M., Rahman, S.: Day-ahead building-level load forecasts using deep learning vs. traditional time-series techniques. Appl. Energy 236, 1078–1088 (2019) 19. Zhu, Y., Zhou, L., Xie, C., Wang, G.J., Nguyen, T.V.: Forecasting SMEs’ credit risk in supply chain finance with an enhanced hybrid ensemble machine learning approach. Int. J. Prod. Econ. 211, 22–33 (2019) 20. Salas-Molina, F.: Fitting random cash management models to data. Comput. Oper. Res. 106, 298–306 (2019) 21. Amel-Zadeh, A., Calliess, J.-P., Kaiser, D., Roberts, S.: Machine Learning-Based Financial Statement Analysis, 15 January 2020 22. Akram, M., El, C.: Sequence to sequence weather forecasting with long short-term memory recurrent neural networks. Int. J. Comput. Appl. 143(11), 7–11 (2016)

Top of the Pops: A Novel Whitelist Generation Scheme for Data Exfiltration Detection Michael Cheng Yi Cho(B) , Yuan-Hsiang Su, Hsiu-Chuan Huang, and Yu-Lung Tsai Information and Communication Security Technical Laboratories, Telecommunication Laboratories, Chunghwa Telecom, Taoyuan, Taiwan {michcho,yhsu,pattyh,tyl}@cht.com.tw

Abstract. As intellectual property and private data are stored in digital format, organizations and companies must protect data from possible digital data exfiltration. Prevention and detection are common approaches to combat against data exfiltration. Regardless of prevention measures, detection approach must be implemented to strengthen data protection. Analyzing network data to find abnormal behavior, namely network traffic anomaly detection, is an effective detection approach. One of the challenges in network traffic anomaly detection is benign data or data noise filtering. A good data filter will improve computation performance and detection accuracy. In this paper, we proposed a whitelist generation scheme to generate data filter for data exfiltration detection. Our scheme leverages kernel density estimator with numerous computed network feature indexes to generate a popularity-based data filter. We use real corporate network data to evaluate data filter efficiency. After evaluation, our scheme generates a whitelist that filters out twice as more data in comparison with the conventional whitelist generation method. Furthermore, detection accuracy is also improved when evaluating against two data exfiltration detection algorithms.

Keywords: Whitelist

1

· Data filter · Data exfiltration · Data leakage

Introduction

Data exfiltration incidents are much more organized nowadays giving its economic value. Stealing intellectual properties, personal identifies and data, etc. are often the motivation of organized cyber crimes. Even top financial companies like Equifax and Deloitte have fallen victims of such incidents [1]. Giving that most of information has gone digital, companies must not overlook the importance of digital data protection. There are prevention and detection solutions to mitigate data exfiltration problem. Solutions like data access control, and data encryption are popular prevention scheme. These approaches only control access privilege to the protected c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 65–79, 2021. https://doi.org/10.1007/978-3-030-80126-7_6

66

M. C. Y. Cho et al.

data. It cannot detect data misuse [2]. Detection mechanism, on the other hand, is adopted to strengthen data protection against data exfiltration. Anomaly detection is an active research domain due to recent raise of machine learning research. Giving network is one of the main medium for data exfiltration, network-based anomaly detection is one of active research domain. Networkbased anomaly detection [3–5] leverages network features and machine learning to detection possible malicious activities which may cause data exfiltration. Data filtering is a common data preprocessing approach in network-based anomaly detection. It is used to filter out benign data before detection model training process and/or during detection process. A good data filter will enhance computational performance and improve detection accuracy. Whitelist is a common method to construct data filter. Two common approaches are used to produce whitelist, namely domain knowledge heuristics and popularity heuristics. Domain knowledge heuristics [6, 7] requires expert knowledge of malicious behavior and training data. Therefore, it is hard to replicate whitelist generation procedure from research to research. Popularity heuristics is another feasible solution since anomaly detection is based on rare event detection. This method generates whitelist by summarize common events (popular events) [4, 5, 8]. It gathers resources from web traffic statistics website (e.g. Alexa [9, 10]) or obtain the top-N most visited website from training data statistics to construct whitelist. Resource from the web suffers locality problem giving whitelist is not customized to the training data. The second approach often neglect to evaluate the effectiveness of top-N configuration (i.e. the approach to find a good N value), since whitelist generation is not the main objective of the research. Popularity heuristics approach is a good approach, but it is not an easy problem to summarize a middle-to-large network heuristically. Part of our corporate network consist of over 10 thousand of clients that generate 40 million of connection in daily basis. It is difficult to develop a solution based on heuristics approach. If we could derive a systematic solution that overcomes locality and top-N configuration problem, anomaly detection research can benefit from such solution. For that, we propose a scheme that is based on popularity concept, and it uses statistic method to select popular domains to serve as whitelist. We name the scheme as TOP (top of the pops) since it adopts the concept of selecting most popular domains from the given network data. In contrast to popularity heuristics, TOP systematically takes raw network data as input, compute feature indexes, using statistics to select whitelist candidate base, and expand from whitelist candidate base to output a reliable whitelist for data exfiltration detection algorithms. According to evaluation, TOP filtered out more data (twice as more) in comparison with the conventional popularity heuristics. As a results, TOP assists data exfiltration detection algorithms to perform better in detection accuracy and computation speed. The remainder of this paper is organized as follows: Sect. 2 reviews related research regarding to whitlist generation and usage. Section 3 explains scheme structure and the algorithms used in the proposed scheme. Section 4 evaluates

Top of the Pops

67

proposed scheme against conventional popularity heuristics using two different data exfiltration algorithms with real network traffic dataset. Lastly, Sect. 5 concludes this research work.

2

Related Work

As for related works, we have surveyed whitelist generation approaches in different malicious network behavior detection domains and categorized the approaches based on adopted strategies. The strategy bases are domain knowledge-based, popularity-based, and statistical-based. For domain knowledge-based approach, Hwang et al. [6] proposed a 3-tier IDS (Intrusion Detection System) using data mining. One of the tier uses whitelist filtering to filter out benign network traffic. A filter is built based on feature profiling. If observed network traffic does not fall into the profile, it is considered as abnormal traffic. This approach may trigger false negative if adversaries leverage popular web service to carry out malicious activity. Strayer et al. [7] introduced a botnet detection scheme that uses data filtering technique to filter out data noise. The technique is based on domain knowledge of malicious characteristic. It is a heuristic data filtering approach based on detection application. Takemori et al. [11] proposed whitelisting approach for bot detection. A whitelist is constructed per work station/server unit. An observation period is required to construct the whitelist. This whitelist construction technique is difficult to deploy in large scale network. Furthermore, it is rather difficult to make sure no malware infection occurs during the observation period. As for popularity-based approach, research leverages well known popular web service list gathered from experience and Internet resource. Yan et al. [4] constructed customized whitelist based on popularity statistics of network traffic, referer header, raw IP addresses. They also applied domain-folding to expand the coverage of whitelist. Gu et al. [8] proposed a bot C&C (command and communication) channel detection system based on network statistics correlation. The system also uses two types of whitelist, namely, hard and soft whitelist, to filter out data noise. A hard whitelist is based on popular web services (e.g. Google, Yahoo, etc.). A soft whitelist is obtained after data is analyzed as benign network traffic. Since anomaly detection detects rare behavior, it makes sense to filter out common behavior (visits to popular web services). However, popularity definition discussion is out the scope of these research. Lastly, statistical-based approach summarize network features to distinguish normal and abnormal behavior. Byakko [12] is a whitelist generation approach based on data distribution. Byakko leverages deviation of expected distribution to determine data noise. A host is considered noise if feature distribution of the training data and testing data is similar. Byakko only works on a set features of a host that has stable distribution. Therefore, Byakko deployment will need experts with domain knowledge to preprocess the study data. Although whitelist is not the main objective in most of the introduced research, it still serve an important role in data preprocessing. A good whitelist

68

M. C. Y. Cho et al.

will increase detection accuracy and reduce the computation time. However, most of the introduced research rely on expert’s domain knowledge or popularity heuristic to generate a whitelist. Our goal is to propose a systematic whitelist generation scheme with minimum domain knowledge of deployed security application. The resulting generated whitelist is close to the study target which will yield better result in data filtering.

3

Proposed Scheme

Before proposed scheme is presented, we will begin with web sites in interest for our problem domain. In malicious network traffic analysis domain, whitelist candidates are often referenced as a popular web site. However, some popular web site may not be a good whitelist candidate depending on security application. In our case, online storage web service is not a good whitelist candidate for data exfiltration detection application. Therefore, we need to define whitelist candidate targets, so we can evaluate the generated whitelist candidates. The following items will be revisited in the evaluation subsection: – Inclusive Domains • Popular domains with integrated services such as default page, search engine, news, portals, etc. • Domains that are work related, such as corporate sites. • Domains used for software update purpose. • Advertisement and client behavior tracking background network traffic when visiting a domain. – Exclusive Domains • Well-known online storage domains. • Any domains that have extensive upload traffic. Figure 1 depicts the workflow of TOP. Raw network proxy log in passed into three major components, namely, feature computation, whitelist candidate computation, and whitelist candidate expansion, to generate a list of whitelist candidates. Feature computation is responsible for feature extraction and feature indexes computation. As a result, two sets of feature indexes are passed into whitelist candidate computation component to generate two sets of whitelist candidates. Finally, whitelist candidate expansion component is in charge of expanding whitelist candidates to expand data filtering coverage. Each of the major components is explained in greater detail of the following subsections. 3.1

Feature Computation

The feature component consists two functionalities, namely, feature extraction and feature indexes computation, and outputs two set of feature indexes to serve as input for whitelist candidate computation (see Fig. 1). Feature extraction component extract client identification, visited domain, upload bytes, and

Top of the Pops

69

Fig. 1. Proposed scheme workflow.

download bytes. Feature indexes computation will use these features to compute two sets of feature indexes to serve as input of whitelist candidate computation. The first set of feature indexes is number of accessed clients, number of network traffic flows, total downloaded bytes, and total uploaded bytes per domain in daily basis. To prevent rare event occurrence in the training data, we only select domains that are visited in all of the training data (training data are separated by day) for feature indexes computation. The feature indexes are used as indicator of a popular domain in the training data. The second set of feature indexes uses access term frequency and uploaddownload ratio to rate a domain in daily basis. Feature index computation is as following: T Fij =

1 count (Ji )

where ∀i ∈ I, ∀j ∈ J

(1)

Let I be clients and J be visited domains from the sampling data. Equation 1 computes visited domain T F (Term Frequency) for each client and domain pair. Whenever a distinctive domain is visited by a client, the corresponding T Fij is a fraction of total distinctive domain visited by that client. T Fij receives a higher value if client i only has a few distinctive network connections. A small set of distinctive domain connections means that the client is less likely to be operated by human, and the generated network traffic could be background noises generated by software updates. Therefore, such domains will receive a higher score. U DRj =

I DLij i=0

T F U DRj =

I

U Lij

where ∀i ∈ I, ∀j ∈ J

T Fij log (U DRj )

where ∀i ∈ I, ∀j ∈ J

(2)

(3)

i=0

Equation 2 is designed to distinct upload-driven and download-driven domains while Eq. 3 calculates a value for whitelist candidate computation to select whitelist candidates. Equation 2 computes U DR (upload-download ratio) for all the visited domains of the given data set. DL is the total downloaded bytes of a given client and visited domain while U L is the total uploaded bytes of a given client and visited domain. We anticipate that a upload-driven domain will have an U DR value that is less than 1 and download-driven domains will be

70

M. C. Y. Cho et al.

greater than 1. Equation 3 multiplies the results from Eq. 1 and 2. Logarithm of U DR is design to separate upload-driven T F value from download-driven T F value. For data exfiltration detection application, we only take T F U DR values that are greater than zero (which indicates download-driven domains) as the input to whitelist candidate computation component. A popular and heavy download-driven domains will expect to have a higher T F U DR value, which whitelist candidate computation is most likely to select domains with a high T F U DR value as whitelist candidate. 3.2

Whitelist Candidate Computation

KDE (Kernel Density Estimation) is the foundation of domain whitelist candidate selection. KDE is used to estimate probability density of sample data with kernel function and bandwidth variable to adjust distribution curve [18, 19]. We manipulate bandwidth variable to select whitelist candidates from various network feature distributions. More specifically, KDE is used as threshold value selector and any feature value surpass the threshold is selected as whitelist candidate. From previous literature, Internet usage (domain visits) tends follow Zipf-like distribution [20–22] which means that a few domains are much more popular compare to the rest of the domains. Since whitelist candidate selection is based on popularity, we uses KDE on Zipf distribution samples to calculate a threshold value and use it as the basis for selecting domains that are qualified as whitelist candidates.

Fig. 2. Sample Zipf distribution.

Figure 2 are distribution plots using sample Zipf distribution data. Figure on the right is the plot produced by KDE function with a manipulated bandwidth. Noticed that the distribution contains several peaks of bell shape curves. Each of the bell shape curve contains a cluster of sample data, and the largest bell shape curve is the long tail sample data from Zipf distribution since the probability of the long tail sample data are very closed. If we place a mark between bell

Top of the Pops

71

shape curves (see the x mark in the figure), we find a threshold to divide sample data into body and long tail. This threshold value is retrieved by calculating local minimum values of the distribution. The figure of the left of Fig. 2 clearly locates the threshold value found in KDE function (see the black bar in the figure). The threshold value clearly divides body and long tail from the sample Zipf distribution and we will use the sample data in the body to serve as whitelist candidate giving its popularity. As for bandwidth selection for KDE function, we noticed that there is no common bandwidth for different features. Therefore, we iterate through several bandwidth configurations dynamically to find a bandwidth that divides body and long tail in no more than 20–80 ratio. That is, the body population does not exceed 20% of the entire population. After applying KDE on two sets of feature indexes (i.e. 4 indexes in feature indexes set 1 and 1 index in feature indexes set 2), 2 sets of whitelist candidates are generated. Whitelist candidate set 1 require further treatment to produce final whitelist result. Record that the 4 feature indexes of feature indexes set 1 are number of accessed clients, number of network traffic flows, total downloaded bytes, and total uploaded bytes. Candidates generated by total uploaded bytes index is used to remove duplicated candidates generated by the other 3 feature indexes giving that we are addressing data expiltration problem. Popular domains that exercise excessive upload behavior should be removed from whitelist candidates. 3.3

Whitelist Candidate Expansion

This component leverages results from whitelist candidate computation components. There are two inputs for this component giving that two sets of feature indexes are used for candidate computation. While the two inputs are based on domain popularity on different feature indexes, popular domains often offer different types of services using numerous sub-domains. To capture these subdomains, we extract 2nd level domains from the candidates and use it to filter out the sub-domains (domain unfolding approach [4]). As aforementioned, upload-driven domains must be excluded for data exfiltration detection. Therefore, we calculate average upload and download bytes and use it as an indication of upload-driven domains and all domains labeled as upload-driven are excluded from the whitelist. In addition, we also set up a list of well known online storage domains (e.g. Google drive, Dropbox, Onedrive, etc.) to act as whitelist filter. As mentioned, potential upload-driven domain should be excluded when addressing data exfiltration problem.

4

Evaluation

Evaluation is divided into subsections. First, we will introduce threat model that is used to carry out the experiment. It describes the approach that is used to generate anomaly network data that represents data exfiltration. Secondly, data

72

M. C. Y. Cho et al.

source and experiment set up is introduced follow by feature indexes distribution summary. Whitelist results evaluation is carried out in the fourth subsection while whitelist efficiency is evaluated in the fifth subsection. Lastly, detection performance is evaluated in the last subsection. 4.1

Threat Model

Data exfiltration can be an insider or outsider attack with a result of leaking sensitive data to adversaries. Regardless of the role of adversaries, threat model is divided into two scenarios based on sensitive data exportation. Firstly, adversaries can host a web service to store the stolen data [13]. Since the web service is used for special purpose, we can assume the web service will not receive many visits in comparison with popular web services like search engine, etc. This scenario can be detected by summarizing URL (universal resource locator) access statistics of a given network. The second scenario is the opposite, adversaries uses a popular online storage service website to export the stolen data [14–17]. In such case, URL statistics analysis will not work well. Individual host network behavior analysis can be an effective approach. We will use the introduced scenarios as evaluation input to summarize whitelist efficiency. 4.2

Data Set and Experiment Configuration

We use proxy logs from a corporate networks as the source of evaluation data. The data comprises over 10 thousand of clients that generate over 40 millions of network connection data during standard working hours (from 8am to 6pm) per day. We take 2 weeks (10 working days) worth data to train a whitelist using TOP, and 1-day data to evaluate the whitelist efficiency. For efficiency evaluation, we use two data exfiltration detection schemes, namely naive approach and advanced approach from Marchetti et al. [3], to measure detection effectiveness with and without applying whitelist data filter. Naive data exfiltration detection approach builds network usage profile for both clients and domains. Profile contains connection count, and upload bytes statistics. Intermediate probability value is calculated to identify clients with possible data exfiltration behavior. Data exfiltration behavior is identified based on a threshold value. After possible data exfiltration behavior is filtered out, two strategies are used to rank the suspected client and domain pair. Strategy 1 ranks suspects based on total upload bytes while strategy 2 ranks suspects based on deviation degree against the profile. We inject three different volumes of data exfiltration network traffic to the test data. Initially, 9 GB of data exfiltration network traffic (refer to incident described in [3]) were injected. However, both detection approaches are very accurate in detecting data exfiltration behavior regardless to whitelist data filtering. This is due to top upload client only consists about 1 GB of upload network traffic in the test data. For that reason, we additionally introduced 20 MB and 700 MB data exfiltration network traffic to detection efficiency evaluation.

Top of the Pops

73

Data exfiltration network traffic injection candidate are selected based on statistics. We select 2 domains to represent the threat model mentioned in early chapter. The first domain represents domain hosted by adversaries which is randomly selected from test data that has small amount of network traffic. The second domain represents a popular online storage network service that is has large amount of network traffic in the test data. As for client selection, we selected clients based on network traffic volume statistics. We pick 6 different clients based on connection count and upload network traffic quantile (i.e. 25%, 50%, and 75%) so that whitelist efficiency can be investigated across different network usage characteristics. 4.3

Feature Evaluation

We first analyse data features to ensure that the selected features are Zipflike distribution, so that the features fit the assumption of TOP. Giving that we have approximately over 140 thousand unique domains per day to analyse, the corresponding distribution plot is difficult to analyse visually. Therefore, we take first 75%-quantile of the distribution to summarize data distributions. Figure 3 illustrates the four selected features (number of accessed clients, number of network traffic flows, total downloaded bytes, and total uploaded bytes) for feature set one, and all four of the features are showing Zipf-like distribution. Figure 4 illustrates T F U DR distribution for feature set two. We only selected the download-driven domains (i.e. T F U DR value that is grater than zero), since we only want download -driven domains as whitelist candidate for data exfiltration detection. Figure on the left is the first 75%-quantile distribution. Although the distribution differs from Fig. 3, it still resembles Zipf-like distribution in the plot with entire population (see the right figure on Fig. 4). The cross mark on the figure is the threshold value computed by whitelist candidate computation component where domain count is fewer than the first few clusters before domain count surges (see figure on the left). Therefore, KDE is still reliable in selecting whitelist candidates for T F U DR distribution. 4.4

Whitelist Evaluation

Table 1 and 2 summarize the whitelist results produced by TOP. TOP generated just over 151 thousand of FQDNs (fully qualified domain names) as whitelist for data filtering. After summarizing whitelist 1 and 2, the candidates fit the definition of whitelist in Sect. 3. Sites like yahoo.com and google.com fit the category of default page, portal, and search engine that are used widely among clients. Corporate website is also in the whitelist while trendmicro.com and windowsupdate.com are used for software updates. Lastly, yimg.com and ads.yahoo.com fit the category for commercial activities. The expand whitelist candidate component (domain unfolding strategy) worked accordingly. Subdomains of gstatic.com took a large portion of whitelist 3 as the site is mainly responsible for delivering static contents for Google service.

74

M. C. Y. Cho et al.

Fig. 3. Distribution of feature set one.

Fig. 4. Distribution of feature set two.

Noticed that a portion of candidates in whitelist 1 are not covered in whitelist 3 (see Table 2). This indicates that the excluded FQDN are not upload-driven domains, and it should not be a part of whitelist candidates. After close examination, the excluded FQDN are based on large connections and large individual client visits. They are mainly used for tracking user behavior (cookie upload network traffics) to provide commercial services such as advertisement and

Top of the Pops

75

recommendations (e.g. google-analytic.com, advertising.com, etc.). As a result, these the excluded FQDNs are kept as part of the whitelist for later date exfiltration detection evaluation analysis. Table 1. Whitelist candidate counts. Whitelist 1 Whitelist 2 Whitelist 3 Count 104

57

151965

Table 2. Whitelist intersection counts. Whitelist 1 Whitelist 2 Whitelist 3 Whitelist 1

0

30

93

Whitelist 2 30

0

57

Whitelist 3 93

57

0

Table 3 is the statistical comparison between conventional whitelist generation method based on popularity heuristics and TOP. We obtained a list of popular sites from Alexa [9, 10] and Wikipedia [23], and use it to compare filtering capability against the result from TOP. After applying domain unfolding methodology, whitelist from conventional method only generates about 19 thousand of whitelist candidates, which is only 13% of the candidates generated by TOP. Furthermore, only 6 thousand of whitelist candidates are intersected between the two approaches. Table 4 evaluates filtering results of testing data. Filter ratio is calculated based on the number of suspected records. The sample testing data contains over 39 million of network connection records while over 27 million network connections are suspected records (where the connections have more upload bytes than download bytes). Overall, whitelist produced by TOP filters out about twice as many in contrast to conventional method. Table 3. Conventional whitelist generation method statistics. Alexa global Alexa Taiwan Wikipedia Domain unfold Internet resource

50

50

50

19026

Intersected results

6

6

6

6143

76

M. C. Y. Cho et al. Table 4. Whitelist filter efficiency statistics. Whitelist 1 Whitelist 2 Whitelist 3 Whitelist All Conventional Filter ratio 36.54%

4.5

16.61%

41.00%

50.25%

21.45%

Detection Efficiency Evaluation

Record that the data exfiltration detection approaches produce a list of suspicious candidate by ordering scores. As a result, a suspicious rank list is produced by sorting the scores in descending order. For this evaluation, we inspect the improvement of suspicious ranking after applying whitelist filtering of the experiments. After evaluating all experiment combination results (i.e. two domains, and three injected network traffic sizes), 20 MB injection network traffic of a popular web service experiment has the best improvement. All experiments that have 9 GB injection network traffic or self hosting web service have minor improvement in ranking. This is due to 9 GB injection network traffic is very suspicious already giving that the top upload traffic count is only 1 GB originally. Self hosting web service on the other hand is also very suspicious giving that the historic self hosting web service profile has a low network traffic. A surge in network traffic causes a high suspicious score. For that reason, we only presents ranking improvement result of experiment that injects 20 MB data exfiltration network traffic to a popular web service. Table 5 and Table 6 contains suspicious ranking improvements after applying whitelist produced by TOP and conventional approach. Row is labeled by detection approach. M1 is the naive detection approach while M1a is ranking based on total upload bytes and M1b is ranking based on deviation degree against the profile. Columns is labeled by selected clients based on statistics quantile while C is connection traffic count and U is upload bytes. From Table 5, whitelist filtering shows positive results regardless to experiment configuration. Among the experiment configuration, ranking improved significantly when whitelist filtering is applied to advanced data exfiltration detection approach. Table 6 shows ranking improvement using whitelist produced by conventional method. Again, suspicious rank has improved. However, the whitelist produced by TOP shows more promising results. 4.6

Performance Evaluation

Lastly, we evaluate computation performance improvement after applying whitelist filtering to detection approaches. This evaluation is carried out on a dual Intel Xeon 2.4 GHz 10-core 2-thread (40 threads in total) machine with 160 GB memory installed. We compute two data exfiltration detection approaches with 9 GB of injected data multiple times and take the average computation time before whitelist filtering and after whitelist filtering. The first naive detection approach takes longer to compute giving that scores are calculated in network connection basis. Therefore, whitelist filtering is more effective and it

Top of the Pops

77

Table 5. Ranking improvement using the whitelist produced by TOP. C25% C50% C75% U25% U50% U75% M1a

2

2

2

2

2

2

M1b

5

5

5

5

5

5

M2

34

185

361

30

243

242

Table 6. Ranking improvement using the whitelist produced by conventional method. C25% C50% C75% U25% U50% U75% M1a 0 M1b 1 M2

9

0

0

0

0

0

1

1

1

1

1

91

175

11

118

109

yields 25% of computation improvement. On the other hand, the advanced detection approach calculates score on client basis which is faster to process. Whitelist filtering improves 1% of improvement giving that the entire computation only takes 0.1 s to compute.

5

Conclusion

In this paper, we presented a scheme, namely TOP, to produce whitelist to serve as data filter. A data filter that is used to filter out data noise to increase efficiency of data exfiltration detection application. Compared to conventional whitelist construction method (popularity heuristics), our scheme customizes whitelist according to study target which yields better data filtering result. According to evaluation, results from TOP can filter out twice as much data noise in comparison with the conventional whitelist generation method. Furthermore, performance evaluations also yielded better detection accuracy result and fast computation result. Although data exfiltration detection is used as the application for TOP, we believe TOP is also fit for different anomaly detection application using other Zipf-like distribution feature indexes. With minimum modification on feature indexes computation component, it is easy to apply TOP to different anomaly detection applications.

References 1. Ullah, F., Edwards, M., Ramdhany, R., Chitchyan, R., Babar, M.A., Rashid, A.: Data exfiltration: a review of external attack vectors and countermeasures. J. Netw. Comput. Appl. 101, 18–54 (2018) 2. Alneyadi, S., Sithirasenan, E., Muthukkumarasamy, V.: A survey on data leakage prevention systems. J. Netw. Comput. Appl. 62, 137–152 (2016)

78

M. C. Y. Cho et al.

3. Marchetti, M., Pierazzi, F., Colajanni, M., Guido, A.: Analysis of high volumes of network traffic for advanced persistent threat detection. Comput. Netw. 109–2, 127–141 (2016) 4. Yen, T.F., et al.: Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks. In: Proceedings of the 29th Annual Computer Security Applications Conference (ACSAC 2013), pp. 199–208. ACM, December 2013 5. Oprea, A., Li, Z., Norris, R., Bowers, K.: Made: security analytics for enterprise threat detection. In: Proceedings of the 34th Annual Computer Security Applications Conference (ACSAC 2018), pp. 124–136. ACM, December 2018 6. Hwang, T.S., Lee, T.J., Lee, Y.J.: A three-tier IDS via data mining approach. In: Proceedings of the 3rd Annual ACM Workshop on Mining network data, pp. 1–6. ACM, June 2007 7. Strayer, W.T., Walsh, R., Livadas, C., Lapsley, D.: Detecting botnets with tight command and control. In: 2006 Proceedings of 31st IEEE Conference on Local Computer Networks, pp. 195–202. IEEE, November 2006 8. Gu, G., Zhang, J., Lee, W.: BotSniffer: detecting botnet command and control channels in network traffic. In: Proceedings of the 15th Annual Network and Distributed System Security Symposium (NDSS). The Internet Society, February 2008 9. The Top 500 Sites on the Web, Alexa Internet Inc., September 2020. https://www. alexa.com/topsites 10. Top Sites in Taiwan, Alexa Internet Inc., September 2020. https://www.alexa. com/topsites/countries/TW 11. Takemori, K., Sakai, T., Nishigaki, M., Miyake, Y.: Detection of bot infected PC using destination-based IP address and domain name whitelists. J. Inf. Process. Inf. Media Technol. 6(2), 649–659 (2011) 12. Kanaya, N., Tsuda, Y., Takano, Y., Inoue, D.: Byakko: automatic whitelist generation based on occurrence distribution of features of network traffic. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 3190–3199. IEEE, December 2019 13. Exfiltration Over Alternative Protocol, MITRE ATT&CK, September 2020. https://attack.mitre.org/techniques/T1048 14. Liu, L., De Vel, O., Han, Q.L., Zhang, J., Xiang, Y.: Detecting and preventing cyber insider threats: a survey. IEEE Commun. Surv. Tutor. 20(2), 1397–1417 (2018). Second quarter 15. Homoliak, I., Toffalini, F., Guarnizo, J., Elovici, Y., Ochoa, M.: Insight into insiders and it: a survey of insider threat taxonomies, analysis, modeling, and countermeasures. ACM Comput. Surv. (CSUR) 52(2), 1–40 (2019) 16. Exfiltration Over Web Service, MITRE ATT&CK, September 2020. https:// attack.mitre.org/techniques/T1567 17. Alshamrani, A., Myneni, S., Chowdhary, A., Huang, D.: A survey on advanced persistent threats: techniques, solutions, challenges, and research opportunities. IEEE Commun. Surv. Tutor. 21(2), 1851–1877 (2019) 18. Su, Y.H., Cho, M.C.Y., Huang, H.C.: False alert buster: an adaptive approach for NIDS false alert filtering. In: Proceedings of the 2nd International Conference on Computing and Big Data, pp. 58–62. ACM, October 2019 19. Nicolau, M., McDermott, J.: One-class classification for anomaly detection with kernel density estimation and genetic programming. In: European Conference on Genetic Programming, pp. 3–18. Springer (2016) 20. Breslau, L., Cao, P., Fan, L., Phillips, G., Shenker, S.: Web caching and Zipf-like distributions: evidence and implications. In: Proceedings of IEEE INFOCOM 1999 Conference on Computer Communications, pp. 126–134. IEEE, March 1999

Top of the Pops

79

21. Adamic, L.A., Huberman, B.A.: Zipf’s law and the Internet. Glottometrics 3(1), 143–150 (2002) 22. Guo, L., Tan, E., Chen, S., Xiao, Z., Zhang, X.: The stretched exponential distribution of internet media access patterns. In: Proceedings of the 27th ACM Symposium on Principles of Distributed Computing, pp. 283–294. ACM, August 2008 23. List of Most Popular Websites, Wikipedia, September 2020. https://en.wikipedia. org/wiki/List of most popular websites

Analyzing Co-occurrence Networks of Emojis on Twitter Hasan Alsaif, Phil Roesch, and Salem Othman(B) Wentworth Institute of Technology, Boston, MA 02115, USA {alsaifh1,roeschp,othmans1}@wit.edu

Abstract. Emojis have become an integral part of digital communication in recent years. They promote the expression of emotions and ideas through visual icons of people, animals, and symbols. Twitter provides a platform for discussion and commentary in real time with a diverse and global audience, making it a good candidate for studying the usage of emoji across various topics. In this paper, 296260 tweets containing emojis have been used to create undirected-weighted co-occurrence networks of emojis with respect to dating, music, US politics, Olympic Games, epidemics, in addition to a control group. For each co-occurrence network, we find related tweets by searching Twitter using associated terms. Node weight and Eigenvector centrality measures are used to compare the co-occurrence networks and are the most influential emojis in the epito one another. We find that demics co-occurrence network, occupying first and second place, respectively in terms of eigenvector despite not being in the top 10 emojis in terms of node weight. Keywords: Emoji · Co-occurrence network · Graph · Social network · Twitter · Tweets · Centrality measures · Dating · Music · US politics · Olympics · Epidemics · Graph analysis

1 Introduction Digital online communication has seen a rapid increase in the last 10 years since the introduction of smartphones, with platforms like Facebook, Instagram, and Twitter being a driving force for that change [2]. Twitter has become a source for bite sized information (280 characters) in the form of news, discussion, brand awareness, and more. The constrained length of tweets in addition to emoji enabled keyboards on smartphones compelled people to use visual expression to emphasize and convey emotions not easily expressed through text alone. Social media has fundamentally changed the way people communicate in the 21st century, making it a daily practice for the typical person and allowing participation in a global community of shared interests such as music, sports, and US politics. Merriam Webster defines social media as: forms of electronic communication (such as websites for social networking and microblogging) through which users create online communities to share information, ideas, personal messages, and other content (such as videos).

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 80–96, 2021. https://doi.org/10.1007/978-3-030-80126-7_7

Analyzing Co-occurrence Networks of Emojis on Twitter

81

Twitter in particular has increased the level of engagement with current events, allowing for instantaneous commentary and enriching the news cycle. This increase in engagement has made the platform attractive to advertisers, and funneled money to reach a wider audience, and propel business into the new mainstream of online presence. The new form of advertising strategy proved to be effective in making a product or services more compelling by reinforcing the brand experience in the mind of the customer. Brands also evolved their strategy to match the casual attitude of social media by replying to customers, posting memes, and using emojis just like the average person. These factors paved the way for the prevalent use of emojis on Twitter, with an estimated 300 million active users [5], making it an integral part of culture today. Emojis are ideograms used in text-based communication and are treated by digital devices as encoded characters. They are different from emoticons, since they are pictures as opposed to typographic approximations. Emojis are part of the Unicode Consortium which allows them to be universally viewed on most digital devices. With Unicode 13.0 released in March 2020, there are 3304 represented emojis across many categories including facial expressions, everyday objects, animals, and symbols [4]. The Oxford Dictionary named the Word of the Year in 2015, and recognized the impact of emojis on popular culture. Pop culture artifacts that highlight the significance of emojis include the musical Emojiland premiered in 2016, and the Emoji Movie released in 2017. According to the study “A Global Analysis of Emoji Usage”, 19.6% of tweets from a dataset of 12,451,835 tweets contained emoji [3]. Co-occurrence networks provide a collective interconnection of terms based on their paired presence within a specified unit of text [6]. They are used to understand the relationship between people, organizations, concepts, biological organisms, and other entities represented within written material. Being able to compare the emoji co-occurrence networks to one another allows us to understand the usage of emoji across various topics. The evaluation of emoji is done using the centrality measures of node weight as well as eigenvector. Each centrality measure uncovers a property of the interconnectedness of the emoji in the network. We further discuss the details of these measures in Sect. 4. Centrality measures have been traditionally used to identify the most influential people in a social network, as well as super spreaders of disease, and in our case the most important emojis in a set of tweets. Each network centrality measures a different property of the network, which makes it difficult to settle on one universal definition of importance. Therefore, we consider the overall contribution to cohesiveness that each centrality contributes. In the remainder of this paper, we talk about related works in Sect. 2 and data collection methodology as well as search terms in Sect. 3. We elaborate on constructing the emoji co-occurrence networks in Sect. 4, and discuss the results of each network in Sect. 5. Finally, we conclude our paper in Sect. 6.

2 Related Work The work by Illendula and Yedulla [1] created embeddings from a co-occurrence network of emojis in order to understand sentiment. Early research in sentiment analysis was focused on creating word embeddings from a large corpus of text such as Wikipedia, which is limiting for understanding the sentiment of emoji. To solve this problem, Anurag

82

H. Alsaif et al.

Illendula and Manish Reddy Yedulla constructed an emoji co-occurrence obtained from tweets, and used its features to learn emoji embeddings in a low dimensional vector space to understand sentiment and emotion of emoji in the context of posts on social media. To evaluate their embeddings, they used the gold-standard dataset for sentiment analysis. Our co-occurrence network construction method is different in that it allows nodes to have self-connections, such that when an emoji appears next to itself, its graph representation is denoted by the emoji node with an edge connected to itself. Lastly, our work is not concerned with sentiment analysis, but instead analyzes and compares cooccurrence networks about different topics with each other using node weights and eigenvector centralities.

3 Data Collection 3.1 Topics The topics used for this research include dating, music, US politics, Olympics, epidemics, in addition to a control set. Dating is a popular topic of interest amongst young people, who are likely to use emojis with a higher frequency, thus contributing to the richness of emojis in the datasets. Music is of interest because of the size and diversity of the audience interested in that topic, as well as Twitter being a dominant platform for marketing music and allowing artists to connect with their base. US politics is a topic of interest since this research was conducted amidst the 2020 US presidential elections. It is a topic of interest since the tweets associated with it are timely, and the competitive nature of this topic allows for a constant stream of developments. In addition, the election is happening post the president’s impeachment turmoil, and during the early breakout of the coronavirus pandemic, thus likely increasing the emotional sentiments of related tweets. The topic of the 2020 Tokyo Summer Olympics is also viable since Twitter is used a lot for commentating on sporting events. Epidemics are another topic of interest since this research is being conducted amidst the coronavirus outbreak. Therefore, we expect this topic to be highly influenced by coronavirus and localized to places that are being most affected by it. Lastly, a control dataset of the most common words as search term is used as a reference point for comparison with the other topics. 3.2 Collection Methodology To collect relevant data for each of the topics discussed in Sect. 3.1, search terms are defined for each data set collection. The search terms for a given topic are intentionally neutral to minimize bias and sufficiently capture the emojis representing it during the time of collection. For example, in our Olympics dataset, we avoid using search terms of athletes and instead only pick the names of the official Summer Olympics Games. The search terms for each dataset are then queried to the Twitter API using the Python library Tweepy which returns respective tweet datasets for each topic. Although this approach does not fully eliminate bias of sentiment regarding some terms, it results in a viable starting point for studying the co-occurrence network of emojis in tweets.

Analyzing Co-occurrence Networks of Emojis on Twitter

83

3.3 Search Terms For each of the topics, we use the search terms below and collect corresponding tweets for 3 h using the Python library Tweepy. Dating: Arranged marriage, Autoerotic, Baby daddy, Baby momma, Bae, Break up, Bumble, Coffee meets bagel, Condom, Cuck, Cuckold, Cuffing Season, Date, Dated, Dates, Dating, deep like, Dildo, Dildos, Divorce, Dry spell, DSL, DTF, DTR, FBO, First date, Flaking, Ghosted, Ghosting, Gonorrhea, Grinder, Hinge, Hit it off, Hit off, Hitting it off, Honeymoon, Honeymoon phase, Hooked up, Hooking up, Hook up, Kiss, Kissed, Kissing, Left swipe, Left swiped, Left swipes, Love, Mail order bride, Marriage, Match, Matched, Matches, Netflix and chill, One night stand, Orgasm, Protected sex, Relationship, Right swipe, Right swiped, Right swipes, Saturdays are for the boys, Sex, Situationship, Slow Fade, Stashing, Std, Stds, Std’s, Sti, Stis, Sti’s, Sugar daddy, Sugar momma, Super like, Super liked, Swiped left, Swiped right, Textlationship, The clap, The love of my life, The pill, Thirst trap, Thirsty, Tinder, Valentine, Wedding. Music: Album, Apple music, Concert, Dropping a record, Ep, Hip hop, hip-hop, Lp, Mix tape, Mumble rap, Music, Pandora, Playlist, Pop, Producer, Rap, Record, single, Song, Sound cloud, Spotify, Tidal, Tour, Track list, Trap, Vinyl. US Politics: Democrat, Democrats, Democratic, Republican, republicans, GOP, Liberal, Liberals, Conservative, conservatives, Bernie, Bernie’s, Sanders, Sanders’, Sanders’s, Bernie bros, Feel, the bern, Buttigieg, Biden, Biden’s, Warren, Warren’s, Donald, Donald’s, Trump, Trump’s, Yang, Yang’s, Yang gang, Legislation, Absentee Ballot, Ballot, ballots, Ballot Initiative, Campaign finance disclosure, Caucus, Constituent, constituents, Delegate, Delegates, Super Delegate, Super Delegates, Pledged Delegate, Pledged Delegates, Unpledged delegate, Unpledged delegates, Convention, conventions, District, districts, Election, Elector, electors, Electoral college, Electoral vote, Electoral votes, General election, Impeachment, Inauguration, Incumbent, Midterm election, nominee, Platform, Political action committee, PAC, Political party, Polling place, Polling station, Popular vote, Precinct, Precincts, Election district, Election districts, Voting district, Voting districts, Primary elections, Provisional ballot, Recall election, Recount, Referendum, Registered voter, Registered voters, Sample ballot, Special election, Super tuesday, Term, Term limit, Ticket, Town hall meeting, Town hall debate, Voter fraud, Election fraud, Voter intimidation, Voter suppression, Voting guide, Voter guide. Olympics: Climbing, Baseball, Basketball, Gymnastics, Equestrianism, Soccer, Tennis, Track and field, Boxing, Fencing, Volleyball, Wrestling, Surfing, Golf, Artistic swimming, Ice Hockey, Modern Pentathlon, Rowing, Shooting, Swimming, Olympics weightlifting, Skateboarding, Tug of war, Water polo, Badminton, Karate, Archery, Sport Climbing, Handball, Taekwondo, Judo, Table tennis, Curling, Triathlon, Field hockey, Beach volleyball, Rhythmic gymnastics, race walking, Track cycling, Tokyo 2020, Summer Olympics. Epidemics: Batsoup, Contagious disease, Coronavirus, Coronavirus, Disease, Ebola, Epidemic, Epidemics, Fatality rate, Illness, Influenza, Malaria, Outbreak, pandemic, Plague, Plagues, Quarantine, Swine flu, Wou flu, Wu flu, Wuhan.

84

H. Alsaif et al.

4 Constructing Emoji Co-occurrence Network The emoji co-occurrence networks are constructed from 296260 tweets containing emoji with respect to the topics of dating, music, US politics, Olympic Games, epidemics, in addition to a control group. For each topic, the corresponding tweets are collected over a period of three hours by searching Twitter using related terms discussed in Sect. 3.3. The tweets were collected on February 29th, 2020. Each distinct emoji in a tweet generates a node of m edges where m is the number of distinct emojis in the tweet. The weight of a node signifies the number of occurrences of that node, while the weight of an edge signifies the number of co-occurrences of nodes incident to the edge. If a node appears more than once in a tweet, a self-looping edge is added signifying the co-occurrence of the node to itself. Figure 1 shows the construction of a co-occurrence network from a sample tweet. The co-occurrence network gains more edge connections as well as nodes with additional tweets. Figure 2 shows the growth of the co-occurrence network from a sample of one tweet in Fig. 1 to a sample of two tweets. In the case of the co-occurrence network in Fig. 2, because the and emojis already exist, their node weights as well as edges are incremented according to the new additional tweet. However, the node is not previously present in the network, and is thus added with incident edges to and .

Fig. 1. Co-occurrence Network Construction Example.

Analyzing Co-occurrence Networks of Emojis on Twitter

85

Fig. 2. The Co-occurrence Network Created from the Control Tweets Dataset. The Graph is Filtered to Only Include Nodes with a Weight Greater than 100, And Edges Weights Greater than 10.

5 Results 5.1 Eigenvectors Eigenvector centrality plays a significant role in understanding the structure of our graph since it allows us to find the most influential node [7] or emoji. Figure 3 below is the mean of the eigenvectors from the five topic datasets sorted by descending order.

86

H. Alsaif et al.

Fig. 3. Emojis from the Dating, Music, US Politics, Olympics, and Epidemics Graphs Sorted by their Average Eigenvector Value.

We see that the ❤ has the highest eigenvector. However, taking the intersection of the top eigenvectors from each dataset gives us slightly different results. The example below shows the intersection of the top six emojis for three dataset resulting in the face with tears of joy emoji . One of the first questions we ask is, how do we compare the co-occurrence networks to each other? We try to answer this question by finding the intersection of top emojis in terms of eigenvector as illustrated by the venn diagram (Fig. 4). However, finding the first intersection emoji “ ” for all five datasets requires that we expand to include the top 15 emojis from each dataset. We further increase the number of top emojis in our comparison to find more emojis in the intersection of our datasets displayed in Fig. 5. The numbers on the y-axis are calculated by averaging the eigenvector values of each emoji across all datasets, and the x-axis denotes the number of top emojis required to find the intersecting emoji.

Analyzing Co-occurrence Networks of Emojis on Twitter

87

Fig. 4. Venn Diagram Displaying the Top 6 Emojis from the US Politics, Olympics, and Epidemics Co-Occurrence Networks.

Fig. 5. The Intersection of the Top Emojis from the Dating, Music, US Politics, Olympics, and Epidemics Graphs. The y-Value of the Emojis is given by their Average Eigenvector Value Across Graphs.

88

H. Alsaif et al.

5.2 Dating The weight of ❤ in this dataset is the most frequent (Fig. 7), and is quite fitting for tweets related to dating. For the eigenvector centrality however, which is an indicator of nodes connecting other nodes in the graph to each other, ✨occupies third place after the heart emoji (Fig. 6). Although ✨is the ninth most frequent emoji in the dataset, about four times less frequent than ❤ , it plays a significant role in connecting other emojis to each other in the network.

Fig. 6. The Top 10 Emojis by Eigenvector in the Dating Graph.

Fig. 7. The Top 10 Emojis by Node Weight in the Dating Graph.

Analyzing Co-occurrence Networks of Emojis on Twitter

89

5.3 Music For the music dataset, the top three emojis in terms of frequency are the , , and respectively (Fig. 9). In terms of eigenvector centrality, the , ❤ and ✨ emojis take the top three places respectively. However, the fourth place in terms of eigenvectors is (Fig. 8), which did not appear in the top ten nodes in terms of frequency.

Fig. 8. The Top 10 Emojis by Eigenvector in the Music Graph.

Fig. 9. The Top 10 Emojis by Node Weight in the Music Graph.

5.4 US Politics For the US politics dataset, the most popular nodes include , , and (Fig. 11). as well as emojis are also quite popular. The fire emoji occupies sixth place, while

90

H. Alsaif et al.

is ninth in ❤ occupies ninth place in terms of frequency. Surprisingly, although occupies terms of frequency, it occupies sixth place for eigenvector centrality and first (Fig. 10). comes is in sixth place, while ✨ comes in fifth in terms of eigenvector centrality.

Fig. 10. The Top 10 Emojis by Eigenvector in the US Politics Graph.

Fig. 11. The Top 10 Emojis by Node Weight in the US Politics Graph.

Analyzing Co-occurrence Networks of Emojis on Twitter

91

5.5 Olympics For the Olympics dataset, the most popular emojis include , ✅, , , and surprisingly (Fig. 13). In terms of eigenvalue, ❤ occupies first place, ✨ second place, third occupies third place in eigenvector centrality, while place (Fig. 12). However, occupies fifth place.

Fig. 12. The Top 10 Emojis by Eigenvector in the Olympics Graph.

Fig. 13. The Top 10 Emojis by Node Weight in the Olympics Graph.

92

H. Alsaif et al.

5.6 Epidemics For the epidemics dataset, the emojis stand out and occupy fourth and fifth and place, respectively in terms of node frequency (Fig. 15). They are followed by , , , , and signifying countries impacted by the coronavirus. Surprisingly, the top two in , respectively (Fig. 14), which both did eigenvector centrality emojis are , and not appear in the top ten nodes in terms of frequency. Emojis, such as ❤, ✨, and which usually appear high on the eigenvector centrality have relatively low scores in the is at number thirty in terms of node epidemics dataset. An interesting finding is that at number one with a weight of 6118, frequency with a weight of 1229 compared to it appears to be number four in terms of eigenvector centrality and the only country to make it to the top 10.

Fig. 14. The Top 10 Emojis by Eigenvector in the Epidemics Graph.

5.7 Control Dataset Once we analyze the centralities of the control dataset, we can determine a handful of emojis are most responsible for connecting the nodes of the network to each other. is the most frequent with about twice as many as at second place and three times as many as at third place. When we compare the other datasets to the control, we can observe that some emojis appear frequently in the network such as as well as , indicating that users who use these emojis use them frequently in any given situation. Another observation is that some emojis may not appear as frequently, but have an overall more significant role such as , ❤, and ✨ emojis with high eigenvector centralities.

Analyzing Co-occurrence Networks of Emojis on Twitter

93

Fig. 15. The Top 10 Emojis by Node Weight in the Epidemic Graph.

5.8 Data The tables below present data regarding the topics discussed in Sect. 3.1. The data includes the number of tweets Table 1 used to construct the co-occurrence network, the top emojis by node weight Table 2, as well as the top emojis by eigenvector for each of the given topics Table 3. Table 1. Number of Tweets with Emojis in the Control, Dating, Music, US Politics, Olympics, Epidemics Datasets. Number of Tweets with Emojis Control

Dating Music US Politics Olympics Epidemics Total

68293

125426 59278 15193

7002

21068

296260

94

H. Alsaif et al.

Table 2. Top 10 Emojis by Node Weight in the Control, Dating, Music, US Politics, Olympics, and Epidemics Graphs. Top 10 Emojis by Node Weight Control Dating

Music

US Politics

18900

❤️23936

24758

6791

12106

19673

4945

6556

❤️ 6808

14724

2777

6317

13749

6136

✨

Olympics

Epidemics

1581

6118

846

5873

6102

779

3566

2320

4122

657

3116

10062

2137

3918

629

2999

6098

7102

1634

2902

617

2543

4647

6599

1559

2606

577

2143

4615

6299

1539

2294

551

2056

6087

1522

❤️2187

524

1926

6067

957

1959

❤️ 495

1811

2551

✨

2227

✨

✅

Table 3. Top 10 Emojis by Eigenvector in the Control, Dating, Music, US Politics, Olympics, and Epidemics Graphs. Top 10 Emojis by Eigenvector Control Dating

❤️ .1424 ✨

❤️.1617

.1362 .1329

Music

.1355

✨

.1330

US Politics .2186

.2070

❤️.2084 ❤️.2012 ✨

Olympics

❤️.2083 ✨

Epidemics .2446

.1994

.2039

.2066

.1803

.1977

.1939

.1618

.1928

.1767

.1524

⚽ .1870

.1663

.1213

.1310

.1762

.1206

.1219

.1731

.1187

.1213

.1693

.1507

.1733

.1592

.1136

.1210

.1653

.1415

.1715

.1404

.1071

.1189

.1526

.1359

.1442

.1292

.1045

.1156

.1366

.1344

.1371

.1203

.1016

.1153

.1311

.1326

.1347

.1186

✨

Analyzing Co-occurrence Networks of Emojis on Twitter

95

6 Conclusion From the results of our study, we concluded that some emojis such as , ✨ and ❤ have a more important role in influencing the network, as measured by their eigenvectors, and this role appears to be universal across the five datasets as well as the control. and However in some cases, non-common or particularly universal emojis, such as in the epidemics graph become outliers. In the epidemics dataset, the rapid global spread of coronavirus and recommendation of health experts to wear masks were contributing and emojis to become the most important in terms of their eigenvector factors for the measurements. These nodes have high eigenvectors and appear to gain influence even with a relatively small frequency compared to the nodes with the highest frequency in the graph. In other cases, some nodes can have the converse property, with a high frequency in the music graph. This but very little influence in terms of eigenvectors such as the suggests a non-deterministic relationship between the frequency of emojis in the graph and their eigenvectors. The eigenvector of emojis is of course dependent on the news and search terms of the data set at the time of collection. This lack of correlation also appears to exist amongst universally popular emojis in terms of frequency as well as for example is widely popular and holds the second highest average eigenvectors. eigenvector, however it does not make the top ten eigenvectors in the music graph even though it places second highest in terms of frequency.

7 Future Work We can directly expand this study by collecting larger data at a different time to see how the frequency of emojis change, as well as collecting data from sources other than Twitter. This will perhaps show more insight as to how the medium affects people’s usage of emojis, and also show which emojis are commonly used across different mediums for the same topic and over time. Our co-occurrence network can also be used for a user predictive model by analyzing emojis that are clustered together. In a similar manner, Google has released Emoji Kitchen, a Gboard feature that allows users to combine multiple emojis into a new singular emoji in February 2020 [8]. The network can be expanded by including the given text for prediction which creates a co-occurrence of emojis and words.

References 1. Illendula, A., Yedulla, M.R.: Learning emoji embeddings using emoji co-occurrence network graph. In: 1st International Workshop on Emoji Understanding and Applications, ICWSM (2018) 2. Edosomwan, S., Prakasan, S.K., Kouame, D., Watson, J., Seymour, T.: The history of social media and its impact on business. J. Appl. Manag. Entrepreneurship 16(3), 79–91 (2011) 3. Ljubešić, N., Fišer, D.: A global analysis of emoji usage. In: Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task, pp. 82–89 (2016) 4. Emoji Counts, v13.0. Unicode.org. https://www.unicode.org/emoji/charts-13.0/emoji-counts. html. Accessed 8 Sept 2020

96

H. Alsaif et al.

5. Number of monthly active Twitter users worldwide from 1st quarter 2010 to 1st quarter 2019. https://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/. Accessed 8 Sept 2020 6. Ozdikis, O., Senkul, P., Oguztuzun, H.: Semantic expansion of tweet contents for enhanced event detection in Twitter. In: ACM International Conference on Advances in Social Network Analysis and Mining 2012, Istanbul, pp. 20–24. IEEE (2012) 7. Bonachich, P.: Some unique properties of eigenvector centrality. Soc. Netw. 29(4), 555–564 (2007) 8. Feeling All the Feels, There’s an Emoji for That. https://blog.google/products/android/feelingall-the-feels-theres-an-emoji-sticker-for-that/. Accessed 11 Jan 2021

Development of an Oceanographic Databank Based on Ontological Interactive Documents Oleksandr Stryzhak1 , Vitalii Prykhodniuk1 , Maryna Popova1(B) , Maksym Nadutenko2 , Svitlana Haiko3 , and Roman Chepkov4 1 National Center “Junior Academy of Sciences of Ukraine”, 38-44, Dehtyarivska Street,

Kyiv 04119, Ukraine 2 Ukrainian Lingua-Information Fund of NAS of Ukraine, 3, Holosiivskyi Avenue,

Kyiv 03039, Ukraine 3 Institute of Telecommunications and Global Information Space of NAS of Ukraine, 13,

Chokolivs’kyi Blvd., Kyiv 03186, Ukraine 4 Scientific and Research Institute of Geodesy and Cartography, 69, Velyka Vasyl’kivs’ka

Street, Kyiv 03150, Ukraine

Abstract. The article proposes an approach to development of an oceanographic databank based on the formation of an ontological representation of information arrays and accessing them with an ontology-defined interface. For processing large amounts of oceanographic data located in disparate archives and databases (including those represented as natural language documents) recursive reduction method is proposed. For providing interactive access to information and services ontological interactive document is suggested. Its core element, called ontological presentation template, ensures the most effective work of the experts with the information and allows them to change system’s structure, composition and functions with minimal time. Control ontologies as a mechanism for defining the configuration of the natural system can be used. For increasing the efficiency of the natural system by optimizing the information transferring process ontological integration descriptors are suggested. Approach to oceanographic data processing that includes recursive reduction method for textual scientific reports and measurement result files processing is presented. Oceanographic databank software system with ontology-based user interface and GIS application viewing mode is presented. Keywords: Oceanology · Oceanography · Data processing · Natural language processing · Word processing · Data storage systems · Databank · Ontology · Interactive document

1 Introduction Information support serves as the basis for decision-making in the field of maritime activities at all levels. Domestic and foreign experience in the field of retrieving and using environmental information shows that the greatest results can be obtained if all © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 97–114, 2021. https://doi.org/10.1007/978-3-030-80126-7_8

98

O. Stryzhak et al.

types of activities for the collection, permanent storage, processing and exchange of oceanographic data are carried out within the framework of a single information system. Such oceanographical systems have already been created or are being created in many countries of the world. For Ukraine, until recently (2014) processes for collecting, storing and exchanging oceanographical data were carried by organizations, located in the Crimea. After their loss, such processes were mostly carried independently by different organizations, and direct information exchange between them mostly ceased. As the result, large volumes of vital data, currently present in Ukraine, are usually stored in disparate archives and databases and inaccessible to a wide range of consumers and interested professionals. Those data include hydrological, hydrophysical, hydrochemical, hydrometeorological, geological-geophysical measurements of the marine environment, data on living and non-living resources, storm disturbance, seismic characteristics of the Azov-Black Sea and the Black Sea basin. Nowadays the issue is being raised about reorganization of these processes with creation of the specialized databank for Academy of Sciences of Ukraine to serve as central data storage and exchange hub. Creation of such databank requires powerful IT-infrastructure with ability to process large volumes of weakly structured data automatically. Such infrastructure should also provide tools to extract measurements from natural-language texts, as some of the data was lost and has to be recovered from secondary sources (including scientific reports, journal articles etc.). An approach for creating such databank is proposed, which is based on method of recursive reduction [1–3] for structuring weakly structured and unstructured data (including natural-language texts) and on creation of ontological interactive documents for further processing, displaying and exchanging resulting structured data. The structure of the paper is the following. Section 2 describes current state of oceanographical data exchange procedures and challenges met by Ukraine in this field. Section 3 describes recursive reduction method used to structure weakly structured arrays of data (including natural-language texts). In Sect. 4 the description of the ontological interactive document is given, which is a kind of information system that uses ontologies for storing information to be processed and as a configuration for defining available functions. In Sect. 5, the input array of oceanographic data and approach for its processing are described. Section 6 presents oceanographic databank software system.

2 State of the Art Collecting and exchanging oceanographical data is a very important task, for which large networks of various systems and institutions are deployed worldwide. Many of them are deployed under various international programs, like The European Marine Observation and Data Network (EMODnet) and Copernicus Marine Environment Monitoring Service (CMEMS). This process is governed by Intergovernmental Oceanographic Commission, which states, that all oceanographical data should be generally routed through National Oceanographic Centers through radio communications or Internet [4], which requires corresponding infrastructure to be deployed in specific country. Ukraine needs deployment of such infrastructure, which requires substantial time and funding. Currently most of oceanographical data is collected during expeditions, and

Development of an Oceanographic Databank

99

following processes (like verifying and error correcting) is performed either manually, or with basic semi-automatic tools. The collected data are stored as arrays of weakly structured information or as parts of scientific reports (which a mostly natural language texts) [5]. Such data cannot be effectively used and exchanging them with standard procedures requires large amounts of manual labor to structure them properly. Presented approach to creation of oceanographical databank eliminates most of the manual labor and streamlines data collection and exchange processes. Such databank can become a basis of automatic data collection infrastructure when it will be deployed.

3 The Method of Recursive Reduction Large amount of available in Ukraine oceanographical data located in disparate archives and databases, often in formats, not optimized for automatic processing. For accessing large part of these data natural language processing is required. For accomplishing this task a method of recursive reduction [1, 2, 6, 7] is proposed. The method of recursive reduction is used to structure weakly structured and unstructured documents. The results of using this method represented as ontologies [8–10], which can be viewed as ordered triples (1). O = X , R, F

(1)

where X is the set of objects of the subject area, R is the set of relations between objects, F is the set of interpretation functions of X and R. The structuring of a certain natural-language text can be represented as a kind of transformation (2). Fstr : T T → O

(2)

Any natural language text is represented by a set of lexemes L, on which a certain relation of the preceding [1, 2, 11–13] is defined. This relation turns L into a linearly ordered set. In addition, the text can be represented as a sequence of sentences (3), on which the same relation of preceding is defined. T T = {S1 ≺ S2 ≺ . . . ≺ Sns }

(3)

where ns is the total number of sentences in the text. Each sentence Si , in turn, is represented by some subset of lexemes (4). LSi = {li1 ≺ li2 ≺ . . . ≺ lini }

(4)

where ni is the number of lexemes in the i-th sentence. Each lexeme, in turn, has a structure (5). lij = where lijT is the textual representation of the lexemes lij , Pij are characteristics of lij .

(5)

100

O. Stryzhak et al.

The lexeme can be connected with other lexemes by relations (6). rsn = l 1 , l 2 , k

(6)

where l 1 , l 2 are lexemes between which there is a relation, k is a type of relation. Thus, a directed graph representing the basic structure of a natural-language text has the form (7). Tsn = L, Rsn

(7)

3.1 Reduction Operator The first step of the transformation (2) is forming a structure (7) based on the input text. This process is carried out using lexical analysis. Then, the reduction operator (8) is applied recursively to this structure. The reduction operator is described in terms of λexpressions [1, 2, 14, 15], and, in turn, is a combination of three other operators applied sequentially. Frd = Fx ◦ Fsmr ◦ Fct

(8)

where Fx – an object identification operator, Fsmr – a relation identification operator, Fct – a context identification operator. The operator (8) applied recursively, for which the fixed-point operator [14] is used (9). Y = λf .(λx.f (xx))λx.f (xx)

(9)

The recursion exit condition is verified by the means of an auxiliary function (10). fFrd x, Frd x = x (10) F = λfx. x, Frd x = x Each of the elements of the reduction operator (8), in turn, is formed by rules of the form (11). g g (11) g = fap , ftr g

where fap is the applicability function, which determines whether a rule can be applied g to a specific set of input information, ftr is the conversion function, which defines the conversion of input information. The transformation specified by rule g has the form (12). g g f (x), fap (x) (12) Fg (x) = tr g x, ¬fap (x)

Development of an Oceanographic Databank

101

3.2 Applicability Function Each applicability function is a lambda term of the form (13). fap = (λl1 , ...ln .tl (¯l)&tr (¯l))x1 , ...xn

(13)

where li is a variable taking values on the set of lexemes, xi is a function argument that defines the value li , tl – lexeme identification function, tr – relation identification function. The applicability function imposes a condition [1, 6, 16] on the oriented graph, formed by a subset of lexemes, belonging to a single sentence (4) and relations between those lexemes. In this case, the condition is imposed sequentially on each lexeme and on each of the relations between them. Imposed conditions are defined as sets of predicates. The lexeme identification function is defined by a set of lexeme identification predicates and has the form (14). Predicates are designed to test various aspects of a particular lexeme from an input sequence. The main type of such predicates is the keyword-checking predicates (15). Other standard types are characteristics-checking predicate (16) and zero predicate (17). A zero predicate is used in a situation when irregular lexemes are expected. Those lexemes can be absent in keyword dictionaries, and there is a probability of incorrect identification of their characteristics during lexical analysis. It is also possible to use specialized types of predicates within specific tasks, for example, regular expression predicates. tl (l1 ...ln ) = c1 (l1 )&c2 (l2 )&...cn (ln ) where ci are lexeme identification predicates. 1, l T ∈ Lc c(l) = 0, l T ∈ / Lc

(14)

(15)

where l T – textual representation of the lexeme l, Lc – keyword dictionary checked by predicate c 1, pc ∈ Pl (16) c(l) = 0, pc ∈ / Pl where Pl – the set of characteristics of the lexeme l, pc – a characteristic, checked by predicate c. c(l) = 1

(17)

The relation identification function has the form (18) and is determined by the set of relation identification predicates. Predicates have a form (19) and are defined by the type of connection that must exist between certain lexemes in the input sequence. In general, not all lexemes are connected with relations, so most predicates are zero predicates (20). 1, ∀i, j ∈ [1..n], i = j, rij (li , lj ) (18) tr (l1 ...ln ) = 0, ∃i, j ∈ [1..n], i = j, ¬rij (li , lj )

102

O. Stryzhak et al.

where rij (li , lj ) – relation identification predicates. 1, l1 , l2 , kr ∈ Rsn r(l1 , l2 ) = 0, l1 , l2 , kr ∈ / Rsn

(19)

where kr – the type of relation that is checked by the predicate. r(l1 , l2 ) = 1

(20)

3.3 Transformation Function g

The transformation function ftr is intended for the formation of objects and relations in resulting ontology (1), as well as contexts, which are used to define interpretation functions. For each of the components of the reduction operator (8), the transformation functions contained in the corresponding rules will have a specific structure. However, all of them are based on the name creation function (21), which, in turn, uses the formatting functions on textual representations of individual lexemes. Formatting functions can be very different depending on the task being performed. fiN (liT ) (21) N (l1 , ...ln ) = i=1..n

where fiN – text formatting function, liT – text representation of lexeme li . The main operations that can be performed during formatting are: • Zero operation, i.e. use text representation without changes; • Skipping a lexeme, i.e. not including its textual representation in the name; • Normalization of the lexeme, i.e. using the corresponding word in the nominative case of the singular (for lexemes representing the words); • Normalization of the number format (for lexemes representing numbers); • Normalization of geospatial information – formatting various types of coordinates, if necessary – with conversion between coordinate systems; • Normalization of dates – parsing textual representations of dates and bringing them to a single format. The transformation functions for the various stages of recursive reduction have the following structure: The object creation function is used in rules that define the object identification operator Fx and has the form (22). g

ftr (l1 , ...ln ) = X (N (l1 , ...ln ))

(22)

where X is the operation of creating an object with a given name. The relation creation function is more complex (23). These functions are used in rules, that define relation identifying operator Fsmr . The job of this function is to determine two objects and create a relation between them. g

ftr (l1 , ...ln ) = R(X (N (l1 , ...lm )), X (N (lm+1 , ...ln )))

(23)

Development of an Oceanographic Databank

103

where X – the operation of creating (or selecting) an object with a given name, R – the operation of creating a relation between two given objects, m – an index that indicates the boundary between object names in the input lexeme sequence. The attribute creation function has the form (24). This kind of function often requires specialized procedures for determining the object to which the attribute belongs. In the general case, this requires preserving the text processing state in a certain way (for example, remembering the last identified object). g

ftr (l1 , ...ln ) = A(N (l1 , ...lm ), N (lm , ...ln ))

(24)

where A – the operation of creating an attribute by its name and value, m – index indicating the boundary between the name and value.

4 Ontological Interactive Documents Ontological interactive document [1, 2, 17, 18] is a kind of information system, that uses ontologies as a source of information for both displaying to the end user and defining available services. The simplest interactive ontological document has a form (25). Main element of interactive document is a natural system SN [19], which provides interactive access to information, contained in ontology O. O, SN

(25)

A natural system SN can be represented as a function (26) [19].

1

n

y = f (x . . . x )

(26)

i

where, x – are actions performed by a user when interacting with information, y – the

result of the work of the system in the form of text and graphical information, f – the target function of the system, which can be represented as (27). 1

n

1

n−1

f ( x ... x ) = D(Qn−1 (...Q1 (X , x )..., x

n

), x )

(27)

where X – the set of objects of the initial ontology O, Qi – information processing functions, D – information display function. Examples of information processing functions include the functions of hierarchical (28) and attribute (29) filtering. [1, 2]. Qh (X , x∗ ) = {¯x ∈ X |¯xRx∗ }

(28)

where X – the set of ontology objects, or a certain subset of them, x∗ – the object that is being filtered, R¯ – the specific relation between objects. Qa (X , A) = {¯x ∈ X |A ∩ Ax¯ = A)} where A – attributes that are filtered, Ax – attributes of an object x.

(29)

104

O. Stryzhak et al.

These functions allow solving a wide range of problems in displaying a variety of information, including information obtained from spatially distributed sources, which can have a different format and structure. However, when solving specific problems, such as building an oceanographic databank, these functions may be insufficient. In particular, most of the data intended for placement in the bank has a geospatial component, of which the GIS application is a natural way of representing [1, 3, 13, 20–23]. In order to ensure the maximal efficiency of processing of the information represented as the interactive document, its natural system should take into account the characteristics of the problem being solved. Moreover, in more complex subject areas (which include oceanography) various types of data can exist, which radically differ in structure and require special approaches for processing and displaying. This requires the availability of effective means of configuring natural systems, which will allow changing their composition and structure with minimal efforts. Such changes may be intended for: • Adding specialized information display functions (in particular, for oceanographic data, which should be displayed as GIS application); • Adding specialized information processing functions (selection of measurements in a specific date range, selection of measurements in a specific geographical region, etc.); • Adding specialized data preprocessing functions (clustering objects on a map, interpolating measurements, etc.); Specialized ontologies (control ontologies) can be used as a mechanism for defining the configuration of the natural system. 4.1 Ontological Presentation Templates An ontological presentation template is a kind of control ontology designed to define additional modules of a natural system. These modules can be included in the structure of the system, changing its behaviour in accordance with the requirements of the task. Using an ontological template within an interactive document will turn it from a pair (25) to a triple (30). O, OD , SN

(30)

where OD – ontological presentation template. The generalized information model of the natural system, which supports ontological presentation templates, should have the form (31) ΠSN =

n

i ΠSN ∪ GT (OD )

(31)

i=0 i – standard modules that provide basic data processing and display functions, where ΠSN GT – the transformation of interpretation of an ontological presentation template. The basic standard module of such a natural system is the system controller Π0 , which controls interactions between other modules as they perform a target function

Development of an Oceanographic Databank

105

of the system (27). The system controller provides the transformation of integration of functions of individual modules (32). GC :

n

Si ∪ ST (OD ) → f

(32)

i=0 i , where GC – the transformation of integration, Si – functions of standard modules ΠSN ST (OD ) – functions, obtained from interpreting an ontological representation template. Interpreting is divided into two parallel processes – the creation of data processing functions and data display functions (33). GQ (x) ∪ GD (x) (33) ST (OD ) = x∈XD

x∈XD

where XD – a set of objects belonging to an ontology OD , GQ and GD – interpretation transformations intended for the creation of data processing and presentation functions, respectively. The use of transformations in the form (33) means that the natural system will contain many information presentations functions, which in the general case cannot be used simultaneously. Therefore, the structure of the target function (27) enables the user to select the display function that will work at the moment. For this, an auxiliary switch function (34) should be provided by the system controller Π0 . ⎧ 1 ⎪ ⎪ ⎪ D1 (X ), x = x ⎪ ⎨ 2 (34) D0 (X , x) = D2 (X ), x = x ⎪ ... ⎪ ⎪ ⎪ ⎩ n Dn (X ), x = x where X – a certain set of objects (usually the result of the processing of information processing functions), x – the user command that determines the choice of display i

function, Di – display functions provided by system modules, x – markers of display functions. Usage of ontological presentation templates requires an extension of the system for generating interactive documents. It must include subsystems that will provide GQ and GD . Their use may require additional computational resources, especially during the development period, when constant changes to ontological presentation templates are expected. Also, their use significantly increases the requirements for protection against unauthorized access to ontologies OD . Therefore, the use of ontological presentation templates after ending of the development period may be impractical – it will be more efficient to create an interactive document with the form (25) so that its natural system already includes all the ontologically described modules as standard modules Πi . 4.2 Ontological Descriptors for Software Integration During the creation of the oceanographic databank, the main type of data being handled are various measurements. The number of such measurements can be very significant,

106

O. Stryzhak et al.

but at the same time, they may be divided into subsets that are characterized by relative uniformity of data structure inside them. It is not practical to store such information in the form of an ontology due to the insufficient efficiency of the standard ontology storage (file system) when working with structured data. However, when working with such a complex subject area as oceanography, flexible mechanisms for describing the complex relationships that the ontology provides are necessary. Combined information storage allows for solving this problem. It consists of a certain specialized information system (usually a thematic database) and an ontology containing: • The list and relationships of data types used in the framework of the information document; • Metadata for accessing information systems used to store the data, displayed as part of an interactive document – addresses of access points, data formats, authorization data, etc.; • Information on the composition and structure of the data displayed as part of an interactive document – a list of fields intended for display, their types and valid values; • Information on the composition and structure of the data on the basis of which processing will be carried out – a list of fields on the basis of which filters will be formed, their possible values (for building drop-down lists), marks on geospatial information (for working with specialized functions of a GIS application); • Information on the preliminary processing of data received through information systems – formatting and translations of field names and values; • Another important for the correct operation of the interactive document data (for example, the location of static resources, such as source files). The overall structure of an interactive document with a combined data storage is similar to (25), with the ontological descriptor of software integration OP used as an ontology O. However, the natural system of such a document differs in that most of the information processing is done by underlying data storage. Therefore, the objective function of the natural system (27) is transformed into (35). 1

n

1

n−1

f ( x ... x ) = D(QP (x, x , ..., x

n

), x )

(35)

where x – an object belonging to ontology OP , which describes integration with a particular information system, QP – a software integration function that performs the following processes in succession:

• Forming a request to the information system based on user commands x i ; • Sending request and receiving information from underlying storage; • Validating the information received, identifying errors and transmitting them to the system controller for displaying to the user; • Pre-processing of the received information – formatting, translation etc. • Receiving and transmitting dynamic metadata to the controller, such as the total number of results matching the query. An interactive document with support for combined data storages should have a specialized controller that dynamically recognizes the type of the current ontology and,

Development of an Oceanographic Databank

107

depending on it, forms the target function of a natural system of the form (27) or (35). In this case, all aspects related to the display function D will remain unchanged as a result of the software integration function QP should not differ in structure from the result of standard information processing functions Q and should be a set of objects of dynamically formed ontology OP∗ . In the general case, such a dynamically formed ontology will not contain ontological relations (36). This makes some display functions unsuitable for working with it. R∗P = ∅

(36)

In general using ontological descriptors of program integration OP also requires using ontological presentation templates OD and therefore an interactive document using combined information storage will have the form (37). OP , OD , SN

(37)

Using ontological integration descriptors can significantly increase the efficiency of the natural system by optimizing the information transfer process: only the information that the expert needs at a certain time is transmitted, and the pre-processing mechanism further increases the efficiency of transfer. This mechanism provides the administrator with a standardized mechanism for controlling the system, since the ontological descriptor can be edited using standard tools like a normal ontology.

5 Oceanographic Data Processing The main problem in the task of creating the oceanographic databank is fragmentation and incompleteness of the original data. Available data differ in structure (files of different types, including weakly structured and unstructured documents) and physical location (data may belong to different organizations, including international). In addition, the data often have incomplete information – they may not provide insights on the platform from which the measurements were taken, the measurement tool, etc. However, as practice has shown, most data have basic information, namely the type of measurements and geographical coordinates of the measuring point, which allows to normalize the data to some extent and bring them to the same structure. As part of the first iteration of the databank creation, the following types of data were processed: • Scientific reports concerning various measurements during expeditions (naturallanguage texts); • Results of salinity and seawater temperature measurements, automatically collected from various sources (mainly ships and buoys) in various specialized formats; • Results of hydroacoustic studies of major rivers of Ukraine and some coastal areas in specialized formats; • Flow measurements from buoys in specialized formats.

108

O. Stryzhak et al.

The available data were conditionally divided into the following types: • Point measurements – characterized by a relatively small amount of measurement at a single point on a stationary platform. Such measurements include most measurements using ships (except hydroacoustics); • Continuous measurements – measurements taken continuously over a certain (relatively small) period of time (mainly hydroacoustic measurements); • Long-term measurements – measurements taken over a considerable (several months) period, usually with a fixed platform (mostly buoys). On the basis of data analysis and classification, the structure of an individual element of the databank was proposed (38). d = g, t, p, c, g, ˜ I, M

(38)

where g – geographical coordinates of the measuring point, t – measurement time information, p – identifier of the platform from which the measurements were taken, c – identifier of a series of measurements (for example, a cruise of a ship), g˜ – geographical data on the further route of the platform (for mobile platforms, including ships), I – a set of related static resources (photos, source files, graphs, etc.), M –measurement data in form of a matrix. For the main categories of input files, different approaches were taken and different processing procedures created. 5.1 Scientific Reports The main type of documents used as input are scientific reports on oceanographic expeditions carried out by various agencies of the National Academy of Sciences of Ukraine. These reports are written in Ukrainian or Russian and contain basic information about a particular study, which in the general case is easy enough to process and bring to the structure (38). To process such documents, the recursive reduction method is used without additional modifications. Processing requires the usage of dictionaries of attribute names, based on which information processing rules are created. The dictionaries include different variants of attribute names, which can be found in various reports (e.g. variant of a name in different languages). Sometimes large amounts of data in scientific reports are presented in form of tables, which require some preprocessing to be efficiently used with the method of recursive reduction. Preprocessing includes restructuring such tables to form sequences of related data, required for applying recursive reduction rules. 5.2 Measurement Result Files A feature of scientific reports is that the information they contain is optimized for human perception. As the person’s ability to perceive textual information is rather limited, such optimization may require some simplification – omitting minor indicators, reducing

Development of an Oceanographic Databank

109

the number of displayed measurements, etc. This can cause problems, especially if the measurement results are collected automatically and their volumes do not allow them to be fully included in the reports. In most cases, it is more efficient to process the original files with the measurement results, ignoring the already processed data from the reports. Many of the measurement result files represent simple tables, containing lists of measurement points. They are processed with recursive reduction method to retrieve metadata, contained in natural-language parts (commentaries) and to normalize measurement names according to given dictionaries. Other measurement files a given in specialized formats, which require various preprocessing to be used with the method of recursive reduction. Those preprocessing may include using external libraries to access data from specialized binary files, transforming non-standard values types (like “days from a certain timestamp”) to a standard form etc.

6 Oceanographic Databank Software System The oceanographic databank software system is designed to work in a web-based network environment. The software system can work both in local area network and on the Internet, in particular, using cloud computing [24, 25]. The system is presented as an interactive document, designed to display complex sets of data, with additional ontologydefined modules for working with data that require special means of displaying. The interactive document uses the combined storage scheme. All measurements are stored in the internal database according to the scheme (38) and metadata stored in the ontological descriptor of software integration. The ontological descriptor contains static data that describes the existing dataset. Such data include: • Database access credentials – used by an interactive document to obtain the necessary information; • Filter configuration – specifies a list of fields in the structure (38) that are used as filters and are displayed in the corresponding area of the interactive document. This configuration also specifies the type of field (date, coordinates, full-text search, etc.), and for fields that are configured as drop-down lists, the values of these fields are also possible; • Dynamic language variables, i.e. field names and field values that are specific to the current dataset. The main purpose of ontological descriptors is to provide a flexible mechanism for configuring the process of using external data sources within the system. An additional task of the descriptors is an optimization of the data transfer process between an interactive document and an external data source (database server): • Storing access credentials in the ontology allows a quick change of data sources (for example, when changing the network address of the database server), as well as enables usage of various external data sources simultaneously (adding them to single ontological descriptor) or separately (creating different ontological descriptors for different data sources);

110

O. Stryzhak et al.

• Filter configuration allows quick change of fields and field values that are available for filtering. This allows specifying fields that are excluded from filtering (used for display only). For example, in the general case there is no meaning to filter by the path of the platform g, ˜ so displaying corresponding field in the filter is impractical; • The filter configuration serves as a system cache that stores the possible field values. Determining the possible values of the fields can be made during the creation of the interactive document, but it is a rather resource-intensive process, which is inappropriate to repeat constantly. An alternative to using a filter configuration within an ontology is to create a server-side caching system, however, this will require the transfer of this information from the server, which may slow the operation of the interactive document and cause additional load on the server; • The configuration of language variables allows using shortened attribute names, more efficient for data transfer. Such names will then be translated in run-time, which also allows creating of multi-language interfaces. 6.1 Ontology-Based User Interface The ontology-based user interface [21–23] provides the means to access available oceanographical data. A general view of this interface is shown in Fig. 1. For displaying measurement data, GIS application is used. In addition, the auxiliary view mode is provided, which allows to displaying measurement data as a table.

Fig. 1. The interface of the oceanographical databank

The GIS application view mode displays individual points at which measurements were made in the form of markers on the map. For convenience, large groups of markers are automatically clustered on the map. This view mode is suitable for displaying measurements on stationary platforms, like buoys or anchored ships. For continuous measurements, this is not enough, as the trajectory of the platform movement is valuable data. If this data is available, it is stored in a part g˜ from the structure (38). These data are displayed on the map in the form of thick lines, originated from a

Development of an Oceanographic Databank

111

Fig. 2. Hydroacoustics measurement result

point, representing measurement. When a large amount of measurements is available, trajectories tend to intersect very often, so filtering should be used in such cases. Some measurement types are difficult to display with standard means, so they have to be preprocessed. One such case is hydroacoustics measurements, which are processed and displayed as a set of raster images (Fig. 2). Auxiliary table view mode (Fig. 3) allows displaying significantly fewer data at the same time then GIS application, but it provides a more detailed representation of information. Table view mode can be used to view large amounts metadata of available in the database measurements and to access measurement results.

Fig. 3. Table view mode

112

O. Stryzhak et al.

The following fields corresponding to the elements of the structure (38) are available in the table mode: • • • • •

Platform type p; Geographical coordinates g; Cruise and ship identifiers (if available), which identify the measurement series c; Date of measurement t; The source file, which contains measurements, and related images are displayed as the part of collection of related documents I ; • The measurement matrix M ; The measurement matrix M often contains large amounts of measurements (tens or hundreds, and in some cases, thousands), so displaying it in a table alongside metadata is impractical. Therefore, measurement results are displayed on demand in a specialized interface. The main table contains only a mark indicating the number of measurements in the matrix. Filtering is performed by using a specialized control block on the left side of the interactive document interface. The filter is fully configurable using the ontological descriptor of software integration, and contains the following fields: • Coordinates – a specialized field that allows selecting a working area on the map. Only measurements located in the selected area are then displayed, both on the map and in the table view. Changing this field will automatically activate the GIS application view mode; • Platform and parameter list – work as sets of switches that allow selection of measurements according to the selected properties; • Date – allows selecting the date range in which displayed measurement should be made.

7 Conclusions Currently, Ukraine has lost its accumulated database in the field of oceanography, created during the existence of the state. Oceanographic information flows, the structure of circulation and management of them was significantly disrupted. Largely, Ukraine’s participation in international cooperation in the exchange of information and in international projects in the field of oceanography has ceased. Now available in Ukraine, the measurement data of the oceanographic parameters of the Azov-Black Sea basin and the World Ocean as a whole are again scattered in disparate archives and databanks as at the beginning of the formation of the state, and some are completely lost. The analysis of the state of affairs in this area shows that there is currently no program to create a national information system for collecting, storing, analyzing and exchanging oceanographic measurement data and oceanographic knowledge. Methods, software and technological tools for analysis, special preparation and use of oceanographic information in decision-making systems in the interests of the state, science, defense and national economy of Ukraine have to be created.

Development of an Oceanographic Databank

113

The developed information system can be used as a basis for the creation of a databank, which will ensure the preservation and protection of data at the state level and can allow the country to resume its participation in international cooperation in this field. At the same time, together with the creation of the technical and technological parts of the databank, it is necessary to implement the legal component of such system at the domestic and international level, which will allow such databank to fulfill its tasks with maximum efficiency.

References 1. Prykhodniuk, V.: Technological means of transdisciplinary representation of geospatial information. Institute of Telecomunications and Global Information Space, Kyiv (2017) 2. Stryzhak, O., Prychodniuk, V., Podlipaiev, V.: Model of transdisciplinary representation of GEOspatial information. In: Ilchenko, M., Uryvsky, L., Globa, L. (eds.) UKRMICO 2018. LNEE, vol. 560, pp. 34–75. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-167 70-7_3 3. Prihodniuk, V., Stryzhak, O.: Ontological GIS, as a means of organizing geospatial information. Sci. Technol. Air Force Ukraine 2(27), 167–174 (2017) 4. Guide to Operational Procedures for the collection and exchange of JCOMM Oceanographic Data. IOC Manuals and Guides No. 3 (3rd Rev. Ed.). UNESCO (1999) 5. Golodov, M., et al.: Hydrophysical research of marine and riverine environments. Geofizicheskiy Zhurnal 6(41), 111–127 (2019) 6. Prykhodniuk, V.: Taxonomy of natural-language texts. Inf. Models Anal. 5(3), 270–284 (2016) 7. Velychko, V., Popova, M., Prykhodniuk, V., Stryzhak, O.: TODOS - IT-platform for the formation of transdisciplinary informational environments. Syst. Arms Mil. Equipment 1(49), 10–19 (2017) 8. Palagin, A., Kryvyi, S., Petrenko, N.: Knowledge-oriented information systems with the processing of natural language objects: the basis of ethodology, architectural and structural organization. Control Syst. Comput. 3, 42–55 (2009) 9. Palagin, A., Kryvyi, S., Petrenko, N.: Ontological Methods and Means of Processing Subject Knowledge: Monograph. Volodymyr Dahl East Ukrainian National University, Luhansk (2012) 10. Stryzhak, O.: Transdisciplinary integration of information resources. Institute of Telecomunications and Global Information Space, Kyiv (2014) 11. Gladun, V., Velychko, V.: Contemplation of natural-linguistic texts. In: XIth International Conference “Knowledge–Dialogue–Solution” Proceedings, Sofia (Bulgaria), pp. 344–347. ITHEA (2005) 12. Velychko, V., Voloshin, P., Svitla, C.: Automated creation of the thesaurus of the terms of the subject domain for local search engines. In: XVth International Conference “Knowledge– Dialogue–Solution” Proceedings, Sofia (Bulgaria), pp. 24–31. ITHEA (2009) 13. Velychko, V., Prykhodinuk, V., Stryzhak, A., Markov, K., Ivanova, K., Karastanev, S.: Construction of taxonomy of documents for formation of hierarchical layers in geo-information systems. Inf. Content Process. 2(2), 181–199 (2015) 14. Barendregt, X.: Lambda-Calculus. Its Syntax and Semantics. World, Moscow (1985) 15. Prykhodinuk, V., Stryzhak, O.: Multiple characteristics of ontological systems. Math. Model. Econ. 1–2(8), 47–61 (2017) 16. Velychko, V., Prykhodniuk, V.: Method of automated allocation of relations between terms from natural language texts of technical subjects. In: XVth International Conference “Knowledge–Dialogue–Solution” Proceedings, Sofia (Bulgaria), pp. 27–28. ITHEA (2014)

114

O. Stryzhak et al.

17. Prykhodniuk, V.V., Stryzhak, O.Y., Haiko, S.I., Chepkov, R.I.: Information-analytical complex of support of transdisciplinary researches processes. Environ. Saf. Nat. Resour. 4(28), 103–119 (2018) 18. Stryzhak, O., Potapov, H., Prykhodniuk, V., Chepkov, R.: Evolution of management – from situational to transdisciplinary. Environ. Saf. Nat. Resour. 2(30), 91–112 (2019) 19. Malyshevsky, A.: Qualitative Models in the Theory of Complex Systems. Science, Fizmatlit, Moscow (1998) 20. Tsvetkov, V.: Geoinformation Systems and Technologies. Financial Statistics, Moscow (1998) 21. Popova, M.: Ontology of interaction in the environment of the geographic information system. Institute of Telecomunications and Global Information Space, Kyiv (2014) 22. Popova, M., Stryzhak, O.: Ontological interface as a means of representing information resources in GIS-environments. Scientific notes of Taurida National V.I. Vernadsky University 26(65), 127–135 (2013) 23. Popova, M.: A model of the ontological interface of aggregation of information resources and means of GIS. Inf. Technol. Knowl. 7(4), 362–370 (2013) 24. Globa, L., Sulima, S., Skulysh, M., Dovgyi, S., Stryzhak, O.: Architecture and operation algorithms of mobile core network with virtualization. In: Mobile Computing, pp. 427–434. IntechOpen, Rijeka (2019) 25. Globa, L., Kovalskyi, M., Stryzhak, O.: Increasing web services discovery relevancy in the multi-ontological environment. In: Wiliński, A., El Fray, I., Peja´s, J. (eds.) Soft Computing in Computer and Information Science. AISC, vol. 342, pp. 335–344. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15147-2_28

Analyzing Societal Bias of California Police Stops Through Lens of Data Science Sukanya Manna(B) and Sara Bunyard Santa Clara University, Santa Clara, CA 95053, USA [email protected], [email protected]

Abstract. The Police Department in any city uses a wide range of enforcement to ensure public safety and traffic stops are one such tool. During this process, racial disparities in policing in the United States is not very uncommon. This has created both social and ethical concerns among people. As a result researchers in different domains have taken interest in connecting different pieces of information and finding a possible solution to these issues. In this paper, we have addressed the societal bias through the lens of data science focusing on how police stops are made and which groups of people are targeted most often. The analysis is done on major cities of California focusing on the different factors that might create discrimination in police stops. We also used two popularly known statistical analysis benchmark test and veil of darkness to look into racial profiling. Based on our analysis, it was clear that the Black drivers were stopped between 2.5 to 3.7 times more than white drivers and Hispanic drivers were stopped between 0.968 to 1.39 times more than white drivers (between 2014 and 2017 respectively). Keywords: Bias · Racial discrimination Benchmark test · Veil of darkness test

1

· California police stops ·

Introduction

Background and Motivation: The Police departments use a wide range of enforcement tools to ensure public safety. Traffic stops are one such tool. These interactions typically involve an officer pulling over a motorist, issuing a warning or citation, and more rarely conducting a search for contraband or making a custodial arrest [4]. The prevalence and nature of traffic stops vary widely across American cities, but they are generally the most common way police departments initiate contact with the public [5]. There are several studies which have looked into racial profiling and used different statistical methods to quantify discrimination. The key problem in testing for racial profiling in traffic stops is estimating the risk set, or “benchmark”, against which to compare the race distribution of stopped drivers [8]. To date, the two most common approaches have been to use residential population data c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 115–128, 2021. https://doi.org/10.1007/978-3-030-80126-7_9

116

S. Manna and S. Bunyard

or to conduct traffic surveys in which observers tally the race distribution of drivers at a certain location. It is widely recognized that residential population data provide poor estimates of the population at risk of a traffic stop; at the same time, traffic surveys have limitations and are more costly to carry out, so the veil of darkness test is applied instead by Grogger and Ridgeway [8]. There are other well-known statistical tests for measuring discrimination; some of them are Benchmark test, Outcome test, Threshold test and different variants of them. According to [13], the authors discussed possible racial disparities as black and Hispanic drivers were often stopped and searched on the basis of less evidence than whites. To reach this conclusion, search rates were examined. In nearly every jurisdiction, it was found that stopped black and Hispanic drivers were searched more often than whites, about twice as often on average. But such statistics alone are not clear evidence of bias. If minorities also happen to carry contraband at higher rates (a hypothetical possibility, not a fact), these elevated search rates may simply reflect routine police work. This raises concerns among local community groups that traffic stop practices disproportionately and impact certain races over others. There have been initiatives to analyze the underlying causes of this. For example [12], in May 2014, the City of Oakland and the Oakland Police Department (OPD) partnered with the team of Stanford social psychologists to collect and analyze data on OPD officers’ self-initiated stops. The task of the research group was to analyze the reports that OPD officers completed after every stop they initiated between April 1, 2013 and April 30, 2014. The analyses revealed that OPD officers disproportionately stopped, searched, handcuffed, and arrested African Americans, relative to other racial groups. Our main motivation is to find if a similar scenario prevails at the same depth in California, as the state itself is considered to be very diverse and multicultural. Some of the earlier research focused on either USA as a whole or other states like Texas [11], New York [13] North Carolina [2], on racial profiling. Contributions: In this paper, we present three-fold analysis: First, we aim to quantify racial disparities in California’s current traffic stop practices using two main statistical analysis, Benchmark test and Veil of Darkness. Second, we visualized for both pedestrian and vehicular level stops at prime California cities. Third, we focused on stops in 2014 and 2017 to compare and contrast if there was any change in the pattern of policing. Organization: We have explained these two goals in the remainder of this paper. Section 2, presents the related work. Section 3 presents the experimental setup. Section 4 illustrates our findings, and finally Sect. 5 concludes the paper.

2

Related Work

There has been several studies on discrimination seen in the behavior of police stops towards drivers or pedestrians based on different factors like gender or race or ethnicity. It is seen from several instances that police officers speak

Analyzing California Police Stops

117

significantly less respectfully to black than to white community members in everyday traffic stops, even after controlling for officer race, infraction severity, stop location, and stop outcome [14]. Researchers from various domains have gathered and analyzed data about these stops [2,7, 9]. There are well-known statistical tests for measuring discrimination; some of them are Benchmark test, Outcome test, Threshold test and different variants of them. We have only focused on the ones closest to the analysis we have done in this paper. In the first test, termed benchmarking, compares the rate at which whites and minorities are treated favorably. However, there is one limitation of benchmarking referred to in the literature as the qualified pool or denominator problem [1], and is a specific instance of omitted variable bias. For example, while looking at the police stops per capita, the driving population might be different from the census (or residential) population. Addressing this shortcoming of benchmarking, Becker [3] proposed the outcome test, which is based not on the rate at which decisions are made, but on the success rate of those decisions. These theories were based off granting credit loans. Though originally proposed in the context of lending decisions, outcome tests have gained popularity in a variety of domains, particularly policing [1,6,7, 10]. For example, when assessing bias in traffic stops, one can compare the rates at which searches of white and minority drivers turn up contraband. If searches of minorities yield contraband less often than searches of whites, it suggests that the bar for searching minorities is lower, indicative of discrimination. Besides benchmarking and outcome, threshold tests [13] have recently been proposed as a useful method for detecting bias in lending, hiring, and policing decisions. For example, in the case of credit extensions, these tests aim to estimate the bar for granting loans to white and minority applicants, with a higher inferred threshold for minorities indicative of discrimination. This technique, however, requires fitting a complex Bayesian latent variable model for which inference is often computationally challenging. So Fast threshold test was developed. It is a method for fitting threshold tests that is two orders of magnitude faster than the existing approach, reducing computation from hours to minutes. To illustrate their efficacy, these algorithms were tested against 2.7 million police stops of pedestrians in New York City.

3

Experimental Setup

Dataset: Our data comes from the Stanford Open Policing Project1 . The Stanford Open Policing Project is a partnership between the Stanford Computational Journalism Lab and the Stanford Computational Policy Lab that collected and standardized over 200 million records of traffic stop and search data from across the country. 1

https://openpolicing.stanford.edu/data/.

118

S. Manna and S. Bunyard

We have used 8 out of 11 the cities of California’s data which is provided in the dataset (Bakersfield, Long Beach, Los Angeles, Oakland, San Diego, San Francisco, San Jose, Stockton) as not every city here contained the “race” information of the people being stopped by the police. Since “race” is one of main attributes responsible to find out discrimination, we discarded Anaheim and San Bernardino. We also discarded Santa Ana, because there were not enough months of data overlap with the other cities. In many cases, the data we used were insufficient to assess racial disparities (for example, the race of the stopped driver was not regularly recorded, or only a non-representative subset of stops was provided). For consistency in our analysis, we further restricted stops occurring in 2014 and 2017 as many jurisdictions did not provide consistent data for other years. Our primary dataset for our analysis thus consists of approximately 1,044,030 stops. Data Processing: We processed the data in few steps: 1. Merging multiple files: The different cities provided differently formatted data, and while the data from the Open Policing Project was already cleaned and formatted, some cities provided different variables than others. So we merged the files with the necessary modifications. 2. Data cleaning: When cleaning the data, we filled in the time and driver gender variables with ‘N/A’ because some of our cities did not have this data recorded. Once this was complete, we eliminated observations with missing values. We then restricted our data to the year 2014, because that was the year with the most complete data. Then, we eliminated rows from Santa Ana because it did not have complete data for 2014 and 2017. Lastly, we created hour and weekday variables that extracted the hour from the stop time and the weekday from the stop date when that information was available.

4

Analysis

We divided our analysis into two phases. Firstly, we presented different variables which might create a “biased” decision in stops. Secondly, we did statistical analysis “Benchmark test” and “Veil of darkness” to analyze racial disparities. 4.1

Analysis of Stops Through Visualizations

One of the criteria for this analysis was that the data needed to be present consistently in all the cities during a certain time span. We recorded that the years 2014 & 2017 (relatively most recent data) had maximum overlap in terms of the time, so we retained them for comparisons. This analysis included both vehicular and pedestrian stops. We looked at the outcome of the stops based on gender and race primarily. We also compared the frequency of police stops at different times of the day and different days of a week just to have a better understanding of policing.

Analyzing California Police Stops

119

The number of data points for the year 2014 is different from 2017 for our visualization because fewer cities had data available for 2017 (Table 4). For the 2014 visualizations, the total number of stops is 1, 044, 030. Distribution of stops per city is summarized in the Tables 1 and 2. We also examined some cities in 2017 to look at how the data had changed. We chose the year 2017 because it was the most recent year where the most cities had complete data available. The cities that we were able to use were Los Angeles, Oakland, San Jose, Bakersfield and Long Beach. Our dataset for 2017 contained approximately 693, 980 stops. Table 1. Distribution of stops per city in 2014 City Bakersfield

Stops 22874

Long Beach

23355

Los Angeles

704705

Oakland San Diego

23062 138917

San Francisco

91961

San Jose

32260

Stockton

6896

Table 2. Distribution of stops per city in 2017 City

Stops

Bakersfield

19129

Long Beach

18475

Los Angeles 598275 Oakland

30545

San Jose

27556

Per Capita Stops by City: Figures 3 and 4 illustrate per capita stops by city for the years 2014 and 2017. For this, we took our stop data and grouped it by driver race. Then, we joined it with census data from the 2010 census. To get our census data, we took the total population and multiplied it by the different demographic percentages to get the number of people of each demographic in each city. Once these two datasets were joined together, for each city we divided the number of stops of each race by the number of people of each race in that city, to get stops per capita by race. To get our confidence intervals for all our data combined, we took the unweighted average of the stop rates to get our combined per capita stops for each race. Then we used the formula:

120

S. Manna and S. Bunyard

CI = (ˆ p1 − pˆ2 ) ± z

pˆ1

1 − pˆ1 1 − pˆ2 + pˆ2 n1 n2

(1)

where pˆ1 is the proportion of stops of race1 , pˆ2 is the proportion of stops of race2 , n1 is the population sum of race1 , and n2 is the population sum of race2 , respectively. z is the z − score at a certain confidence interval (for our case its 99% confidence interval). Based on Eq. (1), we computed confidence intervals by comparing White and Hispanic Drivers in 2014 (99% CI [0.002, 0.003], p < 0.01) and 2017 (99% CI [−0.02, −0.02], p < 0.01). For 2014 and 2017, the results are close to zero, making it insignificant for us to present. Similarly, we computed confidence intervals for White and Black in 2014 (99% CI [−0.114, −0.111], p < 0.01) and 2017 (99% CI [−0.134, −0.132], p < 0.01) respectively. Now, analyzing per capita stops by city, in 2014, it is seen that Black drivers are stopped 2.52 times more than white drivers and in 2017, 3.73 times more than white drivers (see Tables 3 and 4). Similarly in 2014, Hispanic drivers are stopped 0.968 as many times as white drivers and they are stopped 1.39 times more than white drivers in 2017. Table 3. Per capita stops for each city of drivers from different races in 2014 City

Black drivers Hispanic drivers White drivers

Bakersfield

0.114

0.035

0.108

Long Beach

0.0976

0.0443

0.0429

Los Angeles

0.471

0.164

0.152

Oakland

0.129

0.0442

0.0331

San Diego

0.182

0.107

0.107

San Francisco 0.334

0.0992

0.103

San Jose

0.114

0.0597

0.0244

Stockton

0.0527

0.0219

0.0239

Table 4. Per capita stops for each city of drivers from different races in 2017 City

Black drivers Hispanic drivers White drivers

Bakersfield

0.074

0.0472

0.0595

Long Beach 0.0794

0.038

0.0302

Los Angeles 0.469

0.144

0.108

Oakland

0.2

0.0632

0.0251

San Jose

0.0852

0.0469

0.0204

Stops by Gender: Figure 1a and 1b are visualizations of stops by gender in these two years. For both cases, it is quite obvious that females are stopped

Analyzing California Police Stops

121

less often than men but this claim does not justify the fact that the population of females might be lesser than males or the number of females in the driving population on road might be smaller as well.

(a) Proportion of Stops by Gender in 2014

(b) Proportion of Stops by Gender in 2017

Fig. 1. Analysis of stops by gender

Stops by Race: Figure 2a and 2b are visualizations of stops by race in these two years. For computing the stops by races’ visualization, total stops are 964, 455. It has less data points because we discarded the points those were listed race as other or unknown. It is given in the Table 5. The figures in general reflect the fact that in cities with higher concentration of certain races, those races seem to have higher proportions of stops over others. For example, if we look at the city of San Jose, there is a higher ratio of Hispanic population existing. Now the stops are also relatively higher for Hispanic in San Jose than other cities. This does not infer anything specifically about any racial discrimination.

122

S. Manna and S. Bunyard Table 5. Distribution of stops by races per city in 2014 City

Stops

Bakersfield

22014

Long Beach

20758

Los Angeles

655132

Oakland San Diego

22149 130220

San Francisco

77217

San Jose

30353

Stockton

6612

Stops by Time: Figure 4a and 4b and 3a and 3b are analysis of stops by days of the week and by hours in these two years. For each hour of the day, we calculated the percentage of stops in each city that occurred in that hour. The figure in 2014 is composed of 1,013,210 stops (Long Beach and Stockton were excluded because the time of the stops was not available.) The figure in 2017 is composed of 675,421 stops. From the figures we can see that generally stops are lower in the early morning around 5 am, and higher in the afternoon, evening, and late at night. For day of the week, we calculated the percentage of stops in each city that occurred on that day. All of the cities had stop dates available, and the figure from 2014 is composed of 1,044,030 stops. The figure from 2017 is composed of 693,980 stops. From the figures we can see that generally more police stops are made in the middle of the week, while less are made on weekends and Mondays. 4.2

Assessing Racial Discrimination in Traffic Stop Decisions Using Statistical Analysis

We have presented two separate statistical tests to validate if there is any racial discrimination in traffic stop decisions. We have used Benchmark Test and Veil of Darkness. Each subsections describe our findings in details. Benchmark Test: At first we had to understand the racial demographics in the California population data. We could not use the raw population number. To make it useful, computed the proportion of California residents in each demographic group per city. As an eyeball comparison led us to see that black drivers were being stopped disproportionately, relative to the city’s population. Dividing the black stops per capita by the white stops per capita to be able to make a quantitative statement about how much more often black drivers are stops compared to white drivers, relative to their share of the city’s population. Black drivers are stopped at a rate of 2.52 in 2014 and 3.73 in 2017 times higher than white drivers; and Hispanic drivers are stopped at a rate 0.968 as many times

Analyzing California Police Stops

123

(a) Proportion of Stops by Race in 2014

(b) Proportion of Stops by Race in 2017

Fig. 2. Analysis of stops by race

in 2014 and 1.39 times higher in 2017 than white drivers. Since there was not enough data for other races, we decided not to compute this test for them. Caveats about the Benchmark Test: While these baseline stats give us a sense that there are racial disparities in policing practices in California, they are not evidence of discrimination. The argument against the benchmark test is that we have not identified the correct baseline to compare to. For the stop rate benchmark, what we really want to know is what the true distribution is for individuals breaking traffic laws or exhibiting other criminal behavior in their vehicles. If black and Hispanic drivers are disproportionately stopped relative to their rates of offending, that would be stronger evidence. Some people then proposed to use benchmarks that approximate those offending rates, like arrests, for example. However, we know arrests to themselves be racially skewed (especially for low-level drug offenses, for example), so it wouldn’t give us the true offending population’s racial distribution.

124

S. Manna and S. Bunyard

(a) Stops per Hour in 2014

(b) Stops per Hour in 2017

Fig. 3. Analysis of stops per hour

An even simpler critique of the population benchmark for stops per capita is that it doesn’t account for possible race-specific differences in driving behavior, including amount of time spent on the road (and adherence to traffic laws, as mentioned above). If black drivers, hypothetically, spend more time on the road than white drivers, that in and of itself could explain the higher per capita stops we see for black drivers, even in the absence of discrimination. Veil of Darkness [8] is primarily used to assess racial discrimination at the stop decision. This approach relies on the hypothesis that officers who are engaged in racial profiling are less likely to be able to identify a driver’s race after dark than during daylight. Under this hypothesis, if stops made after a smaller proportion of black drivers stopped than stops made during daylight, this would be evidence of the presence of racial profiling. The naive approach is just to compare whether daytime stops or nighttime stops have higher proportion of black drivers. This, however, is problematic.

Analyzing California Police Stops

125

(a) Stops by Weekday in 2014

(b) Stops by Weekday in 2017

Fig. 4. Analysis of stops per hour

More black drivers being stopped at night could be the case for a multitude of reasons: different deployment and enforcement patterns, different driving patterns, etc. All of these things correlate with clock time and would be confounding our results. So the clock time needs to be adjusted accordingly. To adjust for clock time, we want to compare times that were light at some point during the year (i.e., occurred before dusk) but were dark at other points during the year (i.e., occurred after dusk). This is called the “inter-twilight period”: the range from the earliest time dusk occurs in the year to the latest time dusk occurs in the year. We calculated the sunset and dusk times for California by using the library suncalc 2 which calculated sunset and dusk times on the day of each stop based on the stop date and center latitude and longitude of the stop city. Our approach is similar to [13]. For this analysis we also discarded Long Beach and Stockton 2

https://cran.r-project.org/web/packages/suncalc/suncalc.pdf.

126

S. Manna and S. Bunyard

because the time of the stops were not available, so we did not have enough information to compute the test. Adjustments with the Data for this Analysis: To calculate the veil of darkness analysis we chose data from the inter twilight period (when it is dark during part of the year and light during other parts of the year. We also filtered out stops that were in between sunset and dusk that could have been ambiguous. Findings: Table 6 illustrate our result for the Veil of Darkness analysis. To estimate this, we used logistic regression, using a natural spline with 6 degrees of freedom as seen in [13] except for the fact that we only had the city level data. Equation (2) provides the logistic regression model we fitted: P r(black|t, g, p, d, s, c) = logit−1 (αc × c × d + β T × ns6 (t) + γ[g] + δ[p])

(2)

where P r(black|t, g, p, d, s, c) is the probability that a stopped driver is black at a certain time t, location g and period p (with two periods per year, corresponding to the start and end of DST), with darkness status d ∈ {0, 1} indicating whether a stop occurred after dusk (d = 1) or before sunset (d = 0), and with the enforcement agency being city police department, c ∈ {0, 1}. In this model, ns6 (t) is a natural spline over time with six degrees of freedom, γ[g] is a fixed effect for location g and δ[p] is a fixed effect for period p. The location g[i] for stop i corresponds to city, with γ[g[i]] the corresponding coefficient. Finally, p[i] captures whether stop i occurred in the spring (within a month of beginning DST) or the fall (within a month of ending DST) of each year, and δ[p[i]] is the corresponding coefficient for this period. For computational efficiency, we rounded time to the nearest 5 − min interval when fitting the model. Table 6. Veil of darkness analysis using combined city data Estimate Std. Err Year 2014 −0.118

0.020

Year 2017 −0.061

0.027

The basic idea is that we are finding what our data implied about how darkness and clock time influence whether a stopped driver will be black. We then extract the coefficient of darkness. The fact that the coefficient is negative (Table 6) means that darkness lessens the likelihood that a stopped driver will be black, after adjusting for clock time. This matches our intuitive result from 6:30 p.m. [13]. The fact that the standard error is small means that this is a statistically significant finding. Caveats on the Veil of Darkness Test: We argue and conform with [13] that the veil-of-darkness test, like all statistical methods, it comes with caveats. Darkness, after adjusting for time of day, is a function of the date. As such, to the extent that driver behavior changes throughout the year, and these changes are

Analyzing California Police Stops

127

correlated with race, the test can suggest discrimination where there is none. One way to account for seasonal effects is to consider the brief period around daylight savings. Driving behavior likely doesn’t change much at, say, 6 : 00p.m. the day before daylight savings to the day after; but the sky is light on one day and dark on the next. Another thing to note is that artificial lighting (e.g., from street lamps) can weaken the relationship between sunlight and visibility, and so the method may underestimate the extent to which stops are predicated on perceived race. Other things, like vehicle make, year, and model often correlate with race and are still visible at night, which could lead to the test under-estimating the extent of racial profiling. Similarly, the test doesn’t control for stop reason, which is often correlated with both race and time of day. Finally, the test only speaks to presence of racial profiling in the intertwilight period doesn’t say anything about other hours. Nevertheless, despite these shortcomings, the test provides a useful if imperfect measure of bias in stop decisions.

5

Conclusion

We analyzed 8 cities out of 11 California prime cities for this research. We have summarized our take away from this research: 1. We looked at different visualizations at different cities of California focusing on multiple attributes like race, gender, time. Due to not enough information provided in the dataset, the initial visualizations could only illustrate the scenario of police stops in a broader sense. 2. Per capita analysis projected racial discrimination by police during these stops, justifying the claims the difference in police behavior with different groups of people. 3. The benchmark test and the veil of darkness also added extra weightage to our findings on racial discrimination during police stops. In spite of California being multicultural and diverse, we still see similar outcomes in policing. Future Work : As a future work, we intend to extend this work in the following ways: 1. Add more robustness tests for the model by tuning different variables 2. We could look for more additional data that would allow us to compute different other parameters like search rate, arrest rate and so on, so that other popular statistical tests such as outcome and threshold tests. 3. Further research on demographics would allow us to compute more accurate per capita stops.

128

S. Manna and S. Bunyard

References 1. Ayres, I.: Outcome tests of racial disparities in police practices. Just. Res. Policy 4(1–2), 131–142 (2002) 2. Baumgartner, F.R., Epp, D.A., Shoub, K., Love, B.: Targeting young men of color for search and arrest during traffic stops: evidence from North Carolina, 2002–2013. Polit. Groups Ident. 5(1), 107–131 (2017) 3. Becker, G.S.: The Economics of Discrimination. University of Chicago Press (2010) 4. Chohlas-Wood, A., Goel, S., Shoemaker, A., Shroff, R.: An analysis of the metropolitan nashville police department’s traffic stop practices. Technical report, Stanford Computational Policy Lab (2018) 5. Davis, E., Whyde, A., Langton, L.: Contacts between police and the public (2015) 6. Goel, S., Perelman, M., Shroff, R., Sklansky, D.A.: Combatting police discrimination in the age of big data. New Crim. Law Rev. 20(2), 181–232 (2017) 7. Goel, S., Rao, J.M., Shroff, R., et al.: Precinct or prejudice? Understanding racial disparities in New York city’s stop-and-frisk policy. Ann. Appl. Stat. 10(1), 365– 394 (2016) 8. Grogger, J., Ridgeway, G.: Testing for racial profiling in traffic stops from behind a veil of darkness. J. Am. Stat. Assoc. 101(475), 878–887 (2006) 9. Horrace, W.C., Rohlin, S.M.: How dark is dark? Bright lights, big city, racial profiling. Rev. Econ. Stat. 98(2), 226–232 (2016) 10. Knowles, J., Persico, N., Todd, P.: Racial bias in motor vehicle searches: theory and evidence. J. Polit. Econ. 109(1), 203–229 (2001) 11. Liederbach, J., Trulson, C.R., Fritsch, E.J., Caeti, T.J., Taylor, R.W.: Racial profiling and the political demand for data: a pilot study designed to improve methodologies in Texas. Crim. Just. Rev. 32(2), 101–120 (2007) 12. Maitreyi, A.: Data for change: a statistical analysis of police stops, searches, handcuffings, and arrests in Oakland, Calif., 2013–2014. Stanford University, SPARQ: Social Psychological Answers to Real-World Questions (2016) 13. Pierson, E., et al.: A large-scale analysis of racial disparities in police stops across the united states. Nat. Hum. Behav. 1–10 (2020) 14. Voigta, R., et al.: Language from police body camera footage shows racial disparities in officer respect. PNAS 114(25), 6521–6526 (2017)

Automated Metadata Harmonization Using Entity Resolution and Contextual Embedding Kunal Sawarkar and Meenakshi Kodati(B) IBM, Armonk, USA {kunal,meenakshi.kodati}@ibm.com

Abstract. Data curation process for Analytics and Data Science typically involves collecting data from large number of heterogenous and federated source systems with varied schema structures. To make these datasets interoperable, their metadata needs to be standardized. This process, also known as Metadata Harmonization, is predominantly a manual effort involving several hours of concentrated work that leads to reduced efficiency of ML-Ops lifecycle. This paper aims to demonstrate the automation of metadata harmonization using Machine Learning. It focuses on using entity resolution and contextual embedding methods to capture hidden relationships among data columns that help identify similarities in metadata, and thereby, help in automated mapping of columns to a standard schema. This study also addresses the automated derivation of the correct ontological structure for the target data model using ML. While prior competing approaches address manual metadata harmonization problem by proposing usage of semantic middleware, data dictionaries and matching rules this approach recommends novel usage of Machine Learning which improves efficacy of overall lifecycle. Keywords: Metadata harmonization · Metadata crosswalking curation · Metadata contextual embedding

1

· Data

Introduction

Large multinational organizations that gather data from disparate data sources are often faced with the challenge of standardizing the data formats to make the datasets interoperable. This involves finding similarities among data collection methodologies and creating mapping tables that meaningfully unite information from them. This method of mapping identical or equivalent metadata across varied datasets is known as Metadata Crosswalking or Master Metadata Synchronization or Metadata Harmonization. This includes task of metadata mapping & cleansing the metadata to given standard and linking it to a standard structure (see Fig. 1). For a large dataset which has hundreds of columns, mapping metadata to a catalog can be a big challenge. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 129–138, 2021. https://doi.org/10.1007/978-3-030-80126-7_10

130

K. Sawarkar and M. Kodati

One of the limitations of the data curation and cataloging tools available today is that they are not equipped with metadata harmonization capabilities that automate the process of crosswalking. As a result organizations rely on Data Stewards to perform this task manually, often consuming several days or weeks of concentrated effort. Data Stewards manually contrast each column and map it to a set standard schema (see Fig. 2). This being first step in curation, also becomes the bottleneck for efficiency of data science project executions. There is also a possibility that the new data proposes additions or changes to the current ontological structure that the standard schema has to follow. The aforesaid limitations underscore the need for a data curation process that is embedded with the ability to automatically crosswalk the metadata to a standard format at the time of ingestion, and preferably enhance the metadata by deriving an ontological metadata structure that can be scaled to a wide variety of data sources automatically. The current study addresses the above mentioned needs by (1) Proposing a metadata harmonization method that can automatically crosswalk metadata of different data sources to a standard schema, (2) Autonomously deriving the standard ontological structure of metadata from a multitude of source systems using Machine Learning based Entity Resolution Methods that leverage Levenshtein distance-based mechanism to find the relevant entities by employing costbased approaches (3) Using embedding methods like db2vec that involve contextual vectorization and textification to resolve metadata entities to the nearest standardized schemas. The system worked well for inferring metadata column names in a schema as well as an ontological hierarchical structure for that schema in experimentation performed on test data.

Fig. 1. Crosswallking metadata for multiple schemas

In the literature there had been limited use of ML to automate & improve ML-ops process of cataloguing. In previous attempts most of the approaches relies on a prebuilt and static data dictionary which serves as a look up function

Automated Metadata Harmonization

131

Fig. 2. Mapping ontological structure to standard metadata

for new metadata; however such methods needs apriori knowledge of the system & are not scalable to previously unknown schemas or ontological structues.

2

Related Studies

The automation of metadata harmonization as researched by Wurzer [1] can be done through the usage of a semantic middleware for data integration and content analysis that analyzes the semantic closeness of incoming data against historically developed concrete data models to find similarities in the metadata and data content. While it presents a solution to replace manual master data management, it relies on the availability of concrete data models from historically stored datasets, and on content objects that provide a generic description of the new data objects. Furthermore, it focuses on semantic closeness by employing concepts like regular expression matching and data closeness analysis. Therefore, it is less likely to work effectively on completely new, previouslyunseen datasets. In cases where crowd sourced datasets are collected, where the likelihood of consistent data formats and rules is minimal, adopting concepts like regular expression matching and data closeness analysis may not result in accurate classification. The current paper therefore emphasizes the use of ML concepts like entity resolution and contextual embeddings for metadata harmonization. It relies on a base ontological structure to learn to classify and map metadata, and incrementally learns with the help of advanced AI algorithms to improve accuracy of classification. The db2vec approach [4], which formed the basis for the current study, uses contextual embedding, and can be leveraged to derive column relationships even in the absence of a base ontology by deriving and learning from canonical industry data models. Furthermore, this paper also researches automatic derivation of an ontological structure with every new column matching that happens. The strength of this approach, therefore, lies in the application of machine learning for continuous learning and improvement of the base ontological structure. The automated derivation and refinement of the ontological structure, cited as one of the merits of the current study, addresses a key pain point for Master Data Management. The construction of ontology conventionally starts with

132

K. Sawarkar and M. Kodati

interviewing the business stakeholders to determine high level concepts, and is followed by manual inspection of the source systems to extract commonalities which form the foundation for a base ontology, which is then enriched with new entity types and properties. The resulting ontology is then customized for different applications and is mapped back to source system to facilitate subsequent extraction of relevant data from different sources. As the number of source systems increases, the process becomes increasingly laborious, which calls for some degree of automation. The work by Cobett et al. [5] proposes automatic ontology generation through the use of a semantic network of known concepts pertaining to a target domain of knowledge, followed by data discovery techniques that help augment the semantic network with classifications of data attribute properties. The said study too, relies on semantic closeness and availability of historic data with clearly defined technical metadata. The present research utilizes machine learning to enable classification of even those (relevant) data attributes that are completely different from other attributes in the source systems by learning the context of metadata and can even work on previously unseen metadata. Other relevant competing approaches that aim to tackle the metadata harmonization problem include a study by Schon [2], that proposes storing specifications for different sources, using the specifications to generate rules for each source and then matching data elements of different sources to derive a quality metric that characterizes a given match. Another study by Cope [3] aims to combine business rules and data dictionary into an interlanguage document type definition to crosswalk metadata. These approaches require prior knowledge and are less likely to be scalable for new schemas.

3

System Design

This section discusses the framework design aspects for the proposed embodiments and describes their implementations in detail. 3.1

Framework Overview

The framework can be summarized as below (see Fig. 3). 1. The metadata for the crosswalk from various heterogenous sources is collected and arranged in an ontological structure. It also includes additional meta-metadata like lookup for a column attributes for a given standardization format and a data model schema structure with its own naming & descriptive structure for which crosswalk is expected to be performed. All the above metadata is stored in a data frame which can include source column names and the expected standardized schema tiers or data model tiers for the crosswalk output. 2. Based on this meta-metadata like column name and the descriptions the metadata entity are resolved using machine learning methods for the crosswalk predictions. Various embodiments can be used for entity resolution to perform schema crosswalk. Two such methods are discussed below:

Automated Metadata Harmonization

133

Fig. 3. Framework for the metadata harmonization using entity resolution mechanism

(a) Levenstein Distance Based Method: This method, as explained in 4 uses the cost based textual distance to find entity matching on each column names and to the standardized schema of crosswalk model. For tie resolution in case of same Levenstein scores a blocked indexing method is used including a hybrid method. (b) Using Contextual Embeddings: Using db2vec [4] which captures sematic context for the entities and find similar entities to the crosswalk. Further we found that using method db2vec [4], it creates textification and then vectorization for all the entities, and matches the entities to the expected schemas by capturing column relationships. 3. In case there exists a ground truth for the entity matching from previous manual efforts done by data stewards then an additional text classification model can learn from previous metadata crosswalks and improve the performance. The system can also work in the absence any training data to learn explicitly context of column relationships. The solution methodology can also derive & learn metadata from canonical industry data models to specific domain (like healthcare, banking, etc.) 3.2

Implementation

The first step in the process is to obtain the ontology that can be used for the Entity Resolution model. Each metadata entity is described by the metametadata which can include any of the following descriptive attributes about columns viz. a) Verbose Column Name, b) Data classification Business Terms, c) Any Textual Description of the column, d) Business Glossary, e) Data Dictionary. Any one of the above descriptive characteristics or a combination of therein can be used. In cases where a large number of datasets are available to begin with, the datasets chosen to create the ontology need to come from diverse data sources and should encompass a wide array of possible column names in

134

K. Sawarkar and M. Kodati

order be effectively used or crosswalking new datasets. The standard schema can further be refined by standardizing column name formats, removing erroneous data and eliminating duplicate values. Approach 1: Using Levenshtein Distance for Entity Resolution Model. The first step to be taken for this approach is the calculation of Levenshtein distance- a metric for measuring the distance between two metadata strings. The score ranges between 0 and 100, with 100 being an indication for exact match between the column strings. In addition to deciding on the method for derivation of the matching score, the minimum matching score that indicates a “qualified match” needs to be chosen (see Fig. 4). In other words, the threshold score, crossing which a pair of strings can be considered to be “closely matching” needs to be set. This naive approach is likely to work well for column metadata match but does not always capture context of inter-column relationships and ontology in complex cases.

Fig. 4. Using Levenshtein distance

Approach 2: Using db2vec Method for Embeddings. The Db2Vec embedding from cognitive database systems, as researched by Boardwekar et al. [4] captures the inter column and intra column relationship between tables by treating them as an unstructured set. The first step in this process is the textification of the standard schema ontology. When the standard schema is provided as an input in the form of a dataframe, the algorithm first textifies the data (using a preprocessing script to convert every row of the table into the required format sentences). The resulting textified standard schema is then used to train the (db2vec) model (see Fig. 5). The model takes the textified data as input and

Automated Metadata Harmonization

135

Fig. 5. Using db2vec method for embeddings

outputs a vector representation of the unique tokens in the textified data. The training involves the use of a highly specialized 3-layer Neural Network Model to generate vectors. Once trained, this model can take an unknown string as an input and provide a list of records from the standard schema that closely match the input string. In this use case, the input string is a column name from a newly ingested dataset. This model is adept at finding similar records even when there are errors or spelling mistakes in the input column. Moreover, the algorithm can be tuned to output a given number of similar records. When this number is set to 1, the model output a record that matches closest with the input column name. The train set can now be replaced with the complete standard schema, and metadata of a new, unseen dataset can now take place of the test set. The model would compare each column metadata of the new dataset against the standard schema to find the closest match. The ontological structure assigned to the best matching column name would then be picked from the standard schema and assigned to the new column, thereby, rendering the model the capability to perform metadata crosswalking.

4

Experimentation

The aforesaid approaches have been experimented on the open data systems for Marine Litter managed by United Nations Environmental Program (UNEP) and Earth Challenge 2020. All these datasets [6–8] have different data collection methodology, different schemas and are not interoperable (see Fig. 6). A team of Data Stewards had spent several days of manual effort to harmonize metadata and created a interoperable dataset file with uniform metadata [9]. The methods outlined in this study have been tested on these source sets and the results were compared to the manually curated metadata (see Fig. 7).

136

K. Sawarkar and M. Kodati

Fig. 6. Example of source schema sets

Example- For marine litter, the classification of plastic litter alone can be done in many ways by various schemas, requiring manual analysis to map them with each other. In the schema indicated in Fig. 6 T1, T2, T3 represent different ontological levels for same object (see Fig. 8).

Fig. 7. Predicting column metadata

The test result indicated in Fig. 7 shows the output of the db2vec model [4]. For purpose of the experiment, only the column names and tier 1 (T1) has been provided as the input schema in the form of a dataframe. The algorithm first textified the input using a preprocessing script and every row of the dataframe is converted into the required format sentences. The resulting textified standard schema was then used to train the db2vec model. The model provided a vector representation of the unique tokens in the textified data using a robust 3-layer Neural Network Model. To test the model, list of strings (columns) from a test dataset have been provided as inputs. The model parameters have been set to output a list of 5 closely matching ‘Tier1’ values. An example has been presented in Fig. 7: when ‘Used Plates’ was provided as the test string, the Tier1 (T1) value was correctly classified as ‘Metal’. The model was also tested on completely new schemas. An accuracy of 82% was obtained a dataset consisting of few hundred columns in each of the child schemas. Figure 8, shows how the current study can be used to derive an ontological structure for a new schemas. In this case, the complete schema structure indicated in Fig. 6 was provided as input, and the model predicted the ontological hierarchy for a previously unseen string ‘straws and bits’ correctly.

Automated Metadata Harmonization

137

Fig. 8. Predicting ontology

5

Limitations

With the proposed approaches being ML-based in nature, the absence of proper training set can affect the accuracy of the crosswalking. Though these approaches lead to a fair degree of automation of the metadata harmonization and ontology derivation processes, they require some manual intervention, atleast at the beginning, in order to enable the models to work with better accuracy- for instance, the mappings automatically performed by the model need to be manually reviewed and corrected initially so that the models can learn fromfeedback from Data Steward and historical training data to become better. Complete lack of training data will make this approach harder to work.

6

Conclusion

This paper presents a novel approach to harmonize metadata automatically by creating a machine generated crosswalk. It further demonstrates that it can work in the absence any training data using contextual embeddings of entities, and also that the system can derive not just column names but also the ontological structure of schema automatically. The current research, thus, can aid in speeding up the pipeline, resulting in improved efficiency of the data curation process. Statement on Impact This research is primarily aimed at influencing the efficiency of ML-ops process in large enterprises. As popular wisdom goes; 80% of Machine Learning project effort is spent just on data engineering and within that, majority of effort is on cataloguing datasets for data curation, that is, for metadata harmonization across various schemas. Application of ML for improving ML lifecycle for the benefit of data stewards who are mainly responsible the manual work in data curation processes is the main intent of the authors. This will also benefit data engineers and data scientists by reducing the time need to gain access to the

138

K. Sawarkar and M. Kodati

right data. Authors currently do not foresee disadvantages of this method to any particular group nor see the implications of bias. Authors do understand that this research may change the nature of job for the community of Data Stewards but this should be viewed as advantageous to them by improving productivity of tasks instead of being a disadvantage or potential risk to job. Authors welcome any constructive feedback on the impact of this research which they may not have anticipated.

References 1. Wurzer, J.: Automated harmonization of data (2013) 2. Schon, A.: Matching metadata sources using rules for characterizing matches (2011) 3. Cope, B.: Method and apparatus for the creation, location and formatting of digital content (2003) 4. Bordawekar, R., Bandyopadhyay, B., Shmueli, O.: Cognitive database: a step towards endowing relational databases with artificial intelligence capabilities (2017) 5. Cobbett, M., Limburn, J., Oberhofer, M., Schumacher, S., Woolf, O.: Automatic Ontology generation (2020) 6. [Data] Trash Information and Data for Education and Solutions (TIDES): Plastic Pollution. https://cscloud-ec2020.opendata.arcgis.com/datasets/data-marinelitter-watch-mlw-plastic-pollution-/explore 7. [Data] Marine Litter Watch (MLW): Plastic Pollution. https://cscloud-ec2020. opendata.arcgis.com/datasets/data-marine-litter-watch-mlw-plastic-pollution-/ explore 8. [Data] Marine Debris Monitoring and Assessment Project (MDMAP) Accumulation Report: Plastic Pollution. https://cscloud-ec2020.opendata.arcgis.com/datasets/ data-marine-debris-monitoring-and-assessment-project-mdmap-accumulationreport-plastic-pollution 9. [Data] Earth Challenge Integrated Data: Plastic Pollution (MLW, MDMAP, TIDES). https://cscloud-ec2020.opendata.arcgis.com/datasets/data-earth-challeng e-integrated-data-plastic-pollution-mlw-mdmap-tides-

Data Segmentation via t-SNE, DBSCAN, and Random Forest Timothy DeLise(B) Université de Montréal, Montreal, Canada [email protected]

Abstract. This research proposes a data segmentation algorithm which combines t-SNE, DBSCAN, and Random Forest classifier to form an endto-end pipeline that separates data into natural clusters and produces a characteristic profile of each cluster based on the most important features. Out-of-sample cluster labels can be inferred, and the technique generalizes well on real data sets. We describe the algorithm and provide case studies using the Iris and MNIST data sets, as well as real social media site data from Instagram. This is a proof of concept and sets the stage for further in-depth theoretical analysis. Keywords: Clustering · t-SNE · Random forest · DBSCAN Segmentation · Data visualization · Unsupervised learning

1

·

Introduction

Data segmentation refers to the process of dividing data into clusters and interpreting the characteristics of these clusters, which information can be used for decision making purposes. It is clustering but with an additional requirement to understand the reason behind the clustering and stratification of the data. Data segmentation is widely used in a broad range of fields from social media site marketing [16] to the analysis of single-cell RNA sequencing [7, 9]. There are many possible choices of clustering technique as well as possible methods of interpreting the characteristics of each cluster. This research proposes an intuitive, general purpose data segmentation technique which delivers interpretable clusters and tends to generalize well. The algorithm, pictured in Fig. 1, is comprised of three main steps: t-SNE, DBSCAN, and Random Forest classifier. In the text, this process is referred to simply as the algorithm. T-distributed Stochastic Neighbor Embedding (t-SNE) is the basis of the clustering method. It has been chosen because of its vast popularity in the natural sciences [7, 8,14], and it is widely regarded as the state of the art for dimension reduction for visualization [12]. T-SNE creates a low-dimensional embedding of high-dimensional data with the ability to retain both local and global structure in a single map. It has proven successful for visualizing high-dimensional data [17]. There is strong evidence to support that t-SNE embeddings recover wellseparated clusters from the input data [11]. In practice, high-dimensional data c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 139–151, 2021. https://doi.org/10.1007/978-3-030-80126-7_11

140

T. DeLise

tends to produce distinctly isolated clusters by visual inspection of the lowdimensional output embedding [7]. The motivation is to harness the intuitive appeal of the t-SNE embedding. However, t-SNE by itself does not label clusters nor provide information about how and why the clusters appear. Moreover, an aspect of t-SNE that detracts from its ability for inference is that there is no direct map from the input space to the output embedding. The algorithm harnesses t-SNE as an intuitive first step to simply visualize the data. Its great appeal is that we can visually inspect a low-dimensional embedding (an image, for example) and manually pick out clusters. In order to automate this process, we use DBSCAN [4] to extract clusters directly from this low-dimensional embedding. The reason for choosing DBSCAN, as opposed to other density-based clustering algorithms, is that it has a small number of important parameters and is foundational in the field of density-based clustering. The task of extracting visually-identifiable clusters from data in the plane is something that the DBSCAN algorithm can confidently accomplish. There is one important parameter of the DBSCAN algorithm, defined in the reference as the Eps-neighborhood of a point p: Neps (p), that we will simply call . For specific data, it is possible to select a value for that separates dense regions into clusters. Through cross validation, optimal values can be discovered. In fact, tuning can help recover clusters at different levels of resolution.

Fig. 1. This flowchart shows the design of the segmentation analysis algorithm. The data we assume is organized in the usual manner with rows representing individual data points and the columns representing the features of the data. The t-SNE algorithm is applied to the data resulting in a 2-dimensional embedding. Then the DBSCAN algorithm is applied to this embedding, resulting in labeled clusters of the data. Finally these labeled clusters are used as target labels for the random forest classifier, using the original (high dimensional) data as input. The random forest then has the ability to map data points to cluster labels, and also gives access to feature importance scores.

The third and final step of the algorithm uses the cluster labels obtained from the DBSCAN algorithm to train a Random Forest Classifier [2]. The utility of the Random Forest is two-fold: to infer cluster labels directly from the input data, and to gain access to feature importance scores. Random Forest was chosen as well because of its strong ability to classify data, especially if we have reason to believe the target classes belong to well-separated data points. Moreover, it gives transparency to the question of how the data is separated in the input space via

Data Segmentation via t-SNE, DBSCAN, and Random Forest

141

its feature importance scores. This fulfills interpretability requirement of data segmentation. An important question to investigate is how well does the algorithm generalize? While data segmentation is inherently an unsupervised learning technique, not concerned with ground-truth data labels, there is a way to understand its ability to generalize. Simply put, if we separate our data set into training and test sets, then the algorithm applied to the training set should create the same clusters as the model applied to the entire set. We then compare the cluster labels given to the test set, using the Random Forest from the algorithm trained on the training data, to the cluster labels of the test set given by the algorithm trained on the entire data set. If the algorithm tends to generalize well on experimental data sets. This means that cluster labels of out-of-sample data points can be reliably inferred without retraining the model. It lends evidence that we can trust the feature importance scores of the Random Forest, which describe how and why the clusters are formed. The structure of this paper is outlined as follows. Section 2 provides the details of the algorithm as well as the technique to assess its generalizability. Section 3 describes empirical examples of the algorithm applied to three experimental data sets: the Iris data set, a data set of anonymized Instagram data obtained from previous research [6], and the MNIST data set of 70,000 handwritten digits. Section 4 offers interpretation of the results and motivation for future research.

2 2.1

Methods The Algorithm

The algorithm is composed of three main sub-algorithms: t-SNE, DBSCAN, and Random Forest classifier. Figure 1 gives an overview. We assume that the data is in the usual format, with rows representing individual data points and columns representing the features. T-SNE creates a 2-dimensional embedding of the data. For the next step, the DBSCAN algorithm is applied to the lowdimensional embedding to produce cluster labels for each data point. Finally these cluster labels are used to train a Random Forest classifier via supervised learning. The Random Forest model can thus infer cluster labels directly from the raw input data. Certain values for reveal the clusters which are visually apparent in the t-SNE embedding. Most values of generalize well, although values for can be found that generalize extremely well, almost perfectly. In practice we optimize a constant, which is then multiplied by the mean pairwise distance of the t-SNE embedded data points. For more information about tuning via cross validation, please refer to Sect. 2.3.

142

T. DeLise

The Random Forest admits feature importance scores. These scores allow us to understand which features are most influential in separating the data into clusters. Combining these scores with cluster profiles completes the process of segmenting the data, and hence the algorithm. 2.2

Cluster Profiles

We define the cluster profile to be the distribution of the data points of each feature over that cluster, as in Fig. 3. A simple statistic is the mean value. If our input data has n features, then the cluster profile can be represented is an n dimensional vector of the mean values of each cluster, as shown in Table 2 and Fig. 7. The cluster profile is thus used to characterize the cluster. The feature importance scores of the Random Forest algorithm allow us to focus on the features which matter the most. For example, we quickly understand that petal length is much more important than sepal width, for the purposes of dividing the iris data into clusters (Table 1).

Fig. 2. The plot on the left is the 2-dimensional embedding that resulted from the t-SNE part of the algorithm. The plot on the right shows the same data points labeled by the cluster labels that were learned using the DBSCAN algorithm applied to the Embedding.

Table 1. Iris data set feature importance scores calculated by the random forest classifier. Score

Feature name

0.555780 0.314322 0.122197 0.007701

Petal length (cm) Petal width (cm) Sepal length (cm) Sepal width (cm)

Data Segmentation via t-SNE, DBSCAN, and Random Forest

143

Fig. 3. The Iris cluster profiles are shown in a Violin Plot, which displays the empirical distribution of the data over each feature, separated by cluster.

2.3

Generalizing Segments

Cluster profiles are developed and we wish for these clusters and their characteristics to generalize to out-of-sample data points. The algorithm gives a way to infer cluster labels of out-of-sample data points using the Random Forest classifier. Here, we describe a technique for assessing the generalizability of the algorithm. In unsupervised learning scenarios, the data does not contain ground-truth labels, so we take the ground-truth of some particular data point to be the cluster label that is assigned to that data point when the entire data set is run through the algorithm. We then randomly split the whole data set into training (in-sample) and test (out-of-sample) sets in the usual way. In our case we use the 5-fold cross validation technique described in [3]. For each fold of data, the algorithm is run on the training set, which returns the cluster labels of the training data and a Random Forest classifier that will map input data to cluster labels. Finally we infer the cluster labels from the test set by applying the Random Forest classifier. Classification metrics are computed using the labels obtained from the test set compared to the ground-truth labels that were computed from the entire data set. In Table 4 the weighted averages of the classification metrics over all 5 folds of the data are displayed. Cross validation is used, as well, on each training set to choose the optimal value of the DBSCAN parameter . This makes the generalization procedure quite computation-intensive since the training set for each fold of the 5-fold cross validation is used to optimize by way of 5-fold cross validation. This additional computation time is merited for the purpose of a thorough analysis of the algorithm. In practice the parameter can be chosen using less expensive means and validated using cross validation before applying it to the entire data set. We forego the cross validation of the parameter in the MNIST experiment for the sake of time savings and additional resolution of the clusters. The idea of cluster resolution can be illustrated by considering the following thought experiment. A very large value for will always produce only one cluster,

144

T. DeLise

and this technique will obviously always generalize perfectly. Depending on the data, we may wish to set an lower limit to the number of clusters obtained, thus sacrificing performance for the sake of segmenting the data into more, smaller clusters. This purpose is inherently attained by selecting smaller values for . By lowering we derive more clusters from the data, but this also creates more singleton (and extremely small) clusters which detract from the generalizing performance. For the Iris data set we use a lower limit of clusters we require to 2, and for the Instagram data we set the limit to 5. In the MNIST data experiment we intentionally set small enough to reveal the 10 main clusters of the data, therein creating many small and singleton clusters. The cluster labels obtained from the analysis of the training data set need not match the labels obtained from the whole data set. The reason is that the cluster label name is chosen somewhat arbitrarily in that we always label the largest cluster as cluster 0, the next largest cluster as cluster 1 and so on. In fact, it is common for the training set to produce a different number of clusters than the whole data set altogether. We have developed a technique to address this by matching the clusters obtained from the training set with those from the entire data set. It is an iterative procedure that matches clusters which have the largest intersection first. Details about this procedure are supplied in Appendix A. An alternative technique to compute the out-of-sample classification metrics is to map out-of-sample data points to their embedded location, something that is not possible in the original t-SNE algorithm, however has been implemented in the openTSNE software package [15]. We chose not to use this technique in order to focus on the utility of the Random Forest step of the algorithm. However, one should be able to show similar results using the inference mapping of openTSNE. 2.4

Software

The software used for the experiments will be made freely available on GitHub. It is a conglomerate of customized code and algorithms with existing software packages. Scikit-learn [13] was used for the Random Forest and DBSCAN implementations as well as data scaling and classification metrics. FIt-SNE [10] was used for the t-SNE computations as it is fast and has shown success visualizing the MNIST data set. Table 2. Instagram data set cluster profiles. For each of the five clusters we display the mean values of the top five important features over each cluster. Cluster Follows Average shortest path Diameter Clique count Node count 0 1 2 3 4

38.07 345.61 0.13 113.09 122.24

186.98 585.47 0.00 264.17 328.97

126.41 587.62 0.00 180.31 260.71

1.86 1.90 0.00 0.98 1.30

0.63 0.68 0.00 0.61 0.63

Data Segmentation via t-SNE, DBSCAN, and Random Forest

145

Fig. 4. The plot on the left shows the 2-dimensional embedding of the Instagram data set that resulted from t-SNE. On the right side is the same embedding with the cluster labels given by the DBSCAN step.

Table 3. Instagram data set feature importance table, showing the top ten most important features, ordered by score.

3 3.1

Score

Feature name

0.110567 0.100257 0.091884 0.080912 0.057833 0.054657 0.049908 0.046561 0.045851 0.044900 0.044383

Follows Average shortest path Diameter Clique count Node count Followed by Follow ratio Edge connectivity Edge count Node connectivity Average connectivity

Empirical Results Iris Data Set

The Iris flower data set [1,5] is a famous, elegant and freely available data set that displays intrinsic clusters. Figure 2, 3 and Table 1 display the output of the algorithm applied to the Iris flower data set. There are clearly 2 clusters in the data. Table 4 shows the classification metrics for each generalization experiment. Referring to Fig. 3, the goal is to understand why the constituents of each cluster have been grouped together. The clusters label is assigned by the number of data points in each cluster. We see that cluster 0 is characterized by longer petal length, petal width, and sepal length than cluster 1 while having

146

T. DeLise

shorter sepal width. This simple visualization tool, while by no means exhaustive, already offers substantial insight into the descriptive attributes of each cluster. There is very little overlap between the clusters in the distributions of petal length and width. We understand that petal length and width are more important for inferring these clusters than sepal length and width. This idea matches precisely with the feature importance scores of Table 1. Those familiar with the data set will know that these are measurements from three types of flowers: Iris Setosa, Iris Versicolor, and Iris Verginica. The measurements from Versicolor and Verginica tend to mix while the Setosa is quite separate. It corresponds that the segmentation analysis was able to identify two clusters and not three. 3.2

Instagram Data

The analysis of this section follows the same steps as the previous section, the only difference being that we substitute the input data. The data was obtained from a previous study [6] and is completely anonymized. The features contain simple metrics about Instagram users, such as the number of followers, likes, tags, etc. We also calculated several social network attributes based on the raw data. This data set contains 3,229 data points and 27 features. For more information about the data, please refer to [6]. Figure 4 shows the clustering results from the segmentation analysis. The feature importance scores of Table 3 combined with the cluster profiles of Table 2 give us the defining characteristics of the clusters. Cluster 0 is the largest cluster and cluster 1 is the next largest, corresponding to the blue and orange clusters of Fig. 4, respectively. Cluster 0 is described by data points with less follows, average shortest path, and diameter, and cluster 1 has higher values for these important features. We see cluster 2, which is the third largest cluster, has a mean that is zero or almost zero across the important features. These are seemingly empty accounts. The segmentation analysis makes it easy to understand how the data is stratified.

Fig. 5. MNIST database t-SNE embedding and top ten derived clusters. Notice the many small sporadic clusters that are produced around the edges of the main clusters.

Data Segmentation via t-SNE, DBSCAN, and Random Forest

147

Fig. 6. A heat map of the feature importance scores learned by the random forest step of the algorithm applied to the MNIST data set. In this experiment, features correspond to pixels, so we display the feature scores that correspond to each pixel. We notice that the important features are located toward the center of the image, which is the area of the image where the digits appear.

Fig. 7. The cluster profiles for the top 12 clusters (by Size) derived from the MNIST data set. The cluster profile is the mean of each feature over the cluster. Just like Fig. 6, the features correspond to pixels, so the cluster profiles are displayed as images, where each pixel is the mean of all the respective pixels from each cluster. We notice that the first 10 clusters are all substantially larger than the remaining clusters. this is also apparent from the embedding image of Fig. 5. Each of the largest ten clusters are representative of each of the ten digits.

148

3.3

T. DeLise

MNIST Case Study

There are, inevitably, settings for which the default parameters for t-SNE don’t quite get us the best embedding. Tuning t-SNE can sometimes produce better visual clusters. The purpose of this section is to illustrate that the algorithm is robust in regards to parameter tuning. Here we address the MNIST data set of 70,000 hand-written images. The embedding produced using the default parameters for t-SNE does not clearly separate ten clusters. However, by using late exaggeration [9], the authors of [10] show that clusters clearly appear in the produced embedding. Although this data set contains 10 distinct data labels corresponding to each of the first ten digits, it has traditionally been difficult for clustering algorithms to clearly identify clusters corresponding to these ten digits. Even though the embedding on the left side of Fig. 5 seems to show ten distinct clusters, a few of the clusters are slightly touching in certain regions. has been adjusted in order to capture the ten main segments of the data. In doing so a bit of performance was sacrificed in that many very small clusters, often singleton clusters, were identified, which are very difficult to generalize. Nevertheless, we found that the clusters identified in this way still generalize well by weighted average measure. The feature importance scores highlight the important features that contribute to the separation of the clusters. Since each feature corresponds to a pixel, this conveniently gives an intuitive interpretation where we can visualize the important pixels spatially on a two dimensional image in Fig. 6. The result agrees with our intuition that the middle section of the image should be most important for separating the data into clusters. Finally, we visualize the cluster profiles in Fig. 7 in image form as well. The ten largest clusters, in fact, correspond to representations of the ten digits. By focusing on the largest clusters and the most important features, we can understand a vast majority of the data. 3.4

Generalization Performance

Table 4. The classification metrics for each of the experiments use the weighted average method for calculations. These numbers represent the mean of each weighted score across all five folds of the data. Data set

Accuracy Precision Recall F1-score

Iris 1 Instagram 0.943 0.916 MNIST

1 0.996 0.918

1 0.943 0.916

1 0.967 0.911

Table 4 displays the accuracy, precision, recall and f1-score as a weighted average over all five folds of the data. We find for these data sets, the algorithm generalizes well in the sense that out-of-sample data points are most likely going to be

Data Segmentation via t-SNE, DBSCAN, and Random Forest

149

classified into the correct cluster by the Random Forest classifier. The good performance emphasizes our belief that the feature importance scores produced by the Random Forest are useful. Moreover, we are confident that the information gained from the algorithm extends to a broader population.

4

Conclusion and Future Work

It deserves to be written that this research is superficial in nature and relies on statistical evidence as a proof-of-concept. The author intends to further develop a theoretical understanding. The value of this paper is for the engineer or data science practitioner who needs to get answers from data for which there is little understanding. It relies on the success of t-SNE for visualizing data and adds a layer of interpretability in a practical sense. The additional step is subtle but important for mission-critical applications. There are some theoretical connections to be made between the t-SNE and DBSCAN step. One of the main results of [11] is a theoretical guarantee that all clusters of the input data will be mapped to balls in the embedding which can be made arbitrarily small. It is plausible that DBSCAN can successfully identify such balls in a low-dimensional space based on the ideas of connectivity and reachability in density-based clustering. The remaining piece is to create a formal argument that if the DBSCAN algorithm has identified a cluster in the embedding space, then this must correspond to a cluster in the input space. This will be a topic of future research. Random Forest has been a robust supervised learning tool for a long time. If we guarantee that we have labeled actual clusters in the input data, which should follow from the previous paragraph, then we should expect the Random Forest classifier to be able to successfully classify these data points. There should be a way to statistically guarantee that Random Forests can classify disjoint clusters of data. This is another direction of future research. The outline given in these two paragraphs should deliver a more substantial theoretic argument for why this algorithm can dependability be used for data segmentation.

Appendix A

Matching Clusters for Generalization Analysis

This section describes the algorithm used to match clusters between the entire data set and the training data sets that are split during each fold of the generalization analysis. As mentioned in the text, we perform the equivalent of 5-fold cross validation to calculate the average weighted f1-score across all 5-folds of the data. During each fold, we needed a technique to pair the clusters derived from training data with the clusters from the entire data set. This comes down to matching cluster labels, since the labels assigned to each cluster do not necessarily match between runs of the algorithm.

150

T. DeLise

The effect of this matching is really very subtle. Let us do a simple thought experiment by considering the Iris data set, where we saw two main clusters. Something that could happen is that the clusters derived from the entire data set are labeled cluster 0 and cluster 1. The clustering results from the training data during one of the folds of could have derived two main clusters as well, however the algorithm could have labeled cluster 0 as cluster 1, and cluster 1 as cluster 0. The matching outlined in this section simply gives us a quick technique to match those labels. The technique here is also robust to the situation where the training data derives a different number of clusters than the entire data set. Algorithm 1 will match as many clusters as it can, in a largest-first fashion. The optimal algorithm would consider all the permutations of the clusters of the training data compared to the entire data set, aiming to maximize the intersection of all the clusters, however this can require too many computations on large data sets. We sacrifice a bit of performance in terms of f1-score in lieu of considerable time benefits. Algorithm 1: Algorithm for Matching Cluster Labels Between Entire Data Set and Training Data Set Result: BestPerm is list that maps the cluster labels from the AllClusters to TrainClusters. The index position of BestPerm corresponds to the cluster label number of TrainClusters, and the value in that position corresponds to the cluster label of AllClusters. TrainClusters is a list of clusters from training data; AllClusters is a list of clusters from the entire data set; for Cluster1 in TrainClusters do BestSum = 0; BestCluster = None; for Idx, Cluster0 in AllClusters do ThisSum = Size of Intersection of Cluster1 and Cluster0; if ThisSum is greater than BestSum and Idx not in BestPerm then BestSum = ThisSum; BestCluster = Idx; end end BestPerm.append(BestCluster); end

References 1. Anderson, E.: The species problem in Iris. Ann. Mo. Bot. Gard. 23(3), 457–509 (1936) 2. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/ 10.1023/A:1010933404324

Data Segmentation via t-SNE, DBSCAN, and Random Forest

151

3. Breiman, L., Freidman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees, 1st edn. Wadsworth Inc., Belmont (1984) 4. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise, pp. 226–231. AAAI Press (1996) 5. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7(7), 179–188 (1936) 6. Hassanpour, S., Tomita, N., DeLise, T., Crosier, B., Marsch, L.A.: Identifying substance use risk based on deep neural networks and Instagram social media data. Neuropsychopharmacology 44(3), 487–494 (2019) 7. Kobak, D., Berens, P.: The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 10(1), 5416 (2019) 8. Li, W., Cerise, J.E., Yang, Y., Han, H.: Application of t-SNE to human genetic data. J. Bioinform. Comput. Biol. 15, 06 (2017) 9. Linderman, G.C., Rachh, M., Hoskins, J.G., Steinerberger, S., Kluger, Y.: Efficient algorithms for t-distributed stochastic neighborhood embedding. CoRR, abs/1712.09005 (2017) 10. Linderman, G.C., Rachh, M., Hoskins, J.G., Steinerberger, S., Kluger, Y.: Fast interpolation-based t-SNE for improved visualization of single-cell RNA-SEQ data. Nat. Methods 16(3), 243–245 (2019) 11. Linderman, G.C., Steinerberger, S.: Clustering with t-SNE, provably. CoRR, abs/1706.02582 (2017) 12. McInnes, L., Healy, J., Melville, J.: Umap: uniform manifold approximation and projection for dimension reduction (2020) 13. Pedregosa, F.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 14. Platzer, A.: Visualization of SNPS with t-SNE. PloS ONE 8(2), e56883 (2013) 15. Poliˇcar, P.G., Straˇzar, M., Zupan, B.: openTSNE: a modular python library for t-SNE dimensionality reduction and embedding. bioRxiv (2019) 16. Rajagopal, S.: Customer data clustering using data mining technique. Int. J. Database Manag. Syst. 3, 12 (2011) 17. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

GeoTree: A Data Structure for Constant Time Geospatial Search Enabling a Real-Time Property Index Robert Miller(B) and Phil Maguire National University of Ireland, Maynooth, Kildare, Ireland {robert.miller,phil.maguire}@mu.ie

Abstract. A common problem appearing across the field of data science is k -NN (k-nearest neighbours), particularly within the context of Geographic Information Systems. In this article, we present a novel data structure, the GeoTree, which holds a collection of geohashes (string encodings of GPS co-ordinates). This enables a constant O (1) time search algorithm that returns a set of geohashes surrounding a given geohash in the GeoTree, representing the approximate k-nearest neighbours of that geohash. Furthermore, the GeoTree data structure retains an O (n) memory requirement. We apply the data structure to a property price index algorithm focused on price comparison with historical neighbouring sales, demonstrating an enhanced performance. The results show that this data structure allows for the development of a real-time property price index, and can be scaled to larger datasets with ease.

Keywords: GeoTree index

1

· Geospatial · k -NN · Data structure · Price

Introduction

Large scale datasets are a hot topic in computer science. Each one tends to present its own problems and intricacies [9]. The Nearest Neighbour (NN ) problem is a well known and vital facet of many data mining research topics. This involves finding the nearest data point to a given point under some metric which measures the distance between data points. In the context of geospatial data, the NN problem often emerges in the form of geographical proximity search [24]. Real world geographic data is usually represented by a pair of GPS coordinates, which pinpoint any location on Earth with unlimited precision. As a result of their structure, computing the distance between pairs of points in order to find the nearest neighbour can be extremely slow on large datasets. The problem often requires expansion to finding the k nearest neighbours (k -NN), which further increases the complexity by requiring a sorting of the distance matrix in order to extract a ranking of points by proximity. It is extremely computationally expensive to compute and rank these distances on large datasets c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 152–165, 2021. https://doi.org/10.1007/978-3-030-80126-7_12

GeoTree: A Data Structure

153

[25]. A computationally cheap method of solving this problem would vastly improve the scalability of proximity based algorithms [24]. We propose a data structure which enables such cheap computation, the GeoTree, and explore its potential when applied to a real-world geospatial task.

2 2.1

Background Naive Geospatial Search

The distance between two pieces of geospatial data defined using the GPS coordinate system is computed using the haversine formula [23]. If we wish to find the closest point in a dataset to any given point in a naive fashion, we must loop over the dataset and compute the haversine distance between each point and the given, fixed point. This is an O (n) computation. If the distances are to be stored for later use, this also requires O (n) memory consumption. Thus, if the closest point to every point in the dataset must befound, this requires an additional nested loop over the dataset, resulting in O n2 memory and time complexity overall (assuming the distance matrix is stored). If such a computation is applied to a large dataset, such as the 147,635 transactions used in the house property price index developed by [14], an O n2 algorithm can run extremely slowly even on powerful modern machines. As GPS co-ordinates are multi-dimensional objects, it is difficult to prune and cut data from the search space without performing the haversine computation. With a considerable portion of big data being geospatial in nature, geospatial algorithms and data structures are coming under increased research attention, with the amount of personal location data available growing by approximately 20% year-on-year according to the McKinsey Global Institute [13]. As such, exploring alternative methods of representing GPS co-ordinates is necessary to make algorithmic improvements. 2.2

GeoHash

A geohash is a string encoding for GPS co-ordinates, allowing co-ordinate pairs to be represented by a single string of characters. The publicly-released encoding method was invented by Niemeyer in 2008 [20]. The algorithm works by assigning a geohash string to a square area on the earth, usually referred to as a bucket. Every GPS co-ordinate which falls inside that bucket will be assigned that geohash. The number of characters in a geohash is user-specified and determines the size of the bucket. The more characters in the geohash, the smaller the bucket becomes, and the greater precision the geohash can resolve to. While geohashes thus do not represent points on the globe, as there is no limit to the number of characters in a geohash, they can represent an arbitrarily small square on the globe and thus can be reduced to an exact point for practical purposes. Figure 1 demonstrates parts of the geohash grid on a section of map.

154

R. Miller and P. Maguire

Fig. 1. GeoHash algorithm applied to a map

Geohashes are constructed in such a way that their string similarity signifies something about their proximity on the globe. Take the longest sequential substring of identical characters possible from two geohashes (starting at the first character of each geohash) and call this string x. Then x itself is a geohash (i.e. a bucket) with a certain area. The longer the length of x, the smaller the area of this bucket. Thus x gives an upper bound on the distance between the points. We will refer to this substring as the smallest common bucket (SCB) of a pair of geohashes. We define the length of the SCB as the length of the substring defining it. This definition can additionally be generalised to a set of geohashes of any size. Furthermore, we define the SCB of a single geohash g to be the set of all geohashes in the dataset which have g as a prefix. We can immediately assert an upper bound of 123,264 m for the distance between the geohashes in Fig. 2, as per the table of upper bounds in the pygeohash package [15]. geohash 1: c1 c2 c3 x4 . . . xn SCB

geohash 2: c1 c2 c3 y4 . . . yn SCB

where: xi = yi ∀i ∈ {4 . . . n}

Fig. 2. Geohash precision example

GeoTree: A Data Structure

2.3

155

Efficiency Improvement Attempts

Geohashing algorithms have, over time, improved in efficiency and have been put to use in a wide variety of applications and research contexts [17, 18]. As stated by [24], the efficient execution of nearest neighbour computations requires the use of niche spatial data structures which are constructed with the proximity of the data points being a key consideration. The method proposed by Roussopoulos et al. [24] makes use of R-trees, a data structure very similar in nature to the geohash [8]. They propose an efficient algorithm for the precise NN computation of a spatial point, and extend this to identify the exact k-nearest neighbours using a subtree traversal algorithm which demonstrates improved efficiency over the naive search algorithm. Arya et al. [2] further this research by introducing an approximate k -NN algorithm with time complexity of O (kd log n) for any given value of k. A comparison of some data structures for spatial searching and indexing was carried out by [12], with a specific focus on comparison between the aforementioned R-trees and Quadtrees, including application to large real-world GIS datasets. The results indicate that the Quadtree is superior to the R-tree in terms of build time due to expensive R-tree clustering. As a trade-off, the Rtree has faster query time. Both of these trees are designed to query for a very precise, user-defined area of geospatial data. As a result they are still quite slow when making a very large number of queries to the tree. Beygelzimer et al. [4] introduce another new data structure, the cover tree. Here, each level of the tree acts as a “cover” for the level directly beneath it, which allows narrowing of the nearest neighbour search space to logarithmic time in n. Research has also been carried out in reducing the searching overhead when the exact k -NN results are not required, and only a spatial region around each of the nearest neighbours is desired. It is often the case that ranged neighbour queries are performed as traditional k -NN queries repeated multiple times, which results in a large execution time overhead [3]. This is an inefficient method, as the lack of precision required in a ranged query can be exploited in order to optimise the search process and increase performance and efficiency, a key feature of the GeoTree. Muja et al. provide a detailed overview of more recently proposed data structures such as partitioning trees, hashing based NN structures and graph based NN structures designed to enable efficient k -NN search algorithms [19]. The suffix-tree, a data structure which is designed to rapidly identify substrings in a string, has also had many incarnations and variations in the literature [1]. The GeoTree follows a somewhat similar conceptual idea and applies it to geohashes, allowing very rapid identification of groups of geohashes with shared prefixes. The common theme within this existing body of work is the sentiment that methods of speeding up k -NN search, particularly upon data of a geospatial nature, require specialised data structures designed specifically for the purpose of proximity searching [24].

156

3

R. Miller and P. Maguire

GeoTree

The goal of our data structure is to allow efficient approximate ranged proximity search over a set of geohashes. For example, given a database of house data, we wish to retrieve a collection of houses in a small radius around each house without having to iterate over the entire database. In more general terms, we wish to pool all other strings in a dataset which have a maximal length SCB with respect to any given string. 3.1

High-Level Description

A GeoTree is a general tree (a tree which has an arbitrary number of children at each node) with an immutable fixed height h set by the user upon creation. Each level of the tree represents a character in the geohash, with the exception of level zero - the root node. For example, at level one, the tree contains a node for every character that occurs among the first characters of each geohash in the database. For each node in the first level, that node will contain children corresponding to each possible character present in the second position of every geohash string in the dataset sharing the same first character as represented by the parent node. The same principle applies from level three to level h of the GeoTree, using the third to hth characters of the geohash, respectively. At any node, we refer to the path to that node in the tree as the substring of that node, and represent it by the string where the ith character corresponds to the letter associated with the node in the path at depth i. The general structure of a GeoTree is demonstrated in Fig. 3. As can be seen, the first level of the tree has a node for each possible letter in the alphabet. Only characters which are actually present in the first letters of the geohashes in our dataset will receive nodes in the constructed tree. We, however, include all characters in this diagram for clarity. In the second level, the a node also has a child for each possible letter. This same principle applies to the other nodes in the tree. Formally, at the ith level, each node has a child for each of the characters present among the (i + 1)th position of the geohash strings which are in the SCB of the current substring of that node. A worked example of a constructed GeoTree follows in Fig. 4. Consider the following set of geohashes which has been created for the purpose of demonstration: {gc7j98, gc7j98, gd7j98, ac7j98, gc9aaj, gc7j9d, ac7j98, gd7jya, gc9aaj}. The GeoTree generated by the insertion of the geohashes above with a fixed height of six would appear as seen in Fig. 4. 3.2

GeoTree Data Nodes

The data attributes associated with a particular geohash are added as a child of the leaf node of the substring corresponding to that geohash in the tree, as shown in Fig. 5. In the case where one geohash is associated with multiple data entries, each data entry will have its own node as a child of the geohash substring, as demonstrated in the diagram.

GeoTree: A Data Structure

157

ROOT

a

a

b

...

z

...

...

...

...

b

...

z

...

...

...

Fig. 3. GeoTree general structure ROOT a

g

c

c

d

7

7

9

7

j

j

a

j

9

9

a

9

y

j

8

a

8

8

d

Fig. 4. Sample GeoTree structure

It is now possible to collect all data entries in the SCB of a particular geohash substring without iterating over the entire dataset. Given a particular geohash in the tree, we can move any number of levels up the tree from that geohash’s leaf nodes and explore all nearby data entries by traversing the subtree given by taking that node as the root. Thus, to compute the set of geohashes with an SCB of length m or greater with respect to the particular geohash in question, we need only explore the subtree at level m along the path corresponding to that particular geohash. Despite this improvement, we wish to remove the process of traversing the subtree altogether. 3.3

Subtree Data Caching

In order to eliminate traversal of the subtree we must cache all data entries in the subtree at each level. To cache the subtree traversal, each non-leaf node receives an additional child node which we will refer to as the list (ls) node. The list node holds references to every data entry that has a leaf node within the same subtree as the list node itself. As a result, the list node offers an instant enumeration of every leaf node within the subtree structure in which it sits, removing the need to traverse the subtree and collect the data at the leaf nodes. The structure of the tree with list nodes added is demonstrated in Fig. 6 (some nodes and list nodes are omitted for the sake of brevity and clarity).

158

R. Miller and P. Maguire ROOT

...

a

a

...

z

...

...

...

...

...

{d1 } {d2 } {d3 } {d4 }

{d5 }

Fig. 5. GeoTree structure with data nodes ROOT

a

a

{d1 }

{d2 }

...

ls: {d1 , d2 , d3 , . . . }

ls: {d1 , d2 , . . . }

ls: {d1 , d2 }

{d3 }

a

...

ls: {d3 , . . . }

ls: {d3 }

Fig. 6. GeoTree structure with list nodes

3.4

Retrieval of the Subtree Data

Given any geohash, we can query the tree for a set of nearby neighbouring geohashes by traversing down the GeoTree along some substring of that geohash. A longer length substring will correspond to a smaller radius in which neighbours will be returned. When the desired level is reached, the cached list node at that level can be queried for instant retrieval of the set of approximate k -NN of the geohash in question. As a result of this structure’s design, the GeoTree does not produce a distance measure for the items in the GeoTree. Rather, it clusters groups of nearby data points. While this does not allow for fine tuning of the search radius, it allows a set of data points which are geospatially close to the specified geohash to be retrieved in constant time. 3.5

Memory Requirement of the Data Structure

As each geohash is associated with only one character at each level of the GeoTree, only one node on each level will hold that geohash’s data entry in its list node. Thus, each data entry is inserted into one single list node at every level of the tree. Given a tree of height h, this means that the data will be stored

GeoTree: A Data Structure

159

in h different list nodes in addition to the one leaf node which the data receives. If the dataset is of size n, then there will be (h + 1) ∗ n data entries stored in the tree. However, as the height of the tree is fixed and specified prior to the building of the tree, the overall memory requirement of the GeoTree is O (n). This can be further improved to only n data entries stored by collecting a set of the data once in memory and filling the list nodes with a list of pointers to the data entries, if necessary. 3.6

Technical Implementation

To touch briefly on the implementation of GeoTree [16], a nested hash map structure is used in order to store the tree. The root node is the root hash map of the nest, with the hash keys at this level corresponding to the letters of the level one nodes. Each of these keys point to a value which is another hash map containing keys corresponding to the level two letters of geohashes which have matching first letters with the parent key. The nesting process continues down to the leaf nodes (or terminal hash values in this case) in the same fashion described in Subsect. 3.1. The final hash key (representing the last character of the geohash) points to the list of data entries associated with that geohash. 3.7

Time Complexity

Building (Insertion). As hash maps offer O (1) insertion, insertion of data at each level of the GeoTree is O (1). Furthermore, due to the height of the tree, h, being constant and fixed, insertion of entries to the GeoTree is an O (1) operation overall. SCB Lookup. The O (1) lookup of hash maps also means that the tree can be traversed in steps of O (1) time. As the list nodes hold the SCB of every geohash substring possible from those in the dataset, and a maximum of h SCBs will need to be queried, it follows that any SCB lookup is also O (1). 3.8

Comparison with the Prefix Tree (Trie)

The GeoTree data structure shares a number of similarities with the prefix tree or trie data structure [5]. A trie is a search tree which utilises its ordering and structure to increase searching efficiency across its inserted strings. Each branch represents a character and thus as you traverse down the trie, you build the prefix of a word, working toward an entire word at each leaf node. This is very similar to the GeoTree, as the geohash encodings of properties take the place of words in this use case and traversing the GeoTree builds prefixes of geohash strings. Both data structures make use of structure to make search more efficient, however, in the case of the GeoTree, the ordering has geographical significance rather than the semantic meaning in the prefix tree.

160

R. Miller and P. Maguire

One key difference between tries and GeoTrees lies in the subtree data caching step. As the GeoTree relies on being able to query every entry in the subtree of a particular node, the caching is necessary to quickly return a large number of property records. In the case of a prefix tree, it would be necessary to enumerate every path in the subtree to retrieve all of the words. In the use case which the GeoTree is being applied to, this would result in a significant increase in execution time over a very large dataset. The GeoTree data structure could be thought of as a variant or augmentation of the trie, one which is specifically designed to give a fast, approximate solution to k -NN on geospatial datasets.

4 4.1

Real-World Performance Application: House Price Index Algorithm

In order to test the performance of GeoTree in practice, we applied it to the computation of an Irish house price index. House price indexes and forecasting models have come under increased attention from a data mining context, with a view to improve the current methods of calculating and forecasting property price changes. Such algorithms could help identify price bubbles, facilitating preemptive measures to avoid another market collapse [6, 10, 11]. Many of these algorithms are based around the mix-adjusted median or central price tendency model, which requires a geospatial k -NN search [7, 14]. This approach is based on the principle that large amounts of aggregated data will cancel noise and result in a stable, smooth signal. It also offers the benefit of being less complex than the highly-theoretical hedonic regression model. It also requires less data than the repeat-sales model, in the sense of both quantity and time period spread [7, 14,22]. Maguire et al. [14] introduced an enhanced central-price tendency model which outperformed the robustness of the hedonic regression method used by the Irish Central Statistics Office [21]. The primary limitation of this method is the algorithmic complexity and brute-force nature of the geospatial search, which impinges on its scalability to larger datasets, and restricts the introduction of further parameters. Our aim was to apply the GeoTree data structure to improve the execution time, scalability and robustness of this method. We re-implemented the algorithm used by [14] (described below), running the algorithm on the same data set (Irish Property Price Register) used in the original article as a control test for performance before introducing the GeoTree. For the purposes of algorithmic complexity calculation, we let n be the average number of house sales present in one month of the dataset, and let t be the number of months of data in the dataset. Stage two (voting) of the original algorithm is executed as follows: ⇒ Iterate over each month, m, of the dataset (t operations)

GeoTree: A Data Structure

161

⇒ Iterate over each house, h, sold during m (n operations) ⇒ Iterate over houses sold in m to find the nearest to h (n operations*) Stage four (stratification) of the original algorithm is executed as follows: ⇒ Iterate over each month, m, of the dataset (t operations) ⇒ Iterate over each house, h, sold during m (n operations) ⇒ Iterate over each month prior to m, mp ( t−1 2 operations) ⇒ Iterate over houses sold in mp to find the nearest to h (n operations*) By introducing the GeoTree to the algorithm, the steps which formerly required an O (n) iteration over all houses in the dataset to identify the nearest house (marked by an asterisk) now become an O (1) GeoTree ranged proximity search operation. There is, however, a mild trade-off. Rather than returning the closest property to the house in question, the GeoTree structure instead returns everything in a small area around the house (formally, it returns the maximal length non-empty SCB for that house’s geohash). The bucket can then be iterated over to find the true closest property, or an alternative strategy can be employed, such as taking the median price of all houses within the small area. 4.2

Performance Results

Table 1 compares the performance of the algorithms described previously with and without GeoTrees (on a database of 279,474 property sale records), including both single threaded execution time and multi-threaded execution time (running eight threads across eight CPU cores) on our test machine. The results using the GeoTree are marked with a + symbol. 4.3

Correlation

Despite the algorithmic alteration of taking the median price of a group of geohashed nearest neighbours, as opposed to the nearest neighbour per se, the house price indexes produced by the original algorithm and the GeoTree-enhanced version are very similar. Figure 7 shows both versions of the Residential Property Price Index (RPPI) superimposed. The two different versions yielded highly correlated outputs (Pearson’s r = 0.999, Spearman’s ρ = 0.997, Kendall’s τ = 0.966), revealing that GeoTree succeeded in delivering an almost identical index to the original one, though with major performance gains in execution time.

162

R. Miller and P. Maguire Table 1. Complexity and performance of the algorithms

Algorithm Complexity Voting Voting

+

O (n2 t) O (nt)

Stratify

O

Stratify+

O

Overall

O

Overall+

O

a

n2 t(t−1) 2 nt(t−1) 2

n2 t(t+1) 2 nt(t+1) 2

μ (1 core)a

σb

233.54 sc

μ (8 cores)a

σb

2.37% 46.73 sc

1.69%

c

1.68% 3.02 sc

0.69%

29.03 h

2.41% 4.19 h

1.89%

12.78 s

∼0.05 h (163.89 s) 1.71% ∼0.01 h (39.63 s) 0.85% 29.11 h

2.43% 4.21 h

1.90%

∼0.05 h (177.73 s) 1.67% ∼0.01 h (43.71 s) 0.79%

Execution times reported are the mean (μ) of ten trials. Standard deviation (σ) reported as a percentage of the mean (μ). c Includes build time for the dataset array/GeoTree on the dataset, as applicable. d All algorithms computed using an AMD Ryzen 2700X CPU. e All algorithms executed on the Irish Residential Property Price Register database of 279,474 property sale records as of time of execution. b

Fig. 7. Irish RPPI (GeoTree vs original), from 02-2011 to 09-2018

4.4

Scalability Testing

In order to test the scalability of the GeoTree, we obtained a dataset comprising 2,857,669 property sale records for California, and evaluated both the build and query time of the data structure. Table 2 shows mean build time and mean

GeoTree: A Data Structure

163

query time on both 10% (∼285,000 records) and 100% (∼2.85 million records) of the dataset. In this context, query time refers to the total time to perform 100 sequential queries, as a single query was too fast to accurately measure. The results demonstrate that the height of the tree has a modest effect on the build time, while dataset size has a linear effect on build time, thus supporting the claimed O (n) build time with O (1) insertion. Furthermore, query time is shown to remain constant regardless of both tree height and dataset size, with negligible differences in all instances. Table 2. Scalability performance of GeoTree Height h

4

5

6

7

8

Build Time (10%)a

17.63s (0.08 s)

18.10 s (0.10 s)

18.46 s (0.22 s)

18.84 s (0.08 s)

19.39 s (0.09 s)

Build Time (100%)b

179.67 s 183.80 s 183.99 s 192.06 s 194.31 s (0.58 s) (0.57 s) (0.52 s) (0.60 s) (0.94 s)

Query Time 5.1 ms (10%)c (0.3 ms)

5.2 ms (0.4 ms)

5.3 ms (0.9 ms)

5.3 ms (0.4 ms)

5.3 ms (0.5 ms)

Query Time 5.4 ms 5.3 ms 5.5 ms 5.7 ms 5.6 ms (100%)c (1.0 ms) (0.9 ms) (1.0 ms) (1.3 ms) (1.2 ms) a Build Time (10%) is the total time to insert 10% of dataset (∼285,000 records) b Build Time (100%) is the total time to insert 100% of dataset (∼2.85m records) c Query Time consists of total time to execute 100 sequential neighbour queries on 10% and 100% of the dataset respectively d Times reported are in the format µ(σ) calculated over ten trials

4.5

Discussion of Results

The results show that the GeoTree data structure offers the necessary scalability and speed of execution to expand to much larger geospatial datasets, including larger property price datasets. The biggest limitation of the GeoTree lies in the geospatial search distance ranges being linked to the length of the geohash string encoding, thus not being alterable to any desired distance. As a result, the algorithm loses a small amount of accuracy in comparison with the original, as discussed in Subsect. 4.3. Despite this, the substantial gains in execution time shown in Table 1 combined with the scalability offered by an O (1) search algorithm demonstrated in Table 2 make a compelling case for a worthwhile trade-off in certain applications, where execution time would become too long with exact methods, such as in the property price index application shown. Further improvements to the algorithm which could be explored in future research include querying just the surrounding squares of a geohash grid through a GeoTree search, rather than moving up an entire level. For example, in Fig. 1, a search for neighbours in wx4g0c which fails could explore neighbour wx4g0f

164

R. Miller and P. Maguire

and the other adjacent neighbouring squares before falling back to searching through the entirety of wx4g0. This would likely restore some of the lost accuracy previously mentioned without introducing a large execution time overhead, should a mapping of the lettering patterns be computed beforehand and used in neighbour exploration.

5

Conclusion

We have shown that the GeoTree data structure introduced in this article offers an efficient O (1) method for geospatial approximate k -NN search over a collection of geohashes. The application to a real-world property price index algorithm revealed significant reductions in execution time, and potentially opens the door for a real-time property price index. The data structure also performed well when applied to a much larger dataset, demonstrating its scalability. In conclusion, any data science problem which requires geospatial sampling around a particular area can employ the GeoTree for O (1) retrieval of approximate neighbours, potentially enabling, for example, fast retrieval of locations of interest to map users, or geo-targeted advertisement and social networking updates.

References 1. Apostolico, A., Crochemore, M., Farach-Colton, M., Galil, Z., Muthukrishnan, S.: 40 years of suffix trees. Commun. ACM 59(4), 66–73 (2016) 2. Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Angela, Y.W.: An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45(6), 891–923 (1998) 3. Bao, J., Chow, C., Mokbel, M.F., Ku, W.: Efficient evaluation of k-range nearest neighbor queries in road networks. In: 2010 Eleventh International Conference on Mobile Data Management, pp. 115–124, May 2010 4. Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 97–104. ACM, New York (2006) 5. De La Briandais, R.: File searching using variable length keys. In: Papers Presented at the the 3–5 March 1959, Western Joint Computer Conference, IRE-AIEE-ACM 1959 (Western), pp. 295–298. Association for Computing Machinery, New York (1959) 6. Diewert, W.E., de Haan, J., Hendriks, R.: Hedonic regressions and the decomposition of a house price index into land and structure components. Econometr. Rev. 34(1–2), 106–126 (2015) 7. Goh, Y.M., Costello, G., Schwann, G.: Accuracy and robustness of house price index methods. Hous. Stud. 27(5), 643–666 (2012) 8. Guttman, A.: R-trees: a dynamic index structure for spatial searching. SIGMOD Rec. 14(2), 47–57 (1984) 9. Hand, D.J.: Data mining based in part on the article “data mining” by David Hand, which appeared in the encyclopedia of environmetrics. American Cancer Society (2013)

GeoTree: A Data Structure

165

10. Jadevicius, A., Huston, S.: Arima modelling of Lithuanian house price index. Int. J. Hous. Mark. Anal. 8(1), 135–147 (2015) 11. Klotz, P., Lin, T.C., Hsu, S.-H.: Modeling property bubble dynamics in Greece, Ireland, Portugal and Spain. J. Eur. Real Estate Res. 9(1), 52–75 (2016) 12. Kothuri, R.K.V., Ravada, S., Abugov, D.: Quadtree and r-tree indexes in oracle spatial: a comparison using GIS data. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 546–557. ACM (2002) 13. Lee, J.-G., Kang, M.: Geospatial big data: Challenges and opportunities. Big Data Res. 2(2), 74–81 (2015). Visions on Big Data 14. Maguire, P., Miller, R., Moser, P., Maguire, R.: A robust house price index using sparse and frugal data. J. Prop. Res. 33(4), 293–308 (2016) 15. McGinnis, W.: Pygeohash (2017). [Python] 16. Miller, R.: Geotree data structure code implementation (2020). https://github. com/robertmiller72/GeoTree/blob/master/GeoTree.py. [Python] 17. Moussalli, R., Srivatsa, M., Asaad, S.: Fast and flexible conversion of geohash codes to and from latitude/longitude coordinates. In: 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 179–186, May 2015 18. Moussalli, R., Asaad, S.W., Srivatsa, M.: Enhanced conversion between geohash codes and corresponding longitude/latitude coordinates (2015) 19. Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2227–2240 (2014) 20. Niemeyer, G.: geohash.org is public! (2008). Accessed 02 May 2019 21. O’Hanlon, N.: Constructing a national house price index for Ireland. J. Stat. Soc. Inq. Soc. Ireland 40, 167–196 (2011) 22. Prasad, N., Richards, A.: Improving median housing price indexes through stratification. J. Real Estate Res. 30(1), 45–72 (2008) 23. Robusto, C.C.: The cosine-haversine formula. Am. Math. Mon. 64(1), 38–40 (1957) 24. Roussopoulos, N., Kelley, S., Vincent, F.: Nearest neighbor queries. SIGMOD Rec. 24(2), 71–79 (1995) 25. Safar, M.: K nearest neighbor search in navigation systems. Mob. Inf. Syst. 1(3), 207–224 (2005)

Prediction Interval of Future Waiting Times of the Two-Parameter Exponential Distribution Under Multiply Type II Censoring Shu-Fei Wu(B) Department of Statistics, Tamkang University, New Taipei, Taiwan, ROC [email protected]

Abstract. We use the general weighted moments estimator (GWME) of the scale parameter of the two-parameter exponential distribution based on a multiply type II censored sample to construct the pivotal quantities for the use of the prediction intervals of future waiting times. This estimator has been shown to have better performance than the other fourteen estimators in terms of mean square error. At last, one real life example is given to illustrate the prediction intervals based on GWMEs. Keywords: Type II multiply censored sample · Exponential distribution · General weighted moments estimator · Prediction interval

1 Introduction In most literatures of reliability, the exponential distribution is widely used as a model of lifetime data. There are many applications of exponential distribution in the analysis of reliability and the life test experiments. See for example, Johnson et al. (1994) [4]. The failure time Y follows a two-parameter exponential if the probability density distribution y−μ 1 function (p.d.f.) of Y is given by f (y) = θ exp − θ , y ≥ 0, μ > 0, θ > 0, where μ is the location parameter and θ is the scale parameter. The location parameter of twoparameter exponential distributions are so-called threshold values or "guaranteed time" parameters in reliability and engineering. In dose-response experiments, this distribution is generally used to model the effective duration of a drug, where the location parameter μ is regarded as the guaranteed effective duration and the scale parameter θ is referred as the mean effective duration in addition to μ. In life testing experiments, the experimenters may not be able to obtain the lifetimes of all items that are put on test due to the artificial mistakes or for implementing some purposes of experimental designs. Suppose that there are n items are put on the life test and the first r, middle l and the lasts are unobserved or missing, this type of censoring is called the type II multiply censoring. Wu et al. (2011) proposed the simultaneous confidence intervals for all distances from the extreme populations for two-parameter exponential populations based on the multiply type II censored samples. Wu (2016) proposed the prediction interval for the future waiting times for one-parameter exponential © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 166–173, 2021. https://doi.org/10.1007/978-3-030-80126-7_13

Prediction Interval of Future Waiting Times

167

distribution based on type II multiply censored sample. For two-parameter exponential distribution, Wu (2015) [8] proposed some general WMEs (GWMEs) by assigning a single weight to each observation instead of only considering two weights in Wu and Yang (2002) [6] for exponential distribution under multiply type II censoring. The simulation comparison results show that the GWMEs outperforms the 12 weighted moments estimators proposed by Wu and Yang (2002) and approximate maximum likelihood estimator (AMLE) by Balakrishnan (1990) [2] and the best linear unbiased estimator (BLUE) by Balasubramanian and Balakrishnan (1992) [3] in terms of the exact mean squared errors (MSEs) in most cases for exponential distribution. Since GWMEs perform better than other 14 methods. Wu (2019) [7] made use of GWMEs to build the prediction intervals for the future observations. Some users may be interested in the prediction of future waiting times between two failure times. For this kind of problem, we utilized GWMEs to construct a pivotal quantity and build the prediction interval of future waiting times based on the corresponding pivotal quantity. The structure of this research is organized as follows: In Sect. 2, the general WME for the two-parameter exponential distribution is introduced. In Sect. 3, the prediction intervals of future waiting times are proposed. The percentiles of the pivotal quantities are also tabulated for practical use in this section. One real life example to illustrate the proposed intervals is given in Sect. 4. At last, some discussions are given in Sect. 5.

2 The General Weighted Moments Estimation of the Scale Parameter of the One-Parameter Exponential Distribution Suppose that the lifetimes a two-parameter exponential distribution with p.d.f. Y follows given by f (y) = θ1 exp − y−μ , y ≥ 0, μ > 0, θ > 0, where μ is the location parameter θ and θ is the scale parameter. Let Y(r+1) < . . . K < Y(r+k) < Yr+k+l+1 < . . . K < Y(n−s) be the available type II multiply censored sample from the above distribution. ∗ = Y − Y (r + 1), i = r + 2, . . . , r + k, r + k + l + 1, . . . , n − s. The Let Y(i) (i) general WME for the scale param∗ Y∗ ∗ Y∗ ∗ ∗ + . . . + W + W Y eter θ is defined as θ˜ ∗ = Wr+2 (r+2) r+k (r+k) r+k+l+1 (r+k+l+1) + ∗ ,...,W∗ ,...,W∗ ∗ Y∗ ∗ are n-r-l-s-1 weights , where W , .., W . . . + Wn−s n−s r+2 (n−s) r+k r+k+l+1 ∗ ,...,Y∗ ,...,Y∗ ∗ . , . . . .., Yn−s Let assigned to Yr+2 r+k r+k+l+1 t ∗ ∗ ∗ ∗ ∗ = Yr+2 , . . . , Yr+k , . . . , Yr+k+l+1 , . . . .., Yn−s and Y∼ ∗ ∗ ∗ ∗ ∗ t W = Wr+2 . . . Wr+k . . . Wr+k+l+1 . . . ..Wn−s , where [X] represents the transpose of ∼ ∗Y ∗. a vector X. Then the GWME for the scale parameter θ can be rewritten as θ˜ ∗ = W ∼ ∼ ∗ = W∗ ...W∗ ...W∗ ∗ The weight vector W r+2 r+k r+k+l+1 . . . Wn−s is determined to ∼ minimize the mean square error (MSE) of the proposed GWME. From Wu (2015), the ∗ = A∗−1 a ∗ , where optimal weight vector is determined as W ∼ ∼

⎡

∗2 ∗ a∗ b∗r+2,r+3 + ar+2 b∗r+2,r+2 + ar+2 r+3 ⎢ ∗ ∗ ∗ ∗ ∗2 br+2,r+3 + ar+2 ar+3 br+3,r+3 + ar+3 ⎢ A∗ = ⎢ .. .. ⎢ . ⎣ . ∗ a∗ b∗r+2,n−s + ar+2 · · · n−s

⎤ ∗ a∗ . . . b∗r+2,n−s + ar+2 n−s ∗ a∗ ⎥ . . . b∗r+3,n−s + ar+3 n−s ⎥ ⎥, .. .. ⎥ ⎦ . . ∗ ∗2 · · · bn−s,n−s + an−s

168

S.-F. Wu

∗ ∗ , a∗ ∗ is the mean vector of the random vector a∗ = ar+2 , . . . , ar+k , . . . , an−s r+k+l+1 is the covariY∼ ∗ and B∗ = b∗i,j i=r+2,...,r+k,r+k+l+1,...,n−s,j=r+2,...,r+k,r+k+l+1,...,n−s

ance matrix of the random vector Y∼ ∗ . The GWME with minimum MSE is obtained as ∗T ∗ Y∼ = A∗−1 ∼ a ∗ Y∼ ∗ θ˜ ∗ = W ∼

The minimum MSE of θ˜ ∗ is 2 ∗T ∗ ∗ ∗T ∗ MSE θ˜ ∗ = W B W + W a − 1 θ2 ∼ ∼ ∼

(1)

(2)

3 Intervals of Future Waiting Time In order to predict the waiting Times, the pivotal quantity is considered as V =

∗ ˜ Y(j) − Y(j−1) θ , n − s < j ≤ n based on the GWME θ˜ ∗ defined in (1). Since Y(1) Y(n) θ , . . . , θ are ∗T Y ∗ W ∼ ∼ θ˜ ∗ , . . . , is θ θ

the n order statistics from a standard exponential distribution and

a linear combination of n order statistics from a standard exponential ∗ Y −Y θ˜ distribution, the distribution of pivotal quantity V = (j) θ (j−1) θ is independent of θ, n − s < j ≤ n. Let V (δ; n, j, r, k, l, s) be the δ percentile of the distribution of V satisfying P(V ≤ V (δ; n, j, r, k, l, s)) = δ. Making use of the pivotal quantity V, the prediction interval of waiting time Y(j) − Y(j−1) , n − s < j ≤ n is proposed in the following Theorem. Theorem 1: For type II multiply censored sample Y(r+1) < . . . < Y(r+k) < Y(j) − Y(r+k+l+1) < . . . < Y(n−s) , the prediction interval of future waiting times α

∗

∗ α ˜ ˜ Y(j−1) , n − s < j ≤ n is V 2 ; n, j, r, k, l, s θ , V 1 − 2 ; n, j, r, k, l, s θ Proof: Observed that Y − Y α α (j) (j−1) ; n, j, r, k, l, s ≤ ≤ V 1 − ; n, j, r, k, l, s θˆ 1−α =P V 2 2 θ˜ ∗ α α ∗ ; n, j, r, k, l, s θ˜ ≤ Y(j) − Y(j−1) ≤ V 1 − ; n, j, r, k, l, s θ˜ ∗ =P V 2 2 Since the exact distributions of V is too hard to derive algebraically, the δ percentile of the distribution of V is obtained based on Monte Carlo simulation. Moreover, all the simulations were run with the aid of AbSoft Fortran Inclusive of IMSL (1999) [1]. In the simulation, 600,000 replicates are used to compute the percentiles of V for each combination of n, r, k, l, s, j, where j = n−s + 1,…,n. Due to the limitation of the number of pages, only part of the percentiles of V are given in Table 1, for δ = 0.005, 0.010, 0.025, 0.050, 0.100, 0.900, 0.950, 0.975, 0.990, 0.995 under n = 12, 24, 36 (see Table 1). Any specific percentile V (δ; n, j, r, k, l, s) for any censoring scheme (n, r, k, l, s) for the jth waiting time between the jth future observation and the previous one, j = n−s + 1, …, n, can be obtained by the software program provided by author.

r

2

2

3

3

3

2

2

2

2

1

1

1

1

1

4

1

2

n

12

12

12

12

12

12

12

12

12

12

12

12

12

12

12

12

24

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

k

2

4

1

2

2

2

3

3

1

1

1

3

1

1

2

2

2

l

2

1

1

3

3

3

2

2

3

3

3

1

2

2

1

2

2

s

23

12

12

12

11

10

12

11

12

11

10

12

12

11

12

12

11

j

0.0026

0.0056

0.0059

0.0058

0.0029

0.0019

0.0057

0.0028

0.0058

0.0029

0.0020

0.0058

0.0058

0.0029

0.0057

0.0057

0.0029

0.005

δ

0.0053

0.0112

0.0118

0.0115

0.0058

0.0038

0.0114

0.0057

0.0119

0.0058

0.0039

0.0115

0.0117

0.0058

0.0115

0.0115

0.0058

0.010

0.0134

0.0283

0.0296

0.0289

0.0146

0.0096

0.0286

0.0144

0.0298

0.0148

0.0099

0.0289

0.0298

0.0147

0.0289

0.0292

0.0145

0.025

0.0270

0.0577

0.0601

0.0587

0.0296

0.0196

0.0582

0.0292

0.0606

0.0300

0.0201

0.0584

0.0605

0.0300

0.0590

0.0593

0.0295

0.050

0.0557

0.1192

0.1241

0.1209

0.0608

0.0406

0.1200

0.0598

0.1248

0.0617

0.0413

0.1204

0.1247

0.0620

0.1221

0.1223

0.0610

0.100

1.2897

2.9602

3.2901

3.1237

1.5678

1.0409

3.0277

1.5122

3.2821

1.6446

1.0935

3.0439

3.2833

1.6406

3.1466

3.1302

1.5659

0.900

1.7100

4.0191

4.5516

4.2827

2.1462

1.4314

4.1241

2.0578

4.5493

2.2701

1.5169

4.1545

4.5408

2.2699

4.3155

4.2905

2.1491

0.950

2.1431

5.1518

5.9824

5.5682

2.7957

1.8602

5.3219

2.6606

5.9623

2.9771

1.9903

5.3660

5.9647

2.9785

5.6153

5.5651

2.7855

0.975

2.7470

6.8360

8.1700

7.4750

3.7460

2.5050

7.0760

3.5490

8.1300

4.0680

2.7050

7.1270

8.1440

4.0540

7.5330

7.4760

3.7410

0.990

Table 1. The δ percentile of the pivotal quantity V = Y(j) − Y(n−s) θ˜ ∗ and P(V ≤ V (δ; n, r, k, l, s)) = δ

(continued)

3.2260

8.2240

10.0450

9.0610

4.5530

3.0490

8.5480

4.2780

9.9930

4.9940

3.3240

8.6290

9.9840

4.9780

9.1530

9.1210

4.5550

0.995

Prediction Interval of Future Waiting Times 169

2

3

3

3

2

2

2

2

1

1

1

1

1

4

1

2

2

3

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

36

36

36

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

2

2

2

4

1

2

2

2

3

3

1

1

1

3

1

1

2

2

1

2

2

1

1

3

3

3

2

2

3

3

3

1

2

2

1

2

36

36

35

24

24

24

23

22

24

23

24

23

22

24

24

23

24

24

0.0051

0.0052

0.0025

0.0055

0.0053

0.0053

0.0027

0.0018

0.0051

0.0026

0.0053

0.0027

0.0018

0.0052

0.0054

0.0026

0.0053

0.0054

δ

0.0104

0.0104

0.0051

0.0107

0.0108

0.0106

0.0053

0.0035

0.0104

0.0052

0.0107

0.0053

0.0036

0.0105

0.0107

0.0053

0.0105

0.0107

0.0262

0.0262

0.0130

0.0265

0.0266

0.0266

0.0133

0.0089

0.0265

0.0133

0.0270

0.0133

0.0089

0.0265

0.0268

0.0134

0.0269

0.0268

0.0530

0.0532

0.0264

0.0536

0.0542

0.0539

0.0270

0.0180

0.0541

0.0270

0.0545

0.0270

0.0181

0.0537

0.0541

0.0271

0.0543

0.0542

0.1092

0.1095

0.0543

0.1105

0.1118

0.1113

0.0555

0.0370

0.1108

0.0554

0.1117

0.0555

0.0372

0.1108

0.1115

0.0556

0.1113

0.1109

Table 1. (continued)

2.4768

2.4630

1.2380

2.5513

2.5996

2.5809

1.2853

0.8602

2.5608

1.2791

2.5971

1.2997

0.8633

2.5683

2.5945

1.2948

2.5807

2.5777

3.2551

3.2430

1.6277

3.3780

3.4482

3.4211

1.7038

1.1420

3.3924

1.6925

3.4364

1.7244

1.1435

3.3970

3.4411

1.7212

3.4173

3.4134

4.0546

4.0444

2.0220

4.2261

4.3324

4.2956

2.1389

1.4348

4.2408

2.1298

4.3216

2.1710

1.4372

4.2557

4.3191

2.1632

4.2908

4.2800

5.1290

5.1280

2.5640

5.3800

5.5620

5.4800

2.7440

1.8320

5.4300

2.7180

5.5440

2.7740

1.8410

5.4460

5.5380

2.7830

5.4810

5.4770

(continued)

5.9660

5.9740

2.9710

6.2980

6.5150

6.4340

3.2140

2.1440

6.3830

3.1730

6.5200

3.2560

2.1580

6.4010

6.4740

3.2660

6.4340

6.4320

170 S.-F. Wu

3

3

2

2

2

2

1

1

1

1

1

4

1

36

36

36

36

36

36

36

36

36

36

36

36

36

5

5

5

5

5

5

5

5

5

5

5

5

5

4

1

2

2

2

3

3

1

1

1

3

1

1

1

1

3

3

3

2

2

3

3

3

1

2

2

36

36

36

35

34

36

35

36

35

34

36

36

35

0.0053

0.0052

0.0052

0.0026

0.0017

0.0052

0.0025

0.0052

0.0026

0.0017

0.0052

0.0051

0.0026

δ

0.0104

0.0105

0.0104

0.0052

0.0034

0.0104

0.0051

0.0105

0.0052

0.0034

0.0104

0.0103

0.0053

0.0260

0.0262

0.0259

0.0132

0.0087

0.0260

0.0130

0.0261

0.0131

0.0087

0.0261

0.0262

0.0133

0.0527

0.0533

0.0529

0.0265

0.0178

0.0530

0.0265

0.0530

0.0263

0.0177

0.0527

0.0532

0.0267

0.1084

0.1090

0.1086

0.0545

0.0365

0.1088

0.0545

0.1090

0.0545

0.0365

0.1084

0.1095

0.0547

Table 1. (continued)

2.4596

2.4649

2.4693

1.2304

0.8221

2.4669

1.2294

2.4714

1.2380

0.8253

2.4589

2.4784

1.2362

3.2401

3.2477

3.2483

1.6189

1.0828

3.2416

1.6200

3.2550

1.6296

1.0837

3.2351

3.2602

1.6259

4.0276

4.0428

4.0510

2.0228

1.3486

4.0415

2.0199

4.0583

2.0286

1.3461

4.0310

4.0636

2.0289

5.0970

5.1480

5.1270

2.5620

1.7110

5.1200

2.5540

5.1360

2.5700

1.7070

5.1250

5.1630

2.5750

5.9140

5.9970

5.9670

2.9780

1.9890

5.9580

2.9660

5.9740

2.9870

1.9890

5.9880

5.9980

2.9930

Prediction Interval of Future Waiting Times 171

172

S.-F. Wu

4 Example In this section, we use the example of times to breakdown of an insulating fluid between electrodes recorded at five different voltages (Nelson (1982 p. 252)) [5] to demonstrate the prediction interval of future waiting times. Such a distribution of time to breakdown is usually assumed to be exponential distributed in engineering theory. We choose 35 kV, and the multiple type II censored data with n = 12, r = 2, k = 3, l = 1 and s = 5 is: –, –, 41, 87, 93,–, 116,–, –, –, –, –. The weights are 0.23568, 0.12544, 0.19776, 0.8058 and the estimated scale parameter is ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ Y(r+2) + . . . + Wr+k Y(r+k) + Wr+k+l+1 Y(r+k+l+1) + . . . + Wn−s Y(n−s) θ˜ ∗ = Wr+2

∗ ∗ ∗ ∗ ∗ ∗ W4∗ Y(4) − Y(3) − Y(3) − Y(3) + W5∗ Y(5) + 7∗5 Y(7) = 0.20047 ∗ 46 + 0.31604 ∗ 52+ 1.28774 ∗ 75 = 1.22.2362 Using Theorem 1, the 90% and 95% prediction intervals for Y(8) − Y(7) , Y(9) − Y(8) , Y(10) − Y(9) , Y(11) − Y(10) , Y(12) − Y(11) are obtained in Table 2. Table 2. 90% and 95% prediction interval for future waiting times times Y(j) − Y(j−1) , j = 8, . . . , 12. 90% Waiting time

U (0.05; 12, j, 2, 3, 1, 5), U (0.95; 12, j, 2, 3, 1, 5)

Prediction interval

Y(8) − Y(7)

0.0129, 1.1147

(1.576847,136.256692)

Y(9) − Y(8)

0.0161, 1.3932

(1.968003,170.299474)

Y(10) − Y(9)

0.0216, 1.8603

(2.640302,227.396003)

Y(11) − Y(10)

0.0323, 2.7938

(3.948229, 341.503496)

Y(12) − Y(11)

0.0645, 5.5884

(7.884235, 683.104780)

Waiting time

U (0.025; 12, j, 2, 3, 1, 5), U (0.975; 12, j, 2, 3, 1, 5)

Prediction interval

Y(8) − Y(7)

0.0064, 1.5165

(0.7823117, 185.3711973)

Y(9) − Y(8)

0.0079, 1.8969

(0.965666, 231.869848)

Y(10) − Y(9)

0.0106, 2.5295

(1.295704, 309.196468)

Y(11) − Y(10)

0.0157, 3.7951

(1.919108, 463.898603)

Y(12) − Y(11)

0.0317, 7.6069

(3.874888, 929.838550)

95%

Prediction Interval of Future Waiting Times

173

5 Discussion In this paper, the GWMEs are used to construct a pivotal quantity to build a prediction interval for future waiting times for two-parameter exponential distribution. The percentiles of proposed pivotal quantity are obtained by Monte-Carlo method and tabulated in Table 1. Theorem 1 is proposed to build the prediction intervals based on GWME for type II multiply censored sample. At last, one real life example is given to illustrate the proposed prediction intervals of future waiting times for a specific censoring scheme. In the future, we should think of some other estimators to improve the performance of the prediction intervals. We can also investigate the prediction intervals for other lifetime distributions such as Pareto distribution. Acknowledgment. The author’s research was supported by Ministry of Science and Technology MOST 108-2118-M-032-001- and MOST 109-2118-M-032 -001 -MY2 in Taiwan, ROC.

References 1. AbSoft Fortran: (Inclusive of IMSL) 4.6, Copyright (c), Absoft Crop. (1999) 2. Balakrishnan, N.: On the maximum likelihood estimation of the location and scale parameters of exponential distribution based on multiply Type II censored samples. J. Appl. Stat. 17, 55–61 (1990) 3. Balasubramanian, K., Balakrishnan, N.: Estimation for one- and two-parameter exponential distributions under multiple Type-II censoring. Stat. Papers 33(1), 203–216 (1992). https://doi. org/10.1007/BF02925325 4. Johnson, N.L., Kotz, S., Balakrishnan, N.: Continuous Univariate Distributions, vol. 1. Wiley, Hoboken (1994) 5. Nelson, W.: Applied Life Data Analysis. Wiley, New York (1982) 6. Wu, J.W., Yang, C.C.: Weighted moments estimation of the scale parameter for the exponential distribution based on a multiply Type II censored sample. Qual. Reliab. Eng. Int. 18, 149–154 (2002) 7. Wu, S.F.: Prediction interval of the future observations of the two-parameter exponential distribution under multiply Type II censoring. ICIC Exp. Lett. 13(11), 1073–1077 (2019) 8. Wu, S.F.: The general weighted moment estimator of the scale parameter of the two-parameter exponential distribution under multiply type II censoring. ICIC Exp. Lett. (ICIC-EL) 9(11), 3081–3085 (2015)

Predictive Models as Early Warning Systems: A Bayesian Classification Model to Identify At-Risk Students of Programming Ashok Kumar Veerasamy1(B) , Mikko-Jussi Laakso1 , Daryl D’Souza2 , and Tapio Salakoski1 1 University of Turku, Turku, Finland

[email protected] 2 RMIT University, Melbourne, Australia

Abstract. The pursuit of a deeper understanding of factors that influence student performance outcomes has long been of interest to the computing education community. Among these include the development of effective predictive models to predict student academic performance. Predictive models may serve as early warning systems to identify students at risk of failing or quitting early. This paper presents a class of machine learning predictive models based on Naive Bayes classification, to predict student performance in introductory programming. The models use formative assessment tasks and self-reported cognitive features such as prior programming knowledge and problem-solving skills. Our analysis revealed that the use of just three variables was a good fit for the models employed. The models that used in-class assessment and cognitive features as predictors returned best at-risk prediction accuracies, compared with models that used take-home assessment and cognitive features as predictors. The prediction accuracy in identifying at-risk students on unknown data for the course was 71% (overall prediction accuracy) in compliance with the area under the curve (ROC) score (0.66). Based on these results we present a generic predictive model and its potential application as an early warning system for early identification of students at risk. Keywords: Early warning systems · Formative assessment tasks · Predictive data mining models · Problem Solving Skills

1 Introduction Programming is fundamental to computer science (CS) and cognate disciplines, and typically offered as a non-CS major. However, the difficulty of learning to program steers non-CS students away from programming courses, and to select alternative courses [1]. Many students fail or perform poorly in programming even as CS education has seen improvements in methods of teaching programming [2] with failure rates continuing to be the range 28–32% [3, 4]. Accordingly, the pursuit of a better understanding of factors that influence student performance outcomes has long been of interest, and includes the development of early-prediction models to predict academic performance, in turn to © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 174–195, 2021. https://doi.org/10.1007/978-3-030-80126-7_14

Predictive Models as Early Warning Systems

175

identify potentially at-risk students [5–7]. However, the predictor variables and associated machine learning algorithms are typically influenced by contextual variations such as class size and academic settings [8, 9]. In addition, it is widely accepted that parsimony is important in model building [10, 11]. This paper proposes a genre of simple, parsimonious predictive model(s), which account for variations in academic setting, such as academic environment and student demography. The models also assume that concepts taught early in semester typically impact on understanding of latter concepts; hence analysing results of early formative assessments may provide opportunities to assess student-learning outcomes and to identify poorly motivated learners, early in the semester. This study adopted the Naive Bayes classification for our proposed predictive model. K-fold cross-validation was used to evaluate model. An additional objective was to explore the relationship among the selected predictor variables and propose a genre of parsimonious predictive model(s) for incorporation in an early warning system (EWS), to identify at-risk students early and to facilitate appropriate interventions. Towards these objectives, the paper addresses the following research questions (RQs). RQ1. Which measures provide the most value for predicting student performance: perceived problem-solving skills, prior programming knowledge, selected formative assessment results? RQ2. How suitable is the Naive Bayes classification based model for incorporation in an early warning system, to identify students in need of early assistance? The remainder of the paper is organized as follows. Related work (Sect. 2) surveys literature relevant to previous work. Research methodology (Sect. 3) describes the methods used to address our research questions. Data analysis and results (Sect. 4) presents our findings, which we discuss in depth in discussion (Sect. 5). Finally, conclusion (Sect. 6) summarizes our findings and presents limitations in terms of how well the foregoing research questions are answered; we also identify some related future work directions.

2 Related Work This section highlights important past contributions of relevance to the work reported in this paper, including: predictors of academic performance; modelling of predictors; Naive Bayes classification and early warning systems. 2.1 Predictors of Student Performance We limited our study to three variables for predicting student academic performance: problem-solving skills, prior programming knowledge and formative assessment performance. Problem solving is a metacognitive activity, which reveals the way a person learns and experiences different aspects of the problem-solving process [12]. It is considered as a basic skill for students in study and work [13, 14]. Student problem-solving skills can predict student study habits and academic performance [15]. Studies have

176

A. K. Veerasamy et al.

also revealed that there is a relationship between problem-solving proficiency and academic achievement [16, 17]. Marion et al. [18] noted that programming should not be considered as just a coding skill but as a way of thinking, decomposing and solving problems, implying that problem solving may impact student performance in programming courses. Research in the discipline of computer science has also highlighted that many students lack problem-solving skills [19, 20]. Another important variable is prior knowledge, which is defined as an individual’s prior personal stock of information, skills, experiences, beliefs and memories. Prior knowledge is one of the most important factors that influence learning and student performance [21, 22]. Studies have been conducted on the impact of prior knowledge in programming courses, sometimes with mixed results [23–25]. Reviews of computing educational studies have found that both prior knowledge and problem-solving skills are required skills for CS students [24, 26]. Formative assessment is an effective instructional strategy to measure studentlearning outcomes [27, 28]. Formative assessment tasks take place during the course of study and instructors often examine selected formative assessment task results to observe and assess student improvement. Students are aware that completing formative assessment tasks may lead to improved final grades [29]. For example, homework is a type of formative assessment task to test comprehension [30]. Educators use homework to identify where students are struggling, in order to assist them and to address their problems [31]. There have been several studies conducted on the impact of homework on student performance [31, 32]. Veerasamy et al. [31] reported that marks achieved in homework and class demonstrations have a significant positive impact on final examination results. Similarly, computer-based or online-tutorials have positive effects on student academic performance in economics, mathematics and science courses [33, 34]. Moreover, several models have included formative assessment scores to predict student performance [6, 7]. 2.2 Predictive Data Mining Modelling for Student Performance in Computing Education Predictive modelling employs classifiers or regressors to formulate a statistical model to forecast or predict future outcomes. Predictive models have been employed in several studies to automatically identify students in need of assistance in programming courses, based on early course work [7, 35–38]. For example, Porter et al. [36] used correlation coefficient and visualization techniques to predict student success at the end of term, examining clicker question performance data collected in peer instruction classrooms. Liao et al., [38, 39] conducted studies by using student clicker data as input, collected in a peer instruction pedagogy setting to identify at risk CS1 students. In a later study [7], they explored the value of different data sources as inputs to predict student performance in multiple courses; these inputs included the collected clicker data, take-home assignment and online quiz grades, and final grades obtained from prerequisite courses. They employed Linear regression, Support vector machine and Logistic regression machine learning algorithms in these studies respectively for prediction of student performance in computing education. However, the earlier study [39] did not provide sufficient detail why Linear regression was selected over other machine learning techniques. The use of

Predictive Models as Early Warning Systems

177

clickers and the peer instruction pedagogy raised concerns about whether this approach could be applied to courses that did not employ instrumented IDEs, or the peer instruction pedagogy as per later studies [7, 38]. In addition, they may not have had access to grades attained in the prerequisite courses for analysis [7]. The methodologies used in the aforementioned studies cannot be applied to online distance courses. Substantial student collaboration effort is required from students located at different geographical locations, and it may be challenging for instructors to obtain or access relevant data for predictive analysis. In addition, it is not yet clear which machine learning algorithms are preferable in this context. 2.3 Predictive Modelling with Naive Bayes Classification Naive Bayes classification (NBC) is a supervised machine-learning algorithm for binary and multi-class classification problems. It is a simple statistical classifier and based on Bayes’ probability theorem. NBC models considered as an effective choice for predicting student performance [40], and their use has demonstrated improved performance over other classification methods [41]. For example, Agrawal et al. [42] analysed various machine learning classification algorithms and inferred that NBC is effective for student final exam performance prediction. However, Bergin et al. found that although Naive Bayes had the highest prediction accuracy for predicting novice programming success, there were no significant statistical differences between the prediction accuracies of Naive Bayes and Logistic regression, Support vector machine, Artificial neural network and Decision tree classifiers, in predicting introductory programming student performance [43]. In addition, the study reported in this paper, we explored various other machine-learning techniques, such as Random forest, C5.0 and Support vector machine, for predicting student performance. It was identified that NBC provided better overall prediction and at-risk balanced accuracy across our datasets compared to other machine learning models. So, we deployed Naive Bayes classification for this study. Most of the studies around predictive model development and validation used K-fold cross-validation to evaluate model performance [7, 40, 43]. Borra et al. [44] measured the prediction error of the model by employing estimators such as leave-one-out, parametric and non-parametric bootstrap, as well as cross-validation methods. They reported repeated K-fold cross-validation estimator and parametric bootstrap estimator outperformed the leave-one-out and hold-out estimators. As such, in this study, we developed the Naive Bayes based classification model and used K-fold- cross-validation for model evaluation. 2.4 Early Warning Systems (EWS) in Education The term “early warning system” (EWS) is not new and has attracted attention to support instructors and students [45, 46]. The EWS acts as a student progress indicator, enabling educators to support students who perform progressively poorly, before they drop out; for example, Krumm et al. [46] designed a project called Student Explorer as a part of learning management system (LMS) for academic advising in undergraduate engineering courses. They examined the accumulated LMS data to identify students who

178

A. K. Veerasamy et al.

needed academic support and identified factors that influenced academic advisor decisions. Higher educational institutions are placing greater emphasis on improving student retention, performance and support, and hence they have begun to urge educators to use EWSs, such as Course Signals and Student Explorer, to implement effective pedagogical practices to improve student performance [46–48]. However, these EWSs have some shortcomings. First, academic advisors surmise that using EWSs in academic settings is time consuming. Second, many instructors have difficulties in integrating EWSs into their regular work practices as existing EWSs do not appear to be user-friendly. Third, studies also confirmed that identifying a reliable set of predictors to develop early warning systems/early prediction tools is a challenging task [49, 50]. Fourth, EWSs that have been designed for online courses or which rely heavily on learning management systems (LMS) data, and not on student cognitive and performance data, may not be suitable as data sources [46, 47]. Moreover, in programming, several learning activities take place outside of the LMS [51]. Hence, EWS tools and strategies developed on the basis of student cognitive and performance data, best serve the early identification of at-risk students, in order to support their timely learning [45]. In summary, this paper contributes the following: (i) Development of a simple model(s) with explanatory predictor variables selected on the basis of prior work; (ii) Adoption of the NBC algorithm with K-fold cross-validation, to develop predictive models and to explore the predictive capabilities of selected variables; (iii) Identification of models with reliable sets of predictors, which may be suitable in (academic) early warning systems, for courses that utilise assessment tasks and a final exam.

3 Research Methodology The aim of this study was establish a predictive model to predict student final programming grades in introductory programming courses. This study used Naive Bayes classification based predictive modelling with the following inputs: student perceived problem-solving skills; prior knowledge in programming; and formative assessment in the form of homework and demo/tutorial exercise scores, for the first four weeks of semester. Student final exam grades represented the output of the model. 3.1 Description of the Course and Data Sources The initial data collected for model development was derived from assessment activities of university students enrolled in two classroom-based courses: Introduction to Programming and Algorithms and Programming, both offered during the autumn semester of 2016. The dataset collected in the autumn semester of 2017 was then employed as the unknown data set to test the performance (for generalization) of our predictive models. Introduction to Programming is offered in English and Algorithms and Programming in Finnish, and both courses are designed for students with no prior knowledge in programming. The duration of the courses is 11 weeks and 8 weeks respectively, for Introduction to Programming and Algorithms and Programming. Both courses use ViLLE as the learning management system (LMS) to support technology-enhanced classes. ViLLE is mainly used for programming students, to deliver and manage course content,

Predictive Models as Early Warning Systems

179

such as lecture notes, formative and summative assessment tasks for programming students. Student academic data was collected via ViLLE, and SPSS (version-25) and R (version-3.5.1) were used for statistical analysis. Table 1. Dataset details Course

*Training set/enrolled (2016)

*Test set/enrolled (2017)

Introduction to programming

64/93

68/94

Algorithms and programming

170/248

172/258

*Students who participated in the problem-solving skills and course entry surveys, and completed formative assessment tasks and the final exam only selected for training and testing

Table 1 provides dataset details including course names and numbers of students enrolled in each year, as well as the numbers of students selected for training and test/unknown data sets. The data collected via LMS for the year 2016 (n1 = 64 + 170) was used for feature selection, training and testing the models, for evaluation. The dataset collected in the year 2017 (n2 = 68 + 172) was then employed for final testing (generalization performance) in order to propose models developed for the purpose of early warning systems. 3.2 Instruments and Assessments Our study included two surveys, for self-assessment of problem-solving skills and prior programming knowledge, denoted the problem-solving skills and prior programming knowledge instruments, respectively. These surveys were conducted via the LMS at the start of the semester. The problem-solving inventory (PSI) questionnaire contained 32 Likert-type questions to measure individual self-perception of problem-solving ability. The total score ranged from 32 to 192, with higher scores indicating poorer self-reported problemsolving skills. The PSI was developed by Heppner and Peterson [52]. The actual PSI questions were in English, devised by Heppner [52] translated into Finnish for Algorithms and Programming students, whose native language was Finnish. Moreover, this instrument was used in our prior study [53] to identify the relationship between PSI and academic performance of novice learners in an introductory programming course. The prior programming knowledge (PPK) survey instrument was devised as part of a study by Veerasamy et al. [54] and comprised a 5-point (0 to 5) Likert scale of closed response questions for students. Each point was presented to students with a clear description to accurately self-assess their prior knowledge. In addition, students were asked to mention the names of programming languages they had learned and in which they had previously written at least 200 lines of code. Student responses to the PPK were crosschecked to measure the validity and reliability of the PPK survey. Later, these five points were collapsed in to three groups, identified as “0: no knowledge”, “1–2: basic knowledge” and “3–5: good knowledge”. Moreover, our prior study [54] confirmed the feasibility of using the PPK instrument to predict student academic performance.

180

A. K. Veerasamy et al.

A set of weekly homework exercises (HE) was provided for both courses, weekly, for 8 weeks. Each set of homework exercises averaged 5–10 questions, comprising objective type, code tracing, visualization and coding exercises. All exercises were delivered to students via the LMS, wherein they could submit their answers online, which were mostly automatically graded by LMS. The possible total HE scores for Introduction to Programming and Algorithms and Programming was 890 and 317, respectively. Demo exercises (DE) for Introduction to Programming were dispatched to students weekly via the LMS, for 10 weeks throughout the semester. Each set of exercises had 4–7 coding questions. In a DE session (in the classroom), student solutions to questions for students who had completed them, and who were ready to submit, were discussed by the lecturer; selected students, subsequently selected randomly via the LMS, demonstrated their answers in class. No marks were awarded for class demonstrations. However, students who completed the DEs were instructed to enter their responses directly into the lecturer’s computer, to record the number of DEs completed by them [31]. The possible total DE score for Introduction to Programming was 750. Tutorial exercises (TT) for Algorithms and Programming were provided to students weekly for 8 weeks throughout the semester. In a tutorial session students were given coding exercises via LMS to work online in the classroom. Students were allowed to submit their answers online, individually or as group submissions, which were automatically graded by ViLLE. However, a few coding exercises were manually graded by the lecturer, with scores entered into the LMS in order to assess students’ programming ability. The possible total TT score for Algorithms and Programming was 650. Introduction to Programming did not offer tutorials for students. Both HE and DE were hurdles for Introduction to Programming with students having to attain at least 50% overall for HE and 40% over DE, in order to pass these components and the course. Similarly, all HE and TT were hurdles for Algorithms and Programming; students were required to secure at least 50% overall in each component and to have completed the end semester online-assignment in the LMS, to qualify to sit for the final exam. Both DE and TT sessions were conducted in the classroom, partially supervised and assisted by lecturer. The final exam (FE) is an online summative assessment conducted at the end of the course of study. This final exam is conducted electronically via the LMS. The final exam is hurdle for Introduction to Programming and students are required to secure at least 50% (*) to pass this hurdle, in order to attain a grade for the course. However, the final exam is not a hurdle for Algorithms and Programming. Students attain 80% or more in the selected assessment components to get the maximum of two credit points and course grade 2. To obtain grades of 3 to 5, students must secure at least 50% or more in the selected assessment components and should get at least 62% or more in the final exam (Table 2). The final exam grade (FEG) for the course was calculated based only on final exam scores. Table 2 shows the details of the grade calculation used for this study, to predict final exam grades for both courses. Figure 1 and Fig. 2 present the grade-wise student distribution for Introduction to Programming and Algorithms and Programming. In this study, we defined students atrisk if they secured grades 0 or 1 in the final exam, and denoted their grade as “ZERO”.

Predictive Models as Early Warning Systems

181

Table 2. Grading criteria table: Introduction to Programming & Algorithms and Programming Introduction to Programming Final exam marks

Algorithms and Programming Grade

Final exam marks

Grade

0 to 49

0*

0 to 44

0*

50 to 59

1*

45 to 55

1*

60 to 69

2**

56 to 66

2**

70 to 79

3**

67 to 77

3**

80 to 92

4**

78 to 88

4**

93+

5**

89+

5**

*The actual grades 0 and 1 are considered as “at-risk” for this study and denoted as grade “ZERO”; **grades 2 to 5 are considered as “not-at-risk”

Fig. 1. Grade wise distribution chart- Introduction to Programming

182

A. K. Veerasamy et al.

Fig. 2. Grade wise distribution chart- Algorithms and Programming

This is because; students who secured a passing grade were not likely to succeed in subsequent courses. Hence, this study tags students who received a fail grade or a marginal pass as at-risk students, in order to check the at-risk student prediction accuracy of the model (Fig. 1 and 2). Thirty-two (25 + 7) students secured grade “ZERO” in the year 2016 and 16 (11 + 5) in the year 2017 for Introduction to Programming (Fig. 1). Forty-four (30 + 14) students secured grade “ZERO” in the year 2016 and 32 (19 + 13) in the year 2017 for Algorithms and Programming (Fig. 2). Similarly, we defined students as not-at-risk who secured grades 2–5. In total, 32 students secured grades 2–5 in the year 2016 and 52 students in the year 2017, for Introduction to Programming; 126 students secured grades 2–5 in the year 2016 and 140 students in the year 2017, for Algorithms and Programming. The distribution of data is not unimodal. 3.3 Predictive Model The goal of this study was to build a predictive model that most accurately predicts the desired output value for new input (course) data. Two (courses) x 15 predictive models were developed to measure the influence of selected predictor variables in prediction accuracy. In order to evaluate the prediction accuracy of the models, we used 5-fold crossvalidation to ensure that the training and testing sets (year 2016) contained sufficient variation to arrive at unbiased results. In turn, this would avoid overfitting and allow us to establish how well the model generalized to unknown data (year 2017). We used the wrapper method (forward selection) to determine whether adding a specific feature would statistically improve the predictive performance of the model. In addition, the process was continued until all available variables were successively added to a model,

Predictive Models as Early Warning Systems

183

to identify the best set of variables for model development. The prediction accuracy of each of the 30 predictive models was examined by calculating the overall model prediction accuracy, the at-risk student prediction accuracy sensitivity and specificity, and area under the curve score (ROC curve), for each model. The following prediction accuracy measures were applied via R coding, to evaluate the performance of all models (in training and testing) to answer our research questions. Model prediction accuracy (MPA): MPA was calculated as the number of correct predictions made by NBC, divided by the total number of actual values (and multiplied by 100) to get the prediction accuracy. At-risk student prediction accuracy sensitivity (ATSE): The ATSE represents the percentage of at-risk students who are correctly identified by the model. The ATSE was calculated as: ATSE =

TAR × 100 TAR + FNR

where TAR (True At-risk) is the number of predictions for grade “ZERO” that were correctly identified; and FNR (False Not At-risk) is the number of at-risk students who are incorrectly identified as not-at-risk students by the model. The ATSE value represents the percentage of at-risk students who are correctly identified by the model. At-risk prediction accuracy specificity (ATSP): ATSP represents the percentage of not-at-risk students who are correctly identified by the model. The not-at-risk prediction accuracy was calculated as: ATSP =

TNR × 100 TNR + FAR

where TNR” (True Not-At-risk) is the number of correctly identified predictions for grades “2–5; and FAR (False At-risk) is the number of not-at-risk students who are incorrectly identified as at-risk students by the model for all trials. The ATSP value represents the percentage of not-at-risk students who are correctly identified by the model. Area under the ROC (receiver operating characteristics) curve (AUC): The AUC curve is a performance measurement for binary or multiclass classifiers. The AUC value lies between 0.5 and 1, inclusive, where 0.5 denotes a bad classifier and 1 denotes an excellent classifier. The higher the AUC, the better the model is at distinguishing between student at-risk and not-at-risk. The AUC was used to evaluate the diagnostic ability of the NBC model. We defined the prediction accuracy of identifying at-risk and not-at-risk values, as follows: below 50% is considered poor; 50% - 69% moderate; 70%-79% good; and 80 and above as very good. As such, models with lowest predictive performances were dropped for tests on unknown data. The models returned the highest prediction accuracies (top three models X 2 courses) were then employed for testing unknown data (year 2017) for generalization.

184

A. K. Veerasamy et al.

4 Data Analysis and Results We used SPSS for data pre-processing and RStudio to perform the NBC analysis (IBM, 2013). The data pre-processing was conducted as follows. First, all numerical data was scaled for standardization. For example, we converted the actual homework, demo and tutorial exercise scores (for the first four weeks of semester) to percentage scores. Next, the scaled data was stored as csv files to implement our NBC-based algorithms on these pre-processed datasets. Table 3. The Models Developed for Feature Selection: Naive Bayes Classification - Introduction to Programming Model#

Feature

Type

Model equation

#1

PSI

Cognitive variables

PSI → FEG

#2

PPK

PPK → FEG

#3

PSI, PPK

PSI, PPK → FEG

#4

HE

#5

DE

DE → FEG

#6

HE, DE

HE, DE → FEG

#7

PSI, HE

#8

PSI, DE

#9

PSI, HE, DE

PSI, HE, DE → FEG

#10

PSI, PPK, HE

PSI, PPK, HE → FEG

#11

PSI, PPK, DE

PSI, PPK, DE → FEG

#12

PPK, HE

PPK, HE → FEG

#13

PPK, DE

PPK, DE → FEG

#14

PPK, HE, DE

PPK, HE, DE → FEG

#15

PSI, PPK, HE, DE

PSI, PPK, HE, DE → FEG

Formative assessment tasks

Cognitive variables and formative assessment tasks

HE → FEG

PSI, HE → FEG PSI, DE → FEG

PSI: Problem solving skills; PPK: Prior programming knowledge; HE: Homework exercise; DE: Demo exercise; FE: Final exam; FEG: Final exam grade

We developed 15 predictive models for each of the two courses, with the following combinations of predictor variables to measure the differences between predictive capabilities of formative assessments and other variables to answer RQ1. Models #1 – #15 were developed to predict final exam grades for Introduction to Programming and Models #16 – #30 were developed to predict final exam grades for Algorithms and Programming. Models #1 – #3 and #16 – #18 were developed using cognitive factors as input variables to predict final exam grades both courses. Models #4 – #6 and #19 – #21 were developed using formative assessment tasks as input variables to predict final exam grades for both courses. Models #7 – #15 and #22 – #30 were developed using both formative assessment tasks and cognitive factors as predictor variables to

Predictive Models as Early Warning Systems

185

predict student final exam grades for both courses. The models (#1 – #2, #4 – #5) and (#16 – #17, and #19 – #20) were developed with single features for both courses to examine the models’ overall prediction accuracies, at-risk prediction accuracies, not-atrisk prediction accuracies and AUC results of those models, in order to determine the most valuable combination of predictors for model development. Tables 3 and 4 show NBC models developed for feature selection. Table 4. The models developed for feature selection: Naive Bayes classification - Algorithms and Programming Model#

Feature

Type

Model equation

#16

PSI

Cognitive variables

PSI → FEG

#17

PPK

PPK → FEG

#18

PSI, PPK

PSI, PPK → FEG

#19

HE

#20

TT

TT → FEG

#21

HE, TT

HE, TT → FEG

#22

PSI, HE

#23

PSI, TT

#24

PSI, HE, TT

PSI, HE, TT → FEG

#25

PSI, PPK, HE

PSI, PPK, HE → FEG

#26

PSI, PPK, TT

PSI, PPK, TT → FEG

#27

PPK, HE

PPK, HE → FEG

#28

PPK, TT

PPK, TT → FEG

#29

PPK, HE, TT

PPK, HE, TT → FEG

#30

PSI, PPK, HE, TT

PSI, PPK, HE, TT → FEG

Formative assessment tasks

Cognitive variables and formative assessment tasks

HE → FEG

PSI, HE → FEG PSI, TT → FEG

PSI: Problem solving skills; PPK: Prior programming knowledge; HE: Homework exercises; TT: Tutorial exercise; FE: Final exam; FEG: Final exam grade

We were interested in identifying the most valuable features in predicting student performance for both courses. As such, we examined the resultant AUC for each feature when used for model development and final testing. Figure 3 and Fig. 4 present the resultant AUC for training and unknown (testing) datasets, when using each feature individually to predict student final exam grades for both courses.

186

A. K. Veerasamy et al.

Fig. 3. Performance of models with single feature: Introduction to Programming

Fig. 4. Performance of models with single feature: Algorithms and Programming

The DE feature returns the best AUC out of all other features, and followed by PSI. This implies that DE and PSI are powerful predictors of student performance in Introduction to Programming (Fig. 3). On the other hand, the cognitive features (PPK and PSI) returned the best AUC out of all other features for Algorithms and Programming, although AUC results on the unknown dataset were varied (Fig. 4). As noted, one of the objectives of this study was to determine whether our model(s) could be used as early warning system(s) for instructors to identify students in need of academic support (RQ2). Therefore, we selected top three models (for both courses) that returned higher prediction accuracies in the year 2016 (Table 5). Then these selected

Predictive Models as Early Warning Systems

187

models were employed on the dataset collected in the year 2017 (unknown data) to determine how well these models would work on unknown data (Table 6) and to propose the models that with good predictive power as early warning systems. Table 5. The overall and at-risk student prediction accuracies on training set: 2016 Model#

Overall prediction accuracy (MPA)

At-risk prediction accuracy (ATSE)

Not-At-Risk prediction accuracy (ATSP)

Area under the curve (AUC)

Course

#8

79.69

81.25

#11

75.00

81.25

78.12

0.79

68.75

0.78

Introduction to Programming

#5

76.56

78.12

75.00

0.74

#29

65.21

61.36

69.05

0.65

#30

65.26

59.09

71.43

0.64

#28

61.24

61.36

61.11

0.59

Algorithms and Programming

Table 5 shows the top three selected models for each of the two courses for the year 2016, which had higher prediction accuracies (AUC scores). The K-fold cross-validation results for the training set revealed that model #8 with PSI and DE only, as predictors, returns the best prediction accuracy (MPA: 80%, ATSE:81, ATSP:78, AUC:0.79) for Introduction to Programming. Similarly, model #29 developed with continuous assessments tasks (HE and TT) and the cognitive variable (PPK) only, as predictors, returned the best prediction accuracy for Algorithms and Programming (MPA: 65%, ATSE: 61%, ATSP: 69%, AUC: 0.65). Table 6. Top three models’ performance on test dataset for the year 2017 Model#

Overall prediction accuracy (MPA)

At-risk prediction accuracy (ATSE)

Not-At-Risk prediction accuracy (ATSP)

Area under the curve (AUC)

Course

#8

70.91

93.75

#11

67.79

87.50

48.08

0.66

48.08

0.63

Introduction to Programming

#5

68.99

93.75

44.23

0.65

#29

49.65

0

99.29

0.41

#30

49.65

0

99.29

0.41

#28

49.65

0

99.29

0.41

Algorithms and Programming

Table 6 shows the models’ unknown data test results (year 2017) of top three models for each of the two courses. The results were mixed. On average, the at-risk prediction

188

A. K. Veerasamy et al.

accuracy for identifying students who needed support for Introduction to Programming, was 91.67% (Models #8, #11, and #5) in compliance with AUC scores (0.66). However, the not-at-risk prediction accuracy, for identifying student who did not need support, was statistically poor (46.80%). On the other hand, on average, the at-risk prediction accuracy, for identifying students who needed support for Algorithms and Programming was 0% (Models #29, #30, and #28), which is statistically insignificant and addressed below.

5 Discussion The main objective of this study was to construct a predictive model with as few predictor variables to predict final programming exam performance of students in order to identify at-risk students early. The validation and unknown data test results for models (Fig. 3) with a single feature as predictors for Introduction to Programming revealed that demo exercises were the most influential factor (AUC: 0.65) in determining student final exam performance. On the other hand, prior programming knowledge returned the best AUC (0.63) and may serve as a predictor of student success in Algorithms and Programming (Fig. 4). These results imply that models developed and tested with combination of demo and problem solving skills may yield better prediction accuracies than models developed with other features for Introduction to Programming. Similarly, the unknown dataset results for models with single features as predictors for Algorithms and Programming (Fig. 4) revealed that models with cognitive features (prior programming knowledge and problem solving skills) may yield better prediction accuracies than other models. Our statistical results of prediction accuracies computed across all K-fold crossvalidation (training) and unknown data tests for Introduction to Programming yielded good results (Tables 5 and 6). That is, it is possible to identify at-risk students in the first four weeks, based on student prior programming knowledge, problem solving skills, and formative assessment results. Hence, these results answered our RQ1. The overall model prediction accuracy and at-risk prediction accuracy of the top three models (year 2016) selected for Introduction to Programming were very good and congruent with single feature based model results (Fig. 3 and Table 5). In addition, the unknown data test results (year 2017) reveal that the models selected based on high prediction accuracies can be proposed as early warning systems for Introduction to Programming (Table 6). On the other hand, the unknown dataset test results on identifying at-risk students for Algorithms and Programming produced insignificant results (Table 6) and were not congruent with results of models tested with single features (Fig. 4). These results raised the following points. First, unlike for Introduction to Programming, the grade computation for Algorithms and Programming is slightly different. The final exam need not be sat to pass Algorithms and Programming, which would have influenced the predictive performance of the models (#28-#30). Second, registration to attend the final exam is allowed as late as the last lecture week, which would have affected the student performance in the final exam. Moreover, if a student did not secure 50% or more in the final exam they were still able to attain at least a grade 1 score, by securing 80%-94% in the formative assessment tasks. Third, as per standard rules, students could opt to sit the final exam to improve their final course grades despite the scores* they attained

Predictive Models as Early Warning Systems

189

in selected formative assessments. However, post preliminary anecdotal results showed that students who secured adequate scores in formative assessments (*50% or more but below 95%), who were therefore eligible to sit for final exam, achieved higher grades than students who achieved grade 1 or 2 via formative assessment scores. Therefore, the differences between student final exam grades and grades achieved through selected formative assessment scores warrants further analysis, in order to improve the model’s predictive performance for Algorithms and Programming. Moreover, these results persuaded us to redefine our research question, RQ2: (i) Might our proposed model with these predictor variables be deployed in an early warning system to support instructors and students? (ii) Might our proposed model be transformed as a generic predictive model for other courses that with continuous formative assessments and final exam, to predict student performance early in the semester? As noted, several studies attempted to establish early warning systems to identify at-risk students and allow for more timely pedagogical interventions [46–48]. In the same vein, we also attempted to develop a model with the aim of using it for all introductory programming courses. The at-risk prediction accuracy for at-risk student training and unknown data test results (81% and 94%) (#8) of this study supports the contention that our proposed model may be deployed as an early warning system to predict students who need early assistance. In addition, these (Tables 5 and 6) results also revealed that, it is possible to predict student who need support early based on problem-solving skills and formative assessment (demo exercise) scores, secured in the first four weeks of the semester (Fig. 3 and Table 6). Hence, these results imply that our model may be adapted as an early warning system in programming courses assessed with continuous assessment and final exam components, to predict student academic performance and to identify students who need support. The model may be used by instructors to categorize students as, for example, “atrisk”, “marginal”, “average”, “good”, “very good”, and “excellent”, based on predicted final exam grades, and accordingly to reshape their pedagogical practices. In addition, instructors may deploy this model as part of their student academic monitoring system to get actionable data for analysis and to support students in personalised ways. For example, after identifying less-motivated learners via this model, instructors may counsel them about strategies to improve their performance. Students too may use the model(s) in the following ways. First, the results of these models may be delivered as real-time feedback to learners, to persuade or encourage them to devote attention to specific learning activities, in order to improve their performance before they reach a critical juncture. Second, students may perceive these results as ongoing performance indicators to shore up their own learning goals to improve learning and academic performance. Third, the detection of early warning signs could persuade students towards alleviating their learning. Therefore, we conclude that the results of these models may help students to alter their learning behaviors, and to better understand their performances, shortcomings and successes. As noted earlier, most previous studies to develop predictive models reported that their models have generalizability problems, due to the following factors: (a) unsuitability of research methods due to technical, cultural, or ethical reasons; (b) course specific predictor variables for the model developed; and (c) cognitive factors. Keeping these

190

A. K. Veerasamy et al.

problems in mind, we developed our model with transformable predictor variables to test for other courses and on unknown data. Take first the student perceived general problem-solving skills. Many studies have reported the importance of measuring student general problem-solving abilities [16, 55]. Our research findings (Fig. 3) also suggested that being aware of the problem-solving skills of incoming freshmen would help instructors to review their problem-solving based instructional methods, to develop and to enhance students’ metacognitive, computational thinking skills [53]. Based on these results, we recommend that student perceived problem-solving skills should most likely be included as one of the predictive variables in mathematical models for predicting student performance. Consider now the second variable, prior programming knowledge. Although student perceived problem-solving ability plays an important role in student learning, prior subject knowledge remains the central variable in student learning and performance [56]. In addition, the results of this study and our past work [54] suggest that prior knowledge can be considered as appropriate for predicting at-risk students (Tables 5 and 6). However, our unknown data test results of models (#28-#30) produced insignificant results, which needs further analysis. Despite these mixed results this variable may be adapted to predict student academic performance. As noted, predictive models developed for predicting student performance have at least one assessment task as a predictor variable. The overall prediction accuracy of the models (Tables 5 and 6) confirms that a formative assessment task can have predicting influence of student performance, early in the semester. Therefore, we conclude that formative assessment tasks may be considered as a significant predictor for measuring student progress early. However, the selection of predictor variables associated with assessment tasks varies from course to course. For example, in this study we chose homework and demo/tutorial exercises for our predictive model, based on earlier research findings [31] to predict student performance. However, our feature selection and unknown data test results suggest that the models developed with in-classroom assessments (DE/TT) have higher prediction accuracies (MPA, ATSE and ATSP) than models developed with outside-classroom assessments, a result that needs further analysis. Therefore, it is important to select suitable or discipline specific formative assessment tasks to measure student-learning outcomes, to track ongoing performance of students. However, it should be noted that courses that do not have continuous assessments and only a final exam may use cognitive factors only, as predictor variables as post analysis. Based on the past research findings and results of our study (Fig. 3 and Tables 5 and 6 for Introduction to Programming) we have argued that our proposed generic predictive model may be deployed for other programming and non-programming courses, if the goal of instructor is to predict student performance and to identify low motivated learners early in the semester.

Predictive Models as Early Warning Systems

191

Fig. 5. Generic predictive model for student performance prediction

6 Conclusions, Limitations and Future Work of the Study Our research study showed that student perceived problem-solving skills, prior programming knowledge, and formative assessment tasks were significant in predicting student final exam grades for Introduction to Programming. However, the unknown dataset test results for Algorithms and Programming were statistically insignificant. The results showed that student perceived problem-solving skills, prior programming knowledge and formative assessments, captured in a predictive model, was a good fit of the data. Furthermore, the models developed with a combination of in-class assessment variables and cognitive features yielded better at-risk prediction accuracies than other models, developed with a combination of take-home assessment variables. The overall success of the models was good, which persuaded us to update our research questions (Tables 5 and 6). The model may be used as an early warning system for instructors to identify students needing early assistance. Therefore, we presented the generic form of predictive model (Fig. 5), and how this model could be applied and developed in future research, for early identification of at-risk students in other courses that use continuous assessment tasks and a final exam. The results of the study will be used in future work, to build an accurate predictive model for student academic performance in other programming courses. Despite our promising results, this study has several limitations that influence the overall generalizability and interpretation of the findings. First, the data used in this study was collected within one institution. From a statistical point of view, it is true that sample size influences research outcomes. However, a number of studies on predictive models suggest that sample size for analysis is often determined by the research question or the analysis intended, although bigger samples are better to define the reliability and validity of the research [57]. As such, in future we plan to include data from multiple courses collected over multiple semesters. Second, although this study used multiclass classification, we mainly focused on examining at-risk class accuracies and did not analyse other classes of the dataset prediction accuracies individually. However, as noted (Fig. 1 and 2), all other classes (grades 2 to 5) were considered as not-at-risk for this study to proceed further. Third, we measured student prior programming knowledge on a broad level. Therefore, it is unknown whether the participants responded honestly to the survey. Fourth, although the average prediction accuracy on test data was good, the predictive performance of the model might be influenced by the course assessment nature, which means that the resulting model may not work well to predict unknown

192

A. K. Veerasamy et al.

data. Fifth, we used the first four weeks of assessment for analysis. However, learning is dynamic and so a learner might do well in the beginning weeks of the semester and may not perform well in second half of the semester. Hence, there is a need to monitor and track student progress throughout the course period, in order to be better informed about providing continuous academic support. Therefore, we plan to extend our study to predict student performance based on assessment results collected in different weeks of the course, to determine if these results better inform instructors to monitor and evaluate student-learning outcomes. For example, if the duration of the course period is 12 weeks, then the prediction will be deployed in three cycles based on the first, middle, and final four weeks of the course. In spite of these limitations, we contend that our models can be applied to other courses with continuous formative assessments and a final exam to predict student performance. Moreover, in this work, we proposed a prediction methodology using Naive Bayes classification for early identification of students that need support. This generic model introduced in this research paper provides a guide for future work. For example, to identify at-risk students in accounting, the predictor variables used in this study may be adapted as generic predictors to fit course specific data mining predictive models, for student performance. That is, prior programming knowledge can be transformed as prior knowledge in accounting to measure student prior accounting knowledge, with selected formative assessment tasks to predict student performance. The possible research, notionally, is predicting accounting student performance using prior accounting knowledge and selected ongoing formative assessment task results. Similarly, problemsolving skills may be used to measure student perceived problem-solving abilities of accounting students. Moreover, this model may be deployed to seek answers to the following research questions: (i) Does prior knowledge influence student learning? (ii) How does student prior knowledge in the topic/subject impact classroom teaching and learning? (iii) What kind of support activities might be provided to at-risk students identified by this model? (iv) Does the prediction accuracy of the model vary significantly based on the predictive algorithms used? Apart from the above, the proposed model of this study might be used for further research including some more significant factors that are more widely used in predictive models for predicting student performance. For example, psychological variables such as self-belief, self-regulated learning and self-efficacy, as well as and educational variables, such as prior GPA, and other academic variables available obtainable from the LMS may be included to enhance the prediction accuracy of model. Acknowledgments. The authors wish to thank all members of ViLLE research team group and Department of Future technologies, University of Turku, for their comments and support that greatly improved the manuscript. This research was supported fully by a University of Turku, Turku, Finland.

References 1. Ali, A., Smith, D.: Teaching an introductory programming language. J. Inf. Technol. Educ.: Innov. Pract. 13, 57–67 (2014)

Predictive Models as Early Warning Systems

193

2. Holvikivi, J.: Conditions for successful learning of programming skills. In: Reynolds, N., Turcsányi-Szabó, M. (eds.) KCKS 2010. IAICT, vol. 324, pp. 155–164. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15378-5_15 3. Watson, C., Li, F.W.B. Failure Rates in introductory programming revisited. In Proceedings of the 2014 Conference on Innovation & Technology in Computer Science Education (Uppsala 2014), pp. 39–44. Association of Computing Machinery (2014) 4. Bennedsen, J., Caspersen, M.: Failure rates in introductory programming: 12 years later. ACM Inroads 10(2), 30–36 (2019) 5. Castro-Wunsch, K., Ahadi, A., Petersen, A.: Evaluating neural networks as a method for identifying students in need of assistance. In: Proceedings of the 2017 ACM SIGCSE Technical Symposium on Computer Science Education (Seattle 2017), pp. 111–116. ACM (2017) 6. Conijn, R., Snijders, C., Kleingeld, A., Matzat, U.: Predicting student performance from LMS data: a comparison of 17 blended courses using moodle LMS. IEEE Trans. Learn. Technol. 10(1), 17–29 (2017) 7. Liao, S.N., Zingaro, D., Alvarado, C., Griswold, W.G., Porter, L.: Exploring the value of different data sources for predicting student performance in multiple CS courses. In: Proceedings of the 50th ACM Technical Symposium on Computer Science Education, Minneapolis, MN, USA, pp. 112–118. ACM (2019) 8. Pawlowska, D.K., Westerman, J.W., Bergman, S.M., Huelsman, T.J.: Student personality, classroom environment, and student outcomes: a person–environment fit analysis. Learn. Individ. Differ. 36, 180–193 (2014) 9. Costa, E.B., Fonseca, B., Santana, M.A., Araújo, F.F., Rego, J.: Evaluating the effectiveness of educational data mining techniques for early prediction of students’ academic failure in introductory programming courses. Comput. Hum. Behav. 73, 247–256 (2017) 10. Roberts, S.A.: Parsimonious modelling and forecasting of seasonal time series. Eur. J. Oper. Res. 16, 365–377 (1984) 11. Vandekerckhove, J., Matzke, D., Wagenmakers, E.-J.: Model comparison and the principle of parsimony. In: Busemeyer, J.R., et al. (eds.) The Oxford Handbook of Computational and Mathematical Psychology. Oxford University Press, New York (2015) 12. D’zurilla, T.J., Nezu, A.M., Maydeu-Olivares, A.: Social problem solving: theory and assessment. In: Chang, E.C., et al. (eds.) Social problem Solving: Theory, Research, and Training. American Psychological Association, Washington, DC (2004) 13. White, H.B., Benore, M.A., Sumter, T.F., Caldwell, B.D., Bell, E.: What skills should students of undergraduate biochemistry and molecular biology programs have upon graduation? Biochem. Mol. Biol. Educ. 41(5), 297–301 (2013) 14. Kappelman, L., Jones, M.C., Johnson, V., McLean, E.R., Boonme, K.: Skills for success at different stages of an IT professional’s career. Commun. ACM 59(8), 64–70 (2016) 15. Heppner, P.P., Krauskopf, C.J.: An information-processing approach to personal problem solving. Counsell. Psychol. 15(3), 371–447 (1987) 16. Adachi, P.J.C., Willoughby, T.: More than just fun and games: the longitudinal relationships between strategic video games, self-reported problem solving skills, and academic grades. J. Youth Adolesc. 42(7), 1041–1052 (2013) 17. Bester, L.: Investigating the problem-solving proficiency of second-year quantitative techniques students: the case of Walter Sisulu University. University of South Africa, Pretoria (2014) 18. Marion, B., Impagliazzo, J., St. Clair, C., Soroka, B., Whitfield, D.: Assessing computer science programs: what have we learned. In: SIGCSE 2007 Proceedings of the 38th SIGCSE Technical Symposium on Computer Science Education, Covington, Kentucky, USA, pp. 131– 132. ACM (2007) 19. Ring, B.A., Giordan, J., Ransbottom, J.S.: Problem solving through programming: motivating the non-programmer. J. Comput. Sci. Coll. 23(3), 61–67 (2008)

194

A. K. Veerasamy et al.

20. Uysal, M.P.: Improving first computer programming experiences: the case of adapting a websupported and well- structured problem-solving method to a traditional course. Contemp. Educ. Technol. 5(3), 198–217 (2014) 21. Svinicki, M.: What they don’t know can hurt them: the role of prior knowledge in learning. POD network, Nederland, Colorado (1993) 22. Hailikari, T.: Assessing university students’ prior knowledge implications for theory and practice. Helsinki (2009) 23. Watson, C., Li, F.W.B., Godwin, JL.: No tests required: comparing traditional and dynamic predictors of programming success. In: Proceedings of the 45th ACM Technical Symposium on Computer Science Education, pp. 469–474. ACM (2014) 24. Longi, K.: Exploring factors that affect performance on introductory programming courses. University of Helsinki, Helsinki (2016) 25. Hsu, W.C., Plunkett, S.W.: Attendance and grades in learning programming classes. In: Proceedings of the Australasian Computer Science Week Multi Conference, Canberra (2016) 26. Sabin, M., Alrumaih, H., Impagliazzo, J., Lunt, B., Zhang, M.: Information technology curricula 2017: curriculum guidelines for baccalaureate degree programs in information technology. ACM and IEEE, New York (2017) 27. Bloom Benjamin, S., Hastings, J.T., Madaus, G.F.: Handbook on Formative and Summative Evaluation of Student Learning. McGraw-Hill Book Company, New York (1971) 28. Lau, A.M.S.: ‘Formative good, summative bad?’ – a review of the dichotomy in assessment literature. J. Furth. High. Educ. 40(16), 509–525 (2016) 29. VanDeGrift, T.: Supporting creativity and user interaction in CS 1 homework assignments. In: 46th ACM Technical Symposium on Computer Science Education, Kansas City, pp. 54–59. ACM (2015) 30. Rajoo, M., Veloo, A.: The relationship between mathematics homework engagement and mathematics achievement. Aust. J. Basic Appl. Sci. 9(28), 136–144 (2015) 31. Veerasamy, A.K., D’Souza, D., Lindén, R., Kaila, E., Laakso, M.-J., Salakoski, T.: The impact of lecture attendance on exams for novice programming students. Int. J. Mod. Educ. Comput. Sci. (IJMECS) 8(5), 1–11 (2016) 32. Fan, H., Xu, J., Cai, Z., He, J., Fan, X.: Homework and students’ achievement in math and science: A 30-year meta-analysis, 1986–2015. Educ. Res. Rev. 20, 35–54 (2017) 33. Fujinuma, R., Wendling, L.: Repeating knowledge application practice to improve student performance in a large, introductory science course. Int. J. Sci. Educ. 37(17), 2906–2922 (2015) 34. Thong, L.W., Ng, P.K., Ong, P.T., Sun, C.C.: Performance analysis of students learning through computer-assisted tutorials and item analysis feedback learning (CATIAF) in foundation mathematics. Herald NAMSCA, vol. 1, p. 1 (2018) 35. Ahadi, A., Lister, R., Haapala, H., Vihavainen, A.: Exploring machine learning methods to automatically identify students in need of assistance. In: Proceedings of the Eleventh Annual International Conference on International Computing Education Research, Omaha, Nebraska, USA, pp. 121–130. ACM (2015) 36. Porter, L., Zingaro, D., Lister, R.: Predicting student success using fine grain clicker data. In: Proceedings of the Tenth Annual Conference on International Computing Education Research, Glasgow, Scotland, United Kingdom, pp. 51–58. ACM (2014) 37. Quille, K., Bergin, S.: Programming: predicting student success early in CS1. A re-validation and replication study. In: Proceedings of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education, Larnaca, Cyprus, pp. 15–20. ACM (2018) 38. Liao, S., Zingaro, D., Thai, K., Alvarado, C., Griswold, W., Porter, L.: A robust machine learning technique to predict low-performing students. ACM Trans. Comput. Educ. 19(3), 18:1–18:19 (2019)

Predictive Models as Early Warning Systems

195

39. Liao, S.N., Zingaro, D., Laurenzano, M.A., Griswold, W.G., Porter, L.: Lightweight, early identification of at-risk CS1 students. In: Proceedings of the 2016 ACM Conference on International Computing Education Research, Melbourne, VIC, Australia, pp. 123–131. ACM (2016) 40. Hamoud, A.K., Humadi, A.M., Awadh, W.A., Hashim, A.S.: Students’ success prediction based on Bayes algorithms. Int. J. Comput. Appl. 178(7), 6–12 (2017) 41. Devasia, T., Vinushree, T.P., Hegde, V.: Prediction of students performance using educational data mining. In: 2016 International Conference on Data Mining and Advanced Computing (SAPIENCE), Ernakulam, pp. 91–95. IEEE (2016) 42. Agrawal, H., Mavani, H.: Student performance prediction using machine learning. Int. J. Eng. Res. Technol. (IJERT) 4(3), 111–113 (2015) 43. Bergin, S., Mooney, A., Ghent, J., Quille, K.: Using machine learning techniques to predict introductory programming performance. Int. J. Comput. Sci. Softw. Eng. 4(12), 323–328 (2015) 44. Borra, S., Di Ciaccio, A.: Measuring the prediction error. A comparison of cross-validation, bootstrap and covariance penalty methods. Comput. Stat. Data Anal. 54, 2976–2989 (2010) 45. Macfadyen, L.P., Dawson, S.: Mining LMS data to develop an “early warning system” for educators: a proof of concept. Comput. Educ. 54, 588–599 (2009) 46. Krumm, A., Joseph Waddington, R., Teasley, S., Lonn, S.: A learning management systembased early warning system for academic advising in undergraduate engineering. In: Larusson, J.A., White, B. (eds.) Learning Analytics, pp. 103–119. Springer, New York (2014). https:// doi.org/10.1007/978-1-4614-3305-7_6 47. Arnold, K.E., Pistilli, M.D.: Course signals at Purdue: using learning analytics to increase student success. In: Proceedings of the 2nd International Conference on Learning Analytics and Knowledge, Vancouver, British Columbia, Canada, pp. 267–270. ACM (2012) 48. Pistilli, M., Willis, J., Campbell, J.: Analytics through an institutional lens: definition, theory, design, and impact. In: Larusson, J.A., White, B. (eds.) Learning Analytics, pp. 79–102. Springer, New York (2014). https://doi.org/10.1007/978-1-4614-3305-7_5 49. Ya-Han, H., Lo, C-.L., Shih, S-.P.: Developing early warning systems to predict students’ online learning performance. Comput. Hum. Behav. 36, 469–478 (2014) 50. Pedraza, D.A.: The relationship between course assignments and academic performance: an analysis of predictive characteristics of student performance. Texas Tech University (2018) 51. Marbouti, F., Diefes-Dux, H.A., Madhavan, K.: Models for early prediction of at-risk students in a course using standards-based grading. Comput. Educ. 103, 1–15 (2016) 52. Heppner, P.P., Petersen, C.H.: The development and implications of a personal problemsolving inventory. J. Counsell. Psychol. 29(1), 66–75 (1982) 53. Veerasamy, A.K., D’Souza, D., Lindén, R., Laakso, M.-J.: Relationship between perceived problem-solving skills and academic performance of novice learners in introductory programming courses. J. Comput. Assist. Learn. 35(2), 246–255 (2019) 54. Veerasamy, A.K., D’Souza, D., Linden, R., Laakso, M.-J.: The impact of prior programming knowledge on lecture attendance and final exam. J. Educ. Comput. Res. 56(2), 226–253 (2018) 55. Özyurt, Ö.: Examining the critical thinking dispositions and the problem solving skills of computer engineering students. Eurasia J. Math. 11, 2 (2015) 56. Chakrabarty, S., Martin, F.: Role of prior experience on student performance in the introductory undergraduate CS course. In: SIGCSE 2018 Proceedings of the 49th ACM Technical Symposium on Computer Science Education, Baltimore, Maryland, USA, pp. 1075–1075. ACM (2018) 57. Austin, P., Steyerberg, E.: The number of subjects per variable required in linear regression analyses. J. Clin. Epidemiol. 68(6), 627–636 (2015)

Youden’s J and the Bi Error Method MaryLena Bleile(B) Statistical Science, Southern Methodist University, Dallas, USA [email protected]

Abstract. Incorrect usage of p-values, particularly within the context of significance testing using the arbitrary .05 threshold, has become a major problem in modern statistical practice. The prevalence of this problem can be traced back to the context-free 5-step method commonly taught to undergraduates: we teach it because it is what is done, and we do it because it is what we are taught. This hold particularly true for practitioners of statistics who are not formal statisticians. Thus, in order to improve scientific practice and overcome statistical dichotomania, an accessible replacement for the 5-step method is warranted. We propose a method foundational on the utilization of the Youden Index as a potential decision threshold, which has been shown in the literature to be effective in conjunction with neutral zones. Unlike the traditional 5-step method, our 5-step method (the Bi Error method) allows for neutral results, does not require p-values, and does not provide any default threshold values. Instead, our method explicitly requires contextual error analysis as well as quantification of statistical power. Furthermore, and in part due to its lack of usage of p-values, the method sports improved accessibility. This accessibility is supported by a generalized analytical derivation of the Youden Index.

Keywords: Youden Index

1

· p-Value · Type II error

Introduction

The problematic nature of blindly executing hypothesis tests using the p < .05 has been clearly illustrated: the problem received official recognition by the ASA in 2016, culminating in the recent release of a special edition of The American Statistician, which featured 43 papers on the topic [6, 16]. Subsequently, the National Institute of Statistical Science hosted a webinar on alternatives to the p-value, which was followed up in November with a more in-depth discussion featuring three experts in the field: Jim Berger, Sander Greenland, and Robert Matthews [7,8]. Cobb points out the cyclic nature of the problem: “We teach it because it’s what we do, we do it because it’s what we teach” [16]. A recent paper by Cassidy, et al. even went so far as to show that 89% of psychology textbooks that attempt to explain statistical significance, do so incorrectly [2]. Furthermore, in some cases, misinterpretation of statistical significance (or nonsignificance) can be fatal to those who would otherwise benefit from the results c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 196–209, 2021. https://doi.org/10.1007/978-3-030-80126-7_15

The Bi Error Method

197

of the research: earlier this year, a study which investigated the effectiveness of Remdesivir as a treatment for COVID-19 incorrectly interpreted their lack of statistically significant results to mean that “Remdesivir had no effect”, when in reality the study was merely underpowered; the number of subjects was not sufficient to reduce the background noise in the data enough for the signal to be perceivable [12]. A later study with greater statistical power showed evidence that Remdesivir is, in fact, an effective treatment for the coronavirus [1]. Yet in spite of the extensive literature on the adverse effects of procedural, context-blind statistical hypothesis testing with p < .05, we continue to teach (and sometimes practice) the 5-step Null-Hypothesis Significance Testing (NHST) method outlined in Algorithm 1.

Algorithm 1. NHST Method 1: 2: 3: 4: 5:

Specify the null hypothesis Specify the alternative hypothesis Set the significance level (usually α = .05) Calculate the test statistic and the corresponding p-value If the p-value is less than the significance level, then reject the null hypothesis and conclude statistical significance. Otherwise, fail to reject the null hypothesis and conclude statistical non-significance.

Of course, the field of statistical methodology is rich and extensive beyond this. Specifically, hypothesis testing with neutral zones has shown to be effective. The idea behind this strategy is to allow for a third option aside from rejecting or failing to reject the null hypothesis H0 : inconclusive results. However, many of the methods from the statistical literature lack the simplicity and algorithmic structure of the existing 5-step method, which is perhaps what contributes to its attractiveness for teaching and for use by non-statisticians. To quote Wasserstein, Schirm, & Lazar, “Don’t is not enough”: it is not enough to simply ban scientists and statisticians from using or teaching the existing 5-step method, leaving a gap in the curriculum. In order to fill this gap, we propose a replacement 5-step method for use in place of the above [17]. Statistical hypothesis tests can be considered as binary classifiers between the null and alternative hypotheses. The Receiver Operating Characteristic curve, which plots sensitivity vs 1-specificity is of special interest when discussing classifiers. (Here, sensitivity is the propensity of the test to reject the null given that the null is actually false, whereas specificity is the propensity of the test to fail to reject the null, given that the null is actually true). Naturally, adding these two quantities together results in a value analogous to the chance of making a correct decision. Note, both sensitivity and specificity depend only on what decision threshold one uses to classify. The Youden Index is defined as J = maxc (Sensitivity(c) + Specif icity(c) − 1), where c is the decision threshold [19]. That is, J is the decision threshold which maximizes the aggregated sensitivity and specificity of the classifier (see Fig. 1).

198

M. Bleile

Fig. 1. An ROC curve illustrating the Youden Index.

In other words, J is the decision threshold which minimizes the additive Type I and Type II error (where Type I error, denoted α, is the probability of falsely rejecting the null, and Type II error, denoted β, is the probability of failing to reject when the null is false). That is J = minargc (α + β) (note, α, β are both functions of the decision threshold c). Youden’s Index has been widely used in the context of hypothesis testing with neutral zones [9]. Algorithm 2 outlines the proposed method. We call this method the “Bi Error Method” for the two following reasons: 1. Youden’s J minimizes two error terms 2. Subjective analysis of two error rates is required

Algorithm 2. The Bi Error Method 1: 2: 3: 4:

Identify the null and alternative sampling distributions Set reasonable upper thresholds for Type I and Type II error (α, β, respectively). Find (or approximate) the Youden Index (J) If the values of α, β, (and, implicitly, ζ(J))are reasonable, proceed to 5. Else, if any one of the above is too high, the results are inconclusive. 5: If α, β, are reasonable and the observed test statistic is more extreme than x0 , then reject H0 . Else, fail to reject H0 .

We provide five of the most compelling reasons why the proposed method is attractive as a replacement, at least for pedagogical purposes. Accessibility. Unlike some of the more sophisticated methodology used in the statistical literature, our method has the same 5-step, algorithmic structure as the existing method. No p-Values. The pervasiveness and negative effects of misinterpretations of pvalues in the academic literature is well-documented [2, 4,5, 14, 16]. Our method eliminates the need for their use, which serves to potentially increase the accuracy of statistical practice in scientific research.

The Bi Error Method

199

Contextual Error Analysis. One of the ASA’s recommendations for hypothesis testing was that context should be considered in the analysis [16]. Yet, we still run into issues where the hypothesis testing is blindly performed with no attention to context, which has led to potentially disastrous results [4, 12, 14]. Our method propagates “Thoughtful research” as discussed in the literature, by including an explicit step involving contextual analysis, with no default thresholds for any of the values analyzed [17]. Inconclusive Results Option. The method propagates the acceptance of uncertainty in another way. In congruence with methods that have been shown to be effective in the literature, it explicitly allows for a third option besides rejecting or failing to reject the null hypothesis: inconclusivity [9, 11]. Quantitative Incorporation of Type II Error. One of the major issues with the statistical hypothesis testing method currently in place, is that it does not take into account the probability of falsely failing to reject: aka Type II error [4, 12]. In contrast, through the utilization of the Youden index, this method explicitly incorporates this quantitatively into the analysis. The remainder of the paper is structured as follows: in Sect. 2 we present the methods utilized in order to critically evaluate the performance Bi Error Method through simulation, as compared to the NHST method. Section 3 includes a brief description of the results of this evaluation, as well as the analytic results. In Sect. 4 we discuss potential criticisms of the method, as well as analytic comparison of the expected accuracy of the Bi Error Method as compared to the NHST method. Finally, we investigate applications in cases where the analytical result does not apply.

2

Methods

Since our method involves contextual, subjective error analysis, it is impossible to simulate its true performance. We can, however, get a general idea of what its lower bound on performance might be by simulating the hypothesis testing scenario and dichotomously classifying results based on whether they are larger than the Youden index. It is notable that since we are utilizing the Type II error in the test procedure, hypothesis tests must be performed with a point alternative. This is desirable since it forces the researcher to think about effect size: here the alternative parameter does not necessarily represent what the population parameter must be if the test rejects the null hypothesis; rather, it represents such an effect that would be cared about within the context of the study. So, for the hypothesis test of: H0 : θ = θ0 vs. HA : θ = θA there are actually three quantities of interest: θ, θ0 , and θA , where θ is the actual, unobserved value of the parameter of interest, and θ0 , θA are the researcher-determined point null and alternative values, where θA is determined

200

M. Bleile

Algorithm 3. Monte Carlo Simulation 1: Generate a sample of size n from a normal distribution with mean μ, variance 1 2: Perform a one-sample t-test of the hypothesis H0 : μ = 0 3: Check whether the mean is greater than μÂ /2, where μÂ is the “hypothesized” alternative mean. If so, reject H0 . 4: Repeat for all combinations of values of n, μ, and μÂ

by the effect the researcher is expecting to see. The simulation was structured as follows: We tested for n = 10, 20, 30, 50, μ = 0, 0.3, 0.5, 0.7, 1, and μÂ = 0.3, 0.5, 0.7, 1, 1.5, 2, 2.5, 3, 3.5. The entire process was repeated a total of M = 10000 times. Simulation was conducted using the statistical package R [13].

3

Results

As expected, simulation results showed that the Bi Error method generally outperformed the traditional NHST method in terms of accuracy, particularly when the hypothesized alternative mean was greater than the actual mean. Tabular results (see Tables 1, 2, 3, 4, 5, 6, 7 and 8) are available in Appendix A. In order to simplify step 3 for researchers and students, we also provide a generalized derivation of the Youden Index. Suppose we are testing: H0 : θ ≤ θ0 vs HA : θ > θ0 , where θ is an unknown parameter and θ0 is a value. We restrict our attention to this case for simplicity, since the extension of our results to the case where θ < θ0 and the two-tailed case is trivial. Theorem 1. Suppose f0 , fA are the sampling distributions of a statistic X under the null and alternative hypotheses, respectively, such that f0 , fA are symmetric and satisfy |μ − x1 | > |μ − x2 | =⇒ f (|μ − x1 |) < f (|μ − x2 |) (that is, they decrease in the tails). Then J = (μ0 + μA )/2. This result has been previously shown for the Normal distribution, which is a special case [3,10, 15]. However, Theorem 1 is more general in that it applies to a wide range of distributions, including the case where the mean and/or variance do not exist. Before presenting the proof, it is convenient to introduce the following definition: Definition 1. The Bi Error is defined as ζ(c) = α + β We now present the proof of Theorem 1. Proof. Note, it suffices to minimize 2ζ := ζ2 = 1 − F0 (x) + FA (x). Assume, without loss of generality, that μA ≥ μ0 . In this case, a critical value less than μ0

The Bi Error Method

201

Fig. 2. The stationary point is found at f0 (x) = fA (x).

Fig. 3. Heuristic justification for δ = 2(x − μ0 ). 2 is not useful, so we will restrict our attention to x > μ0 . So, ∂ζ ∂x = −f0 (x)+fA (x). Setting this equal to zero yields f0 (x) = fA (x) (see Fig. 2). Since f0 , fA are location family, this is equivalent to saying f0 (x) = f0 (x−δ), where δ = μA − μ0 . But f0 is unimodal. So the mode must be in the interval (x − δ, x). Also, since f0 is symmetric, δ = 2(x − μ0 ).

=⇒ μA − μ0 = 2(x − μ0 ) 0 = x. A heuristic explanation for the first step is available in =⇒ μA +μ 2 Fig. 3. Since f0 is univariate, we know that this stationary point must be a maximum or a minimum. In order to show it is a maximum, we will show that the derivative 0 , and positive in a range is negative in a range antecedent to the point x0 = μA +μ 2 immediately before it. Now we will show μ0 + μA + ) > 0 (1) ∀ > 0, ζ2 ( 2 Consider z (x0 + ) = −f0 ( 3μA2−μ0 + ) + fA ( 3μA2−μ0 + ). So, in order for (1) 0 to hold, we need |fA ( 3μA2−μ0 + )| < |fA ( μA +μ + )|. But since fA is symmetric 2 3μA −μ0 A and decreases in the tails, this holds if ≥ μ0 +μ + 2(μA − muA2−μ0 ). 2 2 3μA −μ0 But the term on the right of that equation simplifies to as required. An 2 identical argument exists for f (z − ) < 0.

202

4 4.1

M. Bleile

Discussion On the Choice of Hypothesized δ

One glaringly obvious possible criticism of this method lies in the necessary choice of “alternative” distribution (μÂ , or equivalently δ = μA − μ0 ). Furthermore, it does not make sense to directly estimate this parameter from the data, (using it twice: once to estimate effect size, and then again to perform the test) since in the case of a symmetric distribution, if the null hypothesis is actually true, this will cause us to use a critical value of approximately (μ0 + μÂ )/2 = 2μ0 /2 = μ0 , which is obviously not useful, since it causes a Type I error of α = .5. In light of this, μÂ should be, rather, chosen by the researcher in light of the context of the study (i.e. δ should be what effect, if one exists, that they are expecting). Furthermore, the estimate of μA should be identified before the analyst explores the data. Note, if μÂ is too close to μ0 , this threatens the experiment with inconclusive results due to an unreasonably high Type I error. Similarly, if μÂ is too far from μ0 , this threatens the experiment with inconclusive results due to an unreasonably high Type II error. The simulated results indicate that the method performs particularly well when the researcher marginally overestimates the expected effect, aka when μA > μ. We believe this lends additional credence to the validity of our method due to the fact that it appears to take advantage of a cognitive bias, but we defer a thorough discussion of this to a subsequent paper. 4.2

Normal Distribution

We have shown that the Bi Error is minimized when we choose (μ0 + μA )/2 as a critical value. A natural question is, what difference does this make, as compared with conventional methods? That is, what does the Bi Error look like when we choose a critical value corresponding to α = 0.05? As suggested in the introduction, consider the case where f0 , fA are normal distributions, with different location parameters. Clearly the normal distribution satisfies the conditions necessary for Theorem 1 to hold. We are interested in the change in Bi Error due to choosing α, β such that ζ is minimized, as opposed to choosing α = 0.05. More formally, we want to look at ξ = ζ(z0.05 ) − ζ((μ0 + μA )/2), where z0 .05 is the 95th percentile for N (μ0 , σ 2 ). Plotting with fixed σ 2 revealed that ξ has a positive relationship with δ = μA − μ0 , as shown in Fig. 4. Plotting with fixed μA , μ0 and letting σ 2 vary revealed an inverse relationship between ξ and δ (Fig. 5). So, by definition, ξ increases with effect size.

The Bi Error Method

203

Fig. 4. Difference in ζ by μA − μ0 , with fixed σ 2 = 2

Fig. 5. Difference in ζ by σ 2 , with fixed μA − μ0 = 3

Since ζ is a sum of probabilities, it is bounded. Also, the raw difference may have different meanings in different scenarios, depending on the actual values of ζ. So, it makes more sense to look at the odds ratio, ω1/ω2, where ω1 = ζ(z0.05 /(1 − ζ(z0.05 )), ω2 = ζ(0.5(μ0 + μA ))/(1 − ζ(0.5(μ0 + μA ))). Plotting φζ = ω1/ω2 with effect size (Cohen’s D), yields Fig. 6 wherein it is notable that, even for small effect sizes, the odds of making an incorrect conclusion on a hypothesis test are 4–6 times larger if we set α = 0.05, rather than minimizing the Bi Error function. This phenomenon is explained in greater depth when we consider the actual algebraic representation of ξ. Define ζα to be the Bi Error for a standard z test

204

M. Bleile

Fig. 6. Cohen’s D vs φ, with σ 2 = 5, μ0 = 0, and μA ranges from 0 to 2

using the usual critical value xα , chosen to achieve some fixed Type I error α. Similarly, define ζ∗ to be the Bi Error for the same scenario, but using Theorem 1, which states that the minimum is found at x∗ = (μ0 + μA )/2 Then ξ is given by: ξ = ζα − ζ∗ = α + FA (xα ) + FA (xα ) − [1 − F0 (x∗) + FA (x∗)] = α + FA (xα ) + F0 (x∗) − FA (x∗) √ √ x ∗ −μ0 = α + [1 + 0.5erf ((xα − x0 )/(σ 2)] + 0.5erf ( √ ) − 0.5erf ((x ∗ −μA )/( 2σ)) 2σ √ √ = α + 1 + 0.5erf ((xα − μ0 )/(2 2σ)) + 0.5erf ((μA − μ0 )/(2 2σ)) − 0.5erf ((μ0 − √ μA )/(2 2σ)),

which can be rewritten in terms of Cohen’s d, d = (μA − μ0 )/σ, like so: √ √ √ 1 1 1 ξ = α + 1 + erf ((xα − μ0 )/(2 2σ)) + erf (d/(2 2)) − erf (−d/(2 2)) 2 2 2 Thus, since the error function erf is known to be monotonically increasing, the last two terms of ξ must be increasing√in d. Of course, this naturally leaves the question of the term 12 erf ((xα −μ0 )/(2 2σ)), which appears to be decreasing in d, since xα is a monotonically increasing function of μ0 . However, notice that since we have restricted ourselves √ to the single-tailed case, x√α > μ0 . So for any value of δ, 12 erf ((xα − μ0 )/(2 2σ)) < 12 erf ((μ0 − μA )/(2 2σ)) and the last term sort of “absorbs” the one that is decreasing in d. Hence, for any small increase in d, ξ will also increase. Note, in the lower tailed scenario, this would be increasing in −d.

The Bi Error Method

4.3

205

F Distribution

A natural additional inquiry is that of the performance of our critical value when the test statistic, under the null and alternative hypotheses, follows some kind of an F distribution. This is obviously critical for many results in regression analysis, ANOVA, and any test wherein we use the extra sum of squares principle. Clearly, since the F distribution is not symmetric, it does not satisfy the conditions of Theorem 1, so we cannot use the closed-form solution for the critical value derived therein. Rather, for the purposes of this study, we instead provide specific examples for a few different choices of numerator and denominator degrees of freedom, and non-centrality parameter on the alternative sampling distribution. All of the minimizations were done numerically using the R function optim() [13]. First, we investigate null sampling distribution f0 F10,10,0 and alternative sampling distribution fA F10,10,10 . Using the typical method of rejection, that is, α = .05, yields a staggering ζ = .78. That is, if the true non-centrality parameter is 10 and we have 10 numerator and denominator degrees of freedom, we have a 78% chance of coming to an incorrect conclusion! Dichotomizing using a critical value that minimizes ζ reduces this chance to ζ = .570, which is still objectively outrageous, but a dramatic improvement from the above. As can be expected, the probability of a Type I error here is substantially larger than .05, at α = 0.3. However, the probability of a Type II error is wildly reduced from βα=.05 = .732 to βζ = .263, further illustrating the necessity of a redefined standard for a critical value, in situations where Type I and Type II errors are equally egregious. Changing the numerator and denominator degrees of freedom to 2 and 30, as is more common in practice, makes things better for both methods. The standard method (α = .05) yields a Bi Error of 0.278, and probability of Type II error β0.05 = 0.228. Using a critical value that minimizes the Bi Error reduces the Bi Error to ζ = 0.236, with P (T ypeIError) of αζ = 0.112 and a P (T ypeIIError) of βζ = 0.123. Notice in this case as well, even without the subjective error analysis part of the proposed method, critical value argmin(ζ) is much more appropriate than the classic critical value corresponding to α = .05 for situations in which Type I and Type II error are equally egregious.

5

Conclusion

Misinterpretation of p-values and statistical significance testing remains a major problem in modern statistical practice. In order to eradicate the problem at its root, we have defined an accessible replacement for the existing, commonlyknown 5-step hypothesis testing procedure. Our method possesses many desirable qualities, including but not limited to simplicity, lack of the need for pvalues, potential for an inconclusive results outcome, and quantitative incorporation of Type II error. Theoretical and empirical results indicate that this method can be useful in many cases.

206

M. Bleile

Notes and Comments. This method is meant as a replacement for the 5-step method for null hypothesis significance testing. It is meant primarily for students and researchers unfamiliar with statistics; as such its scope is limited with regards to real-world applications. One aspect of this is evident in the fact that Theorem 1 does not apply to F-distributions or the case where the variances of the null and alternative distributions are unequal. Further research is warranted in these areas. Acknowledgment. We wish to thank Dr. Lynne Stokes and the two reviewers for their feedback, which substantially improved the paper. We additionally thank Dr. Dan Jeske for introducing us to the literature on hypothesis testing with neutral zones, as well as Michael Watts and Austin Marstaller for proofreading the manuscript.

Appendix A

Simulation Results

In order to assess the performance of each method, we calculated what proportion of the time H0 was rejected, for each value of n, μ, and μA . Note, for μ = 0 this represents Type I error, and for μ = 0 this proportion represents statistical power. Re-running with a different seed yielded deviances in the third decimal place, so the Monte Carlo error on results is ±.01. Consequently, results are reported to the second decimal place. Table 1. Rejection rates using the Bi Error method, n = 10, given true location parameter (μ) and hypothesized alternative location parameter (μÂ ) μ (Actual) μÂ ↓ 0

0.3

0.5

0.7

1

0.3

0.44 0.79 0.92 0.98 1.00

0.5

0.40 0.76 0.91 0.98 1.00

0.7

0.36 0.73 0.89 0.97 1.00

1

0.32 0.67 0.86 0.96 1.00

1.5

0.23 0.58 0.80 0.92 0.99

2

0.17 0.49 0.73 0.89 0.98

2.5

0.12 0.39 0.64 0.84 0.97

3

0.08 0.32 0.55 0.76 0.95

3.5

0.06 0.24 0.45 0.68 0.90

Vertical columns for the Bi Error method tables represent the hypothesized alternative mean, μA . All tables were generated using the R package xtable [18].

The Bi Error Method

207

Table 2. Rejection rates using NHST method with α = .05, n = 10, given true alternative parameter μ (Actual) 0

0.3

0.5

0.7

1

0.05 0.22 0.43 0.65 0.90 Table 3. Rejection rates using the Bi Error method, n = 20, given true location parameter (μ) and hypothesized alternative location parameter (μÂ ) μ (Actual) μÂ ↓ 0

0.3

0.5

0.7

1

0.3

0.44 0.89 0.98 1.00 1.00

0.5

0.40 0.86 0.98 1.00 1.00

0.7

0.36 0.84 0.97 1.00 1.00

1

0.31 0.81 0.96 1.00 1.00

1.5

0.22 0.72 0.93 0.99 1.00

2

0.17 0.63 0.89 0.98 1.00

2.5

0.11 0.54 0.84 0.97 1.00

3

0.07 0.45 0.77 0.94 1.00

3.5

0.05 0.35 0.69 0.91 1.00

Table 4. Rejection rates using the NHST method with α = .05, n = 20, Given true alternative parameter μ (Actual) 0

0.3

0.5

0.7

1

0.05 0.36 0.70 0.92 1.00 Table 5. Rejection rates using the Bi Error method, n = 30, given true location parameter (μ) and hypothesized alternative location parameter (μÂ ) μ (Actual) μÂ ↓ 0

0.3

0.5

0.7

1

0.3

0.45 0.93 1.00 1.00 1.00

0.5

0.40 0.92 0.99 1.00 1.00

0.7

0.37 0.90 0.99 1.00 1.00

1

0.31 0.87 0.99 1.00 1.00

1.5

0.23 0.82 0.98 1.00 1.00

2

0.16 0.74 0.96 1.00 1.00

2.5

0.11 0.66 0.93 0.99 1.00

3

0.07 0.56 0.89 0.99 1.00

3.5

0.04 0.46 0.84 0.98 1.00

208

M. Bleile

Table 6. Rejection rates using the NHST method with α = .05, n = 30, given true alternative parameter μ (Actual) 0

0.3

0.5

0.7

1

0.05 0.48 0.85 0.98 1.00 Table 7. Rejection rates using Bleile’s critical value, n = 50, given true location parameter (μ) and hypothesized alternative location parameter (μÂ ) μ (Actual) μÂ ↓ 0

0.3

0.5

0.7

1

0.3

0.45 0.98 1.00 1.00 1.00

0.5

0.41 0.97 1.00 1.00 1.00

0.7

0.37 0.96 1.00 1.00 1.00

1

0.32 0.95 1.00 1.00 1.00

1.5

0.23 0.92 1.00 1.00 1.00

2

0.17 0.87 1.00 1.00 1.00

2.5

0.11 0.81 0.99 1.00 1.00

3

0.07 0.74 0.98 1.00 1.00

3.5

0.04 0.65 0.96 1.00 1.00

Table 8. Rejection rates using the NHST method with α = .05, n = 50, given true alternative parameter μ (Actual) 0

0.3

0.5

0.7

1

0.05 0.67 0.97 1.00 1.00

References 1. Beigel, J., et al.: Remdesivir for the treatment of COVID-19. New England J. Med. 383, 1813–1826 (2020) 2. Cassidy, S.A., Dimova, R., Giguère, B., Spence, J.R., Stanley, D.J.: Failing grade: 89% of introduction-to- psychology textbooks that define or explain statistical significance do so incorrectly. Adv. Methods Pract. Psychol. Sci. 2(3), 233–239 (2019) 3. Fluss, R., Faraggi, D., Reiser, B.: Estimation of the Youden index and it-s associated cutoff point. Biom. J. 47(4), 129–133 (2005) 4. Harrell, F.: Statistical errors in the medical literature. https://www.fharrell.com/ post/errmed/ 5. Ioanndis, J.: What have we (not) learnt from millions of scientific papers with p−values? Am. Stat. 73(Sup.1), 20–25 (2019) 6. Jeske, D.: Special collection on p−values. Am. Stat. 73(Sup.1), 1–401 (2019)

The Bi Error Method

209

7. Jeske, D.: Alternatives to the traditional p−value. National Institute of Statistical Science, Webinar (2019) 8. Jeske, D.: Digging deeper into p-values: Webinar follow-up with three authors. National Institute of Statistical Science, Webinar (2019) 9. Jeske, D., Smith, S.: Maximizing the usefulness of statistical classifiers for two populations with illustrative applications. Stat. Methods Med. Res. 27, 1–15 (2016) 10. Li, C.: Partial Youden index and cut point selection. J. Biopharmec. Stat. 20(5), 1520–5711 (2018) 11. Liu, X.: Classification accuracy and cut point selection. Stat. Med. 31, 2676–2686 (2011) 12. Norrie, J.: Remdesivir for COVID-19: challenges of Underpowered studies. Lancet 395, 1569 (2020) 13. R Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing (2013) 14. Senn, S.: Dichotomania: an obsessive-compulsive disorder that is badly affecting the quality of analysis in pharmaceutical trials. In: Proceedings of the International Statistical Institute, ISI Sydney, pp. 1–15 (2005) 15. Schistermann, E., Perkins, N.: Partial Youden index and cut point selection. Commun. Stat. 36(3), 549–563 (2007) 16. Wasserstein, R., Lazar, N.: The ASA statement on p−values: context, process, and purpose. Am. Stat. 70(2), 129–133 (2016) 17. Wasserstein, R., Lazar, N.: Moving towards a world beyond p 0, i = 1, ..., 5. Then the protocol does 5 computations and 5 updates per block Y = x_1 + x_2 + x_3 Z = x_4 + x_5 Deduct (x_1/Y)*Z from B_1 Deduct (x_2/Y)*Z from B_2 Deduct (x_3/Y)*Z from B_3 Settle x_4 for B_4 Settle x_5 for B_5.

4

Governance on the Network

Our proposed Blockchain is neither permissionless nor permissioned. In fact, no central authority exists on the network. Then a nomination mechanism (see Fig. 4) is introduced to ensure that only qualified candidates have opportunity to join the group of network operators while preventing corruption caused by one centralized authority and potential malicious guys.

Blockchain Cloud Banking

4.1

817

The Conditions to Become a Node

The Necessary Condition. To become an operating node, a candidate must stake a required minimum amount of native coins to activate (or register) the pair of public-private keys with the network. A higher minimum stake is required for master node role. The Sufficient Condition. A body wishing to become a network operator (i.e. a master or normal node) must complete registration first, and then nomination. Explicitly, a candidate for normal node is required at least two nominations from normal nodes or one nomination from master nodes. A candidate for master node is required at least two master node nominators. One may ask an additional condition that the two nominators must possess (in summation) a minimum percentage (e.g. 10% or 2/N for which smaller) of the total reputation scores of all nodes, where N is the number of nodes. See reputation score in Sect. 2. Of course, at the genesis block, no nomination happens, and the foundation nodes are all promoted up by the developer legitimately. Note that, if the developer is not a legal licensed bank, it may not be a master node, because other foundation banks do not allow to build up such a network. Node Deactivation. An active node can decide itself to leave the network, simply inform its leaving with other nodes. Another way, operating nodes can vote to kick out some node if it is not honest with proven damage evidences for the network, or it is not qualified longer under the assessment of the majority. Votes are weighted by reputation scores (see Sect. 2) and at least 68% of weighted majority from the network is needed to deactivate a node. This mechanism helps removing proven malicious nodes from the network and promoting high quality nomination, honest commitment and frequent activities. 4.2

Reputation Score System

Reputation is very important in social reality. It bases on two major factors: wealth and performance (or achievement). Thuat Do et al. investigated reputation in Blockchain space and introduced a novel framework for consensus, namely Delegated Proof of Reputation, see [18]. Therein, ranking (inspired by Google’s PageRanking) is an essential component contributing to reputation, applied not only for nodes but also all accounts on the network. It exploits the idea that a node’s ranking is built up over time based on its cooperation (connection) and work achievement. Basically, more valuable transactions, more connection will get higher ranking. This paper doesn’t study the ranking algorithms. The author only takes the idea that ranking is helpful to reduce the number of faulty nodes and malicious actions on the network while promoting integrity and quality contribution, hence improve overall security and reliability on the network. Removing resource power suggested by EOS [5] and [18], the reputation engine in this paper computes staking and ranking factors to return normalized

818

T. Do

reputation scores. It is assumed that all qualified nodes have abundant resources to solve the tasks on Blockchain. The reputation scores are used for voting and choosing block generators, and formulated Rep = μS + (1 − μ)R,

(2)

where Rep is reputation score, S is normalized staking index and R is normalized ranking score. Readers refer to [18] for more detail on the reputation and ranking framework, computation and advantage analysis. The parameter μ in Eq. (2) is the control multiplier balancing staking and ranking components. In the early stage of the Blockchain, the numbers of connections and transactions are small (i.e. insignificant), thus μ should be large and then gradually decrease. One can set μ = max{2−h/K , θ}, where 0 < θ ≤ 0.5 is constant, K is a positive integer and h is block height. That means μ = 1 at the genesis block, then monotonously decreasing to 0.5 at block K-th, and μ = θ whenever 2−h/K ≤ θ. Reputation Penalty. If a node is kicked off from the network (see Sect. 4.1), then its nominators’ reputation score will be considered as zero for 30 following days, although having positive values. Other punishments on reputation score can be voted and decided by the majority, then applied where appropriate.

Fig. 4. Governance and consensus.

5

Block Production and Consensus Mechanism

A bundle of consensus protocols are proposed and implemented in various Blockchain networks. The most popular ones (by order) are Proof of Work (PoW), Proof of Stake (PoS) and Delegated Proof of Stake (DPoS). In this part, the author exploits Delegated Proof of Reputation (DPoR), introduced in [18], with a modification (i.e. removing resource power in overall reputation). Accounts on the network are allowed to grant their reputation scores to operating nodes (this is similar to voting procedure in Tron [4] and EOS [5]). A node’s reputation score (included granted quantities) is converted to the probability of the node to be selected as transaction validators, block generators, reputation

Blockchain Cloud Banking

819

score providers and random source generators. Higher reputation score implies greater probability and hence earning more rewards and fees. Assume that the groups of master nodes and normal nodes are {M1 , ..., Mp }, {N1 , ..., Nq } associM N N ated with reputation scores {RepM 1 , ..., Repp }, {Rep1 , ..., Repq }, respectively, where p, q are positive integers. Then the normalized probability outputs are RepM i , P robM i = p M 1 Repj

RepN i P robN . i = q N 1 Repj

(3)

Block producing process is divided into epoches and gaps. Each epoch contains a fixed length of blocks. Between two consecutive epoches, there is a short gap for conclusion on reputation score update and re-generating random sources. Based on the computed probabilities, the random sources determine: – – – – –

a reputation score result provider for the next epoch (or next K epoches); a random source generator for the next epoch; an ordered group of transaction validators among normal nodes; an ordered group of block generators among master nodes. an ordered group of gap-block generators (among all nodes) for the next gap.

To finish a certain gap, a special block (gap-block ) is generated to record the info. A compensation (paid in native coins by the other nodes or coinbase reward) is given to the providers and the gap-block generator. The gap block (containing new reputation scores and random sources only) is valid and confirmed if at least 51% of the total reputation scores (including the granted quantities) of all nodes signed “agree”. In addition, the gap block refers to the previous one to make it a check-point for reference or recovery if necessary. Basically, a block generator, in its turn, will choose transactions validated by legitimate validators to package into a block. After that, it broadcasts the block randomly to other master nodes for finalization via a practical Byzantine Fault Tolerance (BFT) algorithm. There are many practical BFT algorithms out there, see [10, 11]. An open source of BFT implementation is Tendermint Core [14] which has been deployed on Binance Chain [16]. Normally, BFT requires at least f + 1 replicas, 2f + 1 nodes and O(n2 f ) communication complexity to ensure error-free process and fault tolerance system, given f faulty nodes. Thanking to nomination mechanism, it is expected to reduce f remarkably, then speed up communication and finalization process. BFT allows a secure and instant finality on blocks and transactions but its broadcasting process is complex and costs long time. If real-time finality is not a strict requirement, the author suggests replacing BFT broadcast by the longest chain rule (as Bitcoin, Ethereum, Tron applied). In fact, a transaction on Tron Network can be considered as final (or immutable) if there are 20 block confirmations (equivalently 60 s), not a long wait. The Algorand [15] Blockchain uses a Byzantine Agreement protocol and verifiable random function in its so-called pure PoS consensus system. Such a random function is a usefully practical implementation to deploy on other PoS-based Blockchains, as random sources play a critical role in the selection of block generators and transaction validators fairly and honestly.

820

6

T. Do

Differentiation and Advantage Offering

Our proposed Blockchain framework differs from all the existing public, private and consortium Blockchains in several beneficial ways. Creative and flexible architecture with two tiers offers a high adaptivity and compatibility with various banking and IT systems, while allowing nodes to attach their private chains easily. Some public Blockchains (e.g. EOS, Tron) designate network participants into super nodes (block generators) and standby nodes (doing nothing). On our proposed Blockchain, every node has its own role and tasks. Novel block data and heterogeneous distribution make the framework more friendly with privacy and confidentiality while fully ensuring necessary tracing and tracking. This improves the compliance and adoption among regulators and banking institutions. Novel nomination mechanism guarantees stable network expansion with qualifying entrance, hence enhances reliability among nodes. Reputation system offers better decentralization (via granting reputation score) rather than pure staking and voting mechanisms on Tron and EOS, while promotes nodes and application developers working honestly and actively to gain reputation (see more rationales in [18]). Note that PoW (resp. PoS) Blockchains have been facing centralization on giant mining pools (resp. staking concentration on capital whales). The advantages of the Blockchain cloud banking can be easily pointed out, both on technology side and application perspectives. Better Security, Scalability and Decentralization. Nomination entrance reduces potential faulty nodes, hence improve overall security. Two-tiers architecture with the capability of connecting private chains allows network scaling greatly. Reputation system promotes decentralization of network and builds trust on the network. Fairness for Developers. Application developers are important value contributors of any Blockchain network. Unfortunately, they do not have any right in the existing Blockchains’ governance, although their billion dollar business are running on their tops. Our Blockchain framework gives the developers a chance to join network operators based on their reputation. Regulation and Adoption Friendliness. Banks are more pleasant to join the network because the master node design precisely presents their special role, right and responsibility in the banking and financial sector. KYC problem, privacy and confidentiality are all resolved by the identification protocol (see Sect. 3.4) and block data distribution (see Fig. 3). As a consequence, the framework has high compliance capability with regulatory environments in different nations, and can satisfy real business practices.

Blockchain Cloud Banking

821

Creative Banking Infrastructure Design. This paper introduces the first framework for a Blockchain cloud banking which is universal and has many advantages compared to the conventional models. Moreover, by offering open APIs, the cloud will gather many fintech firms involved in the innovation of the financial industry. The Blockchain not only provides a secure and scalable cloud infrastructure for digital payment and banking services but also offers a feasible solution for financial inclusion expansion and coverage, especially to unbanked people via eKYC and digital banking model (see Fig. 5).

Fig. 5. Centralized vs decentralized payment applications.

7

Applications, Assessment and Conclusion

Worldwide experts have recognized that Blockchain has many application potentials in all areas of the banking and financial sector. Regarding the applications of the introduced Blockchain banking cloud, it is applicable in an wide range of financial disciplines: accounting & audit, asset tokenization, auto-invoicing, autogovernance, clearing & settlement, credit info sharing, electronic payment, microfinance, micropayment, peer-to-peer money transfer, global money remittance, letter of credit, smartcontract-based transaction, syndicated lending. These perspectives are clearly indicated in many studies [2, 3, 6, 8,12]. In November 11, 2020, all major media channels in crypto-currency space reported the Ethereum split issue happened at block 11234873. It observed a different chain held by a minority of the network miners whose the old Geth version (an Ethereum Blockchain client software) contained a dormant bug which was detected and reported two years ago. The issue affected deposit and withdrawal activities on Binance and many other crypto-currency exchanges. Although the bug was fixed and the whole network updated the main chain held by the majority, the event raised a big concern on centralized giant client servers. Moreover, the issue revealed that large network Blockchains possibly face the same failure due to latency and obsolescence among various groups of block keepers. Together with mining power concentration, people question decentralization and safety of

822

T. Do

Table 1. Challenges & research opportunities in the design of blockchain protocols [13]

PREStO framework

Design of blockchain

Persistance Weak/strong persistence

51% attack defender, large network Blockchain-Trilemma Design of sustainability Decision of governance schemes

Majority attacks Recovery mechanisms Governance & sustainability Robustness Fault Tolerance Out-of-Protocol Incentives

Protection against collusion Well elaboration with adversaries

Resilience to attacks Efficiency

Positive scale effects Throughput rate Economy of resources

Scalable design Energy saving Compare to conventional solutions

Benchmarking to centralized models Stability

Incentive compatibility: participation, operations, applications Decentralization: entry barriers, distribution of resources Fairness: reward allocation, voting-decision making

Optimality Liveness Safety Scope Privacy features: public/private, permissioned/-less

Incentive mechanisms

Protection against adversaries Decentralization motivation, fair distribution of resources Architecture & Sybil protection Safety vs liveness trade-off Smartcontract execution

Ethereum blockchain (see safety definition on [13]). Despite limited decentralization, Tron and our Blockchain frameworks offer many advantages on efficiency, stability and optimality while being robust and persistent to various attack and collusion schemes. In particular, they provide an easy recovery possibility which is very hard on Bitcoin and Ethereum (see a completed split and rollback on Ethereum Blockchain after the DAO attack). Let acronym our proposed Blockchain Banking Cloud as BBC. According to the systematic framework (PREStO) introduced by S. Leonardos et al. [13],

Blockchain Cloud Banking

823

we have a brief summary (see Table 1), assessment and comparison among Ethereum, Tron and BBC in Table 2. More rationales to the assessment Table 2, our Blockchain model utilizes reputation system, hence offers higher decentralization of network resource distribution and fairness to application developers (also value contributors on the Blockchain). In addition, by the nomination and voting mechanism, our Blockchain has sustainable governance and network expansion. For conclusion, the paper has introduced a novel framework for Blockchain banking cloud as a fundamental infrastructure for an open API platform in banking sector, then promoting innovation in payment and financial applications. The proposed Blockchain is implementable with clearly described architecture, governance, consensus, necessary techniques and protocols. We present assessment and potential applications corresponding with published studied frameworks. Although our tentative design leads to a domestic peer-to-peer banking system, the Blockchain model can be used for international bank settlement, cross-border escrow (money remittance) network wherein a secure, sharing, reliable, synchronous ledger and transaction processing system among participants is necessary and useful. In the further work, the author shall provide more quantitative analysis on the Blockchain banking cloud, especially on the reputation system with some experimental results from existing public Blockchain transaction data. Table 2. Assessment among Ethereum, Tron and BBC based on PREStO. PREStO framework

Ethereum

Tron

BBC

Persistance - Weak/strong - Recovery mechanisms - Governance & sustainability

High Strong Hard Majority follow

High Medium Easy Majority voting

Robustness

High

High

Efficiency - Scalability - Throughput - Energy - Benchmarking

Low Extremely bad Very low Consuming Extremely low

High High Good Good High High Saving Medium

High

Stability - Incentive - Decentralization - Fairness

Medium Fairly Highest Medium

Medium Fairly Medium Medium

Optimality - Liveness - Safety - Contract execution - Privacy

Medium High Medium Slow, complex Public & permissionless

High High High High High High Fast & less complex Public & semi-permissionless

High Good High High

824

T. Do

References 1. Central Banks and Distributed Ledger Technology: How are central banks exploring blockchain today? Insight report, World Economic Forum, March 2019. bit.ly/3aGt23U 2. Central Bank Digital Currencies: foundational principles and core features. Bank for International Settlements, October 2020. https://www.bis.org/publ/othp33. htm 3. Central Bank Digital Currency: Policy-maker toolkit. Insight report, World Economic Forum, January 2020. bit.ly/3mHDg69 4. DPoS consensus mechanism. bit.ly/3aGtb7s 5. EOS Consensus Protocol. bit.ly/37M3tws 6. Forecast: Blockchain Business Value, Worldwide, 2017–2030, Gatner’s Research. gtnr.it/3pyccZd 7. Hedera Hashgraph. https://docs.hedera.com/guides/ 8. How Blockchain Could Disrupt Banking (2018). bit.ly/3aFq13y 9. ICO fundraising. bit.ly/2KVcWrX 10. Castro, M., Liskov, B.: Practical byzantine fault tolerance. In: OSDI 1999: Proceedings of the Third Symposium on Operating Systems Design and Implementation, pp. 173–186 (1999) 11. Castro, M., Liskov, B.: Practical byzantine fault tolerance and proactive recovery. ACM Trans. Comput. Syst. (2002). bit.ly/3aFijqk 12. Allen, S., Capkun, S., Eyal, I., et al.: Design choices for central bank digital currency: policy and technical considerations. Global Economy & Development at BROOKINGS, July 2020. is.gd/DO9Oq7 13. Leonardos, S., Reijsbergen, D., Piliouras, G.: PREStO: a systematic framework for blockchain consensus protocols. IEEE Trans. Eng. Manag. 67(4), 1028–1044 (2020). https://doi.org/10.1109/TEM.2020.2981286 14. Tendermint Core. https://github.com/tendermint/tendermint 15. The Algorand Consensus Protocol. bit.ly/3nZGuDP 16. The Binance Chain Blockchain. https://docs.binance.org/blockchain.html 17. The Tangle. t.ly/sT6T 18. Do, T., Nguyen, T., Pham, H.: Delegated proof of reputation: a novel blockchain consensus. In: IECC 2019: Proceedings of the 2019 International Electronics Communication Conference, pp. 90–98, Japan (2019). bit.ly/2KQ72IM 19. Boring, P., Kaufman, M.: Blockchain: the breakthrough technology of the decade and how China is leading the way – an industry white paper. Chamber of Digital Commerce and Marc Kaufman, Partner, Rimon Law, February 2020. bit.ly/3rsgPFJ

A Robust and Efficient Micropayment Infrastructure Using Blockchain for e-Commerce Soumaya Bel Hadj Youssef(B) and Noureddine Boudriga School of Communication Engineering, University of Carthage, Tunis, Tunisia

Abstract. The rapid evolving of Internet and e-commerce technology in recent years has led to rapid growth of online shopping which has become a central part of our daily life. Micropayment systems are emerging rapidly in e-commerce sites although their lack of security and robustness. In this paper, we provide a micropayment infrastructure using the blockchain technology for securing and verifying the payment transactions achieved between buyers and sellers. Moreover, our proposed infrastructure provides ways for managing the trust level of the buyer based on his historical purchases, controlling the size of the block of tokens that will be uploaded into the blockchain network for later validation, and estimating the risk of loss to be handled. Furthermore, we propose three trust models that compute the buyer’s trust value. The behavior of the buyer has an impact on the block size and the risk of loss. Our micropayment infrastructure ensures payment transaction authentication, prevention from double-spending and double-selling, prevention from tokens forging, tracing of payment transaction, protection and prevention against cyber attack, and reduction of the delay of transaction verification and the waiting time of the buyer. Finally, we conduct a simulation to evaluate the performance of our three trust models and show how the behavior of the buyer has an impact on the decision made by the auditing entity. Keywords: Micropayment · Blockchain technology security · Trust · Risk management · e-Commerce

1

· Payment

Introduction

In recent years, micropayments are increasingly important with the increase need for inexpensive and simple payment methods and the rapid development of e-Commerce in which buyers and sellers are trading with each other on the Internet. Micropayment [11] can be defined as a payment of a small amount of money which is made electronically. It can be considered as an online financial transaction that contains a very small amount of money. The most use case of Micropayments is paying for online content such as online papers, archives, music, videos, games, software downloads, tickets, stamps, mobile applications, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 825–839, 2021. https://doi.org/10.1007/978-3-030-80126-7_58

826

S. Bel Hadj Youssef and N. Boudriga

subscription services, e-commerce stores, and among others. Micropayment provides many advantages for the customers (such as, speed, versatility, convenience, and flexibility) and for the merchants (e.g., speed and acceptable transaction fees). In the literature, many micropayment systems have been proposed. However, these micropayment systems still present several challenges such as security and scalability. In this context, the Blockchain technology, which is considered as one of the greatest technological advances, can be a promising solution for micropayment system due to its characteristics like distributed consensus, cryptographic transactions, anonymity, and full transparency. A blockchain can be defined as a chain of online transactions saved as a shared ledger across numerous computers on a peer-to-peer network. Recently, the blockchain has emerged as a fundamental technology and has been applied in many areas of applications [2, 15] including data verification, integrity verification, data management, privacy and security, healthcare record storing, and education. Moreover, the blockchain has been emerged in trade finance, insurance, supply chain management, product tracing, e-commerce [6, 14], and marketing [7]. In that sense, adopting the blockchain technology for micropayment systems will bring many benefits in e-commerce thanks to the low cost of transactions and its preventive mechanisms used to prevent from some fraud and online malicious activities [13]. However, this technology does not eliminate all attacks and suffers from longer transaction confirmation time. For instance, a Bitcoin transaction requires six confirmations from miners before it is processed and the average time needed for mining a block is ten minutes. In the literature, many research works have been proposed about the combination use of the blockchain technology and micropayment. For instance, in [8], the authors introduced a cost-saving approach, which significantly reduces transaction time and storage for micropayment. Authors in [4] conducted a comparison between Stellar, Lightning Network, Raiden, and PayPal in order to evaluate the feasibility of Stellar. Furthermore, they analyzed a subset of transaction records in Stellar to enable fraud prevention in online monetary transactions. The undertaken feasibility study of Stellar indicates that the blockchain platform is viable to function as a micropayment system. In [5], the authors introduced RandpayUTXO which requires the payee’s signature to be published in the blockchain. Moreover, authors implemented a Blum’s ‘coin flipping by telephone’ problem to design a ‘lottery ticket’ that does not require any third party to facilitate the lottery. Authors in [9] carried out a benchmark performance analysis between Lightning Network and other-like solutions. They integrated the LN off-chain technology within an existing IoT ecosystem. Moreover, they designed a novel algorithm for payment channel fee reduction. However, the above proposed micropayment systems still lack the robustness and trustworthiness. Indeed, these works did not build a mechanism for the trust of the buyers, which can help to detect the misbehaving buyers. In addition, these works did not take interest on the evaluation of the risk of loss.

A Robust and Efficient Micropayment Infrastructure

827

In this context, it is very required to integrate an auditing entity which computes the customer’s trust level which represents a key factor in e-commerce. A lot of research works have been done about the trust management in e-commerce [10], wireless sensor networks [12], ad-hoc networks [1, 3], and among others. In this paper, we propose a robust and efficient micropayment infrastructure using the blockchain technology and an auditing entity. Our infrastructure is able to reduce the delay of payment transaction verification, manage the size of the block of tokens, and reduce the risk of loss that affects the auditing entity. The contributions of our work are as follows. Firstly, we describe the architecture of our micropayment infrastructure by presenting the functions of the actors. Our infrastructure relies on the blockchain network and on an auditing entity which is responsible for building the blocks of tokens with a determined size. Secondly, we provide three trust models for the computation of the buyer’s trust value. This will help the auditing entity to determine the future value of the buyer’s trust and then estimate the risk of loss associated with the behavior of the buyer. Finally, the validation of our micropayment infrastructure is provided. The remaining part is structured as follows. Section 2 presents the requirements that must be fulfilled to achieve a robust micropayment infrastructure. In Sect. 3, we present the actors of our micropayment infrastructure by detailing their functions and we present the different interactions between them. In Sect. 4, we present our proposed trust models for the buyer. In Sect. 5, we present the decision making of the auditing entity. In Sect. 6, the validation of our proposed micropayment infrastructure is provided by proving its robustness against attacks and the results of the simulation are presented. Finally, we conclude the paper.

2

Requirements for a Robust Micropayment Infrastructure

To ensure the robustness of our micropayment infrastructure, several requirements should be satisfied. Among the important requirements, we mention the following. Tokens Aggregation: Aggregating tokens (i.e., small payments) into blocks before transmitting them to the blockchain network is crucial. Hence, verifying a block of tokens by the nodes of the blockchain is more efficient than verifying each token alone so that the verification delay will be reduced. Moreover, the size of the block should be well chosen in order to avoid loss. Prevention from Double-Spending: It is very necessary that our micropayment infrastructure is able to detect double spending attacks. Each token cannot be used twice by the buyer (i.e., it cannot be spent by the buyer more than one for paying the seller). The buyer should use the token only once to pay the seller. This can be prevented by including the token identity and the buyer identity in the token.

828

S. Bel Hadj Youssef and N. Boudriga

Prevention from Double-Selling: Preventing our micropayment infrastructure from double selling attack is needed. This attack can be done by the seller. Hence, prevention from this attack can be done by including in each token the identities of the buyer and the seller and information about the product. Prevention from Tokens Forging Attack: Preventing our micropayment infrastructure from the tokens forging attack is crucial. This attack can be achieved by generating false tokens during the transfer of tokens to the seller or during the construction of blocks or during the transmission of blocks to the blockchain network. Hence, it is needed to ensure that tokens have not been altered or manipulated by unauthorized parties. For this, it is crucial to verify and validate each token before proceeding with the payment. Authentication of Payment Transaction: It is very important to ensure the authenticity of the transaction including the authenticity of the actors and authenticity of the tokens included in the transaction. Each receiver of a token or transaction should authenticate the origin of this token or transaction. Accordingly, including the identities of actors and signing the tokens and transactions circulated between actors is crucial. Payment Transaction Tracing: It is very required to trace the transactions and store the histories of payment. Using the blockchain network can be an efficient solution. Moreover, the inclusion of the timestamps on every interaction achieved between the actors is very important. Accordingly, transaction tracing can be done by adding identities of the actors, information about the product, and timestamps. Buyers’ Trust Management: Inn order to reduce the risk of loss, it is very important to control the trust level of the buyer. This can be done by including a trusted entity in our micropayment infrastructure for managing the trust level of the buyers. In fact, the decision made by this entity depends on the behavior of the buyer.

3

Micropayment Infrastructure Based on Blockchain

In this section, we focus on the presentation of the architecture of our robust micropayment infrastructure. We first present the actors and their roles. Then, we describe the interactions between these actors, as depicted in Fig. 1. 3.1

Description of the Actors

The main actors of our infrastructure are as follows: The Buyers: A buyer is a person who purchases the goods and services from the sellers using micropayment tokens. The buyer has an account at the financial entity. His global amount will be divided into a large number of micropayments. The buyer will receive tokens from the financial entity and will send them to the sellers.

A Robust and Efficient Micropayment Infrastructure

829

The Sellers: A seller is the person who sells his goods and services to the buyers. He is charged of constructing a payment transaction by including the received tokens concerning a given product or service, his identity, the identity of the buyer, and information about the purchase. He is also responsible for sending the transaction to the auditing entity. The Auditing Entity: It is a trusted third-entity whose main role consists in controlling the tokens included in the received transactions from the sellers. The tokens are grouped by buyers. The tokens corresponding to the same buyer are grouped together in blocks. He is also responsible for managing the buyer’s trust and hence determining the sizes of the blocks of tokens. After forming blocks, he submits the blocks to the blockchain network. The Financial Entity: It has three main functions consisting in: a) generating and sending the tokens to the buyers; b) paying the auditing entity after validation of the tokens; and c) providing the blockchain network with information about the generated tokens. The Blockchain Network: It is a peer-to-peer network of computers (i.e., miners). Their main functions consist in registering, verifying, and managing the received transactions from the auditing entity. 3.2

Interactions Between Actors

This section presents the different interactions between the actors (as depicted in Fig. 1) during the tokens and transactions generation, block formation, block size determination, and token payment. – Interaction between the financial entity and the buyer Assuming that the buyer has an account at the financial entity, the global cost is subdivided into a set of tokens having a fixed value v. When the buyer wants to buy a good or a service it sends a request to the financial entity in order to receive micropayment tokens. After request reception, the financial entity generates tokens containing the following information: a unique identifier of the token idt , the token value v, the certificate of the financial entity certF E , the timestamp of the financial entity timeF E , and the signature of the financial entity. Hence, the generated token tF E (message 1 as shown in Fig. 1) can be defined as follows: tF E = {certF E , idt , v, timeF E , sigF E (idt , v, timeF E )}

(1)

This token is then transmitted to the buyer. After reception of the token from the financial entity, a verification process is performed by the buyer. The latter verifies the authenticity of the token, the financial entity’s certificate, and the timestamp included in the token. It is worth noting that the operation of verification will be performed by each receiver of a token. The authenticity of the token is ensured by verifying

830

S. Bel Hadj Youssef and N. Boudriga

Fig. 1. Micropayment infrastructure

the digital signature. The sender’s certificate is verified by sending a request to the certificate authority. The timestamp included in the token is verified by comparing it with the timeclock of the token’s receiver. – Interaction between the buyer and the seller When the buyer wants to buy a service from a given seller, he begins to perform a transaction with it. After verification done by the seller, the built transaction T r contains the following information: certificate of the buyer Certbuyer , certificate of the seller Certseller , a set of tokens Ω needed to cover the purchase, information about the purchase infpurch , the time of creation of the transaction timeT r , and the resignature of the seller on the signature of the buyer. T r =

(2) This transaction (message 2) will be sent to the auditing entity. – Interaction between the seller and the auditing entity After reception of the transaction from the seller, the auditing entity firstly verifies the authentication, the certificate, and the timestamp. Secondly, it extracts the tokens included in Ω and then insert them into the block of the corresponding buyer. Then, it verifies if the constructed block size is achieved or not. In fact, the number of tokens assigned to the buyer changes over time

A Robust and Efficient Micropayment Infrastructure

831

according to the buyer’s profile (i.e., trust value). Two possible cases can be presented: a) when the block reaches its size, the auditing entity sends the block to the blockchain network (message 5) for verification and still wait the result from the blockchain network; and b) if the block size is not reached, it will send a signed message to the seller ordering it to deliver the purchase (message 3). Then, the seller will deliver the purchase (message 4) to the buyer. Moreover, the auditing entity will proceed to redeem all the tokens included in the transaction (message 7). – Interaction between the auditing entity and the blockchain network After receiving each transaction from the seller and after formation of a block, the auditing entity uploads the built block into the blockchain network (messages 5). In fact, the auditing entity inserts the required number of tokens into the block BAE according to its size S, adds its certificate certAE and its timestamp T imeAE , and then signs the block with its certificate. The formed block BlockAE can be expressed as follows: BlockAE =

(3)

where BAE is the block containing S tokens which corresponds to the block size assigned to the buyer. Now, after receiving the blocks from the auditing entity, the blockchain network applies the verification process to determine the validity of each token. Firstly, it verifies the authenticity, the timestamp, and the certificate of the actors. Secondly, it checks if the token is valid or not and determines the type of attack. Finally, it sends the response to the auditing entity (message 6) which consists in a block composed of the states of tokens. This result BlockBC can be expressed as follows: BlockBC = State (BAE ) = {Stj }

(4)

where Stj = ( j = 1...S) represents the state of a token included in the block of size S. In fact, each state contains the identity of the token (sj ), the binary value (bj ) (i.e., 1 for valid token or 0 for invalid token), and the type of attack (taj ) in case of false tokens. After receiving the result from the blockchain network, the auditing entity determines the number of invalid tokens and then computes the buyer’s trust value and hence the future size of the block of tokens. – Interaction between the auditing entity and the financial entity. The auditing entity, after receiving the response from the blockchain network, it firstly extracts the valid tokens from the received block and then sends an invoice to the financial entity to receive payment for these valid tokens. After reception of the invoice, the financial entity will verify this invoice before proceeding to payment by sending a request to the BC network to check the validity of the identities of tokens included in the invoice. If these tokens are validated by the miners, the financial entity proceeds to pay the auditing entity (message 8).

832

4

S. Bel Hadj Youssef and N. Boudriga

Buyer’s Trust Models

In this section, we are interested to present the trust models of the buyer. In fact, the auditing entity is responsible for computing the trust values of the buyers. The main roles of the auditing entity consists in: a) aggregating the tokens and forming the blocks that will be transmitted to the blockchain network; and b) determining the risk level and the future value of the tokens block size according to the computed trust value. It is very interesting to note that trust is dynamic and related to risk. For a given buyer, let S be the size of the current block of aggregated tokens to be transmitted to the blockchain network and b be the number of invalid tokens included in the previous transmitted block. In the sequel, we present our three trust models. 4.1

Neutral Trust Model

The neutral model can be represented by a linear trust function as follows: Te0 (b) =

(S − b) + 1 S+2

It is worth noting that we have Te0 (0) = 4.2

S+1 S+2

and Te0 (S) =

(5) 1 S+2 .

Optimistic Trust Model

The optimistic model can be represented by the following trust function: Te− (b) = 1 − γ2 × exp (−δ2 × (S − b))

(6)

where γ2 = S+1 S+2 and δ2 = 1 − We have Te− (b) : 0 → W , Te− (0) = S+1 S+2 and Te (S) = S+2 . In addition, it has two main properties: its first order derivative is negative so that this function is decreasing and its second order derivative is negative for b ∈ [0, S] so that Te− (.) is convex. log(1+S) S

4.3

Pessimistic Trust Model

The pessimistic model can be represented by the following trust function: Te+ (b) = γ1 × exp (δ1 × (S − b)) 1 S+2

log(1+S ). S

(7)

where γ1 = and δ1 = 1 + We have Te+ (0) = S+1 S+2 and Te (S) = S+2 . Moreover, it has two main properties: its first order derivative is negative so that Te+ (.) is decreasing and its second order derivative is positive on [0, S] so that Te+ (.) is concave. It is worth noting that the three functions have the same start point and end point.

A Robust and Efficient Micropayment Infrastructure

833

Finally, one can easily show that for a given S, we have the following inequalities: Te+ (b) ≤ Te0 (b) ≤ Te− (b)

5

(8)

Auditing Entity’s Making Decision

In this section, we first detail how the auditing entity handles the size of the blocks of tokens it sends to the blockchain network for later computing the future trust value. Then, we present an approximation of the auditing entity’s related risk to be dealt with. 5.1

Computation of the Block Size

In this subsection, we describe how our payment system is trust aware. We first detail how the auditing entity handles the size of the blocks of tokens it sends to the blockchain system using the recent buyer’s trust value. The auditing entity will make his decision by updating the trust value and modifying accordingly the tokens block size S. The decision made by the auditing entity depends on the result of the blockchain network. The new trust value is computed only when the block is invalid. For this, the auditing entity will select a trust model and will then compute the new size of the block. In the last case, at the construction of the nth block, the auditing entity firstly determines the number of invalid tokens bn , and secondly computes the n accumulated number of invalid tokens Φn = bj and hence the trust function j=1

Te (Φn ). Then, the auditing entity computes the relative error of the trust δn = Te (Φn )−Te (Φn−1 ) . Finally, as the relative error of the trust is equal to the relative Te (Φn−1 ) error of the block size the auditing entity can write the following expression: yn − Sn−1,i Te (Φn ) − Te (Φn−1 ) = Te (Φn−1 ) Sn−1,i

(9)

where Sn−1,i is the size of the block at the (n − 1)th formation of the block and i is the buyer. Therefore, we can deduce the value of the solution yn = Sn−1,i × (1 + δn ). Now, the size of the nth block when Φn is strictly positive is expressed as follows: Sn,i = yn , where − is the flour function. 5.2

Risk Evaluation

In this subsection, we are interested to evaluate the risk of loss of money that can affect the auditing entity due to the rise of the number of invalid tokens received after the validation process performed by the miners of the blockchain network.

834

S. Bel Hadj Youssef and N. Boudriga

For instance, when the number of invalid tokens is high the risk level will be very high so that the auditing entity will be strict with these dishonest buyers. However, if the risk is negligible the auditing entity will be more tolerated. Therefore, the risk function Riskn,i , where n is the block number and i is the buyer number, can be expressed as the difference between the amount of payment given to the seller and the amount of payment received from the financial entity after the reception of the result of the nth formed block. Hence, one can say that the risk is related to the number of invalid tokens (bn ), the percentage of compensation (λ) of the auditing entity, the initial value of the block size, and the token value (v). It is worth noting that the formation of the nth block may be performed after m successive transactions. After this, the block will be transmitted to the blockchain network for verification. As discussed earlier, two possible results can be obtained: – All the tokens included in the block are valid hence the auditing entity pays an amount to the seller which is equal to Sn−1,i v(1 − λ) and receives an amount from the financial entity which is equal to Sn−1,i v. – The (n)th block presents invalid tokens then the auditing entity rejects the last transaction and gives to the seller an amount equal to the sum of all the tokens included in the m−1 transactions multiplied by v(1−λ). The auditing entity receives from the financial entity an amount equals to the sum of valid tokens included in m − 1 transactions multiplied by the value of the token.

6

Infrastructure Validation and Simulation

In this section, we first discuss the validation and then describe the simulation of our provided micropayment infrastructure. 6.1

Infrastructure Validation

In this subsection, we focus on proving that our infrastructure complies with the requirements discussed in Sect. 2. Prevention from Double-Spending: Double spending can be performed by the buyers. Our payment infrastructure is prevented from this attack since each token generated by the financial entity is identified by a unique identity. In addition, the certificates of all the actors (i.e., financial entity, buyer, seller, and auditing entity) of our infrastructure are added to each token. Moreover, our infrastructure relies on the blockchain network for verifying and validating the tokens. Hence, if a token is double spent the nodes will detect this attack and will make the state of this token invalid. Prevention from Double-Selling: The sellers are responsible for this attack by selling the product twice to the buyers. Our infrastructure is prevented from this attack. In fact, the certificates of all the actors (i.e., financial entity, buyer,

A Robust and Efficient Micropayment Infrastructure

835

seller, and auditing entity) are included in each token. Moreover, each transaction contains information about the purchase (e.g., number of product ...). Prevention from Tokens Forging: The sellers are responsible for the forgery attack. This attack is prevented by our infrastructure by including the certificates of the actors and the provision of the signature mechanism. In fact, each actor, after receiving a message (i.e., token or transaction), will add its certificate and will then sign its new built message by adding the necessary information. The verification of blocks, which is achieved by the nodes of blockchain network, contributes to prevent from this attack. Payment Tracing: Traceability is guaranteed in our infrastructure through the use of the blockchain network for verifying the tokens and the auditing entity for managing the trust of buyers. Moreover, the verification of the authenticity of the token and its sender, and the timestamp included in the token is performed by each receiver of token. This will help to trace each token. In addition, the inclusion of information about the purchase will help to trace the bought products. Buyer’s Trust Management: Trustworthiness is provided by our infrastructure through the use of an auditing entity which computes the trust value of the buyer according to the result of verification of tokens received from the blockchain network. In our work, three models for the computation of the trust are provided. The selection of one of these models depends on the decision that will be made by the auditing entity which is dependent on the number of invalid tokens received from the blockchain network. 6.2

Simulation and Results

In this subsection, we are interested to evaluate the performance of our proposed trust models. For this, we assume that n buyers want to make purchases from a seller through micropayment tokens. Each buyer owns a large number of tokens received from the financial entity. In every time slot, a buyer can buy one product with a fixed frequency f . In addition, we assume that the sold products have the same cost. Moreover, we assume that a buyer can double spend the token with a double spending rate r. Furthermore, we fix the initial value of the size of the block of tokens S0 . Finally, the three trust models are applied to show their impact on the size of the block of tokens. The following Table 1 summarizes the used parameters. To analyze the performance of our infrastructure, we assess the mean size of the Block of tokens S per buyer over time by varying: a) the double spending rate; and b) the starting value of the size of the block S0 . In the following, we focus on showing the results of our simulation. As shown in Fig. 2, the mean size of the block of tokens S is assessed with respect to the double spending rate r when one buyer and one seller are used. This is applied for the three trust models. The starting value of the size of the block of tokens S0 is set to 100. The frequency of buying products f is equal to

836

S. Bel Hadj Youssef and N. Boudriga Table 1. Values of parameters Parameter

Value

Starting value of S (S0 )

100

Product price

5

Frequency of buying (f )

0.4

Number of buyers

1

Rate of double spending (r)

0.2

Auditing entity’s compensating value (λ) 0.2

0.4. The price of the product is equal to 5. We observe that the mean size of the block of tokens S decreases with the increase of the rate of double spending. This is shown for the three trust models. In addition, when we increase the rate of double spending, we show that the decrease of the optimistic model is slow while the decrease of the pessimistic model is rapid. However, the neutral model shows a moderate decrease. We note that the values of S of the pessimistic trust model are the lowest values compared to the values obtained in the two other models. To resume, the pessimistic model is chosen when the auditing entity wants to be strict with the buyers. However, the optimistic model is applied when the buyers are honest.

Mean Size of block of tokens

100 90 80 70 60

Optimistic model Neutral model Pessimistic model

50 40 0

0.05

0.1

0.15

0.2

0.25

Double spending rate

Fig. 2. Mean size of the block of tokens w.r.t. double spending rate

As shown in Fig. 3, we assessed the mean size of the block of tokens S in the case of the neutral trust model with respect to the cost of the product and the double spending rate r. S0 is equal to 100 and f is set to 0.4. We notice that the mean size of the block S decreases when the rate of double spending is increased. In addition, we note that S decreases when we increase the price of the product. Indeed, the more we increase the rate of double spending and the

A Robust and Efficient Micropayment Infrastructure

837

product price the less is the mean size of the block S. We note that the highest value of S is achieved for the lowest values of the product price and the double spending rate. Furthermore, we observe that the lowest value of S is reached when the values of both the product cost and the double spending rate are the highest. To resume, we can say that it is encouraged to reduce the product price for the buyers having bad behavior. Mean size of block of tokens

100

95

90

85

80

75

70

65

60 25

20

0.2 15

0.15 10

Product price

5

0.1 0

0.05

Double spending rate

Fig. 3. Mean size of the block of tokens w.r.t. double spending rate and product cost

As depicted in Fig. 4, we assessed the mean size of the block of tokens S with respect to the starting value S0 . In this simulation, we are also interested to show the impact of the three trust models. The product price is set to 5 and the double spending rate is set to 0.2. It is worth noting that the more we increase the value of S0 the more is the value of S. Furthermore, we observe that the optimistic model shows higher values of S than the neutral model. Moreover, the latter also shows higher values of S than the pessimistic model. To resume, we can say that the selection of the trust model is related to the behavior of the buyer. As depicted in Fig. 5, we evaluated the mean size of the block of tokens with respect to S0 and the product price. In this simulation, we select the pessimistic model for evaluation. The rate of double spending is set to 0.2. We notice that the mean size of the block S rises with the increase of the starting value S0 and drops with the increase of the product price. Accordingly, the more is the product price and the less is the starting value S0 the less is S. Furthermore, the highest value of S is reached for lowest value of product price and highest value of S0 . To conclude, the selection of the starting value has an impact on the decision made by the auditing entity.

838

S. Bel Hadj Youssef and N. Boudriga

Mean Size of block of tokens

100 Optimistic model Neutral model Pessimistic model

80

60

40

20

0 20

30

40

50

60

70

80

90

100

Starting value of S (S0)

Fig. 4. Mean size of the block of tokens w.r.t. initial value of S (S0 ) Mean size of block of tokens

70 60 50 40 30 20 10 0 20 15 10

Product price

5 0

20

30

40

50

60

70

80

90

100

Starting value of S (S0)

Fig. 5. Mean size of the block w.r.t. starting value of S (S0 ) and product cost

7

Conclusion

In this work, we presented our robust micropayment infrastructure using the blockchain technology for e-commerce. Moreover, we presented three trust models for computing the trust values of the buyer. Furthermore, we presented the decision made by the auditing entity which is based on the determination of the future value of the size of the block and the risk of loss caused by misbehavior buyers. Finally, we validated our micropayment infrastructure and analyzed the performance of our proposed trust models, by assessing the mean size of the block per buyer over time. In the future, we plan to show the scalability of our infrastructure by considering many buyers and sellers. Moreover, we are interested to use two or more auditing entities and show their impact on the performance of our micropayment infrastructure.

A Robust and Efficient Micropayment Infrastructure

839

References 1. Alnumay, W., Ghosh, U., Chatterjee, P.: A trust-based predictive model for mobile ad hoc network in internet of things. Sensors 19(6), 1467 (2019) 2. Casino, F., Dasaklis, T.K., Patsakis, C.: A systematic literature review of blockchain-based applications: current status, classification and open issues. Telemat. Inform. 36, 55–81 (2019) 3. Khan, N., Ahmad, T., State, R.: Blockchain-based micropayment systems: economic impact. In: Proceedings of the 23rd International Database Applications & Engineering Symposium (IDEAS 19), Athens, Greece (2019) 4. Khan, N., Ahmad, T., State, R.: Feasibility of Stellar as a Blockchain-Based Micropayment System. In: Smart Blockchain-2nd International Conference, Smart Block 2019, pp. 53–65. Springer, Cham (2019) 5. Konashevycha, O., Khovayko, O.: Randpay: the technology for blockchain micropayments and transactions which require recipient’s consent. Comput. Secur. 96, 101892 (2020) ˆ Znski, ˇ 6. Lahkani, M.J., Wang, S., UrbaA M., Egorova, M.: Sustainable B2B Ecommerce and blockchain-based supply chain finance. Sustainability 12, 3968 (2020) 7. Rejeb, A., Keogh, J.G., Treiblmaier, H.: How blockchain technology can benefit marketing: six pending research areas. Front. Blockchain 3(3), 1–12 (2020) 8. Rezaeibagha, F., Mu, Y.: Efficient micropayment of cryptocurrency from blockchains. Comput. J. 62(4), 507–517 (2018) 9. Robert, J., Kubler, S., Ghatpande, S.: Enhanced lightning network (off-chain)based micropayment in IoT ecosystems. Futur. Gener. Comput. Syst. 112, 283–296 (2020) 10. Rofiq, A., Mula, J.M.: The effect of customers’ trust on e-commerce: a survey of Indonesian customer B TO C transactions. In: Proceedings of the International Conference on Arts, Social Sciences & Technology, Penang, Malaysia (2010) 11. Verme, P.L.H.: Virtual currencies, micropayments and the payments systems: a challenge to fiat money and monetary policy? J. Virtual Worlds Res. 7(3), 325– 343 (2014) 12. Xiaoling, W., Huang, J., Ling, J., Shu, L.: BLTM: beta and LQI based trust model for wireless sensor networks. IEEE Access 7, 43679–43690 (2016) 13. Xu, J.J.: Are blockchains immune to all malicious attacks? Financ. Innov. 2(1), 1–9 (2016). https://doi.org/10.1186/s40854-016-0046-5 14. Yang, C.-N., Chen, Y.-C., Chen, S.-Y., Wu, S.-Y.: A reliable e-commerce business model using blockchain based product grading system. In: Proceedings of. 2019 the 4th IEEE International Conference on Big Data Analytics, Suzhou, China, pp. 341–344 (2019) 15. Zile, K., Strazdina, R.: Blockchain use cases and their feasibility. Appl. Comput. Syst. 23(1), 12–20 (2018)

Committee Selection in DAG Distributed Ledgers and Applications Bartosz Ku´smierz1,2(B) , Sebastian M¨ uller1,3 , and Angelo Capossele1 1

2

IOTA Foundation, 10405 Berlin, Germany [email protected] Department of Theoretical Physics, Wroclaw University of Science and Technology, Wroclaw, Poland 3 Aix Marseille Université, CNRS, Centrale Marseille, I2M - UMR 7373, 13453 Marseille, France

Abstract. In this paper, we propose several solutions to the committee selection problem among participants of a DAG distributed ledger. Our methods are based on a ledger intrinsic reputation model that serves as a selection criterion. The main difficulty arises from the fact that the DAG ledger is a priori not totally ordered and that the participants need to reach a consensus on participants’ reputation. Furthermore, we outline applications of the proposed protocols, including: (i) self-contained decentralized random number beacon; (ii) selection of oracles in smart contracts; (iii) applications in consensus protocols and sharding solutions. We conclude with a discussion on the security and liveness of the proposed protocols by modeling reputation with a Zipf law. Keywords: Decentralized systems Reputation model · Security

1

· Distributed ledgers · DAG ·

Introduction

In distributed ledger technologies (DLTs), committees play essential roles in various applications, e.g., distributed random number generators, smart contract oracles, consensus mechanisms, or scaling solutions. The most famous example of permissionless DLT is the Bitcoin blockchain introduced in the whitepaper [23] by Satoshi Nakomoto. The blockchain enables network participants to reach consensus in a trustless peer-to-peer network using the so-called proof of work (PoW) consensus protocol. In PoW based blockchains, participants need to solve a cryptographic puzzle to issue the next block. The higher the computing power of a participant, the higher the chances to produce the next block. In the past years, another consensus mechanism, called Proof-of-Stake (PoS), became popular. In contrast to PoW, in PoS-based cryptocurrencies, the next block’s creator is chosen depending on its wealth or stake [30] and not on its computing power. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 840–857, 2021. https://doi.org/10.1007/978-3-030-80126-7_59

Committee Selection in DAG Distributed Ledgers and Applications

841

DLTs already found multiple applications in the financial sector, including online value transfer, digital assets management, data marketplace [13, 41]. Successive projects, inspired by Bitcoin, started an entirely new field of smart contracts, which are used for settling online agreements without the intermediary and the open possibility of decentralized autonomous organizations [17, 25]. Nevertheless, blockchain-based DLTs have problems, which become apparent when the network participants start using full block capacity. As the number of transactions issued by the users became significantly bigger than what can fit into the block, fees began to increase considerably. Increasing the throughput is one of the main motivations behind the DLTs’ scaling problem, which leads to the (in-)famous blockchain trilemma [5]. The blockchain trilemma states that DLT can have up to two out of three desired properties: scalability, security, decentralization. The most problematic part of the trilemma for blockchains is scalability. Proposed solutions that aim at increasing DLTs scalability and flexibility include increasing block size and issuance frequency, sidechains [7], “layer-two” solutions like Lightning Network [28], different consensus mechanisms [2], sharding [21]. A different approach is to change the underlying data structure from a chain to a more general directed acyclic graph (DAG) [29, 35]. DAGs have been adopted by a variety of DLT projects including IOTA [29], Obyte [6], SPECTRE [37], Nano [20], Aleph Zero [12]. While those projects utilize the DAG structure differently and adapted different consensus mechanisms, in most of them, transactions are graph vertexes that reference multiple previously issued transactions. This property assures that the graph is acyclic, and the data structure grows with time. 1.1

DAG Based Cryptocurrencies

An example of a project that utilizes DAGs is IOTA as described in the original whitepaper [29], which requires every transaction in the DAG, also called the Tangle, to reference exactly two other transactions directly. Transactions also indirectly reference other transactions, we say that y indirectly approves x if there is a directed path of references from y to x. As the ledger grows, each accepted transaction gains indirect references, which are assured by the default tip selection mechanism [18,24, 29, 35]. The set of indirect references of a given transaction is interpreted as the number of confirming transactions and play the same role as the confirming blocks in Bitcoin, i.e., the more indirect references a transaction gains, the more likely it is to remain a part of the ledger. The currently deployed implementation of the IOTA is a form of technology prototype [15], hereafter referred to as Coordinator-based-IOTA, and differs from the original whitepaper version. The most crucial difference is that the consensus is based on the so-called milestones - special transactions issued periodically by a privileged node called Coordinator. Every transaction referenced (directly or indirectly) by milestone is considered confirmed. While such a system is centralized, authors in [34] propose a decentralized solution dubbed Coordicide, referred to as post-Coordicide IOTA. One key element in Coordicide is to replace the

842

B. Ku´smierz et al.

Coordinator with the decentralized consensus protocol called Fast Probabilistic Consensus (FPC). FPC is a scalable byzantine resistant voting scheme [3,33] where nodes vote in rounds. In each round, participants query a random subset of online nodes in the network. Opinions in the next round depends on the received queries and a random threshold. When most of the nodes in the network use the same random threshold the system reaches unanimity. The common random thresholds enable the protocol to break meta-stable situations [32]. 1.2

Contributions

Let us define a committee as a group of usually trustworthy nodes, selected to execute a special task. We assume that participants take part in a DLT that supports a reputation system. Every node can issue transactions, which from now on we call messages. We use the name message to indicate that those objects can include generic data and not only token transfers. The reputation system is needed as a criterion to select a subgroup of all participants and to mitigate Sybil attacks, a common threat in permissionless systems. We concentrate on the more difficult case of DAG-based DLTs, which promise better scalability and decentralization than blockchains. Unlike blockchains, which are totally ordered by nature, DAGs lack natural “reference points” that could be used to determine the reputation of the participants. The confined structure of blockchains allows for using reputation calculated for a specific block defined on the protocol level. The complex structure of DAGs does not allow for the adaptation of an analogical rule. This paper proposes a series of committee selection mechanisms for DAGbased DLTs with reputation system. The committee selection process runs periodically and depends on the reputation of nodes, which is stake or delegated stake. More specifically, the contributions of this paper are the following: 1. we propose several protocols to select a committee in permissionless decentralized systems; 2. we analyze the token distribution of a series of cryptocurrency projects; 3. we model the reputation distribution and analyze the security of the proposed protocols; and 4. we discuss several applications of committee selection protocols, including decentralized random number beacon, smart contracts, consensus mechanism, and sharding solutions. 1.3

Outline

The article is organized as follows. In the next section, we discuss previous works related to this paper, mainly related to different kinds of decentralized random number beacon and reputation systems. In Sect. 3, we specify the assumptions on the DLT that are required for our protocol to work. Section 4 is devoted to the committee selection, where we give three different methods of achieving consensus on the reputation values. In Sect. 5, we discuss the security of

Committee Selection in DAG Distributed Ledgers and Applications

843

our proposal by modeling reputation distribution with a Zipf law. Applications of our protocols to dRNG, smart contracts, consensus mechanism, and scaling are discussed in Sect. 6. Finally, Sect. 7 outlines further research direction and concludes the paper.

2 2.1

Related Work Reputation and Committees in DLT

A reputation system in a DLT is any mapping that assigns real numbers to the network participants. Reputation can be objective when all of the participants agree on the exact values of the reputation, or subjective when different nodes have different perception of the reputation. However, for the subjective reputation to maintain its utility and play the same function as in social systems, network users should have at least an approximate consensus on its values. All PoS consensus mechanisms induce a reputation system where the user’s reputation equals staked tokens [2]. In the same way, delegated PoS (DPoS) protocols, where staked tokens are delegated to other nodes [11, 19], define a natural reputation system. Highest DPoS nodes form a committee of block validators and produce the next blocks. The consensus among the fixed-size committee is easier to achieve than in open systems. An interesting variation on DPoS systems is mana introduced in the post-Coordicide IOTA network [34]. This reputation system takes advantage of each issued message, which temporally grants mana to a certain node. Multiple implementations of sharding in blockchains also require a random assignment of the block validators, e.g., [8]. If validators can not predict which shard they will be assigned, then any collusion is significantly hampered. When the network also uses a reputation system, the sharding process can be improved by assigning approximately the same reputation into each shard. An example of such protocol is RepChain [14], which assigns reputation to nodes based on their behavior in the previous rounds. 2.2

Decentralized Random Numbers Generation

Both randomness and reputation systems can improve the security, scalability, and liveness of the DLTs. A random number beacon, as introduced by Rabin [36], is a service that broadcasts a random number at regular intervals. Randomness produced by an ideal beacon cannot be predicted before being published; however, this assumption is hard to achieve in centralized systems. The development of decentralized random number generators (dRNG) tries to address those problems. In general, a dRNG should provide unpredictable and unbiased randomness, which can not be controlled nor easily biased by a single malicious actor. There are different proposals for dRNGs in the literature. Authors in [1] discuss the extraction of randomness from public blockchains using hashes of blocks. However, certain concerns regarding those solutions have

844

B. Ku´smierz et al.

been raised in [26, 31]. Other proposals like RandHound and RandHerd utilize publicly verifiable secret sharing schemes [39] to generate a collective key shared between committee nodes. If more than a certain threshold of partial secrets are published, the network can recover the random number, i.e., (t, n)-threshold security model. Other proposals include smart contracts, e.g., RANDAO used in Ethereum [10]. Security in such solutions is achieved with the risk of fund confiscation. An interesting research direction that can improve security and prevent manipulation of the generated randomness are verifiable delay functions (VDFs) [1]. VDFs take a fixed amount of time to compute, can not be parallelized but can be verified quickly [27,40]. However, VDF calculations are costly. Moreover, this approach requires further research to ensure honest users have access to the fastest application-specific integrated circuits (ASICs) specialized in a given VDF.

3

Protocol Overview and Setup

We propose a protocol for committee selection in DAG-based DLTs with a subjective reputation system. Every node has a reputation, but a priori, there is no perfect consensus on the values of the reputation. We assume that each vertex of a DAG determines the view of the reputation, and subjectivity comes from the fact that there is no unique way of choosing the “reference vertex”. For example, if nodes adopt the simple rule reputation should be calculated based on the most recent received message, then due to network delay, two different nodes disagree on which is the most recent. The committee selection is the process of appointing nodes with a sufficiently high reputation. To perform the three phases above, we require the underlying DAG to verify the following properties. P1 The DAG grows in time, and incoming messages are new vertexes of the DAG. P2 The DAG allows for the message exchange of application messages (optional). P3 The DAG is immutable and provides a strict criterion for an approximate time of message creation. P4 The subjective reputation of the nodes can be read from DAG and is determined by the vertex (different nodes can read reputation from different vertexes). P5 Messages in the DAG are signed by the nodes and can not be counterfeited (i.e., a malicious node can not fake origin of the message). Immutability, P3, guarantees that no message can be removed from the ledger nor new messages can be added in a part of the ledger that suggests it was issued in the past (by attaching it deep into the DAG). Property P3 can be achieved using any of the following methods:

Committee Selection in DAG Distributed Ledgers and Applications

845

(i) Messages are equipped with enforceable timestamps. Messages with wrong timestamps are rejected by the network. (ii) Special partial order generating messages are periodically issued into the DAG. Enforceable timestamps mentioned in (i) can be achieved when honest nodes automatically reject messages with timestamps too far in the past or the future. A certain level of desynchronization must be allowed to account for the network delay and differences in local clocks. However, we require that the network has a specific bound above which no message with lower timestamp will be accepted into the ledger (even accounting for network delays). In the edge cases of timestamps, when part of the network thinks that timestamp is valid and other does not, it is necessary to run a certain kind of consensus mechanism. Partial ordering generating (POG) messages, (ii), can be a result of consensus among the nodes, e.g., nodes vote on them. Other options are proof-of-authority type of consensus where privileged “validator” nodes issue those POG messages. An example of this consensus type is the Coordinator-based IOTA [15], where a particular entity called Coordinator issues milestones. Similarly, Obyte uses main chain transactions (MCT), which are indicated by trusted witnesses [6]. Note that when a node issues a message after the ith milestone/MCT, it can be approved only by a (i + 1)th order generating message. This procedure generates a certain kind of logical timestamps provided by the milestones/MCT. There are two natural ways of determining the reputation value: (i) the reputation is summed over all of the messages approved by a given message; (ii) if timestamps are available, the reputation can be summed over all of the messages with timestamps smaller than DAG vertex’s timestamp. An example of the DAG’s reputation calculated using the method in (i) is presented in Fig. 1.

4

Committee Selection

When all users have the same view on the reputation, a natural choice for the committee is to take the top n reputation nodes. An alternative option is to perform a lottery with a probability dependent on the reputation of nodes. The list of lottery participants consists of k > n nodes with the highest reputation. Using the last random number X produced by the previous committee, nodes calculate the coefficients: ri qi = f (X, idi ) · k

j=1 rj

,

where f is a cryptographic hash function with values in [0, 1], i is the index of a node, ri its reputation, and idi is its identifier. Then, the committee members would be the n nodes with the highest qi .

846

B. Ku´smierz et al.

Node A +5

Node C +5 Node A +25

Node B +15

Node A +5 Node C +5

Node B +10

Node C +10

Node A +10

Node B +5

Genesis

Fig. 1. Each message grants a particular reputation to one node. The reputations are summed over all of the messages approved by a given message. The reputation calculated for the dark blue message takes into account all blue messages: Node A: 15, Node B:5, Node C: 20. The reputation for the dark green message is obtained by summing over all green messages: Node A: 5, Node B:25, Node C 15.

Cryptocurrency projects are known for their high concentration of hashing power or (staked) tokens, and opening the possibility of low reputation nodes being members of the committee may lead to a decrease in security. Thus, we recommend selecting the top n reputation nodes. We discuss this issue in more detail in Sect. 5. In the following, we present three different methods to find consensus on the nodes’ reputations and, therefore, on the members of the committee. We describe the methods for DAGs with timestamps; the corresponding versions for DAG with POG messages are straightforward adaptations. We assume that we want to select the committee at time tC , and D is the bound for accepting dRNG messages in the DAG, i.e., no honest node will accept dRNG messages with timestamps different by more than D from its local time. 4.1

Application Message

The first method of committee selection requires all nodes interested in the committee participation to prepare a special application message. This message determines the value of the reputation of a given node by its timestamp. Application messages can be submitted within an appropriate time window. Let us denote the application time window length by ΔA , then the time window is [tC − D − ΔA , tC − D].

Committee Selection in DAG Distributed Ledgers and Applications

847

Application messages are used to extract only the reputation of the issuing node, i.e., the reputations of two potential committee members are deduced from different messages. If a node sends more than one application message, the other participants do not consider this node for the selection process. Since messages require node’s signatures, this type of malicious behavior can not be imitated by an attacker. This method of committee selection has the advantage that only online nodes can apply. Moreover, if a particular node does not want to participate in the committee, e.g., due to planned maintenance, insufficient resources, it can decide not to apply. In general, applications should not be mandatory as it is hard to enforce. The committee selection procedure is open, and any node can issue an application message. However, the committee is formed from top reputation nodes, and low reputation nodes are unlikely to get a seat. Applications issued by low reputation nodes are likely to be not only redundant but possibly problematic when the network is close to congestion. Thus we propose the following improvements to the application process. A node is said to be M if, according to its view of reputations, it is among the top reputation nodes. Then nodes produce application messages according to the following: If a node x is M2n then it issues an application at the time tC − D − ΔA . For k > 2 a node x which is Mn·k but not Mn·(k−1) submits a committee application only if at the time tC − D − ΔA /( − 1) (according to x’s local time perception) there is less than n valid application messages with stated reputation greater then the reputation of x. The timestamp of such application message should be tC − D − ΔA /( − 1). An example of pseudocode for scheduling the send of an application message is presented in Algorithm 1. 4.2

Checkpoint Selection

We introduce checkpoints, similar to POG messages mentioned in (ii). The checkpoint is unpredictably obtained from ordinary messages, i.e., it is impossible to say in advance that any particular message will become a checkpoint at the time of issuance. To achieve that, we require a single random number X. The reputation of the nodes is calculated for this checkpoint message. Moreover, checkpoints can suggest which nodes are online. For example, using the following rule: if the last message in a past cone of checkpoint issued by a node x is older than a certain threshold, we consider this node offline. The random number X can be the last random number produced by the previous committee, e.g., using the dRNG described in Subsect. 6.1, or can come from a different source. If the random number X is revealed at time tC , the time the committee selection should start, the checkpoint is the first message with a timestamp smaller than tC − D − X. Possible ties are broken with lower hash.

848

B. Ku´smierz et al.

Algorithm 1: Application message send scheduler.

1 2 3 4 5 6 7 8 9

require : nodes reputation(time) → descending ordered list of nodes’ reputation at input time; get position(ID, ordered list of nodes) → ID’s position on the given list; get reputation(ID, time) → ID’s reputation at the given time; application msg with rep higher than(reputation) → amount of application messages in DAG with reputation higher than input; input : tC : time of committee selection; D: bound for accepting dRNG messages in the DAG; ΔA : application time window length; max k: maximum index value for sending an application msg; n: size of the committee; my ID: ID of the node; if (time = tC − D − ΔA ) then k←2 while k ≤ max k do wait until(tC − D − ΔA /(k − 1)) rep list ← nodes reputation(tC − D − ΔA /(k − 1)) my ell ← get position(my ID, rep list) my rep ← get reputation(my ID, tC − D − ΔA /(k − 1)) if (application msg with rep higher than(my rep) ≥ n) then return

12

if (my ell ≤ k ∗ n) then send application message() return

13

k ← k+1

10 11

14

return

The role of the random number X is to improve security. If no participant, including a potential attacker, can predict the timestamp of the future checkpoint, successful attacks are hindered or made impossible. In the following, we describe one attack scenario in the absence of the random factor X. The reputation changes, and it is possible that an attacker at one point in time had a high reputation but lost it in the meantime. Then, an attacker tries to place checkpoint in the favorable (for him/her) part of the DAG. Without the random factor X, the timestamp of checkpoint would be predictable tC − Δ. An attacker could mine message with this timestamp and minimal hash. By placing it strategically in the DAG, an attacker could manipulate reputation obtained from summed over all of the messages approved by a checkpoint (points P3 and (i)). This type of manipulation is presented in Fig. 2. Note that when the DAG is equipped with POG messages, then one of the POG messages can be used as a checkpoint.

Committee Selection in DAG Distributed Ledgers and Applications

849

Fig. 2. An example of a checkpoint manipulation in the absence of the random factor X. A malicious transaction (red) with a timestamp tC − D is issued “deep” in the DAG, even though honest transactions with these timestamps are placed near the blue dashed line. This placement of checkpoints could promote attacker’s and diminish other user’s reputation as only contributions from the Pink “cone” would be used.

Algorithm 2: Committee Selection Scheduler for Checkpoint Selection Method.

5

require : nodes reputation(time) → descending ordered list of nodes’ reputation at input time; get random number() → global random number; top nodes(n, rep list) → top n nodes in terms of reputation; input : tC : time of committee selection; D: network delay; n: size of the committee; output : committee: list of nodes selected as committee members; committee ← {} if time = tC then x ← get random number() rep list ← nodes reputation(tC − D − x) committee ← top nodes(n, rep list)

6

return committee

1 2 3 4

Algorithm 2 describes the pseudocode for determining the committee with the checkpoint selection method. 4.3

Maximal Reputation

Maximal reputation from an interval approach is a combination of application message and checkpoint selection. The reputation value of the node x is the maximal reputation calculated from all of the messages in the interval [tC − D − ΔA , tC − D]. This approach does not require the node x to issue any special message. Note that for ΔA = 0, it reduces to checkpoint selection without random factor. However, the method has higher computational complexity since it requires the computation of reputation for all messages issued in [tC − D − ΔA , tC − D].

850

5

B. Ku´smierz et al.

Threat Model and Security

The fact that the committee is only a subset of all the nodes may decrease the robustness against malicious actors. The security of the protocol depends on the size of the committee but also on the way the reputation is distributed among the nodes. 5.1

Zipf Law

Different protocols might define reputation and methods of gaining it differently. This affects the concentration of reputation and makes it impossible to perform an analysis in full generality. For this reason we propose to model the distribution of reputation using Zipf laws. Zipf laws satisfy a universality phenomenon; they appear in numerous different fields of applications and have, in particular, also been utilized to model wealth in economic models [16]. In this work we use a Zipf law to model the proportional reputation of N nodes: the nth largest value y(n) satisfies y(n) = C(s, N )−1 n−s ,

(1)

N where C(s, N ) = n=1 n−s , N is the number of nodes, and s is the Zipf parameter. A convenient way to observe a Zipf law is by plotting the data on a log-log graph, with the axes being log(rank order) and log(value). The data conforms to a Zipf law to the extent that the plot is linear, and the value of s can be found using linear regression. In the Fig. 3 we present the distribution of the richest accounts for a series of cryptocurrency projects for the top holders, which might be considered for the committee. We observe that most of them resemble Zipf laws. Table 1 contains estimations of the corresponding coefficients of Zipf law. Note that for PoS-based protocols the reputation of a node can, to some extent, be approximated by the distribution of the tokens. Post-Coordicade IOTA uses a reputation system called mana that shares some similarities with a delegated PoS, see [34], and it is reasonable to assume that the future distribution of mana would be Zipf like. 5.2

Overtaking the Committee

We assume that an adversary possesses q% of the total reputation and that it can freely distribute this reputation among arbitrary many different nodes. Honest nodes share (1 − q)% of the reputation. We assume that the reputation among the honest nodes is distributed according to a Zipf law. Overtaking of the committee occurs when the attacker gets t threshold committee seats; the exact value of t might depend on the particular application of the committee.

Committee Selection in DAG Distributed Ledgers and Applications

Token distribution

100

851

Bitcoin Ethereum IOTA Obyte Tether (ETH chain) EOS Tron

10−1

10−2

10−3

1

10 Rank

100

Fig. 3. The token distribution of selected cryptocurrency projects on a log-log scale (June 2020). Table 1. Coefficients of zipf distribution with the best fit to the token distribution of given cryptocurrency project. method: linear regression on a log-log scale (June 2020).

Name

Zipf coefficient

Bitcoin

0.7628

Ethereum

0.756786

IOTA

0.934275

Obyte

1.14361

Tether (ETH chain) 0.815054 EOS

0.536744

Tron

1.02043

The cheapest way for an attacker to obtain t seats in the committee is to create t nodes with y(n − t + 1)(1 − q) reputation each. The critical qc that allows to get t seats in the committee is, therefore, given by qc = t · y(n − t + 1)(1 − qc ). Equivalently, qc =

t · y(n − t + 1) . 1 + t · y(n − t + 1)

(2)

(3)

In Fig. 4 we present the critical value qc for reaching the majority in the committee.

B. Ku´smierz et al.

Committee size n

852

100 90 80 70 60 50 40 30 20 10

0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

0

0.5

1

1.5

2

Zipf coefficient s Fig. 4. Minimal amount of reputation required to overtake the committee for different committee sizes and token distribution modeled with a zipf distribution, threshold t = n/2 + 1, and number of total nodes N = 1000.

6

Applications

In this section, we describe a few applications of the proposed committee selection protocols. 6.1

Decentralised Random Number Generator

Committee selection protocols allow for the construction of a fully decentralized random number beacon embedded in the DAG ledger structure. An advantage of such an approach is that it relies only on the distributed ledger and does not require any interaction outside the ledger. Publication of the random numbers would naturally take place in the ledger, where possible interruptions are publicly visible. Moreover, propagation of the random numbers would use the same infrastructure as the ledger. Gossiping among nodes in the network would decrease the committee nodes communication overhead as they would not need to send randomness to each interested user. Similarly, if the proposed protocol requires a distributed key generation (DKG) phase (or any other setup phase), messages should be exchanged publicly in the ledger. Such an approach allows for the detection of malicious or malfunctioning committee members who did not participate in the DKG phase. After time D, a lack of corresponding messages can be verifiable proven using the ledger. This procedure increases robustness in the case of DKG phase failure. If such failure is detected, the committee selection is repeated until the DKG phase is successful. Any iteration of the selection process may not take into account the nodes that did not participate in the DKG phase previously. Moreover, if all communication is verifiable on the ledger, such nodes can be punished, e.g., by a loss of their reputation.

Committee Selection in DAG Distributed Ledgers and Applications

853

Note that if the dRNG protocol uses a (t, n)-threshold scheme, i.e., n committee nodes publish their parts of the secret in the form of a beacon message and if t or more beacon messages are published then the next random number can be revealed. The procedure of obtaining the random number from the beacon messages requires specific calculations, e.g., Lagrange interpolation. To save every node from performing those calculations, special nodes in the network can gather beacon messages and publish collective beacon messages which already contain the random number. The public can then verify these collective beacon messages against the collective public key. 6.2

Smart Contract Oracles

One main limitation of smart contracts is that they cannot access data outside of the ledger. So-called oracles try to address this problem by providing external data to smart contracts. When multiple parties engage in a smart contract, they have to determine an oracle. An obvious solution is for contracting parties to agree upon the oracle. However, this poses certain problems as oracles may be centralized points of failure and because the different parties have to find consensus on the choice of the oracle for each contract. Moreover, contractees must know in advance that the selected oracles are going to provide reliable services upon contract expiry. The committee selections proposed in Sect. 3 can be used to improve the security and liveness of smart contracts. For instance, if oracles gain reputations that are recorded in the ledger, then oracles can be determined using the proposed methods. If the committee selection process occurs only upon the contract termination, then the majority, if not all of the selected oracles still provide services. A similar approach is adopted in blockchain-based Chainlink [9]. Our paper allows for using analogical methods in DAG-based DLTs. This protocol would also be user friendly as contractees are not required to know classifications of oracles. Note that specialization is very likely to occur, and different oracles probably will deliver different types of data depending on the industry and requirements. For example, sports bets are settled with sports reputation, whereas events in a stock market may use financial reputation. Further nodes can modify smart contracts themselves to modify the weight of oracles vote power or even make outcomes depend on the oracles’ opinion distribution. 6.3

Consensus and Sharding

The problem of consensus is much simpler in closed systems, where the number of participants is known and does not change. Unanimity algorithms that work in such an environment include [4, 22, 38]. However, the closed nature of such protocols makes them not very relevant for decentralization. PoW is open for any user who is willing to solve the cryptographic puzzle; the network does not require any prior knowledge of the user. An intermediary step between entirely open and permissionless networks is a system where each user is allowed to set

854

B. Ku´smierz et al.

up a node and collect reputation, but only the most reliable nodes contribute to the consensus protocol. An example of such a protocol is EOS. In EOS delegated stake plays the role of the reputation. EOS blockchain database expands as a committee of 21 validators with the highest DPoS produce blocks. Validators use a type of asynchronous Byzantine Fault Tolerance to reach consensus among themselves and propagate the new blocks to the rest of the network [19]. An illustration of the procedure of establishing consensus based on the fixedsize closed committee in open and permissionless systems is in Fig. 5.

Open, decentralized protocol

Reputation gaining Open protocol with a measure of: stake, honesty, importance, or reliability Committee selection Closed system with a fixed number of participants

Fig. 5. Process of finding fixed-size closed committee in open permissionless systems with reputation.

Most of the DLTs with committee based consensus use either pre-selected committees or a blockchain as the underlying database structure (EOS [19], TRON [11]). Our committee selection protocols allow for using similar methods in a more general DAG structure (as long as DAG satisfies conditions P1–P5). On the same note, proposed methods can also improve scaling through sharding solutions. It is straightforward to assigning validators to each shard based on their reputation. Similarly, as in the case of RepuChain [14], our solutions can assure that each shard has approximately the same total validator reputation.

Committee Selection in DAG Distributed Ledgers and Applications

7

855

Conclusion and Future Research

In this article, we proposed a committee selection protocol embedded in a DAG distributed ledger structure. We require the DAG to be equipped with an identity and reputation system. We further assumed that the ledger is immutable after some time, i.e., no transaction can be subtracted from the ledger; no transactions suggesting that it was issued a long time ago can be added to the ledger. These assumptions can be achieved by enforcement of approximately correct timestamps or POG transactions. We further discussed methods of reading the reputation from the DAG. Methods include: (1) reputation summed over all of the messages approved by a given vertex in the DAG; (2) reputation summed over all of the messages with timestamps smaller than the timestamp of a given message. Based on that we proposed and discussed the following methods of committee selection: application message, checkpoint selection, and maximal reputation method. Furthermore, we analyzed the token distribution of a series of cryptocurrency projects, which turned out to follow Zipf distribution with a parameter s ≈ 1. We used this fact to model reputation and analyzed the security of our proposal. Then, we focused on the applications of our committee selection protocols. We proposed an application to produce a fully decentralized random number beacon, which does not require any interaction outside of the ledger. Then, we showed how it could improve the oracle in smart contracts, and finally, we discussed possible advancements in consensus mechanism and sharding solutions. Interesting research directions to further improve security and liveness of the proposed protocols might involve the use of backup committees. These solutions could be used when the primary committee is not fulfilling its duties. For instance, an obvious, although more centralized option, is a pre-selected committee controlled by the community or consortium of businesses interested in a reliable protocol. A different option is to select another reputation based committee from nodes with a reputation index {n + 1, ..., 2n}. However, the security of this solution is debatable, as it might be easier for an attacker to overtake it. Figure 4 shows that for n = 20 and t = 11 an attacker can overtake the committee with as little as 10% of the reputation. The value of the secondary committee would be even lower. Other improvements to the security of the committee include giving multiple seats to the top reputation nodes in the committee, e.g., nodes from {1, ..., n/2} would get double or triple identities in the committee. These adaptations increase the reputation requirement to overtake the committee. Other improvements of the liveness can include recovery mechanisms when the committee fails to deliver. We hope that all mentioned improvements to the security and liveness will stimulate further research on this topic.

856

B. Ku´smierz et al.

References 1. Bonneau, J., Clark, J., Goldfeder, S.: On bitcoin as a public randomness source. Cryptology ePrint Archive, Report 2015/1015 (2015). https://eprint.iacr.org/ 2015/1015 2. Buterin, V., Griffith, V.: Casper the Friendly Finality Gadget. ArXiv e-prints, page arXiv:1710.09437, October 2017 3. Capossele, A., Mueller, S., Penzkofer, A.: Robustness and efficiency of voting consensus protocols within byzantine infrastructures. Blockchain: Res. Appl. 2(1), 100007 (2021). https://doi.org/10.1016/j.bcra.2021.100007. https://www. sciencedirect.com/science/article/pii/S2096720921000026. ISSN 2096–7209 4. Castro, M., Liskov, B.: Practical byzantine fault tolerance. In: Proceedings of the Third Symposium on Operating Systems Design and Implementation, OSDI 1999, pp. 173–186. USENIX Association, USA (1999) 5. Chu, S., Wang, S.: The curses of blockchain decentralization (2018) 6. Churyumov, A.: Byteball: A decentralized system for storage and transfer of value (2016) 7. Croman, K., et al.: On scaling decentralized blockchains. In: Clark, J., Meiklejohn, S., Ryan, P., Wallach, D., Brenner, M., Rohloff, K. (eds.) FC 2016, pp. 106–125. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-53357-4 8 8. Dang, H., Dinh, T.T.A., Loghin, D., Chang, E.-C., Lin, Q., Ooi, B.C.: Towards scaling blockchain systems via sharding. In: Proceedings of the 2019 International Conference on Management of Data, SIGMOD 2019, pp. 123–140. Association for Computing Machinery, New York (2019) 9. Ellis, S., Juels, A., Nazarov, S.: Chainlink a decentralized oracle network (2017) 10. Ethereum Foundation. RANDAO: A DAO working as RNG of Ethereum 11. TRON Foundation. Tron advanced decentralized blockchain platform (2017). https://tron.network/static/doc/white paper v 2 0.pdf ´ 12. Gagol, A., Le´sniak, D., Straszak, D., Swietek, M.: Aleph: Efficient Atomic Broadcast in Asynchronous Networks with Byzantine Nodes. arXiv e-prints, page arXiv:1908.05156, August 2019 13. Giudici, G., Milne, A., Vinogradov, D.: Cryptocurrencies: market analysis and perspectives. J. Ind. Bus. Econ. 47(18), 1972–4977 (2020) 14. Huang, C., et al.: Repchain: A reputation based secure, fast and high incentive blockchain system via sharding. CoRR, abs/1901.05741 (2019) 15. IOTA Foundation. IOTA Reference Implementation. Github 16. Jones, C.I.: Pareto and Piketty: the macroeconomics of top income and wealth inequality. J. Econ. Perspect. 29(1), 29–46 (2015) 17. Kim, H., Laskowski, M., Nan, N.: A first step in the co-evolution of blockchain and ontologies: towards engineering an ontology of governance at the blockchain protocol level. SSRN Electron. J. (2018) 18. Kusmierz, B., Sanders, W.R., Penzkofer, A., Capossele, A.T., Gal, A.: Properties of the tangle for uniform random and random walk tip selection. In: 2019 IEEE International Conference on Blockchain (Blockchain), pp. 228–236 (2019) 19. Larimer, D.: EOS.IO White Paper (2017). https://github.com/EOSIO/ Documentation/blob/master/TechnicalWhitePaper.md 20. Colin, L.: Nano: A feeless distributed cryptocurrency network (2017) 21. Luu, L., Narayanan, V., Zheng, C., Baweja, K., Gilbert, S., Saxena., P.: A secure sharding protocol for open blockchains. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 17–30. ACM (2016)

Committee Selection in DAG Distributed Ledgers and Applications

857

22. Miller, A., Xia, Y., Croman, K., Shi, E., Song, D.: The honey badger of BFT protocols. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS 2016, pp. 31–42. Association for Computing Machinery, New York (2016) 23. Nakamoto, S.: Bitcoin: A peer-to-peer electronic cash system (2008) 24. Penzkofer, A., Kusmierz, B., Capossele, A., Sanders, W., Saa, O.: Parasite chain detection in the IOTA protocol. In: Anceaume, E., Bisiére, C., Bouvard, M., Bramas, Q., Casamatta, C. (eds.) 2nd International Conference on Blockchain Economics, Security and Protocols (Tokenomics 2020), vol. 82, pp. 8:1– 8:18 (2021). https://doi.org/10.4230/OASIcs.Tokenomics.2020.8. https://drops. dagstuhl.de/opus/volltexte/2021/13530. ISSN 2190–6807 25. Peters, G.W., Panayi, E.: Understanding modern banking ledgers through blockchain technologies: future of transaction processing and smart contracts on the internet of money. In: Tasca, P., Aste, T., Pelizzon, L., Perony, N. (eds.) Banking Beyond Banks and Money. New Economic Windows, pp. 239–278. Springer, Cham (2016) 26. Pierrot, C., Wesolowski, B.: Malleability of the blockchain’s entropy. Cryptogr. Commun. 10, 211–233 (2017) 27. Pietrzak, K.: Simple verifiable delay functions (2018). https://eprint.iacr.org/2018/ 627 28. Poon, J., Dryja, T.: The bitcoin lightning network: Scalable off-chain instant payments (2016) 29. Serguei Popov. The tangle (2015) 30. Popov, S.: A probabilistic analysis of the Nxt forging algorithm. Ledger 1, 69–83 (2016) 31. Popov, S.: On a decentralized trustless pseudo-random number generation algorithm. J. Math. Cryptol. 37–43 (2017) 32. Popov, S.: Coins, walks and FPC. Youtube (2020) 33. Popov, S., Buchanan, W.J.: FPC-BI: Fast Probabilistic Consensus within Byzantine Infrastructures (2019). https://arxiv.org/abs/1905.10895 34. Popov, S., et al.: The Coordicide (2020). https://files.iota.org/papers/20200120 Coordicide WP.pdf 35. Popov, S., Saa, O., Finardi, P.: Equilibria in the Tangle. ArXiv e-prints. arXiv:1712.05385. December 2017 36. Rabin, M.O.: Transaction protection by beacons. J. Comput. Syst. Sci. 27(2), 256–267 (1983) 37. Sompolinsky, Y., Lewenberg, Y., Zohar, A.: Spectre: a fast and scalable cryptocurrency protocol. Cryptology ePrint Archive, Report 2016/1159 (2016) 38. Stathakopoulou, C., David, T., Vukolic, M.: Mir-BFT: High-Throughput BFT for Blockchains, June 2019 39. Syta, E., et al.: Scalable bias-resistant distributed randomness. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 444–460 (2017) 40. Wesolowski, B.: Efficient verifiable delay functions (2018). https://eprint.iacr.org/ 2018/623 41. Zheng, Z., Xie, S., Dai, H.-N., Wang, H.: Blockchain challenges and opportunities: a survey. Int. J. Web Grid Serv. 1, 1–25 (2016)

An Exploration of Blockchain in Social Networking Applications Rituparna Bhattacharya, Martin White(B) , and Natalia Beloff University of Sussex, Falmer, Brighton, UK {rb308,m.white,n.beloff}@sussex.ac.uk

Abstract. Human beings have an innate tendency to socialize and the Internet gives us the opportunity to overcome geographical barriers and communicate with people located even at the opposite side of the world. Computers that started the journey of digital communication with simple emails and bulletin board systems in the eighties of the previous century have manifested their potential of connecting people across the globe with the evolution of social networking. Social networking has emerged as effective tools for bringing people of similar disposition or needs, whether professional or personal, under the same platform enabling interaction and sharing between peers. However, these services operate in a centralized environment. Blockchain, on the other hand, is revolutionizing the process of sharing things digitally between peers in a decentralized and secured fashion. In this paper, we investigate how blockchain can be applied to social networking services and review some of the currently existing social networking applications that utilize blockchain. We then propose our social networking application, e-Mudra, built on the principles of blockchain technology that facilitates peer-to-peer exchange of leftover foreign currency from international travel. Keywords: Blockchain · Leftover foreign currency exchange · Peer to peer communication · Social networking

1 Purpose Social Networking can be envisaged as having reshaped the way people interact with each other and exchange content and information through the Internet with peers dispersed across the globe. Facebook, Twitter, Instagram are names popular amongst users situated in the remotest corner. Technologists have already started exploring the possibilities of utilizing blockchain in the avenues of social networking. Obsidian Messenger, Nexus, Indorse, Synereo, Steemit, Akasha applications are some of the instances of the endeavors made in this context. In this paper, we study the impact of blockchain in social networking applications, in general, and how the notions of blockchain and social networking can be applied together to solve the problem of exchanging leftover foreign currencies from an international trip profitably and conveniently, in particular, by presenting our e-Mudra platform.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 858–868, 2021. https://doi.org/10.1007/978-3-030-80126-7_60

An Exploration of Blockchain in Social Networking Applications

859

This paper is organized as follows: In Sect. 2, we give an overview of blockchain technology; we then discuss about the social networking applications and their challenges; next we include a description of how blockchain can benefit social networking applications and a review of existing blockchain-based social networking applications; we also introduce the challenge of exchanging cash leftover foreign currency from international trips that can be alleviated using blockchain and social networking; in Sect. 3, we outline our proposed e-Mudra application; in Sect. 4, we include the results of the study and finally we conclude in Sect. 5.

2 Background In this section, we briefly discuss about blockchain technology and then focus on social networking sites and shared economy. We then explicate the impact of blockchain on social networking, blockchain-based social networking applications currently available and the problem associated with cash leftover foreign currency exchange after any international tour. 2.1 A Brief About Blockchain One of the cynosures of current disruptive technologies in the computing world that has invoked much attention from both academia and industry is blockchain. Devised by Satoshi Nakamoto more than a decade ago, blockchain is the fundamental pillar on which Bitcoin, the forerunner of an era of cryptocurrencies, has been built upon [1]. Prior to the invention of blockchain, peer to peer transactions of digital currencies following a decentralized pattern suffered from the issues of Double Spending and the Byzantine General’s problem. Unlike cash payments between peers, digital transactions can be forged easily and trust has to be enforced by an intermediary, that is, a centralized system such as a bank, but this requires service fees charged by those institutions. In a decentralized digital network with thousands of members dispersed worldwide where members are unknown to each other, it is difficult to reach an agreement ensuring the validity of transactions shared by them. Blockchain resolves both these issues. It maintains a transparent, immutable, shared distributed ledger which is a sequentially threaded chain of blocks containing validated transactions, across the decentralized network. Blockchain nodes regularly reach a consensus about the authenticity of new transactions collected in blocks by solving computationally expensive cryptographic hashes, in other words, “Proof of Work”, and the fastest node to solve the puzzle is rewarded with an incentive. Over recent years, blockchain has been leveraged in various areas, most commonly concerning cryptocurrencies and financial services and of course many other application areas exist such as asset management, records management, identification, attestation, charity, government, education, health, to name only a few. 2.2 Social Networking Sites (SNS) and Shared Economy The history of the Web dates back to 1989 with the invention of the Internet followed by the successful communication between a web browser and server a year later, thus

860

R. Bhattacharya et al.

marking the beginning of the age of the Web. By 1999, there were around 3 million websites, but these were mostly static, read-only sites [2]. This generation of web or Web 1.0, saw unidirectional information flow from producer to consumer, via search engines and the most basic form of online shopping carts [3]. The need to engage and contribute actively led to the emergence of Web 2.0, the read-write-publish Internet that enabled users to share, collaborate and create content through social media and blogging. This age witnessed the birth of SNS such as Facebook and Wikipedia. Web 3.0 consisting of semantic markup and web services, facilitated machine-machine interaction while Web 4.0 is envisaged as the Internet of Things (IoT). We have reached what “is also known as the symbiotic web. The goal of the symbiotic web is interaction between humans and machines in symbiosis. The line between human and device will blur. Web 4.0 will interact with users in the same way that humans communicate with each other. A Web 4.0 environ will be an “always on,” connected world. Users will be able to meet and interact inter-spatially on the web through the use of avatars” [2]. This gradual overall evolution of the Web caused much progression of the social web; the enormous development of technologies, particularly mobile devices, paved the way for the SNS to become pervasive. Tim-Berners Lee’s vision of the web as a collective intelligence is turning out to be true when considered in the light of the advancements of SNS [4]. Some instances of the most popular SNS other than Facebook and Twitter are Wikipedia that “harnesses collective intelligence to co-create content”, Flickr and Instagram that enable free photo sharing, YouTube that enables free video sharing, Delicious that facilitates discovering, storing and sharing of web bookmarks, Digg that helps in sharing opinions on digital contents and visualizing the rank of those contents, LinkedIn connects people to form professional network and Blogs for reading, writing, sharing and discussing users’ thoughts and opinions. In SNS, therefore, the contents are not produced or consumed by the platform owners, but by the users of such platforms. Thus, emerged the concept of the shared economy where the platform owners will not themselves own the resources shared through the SNS, rather they will only manage or govern the functioning of such platforms that will bring buyers and sellers under one roof. Uber and AirBnB are some of the notable examples of such applications that empowered people to monetize their assets by sharing with peers through a synergetic platform. The platforms incentivize revenues from the interaction between the collaborating peers. However, various factors are required to be considered thoroughly such as legalities, taxes, security and insurance before the shared economy can become more mainstream. SNSs are not free from security risks and privacy threats. The massive amount of personal data shared on SNSs exposes users to Internet threats like “identity theft, spamming, phishing, online predators, Internet fraud, and other cybercriminal attacks” [5]. Of the various security and privacy issues discussed in [5], Rathore et al. discusses an interesting form of hazard, Sybil attack where the attackers generate a large number of fake identities facilitating them to gain advantages in the distributed and peer to peer system. SNSs may suffer from Sybil attack as they incorporate a plethora of users who joined as peers in a peer to peer network. In such environments, a single online entity can have multiple fake identities where attackers can outvote legitimate users, “reduce

An Exploration of Blockchain in Social Networking Applications

861

reputation values, corrupt information”, vote an SNS account as the “best” increasing the account’s reputation and popularity. 2.3 Impact of Blockchain on Social Networking As blockchain built upon provenance, immutability and distributed consensus, is fashioning the decentralized peer to peer communication process, some technologists are of the opinion that it can resolve the current challenges existing with SNSs [6]. In the absence of any centralized body governing user’s activities, the latter can benefit from greater security with blockchain. Raviv [8] highlights some of the advantages of using blockchain for social networking. The User is not the “Product” Most of the SNSs utilizes user information, including text, images, videos, available in their platform for advertising and marketing purposes. But in a blockchain-based social network, in absence of any platform owner, the nodes participating in the validation process can get rewards from the cryptocurrency bolstering such platforms. So, the user generated content may not be employed for analytics or marketing. Enhanced Authority of Users over their Content In the absence of centralized administration, there is no single entity controlling the content produced by users. Better Security Though the major SNSs have end-to-end encryption, the metadata exchanged with the messages can be eavesdropped by third parties. Blockchain’s decentralized and distributed ledger technology in addition to securing user data through encryption enhances user’s privacy and anonymity [7]. Censorship Resistant Applications Blockchain-based SNSs ensure secure authentication as well as user anonymity, which is of great importance in censorship critical applications. “Platforms like Obsidian provide a way for users to circumvent the censors and to avoid surveillance” [8]. The decentralized library of Alexandria is yet another example of dissemination of art and information without censorship or ads [13]. Support for Payment Mechanism While traditional SNSs like Facebook supports financial transactions, blockchain has the potential to revolutionize the shared economy in a peer to peer ecosystem. Blockchain was invented to support the generation of cryptocurrency and hence, the exchange of tokens or coins in a peer to peer social network is inherently accommodated by blockchain. Smart contracts can assist users perform deals through contracts signed and executed cryptographically. Crowdfunding Possibilities Blockchain-based SNS can help users in raising money through crowdfunding. Such networks can also expedite crowd-sales.

862

R. Bhattacharya et al.

2.4 Existent Blockchain-Based Social Networking Applications Irrespective of the issues persisting with the blockchain technology such as scalability, blockchain-based social networking and social media applications are multiplying. Researchers such as Qin et al. are already investigating the possibilities of utilizing blockchain in academic social networks supporting “irrevocable peer review records, traceability of reputation building and appropriate incentives for content contribution” [11]. Yet another example of blockchain-based social media is Ushare that allows users to control, trace and claim ownership of the contents they share over the decentralized, peer to peer, secure, anonymous and traceable content distribution network [12]. Here, we discuss some of the blockchain-based SNSs. Sapien This is a decentralized reputation based social news platform that takes into consideration four fundamental values: Democracy, Privacy, Freedom of Speech and Customizability. It has its own token, SPN and rewards the digital content creators without any central authority. This platform provides users with the flexibility to participate anonymously or use their real identity [10]. Sapien supports public or private communities called branches pertaining to specifics topic or themes. Users can participate in one or more branches and have reputations for each of those branches signifying their importance. It uses a Proof of Value Protocol. In essence, SPN supports collaborative distinguishing of high-quality content across the Sapien Network and this adds value to SPN. The protocol will bring forward quality content and reward users accordingly. User’s contents are evaluated across the network generating a score that designate user’s reputation thus reflecting her domain specific skill within her communities. Thus, “within Sapien, reputation will mitigate trolling and reduce the spread of fake news” [10]. Sapien also enhances interaction among users by providing facilities such as friend addition, group creation, post sharing, comments writing and text and voice chat. Online privacy is protected through encrypted chat conversations. Sapien follows a hybrid centralized-decentralized web application during the initial stages with the social platform being centralized and all SPN transactions are decentralized and handled by the Ethereum Blockchain. Steemit Based on a decentralized network, Steemit aims to build communities and promote social interaction incentivizing users with cryptocurrency rewards in return of their posts depending on their upvotes count. Some of the services made available to the users of the Steem community are: • • • •

Access to organized news and commentary Access to suitable answers to customized questions “A stable cryptocurrency pegged to the US dollar” Payments without charges [9]

It has an internal cryptocurrency token called STEEM that supports one-STEEM, one-vote. So, the greatest contributing users determined by their account balance will

An Exploration of Blockchain in Social Networking Applications

863

have the strongest say about how the contributions are scored. Users have “financial incentive to vote in a way that maximizes the long-term value of their STEEM” [9]. It supports micropayments for users’ services. Users can upvote or downvote contents and rewards for contributors will be decided based on users’ votes. The system promotes the use of blockchain for censorship resistant applications. Here, no individual can censor contents which the STEEM possessors have valued. Indorse Based on the Ethereum blockchain, this is similar to LinkedIn. It has its own economy and currency. Users can participate in content creation reserving their rights, make profiles, connect with other users and receive payments for their participation. Indorse supports a “skills economy” leveraging Indorse Rewards, Indorse Score Reputation system and payment of tokens to users for their contributions in constructing the platform. User motivation is further enhanced through improvement of security and privacy for users’ data and information. It also leverages smart contracts “and pairs them with payment channels” [7]. Obsidian Messenger This enables data storage in a decentralized system utilizing Proof of Stake consensus protocol and does not use it for analytics or advertising. An end-to-end encryption following 256 Bit encryption algorithm is applied to messages, files, pictures, videos and payments. The nodes receive cryptocurrency to enforce reliability, resiliency, and availability of the network. User accounts are not required and “the address will never have information that links to phone numbers, email or other accounts” [7]. Nexus This supports charity, crowdfunding, transactions in a marketplace, purchasing ad space and other operations through cryptographically-signed and executed contracts. As a blockchain-based system, this application has internal currency thus bypassing the need for external payment mechanisms. This takes into consideration the requirement for secured authentication and anonymity in restrictive domains [7]. Others There are many more SNS applications that utilize blockchain such as Sola, onG.Social, PROPS Project, Synereo, Golos, Akasha, Alexandria and SocialX.

3 The Challenge of Cash Currency Exchange Now that we have seen how blockchain can contribute to building SNS and shared economy applications efficiently, we study a particular case of exchanging cash currency leftover from foreign trips. We explore how blockchain and social networking can be utilized together to solve the issue of cash-based leftover foreign currency exchange at the end of an international trip. Almost every traveler possesses some amount of cash foreign currencies on her return from an international trip. But there is no mechanism to dispose of these currencies

864

R. Bhattacharya et al.

conveniently and profitably. A huge amount of money that can add up to billions globally, remains unused in most of the travelers’ wallets and cupboard drawers at home. Exchange through banks or other financial institution requires travelers to personally visit the institution to deposit the cash when the exchange rate is suitable and hence, is not convenient. Besides, the exchange rate is decided by the institution. It also charges additional fees, thus not lucrative for exchanging small amounts. Not all institutions accept coins. Some institutions have started exchanging cash currencies via posts or kiosks, however, they too suffer from the issue of centrally decided exchange rates, traveler’s onus of submitting the cash currencies, exchange rate is based on the day when institution received the currencies, and limitation in terms of the list of currencies accepted for exchange. Recently, a few systems have emerged that allow peer to peer exchange of currencies but they do not support exchange of cash currencies. Therefore, to address these pertinent challenges, we devised e-Mudra, an application facilitating peer to peer currency exchange, based on blockchain.

4 Method We now describe e-Mudra, a blockchain-based Social Networking application connecting currency exchangers. 4.1 The Application Functionality Our e-Mudra blockchain application maintains an online community of travelers who can exchange or transfer currencies among themselves. Any traveler must deposit cash foreign currency at a designated kiosk at the end of a foreign trip which gets added to her currency balance in her multicurrency wallet. She can then publish an advertisement for foreign currency exchange with her preferred exchange rate based on her currency balances and if any other traveler finds her published advertisement suitable, that traveler can initiate an exchange. Alternatively, she can choose to exchange currency with a fellow traveler if she finds the latter’s advertisement for currency exchange profitable and matching her requirement. Travelers can exchange currencies, transfer currencies to other peers, donate currencies to charitable institutions, buy gifts with the currency balances in their multicurrency accounts and send them to other peers anytime once they have submitted the cash at the kiosk. Every user has a username in addition to the public key that she can share with other peers in the network. Peers can transfer funds or send gifts to other peers using their usernames instead of using the complex public keys. 4.2 Scenario The user deposits her cash currency, say X, in a smart kiosk placed at a convenient location such as airport before travelling back to her own country. The kiosk accepts the currency X and sends a receipt to her smartphone. The user later checks her multi-currency account in e-Mudra application which shows her the updated balance for currency X. She can

An Exploration of Blockchain in Social Networking Applications

865

later post an advertisement for a target currency, say Y, in return of a part or whole of the balance amount of currency X, with a chosen exchange rate. If any peer accepts the exchange at the user’s published rate, the exchange takes place and the user is notified. Alternatively, the user instead of publishing her own rate, may select from any of the advertised rates from other peers in the network, requesting for an amount of currency X in return of currency Y. Once exchanged to currency Y, the money can be sent to the user’s external account pertaining to currency Y or collected from a kiosk that operates for currency Y. 4.3 Architecture Essentially, the users of this system are travelers and they cannot be expected to do the block validation unlike the Bitcoin blockchain network where any node can participate in consensus mechanism. Therefore, architecturally, e-Mudra leverages a permissioned consortium blockchain such that a selected set of nodes with known identities owned by stakeholder organizations that perform block addition and validation. The stakeholder organization is likely to be the one that owns the currency deposit and withdrawal kiosks. The network comprises of the following types of nodes, see Fig. 1. • The user nodes or clients that perform transactions – these are the computer applications, i.e. the multicurrency wallets, which the travelers use to exchange funds, send money to peers, donate to charity or buy and send gifts. • The gateway nodes that conform to consensus protocol – these nodes owned by the stakeholder organizations perform addition of blocks and validation of transactions.

Fig. 1. The eMudra network

866

R. Bhattacharya et al.

The gateway nodes from different organizations compete to find the next block and win rewards following the Proof of Work consensus protocol. The gateway nodes are connected to each other. • The miners that assist the gateway nodes in block mining – each gateway node is connected to a set of miner nodes which help the gateway nodes in speedy discovery of next block by providing faster solution of the cryptographic puzzle. • The kiosk nodes that support cash deposit or withdrawal transactions – these are the kiosks where users can deposit or withdraw foreign currencies. The gateway nodes, miners as well as the kiosk nodes will be managed by crosscountry organizations regulating the network. A kiosk is connected to all the gateway nodes. When a traveler deposits or withdraws money at a kiosk, the corresponding transaction is broadcasted to all the connected gateway nodes. Similarly, when a traveler exchanges or transfers money through her multicurrency wallet, the corresponding transaction is broadcasted to all the connected gateway nodes. The gateway nodes collect the transactions and gather them in a transaction pool. When the number of transactions in the pool exceeds a threshold value, the gateway nodes collect the transactions from the pool and conduct mining. When a gateway node solves the cryptographic puzzle and finds a new block, it broadcasts it to other gateway nodes as well as its helper miners. When a gateway node receives a new transaction, it sends the transaction to all its miners. If any of the miner finds a new block before the connected gateway node, it sends it to its connected gateway node which broadcasts the block to its other helper miners as well as other gateway nodes in the network and the gateway nodes and their miners then check if the block is a valid one. If it is a valid block, they add it to their local blockchain. If the gateway node finds a block before any of its helper miners, it forwards it to its helper miners as well as other gateway nodes in the network which check the validity of the new block and if the block is a valid one, they add it to their local blockchain. When a gateway node or a miner finds a new block, it receives 100 mudras, the internal cryptocurrency of the system. All transactions are recorded on a shared distributed ledger such that any gateway node can read the transactions. A username is generated at the time of user registration and mapped against the user’s public key. A traveler must publish an advertisement or select from a list of advertisements to exchange currencies. When a traveler publishes an advertisement, it gets broadcasted from the traveler’s multicurrency wallet to all the gateway nodes. Its default status is Active. When two travelers exchange currency, the corresponding advertisement status is updated as Expired. The advertisements are stored by the gateway nodes that manages the list of advertisements published by the travellers in the network.

5 Results The differences of this application with existing systems are: • The traveler’s leftover foreign currency does not need to be exchanged on the same day when the currency is deposited or received by the governing institution(s).

An Exploration of Blockchain in Social Networking Applications

867

• The user can exchange currency on any day post submission of the cash currency when she agrees on a profitable rate. • The exchange rate is not centrally decided but only regulated to prevent money laundering; the user selects the preferred rate. • Any currency of any denomination is acceptable as long as the user deposits them in the kiosk before leaving the respective country. • The users can make good use of even low value cash foreign currencies not suitable for utilization through other mediums except charity. • There is no exchange or transfer fees as the application aims to make small money exchange or transfer profitable, however there will be a subscription fee to use the application. It is beyond the scope of this paper to detail the code base; however the proof-ofconcept code base is available on GitHub on request.

6 Conclusion Blockchain is a revolutionary disruptive technology that has much to offer in the domain of social networking. This paper has projected a brief overview of how social networking apps can benefit from blockchain and cited some relevant examples to corroborate the view. It further demonstrated a particular problem, that is, cash leftover foreign currency exchange being addressed by a blockchain-based social networking platform, e-Mudra that we proposed. While such a system as e-Mudra is complex and involves other innovative concepts like smart kiosks or internet of things, blockchain plays a fundamental role in enabling cash leftover foreign currency exchange profitably and conveniently. The current ‘proof of concept’ Java code base is available for collaboration on request to the principle author. 6.1 Future Work Currently, the application prototype supports transfer and exchange of currencies, donation can also be performed by transfer functionality. However, buying and sending gifts is not supported by the prototype at the moment. Kiosk functionality is simulated. To transfer money, a user needs to know the username of the recipient which can be directly communicated by the recipient to the user. To add flavors of SNS, provision to add known users to a peer’s contact list can be supported. User can then add users whose usernames are shared with her directly to her contact list, update and delete later the same. She can send or exchange money with the users in her contact list and also buy gifts for them when the application will automatically fetch details of the recipient such as delivery address from the system and the sender will not need to provide the same.

References 1. Nakamoto, S.: Bitcoin: A Peer-to-Peer Electronic Cash System. www.bitcoin.org: https://bit coin.org/bitcoin.pdf. Accessed 7 August 2018

868

R. Bhattacharya et al.

2. Letts, S.: What is Web 4.0? stephenletts.wordpress.com: https://stephenletts.wordpress.com/ web-4-0/. Accessed 7 September 2018 3. Web 1.0 vs Web 2.0 vs Web 3.0 vs Web 4.0 vs Web 5.0 – A bird’s eye on the evolution and definition. flatworldbusiness.wordpress.com: https://flatworldbusiness.wordpress.com/flat-edu cation/previously/web-1-0-vs-web-2-0-vs-web-3-0-a-bird-eye-on-the-definition/. Accessed 7 September 2018 4. Rana, J., Kristiansson, J., Hallberg, J., Synnes, K.: Challenges for mobile social networking applications. In: Mehmood, R., Cerqueira, E., Piesiewicz, R., Chlamtac, I. (eds.) Communications Infrastructure. Systems and Applications in Europe. EuropeComm 2009. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol. 16. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-112843_28 5. Rathore, S., Sharma, P.K.: Social network security: issues, challenges, threats, and solutions. Inf. Sci. 421, 43–69 (2017) 6. Social networking over blockchain, 29 March 2018. https://www.scalablockchain. com: https://www.scalablockchain.com/blog/social-networking-over-blockchain/. Accessed 7 October 2018 7. Kariuki, D.: Five Top Blockchain-Based Social Networks, 12 September 2017. https://www. cryptomorrow.com: https://www.cryptomorrow.com/2017/12/09/blockchain-based-socialmedia/. Accessed 7 October 2018 8. Raviv, P.: 6 Ways the Blockchain is Revitalizing Social Networking, 1 May 2018. https://cryptopotato.com: https://cryptopotato.com/6-ways-blockchain-revitalizingsocial-networking/. Accessed 7 October 2018 9. steemit (June 2018). https://steemit.com/: https://steem.io/SteemWhitePaper.pdf. Accessed 21 July 2018 10. Ankit Bhatia, R.G.: sapien Decentralized Social News Platform (March 2018). https://www. sapien.network: https://www.sapien.network/static/pdf/SPNv1_3.pdf. Accessed 12 July 2018 11. Qin, D., Wang, C., Jiang, Y.: RPchain: a blockchain-based academic social networking service for credible reputation building. In: Chen, S., Wang, H., Zhang, L.J. (eds.) Blockchain – ICBC 2018. ICBC 2018. Lecture Notes in Computer Science, vol. 10974. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-94478-4_13 12. Antorweep Chakravorty, C.R.: Ushare: user controlled social media based on blockchain. In: ACM IMCOM. Beppu, Japan (2017) 13. Alexandria. https://www.alexandria.io: https://www.alexandria.io/learn/#learn1. Accessed 20 July 2018

Real and Virtual Token Economy Applied to Games: A Comparative Study Between Cryptocurrencies Isabela Ruiz Roque da Silva(B) and Nizam Omar Mackenzie Presbyterian University, S˜ ao Paulo, Brazil

Abstract. Monetization is a way to earn money from products, softwares or services. Many video game companies use monetization to sell items, bonus or even money to customize the user experience or achieve quests easily in their games. Even though the cryptocurrencies are becoming more and more popular, few games use cryptocurrencies as a real form of monetization, which means that there is a potential for its application. The purpose of this paper is to compare the most used cryptocurrencies for games, based on some important characteristics, discuss which is the most suitable cryptocurrency for gaming platforms and propose an architecture for gaming companies use inside the games using cryptocurrencies.

Keywords: Cryptocurrency monetization

1

· Blockchain in games · Game

Introduction

According to Mishkin, 2015, monetization is the name given to the process of exchanging anything: a product, service or software, for a specific value that can be used to legally buy and sell goods. In the universe of video games, monetization is used to sell games and additional content to increase the custom user experience [5]. Every gamer has a different experience, making it necessary to adapt the game to the most suitable monetization, according to the user profile. More and more games are using custom monetizations to generate revenue and increase the way the player interacts with the game [3]. Since the beginning of Massively Multiplayer Online Games (MMO), the ideia of virtual coins was discussed, for example, in the game Everquest from Verant Interactive. Inside Everquest, the users populated a world named Norrath, where they could create an avatar and live their virtual lifes, using the game’s currency; (Castronova, 2001) studied in details the social game mechanics and the macroeconomy of buying and selling items inside and outside the game (which was illegal from the company perspective). He concluded that inside the game, there was a supply and demand law on rare items: when a rare item was in high demand, its price rises in avatar-to-avatar market [4]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 869–880, 2021. https://doi.org/10.1007/978-3-030-80126-7_61

870

I. R. R. da Silva and N. Omar

Based on the context, the cryptocurrencies can co-exist with electronic games due to using them will make the transactions faster and trackable. The idea of this paper is to present in details how the games monetize their activities nowadays and to propose an architecture to explain how cryptocurrencies could be implemented in well-known games to increase security, tracking and monetization of anything inside the game. Besides, discuss some cryptocurrencies that are being used on games based on their technology and market cap, comparing similarities and differences between them. In Sect. 2, concepts and related studies about monetization in games are presented; in Subsect. 3 some concepts about crypto-games and applications are raised; in Subsect. 4, the authors present a comparative study between all gathered cryptocurrencies in the raised cryptogames and in the last section, the proposed architecture for cryptocurrencies and games is proposed.

2

Monetization

At the beginning of game’s history, the games were created by laboratories or computer scientists at universities. Then, the arcades games came to life where players could exchange money for coins to play on machines. They were formerly sold on floppy disks at small hardware specific stores, then the evolution came after some years and soon the video games were created, in which games were sold by physical copies [7], the most used method to monetize games until today. Monetization in games is one of the most important phases of designing a video game. It is becoming more and more important for publishers to earn revenue from their games, most of them have been using a “free-to-play” approach where the users can download the game without paying anything, expect if they want to customize their gaming experience by buying custom items, energy, weapons, etc. [6]: Plants vs Zombies 2 allows the users to buy plant types to improve the gameplay and hence the score; in Kingdom Rush Frontiers, to level up easily, the user can buy power-ups to increase the character health and explode enemies [6]. Usually, the player does not buy these goods using cash or some kind of currency directly, the game has a proxy that connects to a third party financial company where the user can buy them using credit cards or another form of exchange [6] and that is where the cryptocurrencies could help the publishers. Despite mostly being a electronic game, some games sell physical goods to customize characters and improve gameplay mechanics, like the game Disney Infinity that sells physical toys representing new characters and environments [6]. Besides the “free-to-play” approach and the retail method to monetize games, there are many other forms of classifying monetization in this industry: David Perry, former CEO of a cloud gaming platform and Chief Creative Officer from Acclaim Games, in 2008 at the Social Gaming Summit mentioned 29 business models for game monetization based on the book “E-commerce: Business, Technology and Society” from Kenneth Laudon and Carol Traver [1, 2]; Tim Fields and Brandon Cotton in their book “Social Game Design” explain many ways

Real and Virtual Token Economy

871

to generate revenue from games using three basic models: the Classic Download Model, the Signature Model and the Freemium Model [7]. Scott Rogers presented eight models of monetization: Trial; Freemium; Free-to-play (F2P); Downloadable Content (DLC); Season Pass DLC; Membership; Premium and Subscription [6]. Basically, all forms presented by different authors aim to earn money from different types of games and can be summarized into major types: Retail Purchase, In-game Microtransactions, Digital Download, Subscription Model and Indirect Monetization [8]. In this paper five types of models are presented: Trial Model; Retail Purchase; Digital Download; Subscription Model and Freemium Model. In the Trial Model, the publisher launches a version of the game that is not complete, it is a sneak peek of how the entire game will be, including how the mechanics will work, the visual of the characters and the game’s world. Players download it and can try the game without paying anything and the publisher hope they will pay for the full version [6]. This is a common way to attract players on consoles like Playstation [11] and Xbox [12]. For example, the games Final Fantasy VII Remake and Far Cry 4 allow the players to play a limited time on the game (demo play). The Retail Purchase Monetization is the most conventional way of monetizing games, the users pay for the physical copy of the game they want, like a CD and are ready to play the game [8]. It is a model that attracts few people each year due to the digital version of the games that sometimes is more affordable. That change of mind can be seen on the release of the new Playstation 5, where the player can choose between buying a more expensive version of the console with CD input or a low-priced version only for digital downloads. In the Digital Download Model or Classic Download Model, the publisher makes an advertising campaign about the game that is being created, captivating and engaging the players to download the game and play it. Some examples are the mobile stores for phones and tablets [10, 14]; the Steam website [15] containing a plenty of games with different characteristics, that diversify the target audience and the digital stores from consoles like Sony Playstation [11], Microsoft Xbox [12] and Nintendo Switch [13]. In the Subscription Model, there is a concept about game time. Instead of buying the entire game and additional packages as soon as the publisher launches some additional content, the users pay a monthly, quarterly, annual subscription to pay or only pay the time they will play [7]. The majority of games that use this kind of model are MMORPGs (Massively Multiplayer Online Role-Playing Game), like World of Warcraft from Blizzard, launched in November 2004, which can use the Freemium Model (to download only the base game) and Subscription Model for buying a specific period of time. Besides, the players can also buy mounts and mascots to customize the gaming experience and help them inside the game. Figure 1 and 2 show these two cases. Despite being an interesting way of monetizing a game, the real challenge of using it is to keep the player interested on the content of the game as the time goes by.

872

I. R. R. da Silva and N. Omar

Fig. 1. Monetization using game time on World of Warcraft game

Fig. 2. Monetization using the subscription model on World of Warcraft game.

Real and Virtual Token Economy

873

In the Freemium Model or Free-to-Play, the users do not pay for downloading the game [6, 7]. One of the greatest example of this kind of model is the game FarmVille from the Zynga company. The players could play for free, but could speed up some boring tasks that need friends to complete or time to wait. The publisher encourages the players to pay for time, virtual goods, locked contents and DLCs (Downloadable Content). It is becoming more and more popular, specially on mobile games. Despite being completely distinct models, some games can benefit from more than one monetization model to specific parts [6, 7]. The Word of Warcraft game is a fine example: beyond the subscription monetization, the player can also buy mounts and pets. In 2009, Facebook created a system called Facebook Credit so all the games inside his platform could use a common currency, with the idea to earn a percentage on the revenue of the games and also increase security on user’s transactions. A few developers were against the idea due to the Facebook ’s fee to do that: almost 30% of the transaction [7], on the other hand, the community as a whole accepted the idea well. One of the benefits of Facebook ’s system is to give friends a gift with credits to spend in any game they want [7]. After the creation of Blockchain and the cryptocurrencies, some games using the Blockchain’s technology appeared and attracted attention to how this type of technology can impact the industry of games. The next section will discuss more about it.

3

Crypto Games

From the very beginning of the history of cryptocurrencies, there are ideas on how to use them for games. Some examples are: Dragon’s Tale; MinecraftCC ; Gambit. In 2010, the most famous indie game to use Bitcoin was called Dragon’s Tale: a mix of MMO with casino, which had several activities based on four categories: Luck; Skill; PvP and Tournament [17]. For example, the Luck category has an activity called Palace Garden, in which players dig eight holes in the Chinese imperial garden. Each hole may or may not contain Bitcoins and when it finds two or three holes that have Bitcoins, the number increases. In 2012 the first server to run Bitcoin was created along with Minecraft, the so-called MinecraftCC, in which players earned fractions of Bitcoins in exchange for building blocks and killing monsters. Despite community support and advertisements to maintain the project, in 2016 it was discontinued. After a while, in 2013, Gambit appeared, in which players could bet against each other in classic board games or cards using Bitcoin, it functioned as a form of virtual casino. As can be seen, in the early history of cryptocurrencies, almost all applications for games were some kind of virtual casinos or gambling applications. Following this trend, a cryptocurrency appeared with this goal in 2014: DigiByte, which runs on top of the Blockchain technology [16]. The idea was to establish a connection between digital games and the tokens generated by the cryptocurrency. Even with all the effort to create this environment, projects using DigiByte were paused in 2017.

874

I. R. R. da Silva and N. Omar

A year after the creation of the DigiByte cryptocurrency, Blizzard created an internal World of Warcraft game token for players to exchange their real coins for virtual gold so they could spend or sell at auction houses. Security has been improved, because if a player bought tokens at the auction house, he could use it only after thirty days, which helped prevent fraudulent accounts [18]. Looking at the potential of game monetization and all the applications above using crytocurrencies, some companies developed researches allying cryptocurrencies/Blockchain with games, as it can be seen on Table 1. The Blockchain Game Alliance was founded in 2018 with the main aim of spreading crypto games and exploring the creation of games with distributed ledger technologies. Table 1. Table with some cryptocurrency patents in the gaming industry. Taken from Google Patents (https://patents.google.com/). Patent number

Patent name

Publication Year

TW201922325A

Blockchain gaming system

2018

US20190303960A1 System and method for cryptocurrency generation and distribution

2019

US20180114405A1 Digital currency in a gaming system

2019

US20190122495A1 Online gaming platform integrated with multiple virtual currencies

2019

US20180197172A1 Decentralized competitive arbitration using digital ledgering

2018

US9997023B2

System and method of 2019 managing user accounts to track outcomes of real world wagers revealed to users

Analyzing all the history of gaming application with cryptocurrency, that is how the Crypto Games were born. Crypto Games is the name given to every game that uses distributed ledgers to operate the game and a cryptocurrency for exchanging items or characters for money. There are many benefits of using blockchain in games, some researchers have already raised them [19,20]: the rules of blockchain games are transparent, everyone can see what the game is about and what the player can do or not; it guarantees the ownership of items, characters or whatever element that the player owns inside the game [19]; with this guarantee, the owner of them can reuse these elements in other games inside the same Blockchain, like CryptoKitties and KittyRace [19]. KittyRace reuses some elements of the CryptoKitties game, so you can play both games with the same account [19].

Real and Virtual Token Economy

875

Although it is a good strategy to use blockchain for designing games, [19] raised some restrictions about the visual design of the game due to technical limitations. It is a promising application but huge companies are still at an advantage on this aspect.

4

Cryptocurrencies in Games

Nowadays there are a large variety of crypto games on the internet for everybody who wants to play: they goes from gambling games to RPG games (Role-Play Games) [20]. Exploring deeper the cryptocurrencies atop of the most popular games to date, the authors present a table (Table 2) with some of these games and the cryptocurrencies they use as a base for transactions. Table 2. Popular crypto games and their respective cryptocurrency. Name of the game Cryptocurrency CryptoKitties

Ethereum

KittyRace

Ethereum

Satoshi dice

Bitcoin

TronBet

TRON

EOS knights

EOS

0xUniverse

Ethereum

As seen in the table above, there are four main cryptocurrencies used in games: Ethereum, Bitcoin, TRON and EOS. Bitcoin was the first cryptocurrency created and yet is used in various gambling games [17]. Although the market capitalization of the bitcoin is the biggest one with $ 96 billion dollars, according to Coinbase platform, Ethereum has a lot of potential since the majority of games operates atop of it. One of the reasons why Ethereum (second-largest cryptocurrency by market capitalization) is the most used in games is because of its functionality of smart contracts and the DApp infrastructure to build a game or a application on it easily. DApps are becoming popular on the blockchain world since anyone can make an app with an interface to interact with the user in any programming language and contact the server in which the blockchain is running [9]. Another advantage of using Ethereum instead of Bitcoin is the transactions time: Ethereum can handle more transactions per time than Bitcoin, it also can support any type of computation and the creator of the DApp can make his own rules [24,25]. Other cryptocurrencies like TRON and EOS also use the concept of DApp application. An annual report made by Dapp.com informed that TRON is the second most used DApp, losing only to Ethereum, which represents the majority of gaming platforms [26]. In Table 3 the authors bring a brief summary of the

876

I. R. R. da Silva and N. Omar

Table 3. Dapp market summary in 2019 with Ethereum, EOS, Steem and TRON [26]. All

ETH

EOS

Steem

TRON

Total number of dapps 2,989

1,822

493

92

520

Active dapps

1,129

479

80

482

2,217

Active users

3,117,086 1,427,093 518,884 120,560 967,775

Transactions

3.26B

24.52M

2.81B

85.72M 290.28M

DApp market in 2019 by Blockchain, emphasizing the number of DApps on Ethereum. Analysing this data and the proposal of each cryptocurrency whitepaper [27– 30], the authors come up with four main important characteristics for a gaming cryptocurrency: Transaction Fees (if the cryptocurrency has fees to transact); Smart Contracts (whether it has the possibility of implementing a smart contract to make the game smarter); Scalable (it refers to the ability of the cryptocurrency to scale in terms of numbers of transactions) and Performance (if the decentralized network bears many transactions per time). The results of rising these characteristics are on the Table 4. Table 4. Comparative between the main important characteristics raised by the authors and the cryptocurrency. Cryptocurrency Transaction Fees Smart Contract Scalable Performance Bitcoin

X

Ethereum

X

X

TRON

X

X

X

X

X

X

X

EOS

From the scenario raised above, the cryptocurrency that is most suitable for games is EOS. It does not have any fees to play, so the player does not need to transfer funds to the account to make any small transaction, it scales and performs better than the others, since uses a Proof-of-Stake consensus algorithm, differing from the others that uses Proof-of-Work (Bitcoin [27] and Ethereum [29]), Delegated Proof-of-Stake (TRON) [28]. Besides EOS, TRON can also be another candidate for gaming as it is growing in Dapp applications in 2019. The number of active Dapps in EOS are almost the same as TRON Dapps, this comes from the fact that these two cryptocurrencies are mostly used for casinos platforms that involves gambling, Table 5 illustrates the market dominance by category on a daily basis. Even though Ethereum is not fully scalable, it represents the majority of dominance in gaming category (28%), followed by EOS (15%), TRON (12%)

Real and Virtual Token Economy

877

Table 5. Daily market dominance by category [26]. All (%) ETH (%) EOS (%) Steem (%) TRON (%) Game

22

28

15

12

12

Gambling 27

20

56

7

37

High-risk 21

24

4

0

38

Exchange

4

5

8

3

4

Finance

4

5

2

3

1

Social

6

3

3

52

2

Art

2

3

1

1

0

Tools

7

5

4

18

2

Others

8

7

10

4

4

and Steem (12%). Since Ethereum came before the others, this may be the reason why developers prefer using it, for security concerns.

5

Proposed Architecture

This section provides some problems that cryptocurrencies could solve in games, along with a explanation about the proposed architecture and strengths of using it. The video game industry is one of the biggest industries in the world. According to New Zoo, a research company about game market insights, the global games market has a market cap about $159.3 Bi only on 2020, which is a growth of 9.3% compared to last year [21]. As a well-known fact, the descentralized applications, the blockchain technology and its cryptocurrencies represents a world where third-party companies will not have much influence on markets. Instead of using the services from this companies, like credit card, the video game companies could benefit from services from blockchain and cryptocurrencies to ensure the security of the player’s money and its personal data [19, 20], which means it will reduce the third party’s partnership. There are many problems on the actual finance structure of the games: first, the players need to trust their credit card data or bank account to a third-party company, which could lead to vulnerabilities and information leak if a hacker attacks the company, what has happened recently with some games [22, 23]. On 2020, hackers modified some game’s database to generate items or virtual cash for their own account, so they could sell on a market [22]. During the coronavirus pandemic, these attacks are becoming more frequent, due to the fact that the players are often online and unsuspecting any attack. The attacks range from pishing to theft of sensitive information such as credit card data [23].

878

I. R. R. da Silva and N. Omar

The proposed architecture of this paper could solve the theft of credit card information, since the player does not need to enter any credit card information, only the cryptocurrency address. Figure 3 presents our proposal of solution.

Fig. 3. Proposed architecture.

In the solution, the publisher of the game only needs the blockchain to guarantee the ownership of the items between players and secure the player’s game cash. The player can buy the game cash using the address of his cryptocurrency wallet: he transfers funds between his wallet and the game’s wallet. As appointed in the previous section, some cryptocurrencies that could be used on the game is EOS and TRON, due to their beneficial features and network hashrate. When the transaction is approved by the blockchain, the player has two options: buy or sell items, armor, etc. using his new balance from other players or just play the game. He could also receive his game cash back in form of cryptocurrency for his wallet. This type of architecture can ensure the security by using blockchain for validating transactions: the hackers will not be able to attack the monetary system or the trading system, because of the immutable feature of blockchain. And also, they will not be able to steal the player’s credit card information or his balance from the wallet, since the cryptocurrencies use Asymmetric Encryption Cryptography, Hashing and Consensus Algorithms to secure the wallet, the network and personal data of the users.

6

Conclusions and Future Work

Since the beginning of the game’s industry, monetization is a way of earning money from the players. Each game use a different type of monetization or combine different types to achieve a better game experience. In this paper the authors discussed about the importance of the crypto games and their benefits, as well as the different types of cryptocurrencies that could

Real and Virtual Token Economy

879

be used in games based on some important features. The proposed architecture demonstrated what could be done in the financial structure of the game to prevent hacker attacks and information theft, which guarantees the integrity of the whole system and could be used in any games, even the ones that already exist. For future work, the most suitable type of games for blockchain will be analysed, other crytocurrencies not raised on this article will be compared and a further research DApp platforms for games will be done. Acknowledgments. The authors thank Mackenzie Presbyterian Institute and MackPesquisa for the financial help to develop this research.

References 1. Perry, D.: 29 Business Models for Games (2008). https://lsvp.wordpress.com/2008/ 07/02/29-business-models-for-games/. Accessed 25 Sept 2020 2. Laudon, K., Traver, C.: E-Commerce: Business, Technology and Society. Pearson (2008) 3. King, D., Delfabbro, P.: Video game monetization (e.g., ‘loot boxes’): a blueprint for practical social responsibility measures. Int. J. Mental Health Addiction (2018) 4. Castronova, E.: Virtual Worlds: A First-hand Account of Market and Society on the Cyberian Frontier. Gruter Institute Working Papers on Law, Economics, and Evolutionary Biology, vol. 2, December 2001 5. Mishkin, F.S.: The Economics of Money, Banking and Financial Markets. Prentice Hall (2015) 6. Rogers, S.: Level UP! The Guide to Great Video Game Design, 2nd edn. Wiley (2014) 7. Fields, T., Cotton, B.: Social Game Design. Elsevier (2012) 8. Fields, T.: Mobile & Social Game Design: Monetization Methods and Mechanics, 2nd edn. CRC Press (2014) 9. Cai, W., Wang, Z., Ernst, J.B., Hong, Z., Feng, C., Leung, V.C.M.: Decentralized applications: the blockchain-empowered software system. IEEE Access 6, 53.019– 53.033 (2018) 10. Google: Google Play Store. 2020. Accessed 01 Feb 2020. https://play.google.com/ store/apps 11. Sony Playstation Store: Playstation Store (2020). Accessed 24 Sept 2020. https:// store.playstation.com/pt-br/home/games 12. Microsoft Xbox Store (2020): Xbox Store. https://www.xbox.com/pt-BR/games/ all-games?xr=shellnav. Accessed 24 Sept 2020 13. Nintendo Switch Store: Switch Store (2020). https://www.nintendo.com/games/ switch/. Accessed 24 Sept 2020 14. Store, Apple: Apple App Store. 2020. https://www.apple.com/ios/app-store/. Accessed 24 Sept 2020 15. Steam: Steam Store (2020). https://store.steampowered.com. Accessed 01 Feb 2020 16. DigiByte: Digibyte Global Blockchain (2014). https://www.digibyte.co/digibyteglobal-blockchain. Accessed 07 Oct 2019 17. eGENESIS. Gambit website (2010). http://www.dragons.tl/. Accessed 07 Oct 2019 18. Blizzard. World of warcraft token (2015). https://us.shop.battle.net/en-us/ product/world-of-warcraft-token. Accessed 07 Oct 2019

880

I. R. R. da Silva and N. Omar

19. Min, T., Wang, H., Guo, Y., Cai, W.: Blockchain games: a survey, June 2019 20. Scholten, O., Hughes, N., Deterding, S., Drachen, A., Walker, J., Zendle, D.: Ethereum crypto-games: mechanics, prevalence and gambling similarities, October 2019 21. Zoo, N.: New Zoo - Key Numbers. 2020. https://newzoo.com/key-numbers/. Accessed 27 Sept 2020 22. Magazine, PC: Feds charge 5 chinese hackers for targeting video game companies. https://www.pcmag.com/news/feds-charge-5-chinese-hackers-for-targetingvideo-game-companies. Accessed 27 Sept 2020 23. Beat, Venture: Akamai: Cyberattacks against gamers spiked in the pandemic (2020). Accessed 27 Sept 2020. https://venturebeat.com/2020/09/23/akamaigame-industry-faced-more-than-10-billion-cyberattacks-in-past-two-years/ 24. Vujicic, D., et al: Blockchain technology, bitcoin, and Ethereum: a brief overview. In: 2018 17th International Symposium INFOTEH-JAHORINA (INFOTEH), pp. 1–6 (2018) 25. Rudlang, M.: Comparative analysis of bitcoin and ethereum. Master’s thesis, NTNU (2017) 26. Dapp, R.: 2019 Annual Dapp Market Report. https://www.dapp.com/article/ dapp-com-2019-annual-dapp-market-report. Accessed 01 Feb 2020 27. Nakamoto, S.: Bitcoin: A peer-to-peer electronic cash system. Cryptography Mailing list (2009). https://metzdowd.com 28. TRON.: Tron: Advanced decentralized blockchain platform. TRON Foundation, Technical report, December 2018 29. Buterin, V.: A next-generation smart contract and decentralized application platform. Technical report, May 2018 30. Grigg, I.: Eos - an introduction. Technical report, March 2018

Blockchain Smart Contracts Static Analysis for Software Assurance Suzanna Schmeelk1(B) , Bryan Rosado1 , and Paul E. Black2 1 Computer Science, Mathematics and Science, St. John’s University, New York, NY, USA

[email protected], [email protected]

2 National Institute of Standards and Technology (NIST), Gaithersburg, MD, USA

[email protected]

Abstract. This paper examines blockchain smart contract software assurance through the lens of static analysis. Smart contracts are immutable. Once they are deployed, it is impossible to patch or redevelop the smart contracts on active chains. This paper explores specific blockchain smart contract bugs to further understand categories of vulnerabilities for bug detection prior to smart contract deployment. Specifically, this work focuses on smart contract concerns in Solidity v0.6.2 which are unchecked by static analysis tools. Solidity, influenced by C++, Python and JavaScript, is designed to target the Ethereum Virtual Machine (EVM). Many, if not all, of the warnings we categorize are currently neither integrated into Solidity static analysis tools nor earlier versions of the Solidity compiler itself. Thus, the prospective bug detection lies entirely on smart contract developers and the Solidity compiler to determine if contracts potentially qualify for bugs, concerns, issues, and vulnerabilities. We aggregate and categorize these known concerns into categories and build a model for integrating the checking of these categories into a static analysis tool engine. The static analysis engine could be employed prior to deployment to improve smart contract software assurance. Finally, we connect our fault categories with other tools to show that our introduced categories are not yet considered during static analysis. Keywords: Blockchain · Smart contracts · Solidity · Ethereum Virtual Machine (EVM) · Software Assurance · Static analysis

1 Introduction Smart contracts are immutable. Once deployed, they cannot be changed. Modification to the smart contract requires the entire blockchain to be tossed aside, which is an infeasible solution to most production blockchains. This work extends the work of Dingman et al. [1] to closely examine different types of known bugs that can be introduced during smart contract development. The better the understanding of the secure software development for smart contracts on blockchains, the more robust the overall blockchain. In fact, development tools such as static analysis and testing can be further developed to identify software smart contract concerns. Arguably, all software faults can have some level of security ramification—perhaps, minor. This research examines the Solidity v0.6.2 © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 881–890, 2021. https://doi.org/10.1007/978-3-030-80126-7_62

882

S. Schmeelk et al.

developer notes to classify categories of faults noted by developers. Some of these categories are not well understood and can serve as resources for further improvement of Solidity static analysis tools to detect smart contract development concerns.

2 Review of Literature and Related Work The assurance of smart contracts pre-deployment is extremely important as they cannot be patched once executed on a chain. Currently there are four main research pillars with respect to smart contracts: use cases, correctness and verification, other software assurance methodologies and data assurance. Let us now examine the current literature in each subdomain. 2.1 Smart Contract Use Cases Smart contracts are being applied in many industries. Juho et al. [30], Hao et. al [10], and Pee et al. [18] discuss block chain usage in the public sector. Afanasev, Fedosov, Krylova and Shorokhov [14] have reported on the integration of smart contracts with cyber-physical systems to improve the reliability, fault tolerance and security of machineto-machine communications. For similar reasons, Christidis and Devetsikiotis [20] and Naidu, Mudliar, Naik and Bhavathankar [11] made a case for integrating the Internet of Things (IoT) architecture [20] and supply chain [11] with blockchain. Medical image research integrating smart contracts has been published by Tang, Tong and Ouyang [12]. Bringing smart contracts into the educational arena to verify diplomas, certifications, degrees, etc. has been reported by Cheng, Lee, Chi and Chen [13]. Governatori et al. [8] discuss legal concerns around smart contracts. With the advent of the European Union’s General Data Protection Regulation (GDPR), smart contracts have been suggested by Neisse, Steri and Nai-Fovino [17] for the accountability and provenance for data tracking. 2.2 Smart Contract Correctness and Verification Research has also explored the correctness and formal verification of smart contracts. Bartoletti and Zunino [15] introduced BitML (short for the Bitcoin Modelling Language) which is a calculus for bitcoin smart contracts to exchange currency according to contract terms. Lee, Park and Park [9] introduced a formal specification technique for verifications in smart contracts. And, Schrans, Eisenbach and Drossopoulou [16] propose a new type-safe, capabilities-secure, contract-oriented programming lanague, Flint, for future development. 2.3 Other Smart Contract Software Assurance Methodologies Dingman et al. [1] examined Blockchain Smart Contract bugs through the lens of the National Institute of Standards and Technology (NIST) Bugs Framework [19]. Their work identifies nearly 50 bugs spanning multiple categories (e.g. Operational, Security, Functional, Developmental, etc.).

Blockchain Smart Contracts Static Analysis for Software Assurance

883

Tikhomirov, Voskresenskaya, Ivanitskiy, Takhaviev, Marchenko and Alexandrov [23] introduce a pattern-based static analysis tool, SmartCheck, developed in Java. SmartCheck translates Solidity programming language contracts into an XML-based intermediate representation. SmartCheck then checks the XML against XPath patterns. They report that they classify Solidity issues based on the work by Kevlin Henney [24]. Potential vulnerabilities identified by SmartCheck are given in a table, where rows highlighted in gray are susceptible to being false positives. In addition, their research identifies four domains of issues which can be identified with static analysis: security, functional, operational and development. Luu, Chu, Olickel, Saxena, and Hobor [21] introduce Oyente, which is a symbolic execution tool to identify weaknesses. The authors also identified vulnerabilities within four main categories: transaction-ordering dependencies, timestamp dependencies, mishandled exceptions and re-entry issues. Juels, Kosba, and Shi [5] report on the potential for malicious smart contracts, called criminal smart contracts, to be introduced into a blockchain environment. Such criminal smart contracts may attempt to steal private keys or attempt to facilitate crimes within the distributed smart contract systems. A few papers have referenced known smart contract vulnerabilities. Wohrer and Zdun [22] report on known security vulnerabilities in Ethereum/Solidity smart contract systems. They classify the vulnerabilities into security patterns and reference an example of such a vulnerable contract as follows: balance limit, mutex, rate limit, speed bump, emergency stop and checks-effects-interaction. Mense and Flatscher [3] reported on three known classes of vulnerabilities (based on the work of Atzei, Bartoletti and Cimoli [25] and Dika [26]) in Solidity, the Ethereum Virtual Machine (EVM) and in the underlying blockchain framework. Xu, Pautasso, Zhu, Lu and Weber [2] identified a set of patterns for blockchain applications which can be broken into components for further vulnerability scrutiny. Specifically, they reported on potential issues: external world interactions, data management, security and contract structural patterns. 2.4 Smart Contract Data Assurance Another domain of smart contract research involves developing and protecting a blockchain architecture and methodology for smart contracts to retrieve and authenticate data from external sources. Bjorn van der Laan, O˘gruzhan Ersoy and Zekeriya Erkin [7] propose a model concept around a blockchain oracle, entitled MUlti-Source oraCLE (MUSCLE), to reliably retrieve data from outside systems for use within smart contracts. 2.5 Developing Smart Contracts Terry [4, 6] discusses the software development process and basic concepts for smart contracts. Terry covers topics including: the fundamental idea of smart contract, the block chains relationship with smart contracts, and fundamental crypto currency exchange.

884

S. Schmeelk et al.

2.6 Smart Contract Attacks and Protection Sayeed et al. [31] discuss classic smart contract attacks and mitigations. Their research shows the relationship of the classic Decentralized Autonomous Organization (DAO) attack and the Parity Wallet attack. Such attacks have cost organizations and people many millions of dollars. Specifically, they classify blockchain exploitation techniques into categories based on the attack rationale: attacking consensus protocols, bugs in the smart contract, malware running in the operating system, and fraudulent users.

3 Static Analysis to Detect Errors Blockchain smart contracts are being integrated in many different domains across the world. One of the deployment difficulties is that smart contracts are immutable. Detecting smart contract errors and potential issues, which are prone to security concerns prior to deployment is essential. For this research we focused on v0.6.2 of the Solidity API [27]. Currently there are many warnings being posted about the development of smart contracts which are unchecked by the current Solidity compiler. Thus, the burden to detect these vulnerabilities and concerns are entirely on the Solidity developer. Our research contribution is to categorize many of these warnings at large for further integration into a static analysis engine to detect common programming concerns.

4 Findings Specifically, this work focuses on smart contract concerns in Solidity v0.6.2 which are unchecked at either compile or deployment time. Solidity, influenced by C++, Python and JavaScript, is designed to target the EVM. Many, if not all, of these warnings we categorize are currently not integrated into any Solidity static analysis tools or the Solidity compiler itself. We focus the research on concerns discussed in the developer forums. Table 1 presents new categories of concerns which can be incorporated into a static analysis tool to improve smart contracts prior to deployment. Table 1. New categories of concerns for contract static analysis #

Type

Description

1

Legal

This can further be broken into different specific legal regulations. For example, the GDPR requires that data storage adheres to the right to be forgotten. As such, using methods like selfdestruct is not the same as deleting data from a hard disk. Depending on what data is stored on the public blockchain, it may or may not conform to legal requirements (continued)

Blockchain Smart Contracts Static Analysis for Software Assurance

885

Table 1. (continued) #

Type

Description

2

Normalization prior to Input Validation

Tools such as Dingman et al. [1], Feist et al. [28] and Slither [29] do consider certain input validation concerns such as integer overflow/underflow. Our research notes that input Unicode text can be encoded as a different byte array. Thus, they may either be missing proper normalization prior to filtering or holding an entirely unexpected values. Furthermore, any interaction with another contract needs to incorporate proper input validation

3

Overflow/ Underflow

Tools such as Dingman et al. [1] do consider overflow/underflow. Our research highlights specifically that different math operations (e.g. shifts, exponentiation, addition, subtraction, multiplication and division) need to each be considered during a static analysis. In addition, a safe math library could be employed

4

Fixed point numbers

Fixed point numbers are not yet fully supported by Solidity. They can be declared, but cannot be assigned. As such, smart contracts which integrate fixed point numbers need additional logic checks for contract correctness

5

Unsafe Method Call

Avoid using.call() whenever possible when executing another contract function. The.call() function bypasses type checking, function existence check, and argument packing

6

Explicit Hardcoded Gas Values

Although possible, best practice advises against specifying gas values explicitly (i.e. hardcoded), since the gas costs of opcodes can change at any future point

7

Leaking Gas

In certain circumstances, gas could be lost using feed.info{value: x, gas: y} (continued)

886

S. Schmeelk et al. Table 1. (continued)

#

Type

Description

8

Judge Concerns

This applies mainly when a contract acts as a judge for off-chain interactions (i.e. during disputes). A contract can be re-created at the same address after having been destroyed. Check the peculiarities (e.g. salt) so that the contract behaves as expected

9

Unexpected Results from Multiple Assignments

Assigning multiple variables when reference types are involved can lead to unexpected copying behavior

10

Variable Sanitization

Accessing variables of a type that spans less than 256 bits (for example uint64, address, bytes16 or byte), should not make any assumptions about bits not part of the encoding of the type. (i.e. do not assume all zero)

5 Solidity Source Code Analysis In this section, we connect example suggested static analyses given in Section IV with Solidity v0.6.2 source code. Figure 1 shows example contract code from Table 1 Category 1 calling the selfdestruct method. Depending on what information may be stored on the blockchain, certain legal concerns may not be satisfied. In addition, if any Ether is sent to a removed contract, the Ether is simply lost forever.

Fig. 1. Static analysis category #1

Figure 2 shows example contract code of Category 3 given in Table 1 where even if a number is negative, it cannot be assumed by the developer that its negation will be positive.

Fig. 2. Static analysis category #3

Blockchain Smart Contracts Static Analysis for Software Assurance

887

Figure 3 shows example contract code for Category 4 given in Table 1. Fixed point numbers can be declared in Solidity, but they are not completely supported. A static analysis could uncover usage or potential logic concerns, which need to be addressed.

Fig. 3. Static analysis category #4

Figure 4 shows example contract code for the concern listed in Table 1 Category 7. Any Ether (i.e. smallest demonization Wei) sent to the contract is added to the total balance of that contract. Hardcoding gas values explicitly is not good practice, since the gas costs of opcodes can change in the future.

Fig. 4. Static analysis category #7

6 Future Work There is much future work in the secure software development of smart contracts. First, other blockchain languages could be considered for some of the categories of concerns we described in this paper such as legal. Second, some of the categories, such as legal, can further be broken into many different subtypes depending on the laws. For example, blockchains in healthcare settings in the United States of America needs to conform to HIPAA regulations. In addition, financial blockchains which interact with credit card data will need to conform to PCI compliance. Since smart contracts are immutable, once they are deployed correcting issues becomes very expensive to fix. Second, we can work to connect the issues we categorize in this paper with the Bug Framework as in Dingman et al. [1]. Third, we specifically examined Solidity v0.6.2. We could further examine concerns with other version of Solidity. Last, we could either develop a new static analysis tool or build onto an existing one to capture the findings we categorized in this paper.

888

S. Schmeelk et al.

7 Conclusions Our research shows that there is still much work that remains to be done in the development of static analysis for secure software development of blockchain smart contracts. Solidity is a smart contract language designed to target the EVM. We identified new categories of development issues specific to Solidity v0.6.2. As smart contracts are immutable after deployment, they are not designed for patching software issues and software security concerns. Detecting smart contract coding concerns early in the secure software development lifecycle (sSDLC) is essential.

References 1. Dingman, W., et al.: Classification of smart contract bugs using the NIST bugs framework. In: 17th IEEE/ACIS International Conference on Software Engineering, Management and Applications (SERA 2019) 2. Xu, X., Pautasso, C., Zhu, L., Lu, Q., Weber, I.: A pattern collection for blockchain-based applications. In: Proceedings of the 23rd European Conference on Pattern Languages of Programs (EuroPLoP 2018). ACM, New York, NY, USA, Article 3, pp. 1–20 (2018). https:// doi.org/10.1145/3282308.3282312 3. Mense, A., Flatscher, M.: Security vulnerabilities in Ethereum smart contracts. In: Proceedings of the 20th International Conference on Information Integration and Web-based Applications & Services (iiWAS2018), Indrawan-Santiago, M., Pardede, E., Salvadori, I.L., Steinbauer, M., Khalil, I., Anderst-Kotsis, G. (eds.). ACM, New York, NY, USA, pp. 375–380 (2018). https://doi.org/10.1145/3282373.3282419 4. Parker, T.: Smart Contracts: The Ultimate Guide to Blockchain Smart Contracts - Learn how to Use Smart Contracts for Cryptocurrency Exchange! CreateSpace Independent Publishing Platform, USA (2016) 5. Juels, A., Kosba, A., Shi, E.: The ring of gyges: investigating the future of criminal smart contracts. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS 2016). ACM, New York, NY, USA, pp. 283–295 (2016). https:// doi.org/10.1145/2976749.2978362 6. Parker, T.: Smart Contracts: The Complete Step-By-Step Guide to Smart Contracts for Cryptocurrency Exchange. CreateSpace Independent Publishing Platform, USA (2017) 7. van der Laan, B., Ersoy, O., Erkin, Z.: MUSCLE: authenticated external data retrieval from multiple sources for smart contracts. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC 2019). ACM, New York, NY, USA, pp. 382–391 (2019). https:// doi.org/10.1145/3297280.3297320 8. Governatori, G., Idelberger, F., Milosevic, Z., Riveret, R., Sartor, G., Xu, X.: On legal contracts, imperative and declarative smart contracts, and blockchain systems. Artif. Intell. Law 26(4), 377–409 (2018). https://doi.org/10.1007/s10506-018-9223-3 9. Lee, S., Park, S., Park, Y.B.: Formal specification technique in smart contract verification. In: 2019 International Conference on Platform Technology and Service (PlatCon), Jeju, Korea (South), 2019, pp. 1–4. https://doi.org/10.1109/PlatCon.2019.8669419 10. Hao, X., Xiao-Hong, S., Dian, Y.: Multi-agent system for e-commerce security transaction with block chain technology. In: 2018 International Symposium in Sensing and Instrumentation in IoT Era (ISSI), Shanghai, 2018, pp. 1–6 (2018).https://doi.org/10.1109/ISSI.2018. 8538253

Blockchain Smart Contracts Static Analysis for Software Assurance

889

11. Naidu, V., Mudliar, K., Naik, A., Bhavathankar, P.P.: A fully observable supply chain management system using block chain and IOT. In: 2018 3rd International Conference for Convergence in Technology (I2CT), Pune, 2018, pp. 1–4 (2018). https://doi.org/10.1109/I2CT. 2018.8529725 12. Tang, H., Tong, N., Ouyang, J.: Medical images sharing system based on Blockchain and smart contract of credit scores. In: 2018 1st IEEE International Conference on Hot InformationCentric Networking (HotICN), Shenzhen, 2018, pp. 240–241 (2018) 13. Cheng, J., Lee, N., Chi, C., Chen, Y.: Blockchain and smart contract for digital certificate. In: 2018 IEEE International Conference on Applied System Invention (ICASI), Chiba, 2018, pp. 1046–1051 (2018). https://doi.org/10.1109/ICASI.2018.8394455 14. Afanasev, M.Y., Fedosov, Y.V., Krylova, A.A., Shorokhov, S.A.: An application of blockchain and smart contracts for machine-to-machine communications in cyber-physical production systems. In: 2018 IEEE Industrial Cyber-Physical Systems (ICPS), St. Petersburg, 2018, pp. 13–19 (2018).https://doi.org/10.1109/ICPHYS.2018.8387630 15. Bartoletti, M., Zunino, R.: BitML: a calculus for bitcoin smart contracts. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS 2018). ACM, New York, NY, USA, pp. 83–100 (2018). https://doi.org/10.1145/3243734.324 3795 16. Schrans, F., Eisenbach, S., Drossopoulou, S.: Writing safe smart contracts in flint. In: Conference Companion of the 2nd International Conference on Art, Science, and Engineering of Programming (Programming 2018 Companion). ACM, New York, NY, USA, pp. 218–219 (2018). https://doi.org/10.1145/3191697.3213790 17. Neisse, R., Steri, G., Nai-Fovino, I.: A blockchain-based approach for data accountability and provenance tracking. In: Proceedings of the 12th International Conference on Availability, Reliability and Security (ARES 2017). ACM, New York, NY, USA, Article 14, p. 10 (2017). https://doi.org/10.1145/3098954.3098958 18. Kang, E.S., Pee, S.J., Song, J.G., Jang, J.W.: Blockchain based smart energy trading platform using smart contract. In: 2019 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Okinawa, Japan, 2019, pp. 322–325 (2019). https://doi. org/10.1109/ICAIIC.2019.8668978 19. Bojanova, I., Black, P.E., Yesha, Y., Wu, Y.: The Bugs Framework (BF): a structured approach to express bugs. In: 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS), Vienna, 2016, pp. 175–182 (2016). https://doi.org/10.1109/QRS.2016.29 20. Christidis, K., Devetsikiotis, M.: Blockchains and smart contracts for the internet of things. IEEE Access 4, 2292–2303 (2016). https://doi.org/10.1109/ACCESS.2016.2566339 21. Luu, L., Chu, D.H., Olickel, H., Saxena, P., Hobor, A.: Making smart contracts smarter. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS 2016). ACM, New York, NY, USA, pp. 254–269 (2016). https://doi.org/10. 1145/2976749.2978309 22. Wohrer, M., Zdun, U.: Smart contracts: security patterns in the ethereum ecosystem and solidity. In: 2018 International Workshop on Blockchain Oriented Software Engineering (IWBOSE), Campobasso, 2018, pp. 2–8 (2018).https://doi.org/10.1109/IWBOSE.2018.832 7565 23. Tikhomirov, S., Voskresenskaya, E., Ivanitskiy, I., Takhaviev, R., Marchenko, E., Alexandrov, Y.: SmartCheck: static analysis of ethereum smart contracts. In: Proceedings of the 1st International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB 2018). ACM, New York, NY, USA, pp. 9–16 (2018). https://doi.org/10.1145/3194113.319 4115 24. Henney, K.: Inside requirements (2017). https://www.slideshare.net/Kevlin/inside-requir ements

890

S. Schmeelk et al.

25. Atzei, N., Bartoletti, M., Cimoli, T.: A survey of attacks o n ethereum smart contracts sok. In: P roceedings of the 6th International Conference on Principles of Security and Trust – vol. 10204, New York, NY, USA: SpringerSpringer-Verlag New York, Inc., 2017, pp. 164– 186, ISBN: 978 978-3-66 2-5445454454-9 (2017). https://doi.org/10.1007/978978-366236 62544554455-68 26. Dika, A.: Ethereum smart contracts: Security vulnerabilities and security tools tools. Master Thesis, Norwegian University of Science and Technology (2017). https://brage.bibsys.no/ xmlui/bitstream/handle/11250/2479191/18400_FULLTEXT. Accessed 27 February 2018 27. Solidity: Introduction to Smart Contracts (2020). https://solidity.readthedocs.io/en/v0.6.2/int roduction-to-smart-contracts.html 28. Feist, J., Grieco, G., Groce, A.: Slither: a static analysis framework for smart contracts. In: 2019 IEEE/ACM 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), Montreal, QC, Canada, 2019, pp. 8–15 (2019) 29. Slither: GitHub: Static Analyzer for Solidity (2020). https://github.com/crytic/slither 30. Lindman, J., et al.: “The uncertain promise of blockchain for government”. OECD Working Papers on Public Governance, No. 43, OECD Publishing, Paris (2020). https://doi.org/10. 1787/d031cd67-en 31. Sayeed, S., Marco-Gisbert, H., Caira, T.: Smart contract: attacks and protections. IEEE Access 8, 24416–24427 (2020).https://doi.org/10.1109/ACCESS.2020.2970495

Promize - Blockchain and Self Sovereign Identity Empowered Mobile ATM Platform Eranga Bandara1(B) , Xueping Liang2 , Peter Foytik1 , Sachin Shetty1 , Nalin Ranasinghe3 , Kasun De Zoysa3 , and Wee Keong Ng4 1

4

Old Dominion University, Norfolk, VA, USA {cmedawer,pfoytik,sshetty}@odu.edu 2 University of North Carolina at Greensboro, Greensboro, NC, USA x [email protected] 3 University of Colombo School of Computing, Colombo, Sri Lanka {dnr,kasun}@ucsc.cmb.ac.lk School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore [email protected]

Abstract. Banks provide interactive money withdrawal/payment facilities, such as ATM, debit and credit card systems. With these systems, customers could withdraw money and make payments without visiting a bank. However, traditional ATM, debit and credit card systems inherit several weaknesses such as limited ATM facilities in rural areas, the high initial cost of ATM deployment, potential security issues in ATM systems, high inter-bank transaction fees etc. Through this research, we propose a blockchain-based peer-to-peer money transfer system “Promize” to address these limitations. The Promize platform provides a blockchain-based, low cost, peer-to-peer money transfer system as an alternative for traditional ATM system and debit/credit card system. Promize provides a self-sovereign identity empowered mobile wallet for its end users. With this, users can withdraw money from registered banking authorities (e.g. shops, outlets etc.) or their friends without going to an ATM. Any user in the Promize platform can act as an ATM, which is introduced as a mobile ATM. The Promize platform provides blockchain-based low-cost inter-bank transaction processing, thereby reducing the high inter-bank transaction fee. The Promize platform guarantees data privacy, confidentiality, non-repudiation, integrity, authenticity and availability when conducting electronic transactions using the blockchain. Keywords: Blockchain · Self-sovereign identity Electronic payments · Cloud computing

· Smart contract ·

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 891–911, 2021. https://doi.org/10.1007/978-3-030-80126-7_63

892

1

E. Bandara et al.

Introduction

Automatic Teller Machine (ATM) and debit/credit card systems are popular electronic transaction methods provided by banks. With these systems, customers can do money withdrawals to cash, payments, access their bank account information, check balance, etc. without visiting a bank teller. It is a pay-now payment system which is primarily used for small and macro transactions [40]. Today ATM’s are a common feature in most banks revealing that the ATM is a successful technology that humans have adapted to. Even though these systems are popular electronic transaction systems, they inherit weaknesses and loopholes. Below is a list of some of them. 1. The initial cost of setting up an ATM system is very high, requiring separate internet connections, security systems, surveillance camera systems, and buildings to host. 2. ATMs are targets for various frauds and attacks [30]. To cope with these scenarios banks need to invest more money, time and well-trained staff members. 3. ATMs are not evenly scattered in the country. Most of the ATMs are located in urban areas, due to this reason customers have to travel long distances to use ATM facilities. 4. High inter-bank ATM transaction fees exist. All inter-bank transactions go through a third party centralized clearing agent. So for each inter-bank ATM transaction, customers need to pay a high transaction fee. For instance, in Sri Lanka, it costs LKR 20 per transaction. 5. High ATM transaction fee. For each transaction, customers need to pay a relatively high transaction fee. For example, Sri Lankan banks charge LKR 5 per single ATM transaction. 6. Anyone can steal the card number or the CVV number of debit or credit cards and make electronic payments. Physical cards are not required to make electronic payments. 7. Debit and credit cards are easily clone-able. Attackers can clone the card of a user and use it. 8. In many developing countries, national ATM switches are not implemented. Therefore, inter-banking ATM facilities are not available. With this research, we propose a blockchain and self-sovereign identity [36]based low-cost peer-to-peer money transfer system, “Promize” to address the above-mentioned challenges in traditional ATM systems and Debit/Credit card systems. Promize can be used as an alternative for traditional ATM systems(inter-bank ATM systems) and debit/credit card systems. With Promize, users are given the ability to withdraw funds using registered authorities or even through their friends without having to go to an ATM. Any user in the Promize platform can act as an ATM, which is introduced as a mobile ATM. All the Promize transactions are done through the self-sovereign identity empowered Promize mobile wallet application with using QR codes. Users need to link their bank account to the Promize wallet application. The Promize platform automatically links the bank accounts using the core-banking APIs. Once the wallet

Promize - Blockchain and Self Sovereign Identity Empowered Mobile

893

is linked with a bank account, the Promize wallet can be used as an alternative for traditional debit/credit cards. Users can purchase goods from shops and pay via the Promize wallet instead of using a debit or credit card. Promize does not provide a traditional crypto-currency application with its blockchain. Instead, it provides a low-cost inter-bank money transfer and payment system with a blockchain. The blockchain system in the Promize platform is used to link multiple banks in the network. The Promize platforms’ low-cost peer-to-peer money transfer system guarantees, non-repudiation, confidentiality, integrity, authenticity and availability of the transactions [26] with the help of blockchain integration. We have deployed a live version of the Promize platform at the Merchant Bank of Sri Lanka (MBSL) [33]. MBSL’s ATMs are located only in urban areas, with no coverage in the rural areas of the country. Due to this, they provide ATM facilities to their customers in rural areas through the Promize platform. The main contributions of “Promize” are as follows. 1. Blockchain-based low-cost peer-to-peer money transfer system “Promize”, introduced to address the challenges in traditional ATM systems and Debit/Credit card systems. 2. Self-Sovereign identity empowered mobile wallet (for Android/IoS) has been introduced to facilitate peer to peer money transfer and payment transactions. The transactions can be done with QR code scanning. 3. Promize platform based ATM system is introduced to MBSL bank customers in Sri Lanka to facilitate their day-to-day money withdrawal requirements. 4. Blockchain-based low-cost inter-banking transaction system has been introduced to Sri Lankan banks. The rest of the paper is organized as below. Section 2 discusses the architecture of the Promize platform. Section 3, the functionality of the Promize platform. Section 4 describes the production deployment of the Promize platform in the MBSL bank, Sect. 5 discusses the evaluation of the introduced platform. Section 6 surveys related work. Section 7 concludes the Promize platform with suggestions for future work.

2 2.1

Promize Platform Architecture Overview

Promize is a blockchain-based peer-to-peer money transfer platform. It can be used as an alternative for traditional ATM systems as well as traditional electronic payments. The architecture of the Promize platform is described in Fig. 1. It contains four main components. 1. Core bank layer - Each bank in the Promize platform has its own core banking system. The core bank maintains user bank accounts. It provides API’s for account management, fund transfers etc. 2. Distributed ledger - All users’ decentralized identities (DID) and transaction records are stored here.

894

E. Bandara et al.

3. Money transfer layer - Electronic transactions and verification happen here. 4. Communication layer - Data exchanges between Promize wallets happen in this peer-to-peer communication layer.

Fig. 1. Promize platform architecture. The Core Bank API maintains user bank accounts and supports fund transfers, etc. A distributed ledger is used to store user identities and transactions. Electronic money transfer transactions occur via the money transfer layer. The peer-to-peer communication layer is used to exchange transaction information.

2.2

Core Bank Layer

Each bank in the Promize platform has its own core banking system. The core bank maintains user bank account information, account balances, etc. It provides an API to search for user accounts and make fund transfers. The blockchain nodes deployed in the banks interact with their respective core banking APIs to search for user accounts, validate user accounts and perform fund transfers between accounts (for example, the blockchain node deployed in Bank A interacts with Bank A’s core banking API). There are some common core banking systems which most banks use (e.g. Finacle core banking system [31]). These core banking systems expose JSON-based REST APIs or SOAP web service-based APIs to facilitate banking functions such as account creation, account validations, fund transfers, etc. The blockchain smart contracts in the Promize platform interact with these core banking APIs to perform user account validations and transfer funds between user accounts.

Promize - Blockchain and Self Sovereign Identity Empowered Mobile

2.3

895

Distributed Ledger

The distributed ledger is the blockchain-based peer-to-peer storage system in the Promize platform. The blockchain can be deployed among multiple banks. Each bank in Promize can run their own blockchain node. These blockchain nodes are connected as a ring cluster, Fig. 2. Customers of each bank connect to their respective blockchain peer in the network. Blockchain stores all Promize platform users’ digital identity (which is identified as DID or decentralized identity proof [6,36]) information as self-sovereign identity. It also stores all the Promize transaction information in the blockchain. The user identity and transaction information which is stored in one banks’ blockchain node will be synced with all other bank blockchain nodes by using the underlying blockchain consensus algorithm. The smart contracts running on the blockchain interact with the core bank API to facilitate customer bank account validations and fund transfers.

Fig. 2. Promize platform distributed ledger stores user identities and promize transactions. Promize mobile wallet interacts with the blockchain to create user identities and do promize transactions.

We have used Rahasak blockchain [7]; a highly scalable blockchain targeted for big data as the distributed ledger of the Promize platform. Rahasak comes with real-time transaction enabled “Validate-Execute-Group” blockchain architecture [5,8] with using Paxos [29] and ZAB [25] based consensus. It introduced concurrent transaction enabled functional programming [22] and actor [20]based “Aplos” smart contract platform to facilitate blockchain functions [8]. All blockchain functions of the Promize platform are implemented with Aplos smart contracts. “Identity smart contracts” are used to handle user identities and the “Promize smart contracts” are used to handle the Promize money transfer

896

E. Bandara et al.

transaction. With Rahasak blockchain we were able to support real-time transactions [16,34], high transaction throughput, high scalability, backpressure operation handling features on Promize platform. 2.4

Money Transfer Layer

All peer-to-peer money transfers take place in the “Money transfer layer”. The users of the Promize platform will be given the Promize mobile wallet application. Peer-to-peer money transfers occur via this mobile wallet. Users first need to register on the Promize platform and link their bank account to the Promize mobile wallet. To use Promize, users need to have a bank account in any bank which participates in the Promize platform. When registering, the app will generate a public and private key pair which corresponds with the user/mobile wallet. The private key will be saved on the Keystore of the mobile application. The public key and base58 [17] hash of the public key will be uploaded to the blockchain along with other user attributes (e.g. account number, bank name etc.). The base58 hash of the public key will be used as the digital identity (DID) [36] of the user on the Promize platform. Figure 2 shows the format of the DID which is generated by the mobile wallet. This DID will be embedded to the QR code in the mobile app, which the user can show to any relevant parties (e.g. cashier, friend, bank branch officer etc.) when performing peer-to-peer money transfers via the Promize platform. 2.5

Peer to Peer Communication Layer

When making payments or performing transactions via the Promize platform, users need to exchange transaction details with the blockchain ledger and mobile wallet applications. The peer-to-peer communication layer used to exchange this transaction information between Promize mobile wallet applications and the blockchain. The peer-to-peer communication layer can be implemented with a TCP/WebSocket-based communication service or a push notifications service, similar to Firebase [27] (it works on top of 3G, 4G cellular network). In the Promize platform, we have used the Firebase push notification service to implement the peer-to-peer communication between mobile wallets.

3 3.1

Promize Platform Functionality Use Case

In traditional scenarios, users need to go to a bank ATM or bank branch to withdraw money. With Promize, instead of going to a bank ATM, users can withdraw money from registered authorities of the Promize platform (e.g. shops, outlets, etc.) or a friend who has the Promize app installed. First, users register with the platform via the Promize mobile application. When a user registers, the mobile application captures user credentials, bank account information (Fig. 4) and sends a registration request to the “Identity smart contract” on the

Promize - Blockchain and Self Sovereign Identity Empowered Mobile

897

Fig. 3. User captures credentials and account information via the promize mobile wallet and register on the promize platform. The blockchain will store a proof of user credentials.

blockchain. The registration request contains user id(DID), user name, public key, user bank name and bank account number. Upon receiving the registration request, smart contracts validate user credentials, bank account information and create an identity for the user in the blockchain, Fig. 3. The format of the user identity in the blockchain ledger can be seen in Fig. 2. When validating, it will connect to the corresponding core bank system at the bank and verify the user’s bank account details (e.g. verify bank account number with identity number). For example, if a user belongs to Bank A, the blockchain node at Bank A will connect with its core bank system and validate the users’ account information. Assume two users (User A and User B) are registered on the Promize platform and link their bank accounts to the Promize mobile wallet. User A wants to withdraw LKR 1,000. User A goes to User B, who has already installed the Promize mobile app and registered on the Promize platform and asks for LKR 1000. If User B has LKR 1000 at hand, he could navigate to the “Send Promize” section on the mobile application and choose the amount to transfer. Now the app generates a QR code embedding User B’s DID, account number and transfer amount, Fig. 4(c). Now User A navigates to the “Received Promize” section in his mobile application and scans the QR code from User B’s phone 5(a). Once the QR code is scanned, the application will navigate to the Promize receive confirmation screen Fig. 5(b), where User A can confirm it and submit a Promize transaction request to the blockchain, Fig. 5. When submitting the transaction, it sends a Promize create request to “Promize Smart Contract” on the blockchain, which checks the validity of the transaction(double-spend check) and user accounts. Promize create requests contain transaction id, transaction type(create), User A’s did, User B’s did and transaction amount. If the transaction is valid, the blockchain will generate a corresponding random number (transaction salt)

898

E. Bandara et al.

(a) Promize tion

Registra-

(b) Promize Amount

(c) Promize QR Code

Fig. 4. Promize mobile wallet application registration. Promize sending user choose promize amount and initiate promize transaction. App generates a QR code embedding DID, account number and transfer amount.

(a) Scan Promize

(b) Submit Promize

(c) Confirm Promize

Fig. 5. Promize receiver scans the QR code and submits the transaction. A confirmation will be sent to promize sender via push notification. Once promize sender confirm it, the transaction will be approved.

Promize - Blockchain and Self Sovereign Identity Empowered Mobile

899

and create a new Promize transaction in the blockchain ledger (status of the Promize transaction is marked as pending). The format of the Promize transaction in the blockchain ledger is shown in Fig. 2. Then it sends this transaction salt and other transaction details to User B’s Promize mobile wallet as a notification message (e.g. Firebase push notification), Fig. 6. This notification message contains transaction id, User A’s did, User B’s did, transaction amount and transaction salt. Once the notification message is received by User B’s Promize application, it will navigate to the Promize send confirmation screen, Fig. 5. Then User B confirms it, thereby approving the Promize transaction. At confirmation, a Promize approval request will be sent with the received random number (transaction salt) to the “Promize Smart Contract” on the blockchain. The Promize approve request contains transaction id, transaction type(approve), User A’s did, User B’s did, transaction amount and transaction salt. Then the smart contract will check the validity of the transaction parameters (transaction salt and accounts). When validating, it will compare the transaction salt and account numbers of the Promize transaction saved in the blockchain ledger. If the transaction parameters are valid, it will transfer the requested amount (LKR 1000) from User A’s bank account to User B’s bank account by communicating with the core banking service. If this transfer is successful, it will update the Promize transaction status to approved. Finally, the transaction status will be sent to both users’ Promize mobile applications. Then User B physically gives LKR 1,000 to User A, Fig. 6. The idea here is that User A physically receives money from User B which the bank electronically transfers to each other’s accounts via Promize transactions. The same function can be used to purchase goods from shops. At the end of the purchase, users can send money to the shop’s bank account via Promize transactions. Once the money is sent, the user receives goods. 3.2

Data Privacy and Security

The Promize platform guarantees data privacy, confidentiality, integrity, nonrepudiation, authenticity and availability security features. To guarantee privacy, the Promize platform stores users’ digital identity in the blockchain as a selfsovereign identity proof. Actual identity data such as ID numbers and signatures are stored on the users’ physical mobile phone secure storage. When this information is needed for verification [19, 37], it will be directly fetched via the user’s mobile application through push notifications. By using an SSI based approach, Promize platform addresses the common issues in centralized cloud-based storage platforms (e.g. lack of data immutability, lack of traceability). To guarantee confidentiality and integrity we have used RSA cryptography-based digital signature mechanism [24]. All data in the Promize platform are digitally signed by a corresponding party. The random number exchange with the QR code when performing a Promize transaction makes certain users meet in-person to carry out the transaction, which guarantees the non-repudiation. We have used a JWT based auth service to handle the authentication/authorization of the Promize

900

E. Bandara et al.

Fig. 6. Promize transaction flow between user A and user B. QR code is used to exchange information between mobile wallets.

platform. The users’ authentication information (username/password fields) of Promize platform are stored in the JWT auth service [23]. Upon login, the user needs to send an authentication request to the auth service. Then, it verifies the credentials and returns the JWT auth token to the user. This auth token contains user permission, token validity time, digital signature etc. All subse-

Promize - Blockchain and Self Sovereign Identity Empowered Mobile

901

quent requests need to be added to the JWT token and into the HTTP request header to perform the authentication and authorization. The token verification is handled by the gateway service in the Promize platform, Fig. 7.

Fig. 7. JSON Web Token (JWT)-Based Auth service in the promize platform. Auth tokens will be issued by Auth service. Token validation happens at gateway service.

3.3

Low Cost Transactions

With Promize, the cost of electronic transactions can be easily reduced. Since multiple banks can share transaction and account information through the blockchain, it could facilitate low inter-bank transaction costs. Additionally, an inter-bank communication switch or an ATM switch is not required. With Promize, any user can act as an ATM, if they have money they can give it to their friends. If required, banks can appoint official Promize agents to act as ATMs. For example, a bank could appoint a trusted shop in a village as a Promize agent. Users in that village could go to this shop to withdraw money. They could also buy goods and pay via Promize transactions. It would reduce banks’ costs to deploy and maintain physical ATMs and the operation costs of credit and debit cards. It’s capable of providing services for a wide range of customers in rural areas, securely. In normal ATMs, customers take the money at the bank, which means the bank’s money goes out. In Promize, users exchange money between accounts. For example, in above-mentioned scenario User B give’s his physical money to User A, and Promize transfers money from User A’s bank account to User B’s bank account. In this case, the actual money at the bank remains the same. The money does not go out like in ATMs, it is another main advantage for the banks.

902

4 4.1

E. Bandara et al.

Promize Platform Production Deployment Overview

We have deployed a live/production version of the Promize platform at the MBSL. Their branch network and ATM systems are mainly located in urban areas. This particular bank experiences a lack of ATM systems in rural areas. They use the Promize platform to facilitate banking and financial services (e.g. money withdrawals, electronic payments etc.) to people in rural villages in Sri Lanka. Promize mobile wallets are used as ATMs. They plan on appointing official bank agents in rural areas with a Promize mobile wallet. For example, supermarket/shops in the village could be bank agents. The agent’s Promize mobile wallet will be installed in low-cost Sunmi android tablet devices. There is a built-in Bluetooth printer embedded in the Sunmi android tablet. Customers could go to a shop and withdraw money based on Promize protocol discussed in Sect. 3. At the end of the transaction, the cashier of the shop prints a receipt and gives it to the customer using the receipt printer embedded in the Sunmi tablet. This provides greater transparency and validity to Proimze transactions.

Fig. 8. Promize platform production system architecture which is deployed in MBSL Bank.

4.2

Service Architecture

We have built the Promize platform to support high scalability and high transaction load in the banking environment. To cope with high transaction load and back-pressure [14] operations, we have adopted reactive streams based approach with using Akka streams [2, 13]. The Promize platform has been built using micro-services architecture [42]. All the services in the Promize platform are implemented as small services (micro-services) with the single responsibility principle. These services are dockerized [4,35] and deployed using the

Promize - Blockchain and Self Sovereign Identity Empowered Mobile

903

Kubernetes [9] container orchestration system. Figure 8 shows the architecture of the Promize platform in MBSL. It contains the following services/components. 1. 2. 3. 4. 5.

Nginx - Load balancer and reverse proxy service Gateway service - Micro-services API gateway Auth service - JWT based auth service Apache Kafka - Micro-services message broker Rahasak Blockchain - Blockchain ledger in the Promize platform

Nginx load balancer [38] is used as the reverse proxy in the Promize platform. The SSL certificates are set up in the Nginx and expose port 443 to the client. All mobile wallet applications connect with port 443 which is publicly exposed via public IP. The requests coming to Nginx will be redirected to the Gateway service, which is the microservices API gateway of the Promize platform. The gateway service is implemented with golang [3, 39]. This service will validate auth tokens in client requests (HTTP header) and redirect them to the blockchain to handle Promize functionalities. Auth services are the JWT [23]based auth service in the Promize platform. All client credentials(username, passwords) will be stored in auth services’ auth storage. The auth service validates client credentials and issues auth tokens to clients. Each bank in the network has its own Gateway, Auth Service and Nginx-proxy. The bank clients(with Promize mobile applications) connect to the blockchain network via their respective banks’ Nginx-proxy. Apache Kafka is used as the microservices message broker of the Promize platform. Inter-service communication takes place via Kafka message broker using JSON serialized messages. Every service in the Promize platform has its own Kafka topics to receive messages. We have used the highly scalable Rahasak blockchain to implement the Promize platform at MBSL. All blockchain functions are written with the Scala functional programming and Akka actor-based Aplos smart contract in the Rahasak blockchain. MBSL runs AS400-based core banking system [15]. It maintains customer accounts, account balance etc. There is a core banking API which is exposed to perform account inquiries, fund transfers etc. The blockchain smart contracts of the Promize platform interact with these core banking API perform user account validations and fund transfer between accounts.

5

Performance Evaluation

Performance evaluation of Promize was completed and is discussed. To obtain these results, we deployed the Promize platform with a multi-peer Rahasak blockchain cluster in AWS 2× large instances (16 GB RAM and 8 CPUs). Rahasak blockchain runs with 4 Kafka nodes, 3 Zookeeper nodes and Apache Cassandra [28] as the state database. The smart contracts on the Rahasak blockchain are implemented with Scala functional programming and the Akka actor-based [1, 18] Aplos [8] smart contract platform. The evaluation results are obtained for the following, with a varying number of blockchain peers (1 to 5 peers) used in different evaluations.

904

E. Bandara et al.

Fig. 9. Transaction throughput in the promize blockchain

Fig. 10. Time to execute transactions and validate transactions in the promize platform.

Fig. 11. Transaction scalability in the promize blockchain.

Fig. 12. Transaction latency in the promize blockchain.

Fig. 13. Transaction execution rate with no blockchain peers in the promize blockchain.

Fig. 14. Transaction execution rate and transaction submission rate in a single blockchain peer.

Promize - Blockchain and Self Sovereign Identity Empowered Mobile

Fig. 15. Search performance of the promize platform.

1. 2. 3. 4. 5. 6.

905

Fig. 16. Block creation time against the number of transactions in the block.

Transaction throughput of the blockchain Transaction execution and validation time in Promize platform Transaction scalability of the Promize platform Transaction execution rate in the Promize platform Search performance Block creation time

5.1

Transaction Throughput

To complete this evaluation, we recorded the number of Promize create transactions and Promize query transactions that can be executed on each peer in the Promize platform. When creating Promize, an invoke transaction will be executed in the underlying blockchain. The invoke transaction would then create a record in the ledger and update the status of the assets in the blockchain. The query searches the status of the underlying blockchain ledger and can neither create transactions in the ledger nor update the asset status. We flooded concurrent transactions for each peer and recorded the number of completed results. As shown in Fig. 9 we have obtained consistent throughput on each peer of the Promize platform. Since query’s do not update the ledger status, it has high throughput(2 times) compared to invoke transactions. 5.2

Transaction Execution and Validation Time

We evaluated the transaction execution and transaction validation time and recorded the time to execute and validate the different set of transactions (100, 500, 1000, 2000, 3000, 5000, 7000, 8000, 10000 transactions). The transaction validation time includes the double-spend checking time. Transaction execution time includes the double-spend checking time, ledger update time, data replication time. Figure 10 shows how the transaction execution time and validation time varies in different transaction sets.

906

5.3

E. Bandara et al.

Transaction Scalability

The number of transactions that can be executed (per second) over a number of peers in the network is recorded for this evaluation. We flooded concurrent transactions on each peer and recorded the number of executed transactions. Figure 11 shows transaction scalability results. Each node addition to the cluster has increased the transaction throughput in a nearly linear fashion. This means that the transaction latency will be decreased when adding blockchain peers to the cluster, Fig. 12. As peers are added, there is a diminishing return where the performance benefit will degrade if too many peers are added. 5.4

Transaction Execution Rate

Next, we evaluate the transaction execution rate in the Promize platform. We tested the number of submitted transactions and executed transactions in different blockchain peers, recording the time. Figure 13 shows how the transaction execution rate varies according to the different number of blockchain peers in the Promize platform. When the number of peers increases, the rate of executed transactions also increases, relatively. Figure 14 shows the number of executed transactions and submitted transactions in a single blockchain peer. There is a back pressure operation [14] between the rates of submitted transactions and executed transactions. We have used a reactive streaming-based approach with Apache Kafka to handle these backpressure operations in the Promize platform. 5.5

Search Performance

The Promize platform allows searching it’s transaction and account information via the underlying Rahasak blockchain’s Lucene Index-based search API. We evaluated this criteria by issuing concurrent search queries into Promize and computing the search time. As shown in Fig. 15, to search 2 million records connected it consumed 4 ms. 5.6

Block Generate Time

Finally, we have evaluated the time taken to create blocks in the underlying blockchain storage of the Promize platform. The statistics recorded against the number of transactions in a block. Block generation time depends on a). data replication time b). Merkel proof/block hash generate time c). transaction validation time. When the transaction count increases in the block, these factors will also increase. Due to this, when the transaction count increases, block generation time too increases correspondingly. As shown in Fig. 16 to increase a block when having 10k transaction, the platform consumes 8 s.

Promize - Blockchain and Self Sovereign Identity Empowered Mobile

907

Table 1. Peer-to-peer electronic payment platform comparison Platform

Architecture

Promize

Decentralized Rahasak

Mobile-ATM [26] Centralized

6

Running blockchain N/A

Scalability level

Payment type

SSI support

Privacy level

High

Bank

Yes

High

Mid

Bank

No

Low

RSCoin [12]

Decentralized RSCoin

Mid

Crypto

No

Mid

DTPS [21]

Decentralized Ethereum

Low

Visa/ MasterCard

No

Mid

BPCSS [11]

Decentralized Bitcoin

Low

Crypto

No

Mid

BCPay [43]

Decentralized Ethereum

Low

Crypto

No

Mid

TTPayment [32]

Decentralized Bitcoin

Low

Crypto

No

Mid

Syscoin [41]

Decentralized Bitcoin

Low

Crypto

No

Mid

Related Work

Much research has been conducted to improve the privacy/security features of electronic transactions. They seek to provide low-cost alternatives for traditional banking and inter-banking transactions [26]. In this section, we outline the main features and architecture of these research projects. Mobile-ATM [26]. Mobile-ATM is a simple M-Commerce application that provides ATM services. The traditional ATM network is replaced with the proposed M-ATM system. This system is incorporated of a Bank, Customer and the M-ATM agent. Both, M-ATM agent and the customer in this system are required to have mobile phones set up to perform the functions of M-ATM. The bank has an M-ATM server that it hosts as a front-end interface and uses a back-end transaction management system. This tool proposes a new money withdrawal/deposit system (ATM system), enabling users to perform ATM transactions with their mobile phones with additional security features. The goal of this toot is to reduce barriers of using ATM systems while simultaneously improving security related to ATM transactions. RSCoin [12], a centrally controlled cryptocurrency framework uses a sharding-based blockchain protocol that allows for better scalability and efficiency of the centrally-banked crypto-currency system. RSCoin utilizes a centralized monetary supply and distributed transaction ledger. A set of authorities are established and called mintettes that perform the transaction validation (double-spend checking). With a centralized monetary authority, RSCoin is able to scale better compared to other decentralized crypto-currencies. Currency supply is created and controlled by a central bank, making the cryptocurrency based on RSCoin significantly more palatable to governments. While monetary policy is centralized, RSCoin still provides strong transparency and auditability guarantees. Double spending is checked with a simple and fast mechanism and two-phase commit maintaining the integrity of the transaction ledger. Delay-Tolerant Payment Scheme (DTPS) [21] proposes a cash-less payment system intended for rural villages where limited internet network connectivity exists. To work in an intermittent network environment, it uses blockchain

908

E. Bandara et al.

mining nodes located locally in villages that can handle the transaction processing and verification. Base stations are set up and run in these remote villages using technology such as Nokia Kuha connected to the public Internet via unreliable satellite links. This in turn offers 4g coverage to the village while providing connectivity to the wider internet intermittently. Banks control and initiate deployment of the system utilizing the intermittent connectivity and periodically monitor system operations of the villages. Ethereum [10] blockchainbased smart contracts are used for payment service management including, user account initiation, interactions with credit operator, and management of rewards for blockchain miners. VISA and MAsterCard handle all the local transaction verification through the Ethereum blockchain in which they have implemented payment gateways with. BPCSS [11] proposes a Blockchain-based Payment Collection Supervision System(identified as BPCSS). Customers and merchandise stores can use this system to help manage transactions when they want to use the pervasive Bitcoin digital wallet. All transaction details are efficiently saved on a cloud database immediately after customers use their NFC-enabled Android smartphone App to purchase goods in the store using RFID-tags. Both customers and merchants are able to review transaction details in a soveriegn way without the need of a traditional finance system. Results presented for the tool are preliminary using the Bitcoin Testnet but demonstrate the cost-effectiveness of the proposed tool to easily and effectively manage transactions. BCPay [43] BCPay is introduced to provide secure and fair payment of outsourcing services in general without relying on third-party (trusted or nontrusted). Blockchain-based fair payment framework for outsourcing services on cloud and fog computing. BCPay can provide sound and robust payments without the need to have an organization facilitate it. This is achieved by an all-ornothing checking-proof protocol that ensures the outsourcing service provider performs their service completely or pays a fee. Performance evaluations show that BCPay is able to process a high number of transactions in a short amount of time without a large computational cost. Thing-to-Thing Payment (TTPayment) [32] is a bitcoin cryptocurrencybased automated micro payment platform for the Internet of Things. It allows devices to pay each other for services without a person required to interact. A proof-of-concept implementation exists of a smart cable that connects to a smart socket and pays for the electricity based on the draw without human interaction. Bitcoin is shown as a feasible payment solution for thing-to-thing payments, but high transaction fees exist when doing micro transactions. This is mitigated using the developed single-fee micro-payment protocol which aggregates multiple small transactions essentially batching them into one large transaction instead of many small transactions. Syscoin [41] is a permissionless blockchain-based cryptocurrency providing blockchain-based e-commerce solutions for businesses of all sizes. Turing complete smart contracts built on the Bitcoin scripting system are used to control and interact with the system. Coin transactions are controlled by a hardened

Promize - Blockchain and Self Sovereign Identity Empowered Mobile

909

layer of distributed consensus logic. Each smart contract (Syscoin service) uses this hardened layer while still retaining backwards compatibility with the Bitcoin protocol. Commercial developers want to use the most powerful network available (currently Bitcoin), Syscoin allows for the utilization of this network whilc providing security and efficiency. The comparison summary of these platforms and the Promize platform is presented in Table 1. It compares the Architecture(Centralize/Decentralized), Running blockchain, Scalability level, Supported payment types(bank payments, Visa/Mastercard payments, cryptocurrency payments), SSI support, Privacy level details.

7

Conclusions and Future Work

With this research, we have proposed a blockchain-based low-cost alternative for traditional ATM systems and credit/debit card systems, “Promize”. With Promize, users can withdraw money from registered authorities or their friends without going to an ATM. Any user in the Promize platform can act as an ATM. All Promize transactions are done through the QR code-based Promize mobile wallet application. The Promize wallet application can be used as an alternative for traditional debit and credit cards. Users can purchase goods from shops and pay via the Promize wallet instead of using debit/credit cards. The Promize platform reduces the above-mentioned issues in ATM systems as well as debit/credit card systems. The Promize platform provides a secure, low-cost method of doing ATM, debit/credit card transactions while guaranteeing data privacy, confidentiality, integrity, non-repudiation, authenticity and availability security features. We have proven the scalability and transaction throughput features of the Promize platform with empirical evaluations. Most recently we have deployed the 1.0 version of the Promize platform at the Merchant Bank of Sri Lanka. Acknowledgments. This work was funded by the Department of Energy (DOE) Office of Fossil Energy (FE) (Federal Grant #DE-FE0031744).

References 1. 2. 3. 4. 5.

Akka documentation Akka streams documentation The go programming language Docker documentation, August 2018 Androulaki, E., et al. Hyperledger fabric: a distributed operating system for permissioned blockchains. In: Proceedings of the Thirteenth EuroSys Conference, p. 30. ACM (2018) 6. Baars, D.S.: Towards self-sovereign identity using blockchain technology. Master’s thesis, University of Twente (2016) 7. Bandara, E., et al.: Mystiko–blockchain meets big data. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 3024–3032. IEEE (2018)

910

E. Bandara et al.

8. Bandara, E., Ng, W.K., Ranasinghe, N.: Aplos: smart contracts made smart. In: BlockSys 2019 (2019) 9. Burns, B., Grant, B., Oppenheimer, D., Brewer, E., Wilkes, J.: Borg, omega, and kubernetes. Queue 14(1), 70–93 (2016) 10. Buterin, V., et al.: A next-generation smart contract and decentralized application platform. white paper (2014) 11. Chen, P.-W., Jiang, B.-S., Wang, C.-H.: Blockchain-based payment collection supervision system using pervasive bitcoin digital wallet. In: 2017 IEEE 13th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob), pp. 139–146. IEEE (2017) 12. Danezis, G., Meiklejohn, S.: Centrally banked cryptocurrencies. arXiv preprint arXiv:1505.06895 (2015) 13. Davis, A.L.: Akka streams. In: Reactive Streams in Java, pp. 57–70. Springer (2019) 14. Destounis, A., Paschos, G.S., Koutsopoulos, I.: Streaming big data meets backpressure in distributed network computation. In: IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communications, pp. 1–9. IEEE (2016) 15. Ebadi, Z.: Advance banking system features with emphasis on core banking. In: The 9th International Conference on Advanced Communication Technology, vol. 1, pp. 573–576. IEEE (2007) 16. Eykholt, E., Meredith, L.G. and Denman, J.: Rchain architecture documentation (2017) 17. Fisher, J., Sanchez, M.H.: Authentication and verification of digital data utilizing blockchain technology, September 29 2016. US Patent App. 15/083,238 18. Gupta, M.: Akka essentials. Packt Publishing Ltd. (2012) 19. Hammudoglu, J.S., et al.: Portable trust: biometric-based authentication and blockchain storage for self-sovereign identity systems. arXiv preprint arXiv:1706.03744 (2017) 20. Hewitt, C.: Actor model of computation: scalable robust information systems. arXiv preprint arXiv:1008.1459 (2010) 21. Yining, H., et al.: A delay-tolerant payment scheme based on the ethereum blockchain. IEEE Access 7, 33159–33172 (2019) 22. Hughes, J.: Why functional programming matters. Comput. J. 32(2), 98–107 (1989) 23. Jones, M.B.: The emerging json-based identity protocol suite. In: W3C Workshop on Identity in the Browser, pp. 1–3 (2011) 24. Jonsson, J., Kaliski, B.: Public-key cryptography standards (pkcs)# 1: Rsa cryptography specifications version 2.1. Technical report, RFC 3447, February 2003 25. Junqueira, F.P., Reed, B.C., Serafini, M.: Zab: high-performance broadcast for primary-backup systems. In: 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN), pp. 245–256. IEEE (2011) 26. Karunanayake, A., De Zoysa, K., Muftic, S.: Mobile ATM for developing countries, January 2008 27. Khawas, C., Shah, P.: Application of firebase in android app development-a study. Int. J. Comput. Appl. 179(46), 49–53 (2018) 28. Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44(2), 35–40 (2010) 29. Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. (TOCS) 16(2), 133–169 (1998)

Promize - Blockchain and Self Sovereign Identity Empowered Mobile

911

30. Li, Z., Sun, Q., Lian, Y., Giusto, D.D.: An association-based graphical password design resistant to shoulder-surfing attack. In: 2005 IEEE International Conference on Multimedia and Expo, pp. 245–248. IEEE (2005) 31. Luka, M.K. and Frank, I.A.: The impacts of ICTs on banks. Editorial Preface 3(9), (2012) 32. Lundqvist, T., de Blanche, A., Andersson, H.R.H.: Thing-to-thing electricity micro payments using blockchain technology. In: 2017 Global Internet of Things Summit (GIoTS), pp. 1–6. IEEE (2017) 33. MBSL. MBSL bank 34. McConaghy, T., et al.: Bigchaindb: a scalable blockchain database. white paper, BigChainDB (2016) 35. Merkel, D.: Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014(239), 2 (2014) 36. M¨ uhle, A., Gr¨ uner, A., Gayvoronskaya, T., Meinel, C.: A survey on essential components of a self-sovereign identity. Comput. Sci. Rev. 30, 80–86 (2018) 37. Othman, A., Callahan, J.: The horcrux protocol: a method for decentralized biometric-based self-sovereign identity. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2018) 38. Reese, W.: Nginx: the high-performance web server and reverse proxy. Linux J. 2008(173), 2 (2008) 39. Schmager, F., Cameron, N., Noble, J.: Evaluating the go programming language with design patterns. In: Evaluation and Usability of Programming Languages and Tools, p. 10. ACM (2010) 40. Schwiderski-Grosche, S., Knospe, H.: Secure mobile commerce. Electron. Commun. Eng. J. 14(5), 228–238 (2002) 41. Sidhu, J.: Syscoin: a peer-to-peer electronic cash system with blockchain-based services for e-business. In: 2017 26th International Conference on Computer Communication and Networks (ICCCN), pp. 1–6. IEEE (2017) 42. Th¨ ones, J.: Microservices. IEEE softw. 32(1), 116–116 (2015) 43. Zhang, Y., Deng, R.H., Liu, X., Zheng, D.: Blockchain based efficient and robust fair payment for outsourcing services in cloud computing. Inf. Sci. 462, 262–277 (2018)

Investigating the Robustness and Generalizability of Deep Reinforcement Learning Based Optimal Trade Execution Systems Siyu Lin(B) and Peter A. Beling University of Virginia, Charlottesville, VA, USA {sl5tb,pb3a}@virginia.edu

Abstract. Recently, researchers from both academia and the financial industry have attempted to leverage the cutting edge techniques of Deep Reinforcement Learning (DRL) to develop autonomous trade execution systems. They desire a system that could learn from the trade data and develop trade execution strategies to optimize the execution costs. While several researchers have reported success in developing such autonomous trade execution systems, none of them have investigated the robustness of the systems. Despite the powerfulness of DRL, the overfitting and generalization remain a challenge in DRL. For real-world applications, especially in financial trading, the DRL system’s robustness is critical, as the cost of wrong decisions could be devastating. In our experiment, we investigate the robustness of the DRL based autonomous trade execution systems by applying the policies learned from one stock to other stocks directly without any refining or training. As different stocks have different dynamics, they might serve as suitable environments to investigate the systems’ capability to generalize. The result suggests that the DRL systems have generalized well on other stocks without further refining or training on the specific stock’s historical trade data. The result is significant from two perspectives: 1) It suggests that the DRL based trade execution systems can generalize well, which gives us and potential users more confidence in their performances; 2) It also presents an opportunity to develop a more sample efficient system. Keywords: Deep reinforcement learning Robustness · Generalizability

1

· Optimal trade execution ·

Introduction

In the modern financial market, electronic trading has gradually replaced the traditional floor trading and is constituting a majority of the overall trading volumes. Nowadays, brokerage firms compete with each other intensively to provide better execution quality to retail or institutional investors. Optimal trade execution, which concerns how to minimize trade execution costs of trading a c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 912–926, 2021. https://doi.org/10.1007/978-3-030-80126-7_64

Investigating the Robustness

913

certain amount of shares within a specified period, is a critical factor of execution quality. From the regulatory perspective, brokers are legally required to execute orders on behalf of their clients to ensure the best execution possible. In the US, such practices are monitored by the Securities and Exchange Commission (SEC) and Financial Industry Regulatory Authority (FINRA). In Europe, MiFID II, a legislative framework instituted by the European Union, regulates financial markets and improves investor protection. Encouraged by recent successful applications of Deep Reinforcement Learning in many challenging areas, researchers from academia and the financial industry have attempted to leverage the DRL techniques to develop autonomous trade execution systems, which could learn from the historical trade data and make optimized trade execution decisions autonomously. The successful development of such systems would significantly improve the efficiency of the existing manually designed, rule-based system or human execution trader and would dramatically change the brokerage industry. Nevmyvaka, Feng, and Kearns have published the first large-scale empirical application of RL to optimal trade execution problems [10]. Hendricks and Wilcox propose to combine the Almgren and Chriss model (AC) and RL algorithm and to create a hybrid framework mapping the states to the proportion of the AC-suggested trading volumes [3]. To address the high dimensions and the complexity of the underlying dynamics of the financial market, Ning et al. [11] adapt and modify the Deep Q-Network (DQN) [9] for optimal trade execution, which combines the deep neural network and the Q-learning, and can address the curse of dimensionality challenge faced by Q-learning. Lin and Beling [6] analyze and demonstrate the flaws when applying a generic Q-learning algorithm and propose a modified DQN algorithm to address the zero-ending inventory constraint. Although the researchers mentioned above have successfully developed autonomous trade execution systems, none of them have provided evidence of the systems’ robustness, which might lead to concerns from potential users. In this article, we investigate the robustness of the DRL based autonomous trade execution systems by applying the learned policies from one stock to others without further training or refining. The result suggests that the performances of autonomous trade execution systems are robust, and it could learn essential underlying dynamics and can leverage that knowledge to different stocks.

2

DRL Based Optimal Trade Execution

In this section, we briefly review some DRL based optimal trade execution systems, including the states, reward function, assumptions, and discuss how to formulate optimal trade execution as a DRL problem. Deep Reinforcement Learning techniques could be divided into two main categories: 1) Value-based Q learning algorithm such as Deep Q-Network (DQN) proposed by Google DeepMind [9]; and 2) Policy Gradient method such as Proximal Policy Optimization (PPO)

914

S. Lin and P. A. Beling

proposed by OpenAI [12]. In the following sections, we describe two DRL based trade execution systems based on DQN and PPO, respectively. 2.1

Preliminaries

Deep Q-Network. Similar to Q-learning, the goal of the DQN agent is to maximize cumulative discounted rewards by making sequential decisions while interacting with the environment. The main difference is that DQN is using a deep neural network to approximate the Q function. At each time step, we would like to obtain the optimal Q function, which obeys the Bellman equation. The rationale behind Bellman equation is straightforward: if the optimal action-value function Q∗ (s , a ) was completely known at next step, the optimal strategy at current step would be to maximize E[r + γQ∗ (s , a )], where γ is the discounted factor [8]. Q∗ (s , a )|s, a] (1) Q∗ (s, a) = Es ∼ε [r + γ max a

The Q-learning algorithm estimates the action-value function by iteratively updating Qi+1 (s, a) = Es ∼ε [r + γ maxa Qi (s , a )|s, a]. It has already been demonstrated that the action-value Qi would eventually converge to the optimal action-value Q∗ as i → ∞ [13]. The Q-learning iteratively updates the Q table to obtain an optimal Q table. However, it suffers from the curse of dimensionality and is not scalable to large scale problems. The DQN algorithm has addressed this challenge faced by Q-learning. It trains a neural network model to approximate the optimal Q function by minimizing a sequence of loss function Li (θi ) = Es,a∼ρ(·) [(yi − Q(s, a; θi ))2 ] iteratively, where yi = Es ∼ε [r + γ maxa Q∗ (s , a ; θi−1 )|s, a] is the target function and ρ(s, a) refers to the probability distribution of states s and actions a. In the DQN algorithm, the target function is usually freezed for a while when optimizing the loss function to prevent the instability caused by the frequently shift in target function. The gradient could be obtained by differentiating the loss function ∇θi Li (θi ) = Es,a∼ρ(·);s ∼ε [((r + γ maxa Q∗ (s , a ; θi−1 ))i − Q(s, a; θi ))∇θi Q(s, a; θi )]. The model weights could be estimated by optimizing the loss function through stochastic gradient descent algorithms. In addition to the capability of handling high-dimensional problems, the DQN algorithm is also model-free and has no assumption about the dynamics of the environment. It learns about the optimal policy by exploring the state-action space. The Q-learning algorithm is a model-free technique that has no assumptions on the environment. However, the curse of dimensionality has limited its application in high-dimensional problems. Deep Q-Network seems to be a natural choice because it can handle high-dimensional problems while inheriting Q-learning’s ability to learn from its experiences. Given the abundant amount of information available in the limit order book, we believe that the Deep Q-Network could be more suitable to utilize those high dimensional market microstructure information. In this section, we provide the DQN formulation for the optimal trade execution problem and describe the state, action, reward, and the algorithm used in the experiment.

Investigating the Robustness

915

Proximal Policy Optimization. The PPO algorithm is a policy gradient algorithm proposed by OpenAI [12]. It becomes one of the most popular RL methods due to its state-of-the-art performance as well as its sample efficiency and easy implementation. To optimize policies, it alternates between sampling data and optimizing a “surrogate” objective function. PPO is an on-policy algorithm and applies to both discrete and continuous action spaces. PPO-clip updates policies via θk+1 = arg max Es,a∼πθk [L(s, a, θk , θ)] θ

(2)

where π is the policy, θ is the policy parameter, k is the kth step, a and s are action and state respectively. It typically takes multiple steps of SGD to optimize the objective L L(s, a, θk , θ) = min(

πθ (a|s) πθk πθ (a|s) A (s, a), clip( , 1 − , 1 + )Aπθk (s, a)) πθk (a|s) πθk (a|s)

(3)

where Aπθk (s, a) is the advantage estimator. The clip term in Eq. 3 clips the probability ratio and prevents the new policy going far away from the old policy1 [12]. 2.2

Problem Formulation

In this article, we investigate two DRL based systems: 1) A modified DQN algorithm proposed by Lin and Beling [6]; 2) PPO with Long short-term memory (LSTM) algorithm proposed by Lin and Beling [7]. In the following sections, we describe the problem formulation for the two algorithms, respectively. States. The state is a vector to describe the current status of the environment. In the settings of optimal trade execution, the states consist of 1) Public state: market microstructure variables including top 5 bid/ask prices and associated quantities, bid/ask spread, 2) Private state: remaining inventory and elapsed time, 3) Derived state: we also derive features based on historical LOB states to account for the temporal component in the environment2 . The derived features could be grouped into three categories: 1) Volatility in VWAP price; 2) Percentage of positive change in VWAP price; 3) Trends in VWAP price. More specifically, we derive the features based on the past 6, 12, and 24 steps (each step is a 5-second interval), respectively. Additionally, we use several trading volumes: 0.5*TWAP, 1*TWAP, and 1.5*TWAP3 to compute the VWAP price. 1 2

3

https://spinningup.openai.com/en/latest/algorithms/ppo.html. For example, the differences of immediate rewards between the current time and arrival time at various volume levels, manually crafted indicators to flag specific market scenarios (i.e., regime shift, a significant trend in price changes, and so on). TWAP represents the trading volume of the TWAP strategy in one step.

916

S. Lin and P. A. Beling

It is straightforward to derive features in 1). For 2), we record the steps that the current VWAP price increases compared with the previous step and compute the percentage of the positive changes. For 3), we calculate the difference between the current VWAP prices and the average VWAP prices in the past 6, 12, and 24 steps. In the meanwhile, researchers have attempted to leverage LSTM networks to process raw level 2 microstructure data to account for time dependencies among the data, which could reduce the time and efforts spent on manually deriving attributes. Actions. In this article, we choose different numbers of shares to trade based on the liquidity of the stock market. As the purpose of the research is to evaluate the capability of the proposed framework to balance the liquidity risk and timing risk, we choose the total number of shares to ensure that the TWAP orders4 can consume at least the 2nd best bid price on average. The total shares to trade for each stock are illustrated in Table 1. In the optimal trade execution framework, we set the range of actions from 0 to 2TWAP and set the minimal trading volume for each stock. Table 1. # of shares to trade for each stock in the article Ticker

Trading shares Ticker

Trading shares

FB

6000

GS

300

GOOG 300

CRM

1200

NVDA 600

BA

300

MSCI

300

MCD

600

TSLA

300

PEP

1800

PYPL

2400

TWLO 600

QCOM 7200

WMT

3600

Rewards. The reward structure is the key to the DRL algorithm and should be carefully designed to reflect the DRL agent’s goal. Otherwise, the DRL agent cannot learn the optimal policy as expected. In previous work [3,10, 11], researchers use the IS5 , a commonly used metric to measure execution gain/loss, as the immediate reward received after execution at each non-terminal step. There is a common misconception that the IS is a direct optimization goal. The IS compares the model performance to an idealized policy that assumes infinite liquidity 4 5

TWAP order =

Total number of shares to trade

. Total # of periods Implementation Shortfall = arrival price × traded volume − executed price × traded volume.

Investigating the Robustness

917

at the arrival price. In the real world, brokers often use TWAP and VWAP as benchmarks. IS is usually used as the optimization goal because TWAP and VWAP prices could only be computed until the end of the trading horizon and cannot be used for real-time optimization. Essentially, IS is just a surrogate reward function, but not a direct optimization goal. Hence, a reward structure is good as long as it can improve model performance. Shaped Rewards. In contrast to previous work [3, 10,11], the DQN algorithm proposed by Lin and Beling [6] uses a shaped reward structure. Lin and Beling [6] point out that IS reward is noisy and nonstationary, which makes the learning process difficult. They proposed the shaped reward structure aiming to remove the impact of trends to standardize the reward signals. Sparse Rewards. Although the shaped reward structure seems to generalize reasonably well on the rest equities as demonstrated in their article, the designing of a complicated reward structure is time-consuming and has a risk of overfitting in real-world applications. In contrast to previous approaches, Lin and Beling propose to use a sparse reward, which only gives the agent a reward signal based on its relative performance compared against the TWAP baseline model [7]. Zero Ending Inventory Constraint. In the real world business, the brokers receive contracts or directives from their clients to execute a certain amount of shares within a specific time. For the brokers, it is mandatory to liquidate all the shares by the end of the trading period. Lin and Beling [6] modify the Q-function update to combine the last two steps for Q-function estimation to incorporate the zero-ending inventory constraint. We leverage their methods by combining the last two steps. 2.3

Architecture

DQN Architecture and Extensions. The vanilla DQN algorithm has addressed the instability issues of using the nonlinear function approximation by techniques such as experience replay and the target network. In this article, we also incorporate several other techniques to further improve its stability and performance, as illustrated in Fig. 1. The architectures are chosen in Ray’s Tune platform by comparing model performances of all combinations of these extensions on Facebook data only, which is equivalent to ablation studies. A brief description of the DQN architecture and the techniques we use are below. In the DQN algorithm, we use the exact same architecture as in Lin and Beling’s article: a fully connected feedforward neural network with two hidden layers, 128 hidden nodes in each hidden layer, and ReLU activation function in each hidden node. The input layer has 51 nodes, including private attributes such as remaining inventory and time elapsed as well as derived attributes based on public market information. The output has 21 nodes with a linear function as the activation function. The Adam optimizer is chosen for weights optimization [6]. It is well known that the maximization step tends to overestimate the Q function. van Hasselt [14] addresses this issue by decoupling the action selection

918

S. Lin and P. A. Beling

from its evaluation. In 2016, van Hasselt et al. [15] had successfully integrated this technique with DQN and had demonstrated that the double DQN could reduce the harmful overestimation and improve the performance of DQN. The only change is in the loss function below. ∇θi Li (θi )

Q∗ (s , a ; θi−1 )))i =Es,a∼ρ(·);s ∼ε [((r + γQ(s , max a

−Q(s, a; θi ))∇θi Q(s, a; θi )]

(4)

The dueling network architecture consists of two streams of computations with one stream representing state values and the other one representing action advantages. The two streams are combined by an aggregator to output an estimate of the Q function. Wang et al. have demonstrated the dueling architecture’s capability to learn the state-value function more efficiently [16]. Noisy nets are proposed to address the limitations of the -greedy exploration policy in vanilla DQN. A linear layer with a deterministic and noisy stream, as demonstrated in Eq. 5, replaces the standard linear y = b + W x. The noisy net enables the network to learn to ignore the noisy stream and enhance the effectiveness of exploration [2]. y = (b + W x) + (bnoisy b + (Wnoisy w )x)

(5)

PPO Architecture. Unlike the DQN algorithm, the PPO algorithm optimizes over the policy πt directly and finds the optimal state-value function Vt∗ . We leverage Ray’s Tune platform to select the network architecture and hyperparameters by comparing model performances on Facebook data only, which is equivalent to ablation studies. The selected network architectures are illustrated in Fig. 1. In the PPO algorithm, we have implemented the same network architecture as in Lin and Beling’s article [7]: FCN with two hidden layers, 128 hidden nodes in each hidden layer, and ReLU activation function in each hidden node. The input layer has 22 nodes, including private attributes such as remaining inventory and time elapsed as well as the LOB attributes such as 5-level bid/ask prices and volumes. After that, we concatenate the model output and the previous reward and action and feed them to an LSTM network with a cell size of 128. The LSTM outputs the policy πt and the state-value function Vt∗ . Unlike the feedforward neural networks, LSTM has feedback connections, which allows it to store information and identify temporal patterns. LSTM networks are composed of a number of cells, and an input gate, an output gate, and a forget gate within each cell regulate the flow of information into and out of the cell. It was proposed by Sepp Hochreiter and Jürgen Schmidhuber to solve the exploding and vanishing gradient problems encountered by the traditional recurrent neural network (RNN) [4]. Instead of manually designing attributes to represent the temporal patterns, we leverage LSTM to process the sequential LOB data and automatically identify temporal patterns within the data.

Investigating the Robustness (1) DQN Architecture

919

(2) FCN + LSTM

Fig. 1. DQN and PPO architectures.

2.4

Assumptions

The most critical assumption in our experiment is that the actions that the DRL agents have only a temporary market impact, and the market is resilient and will bounce back to the equilibrium level at the next time step. The market resilience assumption is the core assumption of this article and also all previous research applying RL for optimal trade execution problems [3, 10, 11]. The reason is that we are training and testing on the historical data and cannot account for the permanent market impact. However, the equities we choose in the article are liquid, and the actions are relatively small compared with the market volumes. Therefore, the assumption should be reasonable. Secondly, we ignore the commissions and exchange fees as our research is primarily aimed at institutional investors, and those fees are relatively small fractions and are negligible. Thirdly, we apply a quadratic penalty if the trading volume exceeds the available quantities of the top 5 bids. Finally, the remaining unexecuted shares will be liquidated at the last time step to ensure the execution of all the shares.

3

Experimental Results

In this section, we compare the performances of the DRL models trained on every stock data with the DRL models trained on FB only (applied to other stocks using the models trained on FB). From the perspective of learning stability, both DQN and PPO LSTM based models converge fast and have significantly outperformed the baseline models on most stocks during the backtesting. In

920

S. Lin and P. A. Beling

our experiment, we apply DeepMind’s framework to assess the stability in the training phase and the performance evaluation in the backtesting [8]. 3.1

Data Sources

We use the NYSE daily millisecond TAQ data from January 1st, 2018 to December 31st, 2018, downloaded from WRDS. The TAQ data is used to reconstruct the LOB. Only the top 5 price levels from both seller and buyer sides are kept and aggregated at 5 s, which is the minimum trading interval. 3.2

Experimental Methodology and Settings

In our experiments, we apply both the DQN and PPO LSTM algorithms on 14 stocks including Facebook (FB), Google (GOOG), Nvidia (NVDA), Msci (MSCI), Tesla (TSLA), PayPal (PYPL), Qualcomm (QCOM), Goldman Sachs (GS), Salesforce.com (CRM), Boeing (BA), Mcdonald’s Corp (MCD), PepsiCo (PEP), Twilio (TWLO), and Walmart (WMT) which cover technology, financial and retail industries. We tune the hyperparameters on FB only and apply the same neural network architecture to the remaining stocks due to limited computing resources. The experiment follows the steps below. 1. We obtain one-year millisecond Trade and Quote (TAQ) data of 14 stocks above from WRDS and reconstruct it into the LOB. Then, we split the data into training (January-September) and test sets (October–December). We set the execution horizon to be 1 min, and the minimum trading interval to be 5 s. 2. The hyperparameters6 are tuned on FB only due to the limited computing resources. After fine-tuning, we apply the PPO architecture and hyperparameters to the other stocks. 3. Upon the completion of training, we check the average episode rewards’ progression. We then apply the learned policies to the testing data and compare the average episode rewards against TWAP, VWAP, and AC models. We also compare the performances of DRL models trained on every stock and DRL models trained on FB stock only. 3.3

Algorithms

TWAP: The shares are equally divided across time. VWAP: The shares are traded at a price which closely tracks the VWAP [5]. AC: AC is defined in [1]. For a fair comparison with AC, we set its permanent price impact parameter to 0. DQN (Lin2019): The modified DQN algorithm proposed by Lin and Beling [6]. PPO LSTM (Lin2020): The PPO algorithm uses LSTM to extract temporal patterns [7]. 6

Please refer to the appendix for the chosen hyperparameters.

Investigating the Robustness

3.4

921

Main Evaluation and Backtesting

To evaluate the performance of the trained DRL algorithms, we apply them to the test samples from October 2018 to December 2018 for all the 14 stocks and compare their performances with TWAP, VWAP, and AC model. We report the mean of ΔIS = ISModel − ISTWAP (in US dollars), standard deviation of ISModel . Additionally, we also report gain-loss ratio (GLR) below. GLR =

E[ΔIS|ΔIS > 0] E[−ΔIS|ΔIS < 0]

(6)

The statistical results for all the stocks are summarized in Table 2. We observe that the DQN and PPO LSTM algorithms outperform the other models most of the time, while maintaining relative smaller standard deviations. In the experiments, we also find out that both DQN and PPO LSTM models only trained on FB perform reasonably well. Although their performances are not as good as the models trained on each stock most of the time, they reduce the risk of overfitting issue and have performed significantly better on some stocks such as MSCI (DQN), GOOG, GS, and MCD (PPO LSTM) which we highlight in bold font. It’s worth mentioning that both DQN and PPO LSTM’s learning curves on the above stocks (MSCI, GOOG, GS, and MCD) have converged, which suggests that learning curves are not always helpful to predict model performances on new data.

4

Conclusion and Future Work

In this article, we investigate the robustness of two DRL based trade execution models. In our experiments, we have demonstrated that both DRL models outperform TWAP, VWAP, AC models. We have also shown that both DQN and PPO LSTM models are robust and can perform well on other stocks while trained on FB data only. The experimental results are significant and indicate that the DRL model can learn essential dynamics which govern the process of trade execution and generalize well on new stocks. It also presents an opportunity for us to develop sample efficient autonomous trade execution systems in the future with the potential of significantly reducing the costs spent on data acquisition and computing.

Appendix Hyperparameters We fine-tuned the hyperparameters on FB only, and we did not perform an exhaustive grid search on the hyperparameter space, but rather to draw random samples from the hyperparameter space due to limited computing resources. Finally, we choose the hyparameters listed in Tables 3 and 4.

922

S. Lin and P. A. Beling

Table 2. Model performances comparison. Mean is based on ΔIS (in US$s) and STD is based on IS (in US$s). Mean1, STD1, and GLR1 are the performance metrics for models trained on every ticker; while Mean2, STD2, and GLR2 are for models trained on FB only and are applied to other tickers. Ticker

Mean TWAP VWAP

AC

DQN (Lin2019) Mean1 Mean2

PPO LSTM Mean1 Mean2

BA

0.00

49.66

−0.78

16.21

6.21

72.53

55.61

CRM

0.00

−120.89

−4.08

9.05

17.17

101.87

80.44

FB

0.00

−4040.78 −40.26

54.08

54.08

128.23

128.23

GOOG 0.00

−65.41

−0.92

28.43

28.18

18.06

117.73

GS

0.00

−12.58

−0.16

2.40

1.16

−1.81

9.10

MCD

0.00

−7.89

−0.13

11.86

3.63

−178.71 43.68

MSCI

0.00

34.94

−0.34

−770.24 6.78

35.50

25.57

NVDA 0.00

−31.46

−1.32

19.50

6.17

71.32

34.96

PEP

0.00

−77.54

−0.34

38.45

35.38

76.78

106.84

PYPL

0.00

−469.61

−7.84

47.35

21.82

114.38

99.54

QCOM 0.00

−6737.10 −127.87 152.02

59.59

321.43

272.52

TSLA

0.00

−63.55

−0.88

3.30

5.35

46.96

41.26

TWLO 0.00

−41.27

−2.96

4.63

4.67

21.94

21.55

WMT

0.00

−1376.78 −2.24

99.56

49.29

153.58

217.73

Ticker

Standard deviation TWAP VWAP AC

DQN (Lin2019) STD1 STD2

PPO LSTM STD1 STD2

BA

335.05

503.64

335.80

326.62

331.10

294.09

285.60

CRM

836.69

1344.44

849.32

830.79

835.70

859.73

822.37

FB

4154.47 14936.29 4517.32 4,117.55 4,117.55 4,462.66 4,462.66

GOOG 707.03

1394.18

707.40

696.35

692.24

690.63

GS

92.74

190.89

93.19

91.07

92.02

113.71

636.15 85.94

MCD

221.72

420.70

222.10

214.50

219.36

576.29

181.03

MSCI

159.14

220.25

159.59

642.64

156.64

145.73

144.77

NVDA 478.18

726.23

479.65

454.73

468.30

431.11

452.77

PEP

1058.16 2389.84

1058.48 1,033.27 1,048.05 997.11

991.28

PYPL

672.69

756.79

584.96

2580.74

631.29

654.51

624.24

QCOM 2913.72 27131.03 4724.56 2,821.33 2,875.79 2,654.01 2,664.37 TSLA

224.02

488.35

227.79

226.80

223.35

183.42

190.96

TWLO 196.88

341.32

215.42

196.34

194.60

186.11

185.21

WMT

1442.92 6699.03

1456.07 1,386.26 1,416.75 1,355.22 1,321.25

Ticker

GLR TWAP VWAP

AC

DQN (Lin2019) GLR1 GLR2

BA

–

0.59

0.04

1.47

1.44

1.54

1.94

CRM

–

0.50

0.02

1.18

1.25

1.43

1.41

FB

–

0.16

0.00

1.04

1.04

0.63

0.63

GOOG –

0.46

0.10

1.28

1.87

1.34

2.11

PPO LSTM (Lin2020) GLR1 GLR2

GS

–

0.46

0.16

1.40

1.16

0.58

1.05

MCD

–

0.49

0.11

2.00

1.26

0.23

3.24

MSCI

–

0.72

0.15

0.09

2.08

1.55

2.35

NVDA –

0.51

0.06

1.54

1.25

1.21

0.91

PEP

–

0.40

0.18

1.49

1.49

1.30

1.64

PYPL

–

0.22

0.01

2.13

1.43

1.64

2.52

QCOM –

0.08

0.01

1.53

1.06

1.63

1.86

TSLA

–

0.41

0.11

1.27

1.26

1.35

2.38

TWLO –

0.41

0.05

1.11

1.37

1.49

1.67

WMT

0.17

0.02

1.94

1.35

1.32

2.23

–

Investigating the Robustness

923

Table 3. Hyperparameters for DQN (Lin 2019). Hyperparameter DQN (Lin2019)

Description

Minibatch size

Size of a batched sampled from replay buffer for training

16

Replay memory 10000 size

Size of the replay buffer

Target network update frequency

2000

Update the target network every ‘target network update freq’ steps

Sample batch size

4

Update the replay buffer with this many samples at once

Discount factor

0.99

Discount factor gamma used in the Q-learning update

Timesteps per iteration

100

Number of env steps to optimize for before returning

Learning rate

0.0001

The learning rate used by Adam optimizer

Initial exploration

0.15

Fraction of entire training period over which the exploration rate is annealed

Final exploration

0.05

Final value of random action probability

Final exploration frame

125000

Final step of exploration

Replay start size

5000

The step that learning starts and a uniform random policy is run before this step

Number of hidden layers

2 for both main and noisy All of main, dueling and noisy networks networks have 2 hidden layers

Number of hidden nodes

[128,128] for main network The main and dueling networks and [32,32] for noisy have 128 hidden nodes for each network hidden layer; while noisy network has 32 hidden nodes for each hidden layer

Number of input/output nodes

Input: 51; output: 21

Activation functions

Relu for hidden layers and For the hidden layers, we use Relu linear for output layer activation functions and a linear function for the output layer

The input layer has 51 nodes while the output layer has 21 nodes

924

S. Lin and P. A. Beling Table 4. Hyperparameters for PPO LSTM (Lin2020). Hyperparameter

PPO LSTM (Lin2020)

Minibatch size

32

Sample batch size

5

Train batch size

240

Discount factor

1

Learning rate

Linearly annealing between 5e−5 and 1e−5

KL coeff

0.2

VF loss coeff

1

Entropy coeff

0.01

Clip param

0.2

Hidden layers

2 hidden layers with 128 hidden nodes each

Activation functions

Relu for hidden layers and linear for output layer

# input/output nodes

Input: 22; output: 51

Maximum sequence length 12 LSTM cell size

128

Training and Stability Assessing the stability and the model performance in the training phase is straightforward in supervised learning by evaluating the training and testing samples. However, it is challenging to evaluate and track the RL agent’s progress during training, since we usually use the average episode rewards gained by the agent over multiple episodes as the evaluation metric to track the agent’s learning progress. The average episode reward is usually noisy since the updates on the parameters of the policy can seriously change the distribution of states that the DRL agent visits. In Fig. 2, we observe that the DRL agents converge fast in less than 200,000 steps7 . The PPO+LSTM converge to higher average IS values on most stocks, while having shorter trading lengths. It indicates that not only do they reduce the execution costs, but also they trade more quickly to avoid timing risks.

7

Odd index represents the learning curves for the IS, while even index represents for the learning curves for the total # of steps to complete the trades.

Investigating the Robustness

(1) Boeing (IS)

(2) Boeing (Steps)

(3) Salesforce (IS)

(4) Salesforce (Steps)

(5) Facebook (IS)

(6) Facebook (Steps)

(7) Google (IS)

(8) Google (Steps)

(10) Goldman Sachs (Steps)

(11) Mcdonald (IS)

(12) Mcdonald (Steps)

(13) MSCI (IS)

(14) MSCI (Steps)

(15) NVIDIA (IS)

(16) NVIDIA (Steps)

(17) PepsiCo (IS)

(18) PepsiCo (Steps)

(19) Paypal (IS)

(20) Paypal (Steps)

(23) Tesla (IS)

(24) Tesla (Steps)

(27) Walmart (IS)

(28) Walmart (Steps)

(9) Goldman Sachs (IS)

(21) QUALCOMM (IS)

(25) Twilio (IS)

(22) QUALCOMM (Steps)

(26) Twilio (Steps)

PPO LSTM (Lin2020)

925

DQN (Lin2019)

Fig. 2. Training Curves Tracking the DRL Agent’s Average Implementation Shortfalls in US$s (y-Axis for odd #) and Average Trading Steps Per Episode (y-axis for Even #) Against Steps (x-Axis): a. Each Point is the Average IS Per Episode; b. Average Trading Steps Per Episode.

926

S. Lin and P. A. Beling

References 1. Almgren, R., Chriss, N.: Optimal execution of portfolio transactions. J. Risk 3, 5–40 (2000) 2. Fortunato, M., Azar, M.G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., Blundell, C.: Noisy networks for exploration. In: ICLR (2018) 3. Hendricks, D., Wilcox, D.: A reinforcement learning extension to the AlmgrenChriss framework for optimal trade execution. In: Proceedings from IEEE Conference on Computational Intelligence for Financial Economics and Engineering, pp. 57–464, London, UK. IEEE, March 2014 4. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 5. Kakade, S.M., Kearns, M., Mansour, Y., Ortiz, L.E.: Competitive algorithms for vwap and limit order trading. In: Proceedings of the ACM Conference on Electronic Commerce, New York, NY (2004) 6. Siyu, L., Beling, P.A.: Optimal liquidation with deep reinforcement learning. In: 33rd Conference on Neural Information Processing Systems (NeurIPS): Deep Reinforcement Learning Workshop, p. 2019. Vancouver, Canada (2019) 7. Lin, S., Beling, P.A.: An end-to-end optimal trade execution framework based on proximal policy optimization. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pp. 4548–4554. International Joint Conferences on Artificial Intelligence Organization, July 2020 8. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013) 9. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015) 10. Nevmyvaka, Y., Feng, Y., Kearns, M.: Reinforcement learning for optimal trade execution. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 673–68, Pittsburgh, PA, June 2006. Association for Computing Machinery (2006) 11. Ning, B., Lin, F.H.T., Jaimungal, S.: Double deep q-learning for optimal execution. arXiv:1812.06600 (2018) 12. Schulman, J,. Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1811.08540v2 (2017) 13. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998) 14. van Hasselt, H.: Double q-learning. In: Advances in Neural Information Processing Systems, pp. 2613–2621. Vancouver, British Columbia, Canada, December 2010 15. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: AAAI Conference on Artificial Intelligence (2016) 16. Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., Freitas, N.: Dueling network architectures for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning, pp. 1995–2003, New York, NY. JMLR, June 2016

On Fairness in Voting Consensus Protocols Sebastian M¨ uller1 , Andreas Penzkofer2 , Darcy Camargo2 , and Olivia Saa2(B) 1

Aix Marseille Université, CNRS, Centrale Marseille, I2M - UMR 7373, 13453 Marseille, France [email protected] 2 IOTA Foundation, 10405 Berlin, Germany {andreas.penzkofer,darcy.camargo,olivia.saa}@iota.org

Abstract. Voting algorithms have been widely used as consensus protocols in the realization of fault-tolerant systems. These algorithms are best suited for distributed systems of nodes with low computational power or heterogeneous networks, where different nodes may have different levels of reputation or weight. Our main contribution is the construction of a fair voting protocol in the sense that the influence of the eventual outcome of a given participant is linear in its weight. Specifically, the fairness property guarantees that any node can actively participate in the consensus finding even with low resources or weight. We investigate effects that may arise from weighted voting, such as centralization, loss of anonymity, scalability, and discuss their relevance to protocol design and implementation. Keywords: Fairness · Voting consensus protocols network · Sybil attack

1 1.1

· Heterogeneous

Introduction Preliminaries

Weighted voting in distributed systems increases efficiency and network reliability, but it also raises additional risks. It deviates from the one node - one vote-principle to allow a defense against Sybil attacks if the weight corresponds to a scarce resources or recurring costs and fee, e.g. [1]. However, weighted voting may induce a loss of anonymity for the nodes and may incentivize centralization. In most of the systems that are based on weighted voting, all participants are allowed to vote, and a centralized entity counts the votes and takes a decision. In the decentralized setting, participants or nodes have to find consensus on the outcome of the vote. In the case where the number of nodes is high, and not all members can communicate with all other nodes, protocols where nodes only sample a certain number of nodes can be used to aggregate information, [2], or find consensus, [3–9]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 927–939, 2021. https://doi.org/10.1007/978-3-030-80126-7_65

928

S. M¨ uller et al.

This article focuses on a certain class of voting consensus protocols, namely, binary majority consensus. Some basic algorithms in this protocol class are simple majority consensus, [2], Gacs-Kurdyumov-Levin [3], and random neighbors majority, [7]. The basic idea of these majority voting protocols is that nodes query other nodes about their current opinion, and adjust their own opinion throughout several rounds based on the proportion of other opinions they have observed. Voting protocols have been successfully applied in a wide range of engineering and economic applications [10–12], and lead to the emerging science of sociophysics [13]. Recently, [8] introduced the fast probabilistic consensus (FPC) protocol, an amelioration of the classical consensus voting protocols that is efficient and robust in Byzantine infrastructure. This protocol is studied in more detail in [9] and serves this note as a reference for a voting consensus protocol. 1.2

Weights and Fairness

The main contribution of this work is to define an adaption of the majority consensus protocol to a setting that allows nodes to have different weights. We define voting power, Sect. 4, as a power index to describe the influence of each node on the outcome of the consensus protocol. In comparison with standard voting models where the whole population is sampled, we have two instances where the weights may come in: firstly, in the sampling of the nodes to query and secondly, in the weighting of the votes to apply the majority rule. An essential consequence of fairness of the protocol is that it allows defense against Sybil attacks. Protection against Sybil attacks is especially needed in permissionless settings, where a malicious actor may gain a disproportionately large influence on the voting by creating a large number of identities. Moreover, fairness allows even nodes with very few resources or weight to participate in the consensus finding. This property is particularly important in networks where the sum of the weights of nodes with small weight is considerable. Besides technical and economic considerations, we want to mention the possible social impacts of an unfair system. For instance, unfair situations can make participants of the network unhappy, and this should be a real consideration in judging the efficiency and adaption of a protocol, e.g. [14]. In particular, this is of importance in community-driven projects such as IOTA. 1.3

IOTA

IOTA is an open-source distributed ledger and cryptocurrency designed for the Internet of things. For the next generation of the IOTA protocol [15] introduces mana in various places in order to obtain fairness. For instance, the protocol uses mana to obtain fairness in rate control, [16,17], where an adaptive PoWdifficulty guarantees that any node, even with low hashing power, can achieve similar throughput for given mana. This note discusses how mana is used as a weight in FPC to construct a fair consensus protocol.

On Fairness in Voting Consensus Protocols

1.4

929

Outline

The rest of the paper organizes as follows. After giving an overview of previous work in Sect. 2, we give a brief introduction to voting consensus protocol and FPC in Sect. 3. The main contribution is the formulation of a proper mathematical framework in Sect. 4 and the construction of a weighted voting consensus protocol that is fair in the sense that the voting power is proportional to the weights of the nodes. Section 5 proposes to use a Zipf law for modeling the weight distribution. Under the validity of the Zipf law, we discuss in Sect. 6 impacts of the weighted voting on scalability and implementation of the protocol. Section 7 present some simulation results that show the behavior of the protocol in Byzantine infrastructure for different degrees of centralization of the weights. We conclude in Sect. 8 with a discussion.

2 2.1

Related Work Weighted Voting Consensus Protocols

Voting consensus protocols (without weights) are widely studied in theory and applications, and they play an important role in social learning. Also, weighted voting systems have a long history in election procedures, and often one is interested in measuring the influence of the power of the given participants, [18]. Despite these facts, we were not able to find related results on voting consensus protocols with weights. Recently, [19] describes a model that is related to FPC, and that considers biased or stubborn agents. For the sake of brevity, we refer to [8, 9] for more details on related work and references therein. 2.2

Fairness

Fairness plays a prominent role in many areas of science and applications. It is, therefore, not astonishing that it plays its part also in DLT. For instance, PoW in Nakamoto consensus ensures that the probability of creating a new block is proportional to the computational power of a node; see [20] for an axiomatic approach to block rewards and further references. In PoS-based blockchains, the probability of creating a new block is usually precisely proportional to the node’s balance. However, this does not always have to be the optimal choice, [21, 22].

3

Voting Consensus Protocols

We give a brief definition of binary voting protocols in this section. More details can be found, for instance, in [3–7]. We refer to [8, 9] for more information on the fast probabilistic consensus (FPC). To define the protocol accurately, we introduce some notation. We consider network with N nodes that are indexed by 1, 2, . . . , N . We assume that every node can query any other node. This assumption is made for the sake of a better

930

S. M¨ uller et al.

presentation and not necessary. In fact, simulation studies [7, 9] show that it is sufficient if every node is able to query about half of the other nodes. It also seems to be a reasonable assumption because nodes with high weights will likely be known to every participant in the network. Every node i has an opinion or state. The opinion of a node i at time t is denoted by si (t), and takes values in {0, 1}. Every node i starts with an initial opinion si (0). At each (discrete) time step a node i chooses k random nodes Ci = Ci (t) and queries their opinions. Denote by ki (t) ≤ k the number of replies received by node i at time t. If the reply from j is not received in due time we set sj (t) = 0. The updated mean opinion is then 1 sj (t). ηi (t + 1) = ki (t) j∈Ci

Note that repetitions are possible since the sample Ci of a node i is chosen using sampling with replacement. We consider a basic version of FPC introduced in [9]. Specifically, we remove the cooling phase of FPC and the randomness of the initial threshold τ . We consider i.i.d. random variables Ut , t = 1, 2, . . . , with law Unif([β, 1 − β]), for some β ∈ [0, 1/2]. The update rule for the opinion of a node i is then given by 1, if ηi (1) ≥ τ, . si (1) = 0, otherwise, For subsequent rounds, i.e. t ≥ 1: ⎧ ⎨ 1, if ηi (t + 1) > Ut , si (t + 1) = 0, if ηi (t + 1) < Ut , ⎩ si (t), otherwise. Note that if τ = β = 0.5, the protocol reduces to a standard majority consensus. An asymmetric choice of τ , τ = 0.5 allows the protocol the distinction between the two kinds of integrity failure, [8,9]. It is important, that the sequence of random variables Ut are the same for all nodes. A more detailed discussion on the use of decentralized random number generators is given in [9]. In contrast to many theoretical papers on majority dynamics, a local termination rule is needed for practical applications. Every node keeps a variable cnt that is incremented by 1 if there is no change in its opinion. The variable is set to 0 if there is a change of opinion. The node considers the current state as final if the counter reaches a certain threshold l, i.e., cnt ≥ l. In the absence of autonomous termination the protocol stops after maxIt iterations.

4

Fairness

In this section, we propose a proper mathematical framework of fairness. The N weight of the N nodes is given by {m1 , .., mN } with i=1 mi = 1. In the sampling of the queries a node j is chosen with probability f (mj ) . pj = N i=1 f (mi )

(1)

On Fairness in Voting Consensus Protocols

931

Each opinion sj (t) is weighted by gj = g(mj ): ηi (t + 1) =

1

j∈Ci

gj

gj sj (t).

(2)

j∈Ci

The other parts of the protocol remain unchanged. Since we sample with replacement we count the number of times a node i is chosen and denote this number by yi . Using the fact that the sampling can be expressed by a multinomial distribution we can calculate the expected result of a query as Eη(t + 1) =

N

si (t)vi ,

(3)

i=1

where vi =

y∈NN :

yi =k

N k! yi gi y p j. N y1 ! · · · yN ! n=1 yn gn j=1 j

(4)

We call vi the voting power of node i. This quantity measures the influence of the node i. We would like the voting power to be proportional to the weight since this would induce a robustness to splitting and merging, [23]. Definition 1 (Robust to Splitting and Sybil Attacks). A voting scheme is robust to Sybil attacks if a node i splits into nodes i1 and i2 with a weight splitting ratio x ∈ (0, 1), then vi (mi ) ≥ vi1 (xmi ) + vi2 ((1 − x)mi ).

(5)

Definition 2 (Robust to Merging). A voting scheme is robust to merging if a node i splits into nodes i1 and i2 with a weight splitting ratio x ∈ (0, 1), then vi (mi ) ≤ vi1 (xmi ) + vi2 ((1 − x)mi ).

(6)

Definition 3 (Fairness). A voting scheme (f, g) is fair if it is robust to Sybil attacks and robust to merging. In other words, if a node i splits into nodes i1 and i2 with a weight splitting ratio x ∈ (0, 1), then vi (mi ) = vi1 (xmi ) + vi2 ((1 − x)mi ).

(7)

In the case where g ≡ 1 we construct a voting scheme that is fair for all possible choices of k and weight distributions. Theorem 1. Let us assume that g ≡ 1. Then, the voting scheme (f, g) is fair if and only if f is the identity function f = id. Proof. We consider g ≡ 1. In this case we can simplify (7) to vi =

1 k

y∈NN :

yi P[y], yi =k

932

S. M¨ uller et al.

where y follows a multinomial distribution using the probability vector {pj } and making k selections. Hence, we obtain f (mi ) . vi (mi ) = pi = N j=1 f (mj ) Now, let S =

N

j=1

(8)

f (mj ) and

Δx = f (xmi ) + f ((1 − x)mi ) − f (mi ).

(9)

The fairness condition (7) becomes f (xmi ) f ((1 − x)mi ) f (mi ) = + , S S + Δx S + Δx

∀x ∈ (0, 1),

(10)

which is equivalent to Δx f (mi ) = SΔx ,

∀x ∈ (0, 1).

(11)

This means that either f (mi ) = S, meaning that there is only one node, or that Δx ≡ 0, meaning that for all x ∈ (0, 1) we have that f (m) = f (xm) + f ((1 − x)m),

(12)

and that f is a linear function. On the other hand, if the nodes are queried at random without weighting the probability for selection, i.e., f ≡ 1, then there exists no voting scheme that is fair for all k. Theorem 2. For f ≡ 1 there exists no voting scheme (f, g) with for all gi > 0 that is fair for all k. Proof. In the case of f ≡ 1 the voting power simplifies to v(mi ) =

1 Nk

y∈NN :

yi =k

k! yi gi . N y1 ! · · · yN ! n=1 yn gn

(13)

We will consider a special situation that does not satisfy the fairness condition. We consider the situation with N = 2 nodes and m1 = 2/3 and m2 = 1/3. The voting power of the first node is then

Xg( 23 ) 2 , (14) =E v 3 Xg( 23 ) + (k − X)g( 13 ) where X follows a binomial distribution B(k, 1/2). Now, it suffices to show that the right hand side of Eq. (14) is not constant in k. Due to the law of large numbers and the dominated convergence theorem we have that Xg( 23 ) g( 23 ) k→∞ E −→ . (15) Xg( 23 ) + (k − X)g( 13 ) g( 23 ) + g( 13 )

On Fairness in Voting Consensus Protocols

We conclude by observing that g( 2 ) Xg( 23 ) < 2 3 1 , E 2 1 Xg( 3 ) + (k − X)g( 3 ) g( 3 ) + g( 3 )

933

(16)

where the strict inequality follows from Jensen’s Inequality using that X is not a constant almost surely and g( 23 ) > g( 13 ). For these reasons we fix from now on g ≡ 1 and f = id.

5

Distribution of Weight

The existence of a fair voting scheme is independent of the actual distribution of weight. However, to make a more detailed prediction of the qualities of the voting consensus protocol, it may be appropriate to make assumptions on the weights. Probably the most appropriate modelings of the weight distributions rely on universality phenomena. The most famous example of this universality phenomenon is the central limit theorem. While the central limit is suited to describe statistics where values are of the same order of magnitude, it is not appropriate to model more heterogeneous situations where the values might differ in several orders of magnitude. For instance, values in most (crypto-)currency systems are not distributed equally; [24]. Zipf’s law describes mathematically various models of heterogeneous realworld systems. For instance, many economic models, [25], use these laws to describe the wealth and influence of participants. This makes the Zipf law a natural candidate for modeling the weight distribution. It can be defined as follows. The nth largest value y(n) follows approximately a power law, i.e. it should behave roughly as (17) y(n) = Cn−s for the first few n = 1, 2, 3, . . . and some parameters s > 0 and normalising constant C. We refer to [26] for an excellent introduction to this topic. A first and convenient way to verify a Zipf law is to plot the data on a log-log graph, i.e. the axes being log(rank order) and log(value). A linear plot is then a strong indication for a Zipf law. An estimation of the parameter s can be given by a linear regression. As an example, we show the distribution of the top 10.000 richest IOTA addresses together with a fitted Zipf law in Fig. 1. Due to the universality phenomenon mentioned above and the observation made in Fig. 1, we assume, in the following sections, that the weight distribution follows a Zipf law.

6

Scalability and Message Complexity

An essential property of voting consensus protocols is their scalability. In fact, at every round, every node is queried on average k times indifferent to the

934

S. M¨ uller et al.

Fig. 1. The top 10.000 richest IOTA addresses together with a fitted Zipf law with s = 1.1; as of July 2020.

network size. In our proposed fair voting schemes where nodes are sampled proportional to their weight, this is no longer true, and nodes with higher weight are queried more often. This affects the scalability of the protocol and might generate feedback on the weight distribution. Lemma 1. We assume the weights to follow a Zipf law with parameters (s, N ). Then, the average number of queries a node of rank h(N ) receives per round is of order (as N → ∞) 1. Θ(N s h(N )−s ), if s < 1; 2. Θ( logNN h(N )−1 ), if s = 1; 3. Θ(N h(N )−s ), if s > 1. Proof. At every round a node of rank h(N ) is queried on average h(N )−s N · N −s n=1 n

(18)

times. In the case s < 1 this expression becomes asymptotically Θ(N s h(N )−s ). In the case s = 1 we obtain Θ( logNN h(N )−1 ), and if s > 1 it becomes Θ(N h(N )−s ). Let us note that, if s > 0 the node with the highest weight is queried Θ(N s ), Θ( logNN ), or Θ(N ) times. Nodes with considerable lower weight, i.e. of rank Θ(N ), are queried on average only Θ(1) times. This is in contrast to the

On Fairness in Voting Consensus Protocols

935

case s = 0 where every node has the same weight, and every node is queried on average a constant number of times. Therefore, nodes with high weights have an incentive to communicate their opinions through different communication channels, e.g. to gossip their opinions on an underlying peering network and not to answer each query separately. On the other hand, the situation where all nodes gossip their opinions is not optimal neither. In this case, every node would have to send Ω(N ) messages. In the following we propose a criterion for the gossiping of opinions. We assume that nodes with high weights have higher throughput than nodes with lower weights and can treat more messages than nodes with lower weights. We consider the situation where only the Θ(log(N )) highest weights nodes gossip their opinions. This leads to Θ(log N ) messages for each node in the gossip layer. The highest weight node, that does not gossip its opinions, receives on average Θ(( logNN )s ) messages if s < 1, Θ( (logNN )2 ) messages if s = 1, and Θ( (logNN )s ) messages if s > 1. In this scenario, middle rank nodes, i.e. those of rank between Θ(log N ) and Θ(N ), have the highest message load. A possible consequence might be that these middle rank nodes might gossip their opinion leading to possible congestion of the network or might stop to answer queries leading to less security of the consensus protocol. A less selfish response of such a node could be to either split up the node into several nodes or to pool with other nodes to gossip their opinions according to the above threshold rule. This, however, might influence the distribution of weight. The above situation may apply well to heterogeneous networks, i.e. the computational power and bandwidths of the node might differ in several orders of magnitudes, [16,17]. In homogeneous networks the above issues may be solved by a fair attribution of message complexity. Lemma 2. We assume the weights to follow a Zipf law with parameters s and N . Then there exists a fair threshold for gossiping opinions such that every node has√to process the same order of messages. The message load for each node is O( N ) for all choices of s. Proof. We have to find a threshold such that the maximal number of queries a node receives equals the number of gossiped messages. For s < 1 this leads to the following equation: (19) N s h(N )−s = h(N ). s

s

Hence, a threshold of order N s+1 leads to Θ(N s+1 ) messages for every node to 1 1 1+s ) send. For s > 1 we obtain similarly a threshold of N 1+s leading to Θ(N √ messages. For s = 1 we obtain a message complexity for each node of O( N ).

7

Simulations

In this section, we present a short simulation study that shows how the performance of the FPC depends on the distribution of the weights.

936

S. M¨ uller et al.

7.1

Threat Model

We assume that an adversary holds a proportion q of the total weight. The adversary splits its weight equally between its qN nodes. In this way, each adversary node holds 1/N of the total weight. We assume that the adversary is at every moment aware of all opinions and transmits at time t + 1 the opinion of the weighted minority of the honest nodes of step t. 7.2

Failures

Standard failures of consensus protocols are integration failure, agreement failure, and termination failure. There are different possibilities to generalize these failures to the case of heterogeneous weight distributions. Since in many applications the agreement failure is the most severe, we consider only agreement failure in this note. In the non-weighted setting an agreement failure occurs if not all nodes decide on the same opinion. However, in the weighted setting it makes sense to take the weights into account. For this reason, we consider the 1%-agreement failure in the sense that a failure occurs if nodes of at least 1% weight differ in their final decision. 7.3

Results

We choose the initial average opinion p0 equal to the value of the first threshold τ . This can be considered as the critical case since for values of p0 τ or p0 τ , the agreement failure rate is so small that numerical simulation is no longer feasible. We assign the initial opinions as follows. The highest weight nodes holding together more than p0 of the weight are assigned opinion 1 and the remaining opinion 0. Other default parameters are: N = 1000 nodes, p0 = τ = 0.66, β = 0.3, l = 10, maxIt = 50. In Fig. 2 we investigate the protocol with a small sample size, k = 20, and study the agreement failure rate as a function of q. We see that the performance depends on the parameter s; the higher the centralization is the lower the agreement failures are for high q, but at the price that the protocols performs worse if q is smaller. In Fig. 3 we observe an exponential decay of the agreement failure rate in the sample size k. The code of the simulations is open source and online available.1

1

https://github.com/IOTAledger/fpc-sim.

On Fairness in Voting Consensus Protocols

937

Fig. 2. Agreement Failure Rates as a Function of q for Three Different Weight Distributions and k = 20.

Fig. 3. Agreement failure rates with k; q = 0.25.

8

Discussion

We proposed a mathematical framework to study fairness in consensus voting protocols and constructed a fair voting scheme, Sect. 4. Even though this voting scheme is robust to splitting and merging, there are second order effects that may incentivize nodes to optimize their weights. One example we studied concerns the message complexity and its consequences, Sect. 6. Other secondary effects that may influence the weights are, for instance; basic resource costs such as

938

S. M¨ uller et al.

maintaining nodes, network redundancy, and service availability. Moreover, the fact that the security of the protocol has an impact on its performance may lead to changes in the weight distribution. Nodes with high weight are likely no longer be anonymous. In the case where these nodes gossip their opinions, and other users widely accept their reputation, some nodes may decide not longer to take part in the consensus finding but to only follow the nodes with the highest reputations. This would give disproportional weight to nodes with high weight, leading to an unfair situation. A complete mathematical treatment of the above is not realistic; all the more the evolution of such a permissionless and decentralized consensus protocol depends also on other components of the network, as well as economic and psychological elements.

References 1. Neil, B., Shields, L.C., Margolin, N.B.: A survey of solutions to the sybil attack (2005) 2. Mossel, E., Neeman, J., Tamuz, O.: Majority dynamics and aggregation of information in social networks. Auton. Agent. Multi-Agent Syst. 28, 408–429 (2014) 3. G´ acs, P., Kurdyumov, G.L., Levin, L.A.: One-dimensional Uniform Arrays that Wash out Finite Islands. In: Problemy Peredachi Informatsii (1978) 4. Moreira, A.A., Mathur, A., Diermeier, D., Amaral, L.: Efficient system-wide coordination in noisy environments. Proc. Natl. Acad. Sci. U.S.A. 101, 12085–12090 (2004) 5. Kar, S., Moura, J.M.F.: Distributed average consensus in sensor networks with random link failures. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP 2007, vol. 2, pp. II–1013–II–1016, April 2007 6. Cruise, J., Ganesh, A.: Probabilistic consensus via polling and majority rules. Queueing Syst. 78(2), 99–120 (2014) 7. Gogolev, A., Marchenko, N., Marcenaro, L., Bettstetter, C.: Distributed binary consensus in networks with disturbances. ACM Trans. Auton. Adapt. Syst. 10, 19:1–19:17 (2015) 8. Popov, S., Buchanan, W.J.: FPC-BI: fast probabilistic consensus within byzantine infrastructures. J. Parallel Distrib. Comput. 147, 77–86 (2021) 9. Capossele, A., Mueller, S., Penzkofer, A.: Robustness and efficiency of leaderless probabilistic consensus protocols within byzantine infrastructures (2019) 10. Banisch, S., Ara´ ujo, T., Lou¸ca ˜, J.: Opinion dynamics and communication networks. Adv. Complex Syst. 95–111 (2010) 11. Niu, H.-L., Wang, J.: Entropy and recurrence measures of a financial dynamic system by an interacting voter system. Entropy 2590–2605 (2015) 12. Przybyla, P., Sznajd-Weron, K., Tabiszewski, M.: Exit probability in a onedimensional nonlinear q-voter model. Phys. Rev. E (2011) 13. Castellano, C., Fortunato, S., Loreto, V.: Statistical physics of social dynamics. Rev. Modern Phys. 591 (2009) 14. Rabin, M.: Incorporating fairness into game theory. Department of Economics, UC Berkeley (1991) 15. Popov, S., et al.: The coordicide (2020)

On Fairness in Voting Consensus Protocols

939

16. Vigneri, L., Welz, W., Gal, A., Dimitrov, V.: Achieving fairness in the tangle through an adaptive rate control algorithm. In: 2019 IEEE International Conference on Blockchain and Cryptocurrency (ICBC), pp. 146–148, May 2019 17. Vigneri, L., Welz, W.: On the fairness of distributed ledger technologies for the internet of things. In: 2020 IEEE International Conference on Blockchain and Cryptocurrency (ICBC) (2020) 18. Gambarelli, G.: Power indices for political and financial decision making: a review. Ann. Oper. Res. 51(4), 163–173 (1994) 19. Arpan Mukhopadhyay, R.R., Mazumdar, R.R.: Voter and Majority Dynamics with Biased and Stubborn Agents https://arxiv.org/abs/2003.02885 (2020) 20. Chen, X., Papadimitriou, C., Roughgarden, T.: An axiomatic approach to block rewards. In: Proceedings of the 1st ACM Conference on Advances in Financial Technologies, (New York, NY, USA), pp. 124–131. Association for Computing Machinery (2019) 21. Popov, S.: A probabilistic analysis of the Nxt forging algorithm. Ledger 1, 69–83 (2016) 22. Leonardos, S., Reijsbergen, D., Piliouras, G.: Weighted voting on the blockchain: improving consensus in proof of stake protocols (2020) 23. Leshno, J., Strack, P.: Bitcoin: an impossibility theorem for proof-of-work based protocols. Cowles Foundation Discussion Paper, no. 2204R (2019) 24. Kondor, D., P´ osfai, M., Csabai, I., Vattay, G.: Do the rich get richer? An empirical analysis of the bitcoin transaction network. PloS One 9, e86197 (2014) 25. Jones, C.I.: Pareto and Piketty: the macroeconomics of top income and wealth inequality. J. Econ. Perspect. 29, 29–46 (2015) 26. Tao, T.: Benford’s law, Zipf’s law, and the Pareto distribution. https://terrytao. wordpress.com/2009/07/03/benfords-law-zipfs-law-and-the-pareto-distribution/ (2009)

Dynamic Urban Planning: An Agent-Based Model Coupling Mobility Mode and Housing Choice. Use Case Kendall Square Mireia Yurrita(B) , Arnaud Grignard, Luis Alonso, Yan Zhang, Cristian Ignacio Jara-Figueroa, Markus Elkatsha, and Kent Larson Massachusetts Institute of Technology, Cambridge, USA [email protected]

Abstract. As cities become increasingly populated, urban planning plays a key role in ensuring the equitable and inclusive development of metropolitan areas. MIT City Science group created a data-driven tangible platform, CityScope, to help different stakeholders, such as government representatives, urban planners, developers, and citizens, collaboratively shape the urban scenario through the real-time impact analysis of different urban interventions. This paper presents an agent-based model that characterizes citizens’ behavioural patterns with respect to housing and mobility choice that will constitute the first step in the development of a dynamic incentive system for an open interactive governance process. The realistic identification and representation of the criteria that affect this decision-making process will help understand and evaluate the impacts of potential housing incentives that aim to promote urban characteristics such as equality, diversity, walkability, and efficiency. The calibration and validation of the model have been performed in a wellknown geographic area for the Group: Kendall Square in Cambridge, MA. Keywords: Agent-based modelling · Housing choice · Residential mobility · Dynamic urban planning · Pro-social city development

1

Introduction

Even if humans started to cluster around cities 5,500 years ago, it was not until about 150 years ago that cities became increasingly populated [1]. Nowadays more than 50% of the world’s population lives in urban areas, which makes urban planning emerge as one of the key elements for citizens’ well-being [2]. As in the following decades cities may account for 60% of the total global population, 75% of the global C02 emissions, and up to 80% of all world energy use [3], traditional systems of centralised planning might be obsolete [2]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 940–951, 2021. https://doi.org/10.1007/978-3-030-80126-7_66

Dynamic Urban Planning

941

Fig. 1. CityScope Volpe: a data-driven interactive simulation tool for urban design [2]

In view of the need for more participatory decision-making processes, MIT City Science (CS) Group developed a tangible data-driven platform, CityScope. This platform brings different stakeholders together in an effort to enable a more collaborative urban design process [2, 4] – Fig. 1. As part of this collective decision-making approach, the CS Group is designing a dynamic incentive system, that aims to promote pro-social urban features through dynamic incentive policies. By using data-driven agent-based simulations, the impacts on equality, diversity, walkability, and efficiency of such incentives are measured. Thanks to this information, different stakeholders, such as government representatives, urban planners, developers, and citizens, can collaboratively decide which intervention shapes the most favourable urban scenario. Some of the incentives explored in this line of research are focused on the promotion of affordable housing within a 20-min walking distance from one’s workplace. This vision is part of a broader perspective where cities are encouraged to be comprised of autonomous communities, where citizens’ daily needs are met within a walking distance from their doorstep. MIT City Science intends to create the technological tools to facilitate consensus between community members, local authorities, and design professionals [2], so that cities can fulfil their ambition of having walkable districts. This paper arises from the need to forecast citizens’ criteria when choosing their homes, and the consequent mobility patterns. By doing so the effects of numerous pro-social incentives that aim to address the deficiency of affordable housing in city centres can be tested. Although some previous studies focusing on citizens’ mobility choices had already been held in the group, the need to

942

M. Yurrita et al.

combine mobility trends with housing-related aspects was major. In contrast to the consulted bibliography, where most approaches make use of statistical tools to respond to this joint choice research question, Agent-Based Modeling (ABM) has been deployed in this study. People’s behavioral patterns have been represented based on their income profiles. This document is organized as follows. Section 2 describes previous work on the representation of housing and residential mobility choices and highlights the benefits of using ABM when embracing the complexity of urban dynamics. Section 3 gives a detailed summary of the agents that constitute the model, the behaviours of each of them and the dynamics of the housing and mobility mode choice. Section 4 calibrates the model for the particular use case of Kendall Square using census and transportation data and Sect. 6 gives some final thoughts on the effectiveness of the method.

2 2.1

Background Related Work

The present research is an agent-based joint representation of housing and mobility choices that intends to forecast citizens’ behavioural patterns when choosing their residencies as well as to highlight impacts on mobility. This study has been built on top of previous work where different approaches to address this question have been suggested. Some of them merely focused on the housing choice search, others concentrated on the mobility mode issue and some merged both points of view. Among the former methodologies, a two-stage behavioural housing search model [5] had been created. This model was comprised of a hazard-based “choice set formation step”, and a “final residential location selection step” using a multinomial logit formulation. A theoretical housing preference and choice framework had also been given using the theory of mean-end chain [6]. The motivations that make people want to move had been studied [7] using a nested logit model, which paid special attention to how residential decisions are impacted by transportation. A multi-agent approach had also been used to model and analyze the residential segregation phenomenon [8]. And an ABM had been designed to represent the dynamics of the housing market and for the study of the effects that certain parameters, like school accessibility, would have on the outcome [9]. As far as mobility choices are concerned, some previous studies had been carried out within the MIT City Science Group using both statistical tools as well as agent-based models. The HCl platform deployed for the real-time prediction of mobility choices through discrete choice models [10] is one of the examples of how the CS Group had previously addressed the mobility choice issue. Agentbased models had also been developed in our Group to evaluate new mobility mode choices and to assess their impact on cities [11]. Among previous joint choice models, it is important to highlight a model based on the disaggregate theory where the combinations of location, housing,

Dynamic Urban Planning

943

automobile ownership and the commuting mode had been studied [12]. Additionally, a nested multinomial logit model to estimate the joint choice of residential mobility, and housing choice [13] had also been used. This paper presents a novel joint choice model based on agents. Following the Schelling segregation model [14], the factors related to housing, and those related to transportation will be deployed to determine if a certain agent is willing to move or not. 2.2

Agent-Based Simulations

Most of the related work mentioned prior in this paper make use of statistical tools to study the housing and residential mobility choice phenomenon. Our research, on the contrary, will artificially reproduce the dynamics of the urban scenario using agent-based simulations. Although several platforms dedicated to agent-based modelling have been created lately, this model has been developed using the GAMA platform. GAMA allows modellers to create complex multilevel spatially explicit simulations where GIS data can be easily integrated. It also provides a high-level agent-oriented language, GAML, as well as an integrated development environment [15, 16]. The behaviour of each ‘people agent’ has been developed using this tool and some agent-specific functions have been designed in order to recreate their decision-making reasoning while representing the complexity of the surrounding dynamics.

3

Model Description

This model aims to represent the criteria adopted by citizens when choosing their residential location and mobility mode, as well as the importance given to each of these parameters. The realistic characterization of these behaviours enables the study of the consequences that certain housing incentives and urban disruptions might entail. 3.1

Entities, State Variables, and Scales

The environmental Variables used to shape the urban scenario and represent its dynamics include: – Census Block Group: polygon representing the combination of census blocks, smallest geographic areas for which the US Bureau of Census collects data [17]. Each census block group will be formed by the following attributes: GEOID: unique geographical ID for each census block group, vacant spaces: number of vacant housing options available within the boundaries of this polygon (previously a web scraping process has been held and the availability of accommodation and its price identified [18] within the area of interest), city: broader unit the block group belongs to [17], rent vacancy: mean rent for each housing option based on online rent information [18], population: map

944

M. Yurrita et al.

indicating the number of citizens of each income profile that actually live in the area [19], has T : boolean attribute pointing out whether the block group has a T station or not [20] and has bus: boolean attribute that indicates if bus services are available in the area [20]. Census block groups have been the environmental variables chosen to represent the broadest granularity in the model, since applying the same unit as the US Census Bureau enables the calibration and validation of the model using census data. – Building: polygon representing a finer granularity for the area where the urban interventions are considered to take place. The aforementioned census block groups are the alternative to living in a building within the area of interest. The attributes of ‘building’ agents include: associated block group: block group in which the building is located [17], vacant spaces: number of vacant dwelling units within the building [18] and rent vacancy: mean rent for each housing option within the building [18]. – Road: network of roads that agents can use to move around. mobility allowed is the only attribute of the road network, which is taken into account when considering different commuting mobility options and when the resulting time for each of them is calculated [17].

Fig. 2. Model overview, where people, building, road and census block group – areas with different transparency – agents have been illustrated for Kendall Square in Cambridge, MA.

As far as agents are concerned, ‘people agents’ are defined the following way: – People: citizens who work in the area of interest and whose objective is to find the most appropriate housing and mobility option based on their criteria.

Dynamic Urban Planning

945

Their attributes include: type: type of ‘people agent’ based on their annual income profile [19], living place: building or census block group where they decide to live in each iteration of the searching process, activity place: building in the area of interest where each agent works, possible mobility modes: list of mobility modes that are available for each agent considering if they own some kind of private mobility mode [19] or if in their living place there is public transportation available [20], mobility mode commuting mobility mode chosen in each iteration of the housing search, time main activity: commuting time in minutes, distance main activity: commuting distance depending on the housing option being considered and commuting cost: cost based on the distance and the mobility mode chosen. 3.2

Process Overview

The dynamics of this model are ruled by the behaviour of ‘people agents’. Each of them is assigned a nonresidential building in the finer grained area as their workplace. An iterative process will be performed in order to find the housing option and mobility mode that best meets their needs. Iteration 0: The initialisation process takes place in three different steps: (i) a random residential unit is assigned to each agent, (ii) based on the area where this residential unit is located and the probability of owning private means of transportation, possible mobility modes are defined (iii) travelling options will be weighed based on each profile’s criteria and a final score for each mode will be calculated as a linear combination of the importance given to each of these factors. The option that maximizes the score will be the chosen mobility mode. Criteria related to transportation preferences include both quantitative and qualitative factors such as price, resulting commuting time, the difficulty of usage and the social pattern [11]. Table 1 displays some synthetic values of this criteria for three medium income profiles [11] (modified from working status profiles into income profiles), whereas Table 2 shows the features of some of the available transportation modes [11, 20]. Table 1. Synthetic weighing values given to transportation criteria [11] according to different income profiles. These values are normalized between –1 and 0 for price, time and difficulty and between 0 and 1 for pattern importance. They will determine each ‘people agent’s ’ preference towards a particular mobility choice. Price [–] Time [–] Difficulty [–] Pattern [–] $45,000–$59,999

–0.8

$60,000–$99,999

–0.7

$100,000–$124,999 –0.6

–0.8

–0.7

0.7

–0.85

–0.75

0.8

–0.9

–0.8

0.95

946

M. Yurrita et al.

Note that in order to be able to make use of public transportation, both the surroundings of the residential unit and the workplace have to offer that particular commuting option. So as to calculate the time needed to commute for each mode, both the waiting time (when non-private modes are considered) and the commute itself – following the topology of the transportation network – have been taken into account. Table 2. Price, speed and social pattern features of some of the available mobility modes [11]. Price/km and mean speed values are based on the massachusetts bay transportation authority data [20], whereas the pattern weight is a synthetic normalized weighing value between 0 and 1. A linear combination between these values and the mobility choice criteria for each income profile is performed so as to select the most suitable transportation option. Price/km [$] Mean speed [km/h] Pattern [–] Bike 0.01

5

0.5

Bus 0.1

20

0.4

Car

30

1

0.32

Subsequent Iterations: Once the initialisation process has been held, the procedure will change into: (i) an alternative housing option is randomly assigned to each agent (ii) this new housing option is compared to the current unit and the suitability of moving to the new alternative is evaluated based on certain factors according to each income profile (iii) the ‘time’ parameter is calculated as the resulting commuting time required using the most suitable mobility option for each housing unit as described for iteration 0. ‘People agents’ will decide to move in iteration ‘i’ if the score of the alternative housing unit is greater than the current one. Factors considered when assessing housing include quantitative data such as price or commuting time using the most suitable available means of transportation and qualitative factors like the zone preference, importance given to the unit being in this zone and diversity acceptance [9]. Diversity is calculated using the Shannon-Weaver formula following the diversity calculation methodology deployed on the CityScope platform [2]. This measurement quantitatively measures the amount of species (different income profiles in our research) in an ecosystem (the spatial unity being considered in each case) [2]. This diversity metrics will be applied either to a building, if the housing unit being considered is within the finer grained area, or to a census block group should the dwelling be located on the outskirts. This iterative process will be applied until the number of people moving in each iteration asymptotically approaches zero. Once this situation is reached and every ‘people agent’ is assigned the most convenient mobility and housing option, further mobility studies can be performed departing from this t = 0 situation.

Dynamic Urban Planning

947

Table 3. Price, diversity acceptance and zone criteria according to different income profiles (synthetic data based example). Values are normalized between –1 and 0 for price, between –1 and 1 for diversity acceptance and between 0 and 1 for zone weight. These criteria, along with the transportation criteria will point out the housing and mobility choice that maximizes the score obtained through a linear combination. Price [–] Diversity acceptance [–] Zone weight [–]

3.3

$45,000–$59,999

–0.8

0

0.4

$60,000–$99,999

–0.6

–0.3

0.5

$100,000–$124,999 –0.5

–0.5

0.6

Initialization and Input Data

The model initialization relies on two main types of files: the ones containing geographical information and the ones containing information regarding agents and mobility modes. The first group of files includes (1) a shapefile that consists of the block groups of the area in question, (2) a shepefile where the availability of apartments according to each block group is displayed, (3) a shapefile where the buildings of the finer grained area are incorporated, (4) a shapefile that is comprised of the public transportation system and (5) a shapefile with the road network of the region. The second type of input data consists of .csv files where (1) income profiles are defined and their characteristics such as the proportion within the population, the probability of owning a car and a bike are listed [19], (2) ‘people’ profiles define the weight given to each of the criterion when choosing a house – Table 3–, (3) these same profiles describe the criteria regarding mobility mode preferences – Table 1– and (4) different mobility modes are listed and their characteristics detailed – Table 2 –, including the price and time per distance unit, the waiting time, the difficulty, and the social pattern.

4

Model Calibration and Validation: Use Case Kendall Square

The model described above in this paper presents a generic framework where, as long as the input data is modified, the methodology can be applied to any city. In order to develop a first version of the model, Kendall Square has been analysed. Kendall Square is a neighbourhood in Cambridge, Massachusetts, – Fig. 2 – that has been transformed from an industrial area into a leading biotech and IT company hub in the last decades. As the number of companies clustering around this site has soared, the cost of housing has exploded, making it increasingly difficult for low and medium income members to live within a walking distance from their workplace [21]. The commuting patterns derived from this situation, which include high private vehicle density and saturation of the public transportation infrastructure, lead to serious mobility challenges. This section is focused on the description of the calibration process of ‘people agents’ ’ behavioural criteria, so

948

M. Yurrita et al.

that their reactions to potential housing incentives can be characterized and the consequences of their acts monitored. Input files (2) and (3) gather the aforementioned criteria and will be, therefore, inferred from this validation process, whereas file (1) is deduced from census data [19] and file (4) is an adaptation of the Massachusetts Bay Transportation Authority information [20]. The calibration process is based on the definition of two types of errors: one related to housing choices and the second one related to mobility mode patterns. Batch experiments are then deployed so as to find the combination of criteria that leads to the minimum value of the sum of both of these errors. There are eight different factors that affect the decision-making process of ‘people agents’ –housing price, diversity acceptance, preferred zone and weight given to zone [9]; transportation price, time importance, commuting difficulty and social pattern [11] – which applied to eight different income profiles, leads to a total of sixty four variables handled by batch experiments. The housing error is defined as the difference in percentage of the distribution of each income profile in each census block group where citizens working in Kendall have certain presence. The real percentage value inferred from transportation data – the census block group where different Kendall workers live can be identified and the income profile of these approximated to the mean annual income based on census data [19] – is then compared to the resulting simulated value and the root-mean-square deviation calculated – Eq. 1–. n 1 (Yi − Yî )2 (1) RM SE = n i=1 Yi = real percentage of people of type i living in the neighbourhood in question Yî = percentage of ‘people agents’ of type i living in that same neighbourhood as a result of the housing selection methodology As for the mobility mode error, the resulting simulated percentages of pedestrians and car, bike, bus and T usage are compared to the Parking and Transportation Demand Management Data in the City of Cambridge [22]. This error is equally calculated as the root-mean-square deviation, where Yi represents the percentage of people that choose a certain mobility mode i, whereas Yî stands for the percentage of ‘people agents’ that choose that same mobility mode once the iterative process has been performed. As far as the exploration method is concerned, the hill climbing algorithm [23] has been applied and batch experiments have been performed, getting a reasonably good solution. The most suitable criteria combination has led to a mean housing error of 3.87% and a mean mobility error of 2.30% – Fig. 3.

Dynamic Urban Planning

949

Fig. 3. Evolution of housing and mobility errors [%] in a hundred of the calibration process batch experiments.

5

Discussion

In the described model citizens’ housing and mobility preferences are scored taking into account various parameters, which include price (for both housing and commuting), diversity acceptance, zone preference, time, difficulty of usage and social pattern. These parameters have been defined based on bibliography, considering that they should constitute a reduced set of variables that heavily affect the decision-making process. Should a finer-grained study be held, it should be noted that the zone preference criterion encompasses itself numerous possible attributes. The effect of greenery, high quality education centers or fresh food availability, to mention some, could also be included if a more detailed study is to be held. For a first version model, though, it has been assumed that the suggested approach was already complex enough. This same complexity, along with the metropolitan area scope, makes it challenging for the impacts resulting from changes in urban configurations to be calculated-real time, which represents a constraint when it comes to computational cost. Finally, it should be noted that in order to calibrate the criteria in question, the exploration of the design space has been performed using the hill climbing algorithm. The minimization process has resulted in acceptable housing and mobility errors. However, this algorithm heavily relies on the selected starting point and, thus, alternative algorithms could be deployed if this limitation is to be avoided in future calibration processes.

950

6

M. Yurrita et al.

Conclusion

This paper presents a generic methodology where citizens’ criteria towards housing and mobility choices can be tested, as well as the resulting behavioural patterns monitored. Criteria regarding mobility and residential preferences have been calibrated and validated for the particular use case of Kendall Square, where major mobility issues have arisen as a consequence of the technological development of the surroundings and the explosion of housing prices. However, variations in input data and research into combinations of criteria that are particularised for a certain city are possible following the guidelines given earlier in the document. As part of the CityScope platform, this model enables the prediction of citizens’ response to housing-related urban disruptions that aim to promote the pro-social development of cities. The usage of Agent-Based Modelling opens the door to the monitorisation of people’s reactions to dynamically reconfigurable zoning policies. However, the deployment of such a data-driven platform requires of real-time feedback, so that different stakeholders can benefit from the instant impact analysis of the suggested actions. Further research regarding the conversion of this model into a real-time analysis tool is being held. This new approach will need to embrace the complexity of urban dynamics, while constituting an agile and dynamic decision-making tool.

References 1. Sjoberg, G.: The origin and evolution of cities. In: Scientific American, vol. 213, No. 3. Scientific American, a division of Nature America, Inc (1965) 2. Alonso, L., et al.: CityScope: a data-driven interactive simulation tool for Urban design. Use Case Volpe. In: Morales A., Gershenson, C., Braha, D., Minai, A., Bar-Yam, Y. (eds.) Unifying Themes in Complex Systems IX. ICCS 2018. Springer Proceedings in Complexity. Springer, Cham (2018). https://doi.org/10.1007/9783-319-96661-8 27 3. United Nations. Sustainable Development Goals (2020). https://www.un.org/ sustainabledevelopment/cities/ 4. Grignard, A., Maci` a, N., Alonso Pastor, L., Noyman, A., Zhang, Y., Larson, K.: Cityscope Andorra: a multi-level interactive and tangible agent-based visualization. In: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1939–1940 (2018) 5. Rashidi, T., Auld, J., Mohammadian, A.: A behavioral housing search model: twostage hazard-based and multinomial logit approach to choice-set formation and location selection. In: Transportation Research Part A 46, pp. 1097–1107, Elsevier Ltd (2012) 6. Zinas, B., Mohd Jusan, M.: Housing choice and preference: theory and measurement. In: 1st National Conference on Environment-Behaviour Studies. Faculty of Architecture, Planning & Surveying, Universiti Teknologi MARA, Shah Ala, Selangor, Maysia, 14–15 November 2009, pp. 282–292 (2012) 7. Kim, J., Pagliara, F., Preston, J.: The intention to move and residential location choice behaviour. Urban Stud. 42(9), 1621–1636 (2005)

Dynamic Urban Planning

951

8. Aguilera, A., Ugalde, E.: A spatially extended model for residential segregation. In: Discrete Dynamics in Nature and Society, vol. 2007, Article ID 48589. Hindawi Publishing Corporation (2007) 9. Jordan, R., Birkin, M., Evans, A.: Agent-based modelling of residential mobility, housing choice and regeneration. In: Heppenstall, A., Crooks, A., See, L., Batty, M. (eds.) Agent-Based Models of Geographical Systems, pp. 511–524. Springer, Dordrecht (2012). https://doi.org/10.1007/978-90-481-8927-4 25 10. Doorley, R., Noyman, A., Sakai, Y., Larson, K.: What’s your MoCho? Real-time mode choice prediction using discrete choice models and a HCL platform. In: Urbcomp 2019, 5 August 2019, Anchorage, AK (2019) 11. Grignard, A., et al.: The impact of new mobility modes on a city: a generic approach using ABM. In: Morales, A.J., Gershenson, C., Braha, D., Minai, A.A., Bar-Yam, Y. (eds.) Unifying Themes in Complex Systems IX. ICCS 2018. SPC, pp. 272–280. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96661-8 29 12. Lerman, S.: Location, housing, automobile ownership, and mode to work: a joint choice model. In: Transportation Research Record, pp. 6–11. Transportation Research Board (1977) 13. Clark, W., Onaka, J.: An empirical test of a joint model of residential mobility and housing choice. Environ. Plan. A Econ. Space 17(7), 915–930 (1985) 14. Schelling, T.: Models of segregation. In: The American Economic Review, vol. 59, No. 2, pp. 488–493. Papers and Proceeding of the Eighty-first Annual Meeting of the American Economic Association (1969) 15. Grignard, A., Taillandier, P., Gaudou, B., Vo, D.A., Huynh, N.Q., Drogoul, A.: GAMA 1.6: advancing the art of complex agent-based modeling and simulation. In: Boella, G., Elkind, E., Svarimuthu, B.T.R., Dignum, F., Purvis, M.K. (eds.) PRIMA 2013. LNCS (LNAI), vol. 8291, pp. 117–131. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-44927-7 9 16. Taillandier, P., et al.: Building, composing and experimenting complex spatial models with the GAMA platform. GeoInformatica 23(2), 299–322 (2019) 17. United States Government’s Open Data (2020). https://www.data.gov/ 18. PadMapper. Apartments for Rent from the Trusted Apartment Finder (2020). https://www.padmapper.com/ 19. United States Census Bureau. Census Profiles (2020). https://data.census.gov/ cedsci/ 20. MBTA. Massachusetts Bay Transportation Authority (2020). https://www.mbta. com/ 21. Sisson, P.: As top innovation hub expands, can straining local infrastructure keep pace? (2018). https://www.curbed.com/2018/11/6/18067326/boston-real-estatecambridge-mit-biotech-kendall-square 22. Parking and Transportation Demand Management Data in the City of Cambridge (2014). https://www.cambridgema.gov/CDD/Transportation/fordevelopers/ ptdm 23. Russell, J., Norvig, P.: Artificial Intelligence. A Modern Approach. Prentice-Hall, USA (1995)

Reasoning in the Presence of Silence in Testimonies: A Logical Approach Alfonso Garcés-Báez(B) and Aurelio L´ opez-López Computational Sciences Department, Instituto Nacional de Astrof´ısica, ´ Optica y Electr´ onica, Sta. Ma. Tonantzintla, Puebla, Mexico {agarces,allopez}@ccc.inaoep.mx

Abstract. An implicature, as opposed to a logical or formal implication, allows to make linguistic inferences during a conversation. The omission of an answer (silence) also allows to make linguistic inferences in some contexts, according to Swanson. Once previous studies of silence and the conversational implicature of Grice are reviewed, we propose definitions and semantics of three kinds of silence, when knowledge is represented in logic. As a case study for the semantics, we chose a puzzle formulated by Wylie and then apply answer set programming to generate models that reveal new information and different possibilities, including nonmonotonicity in knowledge bases, that lie behind the omission in the context of testimonies. Also, a connection between false statements and silence emerged. Keywords: Silence · Omission · Testimonies representation · Reasoning · Logic

1

· Knowledge

Introduction

Both silence and omission have been studied from different points of view. For instance, from the point of view of semiotics, Umberto Eco [4] stated that silence is a sign and proposes a table of possibilities between speaker and the listener, where the transmission of silence can be voluntary or involuntary. According to [1], the variants for the interpretation of silence are anonymous and not anonymous, where the second can occur as: with identification or face-to-face. In the same work, silence is also examined in the communication process, in terms of prisoner’s dilemma. Dyne et al. [3] propose the defensive, acquiescent, and pro-social silence occurring in organizations. Kurzon defines silence by language in [12], focusing mainly on three types of silence: psychological, interactive, and socio-cultural. In data communication, there are three kinds of acoustic silence in conversation: pauses, gaps, and lapses. Pauses refer to silences within turns, gaps refer to short silences between turns, and lapses are longer or extended silences between turns [2]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 952–966, 2021. https://doi.org/10.1007/978-3-030-80126-7_67

Reasoning in the Presence of Silence

953

Context is a key element for the interpretation of silence. For instance, in court, during an interrogation, silence is interpreted to the detriment of whom decides to remain silent. This is immediately understood as the person hides something. Other example is when we have the interpretation of silence by police officers with suspects that emit the phrase no comment instead [15]. Intentional silence can also signal loyalty to a group. According to [12], if we want to interpret intentional silence, first we have to discard the modal can that expresses unintentionality, as in I can not speak. Afterward, four manners remain to interpret. First, I may not tell you; second, I must not tell you; third, I shall not tell you, and finally, I will not tell you. First two manners are considered external intentional silences by order, while last two are internal by will. From several of these above mentioned studies on silence [3,4,12, 15], we proposed previously a classification of the uses of silence that can be found in [6]. An ommisive implicature, as stated by Swanson [17], asserts that in certain contexts, not saying p generates a conversational implicature: that the speaker didn’t have sufficient reason, all things considered, to say p. A formalization of intentional silence in terms of logic has not been done, as far as we know. However, [11] attempts an informal logic. In [5], a first contribution in this direction is done. This paper proposes a logical conceptualization to the study of omission of answers in a testimonial context, by focusing on a puzzle formally expressed to analyze the implications of three interpretations of omission represented, through the predicate Says (previously employed in [8] and [10]). We solve as case study two versions of a puzzle, generating models with Clingo in one of them, to delineate the relationship that could exist between possible false statements and the interpretation of omission as intentional silence, in this testimonial context. The organization of the paper is: Sect. 2 contains the concept of conversational implicature and some formal definitions, in Sect. 3 we describe the strategy we use and the puzzle taken as case study. Section 4 includes three interpretations, some combinations of them, and their consequences for the case study. Section 5 concludes and details work in progress.

2

Background

We discuss two main concepts for our logical approach, conversational implicature and answer set programming. 2.1

Grice’s Sealed Lips

Grice [10] explains conversational implicature as a potential inference that is not a logical implication, and that is closely connected with the meaning of the word says.

954

A. Garcés-B´ aez and A. L´ opez-L´ opez

The Cooperative Principle (CP) is the basis to explain implicature, and consists of participants making their contribution in a conversation, as required in the scenario in which occurs, for the accepted purpose or direction of their speech exchange. The CP has four categories that include the maxims of Grice: 1. Quantity. Make your informative contribution as required for current exchange purposes. Do not make more informative contributions than required. 2. Quality. Do not say what you think is false. Do not say that for which you lack adequate evidence. 3. Relation. Be relevant. 4. Manner. Category 4 is related to the way it is said and considers a supermaxim ‘Be perspicuous’, along several maxims such as: – – – –

Avoid obscurity of expression. Avoid ambiguity. Be brief (avoid unnecessary prolixity). Be orderly.

As pointed out by Grice, a talk exchange participant may fail to fulfill a maxim and the CP; i.e. he may say, indicate, or allow it to become plain that he is unwilling to cooperate according to the maxim. He may say, for instance, I cannot say more; my lips are sealed, and with this, breaking the communication. In contrast, we consider that in some contexts, silence could be interpreted without breaking communication. The representation of saying and not saying is important in our semantics, which is the reason for introducing the following formal definition of this predicate. Definition 1. Says(A, P, T/F) denotes that the agent A says that P is true (T) or false (F). 2.2

Answer Set Programming

While the antecedents of Prolog are in automatic theorem provers, those of Answer Set Programming (ASP) are in deductive databases and satisfiability testing. For ASP, starting from a representation of the problem, a solution is given by a model of the representation. ASP paradigm is utilised in our research to explore the implications of silence in our case study, since is closely related to intuitionist logic, i.e. these are based on the concept of proof rather than truth (as previously shown in [9] for intuitionist logic). This is a way of doing logical programming by computing stable models for the problems at hand [8]. A stable model is a belief system that holds for a rational agent. Under this approach: – Solutions go beyond answering queries.

Reasoning in the Presence of Silence

955

– The solution of computational problems is done by reducing them to finding answer sets of programs. – In principle, any problem in NP-complete class can be solved with ASP without disjunction. – More-complex problems can be solved when disjunction is included.

3

Case Study

Our aim is to show some possibilities that open when intentional silence or omission occurs during testimonies, as a computational resource that helps for decision-making. We start from a puzzle with three suspects under the assumption that one tells the truth, another asserts half truth, and the last one lies. With the purpose of uncovering the consequences of silence, we model this puzzle under a new assumption: everyone tells the truth except the criminal, and we apply our semantic rules to reveal some relationships between silence and falsehood. 3.1

Method

In mathematics, there is a wide variety of proof techniques to prove that A → B such as Direct, Indirect, Contradiction, Contrapositive, Construction, Induction, and so on [16]. As part of that list, there is the Choose Technique that works forward from A and the fact that the object has certain property (see Fig. 1). This also works backward from the something that happens.

Fig. 1. A model proof for the Choose technique.

We employ logic puzzles since the solutions to some of these problems could have practical implications, and in some way they synthesize actual situations. Figure 2 shows the preliminary considerations (preconditions) that consist of selecting a puzzle with testimonials and implementing (programming) it to reach a contextualized solution, employing common sense rules. In the process, the types of omission are chosen to generate models, that are then analyzed to try to answer some key questions.

956

A. Garcés-B´ aez and A. L´ opez-L´ opez

Fig. 2. Strategy.

An implementation of ASP is Clingo [7] that allows to find, if there exists, the answer set (stable model) of a logical program. For our case study, Clingo is employed to generate answer sets for the problem, and in particular explore the implications of the interpretations formulated. Additionally, Python serves to update the logical program [18]. 3.2

Riddle the Criminal

There is a puzzle that includes testimonies of several people (taken from [19]), and allows to model and examine the interpretations of silence. The riddle is phrased as: Three men were once arrested for a crime which beyond a shadow of a doubt had been committed by one of them. Preliminary questioning disclosed the curious fact that one of the suspects was a highly respected judge, one just an average citizen, and one a notorious crook. In what follows they will be referred to as Brown, Jones, and Smith, though not necessarily respectively. Each man made two statements to the police, which were in effect: Brown: b1 : I didn’t do it. b2 : Jones didn’t do it. Jones: j1 : Brown didn’t do it. j2 : Smith did it. Smith: s1 : I didn’t do it. s2 : Brown did it. The original completion is: Further investigation showed, as might have been expected, that both statements made by the judge were true, both statements made by the criminal were false, and of the two statements made by the average man one was true and one was false. Which of the three men was the judge, the average citizen, the crook? However, the puzzle can be formulated more naturally as follows:

Reasoning in the Presence of Silence

957

Every person involved tells the truth except, possibly, the criminal. Who committed the crime? To analyze the solutions of these two versions, the following logic-based representation of testimonies is done: b1 : says(brown, innocent(brown), T ). b2 : says(brown, innocent(jones), T ). j1 : says(jones, innocent(brown), T ). j2 : says(jones, innocent(smith), F ). s1 : says(smith, innocent(smith), T ). s2 : says(smith, innocent(brown), F ). To solve the puzzle, we have to keep in mind the following conflicting statements: 1. {b1 , s2 } ⊥ 2. {j1 , s2 } ⊥ 3. {j2 , s1 } ⊥ A straightforward and traditional way to solve this type of puzzle is through a matrix representation. For instance, a SaysM atrix, as specified in Definition 2, illustrates the speech acts involved in a context, in terms of the predicate Says(X, Y, T /F ). Definition 2. A SaysMatrix has rows that correspond to the agent that denies (F) or asserts (T) a predicate of another agent found in a column. If there are several predicates, the intersection of columns and rows is the pair

or empty; if the matrix is associated with only one predicate, the intersection can have only T/F. Figure 3-i presents the testimony of the suspects represented with the predicate Says(x, innocent(y), T /F ) previously defined. Figure 3-ii depicts the solution to the original puzzle. This result is found based on the preconditions of the original completion by trial and error. The only solution turns out that Brown is the culprit. Our interest is to know the implications of omission of a testimony or silence in the context of the case study, that is, if by considering silence instead of false statements, we can also reach a solution to the puzzle, when analyzing the alternative (natural) completion of the case study puzzle. Under this assumption, the solution to the puzzle varies, where the criminal turns out to be Smith, as illustrated in Fig. 3-iii. The program Criminal.lp (for Clingo, see Appendix A) implements the testimony of these three people, along with common sense knowledge. So, the program generates the solution: criminal(smith). Next, we present and exemplify some definitions of silence to explore their relationship, regarding the solution of the original puzzle that serves as our case study.

958

A. Garcés-B´ aez and A. L´ opez-L´ opez

Fig. 3. SaysMatrix and solutions of the puzzle the criminal.

4

Interpreting Omission

Starting from the representation of the puzzle, we then describe and explore two variants of defensive silence and one of acquiescent silence related to the context of interest. The first silence to interpret is Defensive, that occurs when an agent intentionally simply decides to remain quite, mainly by fear. The second interpretation is for Acquiescent Silence, i.e. asserting with silence what others have said, commonly caused by resignation. For both cases, we examine their consequences. 4.1

Defensive Silence

This kind of silence is intentional and proactive, intended to protect the self from external threats, and is defined next. Definition 3. Defensive Silence. Withholding relevant ideas, information, or opinions as a form of self-protection, based on fear [3]. To reason in the presence of this kind of silence, the following rule is stated to interpret it. TDS semantic rule: PA i denotes Total Defensive Silence of Ai understood as: P A i = P − XA i With (1 ≤ i ≤ n), n is the number of interacting agents; P is a logic program or knowledge base, and XA i = Says(Ai , ∗, ∗) is all that the agent Ai Says.

Reasoning in the Presence of Silence

959

When someone faces this kind of silence of one or more of those involved in a case under investigation, he can not count on that particular information. Hence, to take into account such behaviour in practical terms, we have to remove the declaration of those people, that involves delete says(agent,*,*). So, expressing this kind of silence for the case study (puzzle); what would happen if silence with common sense is presented as a possibility? Can the interrogator or judge reach any conclusion when some of the suspects decide to intentionally remain quiet? Considering silence, i.e. (total) defensive silence (TDS), for each person giving their testimony, through its implementation as metaprogramming in Python (see Appendix B), would imply executing: t_def_silence(’Criminal.lp’,’brown’) t_def_silence(’Criminal.lp’,’jones’) t_def_silence(’Criminal.lp’,’smith’) and running: clingo 0 (kb-tds-agent).pl We obtain those presumably culpable, i.e. as a result of the silence of one or more persons, we can examine who turns into a candidate to blame. Table 1 includes the possible outcomes (guilty) if suspects decide intentionally to omit their testimonies. One can observe that the perpetrator can be anyone depending on who remains silent. The following comments are pertinent for the different possibilities: 1. The first case coincides with the complete puzzle, i.e. every suspect provides his statement and Smith emerges as guilty. 2. If Brown remains silent, either of the other suspects can be the perpetrator. 3. When Jones remains quiet, either of the other two suspects may be guilty. 4. Smith is guilty even if he tries to protect himself with silence.

Table 1. Total defensive silence model for silent agent Silent agent Presumably culprit {}

{Smith}

Brown

{Smith, Jones}

Jones

{Smith, Brown}

Smith

{Smith}

In TDS, we dispense of all the statements of a particular witness, however it could happen that a witness decides to reveal only part of his version. This is the case of a partial defensive silence (PDS). In this scenario, we can now wonder: What part of the testimony could be convenient to silence in the case

960

A. Garcés-B´ aez and A. L´ opez-L´ opez

of suspects? Who would be the culprit in the event that some person decides to remain partially silent? What possibilities would each of the suspects has if, before giving his allegation, he had access to the testimony of the others? Now, the rule to interpret this kind of silence turns into the following. PDS semantic rule: PA i , pj denotes Partial Defensive Silence of Ai interpreted as: PA i , pj = P − {Says(Ai , pj )} With (1 ≤ i ≤ n), n is the number of agents interacting; (1 ≤ j ≤ m); Ai is an agent; m is the number of utterances/assertions expressed by Ai , and pj is a particular assertion of Ai . To explore this other scenario, we would have to first execute: p_def_silence(kb,agent,predicate) and then run: clingo 0 (kb-pds-agent-predicate).pl

Table 2. Partial defensive silence model for predicate Silent predicate Presumably culprit b1

{Smith}

b2

{Smith}

j1

{Smith}

j2

{Smith}

s1

{Smith}

s2

{Smith}

By detailing the PDS analysis, as shown in the Table 2, we can observe that no predicate alone has the decisive capability to generate another possible culprit, since there is only one model in all cases. 4.2

Acquiescent Silence

The old saying “silence is consent” is behind the third interpretation of silence, and expresses a passive disengaged attitude. This is defined as: Definition 4. Acquiescent Silence. Witholding relevant ideas, information, or opinions, based on resignation [3].

Reasoning in the Presence of Silence

961

Now this kind of silence is interpreted according to the rule stated next. AS semantic rule: PA i denotes Acquiescent Silence of Ai understood as: PA i = PA i ∪ ({Says(Aj , ∗)} ◦ λ) With i = j, (1 ≤ i, j ≤ n), n is the number of involved agents; PA i is T DS for Ai ; λ = {Aj /Ai }, and the operator ◦ with λ substitution expresses the exchange of Aj by Ai on the Says subset of agent Ai . The way to put in practice this interpretation is by omitting the whole person’s testimony and adding new rules related to what is implicitly assuming with silence. For example, in the case of Smith, the following code has to be executed: acq_silence(’Criminal.pl’,’smith’) which causes that: 1. A T DS corresponding to Smith agent is applied, that is, eliminate his Says(x, y, z) predicates. 2. Add the testimonies of other witnesses as if they were his, that is, put in his mouth what others say. That is, insert Says (other-agents, *, *) instances. The acquiescent silence (AS) for Smith, where he again turns out to be guilty, is obtained by the execution of the program: clingo 0 Criminal-as-smith.pl The answer was similar to model of the problem in its natural version. Table 3. Acquiescent silence model for silent agent Silent agent Presumably culprit {}

{Smith}

Brown

Unsatisfiable

Jones

Unsatisfiable

Smith

{Smith}

The solutions found for the puzzle when a person is silenced under the AS interpretation, are listed in Table 3. Notice that for silence of Brown or Jones, no model can be obtain but again Smith is incriminated even if he recurs to AS. 4.3

False Statements or Silence

Is there a relationship between silence and false statements? As we can notice in Table 4, in the context of the case study, further information is hidden behind silence and more possibilities are opened.

962

A. Garcés-B´ aez and A. L´ opez-L´ opez Table 4. Combined silence # TDS

PDS Presumably culprit

1 2 3 4

Brown j1 j2 s1 s2

{Smith, {Smith, {Smith, {Smith,

Jones} Jones} Jones} Jones}

5 6 7 8

Jones

{Smith, {Smith, {Smith, {Smith,

Brown} Brown} Brown} Brown}

s1 s2 b1 b2

9 Smith b1 10 b2 11 j1 12 j2

{Smith} {Smith, Jones} {Smith, Brown} {Smith}

In the solution of the puzzle in its original version (as depicted in Fig. 3ii), Jones is who declared two false statements (j1 and j2 ) and Brown only the first one (b1 ), these correspond to Jones’ total defensive silence and a partial defensive silence of Brown, as we can notice in row 7 of the Table 4 where, of the two models that are obtained, one is the solution to the puzzle with its original version (Brown) and the other is the solution to the puzzle in the new (natural) version (Smith). 4.4

Discussion

As the case study showed, when someone recurs to defensive silence, possibilities are opened while with acquiescent silence, they are reduced due to the increment in conditions that have to be met, even to the degree of finding no solution at all (as Table 3 showed). The characteristics of the puzzle allow to perceive, intuitively, that behind the silence there is important information, including the solution to some variant of it, as shown in previous section. An important fact that we want to bring out is when two programs are equivalent with respect to the semantic answer set. For these models, a definition for equivalence is: two programs are equivalent if they have the same answer set [13]. The program, whose known solution is Smith, and the other programs obtained for the interpretations of silence, are equivalent from the point of view of the achieved result. As part of the results, we reformulate automatically a knowledge base (KB) of a logical program P (speech acts, rules or actions) based on the information inherent in the omission, obtaining an equivalent program P that has fewer rules than the original program. That is:

Reasoning in the Presence of Silence

963

Let P and P two logical programs with predicates says(x, y, z) ∈ X and X ⊂ P. If P == ( P − X) such that X is “silenced” in P, then P is a reduction of P. Moreover, this formulation of interaction as speech acts (represented as predicates Says(x, y, z)) under the assumption of silence of one or more of the interacting agents shows non monotonicity, because allows to draw tentative conclusions, in particular: – With TDS and the silence of Smith, the first found answer led to Smith as solution (see Table 1). This requires two rules less in the knowledge base (KB1 ). – With AS and the silence of Smith, the perpetrator comes out also as Smith (see Table 3). Here, two additional rules are required, i.e. −2 + 4(KB2 ). Considering the cardinality of the knowledge bases, we observe that the following holds: |KB1 | ≤ |KB| ≤ |KB2 |.

5

Conclusions

Given the close relation between Answer Set Programming (ASP) and intuitionist logic, each model generated in ASP can have a proof in the logic [14]. In consequence, each of the applications of the interpretations of silence and their possible combinations, intuitively, involve a proof within the belief system constructed from the rational-agents statements. As shown in the case study, behind silence there is valuable information that can be taken into account for decision making. So, implicatures can be achieved if intentional silent is interpreted along the appropriate context. In our case study, we can observe that there exists a relationship between the falsehood of statements and silence (omission). The study and modeling of silence can be useful to reach intelligent humancomputer interaction. Also, this research on silence can be usable to analyse scenarios in legal cases involving testimonies. A strategy has been proposed. Scenarios where three kinds of silence are involved, e.g. a participant is recurring to defensive silence (total or partial) and other to acquiescent silence, were explored in this study. Other possible interpretation to consider is that of prosocial silence, i.e. retaining work-related information or opinions for benefit of other people or organization. We would like also to bring the interpretations of silence to a more general framework for agent interaction, beyond answer set programming. For instance, the different semantics can be brought in dialogue systems where virtual agents attend appropriately the omission of answers during interaction, as in dialogue games. Acknowledgments. The first author thanks the support provided by Consejo Nacional de Ciencia y Tecnolog´ıa and Consejo de Ciencia y Tecnolog´ıa del Estado de Puebla. The second author was partially supported by SNI.

964

A

A. Garcés-B´ aez and A. L´ opez-L´ opez

Criminal.lp for Clingo 4.5.4

%% Puzzle 51 of Wylie [19] %% for the natural solution. %% suspect(brown;jones;smith). % %% Brown says: says(brown,innocent(brown),1). says(brown,innocent(jones),1). % %% Jones says: says(jones,innocent(brown),1). says(jones,innocent(smith),0). % %% Smith says: says(smith,innocent(smith),1). says(smith,innocent(brown),0). %%%%%%%%%%%%%%%%%%%%%% % %% Everyone, except possibly for the criminal, is telling the truth: holds(S) :- says(P,S,1), -holds(criminal(P)). -holds(S) :- says(P,S,0), -holds(criminal(P)). % %% Normally, people aren’t criminals: -holds(criminal(P)) :- suspect(P), not holds(criminal(P)). % %% Criminals are not innocent: :- holds(innocent(P)),holds(criminal(P). % %% For display: criminal(P) :- holds(criminal(P)). % %% The criminal is either Brown, Jones or Smith, (exclusively): holds(criminal(brown)) | holds(criminal(jones)) | holds(criminal(smith)). #show criminal/1.

B

Silence.py Prototype for Python 3.7

# Definition of Total Defensive Silence # Input: knowledge base or logic program, and agent to silence # Output: new knowledge base or logic program named ’kb’-’tds’-’agent’.lp def t_def_silence(kb,agent): f=open(kb,’r’) g=open(kb[0:len(kb)-3]+’-’+’tds-’+agent+’.lp’,’w’) for line in f: if ’says(’+agent == line[0:5+len(agent)]: line=’%’+line g.write(str(line)) f.close() g.close()

Reasoning in the Presence of Silence

965

# Definition of Partial Defensive Silence # Input: knowledge base or logic program, and agent to silence # Output: new knowledge base or logic program named ’kb’-’pds’-’agent’-’predicate’.lp def p_def_silence(kb,agent,predicate): f=open(kb,’r’) g=open(kb[0:len(kb)-3]+’-’+’pds-’+agent+’-’+predicate+’.lp’,’w’) for line in f: if ’says(’+agent+’,’+predicate+’(’ == line[0:5+len(agent)+len(predicate)+2]: line=’%’+line g.write(str(line)) f.close() g.close() # Definition of Acquiescent Silence # Input: knowledge base or logic program, and agent to silence # Output: new knowledge base or logic program named ’kb’-’as’-’agent’.lp def acq_silence(kb,agent): f=open(kb,’r’) g=open(kb[0:len(kb)-3]+’-’+’as-’+agent+’.lp’,’w’) for line in f: if ’says(’+agent == line[0:5+len(agent)]: line=’%’+line elif ’says(’ == line[0:5]: i=line.index(’,’) line_new=’says(’+agent+line[i:len(line)] g.write(str(line_new)) g.write(str(line)) f.close() g.close()

References 1. Bohnet, I., Frey, B.S.: The sound of silence in prisoner’s dilemma and dictator games. J. Econ. Behav. Organ. 38(1), 43–57 (1999) 2. Chatwin, J., Bee, P., Macfarlane, G.J., Lovell, K., et al.: Observations on silence in telephone delivered cognitive behavioural therapy (T-CBT). J. Linguist. Prof. Pract. 11(1), 1–22 (2014) 3. Van Dyne, L., Ang, S., Botero, I.C.: Conceptualizing employee silence and employee voice as multidimensional constructs. J. Manag. Stud. 40(6), 1359–1392 (2003) 4. Eco, U.: Signo. Labor, Barcelona (1994) 5. Garcés-B´ aez, A., L´ opez-L´ opez, A.: First approach to semantics of silence in testimonies. In: International Conference of the Italian Association for Artificial Intelligence, pp. 73–86. Springer (2019) 6. Garcés-B´ aez, A., L´ opez-L´ opez, A.: Towards a semantic of intentional silence in omissive implicature. Digitale Welt 4(1), 67–73 (2020) 7. Gebser, M., Kaminski, R., Kaufmann, B., Ostrowski, M., Schaub, T., Wanko, P.: Theory solving made easy with clingo 5. In: OASIcs-OpenAccess Series in Informatics, vol. 52. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2016) 8. Gelfond, M., Kahl, Y.: Knowledge Representation, Reasoning, and the Design of Intelligent Agents: The Answer-Set Programming Approach. Cambridge University Press, Cambridge (2014) 9. G¨ odel, K., Anzeiger Akademie der Zum intuitionistischen Aussagenkalk¨ ul: Wissenschaften wien, math.-naturwissensch. Klasse 69, 65–66 (1932)

966

A. Garcés-B´ aez and A. L´ opez-L´ opez

10. Grice, H.P.: Logic and conversation. In: Cole, P., et al. (eds.) Syntax and Semantics: Speech Acts, vol. 3, pp. 41–58 (1975) 11. Khatchadourian, H.: How to Do Things with Silence, vol. 63. Walter de Gruyter GmbH & Co KG (2015) 12. Kurzon, D.: The right of silence: a socio-pragmatic model of interpretation. J. Pragmat. 23(1), 55–69 (1995) 13. Lifschitz, V., Pearce, D., Valverde, A.: Strongly equivalent logic programs. ACM Trans. Comput. Log. (TOCL) 2(4), 526–541 (2001) 14. Pearce, D.: Stable inference as intuitionistic validity. J. Log. Program. 38(1), 79–91 (1999) 15. Schr¨ oter, M., Taylor, C.: Exploring Silence and Absence in Discourse: Empirical Approaches. Springer, Heidelberg (2017) 16. Daniel, S.: How to Read and Do Proofs: An Introduction to Mathematical Thought Processes. Wiley, Hoboken (2005) 17. Swanson, E.: Omissive implicature. Philos. Top. 45(2), 117–138 (2017) 18. VanderPlas, J.: Python Data Science Handbook: Essential Tools for Working with Data. O’Reilly Media Inc., Sebastopol (2016) 19. Wylie, C.R.: 101 Puzzles in Thought and Logic, vol. 367. Courier Corporation (1957)

A Genetic Algorithm Based Approach for Satellite Autonomy Sidhdharth Sikka1,2(B) and Harshvardhan Sikka1,2 1

2

Manifold Computing, Atlanta, Georgia [email protected], [email protected] Georgia Institute of Technology, Manifold Computing, OpenMined, Atlanta, Georgia

Abstract. Autonomous spacecraft maneuver planning using an evolutionary computing approach is investigated. Simulated satellites were placed into four different initial orbits. Each was allowed a string of thirty delta-v impulse maneuvers in six cartesian directions, the positive and negative x, y and z directions. The goal of the spacecraft maneuver string was to, starting from some non-polar starting orbit, place the spacecraft into a polar, low eccentricity orbit. A genetic algorithm was implemented, using a mating, fitness, mutation and crossover scheme for impulse strings. The genetic algorithm was successfully able to produce this result for all the starting orbits. Performance and future work is also discussed. Keywords: Evolutionary computation systems

1

· Satellite · Autonomous

Introduction

Applications of orbital technology grow as the cost of producing satellites and launching them decrease. For the first time in history, diminished launch and development costs are making it possible to create distributed satellite systems. These systems have numerous operational advantages over monolithic satellite design, including but not limited to robustness to the consequences of failure in individual satellites, decreased cost of implementation and replacement due to the size of the units, and being in multiple locations in space. Increases in computing power and miniaturization of the associated platforms has also increased the computational resources available on-board small satellites. This allows for the use of distributed satellite systems in previously intractable problem domains, including planet-wide monitoring for tasks like gathering climate data and tracking wildlife migratory patterns [2]. Additionally, these advances have also opened up possibilities for coordination and autonomy in distributed satellite systems. If a group of satellites were to coordinate, they could accomplish objectives like flying in formation, assembling into more complex structures, retrieval and repair, etc. A short history of the prior research and application c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 967–975, 2021. https://doi.org/10.1007/978-3-030-80126-7_68

968

S. Sikka and H. Sikka

of satellite autonomy and distributed coordination (Satellite ADC) is presented below. 1.1

History of Autonomy and Distributed Space Systems

Autonomy in single satellites and spacecraft has been implemented in older missions than autonomy among distributed systems of spacecraft. Implementations of autonomous spacecraft date back to 1997, when the Mars Pathfinder mission, in particular the rover Sojourner, employed simple autonomy features including simple landmark based navigation. A year later, Deep Space 1 was performing autonomous task allocation [3] and image based navigation [4], and three years later in 2000, NASA’s Earth Observing 1 was autonomously imaging areas of interest on the Earth [5]. The first major mission successfully demonstrating a form of distributed autonomy was the NASA NODES mission. This mission deployed from the ISS in 2016 and was a demonstration of task allocation among multiple satellites using a coordination scheme called negotiation. Negotiation is when satellites establish a connection with each other, then allocate tasks among each other based on orbital parameters, telemetry and resource information [6]. The mission was successful in accomplishing its objectives. Currently, NASA has introduced a distributed autonomous mission paradigm called Autonomous Nano-Technology Swarm (ANTS) and have several missions in the planning stage following this paradigm including the Saturn Autonomous Ring Array (SARA), a swarm of small satellites that will closely observe the rings of Saturn [2]. 1.2

Related Work

Though the implementations of distributed, autonomous space systems have been few and far in between, exploration of the problem has been a popular topic of academic study around the world. This research has broken down the general problem of satellite autonomy and coordination into several sub-problems, including but not limited to retrieval and repair scenarios, satellite tasking, earth observation, space observation, formation flight and science scenarios [2]. Use of indirect coordination methods, like Ant Colony Optimization algorithms or stigmergy methods, and direct coordination like in the negotiation scheme used in the NODES mission, have been explored for spacecraft coordination [6]. A common thread, regardless of the coordination plan, is to use learning or optimization to improve coordination over time. Among the research into spacecraft autonomy, which is sometimes separate from spacecraft coordination, several schema have been proposed to separate the various necessary on-board processing into parts. Most of these schema separate on-board processing into a reactive layer, which takes sensor input and reacts in low level ways, and a reflective layer, which takes information from the reactive layer and self reflects on performance. This reflective layer adjusts the reactive layer to achieve better performance towards high level goals. Across the literature, this seems to be the simplest autonomy schema that others are more complicated versions of.

A Genetic Algorithm for Satellite

1.3

969

Advantages of Learned Approaches

The work presented in this paper is part of a larger effort to approach the problem of Satellite ADC with simulation and optimization, rather than through the development of robust (meaning error resistant in this context) deterministic algorithms. Advances in machine learning and simulation over the last 20 years mean that the space of possible pathways to satellite autonomy are more expansive than they were during the development of Deep Space 1. It is now possible for a satellite to carry out numerous simulations onboard to test potential actions it may take. The development of learning methods based on achieving a goal, including reward based reinforcement learning or fitness based evolutionary algorithms, allow for a learned approach based on simulated or real outcomes as compared to traditional supervised learning based approaches that require a training dataset representative of the problem domain [1]. This is ideal for autonomy, which is defined by independent accomplishment of goals. So at this time, the alignment between Satellite ADC and learning methods certainly is present. The advantages of this approach, as opposed to developing robust algorithms to handle tasks autonomously, are manifold. Robust algorithms are developed by humans, and don’t flexibly adapt to the problem domain. Learning from simulation enables automatic accommodation of circumstances which are not foreseen by a human developer, but do occur within simulation. Then as simulation becomes more advanced, and the problem domain more complex, it becomes more important that control algorithms learn from simulation. Robust algorithms also do not generalize, they are task specific, whereas learned approaches do generalize and can be adjusted to any task provided the reward or fitness functions are updated. Due to the reasons enumerated above, general application of learning systems, through the use of various simulation and optimization methods, promises significant impact in the problem of Satellite ADC. 1.4

Evolutionary Algorithm Preliminaries

An evolutionary algorithm is a type of optimization algorithm that is inspired by biological evolution. Rather than “learn” a good solution to a given problem through a system of well defined rewards and adjustments as is the case in reinforcement learning approaches, an evolutionary algorithm creates a population of candidate. These solutions “mate” with each other in some way to produce more solutions. This mating method must preserve some important aspects of the parent solutions, while also allowing the offspring solutions to differ from their parents to adequately search the solution space. Solutions have some fitness defined by the goal of the algorithm. In order for better solutions to be born, solutions with greater fitness must mate with candidates with lower fitness. This can be accomplished with either strong selection, where the less fit don’t get to mate at all, or weak selection, where the less fit simply have a lower probability or proportion of mating. This is the basic form of an evolutionary algorithm, there has been significant work investigating the use of additional algorithmic steps and processes to prevent population stagnation and improve convergence

970

S. Sikka and H. Sikka

[1]. We outline some of these extensions to basic evolutionary algorithms in the experimental descriptions below. In the following sections, we will elaborate on the goal and setup for this experiment, as well as the details of the algorithm that was developed for this experiment.

2 2.1

Genetic Algorithm Implementation Goal Task Description

The goal was for a simulated spacecraft, placed in some starting orbit, to perform a series of thirty delta-v impulse maneuvers, or instantaneous changes in velocity, into a polar, circular orbit. The number 30 for the size of the series was chosen as a baseline, and the number of delta-v maneuvers allowed would realistically be determined based on spacecraft parameters. These parameters may include factors like onboard fuel, orbital position, and target orbit. The goal of the program was to implement a genetic algorithm which, not provided with any special start series of maneuvers, could produce a series of maneuvers which resulted in a highly polar and highly circular final orbit. Out of the six Keplerian orbit elements, the only ones considered in this experiment were eccentricity and inclination. 2.2

Experimental Setup

The experiment was carried out using the Poliastro Python library, an open source orbital dynamics library. Four simulated spacecraft were placed into different, non polar and non circular orbits, as can be seen in Fig. 1. The size of the delta-v maneuvers they could do was constrained, and the directions were constrained to the six Cartesian directions, positive and negative x, y and z. They would perform one delta-v maneuver every 20 min, between maneuvers their orbit was propagated 20 min. At every 20 min interval, they could also remain idle instead of performing a maneuver. 2.3

Basic Algorithm

The basic form of the implementation of this genetic algorithm included four parts. Devising a fitness score, generating a population of solutions, mating these solutions, and populating a new generation with their children. The fitness score for this experiment was composed of two factors, the eccentricity and the inclination. The inclination of a polar orbit is 90◦ , so the fitness score was the distance of the inclination of the final orbit from 90◦ summed with the eccentricity of the final orbit. The final orbit in this case refers to the resultant orbit of a particular maneuver sequence. The lower the fitness score, the more fit the maneuver sequence. The population generation was performed by using a random number generator to generate a random number between 0 and 6 and creating a 30 digit

A Genetic Algorithm for Satellite

971

Fig. 1. Initial orbits of simulated spacecraft. Varying orbital radii and starting points were selected. The plot is three dimensional, so an angle that showed the orbits best was selected.

long string of these numbers. A 0 meant remain idle, and each number signified an impulse in the positive x, y, z or negative x, y and z directions respectively. The mating scheme for two impulse strings was devised as the following. Element for element, if the mother and father strings agree on an element, the offspring has that element as well. If they differ, a random maneuver is put in the place of that element. This scheme was selected because a scheme was needed which would allow traits of the parent solutions to be preserved in the children while still allowing for some variance for population diversity. Finally, the next generation was populated with children solutions. For evolution towards fitter solutions to occur, the fittest solutions needed to mate the most, and the least fit needed to mate the least. Evolution was carried out over 200 generations, and the fittest solutions were output as the algorithms chosen solution. To combat stagnation in fitness score, mutation and crossover were introduced. A chance of mutation was introduced such that when two sequences mated, even if they agreed on an element, there was a small chance of the element being randomly generated. To introduce crossover, a second population evolving alongside the first was created. Every five generations, these two populations would interbreed, and their traits would crossover. When the fitness scores of both populations would stagnate, a randomly generated but seeded “immigrant” population would be brought in to

972

S. Sikka and H. Sikka

Fig. 2. Final orbits of simulated spacecraft. The only orbital parameters being selected for were inclination and eccentricity, so the orbits are polar and circular but not on the same plane or the same size. The plot is three dimensional, so an angle that showed the orbits best was selected.

mate with one of the populations. Seeded random generation means that populations were generated around an input sequence, and would have a chance of preserving the elements of the input sequence. So an “immigrant” population would be similar to the input sequence, in this case the fittest of one of the populations. 2.4

Results

The four spacecraft that were started in different orbits all achieved final orbits with fitness scores of less than 0.1, meaning the distance in radians from a polar inclination summed with the eccentricity was less than 0.1. The results can be seen in the Fig. 2, so in conclusion the genetic algorithm developed was successful

A Genetic Algorithm for Satellite

973

Fig. 3. Final orbits of simulated spacecraft. The orbital parameters being selected for were semi-major axis and eccentricity, so the orbits are of the same size and all circular, but not co-Planar.

at planning a series of thirty impulse maneuvers that would place the spacecraft into polar, circular orbits. Modifications to the fitness function, and nothing else, enabled the simulated spacecraft to achieve orbits close to selected values for other pairs of orbital parameters. The final orbits when the fitness was how close the eccentricity was to zero and the semi major axis was to 15000 km are shown in Fig. 3, and the final orbits when the fitness was how close the eccentricity was to zero and the longitude of ascending node was to zero are shown in Fig. 4. The rates of convergence of the solutions were largely the same for all the orbital parameters selected, the same number of total generations were used for each and the final solutions had similar fitness scores. Initial convergence is faster than the rate of convergence late in the process, and this is to be expected. Initial populations have high population diversity, and simple changes can drastically affect fitness in a positive way. Later in the process, simple changes can still drastically affect fitness, but usually not for the better.

974

S. Sikka and H. Sikka

Fig. 4. Final orbits of simulated spacecraft. The orbital parameters being selected for were longitude of ascending node and eccentricity, so the orbits are circular and the right ascension of the point where the orbital plane crosses the equator is zero radians, but they are not the same size.

3

Conclusions and Future Work

In this work, we outline a constrained experimental goal of selecting a series of thirty impulse maneuvers which will result in a satellite achieving an orbit with two desired orbital parameters. We also present the genetic algorithm developed to accomplish this goal, as well as the successful results it achieves. This work was a constrained version of the more general problem of autonomous maneuver planning for spacecraft. The success of this optimization approach demonstrates a first step in the broader direction of leveraging learning systems in the coordination and autonomous direction of individual and distributed satellite systems. Future work in this line of inquiry includes achieving environmental interaction, more complex maneuvering, cooperation and coordination, multiple sequential goals, and time dependent goals of agents in an increasingly sophisticated and realistic simulated space environment using learned methods. We intend on exploring these areas in upcoming work.

A Genetic Algorithm for Satellite

975

References 1. Corns, S.M., Keller, J.M., Liu, D., Fogel, D.B.: Fundamentals of computational intelligence: neural networks, fuzzy systems, and evolutionary computation. Genet. Program. Evolv. Mach. 18(1), 149–183 (2017) 2. Araguz, C., Bou-Balust, E., Alarc´ on, E.: Applying autonomy to distributed satellite systems: trends, challenges, and future prospects. Syst. Eng. 21(5), 401–416 (2018) 3. Bernard, D., et al.: Spacecraft autonomy flight experience: The DS1 remote agent experiment (1999) 4. Bhaskaran, S., et al.: Orbit determination performance evaluation of the deep space 1 autonomous navigation system (1998) 5. Doyle, R.J.: Spacecraft autonomy and the missions of exploration. IEEE Intell. Syst. Appl. 13(5), 36–44 (1998) 6. Hanson, J., et al.: Nodes: a flight demonstration of networked spacecraft command and control (2016)

Communicating Digital Evolutionary Machines Istvan Elek(B) , Zoltan Blazsik, Tamas Heger, Daniel Lenger, and Daniel Sindely Hungary Faculty of Informatics, 3in Research Group, ELTE E¨ otv¨ os Lor´ and University, Budapest, Martonv´ as´ ar, Hungary [email protected] http://mapw.elte.hu/elek

Abstract. In this article, we summarize our research results on the topic of spontaneous emergence of intelligence. Many agents are sent to an artificial world, which is arbitrarily parametrizable. The agents initially know nothing about the world. Their only ability is the remembrance, that is, the use of experience which comes from the events that happened to them in the course of their operation. With this individual knowledge base they attempt to survive in this world, and getting better and better knowledge for further challenges: a completely random wandering in the world; wandering in possession of growing personal knowledge; wandering, where evolutionary entities exchange experiences when they accidentally meet in a particular part of the world. The results are not surprising, but very convincing: without learning, the chances of survival are the worst in a world with a given parametrization. When experiences are organized into a knowledge base through individual learning, the chances of survival are obviously better than in the case of a completely random walk. And finally when creatures have the opportunity to exchange experiences when they meet in a certain field, they have the greatest chance of surviving. Keywords: Artificial intelligence intelligence · Learning agents

1

· Simulation of emergence of

Introduction

Intelligence is studied in a wide range of disciplines, such as psychology, cognitive science, philosophy, mathematics, artificial intelligence, neural networks, but is linked in many threads to evolutionary biology and even paleontology. The issue is more relevant today than ever. In the plenary presentations of neural networks and artificial intelligence conferences (Grossberg [1, 2], Adami [3], Ofria [4]), the spontaneous development of intelligence is one of the most important among the unanswered questions, more precisely, the unresolved nature of this question. How did the high level of intelligence that could be found not only in humans but also in mammals, birds, and even molluscs come into being (Gause [5,6], c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 976–986, 2021. https://doi.org/10.1007/978-3-030-80126-7_69

Communicating Digital Evolutionary Machines

977

Maynard [7], Ostrowski [8]). In this context, intelligence does not mean rational thinking, but the ability that these living beings can mobilize for survival and prosperity. Our research results are still far from these high levels of intelligence, but let us not forget that the development that resulted from the long development that led to the development of human thinking abilities, initially started from very simple thinking and action by very simple entities. We set out to simulate this simple thinking. We described the theoretical background of the work in many articles and books (Elek [9–15]). The spontaneous emergence of intelligence is a fundamental issue for evolution. Evolutionary biologists and paleontologists already know a lot about the operation of early evolution. The first classical experiment was in 1952, where Stanley Miller was able to produce amino acids from inorganic materials with an experimental apparatus. Since then, many such experiments have taken place, and even a number of theories have been outlined to model physical and mental evolution.

2

An Artificial World

We created an artificial world that is a variation of the classic wumpus world. This is a chessboard world with dangerous energy eaters (wumpus, trap) and energy sources (gold) (Fig. 1). However, this world is different from the classical wumpus world in that it is not a real chessboard, but a torus (Fig. 2), which thus has no edges. Agents never run into boundaries, even though the world is finite. This world is the scene of the operation of artificial beings (digital evolutionary machines, workers for short). We sent them here in large quantities and watched them prosper.

3

Digital Evolutionary Machines

We have called our artificial beings (agents, workers) digital evolutionary machines that remember everything that happened to them during their lives. Their operation requires energy to be gained from the gold bars of the artificial world, meaning they need to find as much gold as possible to survive. To do this, however, they must be in motion. There are also energy eaters in the world. One is the trap, which when workers encountered them, significantly reduces their energy. The other is the wumpus, which deprives a large amount of energy from them. If they encounter it early in life, it will surely result in their destruction. Also, the movement that is needed to find the gold requires energy of course. The movements of the creatures are basically random. They can go from a given field in four directions (since there are no boundaries on the torus, so this is always true). They use their knowledge that wether the next field is a wumpus or not in their knowledge base. If it is they ask for another random direction and do so until they get a harmless field. Of course, if they don’t have information

978

I. Elek et al.

Fig. 1. The world of chessboard and its players are red field (Trap), gray field (wumpus), gold field (Gold Bar). The traps and the wumpuses are energy eaters, when workers encountered them, thus the energy of the beings is greatly reduced. The gold bar, on the other hand, is a source of energy, the achievement of which increases the energy content of the creatures for helping them to survive. The numbers in the fields indicate how many steps the specified agent arrived at that field

about the next field, they can also step into an energy-eating field. If they do not perish in this, they will store this dangerous place in the knowledge base. Those who worked so effectively that their energy level exceeds a preset value can create two (or more) offspring that are born with the same initial energy as the parent has, but inherit the knowledge that parent has accumulated so far. The knowledge of beings who have perished is also lost, unless before demise their offspring have been created who inherited the knowledge gathered so far. Workers work in three modes: 1. They do not gather knowledge, the outcome of their actions is shaped by blind chance. In the case of friendly worlds, this can also result in prosperous functioning, the creation of offspring, and population growth. 2. They gather knowledge, that is, if they have survived an encounter with a wumpus or trap, they store its place in their knowledge base. As a result, they are increasingly likely to avoid energy-eating fields. In case of large worlds,

Communicating Digital Evolutionary Machines

979

Fig. 2. The torus world is created by folding the two edges of the chessboard world (purple line on the torus) to form a cylinder. The cylinder is then folded to form a torus so that the two base circles meet (red line)

this is a protection only if they have steped again into a known territory during their accidental wanderings. 3. They gather not only their own experiences, but they also pass them on to each other if they step into the same field at the same time. This is the coincidence. During the work, we stored all the elements of the operation of the creatures in a database so that we could evaluate the results of the different runs.

4

Results

The success of the different simulations was measured by the change in the initial population. The increase in the size of the population and the amount of energy indicated the success of each run. The result depended on a lot of things. On the one hand, the friendliness or hostility of the world influenced the result, and on the other hand, the working mechanism of the workers. In a friendly world, non-learning populations could also be effective, especially if their replication capacity was set to greater than 2. In an unfriendly world, however, the population could no longer be effective without knowledge. The chances of populations evolving significantly increased when not only did they gather knowledge, but it also exchanged it with each other when they met in one of the fields. In a hostile world there are no chance to survive for anyone. The following are some figures to illustrate the effectiveness of different runs. First, we present the result in not-so-friendly world. Figure 3 shows a case where

980

I. Elek et al.

Fig. 3. A case without learning: wandering is completely random, no knowledge, no experience. Their chances of survival are not good in this hostile world where there are a lot of wumpuses and traps, but not enough gold bars. Since the number of goldcontaining fields is very small, the wandering consumes their energies

we sent 100 workers into the world, but these did not gather knowledge. Due to accidental wandering, their energies ran out and they perished in a short way. For the sake of comparability, the simulation was stopped in each case in the 3000th iteration step. Figure 4 shows a case where we also sent 100 workers into the world, but these workers had already stored the dangerous places, if they had survived such encounters. In the beginning, the devastation was great here too, because they did not yet have the knowledge to avoid dangerous fields. However, this knowledge was individual, supporting only their owner.

Communicating Digital Evolutionary Machines

981

Fig. 4. Learning case: here they already gather experience, which is used in every subsequent step to decide whether a randomly selected field is dangerous or not (defensive strategy). In this way, they are less likely to step into an energy-eating field because what they already know is no longer stepped on. They thrive in a less friendly world. However, in a very energy-poor world, knowledge does not mean survival either

Figure 5 shows a case where we also sent 100 workers into the world, but these workers no longer only stored the dangerous places they encountered and survived the encounter, but also exchanged it with workers who entered the same field at the same time. Clearly, increasing each other’s knowledge has increased the population’s chances of survival as more and more people knew more and more about their environment. Now let’s look at a case where we are bringing in 100 workers into a friendlier world. The result is shown in Fig. 6.

982

I. Elek et al.

Fig. 5. A case of learner and information exchange: in this case, they gather their own experiences of dangerous fields, and when workers meet, they exchange their knowledge. This gives them a better chance of learning about dangerous fields, as a result of which they have a good chance of avoiding known energy-eating fields

In case of learning entities the friendly world also provides better conditions for population development. The result of this can be seen in Fig. 7. If the learning workers exchange their knowledge when they coincide, their population is remarkably high in a friendlier world. The result can be seen in Fig. 8.

Communicating Digital Evolutionary Machines

983

Fig. 6. The evolution of the incident-free case population in a friendly world: wandering is completely random, no knowledge, no experience gained, but the population is still surviving. Accidental wandering and the amount of energy in the world still provided them with enough energy

Fig. 7. Evolution of learning entities of the population in a friendlier world: they gather experience that is used in each subsequent step to decide whether a randomly selected field is dangerous or not (defensive strategy)

984

I. Elek et al.

Fig. 8. The population development in case of learning and information exchanging case in a friendly world: workers gather their own experiences of dangerous fields and exchange it with each other when they meet in one of the fields. This strategy is remarkably more effective in a friendlier world

Another curiosity is how the knowledge of individuals and the population as a whole relates to each other. Compare the path traversed by one worker with the path traversed by all individuals (Fig. 9). It is clear that the knowledge of the individual remains significantly lower than of the whole population, i.e., the more times the individuals meet and exchange their knowledge, the closer they come to the knowledge of the whole population.

5

Summary

As can be seen from the above figures, the effectiveness of the population, i.e. the development of the number of people, is strongly influenced by knowledge management. Clearly, teaching each other increases your chances of survival.

Communicating Digital Evolutionary Machines

985

Fig. 9. Knowledge of the individual with ID ’16’ worker (figure on the left side) and knowledge of the entire population (figure on the right side). Dark fields are unknown, green fields are known

However, there is another important finding that is not shown in the figures. Examining the data of the workers, it can be seen that the more effective way of fighting is influenced not only by luck and the friendliness of the world, but also by the fact that the workers exist in multitude. Examining their unique data, it can be seen that during a longer simulation, almost no initial entity survives in a large world, but their experience, when incorporated into the knowledge of their descendants, contributes significantly to the “intelligence” of later generations.

986

I. Elek et al.

Acknowledgment. The research has been supported by the European Union, cofinanced by the European Social Fund (EFOP-3.6.2-16-2017-00013, Thematic Fundamental Research Collaborations Grounding Innovation in Informatics and Infocommunications).

References 1. Gail, A.: Carpenter and Stephen Grossberg: Pattern Recognition by SelfOrganizing Neural Networks, MIT (Bradford Books) (1991). ISBN 0-262-03176-0 2. Grossberg, S.: Studies of mind and brain: neural principles of learning, perception, development, cognition and motor control. Boston Stud. Philos. Sci. 70, 622 (2013) 3. Adami, C.: Ab Initio of Ecosystems with Artificial Life, arXiv:physics/0209081 v1 22 September 2002 4. Adami, C., Ofria, C., Collier, T.C.: Evolution of Biological Complexity, arXiv:physics/0005074v1 26 May 2000 5. Gause, G.F.: Experimental studies on the struggle for existence. J. Exp. Biol. 9, 389–402 (1932) 6. Gause, G.F.: The Struggle for Existence. Williams and Wilkins, Baltimore (1934) 7. Smith, J.M.: The Problem of Biology. Oxford University Press, Oxford (1986) 8. Ostrowski, E.A., Ofria, C., Lenski, R.E.: Ecological specialization and adaptive decay in digital organism’s. Am. Nat. 169(1), E1–E20 (2007) 9. Elek, I.: Digital evolution machines in an artificial world: a computerized model. In: International Conference on Artificial Intelligence and Pattern Recognition (AIPR10). Orlando, USA, 12–14 July 2010, p. 169 (2010) 10. Elek, I., Roden, J., Nguyen, T.B.: Spontaneous emergence of the intelligence in an artificial world. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012. LNCS, vol. 7667, pp. 703–712. Springer, Heidelberg (2012). https://doi.org/ 10.1007/978-3-642-34500-5 83 11. Elek, I.: A computerized approach of the knowledge representation of digital evolution machines in an artificial world. In: Tan, Y., Shi, Y., Tan, K.C. (eds.) Advances in Swarm Intelligence. ICSI 2010. Lecture Notes in Computer Science, vol. 6145, pp. 533–540. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-134951 65 12. Elek, I.: Evolutional aspects of the construction of adaptive knowledge base. In: Yu, W., He, H., Zhang, N. (eds.) Advances in Neural Networks — ISNN 2009. ISNN 2009. Lecture Notes in Computer Science, vol. 5551, pp. 1053–1061. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01507-6 118 13. Elek, I.: Evolutional aspects of the construction of adaptive knowledge base. In: Yu, W., He, H., Zhang, N. (eds.) ISNN 2009. LNCS, vol. 5551, pp. 1053–1061. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01507-6 118 14. Elek, I.: Principles of digital evolution machines. In: International Conference of Artificial Intelligence and Pattern Recognition, Orlando, Florida, July, 2008. ISBN 978-1-60651-000-1 15. Elek, I.: Emergence of Intelligence. Nowa Science Publishers, New York (2018)

Analysis of the MFC Singuliarities of Speech Signals Using Big Data Methods Ruslan V. Skuratovskii1(B)

and Volodymyr Osadchyy2

1 Interregional Academy of Personnel Management, Kyiv 03039, Ukraine

[email protected]

2 Department of Computer Science, UCF, Orlando, FL, USA

[email protected]

Abstract. The role of human speech is intensified by the emotion it conveys. The parameterization of the vector obtained from the sentence divided into the containing emotional-informational part and the informational part is effectively applied. There are several characteristics and singularities of speech that differentiate it among utterances, i.e. various prosodic features like pitch, timbre, loudness and vocal tone which categorize speech into several emotions. They were supplemented by us with a new classification feature of speech, which consists in dividing a sentence into an emotionally charged part of the sentence and a part that carries only informational load. Therefore, the sample speech is changed when it is subjected to various emotional environments. As the identification of the speaker’s emotional states can be done based on the Mel scale, MFCC is one such variant to study the emotional aspects of a speaker’s utterances. In this work, we implement a model to identify several emotional states from MFCC for two datasets, classify emotions for them on the basis of MFCC features and give the comparison of both. Overall, this work implements the classification model based on dataset minimization that is done by taking the mean of features for the improvement of the classification accuracy rate in different machine learning algorithms. Keywords: Big data classification · Speech recognition · Emotion recognition · MFCC · Supervised learning · Decision trees · Mel scale features · KNN

1 Introduction To obtain emotion score differences we use Miller function and scale. The vocal acoustics are full of emotional cues to analyze the speaker’s emotional state. Since the expression of emotions occurs most often either at the beginning or at the end of a sentence, a sentence was divided by us into two parts. To the first part we refer a beginning and an end of a sentence, referred to the emotional content expressing by an author; to the second part we refer a middle of a sentence containing only the informational and narrative part. It serves to vector parameterization in KNN for speech emotions. Each emotion is associated with tone of the speaker. Emotions can be recognized both from text and sound. Each of them Models for identifying Multiple emotional states from the MFCC. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 987–1009, 2021. https://doi.org/10.1007/978-3-030-80126-7_70

988

R. V. Skuratovskii and V. Osadchyy

has different approaches to identify the emotional state of the speaker. Emotions also greatly define the interpersonal relations by affecting intelligent and rational decisionmaking. The emotions are a communication bridge among the speaker and the listener. An interaction between individuals is way more clear and effective when the emotions are used in utterances. They play a pivotal role in engaging of a human being in a group discussion and can tell a lot about the one’s mental state [1]. The information hidden in emotions ignited the process of the evolution of the speech recognition field commonly referred as automatic speech recognition. Several models of retrieval and interpretation of emotions from images of speaker’s face and the recordings of his expressions, voice and tone during a conversation has been proposed by researchers. The utilization of physiological signals in the same manner has also been discussed [2]. The significance of emotions in communication can hardly be overestimated since they express the speaker’s intentions to his listeners. There are several spoken language interfaces available today that support automatic speech recognition. By collecting the samples, such systems are providing a base for the speech recognition field [3]. The currently available speech systems are able of processing naturally spoken utterances with high accuracy, but the lack of emotional component makes the ASR systems less realistic and meaningful. There are several real-world fields that may benefit from the recognition of the emotional context of an utterance, such as entertainment, emotion-based audio file indexation, HCI-based systems, etc. [4]. Some of the selected features can be trained to classify, recognize and predict emotions. There are several emotions that can be extracted from the utterances. Few of the universally enlisted among them are Happiness, Fear, Sadness, Anger, Neutral, and Surprise. These emotions can be recognized by any intelligent system, constrained by computational resources. The implementation of the emotional sector of speech makes the human-computer interactions more real and efficient. The analysis of voice and speech for the sake of enhancing the quality of human conversations is reasonable and within the bounds of possibility. The results of emotion detection can be broadly applied in e-learning platforms, car-board systems, medical field, etc. The remaining sections have the following content: Sect. 2 contains literature observation on the topic, Sect. 3 is dedicated to the description of the problem, Sect. 4 carries the details of method implementation and results achieved under problem solution and finally Sect. 5 is the conclusion. In this paper we prove efficiency of the concept of using only 12 MFCC from 39, have identified which 12 MFCC to use for speech and emotion analysis. The dataset used for this experimentation is EMODB. Variants of supervised learning approaches have been implemented to classify emotions from two databases EMODB and SAVEE.

2 Literature Review 2.1 Review of Classifications Methods As well known KNN makes prognostications on time by quick calculating the similarity between an input sample and each training exemplar. Spectral analysis is a promising technique for detecting emotions from sample speech. Its purpose is to use a database in which the data points are separated into several classes to predict the classification

Analysis of the MFC Singuliarities of Speech Signals

989

of a new sample point. Spectral analysis is a promising method of emotion detection from samples of speeches. Prosodic features of speech signals can also be used for analyzing emotions since they contain emotional information. Researchers explored the role and context of emotiony by using a set of 88 features called eGmaps [5]. Speech patterns can be obtained from combination of various speech features acquired from speaker’s utterances. Feature selection plays a pivotal role in the differentiation of different emotions of the same speaker from his speech [6] and it relies on selecting the best features from the signal. Different human languages have different accents, structures of sentences, and speaking styles [7] thus making the identification of emotions from utterances challenging. Various aspects of spoken languages alter the extracted features of the sound signal. It is possible that a sample speech may have more than one emotion which means that each emotion corresponds to a different part of the same speech signal, which complicates the setting of boundaries between emotions. An attempt has been made to study models of the multilingual emotion classification in literature [8].

Fig. 1. Models of the multilingual emotion classification

The substantial advancement in technology has boosted the development of the emotion recognition fields [9, 10]. Call-centers and remote education for example [11] Existing speech recognition systems can be improved by implementing spectral analysis [12]. Authors have identified classes of features extracted from speech electrography and signals. There is an important aspect of SER that includes characterization of emotional content of a speech [13]. Several speech features are obtained from speech acoustic analysis and can be used to detect and predict emotions [14]. The aim of selecting speech features is to determine properties that can improve the rate of classification emotions

990

R. V. Skuratovskii and V. Osadchyy

from a set [15]. The machine learning models are flexible enough to adapt themselves to any model that studies emotions and show good performance in predicting tasks based on selected features [16]. Authors identified classes of features extracted from emotional speech features electrography and speech signals. There is an important aspect of SER that includes characterization of emotional content of speech [13]. Several speech features are obtained from speech acoustic analysis and can be used to detect and predict emotions [14]. The aim of selecting speech features is determination of properties that can improve rate of classification for emotions from a feature set [15]. The machine learning approaches are flexible enough to adapt themselves to any model that studies emotion and show good performance perform well predicting tasks based on selected features [16]. 2.2 Maintaining the Integrity of the Specifications Emotions are intertwined with mood, temperament, personality, sentiment and motivation [17]. Emotions can be understood as a complex feeling of the mind that results in physiological and physiological changes. Human thoughts and behavior is influenced by emotions and there are instances of changes in body when it encounters different emotional states [18]. Literature [19, 20] proves that there is a considerable influence of emotional syndromes on human actions and reactions. Several applications created by researchers rely on emotion detection as an integral component for identification of behavioral patterns [21, 22]. Speech features can be extracted from various sources to accomplish predictive analytics (Fig. 1). The list of sources includes vocal tract, excitation source and prosodic extraction. Table 1. Speech preprocessing techniques SER [23]–Speech Emotion Recognition System (Component and tasks) Feature Extractor

Emotion Classifier

Takes signal input and generates emotion feature Map the speech with one or more emotions

Fig. 2. Framework for supervised emotion classification [23].

Analysis of the MFC Singuliarities of Speech Signals

991

Emotions can be broadly studied using both discrete and continuous approaches. Different classes represent different kinds of emotions and the continuous approach of studying of emotions is a derivation based on combination of several psychological measurements on different axes [24]. Speech Emotion Recognition in the main identifies emotions on the basis of categorical approach as it shown in Table 1, which depends on usage of common words. Researchers derive emotions from expressions of face, speech and various physiological signals. The analysis of facial expressions is a great way to find emotions [25–29] since human face displays emotions very aptly even without a single word uttered. Voice recordings are potentially important for expressing the speakers’ mental state and their intentions. Speech features (Fig. 2) can be studied as vectors for detection of emotion from a data set. [30–32]. The autonomous nervous system allows to assess an emotion and thus utilize physiological signals like ECG, RSP BP to recognize people’s emotions and possibly help cure mental illnesses [17]. We took into consideration the line of temperament. We also integrate the state in which a representative of this type of temperament may be. There are 8 axes of Miller are applied by us. Also it is optimal to take 8 axes. Consciously controlling the volume level, for example, to emphasise a secret, even an angry person can make the voice calmer and quieter in order to show that it was a secret, a question or the essence of the issue. On the contrary, in order to highlight the characterisation of some hero of his story, the speaker can consciously increase the amplitude of the voice. Physiological features of voice and hearing. It takes into account the average amplitude (frequency) and other characteristics of the person’s voice, obtained on the basis of data from the database. Given the above assumptions, if we wish to approach the study of temperament, we first need an understanding of emotion. Such understanding is not easy to come by a glance at a typical textbook of psychology will show that “emotion” is used to refer to a ragbag of apparently disconnected facts and is never itself clearly defined at all. Yet, within one branch of psychology, namely, animal learning theory, there has long been a reasonably clear consensus that emotions consist of states elicited by stimuli or events which have the capacity to serve as reinforcers for instrumental behavior. This, for example, is the framework within which Miller analyzed the concept of “fear” and its role in avoidance learning [73]. The term temperament has considerable overlap with dimensions of personality and with emotional, cognitive, and behavioral functions.

3 Statistical Data Corpus 3.1 Emotional Speech Databases There is need of suitable databases to train the emotion recognition systems. Researchers suggest several existing databases aligned with the task of detecting and classifying the emotions. These data bases can be categorized into three broad domains that cover acted emotions, natural emotions and felicitated emotions [33]. Out of the three mentioned domains enacted emotions are frequently supported by research since they are strong and reliable. EMODB is one of such highly used database for emotional classification. SAVEE is yet another enacted emotion database used for studying emotions. EMODB is

992

R. V. Skuratovskii and V. Osadchyy

a Berlin database for emotions while SAVEE is an English database specifying various emotions. Three databases have been utilized for training and testing: 1. BERLIN DATABASE 2. SENTIMENT DATABASE English 3. AVEE DATASET Berlin Database was created in 1999 and consists of utterances spoken by various actors. EMODB has different number of spoken utterances for seven emotions [34]. Emotions included in the database are anger, boredom, disgust, fear, happiness, indifference, sadness. The dataset contains more than 500 utterances spoken by 51 male and 60 female actors from the age 21 to 35 years. The emotions labelled in EMODB are listed in Table 2. Surrey Audio-Visual Expressed Emotion (SAVEE) [35, 36]. The increasing demand of research in speech analysis led to the development of SAVEE database recordings to help study automatic emotion recognition system. The database contains recordings from 4 male actors in 7 different emotions, 480 British English utterances in total. TIMIT corpus was used for sentence selection and contains phonetically-balanced emotions. The data were recorded in a visual media lab with efficient audio-visual equipment. The recordings were then processed and labelled. 10 subjects under audio, visual and audio-visual conditions evaluated quality of performance, of the recordings. The actors of the database utterances were four male speakers annotated as DC, JK, JE, KL. The speakers who contributed for recordings were postgraduate students and researchers at the University of Surrey. The age of speakers lies between 27 to 31 years. Seven discrete categories of emotions described as anger, disgust, fear, happiness, sadness and surprise were recorded [37]. The focused research was carried out on recognizing the discrete emotions [38]. Table 3 compares the features of both the datasets used in experimental analysis. Table 2. EMODB labels Letter Emotion (German) Emotion (English) W

Ärger (Wut)

Anger

L

Langeweile

Boredom

E

Ekel

Disgust

A

Angst

Fear/Anxiety

F

Freude

Happiness

T

Trauer

Sadness

N

Neutral

Analysis of the MFC Singuliarities of Speech Signals

993

Table 3. Comparison of EMODB and SAVEE Attributes

EMODB

SAVEE

No. of speakers

111

4

Age of Speakers

21 to 35 years

27 to 31 years

No. of utterances

500+

480

Language

German

British English

Emotions

Angry, happy, anxious, fearful, bored disgusted, neutral

Anger, disgust, fear, sadness, happiness, surprise, neutral

3.2 Feature Extraction [39–41] The core step for recognizing speech or emotions from speech is extracting the features of speech. The process of feature extraction refers to identifying the components of the vocals from the audio signal. The audio signal is good source of linguistic information if the noise is discarded in the signal. There is another interpretation of feature extraction which says that it is characterization and recognition of information specifically related to the actor’s (speaker) mood, age, gender. The general process of feature extraction involves transformation of raw signal into feature vectors, which suppress the redundancies and emphasize on the speaker specific properties. The properties are like pitch, amplitude, frequency. The speaker dependencies such as health, voice tone, speech rate and acoustical noise variations may vary the speech signal during the testing and training sessions due to [31, 42, 43]. The shape of the vocal tract filters the sounds generated by human beings which if determined efficiently can be used to derive phoneme representation of the speech sample with high accuracy. The features to be extracted from speech can be studied under three categories named as High level Features which may include phones, lexicon, accent, pronunciation; Prosodic and Spectra-temporal Features that can be studied as pitch energy duration,rhythm, temporal features) and Short term spectral and prosodic Features pertaining to spectrum glottal pulse [44–46]. Short-term spectral features aid in better prediction with higher accuracies for various applications. The spectrogram analysis can be used for information extraction from the short term spectral features. Linear Predictive Cepstral Coefficients (LPCC), Mel-Frequency Discrete Wavelet Coefficient (MFDWC), MelFrequency Cepstral Coefficients (MFCC) are most commonly used short term spectral features for speech analysis [31, 42, 47]. 3.2.1 MFCC MFCC are considered as the commonly used acoustic features for the task of identifying the speaker and the properties of the speech. MFCC takes into account human perception sensitivity with respect to frequencies. The combination of both is best for speech identification and differentiation. The importance of MFCC is inspired by the fact that the shape of vocal tract that includes tongue, teeth, throat etc. filters the sound generated by human speakers. The accuracy in determining the shape enables easy analysis of the

994

R. V. Skuratovskii and V. Osadchyy

sound that comes out through the vocal tract. The accurate in determination of the shape of sound can help in finding the phonetic information. The task of MFCC is to accurately represent envelop of the short time power spectrum of the sound when it traverse through the vocal tract [41, 47, 48]. Mel Frequency Cepstral Coefficients (MFCCs) were identified as a feature and is widely applicable to automatic speech recognition and identification of speaker. The correlation among the actual and heard signal frequencies can be derived efficiently by incorporating Mel scale. Davis and Murmelstein were pioneer in identifying MFCC as sound feature in the 1980’s. MFCC ever since its discovery has been considered important feature for analysis of speech signals. There are few other features along with MFCC,like Linear Prediction Coefficients (LPCs) and Linear Prediction Cepstral Coefficients (LPCCs) that were coined before MFCC and remained the main features for automatic speech recognition (ASR), especially with classification algorithm such as HMM [49]. In practice 8 to 12 or 13 MFCC are considered for representing the shape of spectrum and hence are used for speech analysis [50].MFCC are highly preferred choices in automatic speech recognition systems [51]. Authors found that MFCC is effective for end to end acoustic modelling using CNN [52, 53]. MFCC is widely used feature while considering speech modelling [54, 55]. MFCC based comparative study of speech recognition techniques was conducted by authors who found that MFCC with HMM gave recognition accuracy of 85 percent and with deep neural networks the score was 82.2 percent [56]. Computation of MFCCs includes a conversion of the Fourier coefficients to Mel-scale [57]. Mel-scale are the most popular variant used today, even if there is no theoretical reason that the Mel-scale is superior to the other scales [58]. 3.3 Decision Tree Classifiers There are several machine learning algorithms that can be applied for recognizing speech emotions. The algorithms can be used independently or in hybrid mode for classifying emotions. Decision tree are one of the machine learning algorithms that can be used for classification task [59, 60]. The Decision tree uses the supervised learning approach that works on labelled data. The data is split into train and test subsets for carrying out the classification task. The current work uses Random forest, KNN and XGBoost algorithms for classifying emotions. All the mentioned algorithms are the variations of decision tree classifiers and a brief description of each classifier is given below [14]. 3.3.1 Random Forest One of the most flexible and easily implementable learning algorithm in machine learning is Random Forest. The algorithm provides better solutions over basic decision trees. The random forest depends on few parameters which if tuned can provide good results.The algorithm is widely used due to its simple and flexible aspect of implementation. Random Forest supports both regression and classification task while modelling a solution. It is supervised learning technique that creates random forests. The ensemble decision trees are referred to forest and mostly use bagging for training [51, 52]. The importance of regressive bagging lies in the fact that it increases the overall results. Multiple decision trees are build and together to increase efficiency in random forest algorithm.

Analysis of the MFC Singuliarities of Speech Signals

995

Random forest generation uses same hyper parameters that are used for decision tree or a bagging classifier. The class of classifier does not require combining the decision trees to bagging classification algorithm. The algorithm proceeds by searching for the best feature from available features subset. The selected feature will then be used for splitting the node. The node split diversifies and enhances the results. The relative importance of each feature is measured while prediction. SK-learn tool can be used to measure a feature importance. The tool reduces impurity at the tree nodes that use the feature, across all trees in forest. Score for each feature is automatically computed after training. Features and observations are randomly selected by random forest and averaged for building several decision trees. The decision uses rules and facts for decision making and trees from over fitting. Random forest prevents it by creating subsets and combining them to subtrees. The only limitation of random forest is slow computation which is affected by number of trees build by random forest [61]. 3.3.2 XGBOOST XGBoost [61] uses gradient boosting technique to ensemble decision trees. XGBoost is stands for “eXtreme Gradient Boosting”. Small, medium structured and tabular data uses XGBoost for classification. XGBoost is studied as improvisation upon the base GBM framework. Optimization and algorithmic techniques are used to improve the base framework of GBM. Regularization is used to enhance the performance of algorithm by preventing data overfitting. The algorithm automatically learns best missing values depending on the training loss and handle variety of patterns of sparsity more efficiently. It also has built-in cross-validation method at each iteration. XGboost is sequential tree building algorithm implemented by parallelization. The interchangeable nature of the loops determine the base of building algorithm. The external loop is responsible for maintaining the tree count, and features are calculated by the internal loop. Loops are interchangeable and thus enhance run time performance. All the instances are globally scanned, initialized and sorting is done using parallel threads. This switch of loops increases the algorithmic performance. The parallelization overheads in computation are offset. The tree splitting within GBM framework for stopping the split is greedy in nature. Splitting of tree at node depends on the negative loss criterion at the point of split. XGBoost uses ‘max_depth’ parameter as specified instead of criterion first, and pruning of the trees is done backward. The computational performance is improved significantly by using this ‘depth-first’ approach improves [61]. 3.3.3 Sklearn KNeighbors Classifier and KNN The early description of KNN was found in 1950. KNN is labour intensive approach for large datasets. It was used for pattern recognition initially. The learning of KNN is based on the comparison test data with train data such that both have similarities. A set of N attributes describe the tuple data. An n- dimensional space is used to store all the training tuples where each of them corresponds to a point in space. The pattern space for k training tuples that are closest to unknown tuple is identified by the K-nearest neighbor classifier. The closest found points are referred to nearest neighbors and euclidian distance defines the nearness of the neighboring clones [62].

996

R. V. Skuratovskii and V. Osadchyy

The Euclidean distance between two Co tuples represented by A1 := {a11 , a12 . . . . ., a1n }, A2 := {a21 , a22 . . . . ., a2n } is obtained using following calculation, n d (A1 , A2 ) = (1) (a1i − a2i )2 i=1

And in case of parametrized KNN we use in particular case the following generalization of formula (1): n wi (a1i − a2i )2 d (A1 , A2 (y)) = i=1

After we choose a class y maximizing this distance. Weights depend on the neighbor’s number w(xi ) = w(i). For brevity, we denote these quantities by wi . In general case we utilize some realizations of classification by formula a(x) = arg max[x(i) = y]wi ,

(2)

y∈Y

where x(i) are points near testing point y and Y is assumed by us class of object y. Thus the difference of values of attribute in A1 and A2 is obtained. The difference is then squared to accumulate total distance count. Attributes with large ranges can outweighs attributes within small ranges (binary attributes). To normalize data, we will use Z-scaling based on the mean value and standard deviation; dividing the difference between the variable and the mean by the standard deviation. In practice, minimax scaler and Z-scaling have similar applicability and are often interchangeable. However, when calculating the distances between points or vectors, Z-scaling is used in most cases. And the minimax is useful for visualization, for example, to transfer the features encoding the color of the pixel into a range of [0… 255] [2]. Recall that Z-scaling based on the mean and standard deviation is dividing the difference between variable and mean by standard deviation; z=

x−μ σ

(3)

where μ is an expected value and σ is the standard deviation of the value. Because the K-Nearest Neighbours Algorithm (hereinafter the KNN algorithm) is about the distance from a point to a class, Z-scaling is usually used for its application, as it is known that in calculating the distances between points or vectors in most cases the result of Z- scaling is much more accurate. The main advantage of Z-scaling is that it preserves the normal distribution of a random value. Z-normalization keeps a distribution normal if it was, and a non-normal distribution converts to a non-normal distribution too. As a result of such a transformation, we get a value with 0-th mean (mean) and 1 dispersion.

Analysis of the MFC Singuliarities of Speech Signals

997

Normalization is applied to each attribute value to resolve the issue. Most common class is assigned to unknown tuple among its k-nearest neighbours. If the value of K equates 1, the unknown tuple is assigned the class of the training tuple that is closest to it in pattern space. Real value prediction is returned by KNN for unknown value tuples. The unknown values the classifier of KNN returns the average of the real valued labels associated with K-nearest neighbours of unknown tuple [62–64]. Min-max scaler keeps outliers so we have to use robust scaler of statistics that are robust to autliers. As an alternative normalizatio we propose to use Z-normalization (Z-scaler). This normalization holds a normal distribution.

4 Problem Statement Authoritative literary sources mention MFCC as an important feature for analyzing and classifying various aspects of speech. Some of them state that only 13 MFCC features are sufficient enough to be considered for experimentation. There is currently no experimental validation for this statement. Moreover, there is no sufficient research on the identification of these 12 MFCC from the extracted 20 base features of MEL scale. MFCC also have derivatives of base features named as delta and double delta. The aim of the work is to establish experimental proof of considering only 8 to 13 MFCC from extracted 39 features of MFCC [23, 50]. The current work conducts experimental analysis on MFCC obtained from EMODB, a Berlin database that consists of more than 500 utterances, which were recorded from 111 both male and female speakers from various age groups.

5 Experimental Confirmation of Results To reach more effectiveness we utilize parametric KNN method. To present a number row, you have to look at the dynamics of the signal change. In large sentences, emotions are placed either at the end of the sentence or at the beginning. Therefore, when parameterizing the distance vector, it is important to set weighting factors in such a way as to distinguish the significance of the start of the sentence distance and the distance to the end of the vector coordinate. When applying discriminant recognition techniques to the object, the feature vector is displayed in five-dimensional space. The training matrix will be defined as the data matrix: ⎞ ⎛ w11 w12 w13 w14 w15 w16 ⎜w w w w w w ⎟ ⎜ 21 22 23 24 25 26 ⎟ ⎟ ⎜ W = ⎜ w31 w32 w33 w34 w35 w36 ⎟ ⎟ ⎜ ⎝ w41 w42 w43 w44 w45 w46 ⎠ w51 w52 w53 w54 w55 w56 We highlight five characteristics: in addition to the three mentioned in the beginning, we will highlight the beginning of the sentence and its end, as they are emotionally loaded. They carry not only an information load, but also an emotional one. When analyzing the

998

R. V. Skuratovskii and V. Osadchyy

beginning and the end of a sentence, we will take into account how much each feature has changed relative to the average characteristics of the speaker. Therefore, forming a data-vector with 5 coordinate we substitute in the i-th coordinate characterizing the i = vi of the amplitude of the current phrase ai to the average amplitude the ratio Ma(a) value M (a) of the amplitude of the person’s voice. The columns of the matrix W contain the weighting factors [66, 67] that most characterize this class of emotion, and the rows contain the features extracted from the phrase. For example, high amplitude is most characteristic for the emotion of aggression. The corresponding weight coefficient in the aggression column will be larger. Knowing the average values of the features of speech, we construct the matrix based on the changes in the parameters. The following simulations and experiments were performed. The flow graph in Fig. 3 shows the steps carried out for the experiment.

Fig. 3. Flowgraph of the emotion classification using decision tree classifiers.

Analysis of the MFC Singuliarities of Speech Signals

999

The experiment was done on MFCC extracted from the EMODB and SAVEE data set (Table 4). Four subsets of MFCC features comprising 20 cepstral constants were analyzed for feature importance. This was done to identify which 13 MFCC should be used for speech emotion analysis. The result of each subset using different classifiers areas is shown in Table 5. Each subset was used to classify emotions using supervised learning algorithm (variants of decision trees).It was observed that the results obtained using 20 MFCC over set of first 13 was very near to each other. There was no effective and substantial difference in the accuracy scores for classification while using 13 and 20 MFCC subsets. Increased number of features often increases the complexity of the system and so if 13 MFCC are used instead of 19 MFCC the results will not suffer much loss. The validation of the experiment was also done by extracting important features from PCA Analysis It was seen that most of the important features corresponded to initial 13 MFCC extracted from dataset. The accuracy score of classification on EMODB using M0-M12 was 52%, 50%, 47%, 41% using subsets M0-M19, M10-M12, M15-M17 and M16- M19 respectively using Random Forest classifier. KNN shows 56%, 44%, 45% and 42% accuracy score using subsets M0-M19, M10-M12, M15-M17 and M16M19 respectively. XGB showed poor performance on the original extracted dataset for classification task. SAVEE results as shown in Table 6 depicts accuracy sores of RF using subsets M0-M19, M10-M12, M15-M17 and M16-M19 are 57%, 55%, 50%, 50% respectively. For KNN the performance of four subsets is 67%, 54%, 55%, 50%.XGB showed poor performance on all the subsets. The results obtained in Table 5 and Table 6 clearly indicates that selecting M0-M12 would be a better choice for features from MFCC data set. It can be seen from the results that first 13 Mel coefficients can successfully be used for playing with speech over using 20 features. This selection shall only optimize the results but also reduce the complexity of the model [72] thereby reducing the computation time. Table 4. Label encoded emotions for EMODB Emotion EMODB

Emotion SAVEE

Encoded label

Fear/Anxiety

Anger

0

Disgust

Disgust

1

Happiness

Fear

2

Boredom

Happiness

3

Neutral

Sadness

4

Sadness

Surprise

5

1000

R. V. Skuratovskii and V. Osadchyy

There after the dataset was minimized and re-experimented for feature importance and classification. The accuracy of results for classification increased effectively but the set of important features still contained features M0 to M12 that initial 13 features. The reason behind this is that as the sound signal passes through the vocal tract and comes out as the utterance there is a subsequent addition of noise to the originally generated signal. Addition of noise disturbs the energy whose log is computed as the base of MFF extraction. The induction of noise imputes the signal at later levels so the original signal remains intact for usage in analaysis [65, 68–71]. The features present here can be successfully used for better results as compared those extracted towards the end of the sample of each speech signal . Table 5. Results of EMODB with Subsets of MFCC MFCC M0-M19 M0-M12 M15-M17 M6-M19 RF

52%

50%

47%

41%

KNN

56%

44%

45%

42%

XGB

40%

38%

28%

27%

Table 6. Results of SAVEE with Subsets of MFCC MFCC M0-M19 M0-M12 M15-M17 M6-M19 RF

57%

55%

50%

50%

KNN

67%

64%

55%

50%

XGB

30%

38%

30%

30%

The results in Table 5 and Table 6 show that M0-M19 and M0-M13 has nearly similar results for accuracy on classifiers. The datasets in Table 5 and Table 6 used the non-manipulated MFCC extracted from the speech utterances in in EMODB and SAVEE.

Analysis of the MFC Singuliarities of Speech Signals

1001

For further analysis the datasets were minimized and preprocessed using min max scaling. The important features identified for the minimized data using principle component analysis are in Fig. 4 and Fig. 5. The x-axis of the plot shows various classes of emotion and y axis plot shows the MFCC features.

Fig. 4. Principle Component Vs Class Distribution for EMODB.

Fig. 5. Principle Component Vs Class Distribution for SAVEE.

The results of using 13 MFCC from minimized datasets EMODB and SAVEE are shown in Fig. 6(a), (b), (c). SAVEE results can be seen in Fig. 8(a), (b), (c). The plots in the mentioned figures clearly display the precision; recall and F1-scores for emotions in both the datasets using variants of decision trees.

1002

R. V. Skuratovskii and V. Osadchyy

Fig. 6. (a) Emotion Vs Precision Score for First 13 MFCC using KNN, Random Forest and XGB Classifiers on EMODB (b) Emotion Vs F1-Score for First 13 MFCC using KNN, Random Forest and XGB Classifiers on EMODB. (c) Emotion Vs Recall Score for First 13 MFCC KNN, Random Forest and XGB Classifiers on EMODB.

Analysis of the MFC Singuliarities of Speech Signals

1003

Results showed that for EMODB all the three classifiers defined fairly variable results. Boredom (91%), anger (74%) and sadness (100%) had highest precision for random forest XGB and KNN respectively in EMODB. For SAVEE a higher precision rate for Disgust (84%) was identified using random forest. KNN and XGBoost identified sadness more precisely over remaining six emotions where the scores for sadness were 84%, 70% and 74% with KNN, random forest and XGBoost respectively (Fig. 7). A common conclusion was obtained from the results of both the datasets that sadness was commonly identified with highest precision using KNN (Fig. 9). So KNN can be effective for studying the emotion sadness in emotion analysis.

Fig. 7. Accuracy score for emotion classification using KNN, random forest and XGB classifiers on EMODB.

1004

R. V. Skuratovskii and V. Osadchyy

Fig. 8. (a) Emotion Vs Precision Score for First 13 MFCC using KNN, Random Forest and XGB on SAVEE. (b) Emotion Vs Recall Score for First 13 MFCC using KNN, Random Forest and XGB on SAVEE. (c) Emotion Vs F1 Score for First 13 MFCC KNN, Random Forest and XGB on SAVEE. (d) Emotion Vs Recall Score for 13 MFCC using Three Decision Tree Classifiers on EMODB.

Analysis of the MFC Singuliarities of Speech Signals

1005

Fig. 9. Accuracy score obtained with First 13 MFCC using KNN, random forest and XGB.

6 Conclusion Speech features have always remained the one of the regressively studied topic in research. Speech or utterances contain vital information regarding the intention, emotion and psychology of the speaker. The paper studied the use of one such speech feature called MFCC and utilized it to classify emotions using two datasets. The work also tried to establish the importance of using first 13 MFCC when we have a set of 20 Mel constants that can be extracted for speech based on the vocal physiology of human mouth. Accuracy scores for emotion classification using variants of decision tree approach have been obtained for EMODB and SAVEE for two datasets. KNN was identified as the common classification algorithm for both datasets. The score of sadness as obtained from KNN were highest for both the datasets (Fig. 7). The results of the experiments can be utilized for predicting emotions and personality of the speaker [68]. The results can be integrated with various applications pertaining to human psychology and medical treatments.

References 1. Koolagudi, S.G., Rao, K.S.: Emotion recognition from speech: a review. Int. J. Speech Technol. 15(2), 99–117 (2012) 2. Marechal, C., et al.: Survey on AI-based multimodal methods for emotion detection. In: Kołodziej, J., González-Vélez, H. (eds.) High-Performance Modelling and Simulation for Big Data Applications. LNCS, vol. 11400, pp. 307–324. Springer, Cham (2019). https://doi. org/10.1007/978-3-030-16272-6_11 3. Sreenivasa Rao, K., Koolagudi, S.G., Vempada, R.R.: Emotion recognition from speech using global and local prosodic features. Int. J. Speech Technol. 16(2), 143–160 (2013). https://doi. org/10.1007/s10772-012-9172-2 4. Koolagudi, S.G., Devliyal, S., Barthwal, A., Rao, K.S.: Real Life emotion classification from speech using Gaussian mixture models. In: Parashar, M., Kaushik, D., Rana, O.F., Samtaney, R., Yang, Y., Zomaya, A. (eds.) Contemporary Computing. IC3 2012. Communications in Computer and Information Science, vol. 306. Springer, Heidelberg (2012). https://doi.org/10. 1007/978-3-642-32129-0_28

1006

R. V. Skuratovskii and V. Osadchyy

5. Latif, S., Rana, R., Younis, S., Qadir, J., Epps, J.: Transfer learning for improving speech emotion classification accuracy. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2018-September, no. January, pp. 257–261 (2018) 6. Lee, C.M., Narayanan, S.S.: Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13(2), 293–303 (2005) 7. Banse, R., Scherer, K.R.: Acoustic profiles in vocal emotion expression. J. Pers. Soc. Psychol. 70(3), 614–636 (1996) 8. Hozjan, V., Kaˇciˇc, Z.: Context-independent multilingual emotion recognition from speech signals. Int. J. Speech Technol. 6(3), 311–320 (2003) 9. Ramakrishnan, S.: Recognition of Emotion from Speech: A Review. Speech Enhancement, Modeling and Recognition- Algorithms and Applications (2012) 10. Sebe, N., Cohen, I., Huang, T.S.: Multimodal emotion recognition. Handbook of Pattern Recognition and Computer Vision, 3rd Edition (2005) 11. Zhang, Q., Wang, Y., Wang, L., Wang, G.: Research on speech emotion recognition in Elearning by using neural networks method. In: 2007 IEEE International Conference on Control and Automation, ICCA (2007) 12. Jing, S., Mao, X., Chen, L.: Prominence features: effective emotional features for speech emotion recognition. Digit. Sig. Proc. Rev. J. 72(October), 216–231 (2018) 13. Albornoz, E.M., Milone, D.H., Rufiner, H.L.: Spoken emotion recognition using hierarchical classifiers. Comput. Speech Lang. 25(3), 556–570 (2011) 14. Özseven, A., Dü˘genci, T., Durmu¸so˘glu, M.: A content analysis of the research approaches in speech emotion. Int. J. Eng. Sci. Res. Technol. 7, 1–27 (2018) 15. Kishore, K.K., Satish, P.K.: Emotion recognition in speech using MFCC and wavelet features. In: Proceedings of the 2013 3rd IEEE International Advance Computing Conference, IACC 2013 (2013) 16. Yousefpour, A., Ibrahim, R., Hamed, H.N.A.: Ordinal-based and frequency-based integration of feature selection methods for sentiment analysis. Expert Syst. Appl. 75, 80–93 (2017). https://doi.org/10.1016/j.eswa.2017.01.009 17. Shu, L., et al.: A review of emotion recognition using physiological signals. Sensors (Switz.) 18(7), 2074 (2018) 18. Oosterwijk, S., Lindquist, K.A., Anderson, E., Dautoff, R., Moriguchi, Y., Barrett, L.F.: States of mind: emotions, body feelings, and thoughts share distributed neural networks. Neuroimage 62(3), 2110–2128 (2012). https://doi.org/10.1016/j.neuroimage.2012.05.079 19. Pessoa, L.: Emotion and cognition and the amygdala: from “what is it?” to “what’s to be done?” Neuropsychologia 48(12), 3416–3429 (2010). https://doi.org/10.1016/j.neuropsychol ogia.2010.06.038 20. Koolagudi, S.G., Sreenivasa Rao, K.: Emotion recognition from speech: a review. Int. J. Speech Technol. 15(2), 99–117 (2012). https://doi.org/10.1007/s10772-011-9125-1 21. Winkielman, P., Niedenthal, P., Wielgosz, J., Eelen, J., Kavanagh, L.C.: Embodiment of cognition and emotion. In: Mikulincer, M., Shaver, P.R., Borgida, E., Bargh, J.A. (eds.) APA handbook of personality and social psychology, Volume 1: Attitudes and social cognition., pp. 151–175. American Psychological Association, Washington (2015). https://doi.org/10. 1037/14341-004 22. Fernández-Caballero, A., et al.: Smart environment architecture for emotion detection and regulation. J. Biomed. Inform. 64, 55–73 (2016). https://doi.org/10.1016/j.jbi.2016.09.015 23. Guan, H., Liu, Z., Wang, L., Dang, J., Yu, R.: Speech emotion recognition considering local dynamic features. In: Fang, Q., Dang, J., Perrier, P., Wei, J., Wang, L., Yan, N. (eds.) ISSP 2017. LNCS (LNAI), vol. 10733, pp. 14–23. Springer, Cham (2018). https://doi.org/10.1007/ 978-3-030-00126-1_2

Analysis of the MFC Singuliarities of Speech Signals

1007

24. Cen, L., Wu, F., Yu, Z.L., Hu, F.: A Real-Time Speech Emotion Recognition System and its Application in Online Learning. Emotions, Technology, Design, and Learning, Elsevier, Amsterdam (2016) 25. Shuman, V., Scherer, K.R.: Emotions, Psychological Structure of International Encyclopedia of the Social & Behavioral Sciences: Second Edition, Elsevier, Amsterdam (2015) 26. Ekman, P.: ‘Basic Emotions’. Handbook of Cognition and Emotion, Wiley, Hoboken (2005) 27. Langner, O., Dotsch, R., Bijlstra, G., Wigboldus, D.H.J., Hawk, S.T., van Knippenberg, A.: Presentation and validation of the radboud faces database, Cognition and Emotion (2010) 28. Ekman, P.: ‘Facial expression and emotion’, American Psychologist (1993) 29. Bourke, C., Douglas, K., Porter, R.: Processing of facial emotion expression in major depression: a review. Aust. N. Z. J. Psychiatry 44(8), 681–696 (2010) 30. Van den Stock, J., Righart, R., De Gelder, B.: Body expressions influence recognition of emotions in the face and voice. Emotion 7(3), 487–494 (2007). https://doi.org/10.1037/15283542.7.3.487 31. Banse, R., Scherer, K.R.: Acoustic profiles in vocal emotion expression. J. Pers. Soc. Psychol. 70(3), 614–636 (1996). https://doi.org/10.1037/0022-3514.70.3.614 32. Gulzar, T., Singh, A., Sharma, S.: Comparative analysis of LPCC, MFCC and BFCC for the recognition of Hindi words using artificial neural networks. Int. J. Comput. Appl. 101(12), 22–27 (2014) 33. Shrawankar, U., Thakare, V.M.: Techniques for Feature Extraction In Speech Recognition System: A Comparative Study (2013) 34. Haamer, R.E., Rusadze, E., Lsi, I., Ahmed, T., Escalera, S., Anbarjafari, G.: ‘Review on Emotion Recognition Databases’, Human-Robot Interaction - Theory and Application (2018) 35. Lalitha, S., Geyasruti, D., Narayanan, R., Shravani, M.: Emotion detection using MFCC and cepstrum features. Procedia Comput. Sci. 70, 29–35 (2015) 36. Jackson, P., Haq, S.: Surrey Audio-Visual Expressed Emotion (savee) Database. University of Surrey, Guildford, UK (2014) 37. Liu, Z.T., Xie, Q.M., Wu, W.H., Cao, Y., Mei, Y., Mao, J.W.: Speech emotion recognition based on an improved brain emotion learning model. Neurocomputing 309, 145–156 (2018) 38. Ekman, P., et al.: Universals and cultural differences in the judgments of facial expressions of emotion. J. Pers. Soc. Psychol. 53(4), 712–717 (1987). https://doi.org/10.1037/0022-3514. 53.4.712 39. Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31, 39–58 (2009) 40. Koduru, A., Valiveti, H.B., Budati, A.K.: Feature extraction algorithms to improve the speech emotion recognition rate. Int. J. Speech Technol. 23(1), 45–55 (2020). https://doi.org/10.1007/ s10772-020-09672-4 41. Kumar, K., Kim, C., Stern, R.M.: Delta-spectral cepstral coefficients for robust speech recognition. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (2011) 42. Tiwari, V.: MFCC and its applications in speaker recognition. Int. J. Emerg. Technol. 1, 19–22 (2010) 43. Dave, N.: Feature Extraction Methods LPC, PLP and MFCC In Speech Recognition. Int. J. Adv. Res. Eng. Technol. 1, 1–6 (2013) 44. Yankayi, M.: ‘Feature Extraction Mel Frequency Cepstral Coefficients (Mfcc)’, pp. 1–6 (2016) 45. Ananthakrishnan, S., Narayanan, S.S.: Automatic prosodic event detection using acoustic, lexical, and syntactic evidence. IEEE Trans. Audio Speech Lang. Process. 16, 216–228 (2008) 46. Kinnunen, T., Li, H.: An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010). https://doi.org/10.1016/j.specom.2009. 08.009

1008

R. V. Skuratovskii and V. Osadchyy

47. Wang, W.Y., Biadsy, F., Rosenberg, A., Hirschberg, J.: Automatic detection of speaker state: Lexical, prosodic, and phonetic approaches to level-of-interest and intoxication classification. Comput. Speech Lang. 27, 168–189 (2013) 48. Lyons, J.: ‘Mel Frequency Cepstral Coefficient’, Practical Cryptography (2014) 49. Palo, H.K., Chandra, M., Mohanty, M.N.: Recognition of human speech emotion using variants of Mel-frequency cepstral coefficients. In: Konkani, A., Bera, R., Paul, S. (eds.) Advances in Systems, Control and Automation. LNEE, vol. 442, pp. 491–498. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-4762-6_47 50. Yazici, M., Basurra, S., Gaber, M.: Edge machine learning: enabling smart internet of things applications. Big Data Cogn. Comput. 2(3), 26 (2018). https://doi.org/10.3390/bdcc2030026 51. Wang, X., Dong, Y., Hakkinen, J., Viikki, O.: ‘Noise robust Chinese speech recognition using feature vector normalization and higher-order cepstral coefficients’ (2002) 52. Davis, S.B., Mermelstein, P.: Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. Readings in Speech Recognition (1990) 53. Palaz, D., Magimai-Doss, M., Collobert, R.: End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition. Speech Commun. 108, 15–32 (2019). https://doi.org/10.1016/j.specom.2019.01.004 54. Passricha, V., Aggarwal, R.K.: A comparative analysis of pooling strategies for convolutional neural network based Hindi ASR. J. Ambient Intell. Humanized Comput. 11, 675–691 (2020) 55. Vimala, C., Radha, V.: Suitable feature extraction and speech recognition technique for isolated Tamil spoken words. Int. J. Comput. Sci. Inf. Technol. 11, 675–691 (2014) 56. Dalmiya, C.P., Dharun, V.S., Rajesh, K.P.: An efficient method for Tamil speech recognition using MFCC and DTW for mobile applications. In: 2013 IEEE Conference on Information and Communication Technologies, ICT 2013 (2013) 57. NithyaKalyani, A., Jothilakshmi, S.: ‘Speech Summarization for Tamil Language’. Intelligent Speech Signal Processing (2019) 58. Stevens, S.S., Volkmann, J., Newman, E.B.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8, 208 (1937) 59. Mitrović, D., Zeppelzauer, M., Breiteneder, C.: ‘Features for Content-Based Audio Retrieval’ (2010) 60. Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: ACM International Conference Proceeding Series (2006) 61. Kotsiantis, S.B.: ‘Supervised machine learning: A review of classification techniques’, Informatica (Ljubljana) (2007) 62. Luckner, M., Topolski, B., Mazurek, M.: Application of XGBoost algorithm in fingerprinting localisation task. In: Saeed, K., Homenda, W., Chaki, R. (eds.) Computer Information Systems and Industrial Management. CISIM 2017. Lecture Notes in Computer Science, vol. 10244. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59105-6_57 63. Sutton, O.: ‘Introduction to k Nearest Neighbour Classification and Condensed Nearest Neighbour Data Reduction’, Introduction to k Nearest Neighbour Classification (2012) 64. Deng, Z., Zhu, X., Cheng, D., Zong, M., Zhang, S.: Efficient k NN classification algorithm for big data. Neurocomputing 195, 143–148 (2016). https://doi.org/10.1016/j.neucom.2015. 08.112 65. Okfalisa, I., Gazalba, I., Reza, N.G.I.: ‘Comparative analysis of k-nearest neighbor and modified k-nearest neighbor algorithm for data classification’. In: Proceedings - 2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering, ICITISEE 2017 (2018) 66. Skuratovskii, R.V.: The timer compression of data and information. In: Proceedings of the 2020 IEEE 3rd International Conference on Data Stream Mining and Processing, DSMP 2020, pp. 455–459 (2020)

Analysis of the MFC Singuliarities of Speech Signals

1009

67. Skuratovskii, R.V.: employment of minimal generating sets and structure of sylow 2subgroups alternating groups in block ciphers. In: Bhatia, S., Tiwari, S., Mishra, K., Trivedi, M. (eds.) Advances in Computer Communication and Computational Sciences. Advances in Intelligent Systems and Computing, vol. 759. Springer, Singapore (2019). https://doi.org/10. 1007/978-981-13-0341-8_32 68. Romanenko, Y.O.: Place and role of communication in public policy. Actual Probl. Econ. 176(2), 25–26 (2016) 69. Skuratovskii, R.V., Williams, A.: Irreducible bases and subgroups of a wreath product in applying to diffeomorphism groups acting on the Möbius band. Rend. Circ. Mat. Palermo Ser. 2, 1–19 (2020). https://doi.org/10.1007/s12215-020-00514-5 70. Skuratovskii, R.V.: A method for fast timer coding of texts. Cybern. Syst. Anal. 49(1), 133–138 (2013) 71. Skuratovskii, R., Osadchyy, V., Osadchyy, Y.: The timer inremental compression of data and information. WSEAS Trans. Math. 19, 398–406 (2020) 72. Drozd, Y., Skuratovskii, R.V.: Generators and relations for wreath products. Ukr. Math. J. 60(7), 1168–1171 (2008) 73. Zgurovsky, M.Z., Pankratova, N.D.: System Analysis: Theory and Applications, p. 446. Springer Verlag, Berlin (2007)

A Smart City Hub Based on 5G to Revitalize Assets of the Electrical Infrastructure Santiago Gil1(B) , Germán D. Zapata-Madrigal1 , and Rodolfo García Sierra2 1 Grupo T&T, Facultad de Minas, Universidad Nacional de Colombia, Medellín, Colombia

{sagilar,gdzapata}@unal.edu.co

2 Head of Innovation Observatory, Enel-Codensa S.A. E.S.P. (Enel Group), Bogotá, Colombia

[email protected]

Abstract. The electrical infrastructure has certain assets which have been underused, commonly named as dead assets, such as the light, distribution, and transformer poles. This infrastructure can be exploited with short-range 5G base stations, a technology which is expected to arrive soon, as a strategy to implement new 5G IoT networks and new smart city services. This project proposes the development of a smart city Hub which uses the dead electrical infrastructure with the purpose of offering technological services based on 5G. The Hub 5G is intended to integrate applications from different domains and industries, and at the same time, use open data sources to expose integrated services to the community where the Hub is located. The Hub provides a smart digital ecosystem for connected people, data, things, and services, a requirement for nowadays’ solutions. This initiative enables the application of technologies for society and the reinforcement of collaborative strategies between different industries and public and private sectors. Keywords: IoT · 5G · Integrated services · Smart city

1 Introduction The 5G panorama generates high expectations of new applications, scenarios, capabilities and business models that will emerge due to the fact that it is an interoperable technology across all domains, and that will further drive the digital transformation of all industrial and economic sectors. Additionally, as well as the IP protocol, cellular technologies and 5G networks are convergent technologies on which many other technologies and applications such as IoT, data analysis and Big Data will be based [1]. The connectivity ecosystem between things, devices, machines, humans, and other entities will generate many business opportunities and strategies, and they will be highly dependent on private and secure technology [2]. Those opportunities are oriented for both businesses and consumers; the number of devices connected to the network will grow enormously and the market associated with 5G could grow more than 2.2 trillion dollars in the period 2020 – 2034; in addition, it is expected to generate higher incomes in companies (about 40%) 5 years after the start of 5G. The key to 5G will be 1) transmission © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1010–1031, 2021. https://doi.org/10.1007/978-3-030-80126-7_71

A Smart City Hub Based on 5G

1011

speed, 2) the number of connected devices, which 4G cannot supply completely, and 3) non-public networks over 5G; this will make it possible to improve manufacturing processes in factories, such as, for example, increasing the number of automated robots or vehicles. Other business models based on data may be exploited, and data may be monetized as an asset. On the other hand, consumers will have faster access, better connectivity, and a better disposition of connected resources to be informed, monitor and control what is happening around them and their assets. Some of the most benefited sectors are automotive and mobility, media and content (gaming industry), public sector and smart cities, health, manufacturing, and utilities [3]. The digital transformation, which is extremely necessary for smart -grid, city, industry- applications, requires significant development of wireless communication networks and software resources. The concept of the industrial revolution in which research institutions and companies have focused on is framed under the name of Industry 4.0, an industry based on digital assets and the integration of physical components with virtual components and software [4]. It highlights the importance of the virtualization of data and the implementation of virtual environments which link the physical processes and data with an interoperability strategy, which are in the core of this project. It is also necessary to consider the coexistence of technologies with 5G and its support in IoT applications for hybrid and interoperable solutions, since the complete ecosystem and the integration of technologies is what consolidates the convergency of the IoT and its exploitation for all the sectors and industries. Finally, IoT technologies behave as enablers one another [5]. It is even important to analyze aspects such as requirements, challenges, and research road maps in this area, to consolidate maturity and exploit the full potential of 5G technology and its supporting technologies, and then this may be the basis for new inventions and developments in the future. This project proposes a 5G-based smart Hub for smart city services to work along with available electrical infrastructure to extend services and portfolio of energy service providers from the revitalization of unused assets such as light and distribution poles. There are several applications in smart cities and utilities that can be implemented from this solution as are presented here in energy price data, mobility, video surveillance, energy consumption, and other potential applications such as vehicle to grid integration and recommendations to improve efficiency in utility consumption. The approach can act as an enabler for the integration of people, industries, vehicles, and commerce with the smart grid, creating an intelligent and interconnected ecosystem for multi-domain applications. The remainder of this paper is as follows: Section 2 reviews the state of the art. Section 3 presents the design of the solution. Section 4 shows the configuration of the system and deployment scenario. Section 5 presents the implementation of the system. And finally, Sect. 6 concludes the results and presents some research directions.

2 State of the Art 5G is attributed to an opportunity in the technology sector to drive the digital transformation of all industries, and even with significant social and economic impacts; this with a global perspective of benefits for all regions and people [6]. 5G became a reality from

1012

S. Gil et al.

a concept, although it is only in its early stages [7]. For this, multiple requirements must be addressed, such as wireless communications, energy efficiency of communications, robustness and capacity of networks, business models, data governance, among others. That is why 5G is being a standardized development from organizations and alliances, not just individual inventions of single companies. The review of Li et al. [1] highlights current research, key technologies, and research trends / challenges for 5G-IoT. This review shows all the heterogeneous technologies required for the development of IoT, such as LoRa, Wi-Fi, ZigBee, Bluetooth, among others. The closest to 5G-based IoT implementation was NB-IoT and LTE-M at the launch of 3GPP Low Power Wide Area Technologies [8]. Now 5G is committed to massifying IoT connections and consolidating concepts such as Industrial IoT or Smart Cities with real-time applications. As presented in [9], the telecommunication technologies have advanced and have been tailored for their use in the electrical systems. This paper presents various applications and projects that have been carried out for smart metering, control and monitoring of networks, energy management in homes and buildings, management of the electrical grids, integration and monitoring of renewable energy sources, monitoring and low and medium voltage network management, and other projects involving information and communication technologies (ICT), in which the “EPIC-HUB” project is mentioned, a GSM-based system for monitoring consumption energy and the energy produced. The approach of Karpenko et al. in [10] also proposes the use of interoperable IoT ecosystems for the integration of electrical systems. To do this, they have focused on standardized and interoperable communication protocols and their integration with an IoT ecosystem where things, cloud systems and IoT users interact. The application cases they propose are a parking and charging system for intelligent electric vehicles, considering the integration of the ecosystem; it means, including the associated platforms, the mobile application and the semantic integration of the data with the platform and users. Some of the challenges and research directions that are presented for the development of smart cities are first of all the consolidation of emerging technologies for social welfare, of which Big Data, Machine Learning, the IoT, and the Data analysis are key to achieve this. Then it comes the extension of the applications for different domains to improve energy efficiency, traffic management, parking, mobility as a service and security [11]. In order to integrate applications from different domains, multi-domain and interoperable services are required. The IoT has begun to be widely used for these types of needs, through interoperable services based on IoT technologies and cloud computing, which allow the promotion of new offerings for smart cities [12]. In the article presented in [13], different smart city services are grouped by technological clusters: content management systems, network technologies, data storage and distributed systems, business analytics and intelligence, emerging technologies, and innovation for Smart City. Users’ perceptions of the different services proposed by each cluster were then associated to verify the most favorable services and the least favorable (adequate) services. Another proposal to integrate information technology services, and in this case, for the transport infrastructure was carried out in [14], where a City-Hub was used to improve the services, fluidity, comfort, and energy efficiency of the infrastructure of transport networks for smart cities.

A Smart City Hub Based on 5G

1013

The review done by Minoli and Occhiogrosso [15] discusses practical aspects of 5G implementation for IoT applications in smart cities. Among their analysis, the authors place special emphasis on the coexistence of 5G with other IoT technologies such as NB-IoT, 4G or LoRa. It also highlighted the need for small radio cells (micro / Pico antennas) in 5G IoT applications and the importance of signal penetration. The feasibility of implementing microcells in urban poles (light poles) for 5G signal coverage in smart cities is exposed. Among the resources that benefit from 5G IoT there are: smart buildings, transport systems, smart grids, traffic control, pollution monitoring, street lighting management, surveillance, crowdsensing, logistics, smart services, among others. In the work of Santos et al. [16] a fog computing framework for the autonomous orchestration of functionalities and services in 5G smart cities is proposed. The proposed framework is aimed to improve computing and network performance and provide functionality for data monitoring and analysis. This work refers to other architectures or frameworks for M2M (Machine-to-Machine) communication orchestration and other network technologies, such as ETSI OneM2M, OpenFog Consortium, ETSI MEC (Mobile Edge Computing), ETSI NFV (Network Function Virtualization) NANO, OMA (Open Mobile Alliance) Lightweight M2M, TOSCA (Topology Orchestration Specification for Cloud Applications) and Cloudlet. Additionally, open source tools are mentioned for NFV (OpenStack, OpenBaton, OpenMano), SDN (Software-Defined Networking) (OpenDayLight, ONOS, Ryu) and for M2M (OMA LwM2M, Leshan, 5G-EmPOWER). The Fog computing-based management and orchestration framework consists of 3 layers: sensor layer, fog layer, and cloud layer. The sensor layer is composed of sensors and actuators, connects to the Fog layer via LPWAN (Low Power Wide Area Network) gateways with Fog nodes. Each Fog node is an autonomous system that manages computational resources, and at the same time, these nodes communicate with Cloud nodes, which are responsible for global management and control operations in the network. The Fog and Cloud nodes can communicate one another through a Fog P2P protocol, which allows the exchange of information, and improves the decision-making process related to the provisioning of 5G smart city resources. Just as the previous article ([16]) refers to standards such as ETSI for the definition of its architectures in communication networks for 5G, it is necessary to consider other proposals, such as the Industrial Internet Reference Architecture (IIRA) of the IIoT (Industrial IoT) Consortium [17] and the Reference Architectural Model Industrie 4.0 (RAMI) of the Plattform Industrie 4.0 [18]. The work of Marques et al. [19] proposes an IoT architecture for the application of waste management in smart cities as a strategy to address the challenge of extra population in cities for the coming years. The proposed architecture considers indoor and outdoor implementation, both in order to automatically separate waste correctly. The MQTT, CoAP and HTTPS protocols were implemented as communication/application protocols on the platform. Thus, the platform incorporates Cyber-Physical Systems (CPS), advanced data communication systems, and embedded intelligence to deal with challenges of smart cities. Additionally, it is composed of four layers: the physical objects layer, the communication layer, the Cloud platform layer and the services layer. Cheng et al. [20] proposed an architecture for Industrial IoT based on 5G for smart manufacturing applications. An analysis is also made of the eMBB (enhanced Mobile

1014

S. Gil et al.

Broadband), mMTC (Massive Machine Type Communication) or MIoT (Massive IoT) and URLLC (Ultra Reliable Low Latency Communication) scenarios, and technologies and challenges for 5G-based IIoT. The architecture was designed to include various application scenarios in smart manufacturing, including real-time data acquisition from heterogeneous sources in manufacturing plants, collaborative network manufacturing, human-machine interaction, Automated Guided Vehicles (AGV) collaboration, convergence of digital twins between the physical and cyber layers, product design and manufacturing and maintenance based on virtual reality and augmented reality. The architecture also includes advanced digital technologies such as big data, cloud computing, edge computing, digital twin, virtual/augmented reality and service-oriented manufacturing. Considerations in the architecture to comply with service-oriented manufacturing are highlighted as: 1) large-scale intelligent interconnection of heterogeneous equipment, 2) high reliability, high transmission rate and low latency for monitoring and control, 3) edge computing-based IIoT using SDN technology, 4) 3D and multimedia assisted interaction, and 5) low cost and low power consumption in network transmission. Rahimi et al. [21] also proposed an IoT architecture based on 5G and other advanced digital technologies for the integration of applications, services and generated data. The technologies incorporated in this architecture are: Nanochips, millimeter Wave (mmWave), heterogeneous networks (HetNet), device-to-device (D2D) communication, NFV, SDN, Advanced Spectrum Sharing and Interference Management (Advanced SSIM), Mobile Edge / Cloud Computing, Data Analytics and Big Data. The architecture consists of eight layers: the physical device layer, the communication layer, the Edge (Fog) computing layer, the data storage layer, the service management layer, the application layer, the collaboration and processes layer, and the security layer. The IoTEP platform proposed in [22] is another sample of open IoT-based systems for energy data management and analysis. This application has a greater emphasis on the energy data analysis layer and its application in buildings, although it has a suitable structural design for integration with other information systems to support external applications. A similar proposal is presented in [23], where emphasis is placed on communications, in this case LPWAN technologies for the implementation of IoT platforms for energy systems. Although this last proposal does not have an open data consideration, it proposes an architecture based on communication technologies and interoperable application protocols to guarantee the integration of data with external applications and highlights the importance of communication technologies in this type of project. In the previous two platforms and in [24], the IoT platforms have several associated applications such as smart metering, distributed energy resources (DER) and demand response management. Other common initiatives in the framework of Smart Cities refers to data projects propagated through Hubs, which are platform-type devices provided with sufficient communication protocols for the integration of data and services in the IoT ecosystem: devices, machines, people, applications, among others. The proposal presented in [25] exposes the concept of City Hub, an IoT Hub that acts as a portal for infrastructure and for other Hubs. On such device, data from multiple verticals can be integrated - Open Data, transport, traffic, air quality, energy, etc. The City Hub is based on a platform as a service (PaaS) framework, which can provide services to citizens, local companies that offer

A Smart City Hub Based on 5G

1015

other added services, and even large companies or external companies that want to offer new services to citizens. For this, the Hub consists of an architecture that associates the physical devices and the available data to the cloud, with a high level of interoperability and accessibility to ensure the connectivity of multiple sources of information, and allow the different actors involved in the ecosystem to interact with it. The IoT Hub proposed in [26] also considers the integration with consolidated IoT platforms available in the cloud, which facilitate the process of integrating data from IoT systems and, furthermore, guarantee highly scalable conditions important in IoT solutions. The data associated with this Hub was fed through multiple sources of information such as energy and gas meters, current meters, a weather station, a solar generation plant, an air quality station and others, available in the ABB Krakow Research Center, Poland. In the case of [27], a semantic data hub was proposed to improve data integration in machine-to-machine communications (M2M) in non-domain dependent applications. Other semantic tools for data interoperability and integration can be implemented on data hubs, as proposed in [14]. In the review carried out in [28], 23 technological projects for Smart Cities services grouped by IoT & Cloud Computing projects are referenced. The groups are IoT, Cloud Computing & Big Data; Cloud Computing & Big Data; Cloud Computing; and Cloud Computing & Cyber-Physical Systems. Some of the projects referenced in this review are: SmartSantander, Padova Smart City, European Platform for Intelligent Cities (EPIC) Project, Clout, OpenMTC, OpenIoT, Concinnity, Sentilo, Scallop4SC (SCALable LOgging Platform for Smart City), CiDAP, WindyGrid, SMARTY, U-City, and Civitas. The nature of these projects is open source and addressed to the benefit of citizens.

3 Design of the Solution 3.1 Hub 5G Architecture Taking previous works as reference, it is very important to consider modular communication architectures structured by layers, which include the necessary components of the solution: service integration, IoT, people, applications, local computing (Edge) and cloud computing, among other considerations. The communication architecture proposes the coexistence of wireless IoT technologies, the integration of multiple communication protocols and the provision of services to devices of different nature; touch-screen interfaces for use of people in site, mobile apps, web applications, interfaces between machines, devices, and sensors, among others. The approach towards 5G which is intended to be provided in the solution, and the Hub’s application on revitalization of assets and infrastructure of the electrical system, such as light, distribution, and transformer poles, it is possible to consider a network infrastructure composed of 5G microcells with Pico or micro antennas, as considered in [15]. This strategy focuses on generating value from existing infrastructure, and revitalizing assets to provide more services, different from the functionalities that they are oriented to. Additionally, the core of the application, which will be supported by 5G technology, must provide enough interoperability, both horizontally between wireless communication devices and technologies, and vertically to satisfy the interaction between different

1016

S. Gil et al.

domains regarding smart cities and their integration with other sectors such as smart grids and transportation systems.

Fig. 1. Communication architecture of the hub 5G.

The architecture is presented in Fig. 1. The proposed architecture is made up of six layers: the application layer, where users can access through available devices or tablets on the poles where the Hub 5G is located to access all the information and services that the Hub provides. The IoT layer, where it is performed the connection between the IoT network and the Hub 5G; the idea is to connect cameras for artificial vision, air quality and meteorology sensors, energy, gas and water meters, and other IoT-enabled devices that allow data to be acquired and services to be offered to the community. The Edge layer is where the 5G data Hub is located. This layer is in charge of integrating all the layers in the system, and then, offering the services as a portal of services and benefits for people and things. The 5G data Hub will be in charge of processing light data, and thus, lighten the data transmission to the cloud and decrease response times taking advantage of edge computing. The Cloud layer offers integration with other platforms that allow storing, processing, and managing system resources and data at a massification level. This layer is essential because of the nature of offering smart city services, where it is extremely important to have applications that third parties can provide; the system is also prone to working with big data, so opting for heavy cloud processing is much more efficient and faster. Light processing can be done on the Edge layer. The services layer is responsible for orchestrating all the functionalities that people and things will be able to access through the Hub 5G and its integrations, including citizen security services through video surveillance, information of public transport and mobility, emergencies and catastrophes, suggestions based on geographic

A Smart City Hub Based on 5G

1017

location such as restaurants, shopping centers, etc., weather information, health, and utility consumption. Finally, the communication layer is transparent to all the layers of the system and provides communication capabilities through IoT technologies and protocols to all components of the network. A greater emphasis is placed on Low Power Wide Area Networks (LPWANs), especially cellular LPWAN technologies, to be used within the system which are intended to coexist with 5G. In the same order of ideas, it is planned to consolidate a robust core based on IoT technologies and protocols which will be ready for the arrival of 5G in smart city applications. 3.2 Data Integration Model Controlling the scenarios of data integration in the Hub 5G is extremely important to both things and people; it means, the interaction with the environment, actors, actions, and perceptions. Therefore, it is necessary to consider a model which includes all the possible interactions that the Hub can have. Moreover, it is required to establish the model of the application through which the services are going to be offered to users and things.

Fig. 2. Data integration model of the hub 5G.

Figure 2 show the data integration architectural model which considers how the links between the different actors of the system will be implemented. This model presents the protocols and technologies to be used in each of the communication stages and how the link is implemented for data storage for Edge - few data, specific information - and for Cloud - storage of all events and data generated in the system -. The communication between devices is oriented to be using Wireless Local Area Networks with Wi-Fi (IEEE 802.11) and through private 5G networks (Non-Public networks) according to Release 16 of 3GPP that will allow to do such implementations. Several protocols are provided so different types of devices can be integrated to the Hub 5G ecosystem. The communication between the Hub 5G (Edge) and the Cloud will be done by means of 5G through the

1018

S. Gil et al.

integration of MQTT and HTTP services. The external services are handled using HTTP Application Program Interfaces (APIs). Finally, the integration of the mobile application and end users is via WLAN (Wi-Fi), Short Message Service (SMS), or Email via 5G interfaces. HTTP is used as the main protocol for the integration of services of the Hub 5G, the mobile application, and the external services.

4 Configuration of the System The configuration of the Hub 5G environment is carried out using a Raspberry PI 3 or a Coral Edge TPU; both are similar in terms of performance and interfacing options, the former with more options to be adapted to several peripherals, and the later with better performance to process streaming video while recognizing. The better option is to work with both boards to achieve better performance. In the case the image recognition is carried out outside a TPU, the idea is to only deal with the Raspberry PI 3, because its capacities in terms of RAM and CPU are better than the Coral Edge. The correct strategy is to let the Raspberry PI 3 to orchestrate all the services and peripherals and let the Coral Edge TPU only to infer streaming video, making a powerful Hub 5G ecosystem. The integration between both boards is seamless because both are based on Debian distributions; the Raspberry PI runs a Raspbian OS 10 and the Coral Edge TPU runs a Mendel Day 4.0 OS. Regarding to the physical configuration, the Hub 5G system depends on 4 components: the core (Raspberry PI and/or Coral Edge TPU); the camera, which is intended to run inside an aerial vehicle (drone) but currently runs using a PI or USB Camera; the cellular modem to enable the 4.5G and later the 5G interface when available; and finally, a smart tablet attached to the Hub 5G to allow the users to access smart city services in light poles. 4.1 Mobile Network Initialization The Hub 5G communication interface is achieved via the cellular NB-IoT, LTE-M, and 4G modem U-Blox Sara R410M. The communication is achieved with two options: native cellular AT commands and native use of communication protocols and using a dial-up manager to establish internet connection with the modem. Both options are great to meet the communication requirements of the Hub 5G, but the dial-up option is more flexible as it allows a transparent connection to the internet while using as many protocols over IP as you need. Again, the best option is to use a combination of both alternatives, sometimes using dial-up and sometimes using AT commands. The Sara R410 modem offers a variety of options to send/receive messages using IoT communication alternatives. For example, the most common, the SMS is carried out through the “AT + CMGS” command; the HTTP REST protocol is achieved through a combination of “AT + UHTTP”, “AT + UHTTPC”, and “AT + URDFILE” commands, available for GET, POST, PUT, and DELETE requests; the MQTT publish/subscribe protocol is achieved through a combination of “AT + UMQTT” and “AT + UMQTTC” commands.

A Smart City Hub Based on 5G

1019

The dial-up configuration is done by means of the “wvdial” Debian service. It acts as a network manager for the cellular module and provides a network interface and the IP address. The required parameters to establish the dial-up are as follows: [Dialer Defaults] Modem = /dev/ttyUSB1 Modem Type = Analog Modem Baud = 115200 ISDN = 0 Stupid mode = 1 Idle Seconds = 0 Init1 = ATZ Init2 = ATQ0 V1 E1 S0=0 Init3 = AT+CGDCONT=1,"IP","MOBILE_OPERATOR_APN" Phone = *99***1# Username = "USER" Password = "PASSWORD" New PPPD = yes Dial Command = ATD

After initialized the wvdial service, a ppp interface is then assigned to the Hub 5G with an IP provided by the network operator. 4.2 Application Initialization Once all the components connected to the Hub 5G are ready, the Hub application server is carried out. There are two server applications: 1) the application for face recognition with browser streaming output and 2) the back-end application where data is processed, and services are orchestrated for the Hub 5G and the smart city services. Additionally, there is a client application which runs on a mobile device, a Tablet located along with the Hub 5G infrastructure in a light pole. Face Recognition Application The face recognition application is done by means of a pre-trained face detection model with an own database of faces selected for testing. It is encouraged to include faces of the community in the database to know whether an unknown is in the surroundings and keep track the surveillance of people who are near the Hub 5G. The face recognition model is based on the FaceNet model and the Histogram of Oriented Gradients (HOG) method to make it lighter and be able to run on a Raspberry PI. In contrast, Convolutional Neural Networks (CNN) methods are too heavy for them to run on the Raspberry because of its processing capabilities. Hub 5G Back-End Application The back-end application is developed using the Python-Django framework. It manages the resources of the Hub to provide data to the mobile application, Internet access and native AT commands, and it also manages the available data in the system to perform data analysis. The native AT commands are managed from this application through serial communication between the Hub 5G and the U-Blox cellular modem. This application is also in charge of accessing the open government datasets to map the smart city services of the Hub with updated open data. This initiative takes benefits

1020

S. Gil et al.

from the following open data portals (they all are in Spanish as the official language of the country): – MEData of the Medellín City Government. (http://medata.gov.co/). – Colombia ESRI Open Data (http://datosabiertos.esri.co/). – Open Data of the Metropolitan Area of Medellín (https://datosabiertos.metropol. gov.co/). – Colombia Open Data (https://www.datos.gov.co/). – SIATA portal, weather management of the Metropolitan Area of Medellín (https:// siata.gov.co/siata_nuevo/). Additionally, the application also requests an open API which returns a 5-min feed energy price data which serves as a base for scheduling energy demand based on an hourly tariff. The API is available on https://hourlypricing.comed.com/api?type=5mi nutefeed. All these features have been implemented with the purpose of providing smart city services for transport, utilities, health, safety, weather, and leisure. An additional approach that was covered in this application regarding the Coronavirus pandemic, was the incorporation of daily updated data of the virus in the country. It offers and additional service which for this precise moment becomes essential in a way to provide information of the current situation, which would help the community to be aware of self-care actions. Mobile Application The mobile application is developed using the native Android method with Android Studio. It is intended to offer the visual component of the Hub 5G services. The mobile application is oriented as the graphical user interface of the Hub 5G, from it the resources, data, and requests of the application are accessed in order to provide smart city services in a simple and technological way to the community. The mobile app is in the application layer of the Hub 5G architecture and orchestrates the services of the service layer. The mobile application has at least one activity per offered service, i.e., one or more activities for video surveillance, transportation, emergencies, weather, utility data, and geographical-based services. This application takes advantage of available open data platforms as mentioned above in Sect. 3.2 to implement some of the stated services. To accomplish such features, some activities implement WebView and Google Maps Activities to satisfy the requirements of some services, for example, the mobility service. 4.3 Deployment Scenario When the massification of 5G is available for the seamless deployment in the cities, the desired implementation scenario is expected to cover the locations where the energy service providers have dead electrical infrastructure such as transformer, light, or distribution poles with several 5G Hubs. Each Hub will be provided with a pico-antenna to cover at least 100m2 around. This is a strategy not only to offer smart city services to the community, but also to extend the service portfolio of the energy service providers from

A Smart City Hub Based on 5G

1021

the already deployed electrical infrastructure. This way, the energy service providers can offer 1) the electricity service, 2) value-added information to interested third-parties, 3) provide welfare to their customers, and 4) establish strategic alliances between technology enterprises, such as network service providers, water service providers, public and private transport, etc. The proposed scenario is currently not too viable, because of how the 2G, 3G, and 4G antennas are designed for, but with the arrival of the different types of cells of 5G, it would become more feasible; for example, using Femto (max 32 users) or Pico (max 128 users) cells, distributed along the distribution tower in the city, a powerful integrated smart city scenario can be achieved. Now, the 5G is just in its first phases, so, it is necessary to work with the available technology. That’s why this project centers the communication interface in a cellular modem with NB-IoT and LTE-M, which are technologies that are expected to work along with the 5G and are scalable to future 5G releases in terms of IoT connectivity. The Hub 5G currently operates as an end user in a cellular network, using the U-Blox modems, but in a further stage, it is expected to change these end nodes (modems) to Femto or Pico antennas when available in the local market. It exposes and strategy to be completely prepared for the 5G arrival and the arrangement of technological services based on 5G for smart cities and energy service providers.

Fig. 3. Hub 5G system expected scenario.

The Hubs are intended to be interconnected among them, each one with a drone port and several drones sending video streaming to perform the video surveillance in the location automatically. Then, a collaborative and interconnected system of Hubs will be achieved to provide smart city services over several locations, just as shown in Fig. 3. For now, this project carries out the prototyping phase of this technological development.

1022

S. Gil et al.

5 Implementation of the Hub The Hub 5G prototype is intended to be assembled into a light pole in Universidad Nacional de Colombia at Campus Robledo (Facultad de Minas). This prototype implementation allows the analysis for a subsequent stage of mass deployment in smart cities. The procedure of deployment and verification of the Hub is as follows: 1) The hardware and communication components are integrated together and then initialized. 2) The software which manages the drivers, services, and communication interfaces is initialized. 3) The back-end application for accessing the services is launched. 4) The mobile application is initialized. 5) All the peripherals of the system are checked. 5.1 Hardware and Communication The Hub is initialized by powering up the Raspberry PI and peripherals, then, the communication protocols begin to work. The background services are handled using the Crontab service. The main module to deal with the communication via cellular technology is the U-Blox Sara R410M module; after its initialization and the corresponding service, it should return a ppp interface such as follows: ppp0: flags=4305 mtu 1500 inet 10.140.164.4 netmask 255.255.255.255 destination 10.64.64.64 ppp txqueuelen 3 (Point-to-Point Protocol) RX packets 5 bytes 62 (62.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 6 bytes 101 (101.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Once the cellular interface is ready, a bridge between the wireless Wi-Fi interface and the cellular module is configured to set the WAP in the Hub. Additional protocols or interfaces, such as LoRaWAN could be handled with several approaches. In the case it was decided to include the Hub 5G as a LoRaWAN end node, it is achieved by implementing a serial interface between the Hub 5G (Raspberry PI) and a Microchip RN2903 LoRa module embedded in a LoRa 2 Click MikroE module. It could be done through the Raspberry PI native pins or using a FTDI USB to serial converter. Another approach for doing this is going further to the application layer of the OSI model and integrate the LoRaWAN protocol via MQTT. It enables the cooperation between protocol and at the same time, the tracking of LoRaWAN data from the Hub. The integration via MQTT using the SARA module can be using native AT commands or using Mosquitto or any MQTT library directly from the application with the ppp interface. The native AT commands for integrating MQTT are as follows:

A Smart City Hub Based on 5G

1023

AT+UMQTT=0,"111" # identifier AT+UMQTT=1,1883 ### port AT+UMQTT=2,"serverDNS" # server #AT+UMQTT=3,"IP",1883 # IP with port (alternative) AT+UMQTT=4,"user","pass" # user / password AT+UMQTT=10,36000 # timeout #AT+UMQTT=11,1,2 # secure option #AT+UMQTT=12,1 # clean session AT+UMQTTC=1 # login AT+UMQTTC=2,0,0,"topic/subtopic","message" # publish to a topic AT+UMQTTC=4,0,"application/+/device/+/rx" # subscribe to a topic AT+UMQTTC=6 # read message # Example response +UMQTTC: 6,1 OK +UUMQTTCM: 6,1 Topic:application/1/device/d7d9aaebd65709ce/rx Msg: LoRaWAN message codified as JSON object.

Similarly, when using the Python alternative, it is done through the paho-mqtt library and its methods. The implementation of other communication protocols such as OPC UA and CoAP can be seamlessly integrated into the Hub via Python modules when necessary. Good options to implement these protocols are for example: Python-OPCUA (FreeOPCUA) and CoAPthon. 5.2 Integration Between Back End and Mobile Application Once the configuration of the Hub 5G is done, it is the time to go on with the applications. As said before, the software components of the ecosystem consist of the main backend application, which remains on the Hub controller itself, the integration of external application and services on the cloud, and the mobile application which acts as the Graphical User Interface. At this time, no applications on the cloud are developed on our own, so, they cannot be administrated for specific purposes, but it is expected to be done soon. The back-end application is designed using a Model-View-Controller pattern, which most of the views are oriented to the integration with the mobile application; it means that the back-end application has views without any HTML layout because they only transfer data or specific images to the mobile application. The back-end application is also in charge of mapping the external application or services to the mobile application and the services provided by the Hub. As stated before, this initiative intends to reuse as much as possible open web-based platforms which are employed for the benefit of the region and the community. The resources of the application are organized in a strategic way according to the services it is intended to offer. Other resources are gathered directly from its public resource to the mobile application, such as the weather service which is available on https://siata.gov.co/siata_nuevo/. The updated mobility in the place is acquired using the setTrafficEnabled method on the Google Maps activity. Table 1 provides an overview of the resources provided by the back-end application. Once the resources are available on the back-end application in the local network, they are accessed in the mobile application using HTTP requests with Volley or integrated directly with Web Views.

1024

S. Gil et al. Table 1. Resources provided by the hub 5G back-end application.

Resource URI

Resource type

Description

Health/coronavirus-department

API

An API which maps the data of the departments (states) of Colombia with updated new confirmed and total cases

Health/coronavirus-department-view

Map View

A resource which maps the departments (states) coronavirus data into a Folium map to be displayed on the web and on the mobile application via Android Studio WebView

Health/coronavirus-city-view

Map View

A resource which maps the cities coronavirus data into a Folium map to be displayed on the web and on the mobile application via Android Studio WebView

Health/coronavirus-dailycases

API

An API which maps the daily data of coronavirus in Colombia with current and new cases, deaths, and recovered people

Health/coronavirus-dailycases-view

Graph View

A resource which maps the daily cases API to charts to be displayed on the web and on the mobile app using Plotly

Transport/publictransportroutes

API

An API which maps open data of public bus and bus-metro integration routes in Medellín. This data is then mapped into a Google Map View on the mobile application

Transport/bikeroutes

API

An API which maps open data of available bike routes in Medellín. This data is then mapped into a Google Map View on the mobile application

Utilities/energypriceAPI

API

An API which maps the Comed open API for 5-min feed (https://hourlypri cing.comed.com/api?type=5minut efeed)

Utilities/watertariff

API

An API which maps open water tariff data of the local water service provider (continued)

A Smart City Hub Based on 5G

1025

Table 1. (continued) Resource URI

Resource type

Description

Utilities/watertariffview

Graph View

A resource which maps the water tariff API to charts to be displayed on the web and on the mobile application

Utilities/energytariff

API

An API which maps open energy tariff data of the local energy service provider

Utilities/energytariffview

Graph View

A resource which maps the energy tariff API to charts to be displayed on the web and on the mobile application

Utilities/gastariff

API

An API which maps open gas tariff data of the local gas service provider

Utilities/gastariffview

Graph View

A resource which maps the gas tariff API to charts to be displayed on the web and on the mobile application

Emergencies/emergencyAPI

API

An API to receive requests from the mobile application for different types of emergencies

5.3 Integrated Ecosystem The resources are then available to be accessed from the android application in a smartphone or tablet which will be located on a light pole with the Hub for the people to use it. The app has a main view, which has a button for testing, for services, and for configuration. The main functionality is on the services menu, which is composed of several services for mobility, utilities, health, and so on. Some of the services have submenus to access different data or functionalities. Then, when entering each final view of the application, there is an informative visualization to provide useful data to people who navigate in the app. This is a strategy to bring people to the processes involved in the energy and electrical systems and to be more confident with digitization. All these components bring the community to be a pillar in the transformation of smart cities and smart grids, such as stated by Lu in its framework of interoperability of Industry 4.0 [4]. Some of the services provided by the application are as follows: – Services which show information about the tariff for the different utilities for different hours or months. These services are oriented to inform people how to employ the utilities consciously in order to reduce utility billing. Figure 4 show an overview of these services in the mobile application. The energy price API view uses a 5 min feed for 24 h, the water and gas tariffs views use the tariff data of the local water/gas service provider in Medellín discriminated by socioeconomic stratum, and the energy tariff view, which is not shown in the figure, uses the tariff data of the local energy service provider discriminated by hourly tariffs.

1026

S. Gil et al.

Fig. 4. Utility services provided by the hub 5G. on the left, the energy price API. on the right, above: the water tariff view, below: gas tariff view.

– Services which show information about the updated traffic nearby, the available bike routes in the city, and the available bus and bus to metro public transport routes. the tariff for the different utilities for different hours or months. These services are oriented to inform people of local mobility in order to promote sustainable mobility and the use of bikes and public transport. Figure 5 show an overview of these services in the mobile application. On the left, the updated mobility view uses the Google services to plot in a green to red scale how much traffic is on a route. On the middle, the bike routes view plots the available bike routes in the city to contribute to a safe biking. On the right, the public transport routes view plots the available bus transport routes in the city to promote the use of public transport instead of private.

– Services which show information about the updated coronavirus state in Colombia. These services are oriented to inform people of the condition of this critical situation discriminated by department (state) and city in order to promote more care regarding the virus. Figure 6 show an overview of these services in the mobile application. The Coronavirus department map view plots the current state of every department in Colombia in a yellow to dark red scale according to the criticality of each region. It also shows the specific location of the Hub with a blue marker. The Coronavirus daily update view shows the new confirmed, recovered, and dead cases every day up to the last entry, which is updated by the government daily.

A Smart City Hub Based on 5G

1027

Fig. 5. Mobility services provided by the hub 5G mobile app. from left to right: updated mobility, bike routes, and public transport routes.

– Services which provide support in case of risky situations nearby. These services are oriented to allow people the request of emergencies because of any abnormal situation such as fire, earthquake, landslide, river overflow, etc. nearby. It provides a video surveillance feature to track people who attend the place, if they belong to the community or are unknown. With the proper database, it can identify criminals walking around the zone. Additionally, it also provides the emergency view, which offers a form to fill in and submit in case of any emergency such as fire, earthquake, or any other emergency nearby. Additional views such as the Weather view points to the responsive site of SIATA, a public agency which administrates weather and meteorological measurements in Medellín and its Metropolitan Area. On the other hand, the geographical-based service uses the Nearby Search Query API of Google Places API to map and mark leisure and entertainment places close to the Hub, such as restaurants, malls, and cinemas.

1028

S. Gil et al.

Fig. 6. Coronavirus service provided by the hub 5G mobile app. above: discriminated confirmed cases by department (state). below: daily update with new cases.

6 Conclusions and Future Work In this application it was addressed an initiative to revitalize the dead electrical infrastructure such as light and transformer poles by implementing a technological approach to converge smart city services taking the electrical industry as a main pillar of the digital transformation of cities and society. With the deployment of data-based electrical systems and their integration with other domains and industries, this approach becomes very convenient for the energy service providers to offer new technological services and expand the service portfolio to new areas different from just providing energy, using already deployed infrastructure. It can also lead to cooperation between companies from different domains to improve the local market income through a collaborative performance and strategic alliances in the 5G IoT ecosystem. A good example for that would be the cooperation between the energy service provider of a city, the network service operator, and the early alert (emergencies) department; the energy provider serves with the infrastructure to deploy a 5G micro-cell network composed of several cells, the network operator provides the cellular base stations, and the early alert department would be alert to any emergency or catastrophe which would happen in any zone within the cellular network and could respond to it faster. In the same order of ideas, the energy provider would be able to link the smart meters covered by the network, the network operator would offer the cellular

A Smart City Hub Based on 5G

1029

service normally with this network, and the early alert department could increase the amount of emergency/catastrophe data to implement predicting models. This contribution exposes a first adoption of 5G technology for short-range base stations but considering its massification to deploy a long-range 5G IoT network. It is important consider the arrival of 5G as a reality in the years to come, and then, be prepared to work along with this technology becomes essential to implement new technological services for the new concept of cellular communication. It is also relevant to highlight how novel technologies, applied and oriented correctly to the population, can assist to a better understanding of the processes, the news, and the current situation. The technological initiatives of this era, the era of the Industry 4.0, smart cities, smart grids, cyber-physical systems, and so on, should consider the incorporation of the social factor as a main point to state the purposes, then, a convergent digital ecosystem for the benefit of society, environment, and economy would be addressed. In next steps of this project, it is expected to cover some components that were not included this time. Just to mention some of the future steps: first, embedding the video surveillance with a drone video streaming network to automate the face recognition and alert in the place. Another one would be the integration of the Hub 5G approach with smart homes; it would enable active interaction between homes’ utility measurements, emergencies, billing, and even strategic marketing when appropriate. A similar approach would be the integration of the network of Hubs to vehicle, especially electric vehicles, as a way to implement vehicle-2-grid and grid-2-vehicle communication, and then, implement strategies for managing energy demand and propose energy-efficient plans in electric vehicles. Finally, the services which are provided related to safety, i.e., video surveillance, emergencies, and even health, could be then integrated with a public agency to improve the response time and data for treatment of such critical situations.

References 1. Li, S., Xu, L.D., Zhao, S.: 5G internet of things: a survey. J. Ind. Inf. Integr. 10, 1–9 (2018). https://doi.org/10.1016/j.jii.2018.01.005 2. Palattella, M.R., et al.: Internet of things in the 5g era: enabling technologies and business models. IEEE J. Sel. Areas Commun. 34, 510–527 (2016) 3. GSMA: Internet of ThIngs in the 5G Era - Opportunities and Benefits for Enterprises and Consumers. 26 (2019) 4. Lu, Y.: Industry 4.0: a survey on technologies, applications and open research issues. J. Ind. Inf. Integr. 6, 1–10 (2017). https://doi.org/10.1016/j.jii.2017.04.005 5. Barakabitze, A.A., Ahmad, A., Mijumbi, R., Hines, A.: 5G network slicing using SDN and NFV: a survey of taxonomy, architectures and future challenges. Comput. Netw. 167, 106984 (2020). https://doi.org/10.1016/j.comnet.2019.106984 6. Díaz de Terán, L.M.: 5G, la oportunidad que Europa no puede dejar escaper. https://www.abc.es/tecnologia/informatica/soluciones/abci-luis-manuel-diaz-teran-opo rtunidad-europa-no-puede-dejar-escapar-202003050129_noticia.html. Accessed 06 March 2020 7. Pye, A.: 5G from concept to reality. https://www.environmentalengineering.org.uk/news/5gfrom-concept-to-reality-2590/. Accessed 06 March 2020 8. GSMA: 3GPP Low Power Wide Area Technologies (LPWA). GSMA White Pap, pp. 19–29 (2016)

1030

S. Gil et al.

9. Andreadou, N., Guardiola, M.O., Fulli, G.: Telecommunication technologies for smart grid projects with focus on smart metering applications. Energies 9(5), 375 (2016). https://doi. org/10.3390/en9050375 10. Karpenko, A., et al.: Data exchange interoperability in IoT ecosystem for smart parking and EV charging. Sensors (Switzerland), p. 18 (2018). https://doi.org/10.3390/s18124404 11. Nambiar, R., Shroff, R., Handy, S.: Smart cities: challenges and opportunities. In: 2018 10th International Conference on Communication Systems & Networks (COMSNETS), COMSNETS 2018. January 2018, pp. 243–250 (2018). https://doi.org/10.1109/COMSNETS.2018. 8328204 12. Taherkordi, A., Zahid, F., Verginadis, Y., Horn, G.: Future cloud systems design: challenges and research directions. IEEE Access. 6, 74120–74150 (2018). https://doi.org/10.1109/ACC ESS.2018.2883149 13. Lytras, M.D., Visvizi, A., Sarirete, A.: Clustering smart city services: perceptions, expectations, responses. Sustain. 11, 1–19 (2019). https://doi.org/10.3390/su11061669 14. Heddebaut, O., Di Ciommo, F.: City-hubs for smarter cities. the case of lille “EuraFlandres” interchange. Eur. Transp. Res. Rev. 10(1), 1–14 (2017). https://doi.org/10.1007/s12544-0170283-3 15. Minoli, D., Occhiogrosso, B.: Practical aspects for the integration of 5G networks and IoT applications in smart cities environments. Wirel. Commun. Mob. Comput. 2019, 30 (2019). https://doi.org/10.1155/2019/5710834 16. Santos, J., Wauters, T., Volckaert, B., de Turck, F.: Fog computing: enabling the management and orchestration of smart city applications in 5G networks. Entropy 20, 4 (2018). https://doi. org/10.3390/e20010004 17. Lin, S.-W., et al.: The industrial internet of things volume G1: reference architecture. Industrial Internet Consortium 1.80, pp. 1–7 (2017) 18. Plattform Industrie 4.0: Reference Architectural Model Industrie 4.0 (RAMI 4.0) - An Introduction (2016) 19. Marques, P., et al.: An IoT-based smart cities infrastructure architecture applied to a waste management scenario. Ad Hoc Netw. 87, 200–208 (2019). https://doi.org/10.1016/j.adhoc. 2018.12.009 20. Cheng, J., Chen, W., Tao, F., Lin, C.L.: Industrial IoT in 5G environment towards smart manufacturing. J. Ind. Inf. Integr. 10, 10–19 (2018). https://doi.org/10.1016/j.jii.2018.04.001 21. Rahimi, H., Zibaeenejad, A., Safavi, A.A.: A novel IoT architecture based on 5G-IoT and next generation technologies. In: 2018 IEEE 9th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON) 2018, pp 81–88 (2019). https://doi.org/ 10.1109/IEMCON.2018.8614777 22. Terroso-Saenz, F., González-Vidal, A., Ramallo-González, A.P., Skarmeta, A.F.: An open IoT platform for the management and analysis of energy data. Futur. Gener. Comput. Syst. 92, 1066–1079 (2019). https://doi.org/10.1016/j.future.2017.08.046 23. Song, Y., Lin, J., Tang, M., Dong, S.: An internet of energy things based on wireless LPWAN. Engineering 3, 460–466 (2017). https://doi.org/10.1016/J.ENG.2017.04.011 24. Kuzlu, M., Rahman, M.M., Pipattanasomporn, M., Rahman, S.: Internet-based communication platform for residential DR programmes. IET Netw. 6, 25–31 (2017). https://doi.org/10. 1049/iet-net.2016.0040 25. Lea, R., Blackstock, M.: City hub: a cloud-based IoT platform for smart cities. In: Proceedings of the International Conference on Cloud Computer Technology and Science CloudCom. February 2015, pp. 799–804 (2015). https://doi.org/10.1109/CloudCom.2014.65 26. Bajer, M.: Building an IoT data hub with elasticsearch, Logstash and Kibana. In: Proceedings of the 2017 5th International Conference on Future Internet Things Cloud Work. W-FiCloud 2017. January 2017, pp. 63–68 (2017). https://doi.org/10.1109/FiCloudW.2017.101

A Smart City Hub Based on 5G

1031

27. Gyrard, A., Bonnet, C., Boudaoud, K.: Enrich machine-to-machine data with semantic web technologies for cross-domain applications. In: 2014 IEEE World Forum Internet Things, WF-IoT 2014, pp. 559–564 (2014). https://doi.org/10.1109/WF-IoT.2014.6803229 28. Santana, E.F.Z., Chaves, A.P., Gerosa, M.A., Kon, F., Milojicic, D.S.: Software platforms for smart cities: concepts, requirements, challenges, and a unified reference architecture . ACM Comput. Surv. (CSUR) 50(6), 1–37 (2016)

LoRa RSSI Based Outdoor Localization in an Urban Area Using Random Neural Networks Winfred Ingabire1,2(B) , Hadi Larijani1 , and Ryan M. Gibson1 1

School of Engineering and Built Environment, Glasgow Caledonian University, Glasgow, UK {Winfred.Ingabire,H.Larijani,Ryan.Gibson}@gcu.ac.uk 2 Department of Electrical and Electronics Engineering, University of Rwanda-College of Science and Technology, Kigali, Rwanda

Abstract. The concept of the Internet of Things (IoT) has led to the interconnection of a significant number of devices and has impacted several applications in smart cities’ development. Localization is widely done using Global Positioning System (GPS). However, with large scale wireless sensor networks, GPS is limited by its high-power consumption and more hardware cost required. An energy-efficient localization system of wireless sensor nodes, especially in outdoor urban environments, is a research challenge with limited investigation. In this paper, an energyefficient end device localization model based on LoRa Received Signal Strength Indicator (RSSI) is developed using Random Neural Networks (RNN). Various RNN architectures are used to evaluate the proposed model’s performance by applying different learning rates on real RSSI LoRa measurements collected in the urban area of Glasgow City. The proposed model is used to predict the 2D Cartesian position coordinates with a minimum mean localization error of 0.39 m. Keywords: IoT

1

· LoRaWAN · RSSI · Localization · RNN

Introduction

IoT connected devices are projected to increase exponentially [1], and this is expected to trigger different smart applications in various domains such as healthcare, agriculture, transportation among others. Low Power Wide Area Networks (LPWAN) such as Long-Range Wide-Area Networks (LoRaWAN) are potential reliable low power connectivity solutions to large-scale IoT networks [2]. Moreover, location-aware applications such as Remote Health Monitoring (RHM) is one of the promising smart applications of LoRaWAN [3]. Additionally, an accurate localization system for sensor nodes is crucial, particularly in applications where locating end devices is critical. A typical outdoor localization method in LoRa networks is the Time Difference of Arrival (TDoA) technique. However, this method performs well in open environments and poorly in c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1032–1043, 2021. https://doi.org/10.1007/978-3-030-80126-7_72

Random Neural Networks

1033

harsh urban areas. In comparison, WiFi and Bluetooth RSSI fingerprint based techniques have been successfully applied for indoor localization [4]. While, end device fingerprint localization based on LoRaWAN RSSI characteristics is a research topic yet to be thoroughly investigated, both in indoor and outdoor environments. Furthermore, RNN has recently been used to create several resilient models with significant results [23]. Currently, no one has reported RNN algorithms applied in end device localization models for IoT wireless sensor nodes. In this paper, we use real test LoRaWAN RSSI data collected in the urban area of Glasgow City to develop a localization model based on LoRaWAN RSSI characteristics using RNN. End device localization is vital in location-aware IoT applications whereby a sensor node needs to be accurately located for emergency or maintenance services. RSSI fingerprint localization methods involve mapping RSSI values to corresponding X, Y Cartesian 2D position coordinates [31]. Moreover, RSSI fingerprint localization approaches have been proposed for accurate localization models in the literature using GNSS, Bluetooth, and WiFi wireless technologies and successfully implemented but all with significant drawbacks and limitations. In Addition, the following challenges are among the issues limiting the performance of existing RSSI fingerprint techniques for effective end device localization: 1. 2. 3. 4. 5. 6.

Fingerprint data maps are affected by changing environments. Strong fingerprint databases are needed. Manpower is needed for creating fingerprint databases. Non-linearity challenge between end nodes and target gateways. High power-consumption and cost. Complicated infrastructure layout.

Studies using LPWAN for sensor node localization are present in literature, however, frequently limited to specific settings such as a small area or fixed end nodes. Furthermore, no other work has used RNN for LoRaWAN based node localization models (to the best of our knowledge). Therefore, this work aims at using RNN algorithms to develop a resilient novel system to improve location prediction accuracy compared to the existing RSSI fingerprint approaches. The main contributions of this paper are summarized as: – Development of novel RNN based models for accurate LoRa end device localization. – Real test data sets collected in Glasgow’s urban city are used to train and test the developed localization model with a different number of hidden neurons. – The performance of the developed RNN based localization model is evaluated, and improved accuracy is achieved by training and testing various RNN network architectures using different learning rates. This paper is organized as follows: Sect. 2 introduces LoRa, and LoRaWAN. Sectiont 3 discusses the related work. Section 4 gives the details of the methodology and the procedures used. Section 5 presents the obtained results with the

1034

W. Ingabire et al.

performance analysis of the developed model. Finally, Sect. 6 gives conclusions and future work directions.

2

LoRa and LoRaWAN

The invention of low power wide-area network technologies like LoRaWAN, and other potential network enablers for the dense IoT networks, came as a perfect solution for large scale IoT smart applications. LoRa is a physical layer with the Chirp Spread Spectrum (CSS) modulation technique operated by Semetch within the license-free spectrum, whereas LoRaWAN is the connection protocol stack of the wide-area network of LoRa architecture on the MAC layer [5]. A LoRaWAN network is made up of LoRa end devices connected in a star topology that sends information to one or multiple LoRaWAN gateways. Therefore, the gateways that send the received message to a network server along with a recorded unique message’s metadata information, as shown in Fig. 1. Thus, this metadata information enables localization in LoRaWAN networks, whereby gateways serve as anchor points to determine position coordinates. TDoA algorithms use timestamps at which different gateways receive the same message, whereas the most critical metric for Fingerprinting is RSSI. Also, the more gateways that can receive the same message, the more accurate a localization algorithm is in any of the methods [7]. Furthermore, the Spreading Factor (SF) is a LoRa critical parameter that affects localization accuracy, whereby lower SF limits the transmission range of LoRa devices. Consequently, fewer gateways receive the same message [6].

3

Related Work

This section presents an overview of existing RSSI Fingerprint localization techniques, Random Neural Networks, and Gradient Descent Algorithm (GDA). 3.1

Overview of RSSI Fingerprint Localization Methods

Different algorithms using RSSI fingerprinting localization exist in literature using various wireless technologies. However, existing studies frequently investigated mostly indoor environments due to the training phase’s many data. Furthermore, limited studies are available for localization analysis in outdoor environments, mainly due to the hard work involved in collecting data. WiFi has been used widely in fingerprint localization [8–10], and [11] whereby a smartphone can be used to record its RSSI and estimate its location using the web. The authors make a comparative performance analysis of different wireless technologies based on RSSI localization in [12]. Research studies using LoRaWAN in fingerprint localization are also available in literature [13–16]. Moreover, Choi. W et al. in [17] used LoRa for positioning using three fingerprint algorithms and confirmed LoRa to be effective with the average accuracy of 28.8 m for the three

Random Neural Networks

1035

algorithms. Aernouts. M et al. in [18] presented fingerprint localization data sets for LoRaWAN and Sigfox in large urban and rural areas with a mean estimation error of 398.40 m. Similarly, the authors in [19–21] also investigated LoRa RSSI based fingerprint localization algorithms with significant estimations.

Fig. 1. LoRaWAN network architecture [6]

3.2

Random Neural Networks

RNN has recently been used to create resilient models with significant results. Application in Heating, Ventilation, and Air Conditioning (HVAC) systems by Javed. A et al. in [22–24], energy prediction of non-occupied buildings by Ahmad. J et al. in [25], communication systems, image classification, processing, pattern recognition by Simonyan. K and Zisserman [34], whereas intrusion detection systems were developed and analyzed by Ul-Haq Qureshi. A et al. in [26–28]. Nonetheless, no one has reported RNN algorithms applied in end device localization models (to the best of our knowledge). RNN is a unique class of ANN developed by Gelenbe [29], consisting of directed N multiple layers of connected neurons that exchange information signals as impulses, using a potential positive (+1) for excitation and a negative (-1) for inhibition signals to the receiving neuron. The potential of each neuron i at time t is represented by a nonnegative integer Ki (t). The neuron i is in excited state if Ki (t) > 0 and it is in idle state if Ki (t) = 0. If neuron i is excited, it forwards an impulse signal to another neuron j at the Poisson rate ri . The forwarded signal can reach neuron j as an impulse signal in excitation or inhibition with probabilities p+ (i, j) or p− (i, j) respectively. Furthermore, the transmitted signal can leave the network with a probability mathematically defined in [25] by the following formulas: c(i) +

N

p+ (i, j) + p− (i, j) = 1, ∀i,

(1)

j=1

w+ (i, j) = ri p+ + (i, j) ≥ 0,

(2)

1036

W. Ingabire et al.

Equally w− (i, j) = ri p− + (i, j) ≥ 0.

(3)

Combining Eqs. 1, 2 and 3 r(i) = (1 − c(i))− 1

N

[w+ (i, j) + w− (i, j)]

(4)

j=1

The transmission rate between neurons in Eq. 4 is r(i), and is defined as N [w+ (i, j) + w− (i, j)]. While “w determines the matrices of weight r(i) = j=1

updates from neurons, it is always positive as it is a product of transmission rates and probabilities. In RNN based models, if a signal arrives at neuron (i) with a positive potential, it is denoted by Poison rate Λ(i), whereas a signal with a negative potential arrives at Poisson rate λ(i). Therefore, for every node “i the output activation function for that neuron is given by: q(i) = whereby λ+ (i) =

n

λ+ (i) , r(i) + λ− (i)

(5)

q(j)r(j)p+ (j, i) + Λ(i),

(6)

q(j)r(j)p− (j, i) + λ(i).

(7)

j=1

with λ− (i) =

n j=1

More details about RNN are in [25]. 3.3

Gradient Descent Algorithm

Gradient Descent (GD) is a first-order iterative optimization algorithm widely used for training by different researchers. It is used to minimize the cost function whereby the error cost function is given by: n

Ep =

1 γi (qjp − qjp )2 , γi ≥ 0 2 i=1

(8)

whereby γ ∈ (0, 1) presents the state of output neuron i, likewise qjp is a real differential function whereas qjp is the predicted output value. As, per Eq. 8, to find the local minima and reduce the error value of the error cost function, the relation between neurons y an z is considered, whereby weights w+ (y, z) and w− (y, z) are updated by:

Random Neural Networks

+t +(t−1) wy,z = wy,z −η

n i=1

also: −t −(t−1) = wy,z −η wy,z

n i=1

1037

+ t−1 γi (qjp − yjp )[∂qi ∂wy,z ] ,

(9)

− t−1 γi (qjp − yjp )[∂qi ∂wy,z ] .

(10)

The proposed RNN-LoRa RSSI based localization model is trained using GD, and the calculated weights and biases are updated to the neurons as the algorithm computes the error. More details about GD are found in [25].

4

Methodology

This section gives details about all the procedures used in data collection and developing our RNN-LoRa RSSI based localization model. 4.1

Real Test Measurements

Real test LoRa RSSI datasets were collected in Glasgow City for the training and testing of our developed RNN based end device localization models. Three LoRa SX1301-enabled Kerlink gateways were used to receive the same messages from a MultiTech systems’ LoRaWAN mDot end device controlled by Raspberry Pi. The three gateways were at 30m on George More building in Glasgow Caledonian University, at 27m on James Weir building at the University of Strathclyde, and at 25m on top of Skypark building, respectively. The LoRa mote was used to collect data and transmit it to the three gateways simultaneously at a walking speed away from the gateways from different locations. Figure 2 shows a Bing map with three LoRaWAN gateways, LoRa end device locations, and the path is taken for measurements. More details about the procedure used in our measurements are given in [30]. Data Normalization. Our dataset’s RSSI absolute values are large, which results in an unstable network due to large weights. Therefore, we scaled our dataset using the Min-Max Normalisation data pre-processing technique to the range of 0 to 1, using the following formula: xi =

RSSIi − min(RSSI) , max(RSSI) − min(RSSI)

(11)

where RSSI = (RSSI1 , ..., RSSIn ) is the raw RSSI input values and x( i) is the resultant normalized data.

1038

W. Ingabire et al.

Fig. 2. Glasgow city Bing Map showing three LoRaWAN gateways, LoRa end device positions and the path taken for data collection

4.2

RNN-LoRa RSSI Based Localization Model

End device localization using the RNN approach consists of directed multiple layers of connected nodes in a network. This network is used to develop a model that accurately maps the input to the output using historical data, and the model can then be used to output any desired unknown output. We used the RNN network to develop our proposed model that accurately maps the input RSSI values to the output Cartesian X, Y coordinates using real data measurements collected before, as presented in Fig. 3. Then, the model is used to output any desired unknown position with minimum localization error. RNN learning GD algorithm that provides supervised training is considered for this research study. The RNN is trained to locate each end device in the network service area or grid, and then the trained network is extended further to predict the location of any other sensor nodes on the same network grid based on end-device LoRaWANRSSI fingerprints. Using RNN, we develop RNN-LoRa RSSI-based Localization Model in MATLAB using fingerprint data set created from real measurements taken in Glasgow City with three LoRaWAN gateways. The dataset contains for every message transmitted by the moving end device, the RSSI value at each of the three gateways that received the message with ground position coordinates of the end device at every transmission. Furthermore, we recorded -200 as the RSSI value for every gateway that did not receive a particular message.

Random Neural Networks

5

1039

Results and Performance Analysis

In this section, various RNN architectures for a number of hidden neurons of 6, 12, and 18 were modelled and analysed with 0.01, 0.1, and 0.4 learning rates. Corresponding training and test results of mean localization errors are recorded. A minimum mean localization error of 0.39 m is obtained in a sample size of 1931 data points, whereby 80% of the dataset was used for training and 20% for testing the model with a minimum training mean square value of 0.005 m. The performance of the developed RNN based localization model is analysed using the average localization error (AE) defined as follows:

Fig. 3. RSSI fingerprint localization approach

AE =

n

((Xreal − Xpred )2 + (Yreal − Ypred )2 )0.5

(12)

i=1

whereby (Xreal , Yreal ) is the real true pre-recorded position using GPS and (Xpred , Ypred ) is the predicted position of unknown location estimated by the LoRa RSSI based localization system developed using RNN. The total number of samples used in our localization dataset is given by n. Figure 4 presents results for mean localization error values for all the three RNN-LoRa RSSI based localization system architectures using 0.01, 0.1, and 0.4 learning rates. The results show that for the first system network architecture (RNN-1); increasing the learning rate from 0.01 to 0.1 improved localization performance by minimizing the mean localization error. However, increasing the learning rate further to 0.4

1040

W. Ingabire et al.

Fig. 4. Performance analysis between RNN based localization network architectures with different learning rates

led to an increase in the mean localization error. Hence, using 0.1 learning rate had the best performance, whereby only six hidden neurons were used. For the second system architecture (RNN-2), increasing the learning rate from 0.01 to 0.1 increased the system’s mean localization error whereas increasing the learning rate further to 0.4 decreased the error. Hence, using 0.01 learning rate had the best performance where twelve hidden neurons were used. Likewise, for the third system architecture (RNN-3); eighteen hidden neurons were used, and the learning rate increased with the decrease in the mean localization error. Furthermore, three input neurons and two output neurons are considered for all the three RNN localization model architectures in our analysis. In general, our results show that the developed RNN LoRa RSSI based localization model performed best with a learning rate of 0.01 and when twelve hidden neurons were used. Also, increasing the number of hidden neurons does not significantly improve the performance of the system apart from when the highest learning rate of 0.4 was used. A comparative performance analysis of localization accuracy of all analysed RNN based localization architectures with different learning rates is summarised in Table 1. According to our results, the developed RNN based localization models are reliable with significant minimum mean localization errors and outperform the conventional LoRa localization approaches reported by different researchers in literature [7,18, 31, 32], and [33] with high accuracy for outdoor localization services.

Random Neural Networks

1041

Table 1. Performance of Different RNN based Localization Architectures Learning rates Mean localization error (m) RNN-1 RNN-2 RNN-3

6

0.01

0.76

0.39

0.48

0.1

0.48

0.48

0.45

0.4

0.54

0.41

0.40

Conclusion and Future Work

In this study, we presented a LoRa RSSI based outdoor localization system using Random Neural Networks. Three RNN architectures are used to evaluate our localization model’s performance using data collected in the urban area of Glasgow City, considering 0.01, 0.1, and 0.4 as learning rates. The proposed model performed best with a minimum mean localization error of 0.39 m. Moreover, the analyzed RNN based localization system architectures showed that increasing the number of hidden neurons does not improve the localization system’s performance unless a high learning rate is used. Our results are significant with high-level accuracy for outdoor positioning in urban areas and give important insights for using LoRaWAN and RNN in localization systems. We plan to do a comparative performance analysis of our developed localization model to existing localization models found in literature. Acknowledgment. This work was funded by the Commonwealth Scholarships in the UK in partnership with the Government of Rwanda.

References 1. Statista Says a Five-Fold Increase in Ten years in Internet-Connected Devices by 2025 Will Significantly Increase the Internet’s Promise of Making the World Connected. https://www.statista.com/statistics/471264/iot-number-ofconnected-devices-worldwide/. Accessed 27 October 2020 2. Ahmad, A.I., Ray, B., Chowdhury, M.: Performance evaluation of loraWAN for mission-critical IOT networks. Commun. Comput. Inf. Sci. CCIS 1113, 37–51 (2019) 3. Kharel, J., Reda, H.T., Shin, S.Y.: Fog Computing-Based Smart Health Monitoring System Deploying LoRa Wireless Communication. IETE Tech. Rev. (Inst. Electron. Telecommun. Eng. India) 36(1), 69–82 (2019) 4. Poulose, A., Han, D.S.: UWB indoor localization using deep learning LSTM networks. Appl. Sci. 10(18), 6290 (2020) 5. Semtech, “LoRaWANspecificationv1.1.”. https://www.lora-alliance.org/technology. Accessed 27 October 2020 6. Ghoslya, S.: “All about LoRa and LoRaWAN” . https://www.sghoslya.com 7. Bissett, D.: “Analysing tdoa localisation in LoRa networks”. Delft University of Technology (2018)

1042

W. Ingabire et al.

8. Zghair, N.A.K., Croock, M.S., Taresh, A.A.R.: Indoor localization system using Wi-Fi technology. Iraqi J. Comput. Commun. Control Syst. Eng. 19(2), 69–77 (2019) 9. Hern´ andez, N., Oca˜ na, M., Alonso, J.M., Kim, E.: Continuous space estimation: increasingwifi-based indoor localization resolution without increasing the sitesurvey effort. Sensors (Switzerland) 17, 147 (2017) 10. Janssen, T., Weyn, M., Berkvens, R.: Localization in low power wide area networks using Wi-Fi fingerprints. Appl. Sci. 7(9), 936 (2017) 11. Zafari, F., Gkelias, A., Leung, K.K.: A survey of indoor localization systems and technologies. IEEE Commun. Surv. Tutor. 21(3), 2568–2599 (2019) 12. Sadowski, S., Spachos, P.: RSSI-based indoor localization with the internet of things. IEEE Access 6, 30149–30161 (2018) 13. Kwasme, H., Ekin, S.: RSSI-based localization using LoRaWAN technology. IEEE Access 7, 99856–99866 (2019) 14. Anjum, M., Khan, M.A., Hassan, S.A., Mahmood, A., Gidlund, M.: Analysis of RSSI fingerprinting in LoRa networks. In: 2019 15th International Wireless Communications & Mobile Computing Conference (IWCMC) IWCMC 2019, no. June, pp. 1178–1183 (2019) 15. Lin, Y.C., Sun, C.C., Huang, K.T.: RSSI measurement with channel model estimating for IoT wide range localization using LoRa Communication. In: Proceedings of the 2019 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS) ISPACS 2019, pp. 5–6 (2019) 16. Goldoni, E., Prando, L., Vizziello, A., Savazzi, P., Gamba, P.: Experimental data set analysis of RSSI-based indoor and outdoor localization in LoRa networks. Internet Technol. Lett. 2(1), e75 (2019) 17. Choi, W., Chang, Y.S., Jung, Y., Song, J.: Low-power LORa signal-based outdoor positioning using fingerprint algorithm. ISPRS Int. J. Geo Inf. 7(11), 1–15 (2018) 18. Aernouts, M., Berkvens, R., Van Vlaenderen, K., Weyn, M.: Sigfox and LoRaWAN datasets for fingerprint localization in large urban and rural areas. Data 3(2), 1–15 (2018) 19. Lam, K.H., Cheung, C.C., Lee, W.C.: LoRa-based localization systems for noisy outdoor environment. In: International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob), vol. 2017-Octob, pp. 278–284 (2017) 20. ElSabaa, A.A., Ward, M., Wu, W.: Hybrid localization techniques in LoRa-based WSN. In: ICAC 2019–2019 25th IEEE International Conference on Automation and Computing (ICAC), no. September, pp. 1–5 (2019) 21. Lam, K.H., Cheung, C.C., Lee, W.C.: RSSI-based LoRa localization systems for large-scale indoor and outdoor environments. IEEE Trans. Veh. Technol. 68(12), 11778–11791 (2019) 22. Javed, A., Larijani, H., Ahmadinia, A., Emmanuel, R., Mannion, M., Gibson, D.: Design and implementation of a cloud enabled random neural network-based decentralized smart controller with intelligent sensor nodes for HVAC. IEEE Internet Things J. 4(2), 393–403 (2017) 23. Javed, A., Larijani, H., Wixted, A., Emmanuel, R.: Random neural networks based cognitive controller for HVAC in non-domestic building using LoRa. In: Proceedings of the 2017 IEEE 16th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC) 2017, pp. 220–226 (2017) 24. Javed, A., Larijani, H., Ahmadinia, A., Gibson, D.: Smart random neural network controller for HVAC using cloud computing technology. IEEE Trans. Ind. Inform. 13(1), 351–360 (2017)

Random Neural Networks

1043

25. Ahmad, J., Larijani, H., Emmanuel, R., Mannion, M., Javed, A., Phillipson, M.: Energy demand prediction through novel random neural network predictor for large non-domestic buildings. In: Proceedings of the 11th Annual IEEE International Systems Conference (SysCon) 2017, pp. 1–6 (2017) 26. Qureshi, A.U.H., Larijani, H., Javed, A., Mtetwa, N., Ahmad, J.: Intrusion detection using swarm intelligence. In: 2019 UK/China Emerging Technologies (UCET) , pp. 1–5 (2019) 27. Qureshi, A.U.H., Larijani, H., Ahmad, J., Mtetwa, N.: A novel random neural network based approach for intrusion detection systems. In: Proceedings of the 2018 10th Computer Science and Electronic Engineering (CEEC) 2018, pp. 50–55 (2019) 28. Qureshi, A.U.H., Larijani, H., Mtetwa, N., Javed, A., Ahmad, J.: RNN-ABC: a new swarm optimization based technique for anomaly detection. Computers 8(3), 59 (2019) 29. Gelenbe, E.: Random neural networks with negative and positive signal and product form solution. Neural Comput. 1(4), 502–510 (1989) 30. Wixted, A.J., Kinnaird, P., Larijani, H., Tait, A., Ahmadinia, A., Strachan, N.: Evaluation of LoRa and LoRaWAN for wireless sensor networks. In: Proceedings of the IEEE Sensors, pp. 5–7 (2017) 31. Anagnostopoulos, G.G., Kalousis, A.: A Reproducible Comparison of RSSI Fingerprinting Localization Methods Using LoRaWAN (2019) 32. Purohit, J.N., Wang, X.: “LoRa based Localization using Deep Learning Techniques,” p. 2019, “California State University” (2019) 33. Nguyen, T.A.: “LoRa Localisation in Cities with Neural Networks,” p. 71, “Delft University of Technology” (2019) 34. Simonyan, K., Zisserman, A.: “Very deep convolutional networks for large-scale image recognition”. In: International Conference on Learning Representations (ICLR) ICLR 2015 - Conference on Track Proceedings, pp. 1–14 (2015)

A Deep Convolutional Neural Network Approach for Plant Leaf Segmentation and Disease Classification in Smart Agriculture Ilias Masmoudi(B) and Rachid Lghoul Al Akhawayn University in Ifrane, Hassan II Avenue, 53000 Ifrane, Morocco [email protected]

Abstract. This paper concerns an approach for developing a set of tools based on computer vision and deep learning to detect and classify plant diseases in smart agriculture. The main reason that motivated this work is that early detection of plant diseases can help farmers effectively monitor the health of their culture, as well as make the best decision to avoid the spread of the pathogens. In this work, a novel way for training and building a fast and extensible solution to detect plant diseases with images and a convolutional neural network is described. The development of this methodology is achieved in two main steps. The first one introduces Mask R-CNN to give bounding boxes and masks over the area of plant leaves in images. The model is trained on the PlantDoc dataset made of labeled images of leaves with their corresponding bounding boxes. And the second one presents a convolutional neural network that returns the class of the plant. This CNN is trained on the PlantVillage dataset to recognize 38 classes across 14 plant species. Experimental results of the proposed approach show an average accuracy of 76% for leaf segmentation and 83% for disease classification. Keywords: Computer vision · Convolutional neural network · Segmentation · Classification

1 Introduction An early detection of plant disease is crucial to keep a crop in a healthy condition, but their identification can be a tedious task in many parts of the world due to the unavailability of the required equipment [1]. Moreover, it has been reported that diseases and pests can reduce yields by more than 50%, and small farmers are the most vulnerable to this issue [2]. For efficient crop management, the correct identification of the disease should be fast and cheap. With the huge leap in technological advancement in the past few years, high-resolution cameras, smartphones, and computers have become widespread and accessible to a large portion of the population [1]. All these factors can make an automated solution feasible especially that deep learning techniques have shown their ability to perform exceptionally well on complex tasks. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1044–1055, 2021. https://doi.org/10.1007/978-3-030-80126-7_73

A Deep Convolutional Neural Network Approach for Plant Leaf Segmentation

1045

It has been stated that between 60% and 70% of plant diseases are detected from leaves [3]. Early detection of plant diseases can help farmers effectively monitor the health of their culture, as well as make the best decision to avoid the spread of the pathogens. Many methods are currently used to detect plant diseases. Lab methods can include Polymerase Chain Reaction (PCR), immunofluorescence (IF), or flow cytometry (FCM) [4]. However, most of these techniques are very costly, time-consuming, and require field expertise. Consequently, researchers have tried to find cost-efficient and reliable techniques to help agriculture advancement. In [5], the authors used manual feature extraction combined with canny edge detection and did histogram matching to test the health of the plant. Another use of computer vision in plant disease detection is shown in [6], where the authors used CIELAB, YCbCr, and HSI color spaces to segment disease spots. With the rapid advancement of AI, many machine learning techniques were proposed as well. The authors of [7] proposed an approach based on K-means as a clustering step and a neural network for disease classification. More recently, deep learning approaches that take advantage of large volumes of data and faster machines made their appearance. In [8], the authors applied a deep convolutional neural network to detect 13 different diseases from plant leaves. And in [9], the authors built a dataset called PlantDoc to demonstrate the feasibility of building a scalable and cost-effective solution using computer vision. The authors also believed that segmenting leaves from images can enhance the utility of their dataset. Mask R-CNN can be used for both, leaf detection and segmentation. However, training the model on multiple classes requires more training time; training data needs to have bounding boxes and is not very scalable. For example, if the farmer decides to add a new plant, the upper layers of the R-CNN model would need to be retrained to include the new leaves. On the other hand, having a model that generalized to detect every leaf would result in a much quicker process. Since only the second model would need to be retrained. For these reasons, a deep convolutional neural network approach using Mask R-CNN will be covered for both plant leaf segmentation and a second CNN model for disease classification. The primary step is to detect and segment leaves from the images, and Mask-RCNN is used because of its impressive results in similar tasks [10]. The model is trained on a dataset called “PlantDoc” and is made up of 2,598 images. A second convolutional neural network is trained using the “Plant Village” dataset and then added on top for the leaf segmentation model. The following section explains in detail the methodology, and Sect. 3 shows the results of training the two models.

2 Material and Methods 2.1 Mask -RCNN The state-of-the-art segmentation model is Mask RCNN [10]. The architecture is modeled in two parts: The first part is composed of a backbone model like Resnet, VGG, or Inception. Then, the output is fed to a region proposal convolutional neural network that in its turn returns the zones containing objects in the feature maps [10]. In the second part, the

1046

I. Masmoudi and R. Lghoul

outputs are fed to an ROI align because the fully connected layers are of fixed size. The first fully connected layer is used to predict classes with a SoftMax activation function and the second is a repressor that returns bounding boxes around the objects. The ROI aligned feature maps are also fed to a convolutional neural network and return the masks over every ROI [11]. 2.2 Convolutional Neural Networks In recent years, deep learning techniques demonstrated their ability to make the best use of large volumes of data. Given the complexity of leaf images, convolutional neural networks have the upper hand since their architecture allows learning complex patterns without the need to build handcrafted features [12]. Convolution is a special operation that allows for feature extraction, in which a kernel is applied to the input with an element-wise product at every location and their sum is output at each given position. The output tensor is called a feature map. This operation is repeated for every kernel and the set of feature maps represents the characteristics detected from the input matrix. The main hyperparameters of convolution are the size and number of kernels [12]. And, it is a common practice to use 32 or 64 kernels of size 3 × 3 for simple tasks. 2.3 Data Augmentation For certain tasks, the available dataset is not large enough, and this could lead to the underfitting of the model. Luckily, many techniques have been developed and used in the literature to increase the size of the dataset to learn the most important features and achieve better generalization [13, 16]. Pawara, et al. presented in [13] some useful augmentation parameters. During training, we could use one or a combination of parameters by specifying a range or a value to each one of them. Then, the images are randomly generated over those specified limits. 2.4 Leaf Segmentation The initial step is to detect and segment the leaves from the images. The main reason for this is that a plant can have multiple diseases at the same time and we would like to remove the background to reduce noise during classification. For the PlantDoc dataset, instead of training the Mask-RCNN model for every class, we considered all images as a single class for two main reasons: training on multiple classes for localization and disease classification simultaneously required more training time. Besides, using two different models; one for leaf detection and the other for disease classification gives more flexibility in terms of maintenance. For instance, if we need to add a new class, we would only have to train the disease classification model. Figure 1 shows a sample image from the PlantDoc dataset with its corresponding bounding boxes.

A Deep Convolutional Neural Network Approach for Plant Leaf Segmentation

1047

Fig. 1. Leaves bounding boxes

2.5 Disease Classification The original PlantVillage dataset was selected instead of other third-party datasets to allow for more freedom in the customization and fine-tuning in the data augmentation step. Additionally, the dataset has leaf-segmented images, which is one of the main reasons behind the use of Mask R-CNN for leaf segmentation. Figure 2 below shows one image of each class present in the dataset:

Fig. 2. Leaf classes in the PlantVillage dataset [14]

The distribution of images between classes of the PlantVillage dataset used is unbalanced [15]. To tackle this issue, we will be using a function in Keras called class weights.

1048

I. Masmoudi and R. Lghoul

After calculating a weight for each class, the function will be able to backpropagate the loss according to the inverse of their distribution. Before training, the dataset was split into the following sets: training, validation, and testing with each set having 80%, 10%, and 10%, respectively. Image augmentation will still be applied to increase the generalization of our model since lighting conditions can be very different from the labeled images and the ones captured from a field. Some parameters such as shear, zoom, flip, and brightness have been chosen adequately. A sample of these augmentations is illustrated in Fig. 3:

Fig. 3. Sample of augmented leaves for disease classification

The development of this work is performed according to the different stages shown in Fig. 4.

A Deep Convolutional Neural Network Approach for Plant Leaf Segmentation

1049

Fig. 4. Processing steps of the presented method

3 Results and Discussion 3.1 Leaf Detection Mask R-CNN is trained with an input of size 256 × 256, a learning rate of 0.001, a momentum of 0.9, minimum detection confidence of 0.8, and a mini mask of shape 56 × 56. The model is trained for 10 epochs that lasted 2 h using the Google Colaboratory GPU and finished with a testing accuracy of 0.76%. Figure 5 shows both the validation and training loss and bounding boxes loss of the Mask R-CNN model.

1050

I. Masmoudi and R. Lghoul

Fig. 5. Mask R-CNN losses

The images on the left of Fig. 6 represent the true bounding boxes while the images on the right show the predicted bounding boxes and leaves. We can see that the trained model was even able to detect leaves in the images that weren’t present in the true bounding boxes which is a good sign that the model didn’t overfit. For only 2 h of training and as a proof of concept, the results are better than anticipated. However, in most cases, the mask is not precisely covering the leaf. But this may be solved by fine-tuning other parameters such as the mask shape and disabling the mini mask (It was used to reduce training time). 3.2 Disease Classification To speed-up training, a simple CNN was used with the following architecture: five convolutional layers with a kernel size of 3 × 3 and Relu activations shown in (1) combined with max-pooling every 2 layers and a SoftMax output layer which is defined by Eq. (2) [17]: x if x > 0 f (x) = (1) 0 if x ≤ 0 where x is the input of a neuron ezi σ (z)i = K

j=1 e

zj

for i = 1, . . . , K and Z = (z1, . . . , zk) ∈ R

(2)

A Deep Convolutional Neural Network Approach for Plant Leaf Segmentation

1051

Fig. 6. True and predicted bounding boxes

where zi is an element of the vector Z. The loss is calculated using categorical cross-entropy with Adam optimizer. Figure 7 shows both validation and training accuracies. The validation accuracy significantly decreased over the last few steps confirming the overfitting of the model. To deal with such an issue, early stopping was used allowing us to continuously save the weights of the best performing model and return to that model at the end of the training. The confusion matrix shown in Fig. 8 gives details on the performance of the model in differentiating between the classes. The y axis represents the true labels and the x axis represents the predicted labels. The weighted average testing accuracy across all the classes is 83%. From the confusion matrix above, we note that most of the predictions of the model are correct apart from few exceptions such as a confusion between healthy soybean, healthy potato (91%), healthy raspberry (31%), and healthy cherry (34%). The misclassifications are mainly

1052

I. Masmoudi and R. Lghoul

Fig. 7. Loss and accuracy for both validation and training steps

due to the close texture and shape between the leaves. The model would probably perform better if other architectures and fine-tuning methods are explored. In another experiment, a model with a much lower complexity was trained. The images used were of size 50 × 50 and the architecture consisted of only 2 convolutional layers connected directly to the softmax layer. Surprisingly, the results were very comparable to the previous model. With a weighted average accuracy of 81% and a training time of only 8 min, it is clear that the disease classification can be achieved with simple and fast models. To see the applicability of this method, few random plant images were gathered from the internet (Not existing in both datasets) were tested, and gave fairly good results. But it was found that the reason behind the wrong predictions is the poor segmentation.

A Deep Convolutional Neural Network Approach for Plant Leaf Segmentation

1053

Fig. 8. Confusion matrix of the disease classifier

We can confirm that having separate models for each task is more time efficient and scalable. It is much faster to train the classification model alone than training MaskRCNN for both tasks. Also, this method gives us more room for upgradability, because the segmentation model doesn’t need to be retrained every time a new category is added. The segmentation process could increase the performance of classification. However, this couldn’t be proven in this work since the segmentation task of Mask R-CNN didn’t perform very well. This method also suffers from a few limitations. One of them is that the classification fails when leaves contain multiple diseases or when they are too small/large. Also, the method relies on the availability and quality of the segmented images for classification.

1054

I. Masmoudi and R. Lghoul

4 Conclusion In this paper, an approach for plant leaf segmentation and disease classification was developed based on computer vision and convolutional neural networks. Two datasets are used to demonstrate the ability of the proposed approach to detect and classify plant diseases. The discussed approach can be used in agricultural fields to help farmers improve their crop quality and yield. By embedding the model in a mobile application or with the help of a UAV, the farmer can have cheap and fast monitoring of his crops. In future work, a new segmentation method should be explored. In parallel, the results of Mask R-CNN can be improved by exploring other configurations, and training on a dataset that combines a large variety of sources such as weather conditions and locations.

References 1. Mohanty, S., Hughes, D., Salathé, M.: Using deep learning for image-based plant disease detection. Front. Plant Sci. 7 (2016). https://doi.org/10.3389/fpls.2016.01419 2. UNEP. Smallholders, Food Security, and the Environment. Rome : International Fund for Agricultural Development (IFAD) (2013). https://www.ifad.org/documents/10180/666cac 2414b643c2876d9c2d1f01d5dd 3. Abdu, A., Mokji, M., Sheikh, U.: Machine Learning for Plant Disease Detection: Investigative Comparison Between Support Vector Machine and Deep Learning (2020). https://doi.org/10. 11591/ijai.v9.i3 4. Fang, Y., Ramasamy, R.: Current and prospective methods for plant disease detection. Biosensors 5(3), 537–561 (2015). https://doi.org/10.3390/bios5030537 5. Reddy, P.R., Divya, S.N., Vijayalakshmi, R.: Plant disease detection technique tool—a theoretical approach. Int. J. Innov. Technol. Res. 4, 91–93 (2015) 6. Chaudhary, P., Chaudhari, A.K., Cheeran, A.N., Godara, S.: Color transform based approach for disease spot detection on plant leaf. Int. J. T. N 7. Tete, T.N., Kamlu, S.: Detection of plant disease using threshold, k-mean cluster and ann algorithm. In: 2017 2nd International Conference for Convergence in Technology (I2CT), Mumbai, pp. 523–526 (2017). https://doi.org/10.1109/I2CT.2017.8226184. 8. Sladojevic, S., Arsenovic, M., Anderla, A., Culibrk, D., Stefanovic, D.: Deep neural networks based recognition of plant diseases by leaf image classification. Comput. Intell. Neurosci. 2016, 1–11 (2016). https://doi.org/10.1155/2016/3289801 9. Singh, D., Jain, N., Jain, P., Kayal, P., Kumawat, S., Batra, N.: PlantDoc, Proceedings of the 7th ACM IKDD CoDS and 25th COMAD (2020). https://doi.org/10.1145/3371158.3371196 10. How MaskRCNN Works? | ArcGIS for Developers. Developers.arcgis.com. https://develo pers.arcgis.com/python/guide/how-maskrcnn-works/. Accessed 8 Aug 2020 11. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017). https://doi.org/10.1109/iccv.2017 12. Yamashita, R., Nishio, M., Do, R.K.G., Togashi, K.: Convolutional neural networks: an overview and application in radiology. Insights Imaging 9(4), 611–629 (2018). https://doi. org/10.1007/s13244-018-0639-9 13. Pawara, P., Okafor, E., Schomaker, L., Wiering, M.: Data Augmentation for Plant Classification. Adv. Concepts Intell. Vis. Syst. pp. 615–626 (2017). https://doi.org/10.1007/978-3319-70353-4_52

A Deep Convolutional Neural Network Approach for Plant Leaf Segmentation

1055

14. PlantVillage Disease Classification Challenge, crowdAI (2016). https://www.crowdai.org/cha llenges/1. Accessed 04 Sep 2020 15. Mohanty, S.: spMohanty/PlantVillage-Dataset. GitHub (2016). https://github.com/spMoha nty/PlantVillage-Dataset. Accessed 4 Sept 2020 16. Kuznichov, D., Zvirin, A., Honen, Y., Kimmel, R.: Data Augmentation for Leaf Segmentation and Counting Tasks in Rosette Plants. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2019). https://doi.org/10.1109/cvprw.2019. 00314 17. Goodfellow, I., Bengio, Y., Courville, A.: 6.2.2.3 Softmax Units for Multinoulli Output Distributions. Deep Learning. MIT Press, pp. 180–184 (2016). ISBN 978–0–26203561–3

Medium Resolution Satellite Image Classification System for Land Cover Mapping in Nigeria: A Multi-phase Deep Learning Approach Nzurumike L. Obianuju1(B) , Nwojo Agwu1 , and Onyenwe Ikechukwu2 1 Nile University of Nigeria, Abuja, Nigeria 2 Nnamdi Azikiwe University Awka, Awka, Nigeria

Abstract. Deep learning-Convolutional Neural Network (CNN)) algorithms have made considerable improvements beyond the state-of-the-art records for automatic classification of satellite images for land cover mapping. However, its use brings forward significant new challenges for developing countries such as Nigeria. This is majorly due to the scarcity of labeled high resolution dataset, as training a high performing model requires images with fine details as well as huge amount of labeled data to learn from. In this paper we designed and implemented a novel deep learning framework that can effectively classify pixels in a medium resolution satellite image into 5 major land cover classes. To achieve this purpose, we developed a CNN model from a pretrained ResNet-50, retrained using EuroSAT, a large-scale medium resolution satellite dataset. We further integrated Augmentation and Ensemble learning techniques to the model to create a unique model able to generalize to unseen data. To test the performance of the model, we created image patches from satellite images of cites in Nigeria (NigSAT). The final classification results shows that our model achieves a 96% and 80% accuracy on EuroSAT and NigSAT test data respectively. The model is expected to improve the efficiency and accuracy of satellite image classification tasks in developing countries where there is scarcity of large-scale training dataset, thus providing a more reliable and faster access to land cover information. Keywords: Satellite image classification · Medium resolution · Deep learning

1 Introduction Land cover mapping details how much of a region is covered by natural or artificial features such as forests, water bodies, bare lands, built-up, farmland, etc. The size of these features is constantly changing and are driven by a combination of both natural and anthropogenic causes. These changes can have significant impacts on people, the economy, and the environment. The impact is made worse in most developing countries due to population explosion, poverty, high rural-urban migration, etc. Furthermore, Land cover information is an important variable for many studies involving the Earth surface © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1056–1072, 2021. https://doi.org/10.1007/978-3-030-80126-7_74

Medium Resolution Satellite Image Classification System for Land

1057

and supports a wide range of applications including land use planning, forest monitoring, natural disaster monitoring, population estimation, geology, agricultural purposes. It is therefore very crucial for developing countries such as Nigeria to have information not only on existing land cover but also the capability to monitor the dynamics of land cover change in real time. This in turn will support relevant policies and decision-making for sustainable planning and development. Satellite images of the earth taken from the air and from space shows a great deal about the planet’s landforms, vegetation, and resources. Previous studies have shown that it offers the unique opportunity to investigate the physical features of the earth surface known as land cover mapping without the need to physically access the area. As at today, it is one of the most important tools for mapping the land-cover of the earth. However, the problem is in interpreting and analyzing the content of satellite imagery otherwise known as satellite image classification. This task has posed serious challenge in the past due to the complex and heterogeneous nature of satellite images. With the advent of Artificial intelligence (AI), a field in computer science which develops smart programs to solve problems in ways that appear to emulate human intelligence, new attention is being paid to its application to satellite image classification. As a broad subfield of AI, machine learning (ML) is concerned with algorithms and techniques that allow computers to analyze data, learn from the data and make prediction on new data using computational and statistical methods. Traditional ML techniques such as unsupervised, supervised and object-based image classification were proposed in the past [1]. These techniques were designed primarily on the basis of spectral features reflected by the physical properties of the material and requires prior feature extraction, however, with any classification problem detecting good features/attributes can be difficult [2]. Other notable disadvantages of these methods include low accuracy, huge training time, difficulty in understanding the structure of the algorithms and inability to generalize well for a large-scale learning problem. Recently, Deep learning (DL) models have emerged as a powerful solution to approach many ML tasks including satellite image classification [3]. It has become one of the most widely accepted emerging technology and has offered a compelling alternative to traditional classification techniques. It has the ability to handle the growing earth observation data. Studies have shown that among the many deep learning-based methods convolutional neural network (CNN) is the most established algorithm commonly used for classifying visual images. Its potentials in satellite image classification have been ascertained by many researchers in recent times for solving problems in the domain of geological mapping, land cover mapping, urban planning, geological image classification, population estimation etc. The strength of CNN models lies in three important factors: (i) large-scale labelled training data, (ii) high resolution satellite data (iii) and a good graphical processing unit (GPU). However, labelled training data is very scarce, making it difficult for Convolutional Neural Networks (CNN) to be used for satellite image classification tasks [4]. To mitigate the problem of limited labelled training data, most researchers have used several techniques including transfer learning to leverage on either satellite image data repositories such as UC Merced (UCM) [5], EuroSAT [6] and more recently BigEarthNet

1058

N. L. Obianuju et al.

[7] or using natural image data repositories such as ImageNet or CIFAR-10. Data Augmentation (DA) has also been used extensive to artificially expand the available training dataset [3]. However, these techniques can only be applied in developed countries where there are at least a reasonable number of training datasets. While, freely accessible medium resolution satellite images can be downloaded from earth observation programs such as European space Agency’s (ESA) and the United States National Aeronautics and space Administration’s (NASA), it is very expensive in terms of expertise, money and time for developing countries like Nigeria to label the right quantity of satellite image patches needed to train a high performing CNN model. This factor brings forward a significant new challenge for the application of deep learning in satellite image classification tasks in developing countries where there are no labelled training data repository. Additionally, most of the available models for satellite image classification were trained on high resolution datasets as a result cannot guarantee an accurate result when used for low- medium resolution satellite dataset due to the heterogeneous appearance of satellite images, variation in geography and difference in spatial details of the training datasets with the actual (target) dataset. There is need to develop robust CNN models that can be used in cases of non-availability of training data. Inspired by the above, the main contributions of this paper are to: 1. Design and implement a novel deep convolutional neural network model that can accurately classify medium resolution Nigerian satellite (NigSAT) dataset for land cover mapping in Nigeria using the integration of three important techniques: Transfer learning, Augmentation and Ensemble learning. Our focus is on medium resolution data due to their massive use as the primary source of data for land cover change analysis research in Nigeria. 2. We created land cover image patches from satellite images of Nigeria (NigSAT) for testing and evaluating the CNN models. 3. Additionally, we detailed the methodology which can be followed for creation of more NigSAT image patches. The remainder of this article is arranged as follows: Sect. 2 presents review of related work; Sect. 3 discusses the methodology of the proposed work. Section 4 presents the experimental results and analyses and Finally Sect. 5 provides the conclusion.

2 Review of Related Works CNN, a deep learning model has extensively been used in computer vision applications in recent years and have achieved state of the art results in image classification tasks [8]. They are based on an end-to-end learning process, from raw data such as image pixels to semantic labels. Recent studies have demonstrated that advancement in sensor technology, computation power and deep learning-convolutional neural network model provides a powerful combination for automatic land cover mapping. However, a number of notable challenges are associated with its use in satellite image classification in developing countries. In this section we reviewed recent and relevant papers to highlight

Medium Resolution Satellite Image Classification System for Land

1059

the different advances made in designing CNN models with improved generalization performance on satellite image classification tasks. Our focus is on cases with limited or non-availability of training data. The findings of this preliminary study are discussed below under three headings: 2.1 CNN Model The three main types of layers for building a CNN model are Convolutional Layer, Pooling Layer and Fully-Connected Layer. The arrangement of these layers plays a fundamental role in designing new models and also boosts its ability to learn complex representations. Different architectural designs have been proposed in the past and successfully used ranging from AlexNet, GoggleNet, InceptionV3, VGG-Net, ResNet etc. Each tried to overcome the shortcomings of previously proposed architectures in combination with new structural reformulations [9]. They majorly focused on increasing the depth of the layers which is aimed at improving the network structure [10]. Other popular techniques such as dropout, regularization and batch normalization have also been used to increase optimization as well as the overall performance of the models. Most of these earlier state-of-the-art models were used to classify large scale everyday natural images [11] which significantly differs from satellite images. These models are not solely suitable for handling satellite datasets due to its complexity, large spectral bands or its inherent high variability. Additionally, estimating millions of parameters of a deep CNN requires large number of labelled samples, thereby preventing such models to be applied directly to satellite image tasks with limited training data. Some researchers have proposed novel models: W. Zhang et al. [12] proposed Capsule network (CapsNet) which makes use of group of neurons as a capsule or vector to replace the neuron in the traditional neural network. These neurons can encode the properties and spatial information of features in an image to achieve equivariance, thus solving the problem of overfitting common to deep learning models. It also makes use of dynamic routing instead of max polling as seen in a regular CNN. Paoletti et al. [13] proposed a deep CNN based on 3D convolution, M. Carranza-García et al. [3] proposed a 2D CNN while H. Song et al. [10] proposed a patch-based light CNN (LCNN). To reduce the amount of parameters as well as the computational cost, they omitted the fully connected layer and the pooling layer in the architecture. However, one consistent problem with these models is the absence of large-scale dataset for training the model. As a result, most of the models are shallow. Experimental results shows that deep models have an advantage over shallow models when dealing with complex data such as satellite image. Today, few researchers dare to train a model from scratch. Most recent research work are now focused on building on already existing models (pre-trained models) using transfer learning and implementing architectural innovations or using several other techniques to cope with the limited satellite datasets. 2.2 Transfer Learning Despite CNNs’ robust feature extraction capabilities, in practice it is difficult to train with small quantity of data. It needs to learn features from a large number of training data to achieve satisfactory model accuracy. An extensive review by Ball et al. [14] ranked

1060

N. L. Obianuju et al.

insufficient training dataset as a major challenge in the application of deep learning to the classification of remote sensing (satellite) imageries. Transfer learning (TL) is one of the most popular Machine learning technique for solving problems with limited training data. It involves training a CNN model on a base dataset for a specific task then using the learned features/ model parameters as the initial weights in a new classification task. These models with millions or even hundreds of millions of parameters are trained on huge volume of data. It relies on the fact that the features learned in the lower layers of a CNN, like edges, curves or color may be general enough to be useful for other classification tasks. The use of TL has facilitated the application of CNN to some scientific fields that have less available data [15]. Very recent works have demonstrated that features learnt from successful pre-trained deep CNN models can also be transferred to satellite image classification tasks. For instance R. P. De Lima & K. Marfurt [15] investigated the performance of TL from CNNs pre-trained on natural images for remote-sensing scene classification versus CNNs trained from scratch only on the remote sensing dataset themselves. The result shows that the former outperformed the later. Also Masoud Mahdianpari et al., [16] presented a detailed investigation of the capacity of seven well-known state-of-theart pre-trained, deep learning models for the classification of complex wetland mapping in Canada. They both concluded that learning can be transferred from one domain to another and therefore will provide a powerful tool for satellite image classification. Zhong Chen et al. [17] utilized TL for Airplane detection in remote sensing images. They used VGG16, a pre-trained network as the base network and replaced the fullyconnected layer with a secondary network structure and finally fine-tuned with the limited number of airplane available training samples. M. Xie et al. [4] used a sequence of transfer learning steps to design a novel ML approach using satellite image. The first transfer learning problem P1 is object recognition on ImageNet; the second problem P2 is predicting nighttime light intensity from daytime satellite imagery; the third problem P3 is predicting poverty from daytime satellite imagery. Although using pretrained CNN and adapting it to a new target task is a valid alternative, a good strategy needs to be adopted for the success of the transfer process. There are different strategies to perform transfer learning based on the domain, task at hand, and the availability of data, this includes full training, feature extraction and fine-tuning. Using CNN as a feature extractor involves using extracted features from a pre-existing model to train a new model of our choice while in fine tuning, the weights of the pre-trained CNN model are preserved on some of layers and tuned in the others. The traditional strategy of transferring pre-trained deep CNNs for remote scene classification is to freeze the shallow layers and fine-tune the last deep layers [18]. However, the question that arises is how does depth affect the features of satellite image in the transferring process? In a systematic review of transfer learning application for scene classification by R. P. De Lima and K. Marfurt [15]. They evaluated how dependent the transfer learning process is on the transition from general to specific features by extracting features in three different positions denominated as “shallow”, “intermediate”, and “deep” for each one of the pretrained models. The shallow experiment uses the initial blocks of the base models to extract features and adds the top model. The intermediate experiment extracts the block somewhere in the middle of the base model. Finally, the

Medium Resolution Satellite Image Classification System for Land

1061

deep experiment uses all the blocks of the original base model, except the original final classification layer. The experimental result proved that deep adaptation involving several layers and carefully fine-tuning obtained better results than shallow learning or training a CNN model with randomly initialized weights. the depth of transferred CNNs enhances the generalization power of CNN and guarantees a general hypothesis for remote scene classification. A careful fine-tuning which involves several layers of the model, provides very good result [18]. 2.3 Data Augmentation Data Augmentation is also a well- known technique that is used in machine learning to artificially expand the size of a training dataset by creating modified versions of images in the dataset [3]. It creates variations of the images using a range of operations such as geometric transformations e.g., rotation, image rescaling, flipping, translations, cropping, zooming and addition of noise. While photometric transformations changes the color channels with the objective of making the CNN invariant to change in lighting and color [19]. The augmented data will represent a more comprehensive set of possible features which will help to improve the ability of the model to generalize what they have learned to new images. The volume and diversity of training data are essentially important in training a robust deep learning model. However, results by M. Abdelhack [11] shows that while some image augmentation techniques commonly used in natural image training can readily be transferred to satellite images, some others could actually lead to a decrease in performance. It is useful to consider the nature of satellite data when considering on an augmentation strategy. D. I. N. Ore [20] and Stivaktakis. R et al. [21] proposed a Data Augmentation Enhanced CNN Framework to enhance the volume and diversity of remote sensing datasets by introducing three operations: flip, translation and rotation. The operations were used because they do not increase the spectral or topological information of the satellite images which is important for a consistent classification result. Krizhevsky et al. [22] proposed a data augmentation method to alter intensities of the RGB channels of raw data and achieved improved performance on the ImageNet benchmark. A more recent data augmentation method for training CNN’s known as Random Erasing was introduced by Zhong et al. [16]. The technique is specifically aimed at proffering solution to the problem of occlusion which occurs when part of an image is blocked off. These blocked images when used for training causes a model to generalize poorly when tested with unseen data (overfitting). Random erasing works by masking different rectangular portions of an image with random values thus generating training images with various levels of occlusion. This technique reduces the risk of overfitting and makes the model robust to occlusion. J. M. Haut et al. [23] adopted this data augmentation technique to mitigate the problem of overfitting in hyperspectral satellite image (HSI) classification. Generative Adversarial Networks (GANs) a class of Neural Network also offers exciting opportunity to augment training dataset and help boost accuracy in image classification. GAN offers a way to unlock additional information from a dataset by generating synthetic samples with the appearance of real images. Various variations of GAN

1062

N. L. Obianuju et al.

has been proposed and experimental results proved its effectiveness [24]. Its limitation however is that GANs cannot be relied upon to produce images with perfect fidelity as seen in traditional augmentation techniques. Some researchers used a combination of TL and augmentation techniques to design novel architectures. Hans et al. [25] proposed a two-phase method combining TL and web data augmentation to classify natural images. With their method, the useful feature presentation of pre-trained network was efficiently transferred to a target task, and the original dataset was augmented with the most valuable internet images. A work similar to ours was done by Grant. J. Scott et al. [26]. They investigated the use of Deep CNN for land-cover classification in high-resolution satellite imagery. To overcome the lack of massive labeled data sets, they utilized two techniques in conjunction with DCNN: TL with fine-tuning and data augmentation tailored specifically for remote sensing imagery. In summary, the techniques outlined above have been seen to work in cases where there are at least few training datasets to be expanded, used to fine-tune a pre-trained model or both. In contrast to existing satellite image classification research works which focuses on finding solutions for improving CNN algorithms in case of limited training data, we propose a multi-phase Deep learning approach method which works in cases of non-availability of training data. It also provides additional functionalities by integrating Ensemble learning techniques to transfer learning and augmentation techniques. A system that can extract useful information from medium resolution satellite images is highly desirable especially in developing countries.

3 Methodology The goal of this research is to propose a model for classifying pixels in medium resolution satellite images of Nigeria into five land cover classes for land cover mapping, using a multi-phase learning approach. First, we fine-tuned and retrained a pretrained ResNet-50 using EuroSAT, a large-scale medium resolution dataset. We further integrated Augmentation and Ensemble learning techniques to the model to create a unique model able to generalize to unseen NigSAT data. To test the performance of the model, we created image patches from satellite images of cites in Nigeria (NigSAT). We describe the data and the preprocessing steps that we used in Sect. 3.1, the CNN model architecture choices in Sect. 3.2, and the training methodology that we follow to train, validate, and test our models in Sect. 3.3. 3.1 Dataset Two sets of datasets were used in our study. 3.1.1 EuroSAT Dataset EuroSAT a benchmark dataset for land cover classification was used for training, testing and validating our model. Introduced by Patrick Helber et al. [6], EuroSAT are Sentinel-2 satellite labeled and geo-referenced image patches measuring 64 × 64 pixel gotten from 34 countries in Europe. This dataset originally consists of 10 land use and land cover

Medium Resolution Satellite Image Classification System for Land

1063

(LULC) classes with each class between 2000 and 3000 and a total of 27,000 divided into 10 generic land cover types. We reclassified similar land cover classes from the EuroSAT dataset to suit our land cover classes of interest as shown in Table 1. Table 1. Land cover class reclassification land cover used in EuroSAT New re-classified land cover classes

Class label

Total number after reclassification

Highway

Bare land

1

2500

Residential and industrial

Built-up

2

5500

Annual crop, herbaceous vegetation, pasture, permanent crop

Farm land

3

10500

Forest

Forest

4

3000

Rivers, lakes, sea

Water bodies

5

5500

To prevent class imbalance among the different land cover classes, the lesser classes were augmented as shown in Table 2. Table 2. Dataset statistics for training, validating and testing after data augmentation Augmentation

(Split 80%, 10%, 10%)

Classes

Before

After

Training

Validation

Testing

Bare land

2500

10502

8401

1050

1051

Built-up

5500

10483

8385

1048

1050

Farmland

10500

10501

8401

1050

1050

Forest

3000

10501

8400

1050

1051

Water bodies

5500

10503

8401

1050

1052

3.1.2 NigSAT Dataset To evaluate and test the ability of our model to classify NigSAT images correctly, we generated image patches from satellite image of Abuja, Nigeria. The methodology is described below. • Download: We downloaded medium resolution Sentinel-2A satellite images of Abuja via their website https://scihub.copernicus.eu/dhus/#/home. The Sentinel-2 image was acquired in February 2020. The considered tile is distributed over the six area councils in Abuja-Nigeria (Abaji, Abuja Municipal, Bwari, Gwagwalada, Kuje and Kwali).

1064

N. L. Obianuju et al.

• Image Processing/preparation: Sentinel-2 captures multispectral imagery spanning 13 different bands with three different resolutions 60m, 20m and 10m. We decided to use the bands with 10m resolution (2, 3, 4 and 8). Furthermore, the 10m resolution imagery provides the greatest level of detail for medium resolution data. A color composite is created by combining the three raster image bands in ArcGIS 10.5 version as shown in Fig. 1.

Fig. 1. Satellite image of Abuja, Nigeria

• Image Enhancement: The image was brightened to produce a clearer and relevant image detail. • Extracting and labeling patches: This aims to group pixels with common spectral characteristics into a given category. We classify the images into the five land cover classes (Farm land, Built up, bare land, Water bodies and Forest). A total of 25 image patches were extracted, five for each class. These extracted features are in form of image patches of size 64 × 64 in Geo-TIFF format. We resized all the image patches to the dimension requirement for ResNet-50 model which is (224 × 224). Samples from the two datasets are shown in Fig. 2.

Medium Resolution Satellite Image Classification System for Land

Bare Land

Built-up

Farmland

Forest

1065

Water Bodies

Fig. 2. EuroSAT and NigSAT image patches

3.2 The Model Framework Our model framework is based on Transfer Learning, Augmentation and Ensemble Learning as shown in Fig. 3.

Import Pretrained Resnet-50 Model

Fine-tune Model

Train Model Using EuroSAT Data

Re-Train The Model Using 3 Different Data Subset And Save The 3 Model Weights Separately

Combine The Different Model Weights (Data Ensemble)

New Image To Classify

Average Model Weights

Classification Output

Fig. 3. Model framework

We first import a pretrained ResNet-50 model, a 50-layer variant designed by Microsoft Researchers in 2017. The most important characteristics when considering choice of a pretrained model are network accuracy, speed, and size. After the celebrated victory of AlexNet, ResNet short for Residual Network was the most groundbreaking model in the computer vision/deep learning community. See [27] for more details on the structure and workings of a ResNet model. Also considering our data, choosing a deeper network was crucial to extract more features.

1066

N. L. Obianuju et al.

Our model architecture consists of the base ResNet50 model, with the top layers replaced with a two densely connected layers (of size 1024 and 5, respectively), separated by a size 0.5 (50%) dropout layer which zeros out the activation values of randomly chosen neurons during training and makes our model to learn more robust features. the model architecture is represented as shown in Fig. 4.

Fig. 4. Model architecture

3.3 Training the CNN Model The training phase is where the network "learns" from the data it will be fed with. First, we trained the ResNet50 architecture pre-loaded with ImageNet weights, on the reclassified EuroSAT dataset over 10 epochs. The training was done using RMSprop, a gradient based optimizer, at a learning rate of 1e−4. To further improve the performance and robustness of the trained model above to classify NigSAT dataset, we used Boosting, a technique in Ensemble learning. Ensemble learning allows us to train multiple models instead of a single model and then combine the models to make better prediction. We re-trained the model weight saved above with three different subsets of datasets train_datagen1, train_datagen2 and train_datagen3 to generate three separate model weights. The three subsets of datasets were generated from the original dataset using different data augmentation transformations derived from the Keras and TensorFlow libraries. The first augmentation focused on randomly applying traditional physical transformations, such as: Rescaling, rotations, flipping zoom, etc. the 2nd utilized photometric transformations such as color space: Hue, saturation, brightness and contrast randomization. While the last concentrated on random erasing which aims at solving the problem of occlusion. It masks different portions of the image with rectangular region of arbitrary size thus, generating training images with various levels of occlusion. Finally, the weights generated from the different trained models were saved and combined. This was used to test the NigSAT dataset (Fig. 5 and 6).

Medium Resolution Satellite Image Classification System for Land

1067

Fig. 5. Examples of images generated from train_datagen1.

Fig. 6. Data augmentation - Different Color Transformation of Forest in train_Datagen2. (Color figure online)

4 Results and Analysis This section analyzes the results obtained from each experiment in our study. We performed four experiments as follows: 4.1 To evaluate the effectiveness of the pre-trained ResNet50 architecture preloaded with ImageNet weights on the reclassified EuroSAT dataset (in-domain testing). The training loss and accuracy are plotted. The model was used to predict the Validation and Testing datasets. We observed a performance accuracy of about 95% and 96% on the validation and testing dataset respectively. The results of the predictions on the test dataset are presented in the confusion matrix shown in Tables 3 and 4. Other performance measures used are precision, recall and F1 score.

1068

N. L. Obianuju et al. Table 3. Confusion matrix for the EuroSAT test set Bare land

Built-up

Farmland

Forest

Water Bodies

Bare land

991

24

15

5

16

Built-up

4

1038

4

2

1

Farmland

7

1

1032

1

9

Forest

3

1

7

1014

26

Water bodies

22

2

11

11

977

Table 4. Classification result for EuroSAT test set Precision Recall F1-score Support Bare land

0.96

0.94

0.95

1051

Built-up

0.97

0.99

0.98

1049

Farmland

0.97

0.98

0.97

1050

Forest

0.95

0.96

0.96

1051

Water bodies

0.95

0.93

0.94

1052

0.96

5253

0.96

0.96

0.96

5253

Weighted avg. 0.96

0.96

0.96

5253

Accuracy Macro avg.

4.2 Testing the Ability of the Model in 1 to Classify NigSAT Images (Out of Domain Testing) The model was also used to predict NigSAT data, as expected we observed a poor accuracy of 48%. It shows a drop from 94% to 48%. We believe that this poor accuracy was caused by the Variation between the EuroSAT and NigSAT dataset (Tables 5 and 6). Table 5. Confusion matrix for the NigSAT test set

Bare land

Bare Land

Built Up

Farmland

Forest

Water Bodies

0

2

3

0

0

Built up

0

4

1

0

0

Farmland

0

2

2

1

0

Forest

0

0

0

5

0

Water bodies

0

2

1

1

1

Medium Resolution Satellite Image Classification System for Land

1069

Table 6. Classification result for the NigSAT test set Precision Recall F1-score Support Bare land

0.00

0.00

0.00

5

Built up

0.40

0.80

0.53

5

Farmland

0.29

0.40

0.33

5

Forest

0.71

1.00

0.83

5

Water bodies

1.00

0.20

0.33

5

0.48

25

Accuracy Macro avg.

0.48

0.48

0.41

25

Weighted avg. 0.48

0.48

0.41

25

4.3 With our problem defined, we sort for ways to improve the generalization ability of the model so as to have a better prediction accuracy. We re-trained the model above with three different subsets of datasets train_datagen1, train_datagen2 and train_datagen3 to generate three separate model weights. A careful look at the recall and precision performance scores achieved by the weights shows that based on the Augmentation strategy used, each model was able to predict a particular land cover class better than the rest 4.4 Finally, using Ensemble learning technique, we combined the weights learned by the models. It was then used to test the NigSAT dataset. The results of the predictions are presented in the confusion matrix and classification reports shown in Tables 7 and 8, respectively. We observed a performance accuracy of about 80% gaining an accuracy of 32%.

Table 7. Confusion matrix for the NigSAT test set Bare land Built-up Farmland Forest Water bodies Bare land

4

0

0

0

1

Built-up

0

4

0

0

1

Farmland

0

1

4

0

0

Forest

0

0

0

5

0

Water bodies 1

0

1

0

3

We implemented the model in Python using the Keras and TensorFlow deep learning libraries which provides a complete deep learning toolkit for training and testing models. In addition, it should be noted that all the experiments were performed on HP Envy 17t (10th Gen Intel i7-1065G7, 16GB DDR4, 1TB HD + 256GB NVMe SSD, NVIDIA GeForce 4GB GDDR5.

1070

N. L. Obianuju et al. Table 8. Classification result for the NigSAT test set

Bare land

precision

recall

f1-score

support

0.80

0.80

0.80

5

Built-up

0.80

0.80

0.80

5

Farmland

0.80

0.80

0.80

5

Forest

1.00

1.00

1.00

5

Water bodies

0.60

0.60

0.60

5

0.80

25

Accuracy Macro avg.

0.80

0.80

0.80

25

Weighted avg.

0.80

0.80

0.80

25

4.1 Discussions Our research work was able to show the following: a. That deep networks trained for image recognition in one task (ImageNet) can be efficiently transferred and used for satellite image classification. This was demonstrated on the EuroSAT dataset, where above 96% mean accuracy was achieved using a ResNet-50 model pretrained on ImageNet. b. We proved the high variability which occurs among satellite images of different locations which was demonstrated when the model trained with EuroSAT data was tested on NigSAT data yields a drop from 96% to 46% in accuracy. c. Augmentation techniques have the ability to produce training data more robust to achieve a high generalization ability. d. Ensemble technique can be successfully applied to Deep CNN models for satellite image classification.

5 Conclusion The lack of training data in developing countries is a major obstacle to harnessing the potentials of DL in satellite image classification. In this preliminary research work, we attempted to design and implement a novel technique based on deep convolutional neural network for medium resolution satellite image classification. The aim of our research is to improve the classification accuracy and speed of satellite image classification for land cover mapping in Nigeria. The model was based on the integration of three important techniques: transfer learning, augmentation and ensemble learning. Five different land cover classes (Forest, bare land, farm land, built-up and water bodies) were classified. We used EuroSAT [6] a benchmark dataset for land cover classification for training, testing and evaluating our model. Additionally, we detailed the method for creating a

Medium Resolution Satellite Image Classification System for Land

1071

benchmark dataset from satellite images of cities in Abuja for evaluating our model. Although the size of the NigSAT dataset is small, it is the very first set of indigenous datasets that can be used to test a model in Nigeria. Also, the method can be adopted to create more dataset. The final classification results show that our model achieves up to 80% accuracy. However, the major challenges in classifying these NigSAT medium resolution satellite images are non-availability of large-scale high-resolution training data and excessive amounts of background noise including partial images of trees or cars present in the NigSAT datasets which was absent in EuroSAT dataset. The study also illustrated the need for indigenous large scale medium resolution satellite image patches to be produced for further research work. In future, we aim to experiment by training a model solely using NigSAT datasets. Also, many aspects of the proposed method have not yet been fully addressed and explored, and further investigations are still needed.

References 1. Karim, S., Zhang, Y., Asif, M.R., Ali, S.: Comparative analysis of feature extraction methods in satellite imagery. J. Appl. Remote Sens. 11(04), 1 (2017) 2. Zhang, C.: Deep Learning for Land Cover and Land Use Classification Ce Zhang BSc, MSc, MSc, no. November 2018 3. Carranza-García, M., García-Gutiérrez, J., Riquelme, J.C.: A framework for evaluating land use and land cover classification using convolutional neural networks. Remote Sens. 11(3), 274 (2019) 4. Xie, M., Jean, N., Burke, M., Lobell, D., Ermon, S.: Transfer Learning from Deep Features for Remote Sensing and Poverty Mapping (2014) 5. Yang, Y., Newsam, S.: Bag-Of-Visual-Words and Spatial Extensions for Land-Use Classification, no. May 2014 6. Helber, P., Bischke, B., Dengel, A., Borth, D.: Introducing eurosat: a novel dataset and deep learning benchmark for land use and land cover classification. Int. Geosci. Remote Sens. Symp. 2018, 204–207 (2018) 7. Song, J., Gao, S., Zhu, Y., Ma, C.: A survey of remote sensing image classification based on CNNs. Big Earth Data 3(3), 232–254 (2019) 8. Neware, R., Khan, A.: Survey on Classification Techniques Used in Remote Sensing for Satellite Images. In: Proceedings 2nd International Conference on Electronics, Communication and Aerospace Technology ICECA 2018, no. March, pp. 1860–1863 (2018) 9. Khan, A., Sohail, A., Zahoora, U., Qureshi, A.S.: A Survey of the Recent Architectures of Deep Convolutional Neural Networks, pp. 1–62 (2019) 10. Song, H., Kim, Y., Kim, Y.: A Patch-Based Light Convolutional Neural Network for LandCover Mapping Using Landsat-8 Images, pp. 1–19 (2019) 11. Abdelhack, M.: An Open-source Tool for Hyperspectral Image Augmentation in Tensorflow, pp. 1–4 (2020) 12. Zhang, W., Tang, P., Zhao, L.: Remote sensing image scene classification using CNN-CapsNet. Remote Sens. 11(5), 494 (2019) 13. Paoletti, M.E., Haut, J.M., Plaza, J., Plaza, A.: ISPRS journal of photogrammetry and remote sensing a new deep convolutional neural network for fast hyperspectral image classification. ISPRS J. Photogramm. Remote Sens. 145, 120–147 (2018) 14. Ball, J.E., Anderson, D.T., Ball, J.E., Anderson, D.T., Chan, C.S.: Comprehensive survey of deep learning in remote sensing : theories , tools , and challenges for the community, 11(4) (2020)

1072

N. L. Obianuju et al.

15. De Lima, R.P., Marfurt, K.: Convolutional Neural Network for Remote - Sensing Scene Classification: Transfer Learning Analysis (2020) 16. Land, C., Mapping, C., Multispectral, U., Imagery, R.S.: Very Deep Convolutional Neural Networks for Complex Land Cover Mapping Using Multispectral Remote Sensing Imagery (2018) 17. Chen, Z., Zhang, T., Ouyang, C.: End-to-end airplane detection using transfer learning in remote sensing images. Remote Sens. 10(1), 1–15 (2018) 18. Access, O.: We are IntechOpen , the world ’ s leading publisher of Open Access books Built by scientists , for scientists TOP 1 % Utilization of Deep Convolutional Neural Networks for Remote 19. Taylor, L., Nitschke, G.: Improving Deep Learning using Generic Data Augmentation. no. October (2017) 20. Ore, D.I.N.: Deep Learning in Remote Sensing Scene Classification : A Data Augmentation Enhanced CNN Framework (2018) 21. Stivaktakis, R., Tsagkatakis, G., Tsakalides, P.: Deep learning for multilabel land cover scene categorization using data augmentation. IEEE Geosci. Remote Sens. Lett. 16(7), 1031–1035 (2019) 22. Krizhevsky, B.A., Sutskever, I., Hinton, G.E.: ImageNet Classification with Deep Convolutional Neural Networks (2012) 23. Haut, J.M., Paoletti, M.E., Plaza, J., Plaza, A., Li, J.: Hyperspectral image classification using random occlusion data augmentation. IEEE Geosci. Remote Sens. Lett. 16(11), 1751–1755 (2019) 24. Bowles, C., et al.: GAN Augmentation : Augmenting Training Data using Generative Adversarial Networks 25. Han, D., Liu, Q., Fan, W.: A new image classification method using CNN transfer learning and web data augmentation. Expert Syst. Appl. 95, 43–56 (2018) 26. Scott, G.J., Marcum, R.A., Davis, C.H., Nivin, T.W.: Fusion of deep convolutional neural networks for land cover classification of high-resolution imagery. IEEE Geosci. Remote Sens. Lett. 14(9), 1638–1642 (2017) 27. Mikami, H., Suganuma, H., U-chupala, P.: ImageNet / ResNet-50 Training in 224 Seconds, no. Table 2.

Analysis of Prediction and Clustering of Electricity Consumption in the Province of Imbabura-Ecuador for the Planning of Energy Resources Jhonatan F. Rosero-Garcia(B) , Edilberto A. Llanes-Cede˜ no, Ricardo P. Arciniega-Rocha, and Jes´ us L´ opez-Villada Universidad Internacional SEK, Quito-Ecuador, Instituto Superior Tecnol´ ogico 17 de Julio, Urcuqu´ı, Ecuador [email protected]

Abstract. The electrical planning of a country is a growing need for its economic development. However, this analysis is complex to carry out due to the different consumption habits of the participants in this market. In this sense, in Imbabura-Ecuador, residential electricity consumption is the one that needs the greatest electricity demand. For this reason, a prediction and grouping analysis by municipalities is presented using machine learning algorithms in order to determine consumer trends and present reports for proper electrical planning. As relevant results, the models of decision support machines and random forests proved to be suitable for this task with a prediction error of less than 10%. For its part, the k-means algorithm was able to group four types of electricity consumption with a representation of 98% of the data variability. Keywords: Electrical consumption models · Clusterins

1

· Electrical analysis · Regression

Introduction

One of the main components of the economic development of a country is its capacity to produce electricity reliably and at the lowest possible cost. For this, it is necessary to determine the current and future behavior of electricity demand for proper planning that must consider political, economic, social, environmental and technological variables [14]. However, these analyzes must be undertaken by government entities and the private sector in order to implement programs and plans that have direct impacts on electricity demand. This must be closely related to the demographics of each country in conjunction with the different renewable or non-renewable natural resources to be used as sources of electricity production [2]. In this sense, the forecast of electricity consumption allows the cooperation of all participants in this market to cooperate effectively with each other. In this c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1073–1084, 2021. https://doi.org/10.1007/978-3-030-80126-7_75

1074

J. F. Rosero-Garcia et al.

way, the industrial sector can obtain warnings about its excessive consumption and prevent electricity supply companies from producing electricity optimally by avoiding overloading of generation equipment [18]. For its part, in the residential sector, the analysis of electricity demand is directly related to the expansion of urban sectors within a city. With this, planning on the need for the implementation of transformers and the installation of electric poles becomes a technical task. However, this electrical forecasting process is a complicated task to carry out due to the large amount of data obtained from users and the different variables that this implies [9]. This causes that the electrical system works with certain uncertainty that affects the cost of its production. Taking into consideration that the electrical energy produced on a large scale cannot be stored [7]. In addition, electricity consumption is variable due to its high demand at different times. Another point to consider is the scarce knowledge of electricity consumption trends in the different sectors of a country. This is of great importance to avoid an over-adjustment of policies and projects in the electricity sector. Since when establishing groups of electricity consumption, its planning does not focus on particularities and seeks a generality in future electricity generation plans. Under this criterion, the electricity demand curve allows representing the real demand vs. the expected one. This process must be carried out using complex adaptive algorithms that permanently collect information [19]. However, this analysis must be seen in a different way between the demographics of each country and the type of consumer. With the aforementioned, Ecuador has a great electricity generation capacity by using the different natural resources it has. Since its electricity consumption has constant annual growth. In 2019, it was 4.5% [1], this results in the need to implement decision support systems that have the ability to forecast electricity consumption that best adjusts to the real values and, if applicable, recommend actions on changes in habits consumer. For this reason, there are large data repositories for analysis. This process can only be carried out under robust criteria for machine learning hosted on servers with high computational resources [20]. Therefore, it is proposed to carry out an exhaustive study of the available information, to find consumption trends in the different provinces of the country, to forecast consumption for each one of them [21]. With this, you can group them according to their main characteristics and establish relationships between them. This work is focused on forecasting the electricity consumption by municipalities within the province of Imbabura-Ecuador in relation to the different customer characteristics consumption collected by the companies that supply electricity service. To do this, available information is collected from the Agency for the regulation and control of energies and non-renewable natural resources, subscribed to the Ministry of Electricity of Ecuador. With this, a data analysis scheme will be made with different criteria to determine the best solution to be implemented. For this, it begins with a data cleaning process, subsequently, a comparison of multivariate linear regression algorithms is carried out to determine the model that best adapts and has a better capacity to predict consumption. In addition, a grouping is made between the municipalities to establish

Analysis of Prediction and Clustering of Electricity Consumption

1075

consumption habits. All this is seen in a graphical interface using a decision support tool for adequate decision support by the country’s regulatory entities. As relevant results, the models of decision support machines and random forests proved to be adequate for this task with a prediction error of less than 10%. For its part, the k-means algorithm was able to group four types of electricity consumption with a representation of 98% of the variability of the data. The rest of the document is structured as follows: Sect. 2 presents the works related to the issue raised. Materials and methods are shown in Sect. 3. Section 4 presents the results. Finally, the conclusions and future work are shown in Sect. 5.

2

Related Works

This topic has been analyzed in works such as [8], which presents an analysis of data on household electricity consumption to be stored on a server in the cloud. In this case, they do not present a solution to a specific sector and focus on residential consumption. [15], carries out an architectural study in buildings and their influence on electricity consumption based on user trends. The authors in [5, 17] make use of mathematical models and simulations in different applications of electrical power systems. For its part, [3] uses time series and clustering to predict residential electricity consumption. Recently, [12] has published a set of data on electricity consumption in Colombia and the different characteristics of consumers. In Ecuador, works such as [11] carry out studies on electricity demand in general terms and its relationship with the CO2 emissions that its production entails. [16] presents a solution to predict electricity consumption in public institutions. On the other hand, [6] uses neural networks to predict electricity consumption in the different cantons of the province of PichinchaEcuador. Finally [10] carries out a study of grouping by consumer characteristics in the same province of the previous work. The aforementioned works have presented adequate solutions on electrical prediction and grouping by consumers. However, there are some open problems on the issue raised, such as the combination of various machine learning criteria, the formal presentation of results in interactive interfaces, the relationship with national electricity generation systems, among others. For these reasons, the present work covers these criteria by presenting mathematical models for the extraction of intrinsic knowledge present in the data set obtained on the parameters of electricity consumption in the province of Imbabura-Ecuador. With this, the presentation of results in a decision support tool seeks to integrate academic sectors with public entities for the generation of state policies based on reliable information.

3

Materials and Methods

This chapter shows, on the one hand, the description of the available database and its main characteristics. On the other hand, the proposed data analysis scheme is shown together with the information grouping prediction criteria.

1076

3.1

J. F. Rosero-Garcia et al.

Database Description

In the Imbabura province, its main business line is the production of wooden articles and services for medium-sized industries [13]. Related to electricity production, it has 11 medium-production plants that generate an effective power of 115 MW per year with a total of 130,000 subscribers from the residential sector with an annual consumption of 150 GWh. For this reason, Ecuador has an interconnected electrical system to supply the missing energy to the provinces through its emblematic projects such as the hydroelectric plant Coca Codo Sinclair [1,4]. In this sense, the information acquired through the open data repositories provided by Ecuador made it possible to acquire the consumption reports of residential customers from 2017 until 2020. This type of consumer is established since it is the largest in the province. With this, we have a multivariate data matrix of the form Y ∈ Rm×n , where m is the number of instances, and n is the number of variables. Therefore, we have m = 3500 and n = 8 which are the characteristics of: canton, municipality, type of electrical equipment (220 V electronic devices), number of customers by geographic area, billed energy, incremental consumption, residential consumption and billing. 3.2

Data Analysis Proposed

This data schema focuses on three main components. The transformation, since it is necessary to clean the data and update the information because the Spanish language has accents that are sometimes found in the databases that generate a greater number of erroneous attributes. In addition, there are empty or null data that would generate errors in the prediction and data grouping models (Sect. 3.3). On the other hand, the different multivariate linear regression models must be chosen to determine the appropriate one, these are shown in Sect. 3.4. Finally, the clustering algorithm selected is k-means, due to its great effectiveness in this task. However, it is necessary to properly choose the correct value of clusters within the proposed database [10]. All this process is seen in Fig. 1. 3.3

Transformation

The Y ∈ Rm×n matrix has three main drawbacks. The first is based on having null data or empty cells. For this, we proceeded to eliminate this data to avoid inconveniences. Second, you have categorical variables with a tilde. This causes the same variable to have different values because it has a different ASCII encoding. For this reason, a search method was carried out to eliminate all these words and standardize all the variables. This process was carried out with a ETL tool. Finally, it is necessary to code these categorical variables to train a multivariate linear regression model. To do this, the One-Hot-Encoding method has been used from the sklearn library of the python programming environment. This method is helpful because when you encode the variables, each category value is converted into a new column and a value of 1 or 0 (notation for true or false) is assigned to the column. This avoids misinterpreting numeric values as

Analysis of Prediction and Clustering of Electricity Consumption

1077

Fig. 1. Data analysis scheme

having some kind of hierarchy in them. With the aim of not saturating the linear regression model with too many coded variables. A new array X ∈ Rp×s has been created. Where p = 3450 when eliminating the null data and s = 12 when eliminating the municipalities as a characteristic applicable to the model and when increasing columns of the city attribute that are: IBARRA, ANTONIO ANTE, COTACACHI , OTAVALO, PIMAMPIRO and URCUQUÍ . Furthermore, the target variable has been removed from the model to create a vector of characteristics that only stores in this case the total invoiced value. For its part, to perform the cluster analysis, all the categorical variables must be eliminated to establish similarities in the consumption characteristics. For this reason, a new array called Z ∈ Rp×t is created. Where, p = 3450 since the size of the array has not changed. However, t = 6 since categorical variables have been removed. 3.4

Regression Models

To generate a model that predicts the total cost of electricity consumption, there are different models using multivariate linear and non-linear regression analysis criteria. In this sense, the most important ones in relation to the literature found are raised as: (i) co-variance matrix, (ii) similarity functions, (iii) frequency table and other. The first criterion (co-variance matrix) is a simple linear model (model 1) based on the equation: s xi bi (1) y = bo + i=0

Where y is the dependent variable, xi are the independent variables up to the value of s = 12 and bi are the influence parameters of the variables towards the model.

1078

J. F. Rosero-Garcia et al.

To improve the fit of a linear model, you can square each of its values of the attributes of the matrix and find a multivariate regression (model 2) of the equation: r s xji bi (2) y = bo + j=0 i=0

Where the value of r = 2 that represents the degree that the linear model has been raised. Another criterion to consider is based on distances through the k-NN algorithm. The k-NN regression uses the same euclidean distance functions as has the form (model 3): s 2 (xi − yi ) (3) i=0

Decision support machines (SVM) can be used as non-linear regression methods since SVM establishes tolerance margins to minimize error, individualizing the hyper-plane that maximizes the prediction margin. This criterion is given by Eq. 4 which can use different kernel methods. y = bo +

s (xi − x∗i ).(kernel)

(4)

i=0

It can be seen that the term x∗i becomes the variable taken to a hyper-plane that has been carried out by a kernel function. For this reason, model 4 uses a radial function and model 5 uses a sigmoid function. Finally, algorithms based on frequency tables, such as decision trees and random forests, allow a complex and highly non-linear linear regression model. It is based on Eq. 5. s 1 + fi ∗ x (5) y= B i=0 Where B = 10 represents the decision trees used and fi the resulting forecast frequency. 3.5

Clustering

The k-means method aims to partition a set of n observations into k groups. Where each value of n belongs to a group of k whose mean value of the distance is the closest. However, the randomness of the k values can cause different forms of grouping. For this reason, it is necessary to adequately define its value to group the data that have the greatest similarity in their characteristics. Consequently, K-means ++ allows to eliminate this problem by analyzing a set of observations (x1 , x2 , ..., xn ), where each observation is a real vector d − dimensional and the K − means grouping aims to divide the n observations into k(≤ n) sets

Analysis of Prediction and Clustering of Electricity Consumption

1079

S = S1 , S2 , ..., Sk to minimize the sum of squares within of the group (WCSS) (the variance). This can be seen in Eq. 6 [10].

arg min S

k i=1 x∈Si

2

x − µi = arg min S

k

|Si | Var Si

(6)

i=1

Where nui is the average of the points in Si . Graphically, the variance can be appreciated when using the different k values. For its selection, the value of k should be chosen that no longer exists a high variability in its grouping. In this case k = 4. This can be seen in Fig. 2.

Fig. 2. k analysis by WCSS technique

4

Results

The results are shown with the criteria of: linear regression, grouping and graphical interface. 4.1

Regression Models

For the selection of the linear regression model that best represents the data set of the matrix X ∈ Rp×s . Forecast error tests are performed. To do this, the data matrix has been divided randomized 10 times in training and testing. This is done to train the model and to test the predictions against its response. For this, the accumulated sum of forecast errors (CFE), the mean absolute deviation (MAD), the mean square error (MSE) and the deviation in percentage terms are used (MAE). The Table 1 shows the results obtained by each model proposed with a test set of 750 data. As can be seen, the variability (% MSE and %MAE) is very high in models 1, 2, 3 and 4. For their part, models 5 and 6 have less variability and a forecast error of less than 10%. This result is very acceptable considering the nature of the data.

1080

J. F. Rosero-Garcia et al. Table 1. Error type analysis in matrix X ∈ Rp×s Errors type CFE

4.2

%CFE MAD

MSE

MAE

Model 1

–83007.0 ±7.6

–111.72 152690 14%

Model 2

–82117.9 ±7.52

–110.52 135300 12.43%

Model 3

–83007.0 ±7.6

–111.72 152690 14%

Model 4

–81037.0 ±7.44

–109.06 124017 11.39%

Model 5

–61538.3 ±5.65

–82.82

106848 9.81%

Model 6

–55041.5 ±5.05

–74.08

100914 9.27%

Clustering

With the result of k = 4, proceed to compile the algorithm in the database for grouping by municipality from the matrix Z ∈ Rp timest . With this, four types of consumption are defined: EXCESSIVE, HIGH, MEDIUM and LOW. With this, the clusters are graphically presented in Fig. 3 the affinity of the data and the distance between them. This shows that the value of k = 4 is adequate and represents a 98% variability of the data.

(a) Cluster Analysis by Total Energy Consume

(b) Cluster Analysis by Number of Costumers

Fig. 3. Cluster analysis variability

4.3

Interface

The complete data matrix has been loaded into the decision support tool along with the machine learning algorithms for generating reports. With this, personnel of public entities related to electricity generation can observe and manage the results obtained in a friendly way. This can be seen in Fig. 4, where the forecasts of models 5 and 6 are very adequate. In addition, this information can be viewed by cantons or municipalities for greater control and monitoring. On the other hand, using its two categorical variables and using a treemap as an organization by clusters, all municipalities can be observed with respect

Analysis of Prediction and Clustering of Electricity Consumption

(a) Antonio Ante Municipalities Example Reports

1081

(b) Imbabura Electric Consumption Prediction Models Reports

Fig. 4. Support decision interface about consumption predictions

to the group to which they belong and their prediction consumption per month. This information is very important for electrical planning in the province of Imbabura. This is shown in Fig. 5.

Fig. 5. Cluster Reports by Municipalities: *Excessive Consumption, *High consumption, *Normal Consumption, *Low consumption

Finally, using the global position data of each municipality, each of them can be represented by color with respect to the grouping assigned by the algorithm EXCESSIVE, HIGH, MEDIUM and LOW. With this, in each month that the algorithm is recompiled, the change in consumption habits can be observed and alerts can be presented. This is shown in Fig. 6.

1082

J. F. Rosero-Garcia et al.

Fig. 6. Cluster Reports: *Excessive consumption, *High consumption, *Normal consumption, *Low consumption

5

Conclusions and Future Works

The analysis of electricity consumption by means of automatic learning algorithms allows having robust data analysis for the adequate generation of reports that support the planning and creation of electrical projects that satisfy the increasing electricity demand. In this sense, non-linear regression models allow training models with categorical variables, which are widely used to define certain factors of user characteristics. Consequently, decision support machines and random forests proved to be the most suitable for presenting trends in electricity consumption in the province of Imbabura-Ecuador, organized by cantons and municipalities. For its part, the k-means algorithm with a value of k = 4 was adequate to determine consumption similarities by cantons and municipalities. With this, it is possible to observe monthly changes in its trends and alert about them. Finally, the reports for a decision support tool, allow providing agile and efficient information for the electrical planning of a country. As future work, it is proposed to use these same criteria for all provinces to have a national tool for predicting and grouping electricity consumption.

References 1. Ministerio de Electricidad y Energ´ıa Renovable – Ente rector del Sector Eléctrico Ecuatoriano 2. Dmitri, K., Maria, A., Anna, A.: Comparison of regression and neural network approaches to forecast daily power consumption. In: Proceedings - 2016 11th International Forum on Strategic Technology, IFOST 2016, pp. 247–250. Institute of Electrical and Electronics Engineers Inc., March 2017

Analysis of Prediction and Clustering of Electricity Consumption

1083

¨ Avcı, E., Bayindir, R.: Time series clustering analysis of energy 3. C ¸ etinkaya, U., consumption data. In: 2020 9th International Conference on Renewable Energy Research and Application (ICRERA), pp. 409–413 (2020) 4. Tello Urgilés, D.G.: Universidad Politecnica Salesiana Sede Cuenca Facultad De Ingenierias Carrera De Ingenier´ıa Eléctrica. Technical report 5. Gong, F., Han, N., Li, D., Tian, S.: Trend analysis of building power consumption based on prophet algorithm. In: 2020 Asia Energy and Electrical Engineering Symposium (AEEES), pp. 1002–1006 (2020) 6. Guachimboza-Davalos, J.I., Llanes-Cede˜ no, E.A., Rubio-Aguiar, R.J., PeraltaZurita, D.B., N´ un ˜ez-Barrionuevo, O.F.: Prediction of monthly electricity consumption by cantons in ecuador through neural networks: a case study. In: Botto-Tobar, M., Zamora, W., Larrea Pl´ ua, J., Bazurto Roldan, J., Santamar´ıa Philco, A. (eds.) Systems and Information Sciences. ICCIS 2020. Advances in Intelligent Systems and Computing, vol. 1273. Springer, Cham (2021). https://doi.org/10.1007/9783-030-59194-6 3 7. He, L., Song, Q., Shen, J.: K-NN numeric prediction using bagging and instancerelevant combination. In: Proceedings - 2nd International Symposium on Data, Privacy, and E-Commerce, ISDPE 2010, pp. 3–8 (2010) 8. Johnson, B.J., Starke, M.R., Abdelaziz, O.A., Jackson, R.K., Tolbert, L.M.: A method for modeling household occupant behavior to simulate residential energy consumption. In: IEEE PES Innovative Smart Grid Technologies Conference, ISGT 2014 (2014) 9. Liu, Y., Wang, W., Ghadimi, N.: Electricity load forecasting by an improved forecast engine for building level consumers. Energy 139, 18–30 (2017) 10. N´ un ˜ez-Barrionuevo, O.F., et al.: Clustering analysis of electricity consumption of municipalities in the province of pichincha-ecuador using the k-means algorithm. In: Botto-Tobar, M., et al. (eds.) Systems and Information Sciences, pp. 187–195. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-59194-6 3 11. Narv´ aez, R.P.: Factor de emisi´ on de CO< sub > 2< /sub > debido a la generaci´ on de electricidad en el Ecuador durante el periodo 2001–2014. Avances en Ciencias e Ingenier´ıa 7(2) (2015) 12. Parraga-Alava, J., et al.: A data set for electric power consumption forecasting based on socio-demographic features: data from an area of southern Colombia. Data Brief 29, 105246 (2020) 13. Ministerio de Energ´ıa y recursos no renovables De. Plan de mejoramiento de los sistemas de distribuci´ on de energ´ıa eléctrica (PMD) – Ministerio de Energ´ıa y Recursos Naturales no Renovables (2019) 14. Sauhats, A., Varfolomejeva, R., Lmkevics, O., Petrecenko, R., Kunickis, M., Balodis, M.: Analysis and prediction of electricity consumption using smart meter data. In: International Conference on Power Engineering, Energy and Electrical Drives, vol. 2015-September, pp. 17–22. IEEE Computer Society, September 2015 15. Rezaei, S., Sharghi, A., Motalebi, G.: A framework for analysis affecting behavioral factors of residential buildings’ occupant in energy consumption. J. Sustain. Archit. Urban Design 5(2), 39–58 (2018) 16. Toapanta-Lema, A., et al.: Regression models comparison for efficiency in electricity consumption in ecuadorian schools: a case of study. In: Botto-Tobar, M., Vizuete, M.Z., Torres-Carri´ on, P., Montes Le´ on, S., Guillermo, V.P., Durakovic, B. (eds.) Applied Technologies, pp. 363–371. Springer, Cham (2020). https://doi. org/10.1007/978-3-030-42520-3 29

1084

J. F. Rosero-Garcia et al.

17. Wang, M., Kou, B., Zhao, X.: Analysis of energy consumption characteristics based on simulation and traction calculation model for the crh electric motor train units. In: 2018 21st International Conference on Electrical Machines and Systems (ICEMS), pp. 2738–2743 (2018) 18. Ye, X., Wu, X., Guo, Y.: Real-time quality prediction of casting billet based on random forest algorithm. In: Proceedings of the 2018 IEEE International Conference on Progress in Informatics and Computing, PIC 2018, pp. 140–143. Institute of Electrical and Electronics Engineers Inc., May 2019 19. Yildiz, B., Bilbao, J.I., Dore, J., Sproul, A.B.: Recent advances in the analysis of residential electricity consumption and applications of smart meter data, December 2017 20. Zhang, X.M., Grolinger, K., Capretz, M.A., Seewald, L.: Forecasting residential energy consumption: single household perspective. In: Proceedings - 17th IEEE International Conference on Machine Learning and Applications, ICMLA 2018, pp. 110–117. Institute of Electrical and Electronics Engineers Inc., January 2019 21. Zhao, H.X., Magoulès, F.: A review on the prediction of building energy consumption, August 2012

Street Owl. A Mobile App to Reduce Car Accidents Mohammad Jabrah , Ahmed Bankher , and Bahjat Fakieh(B) Information Systems Department, King Abdulaziz University, Jeddah, Saudi Arabia [email protected], [email protected]

Abstract. The irreversible increasing rate of car accidents due to the distraction of using smartphones while driving led this paper to aim for proposing a mobile application that would help to reduce car accidents by limiting the functionality of smartphones during the car movement to the basic needs only, such as calls or maps. The study started by analyzing several applications in the market. Then, the study conducted a survey that was answered by 303 participants revealed the need for a dedicated mobile application to minimize the smartphone distraction while driving. The result shows some interesting information regarding the approximate rate of using smartphones while driving, the common reasons for using smartphones, and some critical reasons that led drivers to avoid touching their mobile phones on the road. Ultimately, the paper proposed a mobile application that is based on the obtained result. Keywords: Accidents · Driving · Phone calls · Smart phones · Texting · Traffic

1 Introduction Traffic accidents are considered one of the top 10 causes of death worldwide [1]. In 2018, traffic injuries led to 1.35 million deaths [1]. In that same year in Saudi Arabia, according to the General Authority for Statistics, traffic accident counts reached a high of 352,464, resulting in 30,217 injuries and 6,025 deaths [2]. Out of the 352,464 accidents that took place that year, and according to the Saudi Standards, Metrology and Quality Org. (SASO), as well as their conducted studies and statistics, 161,242 traffic accidents were caused by the use of mobile phones while driving. [3]. These official numbers mean that 46% of the annual traffic accidents in Saudi Arabia were caused by drivers’ distraction with mobile phones while driving. This makes mobile phone use the main cause of traffic accidents leading to tragic consequences.

2 Literature Review 2.1 Background and Overview of Related Work The issue of car accidents in general, and distracted driving, has been one of the major problems of the century. Humanity has suffered heavy losses due to behavior that cause © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1085–1105, 2021. https://doi.org/10.1007/978-3-030-80126-7_76

1086

M. Jabrah et al.

drivers to lose concentration while driving. Among these dangerous behavior is the use of mobile phones while the vehicle is in motion. The use of a mobile phone while driving exposes not only the driver to the possibility of an accident, but also other passengers in the car, people driving other vehicles on the road, people walking down the street, and even parked cars. As this is considered a major issue, there have been many studies, technological solutions, and governmental attempts in many countries to reduce the scale of this behavior. In this chapter, mobile applications and technologies related to the issue will be discussed, and an overview will be provided to explain the relationship between the mentioned technologies, the issue, and this project. 2.2 Related Mobile Applications Drive Safe.ly Developed by iSpeech and suns only in the United States, the application helps solve the problem of texting while driving by reading received SMS messages and emails aloud using iSpeech Text to Speech (TTS). It allows the user to respond by converting the recorded message into a text message and sending it. This option is available only for the Pro option, which is not free of charge [4]. AT&T Drive Mode This is a free application developed by the American mobile telecommunication company AT&T for Android devices only, as shown in Fig. 1. The application was designed to prevent distracted driving caused by the use of mobile phones by silencing incoming phone calls and alerts so the driver could stay focused. The application starts working after the speed exceeds 15 mph (almost 25 km/h), and it provides a specifically designed interface allowing the driver to access navigation, play music, and key contacts. The application also provides parental alerts, which send a text message to parents if the application is turned off or disabled, or if a new speed dial number is added [5].

Fig. 1. AT&T drive mode interface

Street Owl. A Mobile App to Reduce Car Accidents

1087

Google Assistant Driving Mode This is an interface made by Google for Android devices. Figure 2 shows that the interface enlarges the size of the icons that represent the most common actions, such as navigation using Google Maps, phone, and media. Additional commands could be customized, such as returning a missed call, resuming previously played media, or navigating to a registered location [6].

Fig. 2. Interfaces of Google assistant driving mode

Cell Control The application blocks predefined services and behavior, such as texting, taking pictures and selfies, playing games, and using social media. It can also be used as a monitor to control how teens drive their cars and to check on whether they engage in any bad driving behavior [7]. OneTap OneTap is an application for Android that helps to control distracted driving during which an individual uses a mobile phone while the vehicle is in motion. The application automatically silences any texts and notifications received and can notify the user when friends or family are currently driving so that the user does not contact them [8]. Two snapshots of OneTap are represented in Fig. 3.

1088

M. Jabrah et al.

TextLimit This application is designed to prevent the use of a set of predefined features when the vehicle exceeds a certain speed. The application restores all the prevented features once the vehicle drops back under the predefined speed. The application works symmetrically with the website in need to perform in a complete way [9], as shown in Fig. 4. Drive Scribe The application is designed to block incoming texts and calls while the vehicle is in motion. It also notifies drivers when they are traveling at fast speeds [7].

Fig. 3. OneTap application interface

Fig. 4. TextLimit interfaces

Street Owl. A Mobile App to Reduce Car Accidents

1089

True Motion The True Motion application – as in Fig. 5 – is a family-oriented driving app that allows the user to get a clear picture of the family’s safety. The application allows the user to get details regarding their children’s driving behavior, such as location, phone use, texting, aggressive driving, and speeding [10]. Text Buster This is a two-part system that includes a physical part to be installed in the car and software for an Android mobile phone. The system costs $200 and assumes that the driver is using the mobile phone. In that sequence, it sends a notification to the connected mobile phone and blocks texts and calls [11].

Fig. 5. True motion application interfaces

OMW This application for iOS allows you to share your location and estimated time of arrival (ETA) with the people whom you are heading to meet. The application aims to reduce mobile usage (e.g., calling or texting other people to inform them of your current location or ETA) while you are driving a vehicle [12]. Car Dashdroid A replacement for a car’s screen, this converts an Android device into a car home dock screen, as shown in Fig. 6. The application allows the user to reply to text messages such as SMS, WhatsApp, and Telegram without touching the device. The application uses large buttons and icons, as well as full voice commands, and allows the user to create more than 40 shortcuts for easy-access commands.

1090

M. Jabrah et al.

Fig. 6. Car dashdroid

2.3 Analysis of Related Work Case Study: Road Traffic Accidents Among Drivers in Abu Dhabi, UAE The study aims to evaluate and determine the factors related to deaths caused by road traffic accidents. Questionnaires were used to gather data and were deployed in the UK and UAE. Two versions were used: Arabic and English. Six hundred people from the two genders participated in the distributed versions in the city of Abu Dhabi. The study resulted in the identification of several factors causing road traffic accidents. This research points to the slow progress in traffic accident reduction in UAE and Gulf Cooperation Council (GCC) countries, and to the fact that young adults aged 18 to 35 constitute over 50% of traffic accident deaths in UAE. The first author of the study has experience as a police officer working with the traffic department in Abi Dhabi. Based on that experience, it was noted that the number of traffic violations is increasing, as is the number of traffic accidents involving injured young adults. The study also explains how much traffic accidents cost different countries in terms of their Gross National Product (GNP); they cost 2% of the GNP in high-income countries, 1.5% in middle-income countries, and 1% in low-income countries. In general, road traffic accidents cost the global community about US$518 billion. The study points to efforts made by the government of the United Arab Emirates to reduce traffic accidents and make roads safer for all drivers and passengers. The study explains how males are more likely to adopt risky driving behavior and to be involved in car crashes as compared to female drivers. Quantitative data methods were used through questionnaire surveys, in which drivers were asked to identify “relevant factors that contribute to the traffic accidents in Abu Dhabi.” The most commonly cited factors were: • A significant percentage of drivers from both genders do not wear seat belts regularly • Use of a mobile phone while driving

Street Owl. A Mobile App to Reduce Car Accidents

• • • •

1091

Drinking alcohol and driving Aggressive driving behavior Unsafe driving behavior Lack of correct targeting of traffic campaigns.

The study concluded by highlighting the mentioned problems identified as major factors causing traffic accidents and suggested developing safety measures on roads, as well as developing educational programs and campaigns targeting the right groups so that results can be noticed [13]. Mobile Phone Usage and the Growing Issue of Distracted Drivers A report by the World Health Organization (WHO) focuses, in an accurate and detailed way, on the issue of mobile phone use while driving and aims to raise awareness of distracted driving. The report states that mobile phone use causes drivers to take their eyes off the road and their hands off the steering wheel, which has the biggest impact on driving behavior. The listed body of evidence includes longer reaction times for acts related to such things as identifying traffic signals and braking, the ability to keep the vehicle in the correct lane, shorter following distances, and the condition in which the driver is not aware of the surrounding conditions. The report indicates that the rapid growth of text messaging between drivers is a major problem that requires attention. Nevertheless, young drivers are thought to use mobile phones while driving more than older drivers do, which makes them more vulnerable to the effects of this behavior due to their lack of experience in reacting to road conditions. The report also indicates that different studies suggest that crash risk related to the use of mobile phones appears to be approximately 4 times higher, and that this risk is similar for both hand-held and hands-free devices, suggesting that the reason for this increased risk is not the act of holding the phone, but the loss of concentration while one is involved in a phone conversation. It is stated that, annually, 1.3 million people die and another 50 million are injured due to traffic accidents, alongside the significant casualties and economic losses that accompany the emotional toll. According to the WHO’s report, driver distraction can be split into four types: • Visual: “looking away from the road for a non-driving-related task” • Cognitive: “reflecting on a subject of conversation as a result of talking on the phone – rather than analyzing the road situation” • Physical: “when the driver holds or operates a device rather than steering with both hands, or dialing on a mobile phone or leaning over to tune a radio that may lead to rotating the steering wheel” • Auditory: “responding to a ringing mobile phone, or if a device is turned up so loud that it masks other sounds, such as ambulance sirens” The report also points to the lack of official and registered data regarding traffic accidents in relation to driver distraction, especially when crash details are registered by the police and no driver distraction data are noted. This makes it difficult for researchers and people who are seeking solutions to the issue.

1092

M. Jabrah et al.

There are two sources of driver distraction: in-vehicle (interior) distractions, which occur when the driver plays with the radio or mobile phone or changes music tracks, and external distractions, which occur when a driver’s eyes are taken off the road while the driver is looking at buildings. The report included a selection of studies suggesting that distraction is an important contributor to traffic accidents. These are quoted below: “An Australian study examined the role of self-reported driver distraction in serious road crashes resulting in hospital attendance and found that distraction was a contributing factor in 14% of crashes”. “In New Zealand, research suggests that distraction contributes to at least 10% of fatal crashes and 9% of injury crashes, with an estimated social cost of NZ$ 413 million in 2008 (approximately US$ 311 million). Young people are particularly likely to be involved in crashes relating to driver distraction”. “Insurance companies in Colombia reported that 9% of all road traffic crashes were caused by distracted drivers in 2006. Of all cases where pedestrians were hit by cars, 21% were caused by distracted drivers”. “In Spain, an estimated 37% of road traffic crashes in 2008 were related to driver distraction”. “In the Netherlands, the use of mobile phones while driving was responsible for 8.3% of the total number of dead and injured victims in 2004”. “In Canada, national data from 2003–2007 show that 10.7% of all drivers killed or injured were distracted at the time of the crash”. “In the United States, driver distraction as a result of sources internal to the vehicle was estimated to be responsible for 11% of national crashes that occurred between 2005 and 2007, although a smaller study involving 100 drivers found that driver involvement in secondary tasks contributed to 22% of all near crashes and crashes (25, 26). In 2008, driver distraction was reported to have been involved in 16% of all fatal crashes in the United States”. “In Great Britain, distraction was cited as a contributory factor in 2% of reported crashes. The difficulty for the reporting officer to identify driver distraction is likely to have led to an under-representation of this proportion, which is considered a subjective judgement by the police. In addition, contributory factors are disclosable in court and police officers would require some supporting evidence before reporting certain data, a factor likely to lead to an under-representation of the problem”. The report suggests that technological systems could be used within vehicles to prevent distraction in general, such as lane-changing warning features. The report concludes with an acknowledgment of the importance of mobile phones in improving communications, while also citing mobile phones as a major cause of distracted driving. It also urges governments to improve safety measures and current gains, in order to take advantage of the developing technologies and help mitigate the issue in the future [14].

3 Research Problem Traffic accidents are considered among the main killers of children and young adults worldwide, and mobile phones are considered one of the top 10 causes of traffic accidents

Street Owl. A Mobile App to Reduce Car Accidents

1093

[15]. In the Kingdom of Saudi Arabia specifically, distraction caused by the use of mobile phones while driving is responsible for almost half of the recorded traffic accidents, which result in a high number of injuries, deaths, and damage to or loss of property. Thus, this research aims to address the following: 1. Highlighting the impact of mobile phone use while driving on traffic accident rates in Saudi Arabia. 2. Studying the issue of traffic accidents caused by distraction resulting from the use of mobile phones from both social and scientific perspectives and in relation to previous studies done in the same field. 3. Providing a mobile application to help minimize the likelihood of accidents.

4 Methodology This project is divided into two main phases. In the first phase, a questionnaire will be created to gather data from a specific community segment, i.e., drivers between 18 and 64. Then, the collected data will be analyzed to obtain an understanding of mobile phone use while driving and the accident ratio related to mobile phone use. A specialized software application such as QlikView will be used to analyze and visualize data. We will develop a mobile application using specialized software such as Android Studio or alternatives. A proper database will be built while progress is being made during the project’s timeline, which will be built using MySQL or similar tools.

5 Analysis 5.1 Data Collection A study questionnaire was conducted as part of the project. The purpose of the study is to measure the scale of society regarding the issue of distracted driving due to the use of mobile phones, and how widespread the issue is. The questionnaire was conducted online in the Arabic and English languages. Figure 7 presents the flow of the conducted study.

1094

M. Jabrah et al.

Fig. 7. The survey structure

5.2 Observation Description The questions were carefully worded to fulfill the needed output of the questionnaire. They were chosen to provide an overview of society’s reactions and the scale of the issue, as well as what solutions are being offered to solve it, for purposes of comparison with the proposed solution.

Street Owl. A Mobile App to Reduce Car Accidents

1095

5.3 Observation Analysis Q1: Gender This question measures the number of respondents based on their gender, as in Fig. 8. In total, 142 respondents were female, with a percentage of 47.02%, while 161 were male, with a percentage of 52.98%.

Fig. 8. The gender of the participants

Q2: Age This question indicates the number of participants in relation to different age groups. Figure 9 shows that 18–25 represents the group of university students, 26–30 represents new graduates and master’s degree students, and the 31–40 and 40+ groups represent older adults.

Fig. 9. The age range of the participants

Q3: What do You Think are the Main Reasons for Car Accidents in Saudi Arabia? This question seeks participants’ opinions regarding the main reasons for car accidents in Saudi Arabia. Figure 10 indicates that distracted driving, which includes the use of a mobile phone while driving, represents the highest percentage, at 44%. Reckless driving was second, at 25.67%. Speeding came in third, at 15.67%.

1096

M. Jabrah et al.

Fig. 10. The main reasons for accidents, based on society’s opinion

Q4: Do You Drive a Motor Vehicle? This question was used as a filter to differentiate between participants who drove a vehicle and those who did not, so that people who drove could be targeted with more specific questions. Figure 11 shows that 64.90% of participants drove a vehicle, while 35.10% did not.

Fig. 11. Rate of participants who drive

Q5: Are You Considering Driving in the Future? This question was asked of people who answered “no” to the previous question, simply to

Fig. 12. The rate of non-driving participants who are considering driving

Street Owl. A Mobile App to Reduce Car Accidents

1097

measure their desire to drive. The result in Fig. 12 indicates that 67.92% were considering driving and 32.08 were not. Q6: Do You Have a Driving License? This question was asked only of participants who drove, to measure the percentage of legal and illegal drivers. The result in Fig. 13 shows that 94.39% of the participants had a driving license, while 5.61% did not.

Fig. 13. Rate of driving-participants who have a driving license

Q7: What is the Reason for not Having a Driving License Yet? You can Select More Than One Answer This question was asked only of people who did not hold a driving license, to determine why they did not yet have one. A simple representation of the result is shown in Fig. 14.

Fig. 14. The reasons for driving without a driving license

Q8: Do You Use Your Mobile Phone While Driving? This question sought the percentage of participants who used their mobile phones while driving. The results in Fig. 15 indicate that 62.24% (over half) said yes, while 37.76% said no.

1098

M. Jabrah et al.

Q9: Under Which Circumstances do You Use Your Mobile Phone While Driving? You can Select More Than One Answer This question was asked to determine the reasons why drivers used their mobile phones, which helps in understanding the behavior. The results, as revealed in Fig. 16, indicated that making phone calls was the top reason. Navigation and Google Maps came in second, while “for emergencies only” was third. Playing music was the fourth reason, while chatting was the fifth. Social networks came in sixth, while gaming came in last, at seventh. Seventeen people said that they did not use mobile phones while driving.

Fig. 15. The rate of people who use a mobile phone while driving

Fig. 16. The reasons why people use their mobiles while driving

Q10: Have You Ever Done Any of the Below, or do You know Someone Who had an Experience With Any of the Below? You can Select More Than One Answer This question was asked to determine the scale of the impact of using mobile phones on lives, property loss or damage, traffic violations, and other circumstances. Figure 17 shows the results as the following: 41.3% of the participants had a traffic violation themselves or knew someone who did, because of the use of mobile phones; 28.2% of the participants had gotten into a traffic accident or almost gotten into one, or knew someone who did or almost did, because of the use of mobile phones; and 7.3% of the participants had run a red light themselves or knew someone who did because of the use of mobile phones. Meanwhile, 23.2% of the participants had not encountered any of the abovementioned circumstances.

Street Owl. A Mobile App to Reduce Car Accidents

1099

Fig. 17. The rate of people who have experienced the results of distracted driving

Q11: Do You Think Traffic Accidents Caused by Drivers’ Distraction while Using Mobile phones Could be Avoided? This question was asked to determine society’s perspective regarding the possibility of avoiding traffic accidents caused by mobile phone use. The results in Fig. 18 show that 82.14% of the participants agreed that such accidents could be avoided, while 17.86% did not.

Fig. 18. The rate of people who think that distracted driving could be avoided

Q12: How Often do You Notice People Using Their Mobile Phones While Driving? This question was asked to determine how frequently participants noticed people using their mobile phones while driving, which could help in measuring the scale of the problem. The question used the scale method, which is represented by five degrees, as in Fig. 19. The first choice was “I don’t notice them” and the fifth choice was “I always notice a lot of them”. The fifth and fourth options were each chosen 73 times, while the third option was chosen 46 times. The second option was chosen four times. No participant selected the first option.

1100

M. Jabrah et al.

Q13: Are You Aware of Any Current Laws that Prohibit the Use of Mobile Phones While Driving in Saudi Arabia? This question was asked to measure society’s awareness regarding the laws that prevent the use of mobile phones while driving in Saudi Arabia. Figure 20 shows that 95.41% of participants were aware of the laws, while 4.59% were not.

Fig. 19. The rate of people who notice drivers using mobile phones

Fig. 20. The rate of awareness of laws

Q14: Are you Aware of any Penalty as a Result of the Use of Mobile Phones While Driving in Saudi Arabia?

Fig. 21. The rate of awareness of penalties

Street Owl. A Mobile App to Reduce Car Accidents

1101

This question was asked to measure society’s awareness regarding the penalties resulting from traffic violations related to the use of mobile phones while driving in Saudi Arabia. Figure 21 indicates that 94.90% of participants were aware of the penalties, while 5.10% were not. Q15: What Would Make Drivers Abstain From Using Their Mobile Phones while Driving? You can Select More Than One Answer This question was asked to determine society’s opinion regarding the factors that would cause drivers to stop using their mobile phones when driving. Figure 22 shows that “Technology” was the most-chosen option 131 times. “Penalty increasing” came in second with 100 votes, while the option “After being involved in an accident” was third with 88 votes. “Adaption to laws” came in fourth with 73 votes, while there were 11 votes for “nothing will change this habit”.

Fig. 22. What would stop people from using their mobiles while driving

Q16: Do you think that Developing a Mobile Application that Could Help Drivers Control the Behavior of Using Mobile Phones While Driving could Help Minimize the Scale of this Behavior? This question was asked to measure society’s opinion regarding the provision of a mobile

Fig. 23. The rate of people who think a mobile app can help solve the issue

1102

M. Jabrah et al.

application as a solution to this problem. Figure 23 shows that 85.20% of participants agreed with the idea, while 14.80% did not. Q17: If we Developed such an App, Would You be Willing to use it? This question was asked to measure participants’ acceptance and to determine whether there was space for such an idea. Figure 24 shows that 81.63% of participants said that they would be willing to use such an app, while 18.37% were not willing to do so.

Fig. 24. The rate of people who are willing to use the proposed solution

Gender-Based Usage of Mobile Phones While Driving Based on the gathered results and after analysis, Fig. 25 shows that 72.5% of females from among the entire count of female participants did not use their mobile phones while driving, while 27.5% of females did. On the other hand, it was found that 28.8% of males from among the entire count of male participants did not use their mobile phones while driving, while 71.2% of males did use their mobile phones. The two percentages show that men use their mobile phones more frequently than women do.

Fig. 25. The relation between each gender and the usage of mobile phones while driving

5.4 Discussion The results of the performed study indicate that men use their mobile phones more than women do, and that women are more committed to the laws. Society has a high

Street Owl. A Mobile App to Reduce Car Accidents

1103

acceptance level for the proposed solution, which is a mobile application to help drivers control the behavior of using mobile phones while driving. Society’s awareness of the laws and penalties regarding the traffic violation of using mobile phones while driving is prospectively high; nevertheless, there should be more advertisements and campaigns to educate current drivers and upcoming generations about the issue and why they should control this behavior, as some drivers do not understand what they are risking when they use their mobile phones while driving.

6 Interface Design To satisfy the users and deliver a well-designed product, the interface design should follow the principles of human-computer interaction (HCI). The application was named

A: The welcome page once the application is launched

B: The user chooses to start a new journey or view the developer's info

C: New journey interface Fig. 26. Street owl app interface

1104

M. Jabrah et al.

“Street Owl” and was designed to have a consistent appearance across the system, as shown in Fig. 26. The user can understand their current location while using the application. Navigation methods such as buttons, as well as other methods and elements, will be used to enable the communicating of information and the initiating of actions. Primary content—such as numbers, text, graphics, interactive elements, etc.—should be available on-screen, as well as easy to notice and understand. Scrolling, zooming, and other behavior should be enabled. Many types of controls—such as buttons, text fields, and progress indicators—will be used to make the experience easy for the user. Images and icons will be added using the standards of user experience (UX), in addition to the fonts’ sizes and colors, backgrounds colors, and other elements.

7 Usability Testing The usability testing went through five stages, which were: 1. Alpha Testing. This is the most common type of testing used in the software industry. The objective of this type of testing is to identify all possible issues or defects before releasing the product into the market or to the user. Alpha testing is carried out at the end of the software development phase but before the beta testing phase. Minor design changes may be made as a result of such testing. Alpha testing is conducted at the developer’s site. An in-house virtual user environment can be created for this type of testing. 2. Accessibility Testing. The aim of accessibility testing is to determine whether the software or application is accessible for disabled people. Here, disability means deafness, color blindness, mental disability, blindness, old age, and other disabilities. Various checks are performed, such as font size for the visually disabled, color and contrast for those with color blindness, etc. 3. Beta Testing: Beta testing is a formal type of software testing carried out by the customer. It is performed in the real environment before the product is released to the market for the actual end-users. Beta testing is carried out to ensure that there are no major failures in the software or product and that it satisfies the business requirements from an end-user perspective. Beta testing is successful when the customer accepts the software. Usually, this testing is done by end-users or others. It is the final testing done before a product is released for commercial purposes. Usually, the beta version of the software or product released is limited to a certain number of users in a specific area. Thus, the end-user actually uses the software and shares their feedback with the company. The company then takes necessary action before releasing the software worldwide. 4. Black Box: Testing internal system design is not considered in this type of testing. Tests are based on the requirements and functionality. Detailed information about the advantages, disadvantages, and types of black box testing can be seen here. 5. Gorilla Testing: Gorilla testing is a testing type performed by a tester and sometimes by the developer as well. In gorilla testing, one module or functionality in the module is tested thoroughly and heavily. The objective of this testing is to check the robustness of the application.

Street Owl. A Mobile App to Reduce Car Accidents

1105

8 Conclusion Driving and using a mobile phone simultaneously could expose a driver, passengers, people walking on the side of the road, and even property to losses and damages. This behavior is considered one of the main causes of death worldwide, especially in Saudi Arabia. We believe that using technology to help monitor, control, and assist drivers in maintaining safe driving behavior is important. Street Owl could help people control themselves and monitor their behavior if they find it difficult to do so without help. More awareness should be spread—especially in large cities such as Jeddah, Riyadh, Makkah, and Khobar—regarding the results of using anything that would lead to the distraction of the driver. More efforts should be made by both the authorities and the people of Saudi Arabia to make our streets safer for everyone.

References 1. WHO. Global status report on road safety 2018. 2018 26 Sep 2019. https://www.who.int/vio lence_injury_prevention/road_safety_status/2018/en/ 2. Statistics, G.A.f.: Statistical Yearbook of 2018 | Chapter 14 | Transportation. 2018 [cited 2019 25 Sep 2019]; Official Traffic Accidents Statitcs Published by GAS]. https://www.stats.gov. sa/en/1020 3. SASO. SASO: 161,242 Traffic Accidents Take Place Annually Due to the Use of Mobile Phones While Driving. 2019 [cited 2019 25 Sep 2019]. https://www.saso.gov.sa/en/mediac enter/news/Pages/saso_news_973.aspx 4. Oltsik, J.: What’s Needed for Cloud Computing? Enterprise Strategy Group, 10 (2010) 5. AT&T. AT&T DriveMode. 2018 [cited 2019 25 Oct 2019]. https://about.att.com/sites/it_can_ wait_drivemode 6. Kerns, T.: Google is getting rid of Android Auto’s smartphone UI. 2019 [cited 2019 25 Oct 2019]. https://www.androidpolice.com/2019/07/30/android-auto-app-going-away-assist ant-driving-mode/ 7. Winkler, V.J.R.: Securing the Cloud. 2011: Syngress 8. Buyya, R., et al.: Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Futur. Gener. Comput. Syst. 25(6), 599–616 (2009) 9. MobileLifeSolutions. Text Limit. 2013 [cited 2019 26 Oct 2019]. https://apps.apple.com/us/ app/text-limit/id612385844 10. TrueMotion. TrueMotion Family Safe Driving. 2016 [cited 2019 25 Oct 2019]. https://apps. apple.com/us/app/truemotion-family-safe-driving/id1121316964 11. Goodwin, A. TextBuster. 2013 [cited 2019 26 Oct 2019]. https://www.cnet.com/pictures/tex tbuster/2/ 12. Jorissen, K., Vila, F.D., Rehr, J.J.: A high performance scientific cloud computing environment for materials simulations. Comput. Phys. Commun. 183(9), 1911–1919 (2012) 13. Hammoudi, A., Karani, G., Littlewood, J.: Road traffic accidents among drivers in Abu Dhabi, United Arab Emirates. J. Traffic Logistics Eng. 2 (2014) 14. WHO. Mobile phone use: a growing problem of driver distraction. 2014 [cited 2019 26 Oct 2019]. https://www.who.int/violence_injury_prevention/publications/road_traffic/distra cted_driving/en/ 15. WHO. The top 10 causes of death. 2018 [cited 2019 5 Oct 2019]. https://www.who.int/newsroom/fact-sheets/detail/the-top-10-causes-of-death#targetText=Ischaemic%20heart%20dise ase%20and%20stroke,in%20the%20last%2015%20years

Low-Cost Digital Twin Framework for 3D Modeling of Homogenous Urban Zones Emad Felemban1(B) , Abdur Rahman Muhammad Abdul Majid2,3 , Faizan Ur Rehman2,3 , and Ahmed Lbath2 1 Computer Engineering Department, College of Computing and Information Systems, Umm

Al-Qura University, Mecca, Saudi Arabia [email protected] 2 Laboratoire d’Informatique de Grenoble, University of Grenoble Alpes, Grenoble, France [email protected], [email protected] 3 Institue of Consulting and Research Studies, Umm Al-Qura University, Mecca, Saudi Arabia

Abstract. Smart cities have been on the rise since the last decade. These cities can only be effective and sustainable when acquiring error-free information from multiple aspects and sources. Prone to errors and mistakes, traditional tools and observation methods require new assessment strategies to provide a clear perspective in simulation and a systematic evaluation for decision-making to be aligned with Industry 4.0. Digital Twin can offer an efficient solution to this problem under the cyber-physical system entity by creating a virtual space identical to the physical region and providing predictive information on the city’s current state. Such systems offer accurate representation in a virtual environment while taking input from the real world and simulating them for future predictions. These nearidentical systems are constructed at a very high cost, and the cost increases as more intricate details are added to the environment. This paper presents selective technologies that can potentially contribute to developing a low-cost intelligent environment and smarter urban management framework. It looks into the analysis of the impact of different scenarios on a public event in three dimensions that will be crucial to the decision and policymaker before a plan is approved. The proposed framework will be able to guide present and upcoming potential solutions against the administrative challenges. Keywords: Digital twin · Low-cost · 3D Modeling · Data collection · Smart city

1 Introduction Traffic, safety, and energy are fields in which various cities have increased consumption. Many problems are occurring as a cause, such as traffic congestion and water shortage. Europe, Japan, and the United States are some of those regions that are investing heavily in the creation of smart cities and their related technologies. China is also using intelligent cities for urban innovation. Smart cities and their related industries are expected to grow to $1.5 trillion in 2020 [1]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1106–1114, 2021. https://doi.org/10.1007/978-3-030-80126-7_77

Low-Cost Digital Twin Framework for 3D Modeling of Homogenous Urban

1107

In such a smart city, computing technologies and information and communication technologies converge and are combined as Digital Twin to provide a core joint platform. In other words, real-time simulation, artificial intelligence, IoT, spatial information is utilized to build a converged digital twin platform. Information necessary for city operations is integrated and virtual to modify the real world and solve various urban problems. The core infrastructure to implement a smart city platform is spatial information to decode the problem, such as accurate status analysis and results within the spatial infrastructure, and its information is essential for the city’s transformation to a smart city. This spatial information can be made visible and efficient using aerial Lidar surveys, Ground lidar surveys, vehicle lidar surveys, and multi-directional aerial surveys [2]. Research has been carried out to generate actual 3D spatial information and its construction based on various sensors. Recently, the demand for the latest low-cost technology for useful 3D spatial information is increasing. Therefore, the purpose of this study is to provide a low-cost and high-quality 3D framework. Building a 3D model of the Mina area in Mecca, Saudi Arabia using multiple technologies and sensors to acquire information will be used. Also, as a digital twin model, the built model’s quality and efficiency analysis will be performed, and the efficacy and possibility of application in the application field will be discussed. For the rest of the paper, the authors discuss the current research trends related to digital twin technologies. The framework design methodology is discussed in detail, focusing on each aspect of the Digital twin framework, followed by a brief implementation section. The paper is then concluded by discussing the framework and the limitations of such a low-cost alternative, followed by concluding remarks and intended future work.

2 Research Trends The construction of a 3D digital model for a high-density, verticalized urban space is essential not only for constructing the spatial information base of smart cities but also for the construction of three-dimensional intelligence and cyber-physical systems. It is regarded as an element, and studies on manual, semi-automated, and automated have been actively conducted since the 1990s based on various sensors geared towards an Industry 4.0 revolution. As the current policy has led the drone industry to the fourth industrial revolution, research is actively underway to take advantage of developing a low-altitude, high-precision, and low-cost 3D spatial model. Such enhancement was otherwise not possible to obtain and generate from the existing operating aerial surveying base. Looking at the study of accuracy analysis of drone photogrammetry, Lee et al. [3] produced orthogonal images and digital surface models with 4cm/pixel spatial resolution using fixed-wing and low-cost rotary-wing drones for topographic surveying of civil engineering construction sites. The root-mean-square error was analyzed to be around 10 cm in the x, y, and z directions compared with the outcome. NO et al. [4] proposed an underground utility mapping using ground control points the reduced the error bound of the acquired data. Guisado-Pintado et al. [5] mounted a general camera on a rotorcraft drone. They obtained an accuracy of 9cm horizontal (x,y) error and 7.7cm vertical (z) error as a

1108

E. Felemban et al.

result of error analysis on the inspection point in a study to evaluate the effectiveness of the drone photogrammetry method for production, the result of verifying the digital topographic map produced by drone photogrammetry by the ground survey method results in a plane error of an average of 4cm and a maximum of 8cm, and an elevation error of an average of 11cm and a maximum of 31cm. NO et al. [4] further compared the ground survey coordinates of the buried pipe in the experimental area, and the 17 processing results of the drone photographed data 7 times to create an underground facility map using a drone. It was suggested that excess was seen in only two processing. As a result of consideration through preliminary research, the accuracy of drone photogrammetry varies depending on the shooting height, camera resolution, and redundancy at the time of drone photography. However, the plane error is within 10cm on average. It shows high accuracy compared to the standard deviation within 14cm and the maximum value within 28cm, which is the error limit of the plane reference point. While comparing simulation modeling and digital twin, Krasikov et al. [6] identified the main difference in the data flow structure of the model. They proposed that the loop between the physical space and the virtual should be automated and continuous.

3 Framework Design Methodology As shown in Fig. 1, given an existing physical product, in general, it takes several steps to create a fully functional and responsive digital twin. It should be made clear that in practice, manufacturers and developers do not strictly follow such a sequence while developing a Digital Twin model [7]. Discussed below are generalized steps that were taken while developing the economical yet practical digital twin framework. 3.1 Spatial Data Acquisition This step deals with the process of spatial data acquisition. The goal is to collect and acquire quickly in an economical manner. The initial step in acquiring data was an onsite inspection of camps where they were measured using tools. The pedestrian bridge details were acquired for a vendor that provided the entire floors and ramps of the bridge realistically. Intel’s RealSense™ Lidar Camera1 is also used to collect data. Such lowcost Lidar scanning devices can help in providing realistic textures and measurements for validation. Other standard acquisition methods allow users to mount a scanning module on a moving vehicle that gathers information from the surrounding in the form of cloud points, which makes the data more accurate, however that is not a cost-effective solution as the module have exorbitant prices that does not justify the use case it is being used for scanning [8]. 3.2 Building a 3D Environment This process is being enabled using the finalized computer-aided drawing (CAD) constructed from the measurement from the previous steps. Since the Mina area structures 1 Intel® RealSense™ LiDAR Camera L515 - link.

Low-Cost Digital Twin Framework for 3D Modeling of Homogenous Urban

1109

Fig. 1. Digital twin framework lifecycle

are mostly homogenous (see Fig. 2), these buildings are quickly drawn. These measurements were also provided to the graphics designer to perform the extrusion process for surfaces in the Blender2 that allows us to model the camps in the Mina area effectively. Blender is also a freely available 3D creation suite that aids in the cost-effective solution provided by the authors. 3D modeling is then performed on these CAD drawings to depict the camp and the bridge structures. The virtual environment includes three aspects, elements, behaviors, and rules [9]. At the level of elements, the virtual space model mainly includes the physical environment’s geometric model, dynamic or static entities, and the environment’s user or an administrator. At the level of behaviors, the authors analyze the behavior of environments and users and focus on analyzing the environment and user interaction generated by the behavior and modeling. The rules level mainly includes the evaluation, optimization, and forecasting models established following the law of environment operation. 3.3 Data Processing to Facilitate Administration Data acquired from different sources (i.e., mainly from the physical environment and the Internet) are analyzed, integrated, and visualized. The massive data acquired from multiple sources is delivered using real-time databases such as Firebase Realtime Database3 . This can help such data almost instantly to make effective decisions. Data analytics is essential to convert data into more factual information directly inquired by developers for decision-making. Furthermore, since environment data are collected from diverse sources, data integration helps discover the implicit patterns that cannot be uncovered 2 Blender - link. 3 Firebase Realtime Database – link.

1110

E. Felemban et al.

Fig. 2. Homogenous construction of mina tents [10]

based on a single data source. Moreover, data visualization technologies are incorporated to present data more straightforwardly. Finally, advanced artificial intelligence techniques can be incorporated to enhance a Digital Twin’s cognitive ability (e.g., reasoning, problem-solving, and knowledge representation), so that specific, relatively simple recommendations can be made automatically. 3.4 Environment Behaviors Simulation in a Virtual Environment The enabling technologies include simulation and virtual reality (VR). The former is used to simulate critical functions and behaviors of the physical environment in the virtual world. In the past, simulation technologies are widely used in environmental design. On the other hand, virtual reality (VR) technologies involve designers and even users to ‘directly’ interact with the virtual environment in the simulated environment. Recently, VR technologies are increasingly employed to support virtual prototyping and environment design [11]. Many readily available VR hardware devices can be directly adopted for digital twins. For this framework, HTC’s VIVE™ VR4 devices were used to provide users with an immersive experience. This methodology can be further enhanced by simulating virtual crowd in these 3D models to provide more information on the effectiveness of the 3D model [12]. 3.5 Recommendation and Actions to the Physical World Based on the digital twin recommendations, the physical environment is equipped with a capability, using various actuators, to adaptively adjust its function, behavior, and structure in the physical world. Sensors and actuators are the two technological backbones of a digital twin. The former plays a role in sensing the external world, whereas the latter plays a role in executing the desirable adjustments requested by the digital twin. 4 HTC® VIVE™ Cosmos Elite Series - link.

Low-Cost Digital Twin Framework for 3D Modeling of Homogenous Urban

1111

In practice, the commonly used actuators that are suitable for consumer environments include, for example, hydraulic, pneumatic, electric, and mechanical actuators. Augmented reality (AR) technologies can also reflect some parts of the virtual environment to the physical world. For example, AR enables end-users to view the real-time state of their environments. Recently, AR technologies are increasingly applied in the factory domain production engineering [13]. 3.6 Establishing Real-Time Connections Between the Physical and Virtual Environments The connections are enabled using several technologies, such as network communication, cloud computing, and network security. Firstly, networking technologies enable the environment to send its ongoing data to the ‘cloud’ to power the virtual environment. The feasible networking technologies for consumers include, for example, Bluetooth, QR code, barcode, Wi-Fi, Z-Wave, and other similar technologies. Secondly, cloud computing enables the virtual environment to be developed, deployed, and maintained entirely in the ‘cloud’ conveniently accessed by both designers and users from anywhere with Internet access. Lastly, since environment data are directly and indirectly concerning user-environment interactions, it is critical to guarantee connections. Much effort has been devoted to connecting the physical and virtual environments in light of the Internet of Things, adapted for the digital twin research. 3.7 Multi-source Data Collection of Vital Information Generally speaking, three types of environment-related data should be processed by the digital twin. For ordinary environments, physical environment data is usually divided into environmental data, customer data, and interactive data. Environment data contains customer comments, viewing, and download records. Interactive data consist of userenvironment interaction, such as stress, vibration, etc. Using the sensor technology and IoT technology can gather some of the above data in real-time, web page customer browsing records, download records, evaluation response, can obtain the rest of the data. For this framework, one sensor used to collect data from the physical space is the Footfall® 3-D People Counter5 . It provides vital information on the number of participants at a particular venue. The collected data are fed to Sect. 3.2to close the loop towards building more functional virtual environments.

4 Implementation One of the significant problems towards smarter cities is to develop multi-criterion decision support and prediction models for the authorities. These models can further be utilized to simulate multiple scenarios. Companies such as ESRI are actively developing applications that can assist in making decisions based on several criteria. One such tool is ArcGIS Urban that enables 3D urban environments for making significant decisions. 5 Footfall 3-D People Counters – link.

1112

E. Felemban et al.

Furthermore, ArcGIS City Engine assists with creating 3D content for Urban planning, makings it more comfortable with making decisions and exporting the model in multiple 3D formats. Figure 3 demonstrates the usage of an exported 3D model of the Jamarat Bridge in ESRI ArcGIS Pro. This model can be further utilized in decision making application for better understanding of planned scenarios.

Fig. 3. Demonstrating digital twin at scale using ArcGIS Pro

Dashboards are also considered an essential component of a digital framework. It is extensively used to provide better insights for making an informed decision related to smart cities. The data for these dashboards are acquired for multiple sensors, active as well as passive in nature. Using these dashboards, experts can monitor the actual situations and changes in a location and make wiser decisions [14]. The data received from the sensors can be enormous that results in big data sets. These datasets need to analyzed using a cloud-based GIS platform. The data can also be reduced by determining relevant and essential data and discarding the rest.

5 Discussion As we move towards smart cities and intelligent cyber-physical systems, the resulting low-cost digital twin framework will provide an economically viable and suitable alternative to other options with incredibly high costs. The proposed framework is exceptionally suitable for homogeneous urban locations where physical structures are similar to one another. This scenario expedites creating a smart ecosystem of smart sensors and IoT devices with prediction systems to provide decision-makers with all the necessary information and make well-informed decisions. There are certain limitations to this methodology. Initially, the point cloud scan quality depends on the device being used to collect details of the structures. Each Lidar device has limitations as to how many points it can collect in a minute. Expensive

Low-Cost Digital Twin Framework for 3D Modeling of Homogenous Urban

1113

devices have a more significant capability of collecting more points, but these devices incur colossal costs on the project. In comparison, low-cost devices with lower points per minute capability can provide enough details to generate a 3d model.

6 Conclusion and Future Work In this paper, the authors presented a novel digital twin framework that is both economical and sustainable. Overall production and storage of spatial data demand innovation that can cater to a broader audience. Thus, it is necessary to provide real-time and precise information, including its spatial location. Such information flow is entirely automated and continuous, where the data should flow from the physical space to the virtual environment and communicate commands and recommendations back to the actual world. The framework also intends to focus on significant data convergence of multi-sensor data and their connectivity with the virtual environment. The framework will also be tested for its accuracy in the 3D model’s spatial details by collecting and comparing results from multiple low-cost lidar scanning devices. The primary outcome of improvements in real-time sensor information is the prediction accuracy. Simulating prediction models on such a digital twin platform can be used to plot current and forecast imminent hazards. These patterns can be used for accurate trend analyses. The development of a cost-effective digital twin of the urban area will guide present, and upcoming potential solutions against the administrative challenges through the steps discussed earlier. Acknowledgment. The authors extend their appreciation to the Deputyship for Research & Innovation, Ministry of Education in Saudi Arabia, for funding this research work through project number 0909.

References 1. Frost, L.A., Sullivan, D.L.: Smart cities. Frost Sullivan Value Propos. 1(1), 42–49 (2019) 2. Wang, Y., Chen, Q., Zhu, Q., Liu, L., Li, C., Zheng, D.: A survey of mobile laser scanning applications and key techniques over urban areas. Remote Sens. 11(13), 1540 (2019). https:// doi.org/10.3390/rs11131540 3. Lee, K.W., Son, H.W., Kim, D.I.: Drone remote sensing photogrammetry. Seoul Goomib (2016) 4. No, H.-S., Baek, T.-K.: Real-time underground facility map production using Drones. J. Korean Assoc. Geogr. Inf. Stud. 20(4), 2016–2017 (2017). https://doi.org/10.11108/kagis. 2017.20.4.039 5. Guisado-Pintado, E., Jackson, D.W.T., Rogers, D.: 3D mapping efficacy of a drone and terrestrial laser scanner over a temperate beach-dune zone. Geomorphology 328, 157–172 (2019). https://doi.org/10.1016/j.geomorph.2018.12.013 6. Krasikov, I., Kulemin, A.N.: Analysis of digital twin definition and its difference from simulation modelling in practical application. KnE Eng. 105–109 (2020). https://doi.org/10.18502/ keg.v5i3.6766 7. Tao, F., et al.: Digital twin-driven product design framework. Int. J. Prod. Res. 57(12), 3935– 3953 (2019). https://doi.org/10.1080/00207543.2018.1443229

1114

E. Felemban et al.

8. Sofia, H., Anas, E., Faiz, O.: Mobile mapping, machine learning and digital twin for road infrastructure monitoring and maintenance: Case study of mohammed VI bridge in Morocco. In: Proceedings - 2020 IEEE International Conference of Moroccan Geomatics, MORGEO 2020 (2020). https://doi.org/10.1109/Morgeo49228.2020.9121882 9. Tao, F., Cheng, J., Qi, Q., Zhang, M., Zhang, H., Sui, F.: Digital twin-driven product design, manufacturing and service with big data. Int. J. Adv. Manufact. Technol. 94(9–12), 3563–3576 (2017). https://doi.org/10.1007/s00170-017-0233-1 10. Mina. – SoFlo Muslims. https://soflomuslims.com/mina/. Accessed 01 Oct 2020 11. Stark, R., Israel, J.H., Wöhler, T.: Towards hybrid modelling environments—Merging desktop-CAD and virtual reality-technologies. CIRP Ann. 59(1), 179–182 (2010). https:// doi.org/10.1016/j.cirp.2010.03.102 12. Majid, A.R.M.A., Hamid, N.A.W.A., Rahiman, A.R., Zafar, B.: GPU-based optimization of pilgrim simulation for Hajj and Umrah rituals. Pertanika J. Sci. Technol. 26(3), 1019–1038 (2018) 13. Nee, A.Y.C., Ong, S.K.: Virtual and augmented reality applications in manufacturing. IFAC Proc. 46(9), 15–26 (2013). https://doi.org/10.3182/20130619-3-RU-3018.00637 14. Shirowzhan, S., Tan, W., Sepasgozar, S.M.E.: Digital twin and CyberGIS for improving connectivity and measuring the impact of infrastructure construction planning in smart cities. ISPRS Int. J. Geo-Inf. 9(4), 240 (2020). https://doi.org/10.3390/ijgi9040240

VR in Heritage Documentation: Using Microsimulation Modelling Wael A. Abdelhameed(B) Applied Science University, Building 166, Road 23, Block 623, P.O. Box 5055, East Al-Ekir, Kingdom of Bahrain [email protected], [email protected]

Abstract. There are different digital methods and technologies that offer various applications and techniques in the field of heritage documentation and archiving. Virtual Reality (VR) is one of those technologies. VR has not been intensively used in heritage documentation, although its functions provide more potentials than other digital methods and technologies. The study through literature review classifies and reports different digital methods and technologies used in architectural heritage documentation. The study proceeds to present details of a VR function developed by the author. The newly developed VR function is utilized in heritage documentation of a case study. The case study is an existing archaeological site and its monuments, including their digital heritage resources of different formats, such as texts and images. The VR function enables to control the display of different narrations and chronological changes of the case study or part of it, in one VR model. Moreover, the VR function enables the user to visualise a certain era or a certain model-location, with display of relevant digital heritage resources -whether text or image-. The study concludes by proving the effectiveness of the VR function. Keywords: Virtual reality · Microsimulation modelling · Architectural heritage · Heritage documentation · Microsimulation function

1 Introduction The digital use in heritage documentation has become inevitable due to many advantages provided. In the field of architectural heritage there is no global standardisation of the heritage documentation, although several national institutions that deal with the documentation on the national levels developed a few data standards. The main advantage of the 3D digital reconstruction of archaeological sites and historical buildings is the potential to document and archive the monuments in a more precise, efficient way.

2 Research Scope The study focuses upon creating a VR tool for architectural heritage documentation, the study introduces a new VR function inside the microsimulation method (microsimulation player) of a commercial VR programme. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1115–1123, 2021. https://doi.org/10.1007/978-3-030-80126-7_78

1116

W. A. Abdelhameed

The study utilizes a case study, and develops its VR model including an archaeological site model, models of changes occurred in the site terrain, and a few monument models divided into different parts according to different eras, as well as heritage resources of different formats. The VR function enables to chronologically display the substantial stages of excavation and reconstruction of the site’s terrain and monuments, as well as simultaneously link the VR display to the heritage resources -texts or images- according to their relation of time and location.

3 Research Objectives The research has the following two objectives: • Developing a new VR function that controls the VR visualisation of the model’s parts -both monuments and site-, as well as the VR visualisation of some heritage resources. • Applying the new VR function in the case study’s VR model, to prove its effectiveness in fulfilling requirements of the heritage documentation process, i.e. linking the VR display of chronological changes, to their heritage resources.

4 Literature Review The architectural heritage documentation aims to enable the users to access and visualise the heritage information. Its processes are varied based on the monument’s status: accessibility, size, complexity and existence. Beside the foregoing variables of the architectural monuments, there are other variables resulted from the available resources, which accordingly lead to select the modelling method and the used technology. The digital reconstruction method is the process of digitally generating the 3D model of an object, by using the object’s appearance, depth and colour [1]. Due to the large amount of data in a single project file, Boeykens et al. [2] had to divide the reconstruction process of architectural heritage into two stages, structure and detailing; the latter was also divided further into different files. Methods of parametric modelling are used in historical reconstruction [3]. A procedural modelling method has been used in urban modelling by Parish and Müller [4], and in virtual heritage modelling by Müller et al. [5]. El-Hakim et al. [6] introduced a method to reconstruct and model large-scale heritage sites through which several technologies are employed: image-based modelling for basic forms and structures, laser scanning for fine details and surfaces, and image-based rendering for landscapes and surroundings. VR technology is employed into a wide variety of applications and visual experiences. VR technology has fields, namely: augmented reality (AR), virtual reality, and mixed reality that refers to a combination of real and virtual elements the user can interact with each or even can virtually exists inside the model. Sylaiou et al. [7] presented a system that integrates augmented reality with an interface tool and software to create web-based virtual museum exhibitions.

VR in Heritage Documentation: Using Microsimulation Modelling

1117

Number of techniques are associated with the AR systems such as photogrammetric, range-based modelling, image-based rendering and position orientation tracking [8]. From the literature review, various methods can be identified, some of which can be combined in one heritage documentation method based on the requirements and resources. In addition, one disadvantage appears that is the large file size, which leads to divide the documentation file into separate files.

5 Details of the Case Study The research paper employs a case study to prove the new VR function effectiveness. The case study is a 5000-year-old site that includes two main forts and different structures, some of which had built over the ruins of older parts structures. Although many archaeological ruins are identified with different functions: residential, public, and commercial, this research work and its VR model focus upon the main architectural monuments: • The main architectural monument, on the top of the 12-m-height mound, • the coastal fortress, and • different terrains of different heights, representing a few excavation stages.

6 Advantages of the VR Function Using the new VR function leads to have one VR model/platform with the following advantages: • The VR platform includes different model-parts inside the same file, such as different terrains of the site and different models of each monument representing the changes occurred through time periods. Each model part of the main VR model reflects a time period and concurrently solves the issue of the large model size. • The VR model acts as an architectural heritage documentation tool. Different narrations of the archaeological site can be displayed, as well as of the monuments, for example excavation of the terrain and reconstruction of the different monuments. • The VR model documents and displays different heritage resources formats, such as images and textual explanation, according to their relevant time and location in the VR model narration.

7 VR in Architectural Heritage Documentation 7.1 VR Modelling Process To achieve the flexibility in the heritage documentation for displaying different narrations, different parts/sub-components of the site and its structures/monuments are modelled in separate files by the 3ds max programme, including the different terrains and the related historical changes of structures/monuments.

1118

W. A. Abdelhameed

The files of digital models -parts/sub-components- are separately imported with their textures into the VR Programme to make one VR model file. Developing the VR model of the case study site and its monuments starts with making the different terrains through historical periods, and proceeds by adjusting each monument’s parts according to their coordinates in the site. The monuments’ changes are added as well to the VR model. The links between these models’ parts and the related heritage resources are created and stored in the VR model file. A new VR function, therefore, is needed in the microsimulation player of the VR programme to control displaying these model parts/sub-components based on their chronology and topography. Images of the main narration of the case study are shown in Fig. 1. 7.2 Details of the New VR Function The microsimulation modelling inside the VR platform allows the execution of XML algorithms. The researcher developed the new function from an older version [9] by XML code to control visualisation of the models’ sub-components. The method used in XML code is to make three variables of each sub-component inside the model: time, position, and direction with angle. The VR function utilizes the sub-component ID as the same as of the VR model. The VR function controls the display time of each sub-component that is usually set up according to the chronological order and can be easily changed in the XML code to apply the VR display of another narration. The display position of each sub-component is set up to the different spatial parameters: X, Y, and Z inside the site, as well as the camera position inside the programme, Fig. 2. The time-control window inside the microsimulation player of the VR programme enables to pause the visualisation process which allows the user to navigate any VR model’s part at any time point, Fig. 3. The navigation can be through walkthrough, flight mood or programme-camera control.

VR in Heritage Documentation: Using Microsimulation Modelling

1119

(a) The Archaeological Site before Excavation

(b) During Excavation Processes

(c) During Reconstruction Processes

(d) Zoom in to Show the Two Forts and Central Excavation Area

Fig. 1. VR Model of the archaeological site showing its main structures, screenshots of the microsimulation player window of the VR programme.

1120

W. A. Abdelhameed

(a) The Main Fort During Reconstruction Processes from a Certain Viewpoint

(b) The Main Fort after Reconstruction Processes from the Same Viewpoint Fig. 2. The VR model chronologically showing the reconstruction process of the main fort of the case study, screenshots of the microsimulation player window of the VR programme.

7.3 Microsimulation Modelling Using microsimulation modelling by dividing the monuments’ models into subcomponents enables to document and display the sub-components and their related heritage resources. The control of the sub-components inside the VR model and microsimulation player is achieved by XML algorithms. The algorithms of the new VR function control the display time of the heritage resources inside the VR model, in the relation to the programme-camera position and the microsimulation player time. Using the VR function and microsimulation modelling overcomes the disadvantage of the large model that appeared in the literature review. The effectiveness of the new VR function manifests itself in the control of different monuments and parts of one VR model, as well as the link of the VR display between the heritage resources and their relation of time period and in-model location, Fig. 4. The VR function is adapted and included in the commercial version of the VR Programme.

VR in Heritage Documentation: Using Microsimulation Modelling

1121

(a) Inside View of the Main Fort in the Walkthrough Mode of the VR Platform

(b) Outside View of the Coastal Fort in the Flight Mode of the VR Platform Fig. 3. Walkthrough and flight modes in the VR model to display the model details, screenshots of the microsimulation player window of the VR programme.

Fig. 4. Heritage resources’ photos during the reconstruction process of the coastal fort, screenshots of the microsimulation player window of the VR programme.

1122

W. A. Abdelhameed

8 Conclusion The advantages behind using microsimulation modelling in the architectural heritage documentation is highlighted. The study proves the effectiveness of the new VR function. The presented VR function can be employed into different disciplines and areas other than architectural heritage documentation, such as, • developing an educational and a learning tool of the architectural history or even of the history in general, offers a more attractive method for learners and educators than the conventional methods. An interaction of the users with the proposed tool for education and learning can be developed, in order to fulfil all different learning styles: visual, aural, read/write and kinesthetic. • making AR tours in the archaeological sites and the monuments. The user can physically be in the site, or away through a virtual model and the Internet. The physical presence of the user combined with the virtual presence in different eras, inside the architectural heritage model introduces a mixed use of VR and AR.

Acknowledgments. I would like to express my appreciation to the Forum8 Company and the software developers for the assistance they offered, in updating the programme Micor-simulation Plug-in to include my VR function.

References 1. Gomes, L., Bellon, O., Silva, L.: 3D reconstruction methods for digital preservation of cultural heritage: a survey. Pattern Recogn. Lett. (50), 3–14. Elsevier (2014). https://www.sciencedi rect.com/science/article/pii/S0167865514001032. Accessed 6 March 2020 2. Boeykens, S., Himpe, C., Martens, B.: A case study of using BIM in historical reconstruction, the Vinohrady synagogue in Prague. In: Digital Physicality, Physical Digitality, eCAADe and CVUT, Faculty of Architecture, pp. 729–738 (2012). https://core.ac.uk/download/pdf/ 34531367.pdf. Accessed 6 March 2020 3. Chevrier, C., Perrin, J.: Generation of architectural parametric components: cultural heritage 3D modeling. In: Tidafi, T., Dorta, T. (eds) Joining Languages, Cultures and Visions: CAADFutures pp. 105–118 (2009) 4. Parish, Y., Müller, P.: Procedural modeling of cities. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques - SIGGRAPH, ACM Press, Los Angeles, California, USA, pp. 301–308 (2001) 5. Müller, P., Vereenooghe, T., Wonka, P., Paap, I., Van Gool, L.: Procedural 3D reconstruction of puuc buildings in Xkipché. In: Proceedings of the 7th international Symposium on Virtual Reality, Archaeology and Intelligent Cultural Heritage - VAST2006, Nicosia, Cyprus, pp. 139– 146 (2006) 6. El-Hakim, S., Beraldin, J., Picard, M., Godin, G.: Detailed 3D reconstruction of large-scale heritage sites with integrated techniques. IEEE Comput. Graphics Appl. 24(3), 21–29 (2004). https://ieeexplore.ieee.org/abstract/document/1318815/. Accessed 6 March 2020 7. Sylaiou, S., Mania, K., Karoulis, A., White, M.: Exploring the relationship between presence and enjoyment in a virtual museum. Int. J. Hum.-Comput. Stud. 68(5), 243–253. (2010). Elsevier, https://www.sciencedirect.com/science/article/pii/S1071581909001761. Accessed 3 June 2020

VR in Heritage Documentation: Using Microsimulation Modelling

1123

8. Noh, Z., Shahriza, M., Sunar, M., Pan, Z.: A review on augmented reality for virtual heritage system, 50–61 (2009). Springer, Berlin, Heidelberg. https://link.springer.com/chapter/https:// doi.org/10.1007/978-3-642-03364-3_7. Accessed 3 June 2020 9. Abdelhameed, W.A.: Micro-simulation function to display textual data in virtual reality. Int. J. Archit. Comput. 10(2), 205–218 (2012)

MQTT Based Power Consumption Monitoring with Usage Pattern Visualization Using Uniform Manifold Approximation and Projection for Smart Buildings Ray Mart M. Montesclaros(B) , John Emmanuel B. Cruz, Raymark C. Parocha, and Erees Queen B. Macabebe Department of Electronics, Computer, and Communications Engineering, School of Science and Engineering, Ateneo de Manila University, 1108 Quezon City, Philippines [email protected], [email protected]

Abstract. The persistent electricity price hikes in the Philippines and the impact of climate change on the country’s energy demands adversely affect the consumer’s finances on a regular basis. These form the foundation for energy efficiency initiatives in building-based electricity consumption. Such initiatives encompass a wide range of innovations from energy generation to real time monitoring, management and control. At its core, this study supplements institutional efforts on energy management through the electricity consumption monitoring system. A non-intrusive data acquisition system that monitors the aggregate electricity consumption and visualizes the appliance usage patterns in a building setting was developed. This Non-Intrusive Load Monitoring (NILM) technique allowed acquisition of data from a single point of measurement using only a single sensor clamped to the main powerline. The acquired data were streamlined to communicate with the IoT OpenHAB framework via Message Queuing Telemetry Transport (MQTT) protocol and were implemented through deployment. Lastly, Uniform Manifold Approximation and Projection (UMAP) was used for dimension reduction. UMAP was applied to the raw time series data of the aggregate power consumption in order to visualize and determine appliance usage patterns, and effectively label data instances through a scatter plot. Keywords: Energy monitoring · NILM · IoT · OpenHAB · Smart buildings · UMAP · Dimension reduction

1 Introduction The emergence of “Smart Living” paved the way for people to use the available technologies to enhance their living experience. Two key concepts are closely related in Smart Living; Internet of Things (IoT) and Machine Learning. Incorporating the two allows improvements in the quantity and quality of data gathered and turns this data into useful and actionable information. The concept of smart living gave way to other smart © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1124–1140, 2021. https://doi.org/10.1007/978-3-030-80126-7_79

MQTT Based Power Consumption Monitoring with Usage Pattern Visualization

1125

concepts, one of which are smart buildings. A smart building is any structure that makes efficient use of automated processes in order to optimize and minimize energy usage [1]. To make these optimizations possible, accurate and reliable data are crucial and can be obtained through monitoring systems while also considering cost-efficient setups. The Philippine energy consumption has been on the continuous rise in the past few years. Data show that there has been an increase by 4% to 6% in electricity consumption compared to the previous year [2]. Electric meters are usually positioned at locations that are hard to reach or concealed which makes regular manual monitoring inconvenient. This prevents consumers from obtaining information and incapable of managing their energy usage efficiently. In order to address these issues, load monitoring systems have been developed so that consumers can obtain their power data and have a better understanding of their electricity consumption behavior [3]. The system utilized the Non-Intrusive Load Monitoring (NILM), a low-cost, non-intrusive energy measuring system which records the aggregate power consumption of a residence or a building without interfering with the existing electrical system. This study presents the design and development of a data acquisition system for electricity consumption using NILM. This system was integrated to the MQTT protocol and the acquired data were streamlined to communicate with the OpenHAB framework for ease of access. The Uniform Manifold Approximation and Projection (UMAP) was employed for dimension reduction in order to visualize and determine appliance usage patterns and effectively label data instances through a scatter plot based on the raw time series data of the aggregate power consumption.

2 Theoretical Background 2.1 Total Aggregated Power Consumption The total aggregated power consumption states that the individual power consumption of each appliance can be obtained by: n pi (t) + e(t) (1) pt (t) = i−1

Where pt (t) is the aggregated power consumption, pi (t) is the energy consumption of an appliance, and e(t) is the measurement errors and line loss [2]. The aggregated power consumption indicates that the total average power consumption is just the summation of the power consumption by each individual appliance. Figure 1 shows power consumption patterns of individual appliances and combinations of these appliances. Each appliance displays a unique characteristic called “digital signature”. As shown in Fig. 1, some appliances consume a lot more power compared to the other while some have different states with regards to power consumption. The digital signature of each appliance can be used to determine which specific appliance contributed to the aggregate power consumption [3]. These digital signatures can be attributed to the operational state of the appliance. Type-I; Single-State or Two-State appliances operate in two states. Type-II; Multi-State appliances or Finite-State Machines (FCM) have more than two operating states, where each state has different power consumption compared to the other states. Type-III; Continuously Varying State, where appliances under this state do not have a well-defined and no fixed number of states [4].

1126

R. M. M. Montesclaros et al.

Fig. 1. Power vs time plot of the total power consumption [3].

2.2 MQTT MQTT stands for Message Queuing Telemetry Transport, which is a machine-tomachine (M2M)/’Internet of Things’ connectivity protocol and was designed as an extremely lightweight publish/subscribe message transport [5, 6]. MQTT works on the TCP layer and delivers messages from node to server and is ideally used for the (Internet of Things) IoT nodes which have limited capabilities and resources [7]. It consists of two message sets on a connection ‘Publish’ and ‘Subscribe’ [8]. Any data that is published under a certain topic is received by any entity subscribed to that topic. 2.3 Time Series Clustering and Dimensionality Reduction A time series is a series of data points that is ordered or graphed with respect to time. It is a sequence of data points that is taken at equal intervals of time [9]. Time series clustering are methods where the long time series data sets were clustered so that the data characteristics and features can be extracted [10]. Dimensionality reduction is the process of reducing the dimensions of a time series to enhance the efficiency of extracting patterns in the data [11]. A time series C of length n can be reduced to w dimensions by: w win Cj C¯ l = j= wn (i−1)+1 n −

(2)

where, Ci is a piecewise aggregate approximation of time series vector. Using this formula, the data is divided into w equal sized frames. The mean value of the data that falls within the frame is calculated, and the vector is obtained. The vector of the calculated values becomes the data-reduced representation [12]. Uniform Manifold Approximation and Projection (UMAP) is a manifold learning technique developed for dimension reduction. In dealing with high dimensional data, UMAP creates a topological representation using local manifold approximations, and

MQTT Based Power Consumption Monitoring with Usage Pattern Visualization

1127

then patches their local fuzzy simplicial set representations, which were constructed from two processes: (1) approximation of a manifold which the data is assumed to lie, and (2) construction of a fuzzy simplicial set representation of the said manifold. Given a low dimensional representation of the data, the same set of processes will be used to construct an equivalent topological representation. UMAP then will optimize the layout of the data representation in low dimensional space. This is to minimize the cross-entropy between the two representations [13].

3 Methodology 3.1 System Architecture The power consumption monitoring system was developed to measure the electricity consumption in an electrical network. Figure 5 shows the Power Consumption Monitoring System Architecture. The system architecture consists of six main parts as shown in the figure. The locally connected devices in a computer laboratory within a school building where the system was deployed sends their aggregate current reading to the

Fig. 2. Architecture of the power consumption monitoring system.

1128

R. M. M. Montesclaros et al.

Data Acquisition Module via a non-invasive current sensor. Then, the module sends the data to the Raspberry Pi (RPi) which serves as the IoT Gateway. The data, which contain power readings, undergo preprocessing for classification using dimension reduction via UMAP. From the IoT Gateway, the total power consumption readings and the visualized usage patterns are sent and displayed in a Grafana instance using nodeRed via MQTT. The dashboard then displays the individual power consumption readings on each sensor and the total power consumption in kWh. 3.2 Hardware System Setup

Fig. 3. Hardware system block diagram.

The block diagram for the system’s hardware setup is shown in Fig. 3 and is based on the Semi-Automated Data Acquisition (SADA) system used by Bugnot and Macabebe [14] with a few modifications. The socket array unit and Arduino microcontroller were removed, and the data acquisition module was designed to read current values from a 3-phase power supply. The module allows non-intrusive gathering of current readings and calculates the aggregate power consumption for pre-processing and time-series clustering. The aforementioned values were sent through a localized network with static IP connection. Since Node-Red runs locally and subscribes to the same topic as the one being published by the data acquisition module, the logging unit stores the received and parsed data in InfluxDB. Lastly, the data stored is then displayed in a Grafana dashboard. The setup is divided into three main blocks based on their corresponding functionalities, namely: the sensor unit, the network unit and the logging unit. The sensor unit is composed of three SCT-013-050 non-invasive split-core current sensors and an ESP32 microcontroller. This unit was used to gather aggregate current data from the circuit breaker. The sensor has a rated input range of 0 A to 50 A rms, but can handle up to 60 A rms, and an output of 1 V AC rms. The sensor has a full scale accuracy of ±1% and operating frequencies of 50 to 1000 Hz. The aggregate power consumption was obtained by assuming 220.63 VAC as the voltage multiplier to the aggregate current. This voltage multiplier was the measured reading from the digital multimeter.

MQTT Based Power Consumption Monitoring with Usage Pattern Visualization

1129

The network unit is mainly composed of a switch and a router. This unit was intended for the purpose of transferring data packets between multiple network devices. This setup allows for easy access when troubleshooting connectivity issues and connects to the RPi remotely via ssh. Once the data from the ESP32 was successfully received by the RPi, the Node-Red instance would then parse the received character array data type to its corresponding double-precision value, and then store the parsed data to the database. To display the necessary metrics, Grafana was used. It loads the data that was stored in the database in InfluxDB, and outputs a real-time visualization for the current reading of each sensor, and the aggregate power consumption. Figure 4 shows the user interface data visualization using Grafana.

Fig. 4. Real time data visualization using Grafana.

3.3 Software System Flow The flow diagram for the software system is shown in Fig. 5. The data acquisition portion sends the raw time series data for pre-processing. The pre-processing portion converts the raw data to a form that can be inputted to the UMAP object. Once the preprocessed data is passed as an argument, the reducer is then trained to learn about the manifold. UMAP utilizes sklearn’s API in order to use the fit method and indicate which dataset the model needs to learn from [15]. The data points are then embedded on a 2D-plane. The usage patterns of the appliances are then visualized using a standard scatterplot. Visualization was done by clustering reading with similar patterns using the processed raw time series data. The pre-processing methodology for dimension reduction on temporal data for visual analytics was used. The first part performs unity-based normalization. This ultimately sets the raw data to a common scale without distorting the differences in the values [11].

1130

R. M. M. Montesclaros et al.

Fig. 5. Software flow diagram.

After the values were normalized, the Sliding Window Technique was implemented. This technique was done by incrementing the bounds of the observation window in the time-domain and as a result, also sliding the observation window across the length of the collected data-sample. This converts a 1D stream of data with N columns into a (N-(W-1)) x W matrix where W is the window size. It was observed that as the window size increases the clusters become more indistinguishable. This can be attributed to the fact that as the size of the observable window increases, more unique functions are taken into consideration making it more difficult to cluster data points with similar patterns. Since the window size was up to the proponents’ discretion, four window sizes were analyzed (2, 4, 60, 300). Among these sizes, a window size of 4 produced the most distinguishable clustering with the given dataset. In this case, the researchers used a window size of 4 which produced an (N − 3) × 4 matrix. After this, the resultant matrix from the sliding window technique underwent dimension reduction. The purpose was to provide additional visual analysis and to reduce the feature space into a 2D plane without compromising the shape characteristics of the original data. Through dimension reduction, the structure of the original data is preserved and it allows us to draw out certain intuitions about the data itself through visual means. This task can easily be done by using UMAP. After instantiating the UMAP class, the reducer was then trained to learn about the manifold. The fit method accepts the resultant matrix in order for the model to learn from it. The reducer accepts certain parameters before embedding the data into a 2D plane. After the reducer was trained, the transformed data was visualized. This was done by the transform method available in UMAP. After which, a UMAP projection was displayed by plotting the resulting embedding.

MQTT Based Power Consumption Monitoring with Usage Pattern Visualization

1131

4 Results and Discussion 4.1 Data Collection Results The data acquisition module shown in Fig. 2 was deployed in a computer laboratory within the university. The computer laboratory had the following devices: 100 AT and 100 AF circuit breaker, (1) switch, (1) router, (1) projector, (2) air-conditioning units, (4) lights, and (24) desktop computers. The data acquisition module requires a power supply with 5 V input voltage and a recommended input current of 2 A. The module operates inside the circuit breaker. This was done by tapping into one of the breaker bus bars. The computer laboratory layout showing the devices in the room is seen in Fig. 6.

Fig. 6. Computer laboratory room layout.

Current Sensor Readings. There were three current sensors that were deployed and labeled for each corresponding phase wire inside the power box. Figure 7 shows the location of the current sensors connected to the wires in the circuit breaker box. Looking at the figure and considering the current reading of the three phase wires, the right and middle phase wires follow the same trend of consumption, which can be attributed to the fact that the breaker switches for the appliances were supplied by the said phase wires, making it contribute the most in the total power consumption. Each phase wire can supply 110 V, making two phase wires necessary to supply 220 V for each switch. The left phase wire shows a way less current consumption, and has a near constant current reading; different from the trend of the other two phase wires.

1132

R. M. M. Montesclaros et al.

Fig. 7. Computer laboratory room layout.

Computer Laboratory Power Consumption Readings. The power consumption data was gathered. Figure 8 shows a sample power consumption data of the room within a day; where trends can be observed. A general observation from the data gathered shows that there was a spike in power consumption at the start of the day, at around 8:00 am, where all appliances were powered on. This huge spike was mainly caused by the airconditioning units which take time to stabilize given the initial temperature within the room. Moreover, the power consumption of the room is relatively stable, but it can be observed that there is a rising and falling trend. It can be attributed to the compressor of the air conditioner which cycles on and off to complete a cooling cycle. Lastly, the schedule of the room greatly affects the power consumption, as the power consumption rises when there is a class being held inside, since the appliances were being used.

Fig. 8. Computer laboratory power consumption.

Cost Computation. One of the objectives was to be able to design and develop a system that monitors building-based electricity consumption. In order to do this, it is imperative for the system to provide its users not only the necessary power consumption metrics but also its corresponding costs.

MQTT Based Power Consumption Monitoring with Usage Pattern Visualization

1133

The cost computation is based on certain assumptions and estimations obtained from the summary scheduled rates published by the electric utility company [16]. The service rate is used by Meralco to classify its customers and to appropriate their corresponding bills based on the approved rate schedule [17]. The service rate classification used for the computer laboratory was General Service A (GS-A) under the 0–200 kWh criteria. Table 1 is a list of charges and their corresponding rates used to calculate the electricity rate per kWh. Calculating the total energy consumption for the power consumption data shown in Fig. 8, the total energy consumed during this day was calculated at 59.078 kWh amounting to Php 480.51. Table 1. Summary of charges and scheduled rates. Service Rate Type General Service A (GS-A) (0–200 kWh) Charge Type

Rate (Php/kWh)

Generation Charge

4.6632

Transmission Charge

0.7859

Distribution Charge

1.0012

Supply Charge

0.5085

Metering Charge

0.3377

System Loss Charge Universal Charge

0.3528 UC-ME

0.1561

UC-EC

0.0025

UC-SCC 0.1453 UC-SD

0.0428

Fit-All (Renewable)

0.0495

Lifeline Rate Subsidy

0.0880

Senior Citizen Subsidy

0.000

TOTAL RATE

8.1335

Test Cases. There were eight random combinations that were considered as test cases. For each test case, data were gathered for 15 min and the total power consumption in each case was obtained. The purpose was to gather data labels that characterize usage patterns of the appliances. Table 2 shows the combinations of appliances present for each case. Note that there are four desktop computers in a row as seen in Fig. 6.

1134

R. M. M. Montesclaros et al. Table. 2. List of test cases and the state of the appliances.

Case

Appliance Aircon (No. and Temp.)

Lights (On/Off)

Computer Rows (No.)

Projector (On/Off)

1

2 at 16 °C

On

6

On

2

1 at 16 °C

On

6

On

3

0

On

6

On

4

2 at 23 °C

Off

4

On

5

2 at 23 °C

Off

0

Off

6

2 at 23 °C

On

0

On

7

1 at 16 °C

On

6

On

8

0

Off

4

Off

Fig. 9. Power consumption vs. time plots of (a) Case 1 and (b) Case 3.

MQTT Based Power Consumption Monitoring with Usage Pattern Visualization

1135

Figure 9 shows the power consumption graph of (a) Case 1 and (b) Case 3. Comparing the graphs of Case 1 and 3, it can be seen that the major contributor to the power consumption of the room are the air conditioners. Looking at the difference in power consumption, it can be said that the contribution of the air conditioners alone totals to 9,000 W to 10,000 W. 4.2 Preprocessing Results The raw time series data were preprocessed to convert them to a form that can be used as input to the UMAP object. In this section, the outputs of the unity-based normalization and sliding window technique are discussed and analyzed. Normalization. The type of normalization used for the raw time series data was unitybased normalization. The purpose is to set all the values in the dataset within a range of [0, 1]. Figure 10 shows an example of a normalized power consumption data.

Fig. 10. Normalized power consumption readings.

Since there were 466,629 data points, a subset was also normalized to improve the speed of the analysis. In this case, the subset spanned data points from 12:00 PM until the last element of the training dataset. This was observed to be the time where the data acquisition module had the most stable readings. The whole dataset was split into two, 80% of the dataset was used for training and 20% of the dataset was used for testing. Sliding Window Technique. This process converts a 1D stream of data with N columns into a (N − (W − 1)) by W matrix where W is the window size. For this study, a window size of 4 was chosen and produced an (N − 3) × 4 matrix. Increasing the window size to a higher value drastically increased processing time and as a result slowed down the embedding and visualization process. The rationale behind using this technique was to extract more information about the behavior and the usage patterns of the appliances being used. More information can be extracted when the dataset is a continuous stream of points within a certain window size compared to independent or discrete data points within that same window size. The sliding window technique was able to preserve the original behavior of the dataset and can be used as input to the UMAP object.

1136

R. M. M. Montesclaros et al.

4.3 Appliance Usage Patterns After preprocessing, the data served as input to the UMAP object for visualization using the umap.plot package. First, the window size was varied in order to choose a plot that produced the most distinct clusters. For each window size, the following plots were used to visualize the trained UMAP model: (1) points, and (2) connectivity. The connectivity plots were used to visualize how the connectivity of the manifold relates to the resulting 2D embedding since UMAP is a topological representation of the data’s manifold for which it was sampled from. All plots had a random_state = 42, a min_dist = 0.0, n_neighbors = 15 and a min_dist = 0.0. Table 3. Appliance usage pattern point and connectivity visualizations with increasing window sizes.

Plot Type: Points Window Size 2

4

60

300

Plot Type: Connectivity Window Size 2

4

60

300

As seen in Table 3, the clusters become more indistinguishable as the window size increases. This can be attributed to the size of the observable window. As the size of

MQTT Based Power Consumption Monitoring with Usage Pattern Visualization

1137

the observable window increases, more unique functions are taken into consideration making it more difficult to cluster data points with similar patterns. In this study, four window sizes were analyzed. Among these sizes, a window size of 4 produced the most distinguishable clustering with the given dataset A window size of 2 produced clusters that were in close proximity to each other. Figure 11 shows the manifold projection of the power consumption dataset using window size = 4. In the figure, there are four main clusters formed and nine clusters in total (subclusters included). Since the data labels gathered were limited, partial labeling or semi-supervised learning using UMAP was used. Random masking was performed on target labels and the unlabeled data points were given a value of −1, this is an sklearn standard for unlabeled target information. After this, supervised learning was applied to the dataset and it will learn accordingly by considering entries with −1 as unlabeled.

Fig. 11. Manifold projection of the power consumption dataset using window size = 4, showing 4 distinct clusters.

The embedding seen in Fig. 12 shows the resulting topographical representation when semi-supervised learning is applied to a smaller dataset. The eight data labels used were the appliances turned ON for a duration of 15 min per test case. As seen in the figure, similar data points were closer to each other. Although two ACs contributed highest in terms of power consumption, the cluster was still smaller compared to the “12 computers” case since most of the cases performed included computers being turned ON more frequently than cases which included ACs alone. However, when combining

1138

R. M. M. Montesclaros et al.

all the clusters which include ACs, it is apparent that these appliances contribute to most of the power consumption readings inside the computer laboratory.

Fig. 12. Power consumption usage patterns embedded via UMAP using partial labels on a day’s worth of data.

5 Conclusion This study presented an MQTT based power consumption monitoring system that can measure and display the power consumption in real time. The data-acquisition module can measure the aggregate power consumption of a three-phase power supply. Moreover, it was successfully streamlined to the IoT home gateway over a protected local network via MQTT protocol. The appropriate current and power metrics were then displayed on a Grafana dashboard. The power consumption data was analyzed to extract insights about the behavior of the appliances present in the computer laboratory. The energy consumption in kWh and the corresponding amount were determined using the service rates of the utility company. Using the data from the test cases, the two air conditioning units were identified as the major contributors in the increase and sudden fluctuations of the power readings. Lastly, the raw time series analysis was conducted by applying unity-based normalization and sliding window technique to the data prior to sending it to the UMAP object for dimension reduction. This helped visualize appliance usage patterns. A variable window and step size provides flexibility which allow modifications in the dimension reduction process without compromising the original behavior of the dataset. Through the 2D topological representations, it was observed that data instances which exhibited similar patterns tend to form clusters. This can be used as a basis to associate repetitive patterns. On the other hand, observed outliers can be used to streamline the identification of anomalies in the appliances being used inside the vicinity. As future work, categorical label information can be included to perform supervised dimension reduction. This will allow the system to classify appliances in real time

MQTT Based Power Consumption Monitoring with Usage Pattern Visualization

1139

based on the device’s usage pattern. Automating the process of exhausting all possible combinations of appliance activation will reduce the data-collection procedure. Further improvements on testing and selecting other features for appliance recognition can speed up training time, reduce the complexity of the model, and increase the model’s accuracy. Moreover, hardware refinement by limiting and simplifying the design of the data acquisition module can reduce costs. Lastly, when scaling energy monitoring systems for buildings it is imperative to look into IoT fleet management platforms which will allow continuous development and updating of source code on multiple deployed devices. This will also allow admins to monitor multiple devices through a dashboard containing all the necessary metrics. Acknowledgment. This study was funded by the University Research Council (URC) of the Ateneo de Manila University under the URC project, “Development of an Energy Monitoring System for Smart Buildings (Phase 1).”

References 1. What is a smart building and how it can benefit you?. https://www.rcrwireless.com/201 60725/business/smart-building-tag31-tag99#:~:text=A%20smart%20building%20is%20a ny,lighting%2C%20security%20and%20other%20systems. Accessed 14 Jan 2021 2. Doe.gov.ph. https://www.doe.gov.ph/sites/default/files/pdf/energy_statistics/05_2018_p ower_statistics_as_of_29_march_2019_electricity_consumption.pdf. Accessed 05 Oct 2019 3. Aladesanmi, E., Folly, K.: Overview of non-intrusive load monitoring and identification techniques. IFAC-PapersOnLine 48(30), 415–420 (2015) 4. Zoha, A., et al.: Non-intrusive load monitoring approaches for disaggregated energy sensing: a survey. Sensors 12(12), 16838–16866 (2012) 5. FAQ - Frequently Asked Questions: MQTT. http://mqtt.org/faq. Accessed 08 Oct 2019 6. MQTT. http://mqtt.org/. Accessed 08 Oct 2019 7. Hivemq. http://www.hivemq.com/blog/mqtt-essentials-part-1-introducing-mqtt. Accessed 08 Oct 2019 8. Mqtt version 3.1.1 becomes an oasis standard. http://www.oasis-open.org/news/announcem ents/mqtt-version-3-1-1-becomes-an-oasis-standard. Accessed 08 Oct 2019 9. The Complete Guide to Time Series Analysis and Forecasting. https://towardsdatascience. com/the-complete-guide-to-time-series-analysis-and-forecasting-70d476bfe775?gi=5fe540 1e38cd. Accessed 16 Apr 2020 10. Wang, X., et al.: Characteristic-based clustering for time series data. Data Min. Knowl. Disc. 13(3), 335–337 (2006) 11. Ali, M., Jones, M.W., Xie, X., Williams, M.: TimeCluster: dimension reduction applied to temporal data for visual analytics. Vis. Comput. 35(6–8), 1013–1026 (2019). https://doi.org/ 10.1007/s00371-019-01673-y 12. Lin, J., et al.: Finding motifs in time series. In: Proceedings of the Second Workshop on Temporal Data Mining, vol. 3 (2002) 13. McInnes, L., et al.: UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3(29), 1–4 (2018) 14. Bugnot, R.J., Macabebe, E.Q.B.: A non-intrusive appliance recognition system. In: IEEE International Conference on Internet of Things and Intelligence System (IoTaIS) (2019) 15. UMAP 0.4 Documentation. https://umap-learn.readthedocs.io/en/latest/basic_usage.html. Accessed 12 May 2020

1140

R. M. M. Montesclaros et al.

16. Meralco. https://meralcomain.s3.ap-southeast-1.amazonaws.com/2020-03/03-2020_rate_ schedule.pdf?null. Accessed 12 May 2020 17. Meralco Customer Community. https://online.meralco.com.ph/customers/s/article/ServiceRate-Classifications. Accessed 12 May 2020

Deepened Development of Industrial Structure Optimization and Industrial Integration of China’s Digital Music Under 5G Network Technology Li Eryong1(B) and Li Yukun2 1 Jiangxi University of Finance and Economics, Nanchang 330000, China

[email protected] 2 Universiti Putra Malaysia, 43400 Kuala Lumpur, Malaysia

Abstract. While 4G technology is developing vigorously, 5G construction has already been in full swing. It has attracted the attention of the industry for its advantages of fast transmission rate, low latency and mass connection of the Internet of Things. The change of 5G will prompt all walks of life to actively deconstruct and restructure related industries. The deep integration of 3D, VR, AR, AI and 5G technology will bring great changes in the creation, performance, dissemination and appreciation of digital music. This paper analyses the industrial structure changes and future development trend of China’s digital music industry under 5G network technology, aiming to provide some useful new ideas for China’s digital music industry. Keywords: 5G · Digital music industry · AI · Block Chain

1 Development Background and Status Quo of China’s Digital Music Industry Digital music refers to music that uses digital technology in the production, distribution, and storage of music. It is mainly transmitted, downloaded, or enjoyed through the network. A computer or a mobile digital audio player decodes and issues digital music content in the cloud or downloaded MP3, WAV, FLAC, MPEG-2, AIFF and other formats to complete the transmission and consumption of the music. The advantages and characteristics of its immediacy, reproducibility, mass downloading and unlimited play effectively expand the range of music propagation and increase the speed of it. This article mainly reviews the development of China’s digital music market in the era of online music and wireless music. Online music refers to digital music that can be viewed online, or that can be downloaded directly to a computer and transmitted to other playback devices. Wireless music refers primarily to music services distributed in the Internet, including mobile phone ringtone downloads, coloring ring back tones, music downloads and wireless audio-visual music services. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1141–1150, 2021. https://doi.org/10.1007/978-3-030-80126-7_80

1142

L. Eryong and L. Yukun

The era of online music (1999–2008): In the mid-1990s, the technology of MPEG Audio Layer 3 compressed music with a compression ratio of 1:10 or 1:12, easily compressing a CD into a few megabytes. Digital technology has not only changed the survival mode of various fields, but also quietly guided the transformation of the traditional music industry. It greatly reduces the burden of carrying traditional music products (vinyl records, tapes, CDs), and the form of music carriers has radically changed with the development of technology. However, at the end of the 20th century, China’s bandwidth speed was basically maintained at about 100 kb, download speeds were maintained at tens of kb, and users with multimedia computers accounted for very few. Therefore, online music did not form a scale and influence that surpassed the mobile ring back tone and record market. In the early 21st century, the market size of online music has gradually expanded, the reasons to which are the popularity of electronic computers, the increase in network speed, the emergence of mobile music players, and the establishment of P2P online music websites. Typical music websites such as Kugou and Baidu have average daily downloads of 5 million to 10 million in 2005. The new business model is mainly based on free download and trial listening, and paid music. However, a series of problems have always restricted the development of the digital music market: for example, the digitization of music has brought unprecedented piracy problems, and the user’s awareness of paying has not been cultivated, which has restricted the development of digital music to a certain extent. Since 2008, China’s digital music market has entered a regulatory adjustment period. At the same time, WAP and 3G websites have been increasing and the channels for to have wireless music have been continually expanded. The integration of the telecommunications, broadcasting, and mobile Internet industries will provide consumers with a more convenient way to obtain music. The era of wireless music (2009-present): The most noteworthy is that the popularity of mobile Internet has promoted the development of digital music. With the transformation of China’s digital music industry from 2G, 3G to 4G, the value of music has been constantly refreshed. China’s digital music market reached 1.79 billion yuan in 2009, with wireless music accounting for 92.1%. Wireless music has become an important revenue component of music content providers. The only problem is that the prevalence of piracy is not effectively controlled. The “Notice on Ordering Online Music Service Providers to Stop Unauthorized Dissemination of Music Works” issued by the National Copyright Administration on July 2015 is considered as a key step in the standardization of the Chinese music copyright market. As of 2018, the size of China’s digital music market was 7.63 billion yuan, with a growth rate of 113.2%. Digital music users reached about 700 million people, and the number of paying users was nearly 40 million. China’s digital music market has formed a “one super, many strong” pattern. The platforms have begun to achieve a monopoly situation from the past to the start of mutual authorization. Strengthening the production of upstream content and expanding diversified profit channels have become development priorities and business models are gradually mature.

Deepened Development of Industrial Structure Optimization

1143

2 Optimization of China Digital Music Industry Structure Under 5G Network On November 14, 2017, the Ministry of Industry and Information Technology of China issued a 5G system frequency usage plan, which identified 3300–3400 MHz, 3400– 3600 MHz, and 4800–5000 MHz frequency bands as China’s 5G working frequency bands [1]. Compared with 4G, 5G forms a network system that integrates multiple technologies and multiple services through the characteristics of millisecond-level delay, high reliability, high speed, and large bandwidth, which can provide more extensive user groups with high-quality services, and will to a large extent promote the optimization and upgrading of the digital music industry structure and the integration and development of related industries [2] (see Table 1 for details). Table 1. 2G-5G network performance indicators Name

Peak rate

User experience rate

Connection density

System

Adapter

2G

Peak data rate 30 kbps

User experience rate 15 Kbps

Connection density 100 users/km2

GSM, CDMA

Voice interaction, text messaging

3G

Peak data rate 2 Mbps

User experience rate 300 kbps

Connection TD-SCDMA, density 1000 WCDMA, users/km2 CDMA2000

Meeting general quality audio and video transmission

4G

Peak data rate 100 Mbps

User experience rate 20 Mbps

Connection TD-LTE, density 2000 FDD-LTE users/km2

Meeting high-quality audio and video transmission

5G

Peak data rate 1 Gbps

User experience rate 100 Mbps

Connection density 1million users/km2

Internet of Everything: AI VR, AR etc.

NR/TD-LTE, GSM, LTEFDD, WCDMA

In the 5G era, communication systems will gain the abilities to interact with the environment, and the targets are expanded to joint optimizations of ever-increasing numbers of key performance indicators (KPIs), including latency, reliability, connection density, and user experience [3]. It means that under the background of 5G network technology, users can realize Internet of everything in every scene of life with certainty. For the market, this is also a brand-new business model and a brand-new challenge.

1144

L. Eryong and L. Yukun

2.1 5G-Based Artificial Intelligence Composition Technology As the 5G technology and big data and cloud computing are continuously applied, artificial intelligence composition technology has made great progress in the fields of music education, music creation, music performance and entertainment services. Artificial Intelligence Composition, originating from “algorithmic composition”, is an attempt to use a formal process to enable people (or composers) to use computers to achieve different degrees of automation in music composition [4, 5]. At present, computer-aided algorithm composition systems developed abroad such as Super Collider, C Sound, MAX/MSP, Kyma, Nyquist, AC Toolbox, etc. Domestic algorithms are mainly artificial neural network systems (an algorithmic mathematical model that imitates the behavioral characteristics of biological neural networks and performs distributed parallel information processing), genetic algorithm (a global optimization algorithm that uses adaptive functions to evolve samples) and hybrid algorithm (combination of genetic algorithm and artificial neural network systems) [6]. For the compositions formed based on the large-scale large data material library and with the combination of these algorithmic composition technologies with artificial intelligence composition technology and the artificial intelligence composition system it is almost impossible to distinguish between works created by artificial or intelligent composition systems. The most noteworthy thing is that the artificial intelligence composition system can create user-owned songs based on pictures, lyrics or a melody provided by the user, which is actually a process of user generated content (UGC). Users can upload information through the 5G network to the cloud, and then use the artificial intelligence composition system to convert the music works that users need (see Fig. 1).

Fig. 1. The process of creating songs by artificial intelligence composition system

With the advent of the music composition system with artificial intelligence, audience without the theoretical knowledge of music composition can readily create their representative works. Users only need to prepare images, melodies or other relevant materials and upload to the AI music composition system; upon receiving instructions, the system will fulfill the creation of lyrics, music composition, arrangement and even the creation of the music video through composition methods such as genetic algorithm,

Deepened Development of Industrial Structure Optimization

1145

artificial neural network and so on with the incorporation of big data. These processes are completed in an instant, which are beyond imagination in the past. Furthermore, as of December 2020, the number terminals with established 5G connection has exceeded 200 million, and 5G mobile phone shipments have exceeded 144 million units. In 2020, 580, 000 units of 5G base stations have been increased, and all cities have basically realized 5G network coverage [7]. 5G networks will tailor the provisioning mechanisms for more applications and services, which makes it more challenging in terms of complicated configuration issues and evolving service requirements [8]. Specifically, the changes in the music market will reflect the following characteristics: • The Feasibility of Grassroots Music Creation In the past, music creation was generally fulfilled by professional composers, but nowadays it is difficult for people to distinguish between works created by artificial intelligence music composition systems and works created by musicians, indicating that the technology has gradually matured. In addition, although consumer applications of 5G are still in the initial stage of introduction, with the gradual advancement of the 5G market in the future, such an enormous user group, it will inevitably give birth to a series of innovative applications, which would actively promote 5G powered video entertainment applications, and it is necessary to develop the layout of consumer applications through cooperation with Internet companies, including AR/VR live broadcasting, etc. In the future, with the popularization of 5G networks and artificial intelligence chips, developers will vigorously promote the marketization of artificial intelligence composition systems and the research and development of related mobile applications, making it possible for people to realize the previously unattainable music composition work an easy entertainment. • Everyone can become a Copyright Receiptor of Music Works As blockchain technology matures, the integration of 5G network technology and blockchain technology will make the music market more transparent and diversified for earning. With the assistance of the secure and reliable database of blockchain and the highly decentralized and liberalized management mode, copyright maintenance operation of music works can be greatly reduced and the transaction of music products can be facilitated. On the 2019 Smart China Expo, Blockchain and Digital Economy Joint Laboratory of China Telecom released “The White Paper on Blockchain Smart Phones in the 5G Era”, in which it is described a blockchain application ecology established by China Telecom, which would lead the new change of digital economy with its distinctive decentralization Technology. It provides a safer hardware infrastructure for 5G mobile phones, makes five mobile phones become nodes of decentralized operation and data, so as to form a new value transmission network and establish a more secure ecosystem for mobile phone users [9]. In this way, the practical utility of artificial intelligence composition technology will be brought to the ultimate. For example, under the background of blockchain, ordinary users can use artificial intelligence music composition platforms to create their own works, and then released to the blockchain platform supported by 5G architecture, with its independent digital wallet and smart contracts (transparency for transactions between music performers, producers and fans) has a great market advantage, for example, a user publishes a new work to the platform, once a consumer has purchased the copyright of the work,

1146

L. Eryong and L. Yukun

the funds will be automatically distributed to the accounts of creators, propagandists, performers and so on according to a preset proportion. • It will Replace Human Musical Activities to Some Extent In September 2018, Google held a large-scale AI interactive experience exhibition on Shanghai Long Museum, where the audience can interact with the AI robots through language, dancing or graffiti. What’s amazing is that the artificial intelligent robots can extemporaneously play with the performers when the audience sings or plays at will. Due to the powerful composition technology of artificial intelligence, it will be favored by many performing arts groups and performing arts companies in the future, and in anticipation, it will probably replace part of music technicians who are active in the current market. Firstly, once the artificial intelligence composition technology matures and is launched to the market, it will have a great impact on the industries of music performance, music creation and music production. To cite an obvious example, during the days when the MIDI composition technology has not been invented, if a singer needs to use orchestral accompaniment while performing a song, generally, it would need a large instrumental ensemble composed of string instruments, wind instruments and percussion instruments, which would be at least 20 to 30 people. However, with the introducing of MIDI composition technology, with only a professional MIDI composer in front of the computer, recording of all the sound making and instrument play could be handled, the cost would be greatly reduced, and very little time and effort have to be spent. Similarly, nowadays, the emergence of artificial intelligent composition technology will also bear part of the work of composers; not only that, behind the stage of performance, some of the work will be streamlined due to this technology, and not so many messy synthesizers, effectors and other relevant electronic composition equipment would be needed, which would mark another technical change in the history of music technology. 2.2 Virtual Concert and AR Augmented Reality Experience The 2019 World VR Industry Conference with the theme of “VR makes the world a better place—VR + 5G opens a new era of perception” was held in Nanchang, Jiangxi Province on October 19, 2019. Guo Ping, Huawei’s rotating chairman, said in a speech: “VR/AR, as the first application in the 5G era, is a revolution in human-computer interaction and a revolutionary upgrade in computing power, connectivity and display.” AR (Augmented Reality) is a relatively new technology that promotes the integration between real-world information and virtual world information content. It is a virtual reality technology developed in the 20th century. The requirements for panoramic real-time and calculation are extremely strict. Not only does real-time recognition and automatic tracking of real scenes be required, but also it consumes ultra-high bandwidth. Under traditional 4G networks, VR and AR cannot fully reflect its value. However, as the 5G commercial process continues to advance, new opportunities will inevitably be ushered in [10]. Virtual concert: Faye Wong’s “Magic Music” concert on December 30, 2016 was the first time that VR was used for live webcasts in China, but the coverage of 4G network of the time was unable to meet the massive data transmission requirements of VR and it was difficult to maintain its stability. In general, the live broadcast of music concerts is mainly guaranteed through wired networks. However, VR uses 5G to connect 1 million

Deepened Development of Industrial Structure Optimization

1147

users per square kilometer with massive connections, low latency and extremely fast features, so that fans can easily watch immersive live broadcasts at home [11]. Minecraft games also recently hosted the Fire Festival 2019. This is a live virtual music festival held entirely in the game. Over 50 artists performed at the event, and 6,500 people participated in the live interaction of the festival. The bars, pools, sea cliffs, private parties appearing in the game are virtual, but the audience has real experience in the virtual game world. In addition, 5G + VR opens up a lot of possibilities for musicians. It is not necessary to be in a physical space to co-create the same piece of music. The performer can accompany concert singers at different places, etc., which is enough to widen the horizon of the entire music industry. AR mobile augmented reality experience: High-speed, stable, low-latency 5G mobile communication network, deep integration with Wi-Fi, allowing digital music to be online at any time, the fusion of reality, augmented reality, augmented virtual and virtual reality blurs the boundaries between the virtual world and the real world. Fans can experience the story behind the song and the music MV through VR. Mixing the real world allows fans to deeply experience the story and content of the song in the creation of songs across time and space, be immersed in the three-dimensional virtual world, and truly delineate a new field of virtual reality mapping for the viewer [12]. 2.3 5G-Based Music Blockchain Blockchain technology refers to the technical solution of collectively maintaining a reliable database through decentralization and trustfree. One of its biggest features is decentralization. The entire block is an open source system without unified database management organization, and everyone can edit the data. In addition, with the application of distributed ledger technology, everyone will back up the same information, which will inevitably form a huge database. Traditional 4G technology cannot undertake the massive data transmission tasks, while 5G has more extreme experience and larger capacity and uses the radio frequency spectrum to transmit massive amounts of data with higher speed and reliability than previous technologies. It can be therefore estimated that 5G will essentially help blockchain technology to create a new three-dimensional digital environment. Blockchain technology and 5G technology, once applied in the music field, will break the traditional profit model and form a new music copyright trading platform and an efficient, free and open network environment. With the support of 5G technology, the blockchain will solve a series of problems such as the opacity of copyright holders’ benefits, long benefit periods, and no rights for the copyright holders to speak. Without the “intermediary agency”, the copyright owner does not need to deal with complicated intermediaries, and will get the maximum benefit. In addition, the virtual wallet (virtual digital cryptocurrency), smart contracts (transparency between musicians and fans), and distributed databases (storing all relevant copyright databases) of blockchain technology will form a new copyright transaction platform. Piracy and infringement will be significantly reduced.

1148

L. Eryong and L. Yukun

3 In-Depth Development of Digital Music Industry Convergence Format under 5G Network Under the traditional 4G network technology, blockchain, 3D, VR, and AR technologies have made certain progress. The arrival of 5G is bound to accelerate the intelligent layout of human beings in the field of life, and to make smart interconnection, smart home, and smart travel a new normal. Under 5G technology, digital music will work hand in hand with intelligent technologies such as the Internet of Things, artificial intelligence, and cloud computing to create a new way of life and work for humanity. • Smart Home 5G network technology has become an era label. It will drive the rapid layout and development of the smart home industry, foster new economic drivers, trigger profound changes in production methods and lifestyles, accelerate the digitization, networking, and intelligence of production activities and promote the transformation and development of the real economy [13]. Digital music will appear in more scenes in the field of smart home. The mobile client is used to personalize and modulate scene modes in different time periods and the corresponding music style. For example, in the early morning, you can modulate a relaxed and cheerful music scene, and when you are holding a small party with friends, you can modulate a lively music scene and so on. All this will be completed independently by the artificial intelligence home system. The deep integration of digital music and smart home systems not only promotes the rapid and comprehensive development of the smart home industry, but also changes our lives in a completely new way. • Transportation Vehicle to Everything (V2X) is an information technology to connect vehicles with everything. V2X technologies attract industrial and academic efforts to provide wireless connectivity between all road entities and support several V2X services [14]. Under the 5G network technology, the use of on-board FM and on-board CD users will decline, and wireless digital music will be more and more popular, which will greatly promote the development of the digital music industry. With the development of the Internet, the year-on-year increase in car ownership, and the youthfulness of car buyers, consumers are no longer able to meet the basic security functions of the Internet of Vehicles, and music overhearing entertainment has become the first demand for “Internet of Vehicles”. Therefore, the urgency of consumer and vehicle terminal manufacturers’ demand for in-vehicle audio content has also increased rapidly. • Medical Field A large amount of previous clinical experience has shown that music therapy can treat human diseases from the physical and psychological aspects, and has significant effects on poor thinking activities, lack of will, lack of interest, and attention disorders. For example, to relieve the anxiety of cancer patients, especially the treatment of patients with chronic schizophrenia supplemented with music therapy on the basis of antipsychotic drugs will greatly improve the patient’s condition. The intelligent system can monitor real-time according to people’s sleep, exercise, heart rate, etc. and adjust through intelligent terminals. In July 2017, the State Council issued the “New Generation Artificial Intelligence Development Plan”, which proposes that by

Deepened Development of Industrial Structure Optimization

1149

2030, China’s AI theory, technology and applications will generally reach worldleading levels. The continuous integration of digital music, artificial intelligence and medical treatment is bound to promote major changes in the human medical field.

4 Summary The fusion of 5G, Internet, Internet of Things, and blockchain technologies, especially with virtual reality, augmented reality, mixed reality, robotics and other technologies, will greatly change the way humans watch the performances. In addition, the continuous maturation of artificial intelligence composition technology will eventually realize the possibility of creating music at the grassroots level, that is, everyone can complete a more personalized musical composition through the artificial intelligence composition system. The deep integration of digital music with smart home, medical, and connected cars will let us see that under the blessing of 5G, human production methods and lifestyles are moving towards digitalization, networking, and intelligence. Of course, the realization of technology is just a prerequisite. To truly realize human digitization, networking, and intelligence, what we still need is the technology based on the development of social productivity and the improvement of creativity. In the 5G era, first of all, we should attach great importance to the construction of a talent team for technology production and R & D, which is an important guarantee for China to fully enter the 5G era. Second, productivity and the market are needed as the guarantee. The fundamental way to enable the digital market to transform and develop in the 5G era ultimately depends on the market. Finally, the government and enterprises need to play a leading role and construct a good system and institutional environment to make necessary guarantees.

References 1. Zhao, J., Sun, Y., Liu, S.F.: Research on 5G network deployment scheme. China New Telecommun. 9(10), 2–7 (2018) 2. Liu, C.F.: How 5G will rewrite the media industry. Media 3(06), 6–7 (2019) 3. Yang, W.J., Wang, M., Zhang, J.J., et al.: Narrowband wireless access for low-power massive internet of things: a bandwidth perspective. IEEE Wirel. Commun. 24, 138–145 (2017) 4. Gareth, L., Curtis, A.: Programming languages for computer music synthesis, performance, and composition. ACM Comput. Surv. 17(2), 235–265 (1985) 5. Baratè, A., et al.: 5G technology for music education: a feasibility study. IADIS Int. J. Comput. Sci. Inf. Syst. 14(1), 31–52 (2019). ISSN 1646-3692 6. Zhou, L., Deng, Y.: Research on the status quo and trend of the development of artificial intelligence composition. Arts Explor. 32(05), 107–111 (2018) 7. NetEase News. https://3g.163.com/news/article/FV5P1HJU075497TJ.html. Accessed 31 Dec 2020 8. SOHU. https://www.sohu.com/a/298792526_114949. Accessed 10 Sept 2019 9. You, X.H., Zhang, C., Tan, X., Jin, S., Wu, H.: AI for 5G: research directions and paradigms. Sci. China Inf. Sci. 62(2), 1–13 (2018). https://doi.org/10.1007/s11432-018-9596-5 10. Ge, X., Pan, L., Li, Q., Mao, G., Song, T.: Multipath cooperative communications networks for augmented and virtual reality transmission. IEEE Trans. Multimed. 19(10), 2345–2358 (2017). https://doi.org/10.1109/TMM.2017.2733461

1150

L. Eryong and L. Yukun

11. Baratè, A., et al.: 5G technology for augmented and virtual reality in education. In: Conference: International Conference on Education and New Developments (2019) 12. Qiao, X., Ren, P., Dustdar, S., et al.: Web AR: a promising future for mobile augmented reality—state of the art, challenges, and insights. In: Proceedings of the IEEE, vol. 107, no. 4, pp. 651–666, April 2019. https://doi.org/10.1109/JPROC.2019.2895105 13. Chen, R.D.: 2019 digital market in 5G era. Spec. Plan. 11(04), 18–19, 1004–3381 (2019) 14. Abdel, S.A., Hakeem, A.A., Hady, H.K.: Current and future developments to improve 5GNewRadio performance in vehicle-to-everything communications. Telecommun. Syst. 75(3), 331–353 (2020). https://doi.org/10.1007/s11235-020-00704-7

Optimal Solution of Transportation Problem with Effective Approach Mount Order Method: An Operational Research Tool Mohammad Rashid Hussain(B) , Ayman Qahmash, Salem Alelyani, and Mohammed Saleh Alsaqer Center for Artificial Intelligence (CAI), College of Computer Science, King Khalid University, P.O. Box 960, Abha 61421, Kingdom of Saudi Arabia [email protected]

Abstract. This paper introduces a new approach to optimize the cost per unit of product for the Transportation Problem to achieve better outcomes. We present Basic Feasible Solution (BFS) approach compromised of five main steps: (1) Create a Matrix A = mod |Supply(si )-Demand(dj )| (2) Add the cost of each cell of cost matrix C with corresponding elements of Matrix A and Create Matrix B. (3) Mark number in Ascending order of each elements of Matrix B from 1 to mxn (4) If si = dj , then ζ = |small(si , dj )|, else ζ = si or dj , Assign ζ in Matrix B to smallest number from 1 to mxn, and cut the rest of elements of row ζ or column ζ and subtract ζ from other than selected; and (5) Repeat step 4, until all the supply and demand become zero. This solution approach finds the basic feasible solution of TP with the same complexity for solving Vogel’s approximation method (VAM). Keywords: Mount Order Method (MOM) · Transportation Problem (TP) · Basic Feasible Solution (BFS) · Optimal Solution (OS) · Stepping stone method

1 Introduction A set of non negative individual allocation which satisfies all the given constraints is termed as feasible solution. Basically in linear programming more importance is about basic feasible solution rather than basic solution (Solution of a problem which satisfies all the condition). Basic feasible solution (BFS) - For m × n matrix transportation problem, if numbers of allocations are m + n − 1 then it is called as basic feasible solution. To compute the Basic Feasible Solution (BFS) of Transportation Problem (TP), the efficient method which has been observed is Vogel’s approximation method (VAM) than other existing method, which results nearest to optimal solution. In comparative study, the performance of proposed Mount Order Method (MOM) has been compared with VAM to prove the effectiveness of proposed one. Operational research (OR) is a mathematical method, and the scientific study of OR make a better decision. It offers powerful tools to analyze and understand certain classes of problems to achieve better outcomes. Optimization problem (OP) identifies the values of the decision variables that © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1151–1168, 2021. https://doi.org/10.1007/978-3-030-80126-7_81

1152

M. R. Hussain et al.

lead to the best outcomes for the decision maker within the assumption of the model. Optimization and a feasible solution is an optimal solution (OS) which maximize or minimize the measure of goodness. The solution to a transportation problem (TP) has two stages; one is called the identifying a basic feasible solution (BFS) and the second are to get the OS. A set of options that result in the conditions are satisfied is called BFS; this paper presents an approach to find a BFS of TP, with a better result towards the existing approach. We first define the existing approach and suggested one intelligent approach that it can give better result compare to the existing method. A brief review of the existing method of TP and subsequent approach follows. We then offer a selective and better approach which TP is used to identifying a BFS and provide some evidence by comparing with the existing approach. The proposed approach suggests, how operations researchers approach new problems, provide a brief survey of different techniques which have been developed by the researchers of OR. Operation researchers are doing research in their area of interest to achieve better outcomes. The OS identifies the values of the decision variables that lead to the best outcomes for the decision maker within the assumption of the model. The solution to a TP takes about transporting a single item from a given set of supply point to a given set of the destination point. In Table 1 and Table 2, the supply in supply point i is si and the requirement in-demand point or destination point j is dj . The problem is finding the least cost transportation from the supply points to demand points where, unit cost cij of transportation. The problem is formulated as xij which is the product transported from supply point i to demand point j. Table 1. Transportation tableau: quantity transported from supply point i to demand point j.

Optimal Solution of Transportation Problem

1153

The given set of supply from supply point i is ai and the requirement at demand point j is bj and to find the least cost solution of the problem, where cij is the per unit cost of product. The quantity of supply product given from point i to required point j is the problem finalized as xij . To minimize the total cost of a product cij xij is the objective function of subject to sets of constraints. ai = Total availability, and bj = Total requirements. If ai ≥ bj , and cij ≥ 0: Total availability > the requirements, i.e. the possible to fulfill all requirements of the different set of product). bj , and cij ≥ 0: Total availability < the requirements (then clearly all If ai < the requirements cannot be met). To minimize the cost of the transportation cij xij is the objective function of subject to supply constraints. For every supply point, the total quantity that leads the point should ≤ what is available. si is the quantity which is available in the supply point i and as far as every demand point or destination point j is dj concerned. xij : quantity transported from i to j. Minimize cij xij xij ≤ si xij ≤ dj xij ≥ 0, cij ≥ 0, ∀i,j TP is a slightly different version of the simplex algorithm. si = Total availability, and dj = Total requirements. If si ≥ dj , Which means total availability is more than what is required (possible to transport the entire requirement such that the demand of every destination point is met). So it is possible to minimize the transportation cost. Transportation cost cannot be negative cij ≥ 0, ∀i,j . We will only end up sending exactly the amount that is required and none of these destination points will receive even one unit more than it requires because the extra unit has to be transported from one of these supply points and that can only increase the cost of transportation. Cost optimization in the age of digital business means that organizations use a mix of IT and business cost optimization for increased business performance through wise technology investments”, says John Roberts, research vice president, and distinguished analyst with Gartner’s CIO and Executive Leadership team [6]. “The key to effective enterprise cost optimization is to have proactive processes in place as part of business and technology strategy development to continually explore new opportunities (John Roberts). In recent year OR is playing a major role in the different aspect of healthcare. It is optimizing radiation treatment for cancer, allocating the resources for HIV prevention programs and so on. The concept of getting the feasible solution and its variants has applications in many areas such as development, optimization, planning, and design making application of Operation Research models for public health in healthcare are discussed in [1] and the clinical

1154

M. R. Hussain et al.

problems are discusses in [3], to optimize a system, and software facilitates are the problem design methods. The Queuing theory of operation research concept has been applied to reduce the patients waiting time through Monte Carlo modeling method [2, 12] used Total Opportunity Cost Method (TOCM) and find out the distribution indicator for each cell of the TOCM through the respective element, subtracting corresponding row and column highest element of each cell and the cell having smallest distribution indicator (SDI) are allocated [4]. Based on the TOCM Approach, some of the methods have been developed like TOCM-MMM, TOCM-VAM, TOCM-EDM, TOCM-HCDM, and TOCM-SUM, etc. [7–9] in 2012. Finding BFS is a prior requirement to obtain an optimal solution [5], some standard method to find BFS are NWC, LCM and VAM and one another method also have been developed by Ahmed et al. [10] in 2016 is called Allocation Table Method (ATM). Some of the improvement is basically achieved by making changes in the available solution procedure of the classical LCM [11]. We observed that VAM is one of the best existing BFS methods that as a follow-up procedure to get an optimal solution, this method is based on cost cell approach, and NWC is based on the position that does not mean about per unit cost of the product. Our aim is to compare our approach with the best existing method to prove the best performance of approach to solving the transportation problem to find the basic feasible solution, which results near to an optimal solution.

2 Problem Description Based on the results achieved by different methods are compared and discussed. The validity and utility of the LP model can be the judge to justify the performance of the proposed method. Validation by results consists of a comparison of different methods and their solution with corresponding outcomes. A comparative study between existing and proposed methods are carried out by solving some numbers of transportation problems that shows the proposed approach gives a better solution in comparison to the existing method. 2.1 Basic Feasible Solution (BFS) of TP A BFS can be found by • • • •

Northwest corner Rule Least Cost method Vogel’s approximation method Proposed method: Mount order method

A BFS to a TP Satisfies the Following Conditions

i. The row-column (supply-demand) constraints are satisfied. ii. The non-negativity constraints are satisfied.

Optimal Solution of Transportation Problem

1155

iii. The allocation is independent and does not form a loop. iv. There is exactly m + n − 1 allocation. All the existing and proposed methods are satisfying the first two conditions. • A loop can be broken in two ways, one of which will result in a solution with lesser (non-increasing) cost. It is advisable to break the loop and obtain a better solution with lesser cost. • A feasible solution has more than m + n − 1 allocation, there is a loop and we break it to get a basic feasible solution without a loop. • We observed, all the methods (existing and proposed) will not give more than m + n − 1 allocation. • An exact m + n − 1 allocation also does not guarantee the loop (either exist or not), there might be the hidden loop. • A degenerate basic feasible solution has fewer than m + n − 1 allocation. Introduce a ε such a way in which it does not form a loop. 2.2 Optimal Solution to a TP There are two different methods that we use to find an optimal solution. • Stepping Stone Method • Modified distribution method (MODI) or U-V Method These methods used to determine optimality of a basic feasible solution (i.e. Northwest Corner Rule, Least Cost or Vogel’s Approximation, and proposed “Mount order method”). Optimization and a feasible solution is an optimal solution (OS) which maximize or minimize the measure of goodness. The optimal solution sends exactly the quantity that is needed in each of these destination points. dj , (Balanced transportation problem), it will meet the exact require1. if si = ment of every destination point. It means that inequality becomes an equation xij = si,∀i , and xij = dj,∀j . 2. If the condition is not balanced (unbalanced), then need to convert to make the condition balanced and solve it. (i) if si ≤ dj , The entire requirement cannot be met. (ii) if si ≥ dj , the entire demand cannot be met. Optimization is a synonym of mathematical programming for which computer programming is used to implement. It is an economy of compactness and predictive power, correctness, and coherence, in which the problem has discussed to represent a real-world situation. Optimization and a feasible solution is an OS which maximizes or minimizes the measure of goodness. The linear programming (LP) may or may not have a feasible solution. If feasible region is empty then LP does not have a solution. If the solution exists,

1156

M. R. Hussain et al.

means the feasible region is non-empty but the condition of bounded is not depended on the region. Optimization problem identifies the values of the decision variables that lead to the best outcomes for the decision maker within the assumption of the model. Transportation problem is a slightly different version of the simplex algorithm. The solution to a TP has two stages, one is called identifying a basic feasible solution and the second is to get the optimal solution. Stepping Stone Method This is one of the methods used to determine optimality of an initial basic feasible solution. Finding a BFS is the prior requirement to obtain an optimal solution for the transportation problem. There are three standard BFS methods have been discussed in all standard books (Hillier and etc.) and these are Northwest Corner Rule (NWC), Least Cost Method (LCM), Vogel’s Approximation Method (VAM). In this paper, a new approach “Mount Order Method” (MOM) is proposed to obtain a BFS for the TP. This method tries to enter every one of the unallocated position under the assumption that present solution is not optimal then at least one of this unallocated position should have an allocation. So this method tries to it exhaustively look at a decrease (because of putting +1 in the unallocated position and completing the loop). It enters that and tries to move by reducing the objective function further. Now, at the optimum when we unable to find an entering unallocated position. We say that the optimum is reached.

3 Solution Approach Representation of Transportation Tableau Table 2. Unit cost transportation tableau cij : Per unit cost of transportation, where cij ≥ 0 for all i and j, ∀ i, j, i = 1, 2 ,…, m and j = 1, 2 , …, n Demand → Supply↓ (Si )i=1, ∀ j (Si )i=2, ∀ j (Si )i=3, ∀ j … (Si )i=m, ∀ j

(DJ )J =1, ∀ i (DJ )J =2, ∀ i (DJ )J =3, ∀ i … (DJ )J =n, ∀ i

c1j j=1 c2j j=1 c3j j=1 … cmj j=1

(c1j )j=2

(c1j )j=3

… (c1j )j=n

(c2j )j=2

(c2j )j=3

… (c2j )j=n

(c3j )j=2

(c3j )j=3

… (c3j )j=n

…

…

… …

(cmj )j=2

(cmj )j=3

… (cmj )j=n

Si , Dj : cij → Set of supply and demand with per unit cost

Optimal Solution of Transportation Problem

1157

xij → Quantity transported from supply point i to demand point j. Si → (ai )i→1 to m Si : (ai )i→1 to m → (a1 , a2 , a3 , . . . am ) ⎫ Per unit cost of Supplied product ai Supply point j ⎪ ⎪ ⎪ ⎪ (Si )i=1 → (c1j )∀j ⎪ ⎪ ⎪ ⎪ ⎪ (Si )i=2 → (c2j )∀j ⎬ . j = 1 to n ⎪ ⎪ . ⎪ ⎪ ⎪ ⎪ ⎪ . ⎪ ⎪ ⎭ (Si )i=m → cij ∀j ⎫ (Si )i=1, ∀ j → c1j j=1 , (c1j )j=2 , (c1j )j=3 , . . . , c1j j=n ⎪ ⎪ ⎪ ⎪ ⎪ (Si )i=2, ∀ j → c2j j=1 , (c2j )j=2 , (c2j )j=3 , . . . , c2j j=n ⎪ ⎬ (Si )i=3, ∀ j → c3j j=1 , (c3j )j=2 , (c3j )j=3 , . . . , c3j j=n ⎪ ⎪ ⎪ ... ⎪ ⎪ ⎪ (Si )i=m, ∀ j → cmj j=1 , (cmj )j=2 , (cmj )j=3 , . . . , cmj j=n ⎭

(1)

(2)

DJ → bj j→1 to n DJ : bj j→1 to n → (b1 , b2 , b3 , . . . bn )

(3)

⎫ Per unit cost of Demanded product ai demand point j ⎪ ⎪ ⎪ ⎪ → (c ) (D ) ⎪ J i1 ∀i J =1 ⎪ ⎪ ⎪ Dj j=2 → (ci2 )∀i ⎪ ⎬ i = 1 to m . ⎪ ⎪ ⎪ . ⎪ ⎪ ⎪ ⎪ . ⎪ ⎪ ⎭ Dj j=n → (cij )∀i

cj , xj → bj → (cij )∀i,i=1 to m (DJ )J =1 to n →

(4)

⎫ ⎪ Minimize nj=1 cj xj ⎬ Subject to nj=1 aij xj ≥ bj , for all j ⎪ ⎭ xj ≥ 0, for all j Algorithm: Mount Order Method (MOM)

(5)

1158

M. R. Hussain et al.

4 Numerical Example Example-1 of Transportation Problem to Find basic Feasible and Optimal Solution Si : (a i )i→1 to 3 → (10, 15, 40) Using Eq. (1). DJ : bj j→1 to 3 → (20, 15, 30) Using Eq. (3). ⎫ (Si )i=1, ∀ j → c1j j=1 , (c1j )j=2 , (c1j )j=3 : $2, $2, $3 ⎪ ⎬ (Si )i=2, ∀ j → c2j j=1 , (c2j )j=2 , (c2j )j=3 : $4, $1, $2 Using Eq. (2) ⎪ ⎭ (Si )i=3, ∀ j → c3j j=1 , (c3j )j=2 , (c3j )j=3 : $1, $2, $1 Step 1:

⎡

⎤ 10 5 20 Matrix A = [A] = ⎣ 5 0 15 ⎦ 20 25 10 ⎡ ⎤ 223 Matrix C = [C] = ⎣ 4 1 2 ⎦ 121 Step 2: ⎡

⎤ 12 7 23 Matrix B = [B] = ⎣ 9 1 17 ⎦ 21 27 11

Optimal Solution of Transportation Problem

Step 3:

1159

1160

M. R. Hussain et al.

Optimal Solution of Transportation Problem

1161

MOM Result ⎫ cj , xj → bj → {(ci1 )i=1 , 10} + {(ci1 )i=3 , 10} ⎪ (DJ )J =1 → ⎬ (DJ )J =1to3 = $75 cj , xj → bj → {(ci2 )i=2 , 15} (DJ )J =2 → ⎪ ⎭ cj , xj → bj → (ci3 )i=3, 30 (DJ )J =3 →

Existing methods (NWC, LCM, VAM) Algorithm: North-West Corner Rule (NWC) Step 1: Start in the upper left corner of the transportation table. Step 2: Put x11 , x11 = min{a1 , b1 }. Step 3: if x11 = a1 , mark cross in row 1, no such more basic variables come from row 1 and put b1 = b1 − a1 . Step 4: if x11 = b1 , mark cross in column 1, no more basic variables come from column 1 and put a1 = a1 − b1 . Step 5: if x11 = a1 −b1 , mark cross in either row 1 or column 1, but not both. put b1 = 0, mark cross in row 1. Otherwise put a1 = 0, mark cross in column 1. Step 6: Carry on to apply this approach to the most north-west corner cell in the table which does not lie in a crossed-out row or column. Ultimately, there will be only one cell that may be assigned a value (Assign this cell value equal to its row or column demand, and mark cross in both the cells row and column). Step 7: Now, BFS has been obtained ⎫ c , x → bj → {(ci1 )i=1 , 10} + {(ci1 )i=2 , 10} ⎪ (DJ )J =1 → ⎬ j j (DJ )J =1to3 = $115 cj , xj → bj → {(ci2 )i=2 , 5} + {(ci2 )i=3 , 10} (DJ )J =2 → ⎪ ⎭ cj , xj → bj → (ci3 )i=3, 30 (DJ )J =3 →

1162

M. R. Hussain et al.

Algorithm: Least Cost Method (LCM)

Algorithm: Vogel’s Approximation Method (VAM)

Optimal Solution: Stepping Stone Method over MOM • x12 , x13 , x21 , x23 , and x32 are unallocated cell or non-basic variables.

Optimal Solution of Transportation Problem

1163

• Number of allocated cells are 4, which is less than m + n − 1 (degenerate feasible solution exist) • Consider E, to make the condition Basic Feasible Solution, select the position in which it must retain the true that these m + n − 1 position are independent and chose E = 0 as an allocation • Consider the position x12 for E, which will not form a loop with allocated cells. • Now, there are five non-basic position, put + 1, one by one in each of the position, which makes a loop and evaluate the effect. • Add + 1 in any of unallocated cell, which makes a loop because m + n − 1 condition exceed.

Net cost (increased/decreased). To look over the net cost by putting +1. Start from this unallocated cell with +1 and move to other allocated cell with an alternate sign to form a loop. Carry on this procedure one by one with all unallocated cell. The sign of the net cost will show that either net cost is increasing or decreasing is shown in Table 3. There are four unallocated cells, putting +1 in each would give us an increased. So, we realized that none of them actually are capable of decreasing the objective function further from 75 to something lower than that, so, optimum is reached. ⎫ cj , xj → bj → {(ci1 )i=1 , 10} + {(ci1 )i=3 , 10} ⎪ (DJ )J =1 → ⎬ (DJ )J =1to3 = $75 cj , xj → bj → {(ci2 )i=2 , 15} (DJ )J =2 → ⎪ ⎭ cj , xj → bj → [{(ci3 )i=3 , 30}] (DJ )J =3 →

Hence, it proves the developed method is giving better result in comparison to the existing method.

1164

M. R. Hussain et al.

Table 3. Net cost (increased/decreased): the sign of the net cost show that either net cost is increasing or decreasing. Unallocated-cell Loop of per unit cost cell

Net cost increased Net cost decreased

x13

c13 − c33 + c31 − c11

1

–

x21

–

x23

c21 − c22 + c12 − c11 3 c23 − c33 + c31 − c11 + c12 − c22 1

x32

c32 − c31 + c11 − c12

–

–

1

5 Related Problem Problem 1 is similar to all other rest of problems and can be solved by a similar approach. Table 4. Supply-demand table of problem 1 to 6 Problem →

1

2

3

4

5

6

Si : (ai )∀i →

10, 15, 40

40, 60, 50

40, 30, 30

22, 15, 8

15, 25, 10

200, 300, 500

DJ : (bj )∀j →

20, 15, 30

20, 30, 50, 50

20, 30, 50

7, 12, 17, 19

5, 15, 15, 15

200, 400, 400

Table 5. Per unit cost (in $) of supply product table of problem 1 to 6 Problem →

1

2

3

4

5

6

Si : (cij )i=1,∀j →

2, 2, 3

4, 6, 8, 8

2, 5, 2

5, 2, 4, 3

10, 2, 20, 11

2, 7, 5

Si : (cij )i=2,∀j →

4, 1, 2

6, 8, 6, 7

1, 4, 2

4, 8, 1, 6

12, 7, 9, 20

3, 4, 2

Si : (cij )i=3,∀j →

1, 2, 1

5, 7, 6, 8

4, 3, 2

4, 6, 7, 5

4, 14, 16, 18

5, 4, 7

In more of the cases, our proposed method is performing better than other methods and achieved result near to the optimal solution, which have been shown in Table 4, Table 5 and Table 6. Problems 1, 3, 4, and 6, BFS is exactly similar to the optimal solution, which is 75, 210, 104, and 3300. Table 7 shows that the proposed method performs better than another method (in %).

Optimal Solution of Transportation Problem

1165

Table 6. Results of the proposed method are compared with existing methods of getting basic feasible solution and also compared with the optimal solution (in $) Problems Basic feasible solution

Optimal solution

Proposed method

Existing method

MOM

NWC

Existing method

1

75

115

85

85

75

2

930

980

930

960

920

3

210

280

270

210

210

4

104

125

105

104

104

LCM VAM Stepping stone method

5

475

520

475

475

435

6

3300

4800

3300

3300

3300

Table 7. Results of the proposed method are compared with existing methods of getting basic feasible solution and also compared with the optimal solution (in %) Near about (in %) optimal solution

Problems 1

BFS

2

3

4

5

6

Proposed MOM 100% method

98.92% 100%

100%

91.58% 100%

Existing method

93.88% 75%

83.2%

83.65% 68.75%

NWC

65.22%

LCM

88.24%

98.92% 77.78% 99.05% 91.58% 100%

VAM

88.24%

95.83% 100%

100%

91.58% 100%

Table 8. Results of the proposed method are compared with existing methods of getting matched with optimal solution

A Statistical (Probability) test has been performed to check whether the proposed algorithm performs better than the other BFS algorithm. Hence, the performance of the proposed method is found to be significantly better than the other method in 67% (4/6) cases are getting an optimal solution.

1166

M. R. Hussain et al.

NEAR ABOUT (IN %) OPTIMAL SOLUTION MOM

NWC

LCM

VAM

EXAMPLE 1 EXAMPLE 2 EXAMPLE 3 EXAMPLE 4 EXAMPLE 5 EXAMPLE 6

Fig. 1. Result analysis graph of Table 7

Table 9. Probably optimal: proposed method is probably optimal, methods are compared using different types of problems and concluded probably optimal result Basic feasible solution

Probably optimal

Proposed method

0.67

MOM

Existing method NWC

0

LCM

0.17

VAM

0.5

Summary This paper introduces a new approach to find a basic feasible solution of transportation problem. The objective is to find a better feasible solution which is near to optimal solution or sometimes exact optimal solution. In order to minimize the transportation cost of product we had to find the product with the minimal per unit transportation cost. This was done by solving the proposed Mount order method. In addition, the optimal result for the selected problem was obtained. Notable, the time complexity remains the same as that of the Vogel’s approximation method. It may be observed in Fig. 1 and 2 that the % of similarity is near or equal to the optimal result (from Table 8 and Table 9). The difference between rests of two examples which are not similar to optimal are also very small i.e. 930 – 920 = 10, 475 – 435 = 40. Since the proposed method results

Optimal Solution of Transportation Problem

1167

Probably OpƟmal

37% 50%

13% 0% MOM

NWC

LCM

VAM

Fig. 2. Result analysis graph of Table 9

are very similar to optimal solution, then it may be implied that the method proposed to find BFS is better than existing method.

6 Conclusion Operation research (OR) is a scientific study and it offers powerful tools to understand and analyze certain classes of problems to achieve better outcomes. The solution of a transportation problem (TP) has two stages, first to identify the basic feasible solution and next to get the optimal solution (OS). A set of choices that result in the condition being satisfied is called BFS. In this article, a new approach named Mount order method (MOM) for finding a BFS of TP is proposed. There are some popular methods which are used to find a BFS of TP are North-west corner Rule (NWC), Least cost method (LCM), and Vogel’s approximation method (VAM). I present and solve a set of six numerical problems by using these three existing methods and compare the obtained result with proposed MOM. The BFS is obtained by using VAM and MOM, and then the result of these BFS is used to find the optimal solution. In some instances, I realized the result obtained by proposed MOM is very close to the optimal solution. The proposed algorithm shows reasonable complexity to find BFS. With the solution provided by MOM can easily reach to the optimal solution in very less number of iterations using the Steppingstone method or Modified distribution method (MODI).

References 1. Romero-Conradoa, A.R., Castro-Bolañoa, L.J., Montoya-Torres, J.R., Jiménez-Barros, M.Á.: Operations research as a decision-making tool in the health sector: a state of the art. DYNA 84(201), 129–137 (2017)

1168

M. R. Hussain et al.

2. Khan, A.R., Vilcu, A., Uddin, Md.S., Ungureanu, F.: A Competent algorithm to find the initial basic feasible solution of cost minimization transportation problem. Bul. Inst. Politeh. Din Iasi Romania Seçtia Autom. Calcul. LXI(LXV) (2), 71–83 (2015) 3. Ehrgott, M., Holder, A.: Operation research methods for optimization in radiation oncology. J Radiat. Oncol. Inf. 6, 1–41 (2014) 4. Islam, M.A., Haque, M., Uddin, M.S.: Extremum difference formula on total opportunity cost: a transportation cost minimization technique. Prime Univ. J. Multidisc. Quest 6, 125–130 (2012) 5. Islam, M.A., Khan, A.R., Uddin, M.S., Malek, M.A.: Determination of basic feasible solution of transportation problem: a new approach. Jahangirnagar Univ. J. Sci. 35, 101–108 (2012) 6. Roberts, J.: Your guide to building a successful strategic plan. https://www.gartner.com/en/ insights/strategic-planning 7. Khan, A.R.: A re-solution of the transportation problem: an algorithmic approach. Jahangirnagar Univ. J. Sci. 34, 49–62 (2011) 8. Khan, A.R., Banerjee, A., Sultana, N., Islam, M.N.: Solution analysis of a transportation problem: a comparative study of different algorithms. Bull. Polytech. Inst. Iasi Romania Sect. Text. Leathersh. (2015) 9. Ahmed, M.M., Khan, A.R., Uddin, M.S., Ahmed, F.: A new approach to solve transportation problems. Open J. Optim. 5(1), 22–30 (2016) 10. Ahmed, M.M., Khan, A.R., Ahmed, F., Uddin, M.S.: Incessant allocation method for solving transportation problems. Am. J. Oper. Res. 6(3), 236–244 (2016) 11. Uddin, M., Khan, A.R., Kibria, C.G., Raeva, I.: Improved least cost method to obtain a better IBFS to the transportation problem. J. Appl. Math. Bioinf. 6(2), 1–20 (2016) 12. Thomas, S.J.: Capacity and demand models for radiotheraphy treatment machine. Clin. Oncol. 15(6), 353–358 (2003)

Analysis of Improved User Experience When Using AR in Public Spaces Vladimir Barros(B) , Eduardo Oliveira, and Luiz Araújo Cesar School, Recife, Brazil [email protected]

Abstract. The new immersive technologies have been offering several possibilities and creating new markets around the world. Projects using Augmented Reality offer different experiences in the most diverse segments, including culture. If well designed in public spaces, AR can become a powerful tool in situations that simulate, facilitate, and favour cultural learning, making it more interesting and interactive. This article aims to investigate the improvement of relationship of users with sculptures in public spaces through the use of an AR tool. This article aims to investigate the improvement of relationship of users with sculptures in public spaces through the use of an AR tool. We build on this analysis to identify improvements promoted by the use of AR tool in the learning and engagement process of the users about relevant statues of remarkable personalities of local culture. The idea was to integrate technology, culture, and design so that public spaces, which contain these monuments, could be more visited and have their values re-meant through information. Keywords: Augmented reality · User centered design · Culture

1 Introduction Bleicher and Comarella [4] define that immersive virtual environments are the ones that cross the limits of physical space and allow interaction to happen immersed in simulated realities. User engagement is much greater because of this opportunity for interaction created by the endless opportunities for approach. The Augmented and Virtual Reality devices (Fig. 1) provide experiences that mix the physical and virtual worlds through three-dimensional images. Technically, Gnipper [8] says that the process of these immersive environments undergoes a gradation, starting with Augmented Reality, considered much larger than Virtual Reality for the year 2023. Due to its practicality, AR has a certain advantage over VR you need glasses for it to work. Its daily use will generate new experiences that will attract more interested parties, moving to a Mixed Reality (AR plus VR), and, after a familiarization, arrive at Virtual Reality. This way, the user would have time to understand these applications with distinction. It is generally agreed today that AR is generally related to areas of advertising and entertainment. According to SILVA [19], the American multinational technology PTC presented a research on the marketing market made by AR. Economic movements © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1169–1189, 2021. https://doi.org/10.1007/978-3-030-80126-7_82

1170

V. Barros et al.

Fig. 1. Augmented realities (a) and Virtual (b) Sebrae; Website 2017.

reached US$ 7 billion in 2019 and by 2021, this figure is expected to reach US$ 63 billion, an increase of 800%. In the cultural area, there are great opportunities for use, a local example is the Circuit of Poetry in Recife, a series of 12 life-size sculptures of great artists of Pernambuco literature and music scattered throughout the neighborhood (Fig. 2).

Fig. 2. Statue of Ascenso Ferreira of the circuit of poetry of Pernambuco. JETRO ROCHA/BORAALI, website 2017.

Some of these statues present necessary structural problems of repairs, such as on the information plates of each of them. These include information about the lives of the honorees, which is lost due to vandalism and wear and tear. The statue of Ascenso Ferreira (Fig. 2) was the basis of study for the strategic location where there is intense pedestrian circulation. This would facilitate the research process with the public, punctuating the beneficial possibilities for the use of AR in the public spaces that the monuments meet.

2 Literature Review 2.1 The User Centered Design According to LOWDERMILK [13], usability and UCD will change the way people interact with your application, creating a rich experience with solid design principles. This transforms the digital age and viscerally affects all economic and social structures. EVANS and KOEPFLER [7] also define the UCD as an improved man-world experience

Analysis of Improved User Experience When Using AR in Public Spaces

1171

process with technology and consider it as a means to engage people for the living environment. So instead of focusing design on technology, it focuses on how to increase reality with necessary, user-defined things. First, it is necessary to choose the public, monitoring, and evaluating their behavior in the face of technological application. They emphasize that defining the public it is necessary to investigate users in their wild environment, rather than in a created environment. When they engage in objective activities in their habitats, one can analyze how they are affected by AR and know if we are increasing experience and improving it. The study can bring advantages that minimize costs and foster innovations applied to real life. This research also goes through Norman’s design principles [15] that help to understand aspects that transform the creation of products and services: 1. Visibility: It is all that the user can perform in sequence. When the functions are out of sight, the same does not know how to use them. 2. Feedback: Return of information for the user to continue a task and can present itself in a sensory way with audio, tactile, visual, or their combinations. Without it, the user can perform improper actions or repeat them more than once. 3. Restrictions: Restricting the number of choices your artifact becomes safer and easier to use. An objective and unique way make the process more intuitive. 4. Mapping: Where we relate two things, in this case, between virtual environments, the user, and the results of this relationship in the world. 5. Consistency: Similar tasks are obtained with operations and similar elements visually. This transformation of the interface gives greater control to the user. 6. Affordance: An object that allows people to know how to use it by physical obviate and suggestion of interaction. According to JACOBS [10], the urban space is where interactions between people are generated and where various individual and collective desires are mixed, making it an inseparable element of the city. Likewise, SANTOS and VOGEL [18] argue that the city is formed from the view of its users, having in its daily lives and the variables and complex relations man-space the basis of urban knowledge. MOREIRA and AMORIM [14] defend projects such as Archeoguide (Fig. 3), which recreates architectural monuments of the city of Olympia in Greece, home of the Olympic Games of Antiquity, bringing invaluable to world culture. The potential of AR in the performance of cultural heritage is seen from documentation, publication, and information dissemination, allowing the user to visualize and interact with objects and historical data. LIMA [12] evaluates that the AR used in cultural scripts brings a freer, contextualized, and complete experience. Moreover, by AZUMA [3], HUGUES [9] AND ZILLES BORBA, ZUFFO, and MESQUITA [20, 22], this article noted RA as a tool for: – Virtually project images to increase the real-world experience; – Maintain the sense of presence in the real environment, without the immersion of other tools of simulation of objects or spaces, increasing the use of other users; – Combination of hybrid worlds, real and virtual, through artifact;

1172

V. Barros et al.

Fig. 3. Temple of Hera in Olympia: (a) Ruins, Present Times; (b) SIMULATION in AR. VLAHAKIS, Website; 2002.

– Complement the real world and not replace the physical environment. For AZUMA [4], there is an addition in the real environment when using AR for interaction with the virtual, attaching some data entry commands, image capture, and sensory and geolocated devices. This information is processed for assimilation into a peripheral that connects the two environments in real-time. PINHO, CORRÊA, NAKAMURA and JÚNIOR [16] affirm it is necessary to observe the two environments on certain tasks, emphasizing that interactive construction goes through techniques to assertively elaborate an AR project (Table 1). Within this line, LANTER and ESSINGER [14] argue that intuitive design has products represented by visual models suitable for UX and the activity that was intended. ABRAS, MALONEY-KRICHMAR, and PREECE [1] also cite the use of UCD to build knowledge experiments that connect the real world and people’s life experiences. For this, the user needs the exploration control so that the exposed material is understandable and simple in its essence. Plan all the actions of the public aiming at possible errors, it is important to give the same, the chance of recovery if by chance lose the driving of the use of the product. 2.2 Cultural Application Around culture, Arias [2] reveals the importance of understanding it through its roots. Assimilating these origins makes people more susceptible to understanding the need to keep them alive in memory. According to LÓSSIO and PEREIRA [11], there is a loss of this context throughout life, where culture suffers interference from some factors for stagnation and a lack of appreciation of its popular nature. It is possible to mention the influence of globalization that has come to change the way people behave in the face of popular culture rituals and celebrations. Tradition is valued by custom, but there is no custom of valuing its real meanings. Then, the subject should be discussed under the perspective of cultural sustainability and how to provide pleasure in the segment, so that it becomes a desirable and salable product, in order to generate business and income.

Analysis of Improved User Experience When Using AR in Public Spaces

1173

Table 1. Introduction to virtual and augmented reality; 2018. Adapted by the Author; 2019. Tasks

Real environment

Virtual environment

Manipulation of objects

Object manipulation is usually done with tools or by hand

Tool selection it’s complicated

Communication and The possibility of commands by means of voice communication with other users by means of voice is fundamental importance in the process interactive between users

The technology of voice recognition it’s still precarious

Measurement of objects

The possibility of measuring environment objects is quite natural for real applications

It is still difficult and little need the possibility of measuring objects in virtual environments

Annotation of Information about environment objects

Annotating information text and graphics on paper or notice boards is extremely simple and useful in the process interaction in real environments

The entry of texts and numbers is poorly developed in virtual environments

It was justified in this study the reflection and production of effective alternatives with the use of AR to solve problems, examples of the little use of public spaces and the scarcity of cultural information about its monuments. 2.3 Cultural Application The research was characterized as qualitative when the behaviors of users in the face of an immersive process in AR in the monuments of Recife were analyzed. A device called Digital Circuit of Poetry (Fig. 4) was created, prototypical to present information about

Fig. 4. Identity of the digital poetry circuit project. Created by the author; 2019.

1174

V. Barros et al.

the work of Ascenso Ferreira to the user. Using design methodologies were analyzed, developed, implemented and observed solutions to achieve results.

3 Research Methodology 3.1 Sample Observing the Global Mobile Consumer Survey 2019 produced by DELOITTE [5], the public that participated in the survey was selected. This definition of the user profile for a first interview was intended to obtain a control group for the application of the tests. With the document, a clipping of the people who started the tests was defined. Soon after, the Structural Survey Questionnaire was applied to define the control group of future tests. In a semi-structured interview, the user’s habits and interests were a little more known. Taking into account the use of mobile frequently and easily with connected devices, the public was segmented into male and female individuals with an average age between 18 and 36 years (Fig. 5).

Fig. 5. The user profile is defined for the survey based on DELOITTE data. Prepared by the Author 2019.

Thirty individuals were chosen according to the theory of TULLIS and ALBERT [21], in which they classify the samples under two points: those that identify usability problems in the interaction, with 3 or 4 participants to detect the most significant points, and those that point observations of a process involving many tasks. In the latter it takes much more than four users and claims that coming closer to completion, more people are needed to identify the missing problems. This study of the degree of confidence in the sample (Table 2) defines the success of the research in a number defined in a group of participants who were able to perform the tasks. According to the authors, as it gets closer to the conclusion, more people are needed to identify the missing problems.

Analysis of Improved User Experience When Using AR in Public Spaces

1175

Table 2. Example of how confidence intervals change as a function of the sample size. Prepared by the Author; 2020. Number successful

Number of participants

95% low confidence

95% high confidence

4

5

36%

98%

8

10

48%

95%

16

20

58%

95%

24

30

62%

91%

40

50

67%

89%

80

100

71%

86%

Note: These numbers indicate how many participants in a usability test have successfully completed a given task and the confidence interval for that average completion rate in the larger population

3.2 Research Steps The general design of the research followed the model of DRESCH, PACHECO, and JÚNIOR [6] for technological studies and developed as follows: Problem Identification: The use of immersive technologies in public environments was investigated to raise hypotheses that indicated the construction of research thinking. Information was organized to plan the best way to collect the data, in addition to exploring the urban spaces where the little-visited monuments are located. Intrinsic problems were raised for each guide to reach the result, observing convergent points in similar studies that led to an opportunity for further analysis. Directed readings: With directed studies, articles, and materials related to the three areas proposed by the problem were researched: RA, UCD, and public spaces. Identification and proposition of artifacts and configuration of problem classes: At this stage, we sought to identify how AR could improve user interaction with the external environment, and how the production of a device would mean a location before little or not moved. Using an artifact would make the data collection process simpler to observe the possible solutions to the problem. Artifact design: After the research related to the fields involved and with the support of the directed studies, the Brainstorm technique was used to launch a solution that best suited the problem. In this phase, a dense work of ideation for development was sought, bearing in mind that this process was feasible and scalable. Artifact development: At this time began to run project processes creation of the artifact: 3D modeling, Ux, and mapping of the monument. The prototyping occurred in phases with test cycles until the material was executed efficiently. Artifact evaluation: The artifact was evaluated with experts by analyzing the effectiveness of the user experience in a real test field. This moment was crucial for necessary adjustments aiming at an assertive application of the research. Semi-structured interviews (1st part): As discussed in item 4.1, the control group was defined through applied research on users defined by the DELOITTE survey [5].

1176

V. Barros et al.

Semi-structured interviews (2nd part) and application of tests: For better understanding, this process was divided into four moments: – Initially, the 2nd part of the semi-structured interview was applied with The Technical Questionnaire of Research 1 (see item 4.2 of this article). In this phase, we tried to understand the relationship of users with monuments directly, focusing on their habits at the time of analog appreciation, without any interference from technology. – In the second part, the user was presented with the Augmented Reality artifact on the monument of Ascenso Ferreira declaiming authorial poetry. The necessary information with all the steps for using the device has been made available for testing via a mobile device. – In the third part, after following the instructions for the use of AR and having tested all the material in all its possibilities and functions, the Technical Questionnaire of Research 2 was applied (see item 4.3 of this article). Following the Rohrer desirability study [17], a short user experience questionnaire was created to evaluate the use of AR applied in the monument, and some points of User-Centered Design associated with information exposure were observed. – In the last stage, research technical questionnaire 3 (see item 4.4 of this article) was used to understand how the user sees interactivity, usability, and desirability in a more incisive way. His objective questions concern the use of this type of technology applied to culture. After applying this method (Fig. 6) it was observed that the data obtained with the use of technology can solve the problem proposed by this dissertation.

Fig. 6. Method of the 2nd part of semi-structured interviews and application. Prepared by the Author; 2019.

– Explanation of learning: With the results of the tests in hand, the feedbacks were cataloged for the evaluation of possible improvements.

Analysis of Improved User Experience When Using AR in Public Spaces

1177

– Conclusions: At this stage, all the results were formalized to generate a final observation with the judgment of the solution and even the new possibilities generated for future studies. – Generalization for a class of problems: This stage aimed to present the possibility of implementation for similar situations. – Communication of the results: Presented here all the lessons learned by the process and the possibility of future developments. Table captions should be placed above the tables. As this research project was proposed in 2019, it is important to note that adaptations have been made in some of the processes on account of the COVID-19 pandemic. The development of the artifact, the application of the interviews, and the application of the tests underwent some changes to comply with the guidelines of social distancing suggested by the World Health Organization (WHO). All studies and methodologies were idealized for development in a real environment, with the application of tests in the public space. But it was necessary to find an alternative in which users could experience as close to the initial idea within their own homes. Thus, the DSR was adjusted to guide the research and conduct the operation from the beginning, with planning, to the analysis of the final data.

4 Results 4.1 Structural Questionnaire This semi-structured questionnaire aimed to know a little more about the habits and interests of the user to define the control group, selecting 30 interviewees to follow the other stages of the research with the following results (Fig. 7, 8, 9, 10, 11, 12, 13, 14 and 15): 1. What is the system of your cell phone?

Fig. 7. Structural questionnaire result P1.

2. What is your level of interaction with architectural works, sculptures and Monuments?

1178

V. Barros et al.

Fig. 8. Result structural questionnaire P2.

3. Do you consider the information presented about architecture, statues and monuments that are easy to see?

Fig. 9. Structural questionnaire result P3

4. Is the information presented educational?

Fig. 10. Result structural questionnaire P4.

5. Are you satisfied with the information presented?

Analysis of Improved User Experience When Using AR in Public Spaces

1179

Fig. 11. Result structural questionnaire P5.

6. How do you get this information? (You can score more than one)

Fig. 12. Result structural questionnaire P5.

7. How often do you use your mobile phone to obtain tourist information and cultural?

Fig. 13. Result structural questionnaire P7.

8. What would you think of some application with interactive audio and video applied to these sculptures?

1180

V. Barros et al.

Fig. 14. Result structural questionnaire P8.

9. About the use of smartphones do you think it would facilitate the process?

Fig. 15. Result structural questionnaire P9.

4.2 Technical Questionnaire 1 This first semi-structured technical interview aimed to understand the relationship of the control group with the monuments directly, focusing on their habits at the time of analog appreciation, without any interference from technology. We obtained the following results (Fig. 16, 17, 18, 19 and 20): 1. What is your profession? 30 people with different works like: designers, engineers, journalists, and students. 2. What is your age group?

Fig. 16. Result technical questionnaire-1 P2.

Analysis of Improved User Experience When Using AR in Public Spaces

1181

3. About monuments and museums: How much time do you spend observing the monuments in the streets or museums?

Fig. 17. Result Technical Questionnaire-1 P3.

4. How often do you see innovative modes of presentation of Monuments?

Fig. 18. Result technical questionnaire-1 P4.

5. My interest in the monuments has a direct connection with the way where they are presented.

Fig. 19. Result technical questionnaire-1 P5.

1182

V. Barros et al.

6. You would again enjoy exhibits or monuments already seen previously?

Fig. 20. Result technical questionnaire-1 P6.

4.3 Technical Questionnaire 2 After following the instructions for the use of the Augmented Reality application and testing application (Fig. 4), this second technical questionnaire was answered. With it was expected to analyze the UX regarding the use of AR applied in the monument, observing some points of the UCD associated with the exposure of information. According to the experience in the use of the AR of the Digital Circuit of Poetry, the scale below was used to measure the user’s experience by marking the closest to the adjective identified during the proposed test. We obtained the following results (Fig. 21, 22, 23, 24, 25, 26, 27 and 28): 1. What is your view on the use of AR in the proposed test (fluidity level)?

Fig. 21. Result technical questionnaire-2 P1.

2. What is your view on the use of AR in the proposed test (difficulty level)?

Fig. 22. Technical questionnaire-2 P2 Result.

Analysis of Improved User Experience When Using AR in Public Spaces

1183

3. What is your view on the use of AR in the proposed test (efficiency level)?

Fig. 23. Result technical questionnaire-2 P3.

4. What is your view on the use of AR in the proposed test (level of comprehensibility)?

Fig. 24. Result technical questionnaire-2 P4.

5. What is your view on the use of AR in the proposed test (level of dynamism)?

Fig. 25. Result technical questionnaire-2 P5.

6. What is your view on the use of AR in the proposed test (level of interest)?

Fig. 26. Result technical questionnaire-2 P6.

7. What is your view on the use of AR in the proposed test (level of innovation)?

1184

V. Barros et al.

Fig. 27. Result technical questionnaire-2 P7.

8. What is your view on the use of AR in the proposed test (temporal level)?

Fig. 28. Result technical questionnaire-2 P8.

4.4 Technical Questionnaire 3 This final step aimed to know the user’s view more directly with the interactivity, usability, and desirability of the artifact. Through semi-structured research, the experience with AR was related to objective questions that alluded to the use of this type of technology applied to culture. We obtained the following results (Fig. 29, 30, 31, 32, 33, 34, 35 and 36): 1. How interactive did you find the AR application in this context?

Fig. 29. Technical questionnaire-3 P1 result.

2. The interactivity presented by the application aroused my interesting about the appreciation of the monument.

Analysis of Improved User Experience When Using AR in Public Spaces

1185

Fig. 30. Result technical questionnaire-3 P2.

3. How do you see the use of the application?

Fig. 31. Technical questionnaire-3 P3 result.

4. The possibility of pausing and continuing (on the button) with the recital of poetry made the experience easier to use.

Fig. 32. Result technical questionnaire-3 P4.

5. The presentation of the monument in AR, reciting a poem, brought a desire over the history of the work.

1186

V. Barros et al.

Fig. 33. Result technical questionnaire-3 P5.

6. Would you pay for this type of product in culture-oriented AR?

Fig. 34. Result technical questionnaire-3 P6.

7. Would you indicate this type of cultural product in AR?

Fig. 35. Result technical questionnaire-3 P7.

8. Why would you consume these cultural products? (You can score more than one)

Analysis of Improved User Experience When Using AR in Public Spaces

1187

Fig. 36. Result technical questionnaire-3 P8.

5 Recommendations In its entirety, the numbers were very positive when we observed the application’s support in a group that consumes technology daily. This fact alone is already a positive indicator for the defense of cultural proposals involving technology and public spaces. The structural questionnaire provided basic information about the cultural positioning of the public: They interact little with the city’s monuments; Time or time again they find it difficult to visualize the information of the works; They observe partial gains in the educational aspect of the information provided; They feel some dissatisfaction with the amount of information presented; They use their mobile phone to facilitate the retrieval of information; They felt they would have a greater interest if the works were more interactive. In technical questionnaire 1, already more immersive in AR, users generally opined on the following questions: – They do not easily find innovative applications in monuments. – Much of it has direct interest when monuments have a differentiated presentation. – Much of it would visit an exhibition again. With technical questionnaire 2, there are important technical considerations for the assertiveness of the research, since users consider positive AR in all aspects and following the table of TULLIS and ALBERT [21] (Table 2), we have 91% confidence, since all were able to complete the action defined in the test. They considered the application of fluid AR during the process, easy to use, efficient for its destination, evident in the content presented, dynamic in the interaction presented, and with a high level of innovation. Finally, technical questionnaire 3 presented an objective explanation of the interviewees, being more evaluative about this application in the public space; considers the application interactive. They were interested in enjoying the monument. He sees that the application would facilitate immersion. It brought desire about the place and monument. Most would pay for consumption and indicate this type of product.

1188

V. Barros et al.

6 Conclusion and Future Research The use of Augmented Reality as a hard-hitting way for the occupation of public space is analyzed throughout the process. It is expected to resignify these places within the city through a more immersive experience, addressing the cultural context over a new look. Accompanying this process also gives the possibility of expansion to other monuments within the Neighborhood of Recife, since they have several other cultural architectural works. The users approached showed great willingness to consume cultural products in their State, however, the way they present themselves does not give the possibility of a search and greater immersion in the subject. The artifact visibly caused a user’s inquiry into the life of the poet honored in the Poetry Circuit. This fact itself already makes it notorious how an immersive approach can generate knowledge and interconnect information that goes unnoticed. Even the recommendation of the users tested was the inclusion of some functionalities to make the artifact more interactive: a menu that gave access to an audiobook where they could permeate other poetry; telling of causes about the poet’s life, where he would speak of his life story; and a set of questions and answers about the life of the statue honoree. These possibilities show how positive the interaction with the user was and that they are really in need of stimuli to consume products that usually do not go through this type of approach. It is important to highlight the possibility of scaling this project to other works in Recife since the capital has several other works that are not within the Poetry Circuit and go through the same problem of abandonment of public space. This digitization of history would make the most visited and prepared environments for the reception of local or outside visitors. This strengthening brings again in the knowledge of human history; after all, the identity of a people is intrinsically linked to their culture. It is expected that the object of study Circuit da Digital da Poesia can be part of local digital tourism, presenting the culture of Pernambuco in a more didactic and desirable way, always instructing on the roots of its ancestors. This time, not only for consumption at home, as was the case of the test, more in the place where the statues meet. Only then can we make public what is ours: the knowledge about our traditions and the place where they were born.

References 1. Abras, C., Maloney-Krichmar, D., Preece, J.: User-centered design. In: Bainbridge, W., et al.: Berkshire Encyclopedia of Human-Computer Interaction. http://citeseerx.ist.psu.edu/ viewdoc/download?doi=10.1.1.94.381&rep=rep1&type=pdf. Accessed 14 May 2019 2. Arias, P.G.: La cultura. estrategias conceptuales para comprender a identidad, la diversidad, la alteridad y la diferencia. LNCS. https://digitalrepository.unm.edu/cgi/viewcontent.cgi?art icle=1009&context=abya_yala. Accessed 13 July 2020 3. Azuma, R., Baillot, Y., Behringer, R., Feiner, S., Julier, S., Macintyre, B.: Recent advances in augmented reality. IEEE Comput. Graph. https://www.researchgate.net/publication/320 8983_Recent_advances_in_augmented_reality_IEEE_Comput_Graphics_Appl. Accessed 26 Mar 2020 4. Comarella, R.L., Bleicher, S.: Experimentação de recursos didáticos. https://moodle.ead.ifsc. edu.br/mod/book/view.phpid=82391&chapterid=16198che. Accessed 25 Oct 2020

Analysis of Improved User Experience When Using AR in Public Spaces

1189

5. DELOITTE GLOBAL: A mobilidade no dia a dia do brasileiro. https://www2.deloitte.com/br/ pt/pages/technology-media-and-telecommunications/articles/mobile-survey.html. Accessed 22 July 2020 6. Dresch, A., Pacheco, D., Antunes Júnior, J.A.V.: Design Science Research: método de pesquisa para avanço da ciência e tecnologia. 1st edn. Bookman, Porto Alegre (2015) 7. Evans, K., Koepfler, J.: The UX of AR: toward a human-centered definition of augmented reality. http://uxpamagazine.org/the-ux-of-ar/. Accessed 15 July 2020 8. Gnipper, P.: Especialista faz 5 previsões sobre o futuro das realidades virtual e aumentada. https://canaltech.com.br/rv-ra/especialista-faz-5-previsoes-sobre-o-futuro-dasrealidades-virtual-e-aumentada-119024/. Accessed 20 Sept 2020 9. Hughes, O., Fuchs, P., Nannipieri, O.: New augmented reality taxonomy: technologies and features of augmented environment. https://hal.archives-ouvertes.fr/hal-00595204/document. Accessed 24 Aug 2020 10. Jacobs, J.: Morte e vida das grandes cidades. 3 nd edn. WMF, Martins Fontes, São Paulo (2014) 11. Lanter, D., Essinger, R.: User-centered design. In: Richardson, D., et al. (eds.) International Encyclopedia of Geography: People, the Earth, Environment and Technology. https://online library.wiley.com/doi/full/10.1002/9781118786352.wbieg0432. Accessed 21 May 2018 12. Lima, Ma.C.G.V.: Realidade Aumentada Móvel e Património no Espaço Público/Urbano. Dissertação de Mestrado em Gestão e Programação do Património Cultural, apresentada à Faculdade de Letras da Universidade de Coimbra. https://eg.uc.pt/bitstream/10316/36094/1/ Realidade%20Aumentada%20Movel%20e%20Patrimonio.pd. Accessed 25 Nov 2019 13. Lowdermilk, T.: Design Centrado no Usuário. Novatec Editora. São Paulo (2013) 14. de Souza Moreira, L.C., de Amorim, A.L.: Realidade Aumentada e patrimônio cultural: apresentação, tecnologias e aplicações. https://www.researchgate.net/publication/275153985_ REALIDADE_AUMENTADA_E_PATRIMONIO_CULTURAL_APRESENTACAO_ TECNOLOGIAS_E_APLICACOES Accessed 29 Sept 2019 15. Norman, D.: The Design of Everyday Things. Basic Books, New York (1988) 16. Pinho, M., Corrêa, C., Nakamura, R., Júnior, J.: Introdução a Realidade Virtual e Aumentada. Editora SBC, Porto Alegre (2018) 17. Rohrer, C.P.: Desirability studies: measuring aesthetic response to visual designs. https:// www.xdstrategy.com/desirability-studies/. Accessed 15 July 2020 18. dos Santos, C.N.F., Vogel, A.: Quando a rua vira casa: A apropriação de espaços de uso coletivo em um centro de bairro. 3 nd edn. Projeto, Rio de Janeiro (1985) 19. Silva, B.: Marketing mais que real. https://www.istoedinheiro.com.br/marketing-mais-quereal/. Accessed 13 Apr 2020 20. Tori, R., da Silva Hounsell, M. (eds.): Introdução a Realidade Virtual e Aumentada. Editora SBC. Porto Alegre (2018) 21. Tullis, T., Albert, B.: Measuring the User Experience Collecting, Analyzing, and Presenting Usability Metrics. Morgan Kaufmann Publisher, New York (2013) 22. Zilles Borba, E., Zuffo, M., Mesquita, F.: Uma nova camada na realidade: realidade aumentada, electrónica e publicidade. http://www.lasics.uminho.pt/ojs/index.php/cecs_ebooks/art icle/view/2897. Accessed 23 Aug 2020

Autonomous Vehicle Decision Making and Urban Infrastructure Optimization George Mudrak(B) and Sudhanshu Kumar Semwal Department of Computer Science, University of Colorado at Colorado Springs, Colorado Springs, CO 80918, USA

Abstract. Highly mobile populations can quickly overwhelm an existing urban infrastructure as large numbers of people move into the city. The urban road networks having been constructed for less traffic will quickly become congested leading to diffusion of traffic and a greater spread of congestion as secondary roads are increasingly utilized. Further unwanted and potentially fatal consequences such as increased accident, stress, driver anger, and road rage will increase. Traditional methods of addressing the growth in traffic density would include increasing lane count and construction of new roads, and can be costly. Yet traditional responses do not take into account the growing number of connected vehicles and soon, possible autonomous vehicles will interact with human-driven cars. In considering a driving population of autonomous vehicles, we identify an increased range of possible traffic management strategies for collaborative experiences. The vehicles themselves can operate according to their own goals and the urban infrastructure can also be enhanced to a greater degree of self-management. This paper will explore the ideas, techniques and approach behind creating an Agent Based Modelling environment to support the interaction of autonomous vehicles with a smarter urban infrastructure. Procedural content generation (PCG) and agent based modelling concepts will be applied in establishing the modelling framework. Keywords: Agent based modeling · Autonomous vehicles · Smart infrastructure · Smart cities · Procedural generated cities · Population noise maps

1 Introduction Cities manage their infrastructures based on projections regarding future increases or decreases of the population over time. When a population rapidly increases for any reason, infrastructures become strained. In particular, rapid population growth with corresponding mobility increases traffic congestion within these urban areas [1]. Congestion does not remain localized and isolated but rather flows along and disperses from influential road segments [2] and important road intersections, identifying of which is difficult [3]. Congestion itself remains only part of the concern as increased traffic, stress, anger, fear and anxiety directly could impact driver and pedestrian safety [4, 5]; at times for the worse, with fatal accidents. To address these and other concerns, we evaluate urban road infrastructure and to utilize the best tools possible. Future tools will quickly be added © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1190–1202, 2021. https://doi.org/10.1007/978-3-030-80126-7_83

Autonomous Vehicle Decision Making and Urban Infrastructure Optimization

1191

to the set for urban planners as autonomous vehicles (AVs) continue making their way into everyday use. Thus increasing influence of autonomous vehicle as the ubiquitous car continues making inroads. Many researchers are already contributing to the study of AVs [6]. The automotive and IT industries [7] along with government support [8] have made the goal for selfdriving cars their research priority. Between 2014 and 2017, more than $80 billion was spent on the AV field [9] and states such as Nevada, USA have authorized state wide testing of AVs [10] along with a number of cities across the North America and Europe [11]. So what defines autonomous or self-driving vehicles? AVs can drive themselves from one location to another location in “autopilot” mode, i.e. no driver operational inputs, using various in-vehicle technologies and sensors [12] along with various decision making software models. They operate on the road in the same fashion as people operated vehicles. Immense contribution is expected in the area of aging population mobility and others. AVs must make ongoing decisions regarding their inputs and goals, in order to allow them to operate in real-world traffic. Obeying existing traffic laws is at the foundation of what these vehicles must enforce. For autonomous vehicles to gain acceptance into society, considerations must include more than vehicle dynamics and legal adherence. The passenger’s priorities should be included. These priorities would certainly include travel time to answer the question of reaching the destination in a timely or even early manner. Was the trip comfortable for the passenger or did the vehicle behave in such a manner as to cause some physical, mental or emotional discomfort? These vehicles should be able to operate and adjust in a world with common safety and traffic laws along with substantial cultural variance in adherence to the letter of the law. Imagine taking a 4902 mile road trip with your vehicle from Berlin, Germany to Delhi, India. While the core traffic laws remain consistent driving through southern Europe, across Turkey, Iran, Afghanistan and Pakistan, the local population’s adherence to laws, patience, tolerance for deviations, etc. contains considerable variance. The following paper discusses the Agent-Based AVs implementations along with a review of existing research.

2 Background and Literature Review 2.1 Agent Based Modeling of Complex Systems Complex Systems represents a growing domain of computing sciences in which any number of components, systems, agents, etc. interact to represent a system, solve a problem, or even explore an idea [14]. Complex systems are modeled using various techniques, though the use of agent based modeling continues finding favor within a number of fields [15]. While each field has created their own nomenclature and characteristics for representation within complex systems, they have in common that they require autonomous agents and often represent some aspect of human-based interaction [13, 14, 16].

1192

G. Mudrak and S. K. Semwal

ABMs provide an ideal solution to modeling specific types of theory and implementation for a number of reasons such as establishing a normalized environment for repeatability, especially with the involvement of multiple disciplines [17], economic impacts, and/or ethical concerns requires a participant safe approach [13]. ABMs lend themselves quite readily towards multi-domain modeling. From a multi-disciplinary view ABM work takes into account environmental factors as well as complex human behaviors, examples being the development of predictive models relating to burglary [18], population impacts of processes of re-urbanization within metropolitan areas [19] and work in simulating transportation [69]. Economic centric ABMs explore the various costs associated with retail and service oriented markets along with population impacts [20–22]. Work in especially sensitive areas that possess high levels of ethical and social concerns benefit substantially from the use of ABMs. Such work would include individual bullying behaviors [23] and group based bullying behaviors [24]. Models have been presented in which they theories regarding the link between urban sprawl and income segregation [25]. Additional work illustrates the use of ABMs for exploration of city planning and population dynamic zones [65]. Further, city and population ABMs have even been used within the gaming industry for example the well-known SimCity series [66] and even the current Cities Skylines [83] PlayStation 4 game. The use of agent based models requires the combination of concepts and tools from a variety of fields such as computing sciences, sociology, psychology, economics, game theory, physics, biology, and medicine. The ABM contributions to both the research and industry fields are made across multiple disciplines and well established. Complex Systems For our purposes, Complex Systems, adheres to the general views espoused by Mitchell [14] and Gilbert’s book [13]. In brief, Mitchell [14] identifies three commonly exhibited properties of complex systems complex collective behavior, signaling and information processing, and adaptation. She further provides the following insight into the basic elements comprising a complex system as illustrated in Table 1: Table 1. Elements of a complex system Environment

The space in which the actors operate, inclusive of its own set of properties

Actors

Atomic individuals that inhabit the environment and possess their own set of properties

Behaviors

Actor behaviors to interact with other actors and the environment

Population

A collection of homogeneous agents

Taken as is, Mitchell’s statements of the attributes of complex systems provide a framework in which agents, and hence agent based modeling, can operate. Gilbert brings us into the ABM space by elaborating on the elements found and mapped up to the general framework provided by Mitchell [13]. Gilbert’s complimentary view can be seen in Table 2:

Autonomous Vehicle Decision Making and Urban Infrastructure Optimization

1193

Table 2. Complementary view for the elements of a complex system Ontological correspondence

A direct correspondence between the computational agents in the model and real-world actors

Heterogeneous agents

A mix of dissimilar agents operating according to their defined preferences and rules

Environment

The environment in which the agents operate

Interaction

The ability for agents to affect one another

Bounded rationality

Limiting the cognitive abilities of the agent and thus the degree to which they can optimize their utility

Learning

Agents learn from their experiences through evolutionary and social learning

An outcome of ABMs lies with ontological correspondence covering emergence of non-trivial, non-coded behaviors resulting from the interaction of fundamental decisions by the autonomous agents as they grow, adapt, learn and interact with their environment and one-another. An additional benefit of Complex System ABMs is found in Epstein and Axtell’s [16] critique of homogeneous agents highlighting Gilbert’s [13] specific “heterogeneous” criteria. In their work, Social Science from the Bottom Up”, Epstein and Axtell present that the use of rational homogeneous agents, removes the richness and unpredictability from a population. While this modeling reaffirms macro-level behaviors, it does not emphasize micro-level behaviors and as such represents a barrier to one of the hallmarks of complex systems - emergent behavior, [16]. As evident with our discussion from previous sections, the application of ABMs fits extremely well with modeling Autonomous Vehicles (AVs) given consideration for the multi-disciplinary nature of urban road infrastructure, vehicles, and human behavior. It is also readily apparent given the economic, market and even ethical/social impacts that reaffirm the use of ABMs in this domain. Real-world physical research includes such as areas as pedestrian activity [26], lane management [27], or any other driver assisting technologies; these are not included in this study as they are outside the scope of our research at this time. As in real-life, our ABM’s vehicle agents were able to determine a route based on a starting location and a destination to determine a possible route at a reasonable cost. Kohout and Erol [28] demonstrated two key points applicable to our approach in their work on agent-based vehicle routing. The first being that the use of a stochastic approach (which is preferred in ABMs) to routing yielded a viable alternative to a centralized routing heuristic, which is usually avoided in favor of local interactions; the second being the routing occurring in a changing environment. This is significant in that an ABM utilizes stochastic behavioral responses based on environmental factors representing “real-time” traffic information that vehicular agents use to model driving successfully [29]. This interdependence at every tick of the simulation has the capacity to exhibit complex and multi-dimensional behavior. Further viewing of the routing for a vehicle on the urban road system, demonstrates that in fact a connected graph exists in

1194

G. Mudrak and S. K. Semwal

which every intersection or dead-end represents a node while all the roads themselves represent the edges. Applying this understanding of the known shortest path algorithms ensures that our vehicles can travel from start to destination [30]. Further work has illustrated the impact that a change in population can have upon an urban environment [19] and reinforces the importance of being able to account for population growth or decline. Especially in consideration that no city remains an enclosed static entity but rather undergoes dynamic organic changes over time. ABM Modeling Tools A large number of ABM modelling frameworks exist supporting both single and multidomains such as teaching, visualization, financial markets, manufacturing, business, healthcare, social sciences, technologies, design, etc. [67, 68]. Focusing on the human element of modeling emphasizes a framework in the social sciences domain and with our need to include custom code requires a framework that supports plug-ins for nonnative algorithms and code. Additionally, utilizing a framework where an active user community exists provides options for support and example code. Further, it was critical that the framework would support large numbers of differentiated, autonomous actors with a relatively reasonable learning curve for the utilized language. From prior experience and evaluation, it was known that the ABM framework NetLogo [64] could meet the necessary environment and interaction needs. We are quite familiar with the framework and programing language having utilized it for prior ABM research and simulations [15, 23, 24, 65]. 2.2 Behavioral Influencers The modern vehicle operator today must contend with driving distractions as a common component of everyday driving [31] and the potential for those distractions to negatively impact their performance [32, 33]. For our purposes, we examine two sources for the distractions and behavioral impactors that affect driving performance and decision making. Cultural Influencers Using ABMs Driving in small-town America traffic does not compare to driving in New Delhi or New York City traffic, for a number of reasons – even driving between large cities such as New Delhi and New York City. It is not difficult to project the impact of common and recognized in-vehicle distractions [34] given an obvious increase in vehicles. Culture represents an additional factor taken into account as an influencer on driving behavior using ABMs. Even if driving left or right hand side, still rules seems to be same, following similar traffic law encouraging similar flows. While the vast majority of countries present the same basic traffic laws, driver compliance, deviation tolerance and behavioral responses can vary substantially across the world’s countries. Research focusing on cultural differences have utilized the driver behavior questionnaire (DBQ) and have confirmed that cultural behavioral responses are in fact country/culture dependent with an understandable impact on the driving experience [35]. Further work has shown that

Autonomous Vehicle Decision Making and Urban Infrastructure Optimization

1195

while behavioral responses vary, driver anger in response to perceived affronts remains consistent across cultural boundaries [36, 37]. Occupant Influencers As discussed, the environment contains numerous distractions and other influencers that affect a driver’s ability to safely operate a vehicle and can lead to stress manifesting itself in aggressive driving [38]. It has been shown that aggressive driving influenced by anger leads to traffic accidents [39] as angry drivers would be slower in responding to various traffic events [40]. Anecdotally, it is known that unchecked anger is not what people typically exhibit. Researchers include a form of “forgiveness” factor in their research [41]. While the effectiveness of de-escalation strategies such as reminders of past individual driver’s road-rage meets with success [42] with varying effectiveness across cultures [43], it warrants inclusion for us. Adjusting our view of anger, and giving it the general term impatience, allows us to examine the behavioral impact in a non-emotionally biased fashion. Combined with our understanding that both cultural and personal responses from drivers influence their performance and decision making, these types of factors are accounted for within the model. This allows us to address autonomous vehicle behaviors themselves in similar fashion to how drivers view situations so that “progress” is made. Thus we recognize autonomous vehicles themselves must have a concept of impatience in relation to meeting a primary goal of delivering passengers on time to their destinations. As such, for example, an autonomous vehicle cannot allow itself to be stonewalled by pedestrian traffic, or stay on a predetermined route no matter the traffic situation. 2.3 Urban Planning Approaches to urban planning often take several factors into account when optimizing the road infrastructure. This includes city, state, country and region influenced. For example, in the central and mountain portions of the United States, substantial road infrastructure exists given the distributed city planning and number of vehicles. While in contrast, other parts of the world prioritize pedestrian traffic [44, 45] possibly reducing the number of available roads along with other strategies for pedestrian safety [46]. Research into the domain of artificial city creation is ongoing and typically focused on the film and game industries [47, 66]. These cities are often created via procedural content generation (PCG) so that a consistent set of constraints and criteria are applied for a viable result [48]. By using the PCG approach and varying input factors, we can create a new-infinite amount of content that adheres (cities) to the same construction and layout constraints [49]. We applied the principles behind PCG as a means of generating a stochastic and distributed starting population as our structure and creating the road network as our function of the structure [50]. Population Density Visualization Our primary concerns regarding population generation are two-fold. First, we desired a PCG approach so that populations can be created quickly and repeatedly [50] and

1196

G. Mudrak and S. K. Semwal

secondly, a nonlinear agglomeration population distribution as that corresponds to real world cities [51, 52]. Foundational work has been done to meet these goals in other areas such as procedural image creation and effects [53] by Ken Perlin in an effort to create less “machine-like” computer graphics [54] through the creation of an algorithm now referred to as Perlin Noise and an enhanced version by K. Perlin referred to as Simplex Noise. In both cases, the algorithm procedurally generates nonlinear “noise” exhibiting elements of agglomeration. One means of approaching creation of a population data set focuses around working with non-targeted data sets related to some portion of a population [55, 56]. In these cases a variety of information feeds into what becomes an aggregated data set. For us, the limitations of this approach include the need for access to a number of data sources that we do not have and what essentially would become an emphasis on creating the “right” data set. We additionally evaluated a more recent approach to noise generation termed “blue noise”. In the blue noise approach by Mitchell, Ebeida, Awad, et. al. [57] they illustrate an algorithm to generate noise distributions with good quality and performance across different dimensions. Their approach while being PCG friendly, yields uniformly distributed noise maps that do not exhibit our need for agglomeration. The noise is too uniform to correspond with a real-world varied population density distribution, and therefore we did not utilize this approach [57]. We found an additional approach in using two-dimensional Perlin noise and used it to implement the city generation [58]. The Simplex noise algorithm generates a nonlinear matrix of smoothed and agglomerated values, we can see that it meets our criteria as also being a PCG friendly algorithm. As such, we have chosen to make use of this approach in generating our working populations. Infrastructure Visualization All road networks exist to move members of a population from point-A to point-B and for our road generation, we defined and met similar core requirements as we established with our population generation. First, we desired a PCG based approach to ensure that the process could be quick and repeatable. Secondly we found and agreed that a heuristic based approach performed well in road network creation [59]. Road networks can be duplicated or even procedurally generated over existing maps with extracted semantics [60] or even built up from image-derived templates or raster maps [61, 62]. Knowing that we have a population density map provided, we use Simplex noise to focus on the procedural and heuristic aspects for our roads. We found Parish and Muller’s approach to road generation quite insightful especially in its application of global and local rules to “craft” the road network [59]. Application of these rules, or viewed another way “constraints” facilitated a flow to the generated road networks resulting in a greater correspondence with real road infrastructures. As a whole, for our urban planning, we recognized the need for a base population to exist as the foundation necessary for the generation of an urban road network. We identified various research that highlighted the ability to procedurally generate a population and from that data, a road network. Being able to achieve this as PCG and repeatedly, provides substantial benefit as we can focus our exploration on the AV and passenger behaviors

Autonomous Vehicle Decision Making and Urban Infrastructure Optimization

1197

involved without the distractions around creating the perfect infrastructure. We further increase our ability to test our behaviors against stochastically created environments.

Fig. 1. Road infrastructure with vehicles present

Vehicles NetLogo [63, 64] supports agent and link visualization as illustrated in Fig. 1. These visuals are enabled through use of a standard library of off-the-shelf icons and the ability to create/edit custom icons and links [84]. This support enables vehicles to be representing graphically in customizable fashion with respect to both the icon itself and the color. Further links, the visual relationship between agents, can be customized allowing for a variety of display options applicable to our usage of links as road segments. This graphical capacity within NetLogo provides all we need for visualization of vehicles and road segments within our ABM framework. Vehicle Route Planning Vehicle route planning within an urban road infrastructure is comparable to traversing an undirected connected graph in which all vertices or intersections are linked by one or more roads [85]. Essentially any path finding algorithm that would guarantee a result could apply [70–72]. For our purposes, any well-known and established algorithm that met our needs for finding a guaranteed route and is well-known would suffice. We

1198

G. Mudrak and S. K. Semwal

selected the well-known A* algorithm as it met our needs to deliver a guaranteed route, is well established and a baseline for other research [71]. Toll Planning The topic of road pricing has been a consideration of economic and transportation research for years, [73] with works performed to determine acceptability [74]. Recent research demonstrates that use of road pricing can reduce traffic jams [75] with the additional benefit of reducing vehicle emissions [76]. Behavioral research has been conducted on what changes drivers will make and the acceptance of those changes [77] along with economic impact in terms of costs, [75] or through positive incentives [77]. We elected to exclude these driver behavioral factors from consideration while accepting the foundational position that road pricing via tolls does affect traffic. There, we elected to include toll support within our framework. Positional Currency The use of tokens to change or incentivize traveler behavior has been occurring for some time [81] with recent research folding game theory into use and commuter behavior [82]. These tokens represent another approach to the concept of road pricing that can be applied per traveler. We rebranded the general “token” and called it “commuter token”. Having a commuter token represents one-half of the solution for AVs. The other half can be found in the models behind game theory negotiation. These have been examined as far back as 1961 [78] with newer research into robotic resource negotiation [79] and multi-agent enterprise collaboration [80]. Through use of the commuter tokens, we empower the AVs with a means to offer their own incentives encouraging cooperation throughout the AVs lifetime [77]. The ability for AVs to self-negotiate position through use of a commuter token or commuter currency allows to greater traffic flow.

3 Conclusion As the automotive domain continues its unstoppable drive towards truly autonomous vehicles we have identified how proving support for a smarter urban infrastructure can in combination with AV data, optimize traffic flow. Further in this survey paper we have presented the ideas that AVs must operate in more than just their primary culture and thus an exploration of disparate cultural influences must take place. We also present visualization of AVs on the PCG generated road-map using NetLogo™. We feel that through agent based modelling we can optimally explore the diverse operational values of both the autonomous vehicle and a smart urban infrastructure. This research would yield information applicable to a required autonomous vehicle transitional period as they approach saturation and also how a city could respond during and after the adoption phases.

References 1. Liu, C., Vu, H.L., Leckie, C., Anway, T.: Spatial partitioning of large urban road networks. In: Proceedings of EDBT (2014)

Autonomous Vehicle Decision Making and Urban Infrastructure Optimization

1199

2. Liu, C., Vu, H.L., Islam, S., Anwar, T.: RaodRank: traffic diffussion and influence estimation in dynamic urban road networks. In: CIKM, Melbourne, pp. 1671–1674 (2015) 3. Page, L., Brin, S.: The anatomy of a large-scale hypertextual web search engine. Comput. Netwo. ISDN Syst. 30(1–7), 107–117 (1998) 4. Schmidt-Daffy, M.: Fear and anxiety while driving: differential impact of task demands, speed and motivation. Transp. Res. F: Traffic Psychol. Behav. 16, 14–28 (2012) 5. Groeger, J.A., Stephens, A.N.: Do emotional appraisals of traffic situations influence driver behaviour? In: Behavioural Research in Road Safety: Sixteenth Seminar, pp. 49–62 (2006) 6. 2025AD The Automated Driving Community. https://www.2025ad.com/latest/top-universit ies-for-autonomous-driving/ 7. CBInsights. https://www.cbinsights.com/research/autonomous-driverless-vehicles-corporati ons-list/ 8. ABC News. https://abcnews.go.com/US/companies-working-driverless-car-technology/ story?id=53872985 9. Brookings. https://www.brookings.edu/research/gauging-investment-in-self-driving-cars/. Accessed 10 Oct 2020 10. Las Vegas SUN. https://lasvegassun.com/news/2017/jun/20/autonomous-vehicles-in-nevadaroll-forward-with-ne/ 11. Showbiz CheatSheet. https://www.cheatsheet.com/money-career/cities-have-driverless-cars. html/ 12. Gartnet IT Glossary. https://www.gartner.com/it-glossary/autonomous-vehicles/. Accessed 10 Oct 2020 13. Gilbert, N.: Agent-Based Models, Series: Quantitative Applications in the Social Sciences. SAGE Publications (2008) 14. Mitchell, M.: Complexity a Guided Tour, 1st edn. University Press, Oxford (2009) 15. Mudrak, G.: Modeling aggression & bullying: a complex systems approach. Master Thesis UCCS (2013) 16. Axtell, R., Epstein, J.M.: Growing Artificial Societies - Social Science from the Bottom Up. Brookings Institution Press (1996) 17. Axelrod, R.: Agent-based modeling as a bridge between disciplinse. Handb. Comput. Econ. 2, 1565–1584 (2006) 18. Evans, A., Jenkins, T., Malleson, N.: An Agent-based model of burglary. Environ. Plann. B: Urban Anal. City Sci. 36(6), 1103–1123 (2009) 19. Rauh, J., Rid, W., Hager, K.: Agent-based modeling of traffic behavior in growing metropolitan areas. Transp. Res. Proc. 10, 306–315 (2015) 20. Moreno, A., Nealon, J.: Agent-based applications in health care. In: Applications of Software Agent Technology in the Health Care Domain, pp. 3–18. Birkhäuser, Basel (2003) 21. Colson, A., Chisholm, D., Dua, T., Nandi, A., Laxminarayan, R., Megiddo, I.: Health and economic benefits of public financing of epilepsy treatment in India: an agent-based simulation model. Epilepsia 57(3), 464–474 (2016) 22. Evans, A.J., Birkin, M.H., Heppenstall, A.J.: Genetic algorithm optimisation of an agentbased model for simulating a retail market. Environ. Plann. B: Urban Anal. City Sci. 34(6), 1051–1070 (2007) 23. Mudrak, G., Semwal, S.: Modeling aggression and bullying: a complex systems approach. In: Annual Review of CyberTherapy and Telemedicine, pp. 187–191. IOS Press (2015) 24. Mudrak, G., Semwal, S.: Group aggression and bullying through complex systems agent based modeling. Ann. Rev. Cyberther. Telemed. 14, 189–194 (2016) 25. Buchmann, C.M., Schwarz, N., Guo, C.: Linking urban sprawl and income segregation findings from a stylized agent-based model. Environ. Plann. B: Urban Anal. City Sci. 46(3), 469–489 (2019)

1200

G. Mudrak and S. K. Semwal

26. Batty, M.: Agent-based pedestrian modeling. Environ. Plann. B. Plann. Des. 28(3), 321–326 (2001) 27. Tewolde, G., Zhang, X., Kwon, J., Dang, L.: Reduced resolution lane detection algorithm. In: IEEE AFRICON, pp. 1459–1465, November 2017 28. Erol, K., Kohout, R.: In-time agent-based vehicle routing with a stochastic improvement heuristic. In: AAAI, pp. 864–869 (1999) 29. Dia, H.: An agent-based approach to modelling driver route choice behaviour under the influence of real-time information. Transp. Res. Part C: Emerg. Technol. 10(5–6), 331–349 (2002) 30. Rothkrantz, L.: Dynamic routing using maximal road capacity. In: CompSysTech 2015 Proceedings of the 16th International Conference on Computer Systems and Technologies, pp. 30–37 (2015) 31. Feaganes, J., et al.: Driver’s exposure to distractions in their natural driving environment. Accid. Anal. Prev. 37(6), 1093–1101 (2005) 32. Feaganes, J., Rodgman, E., Hamlett, C., Reinfurt, D., Stuffs, J.: The causes and consequences of distraction in everyday driving. Assoc. Adv. Automot. Med. 47, 235–251 (2003) 33. Feaganes, J., et al.: Distractions in everyday driving (2003) 34. Regan, M., Young, K.: Driver distraction: a review of the literature. NSW: Australasian College of Road Safety, pp. 379–405 (2007) 35. Ozkan, T., Lajunen, T., Tzamalouka, G., Warner, H.W.: Cross-cultural comparison of drivers’ tendency to commit different aberrant driving behaviours. Transp. Res. F: Traffic Psychol. Behav. 14(5), 390–399 (2011) 36. Gras, M.E., Cunil, M., Planes, M., Font-Mayolas, S., Sullman, J.M.: Driving anger in Spain. Pers. Individ. Differ. 42(4), 701–713 (2007) 37. Yao, X., Jiang, L., Li, Y., Li, F.: Driving anger in China: psychometric properties of the driving anger scale (DAS) and its relationship with aggressive driving. Pers. Individ. Differ. 68, 130–135 (2014) 38. Shinar, D.: Aggressive driving: the contribution of the drivers and the situation. Transp. Res. F: Traffic Psychol. Behav. 1(2), 137–160 (1998) 39. Kontogiannis, T.: Patterns of driver stress and coping strategies in a Greek sample and their relationship to aberrant behaviors and traffic accidents. Accid. Anal. Prev. 38(5), 913–924 (2006) 40. Trawley, S.L., Madigan, R., Groeger, J.A., Stephens, A.N.: Drivers display anger-congruent attention to potential traffic hazards. Appl. Cogn. Psychol. 27(2), 178–189 (2012) 41. Dahlen, E.R., Moore, M.: Forgiveness and consideration of future consequences in aggressive driving. Accid. Anal. Prev. 40(5), 1661–1666 (2008) 42. Takaku, S.: Reducting road rage: an application of the dissonance-attribution model of interpersonal forgiveness. J. Appl. Soc. Psychol. 36(10), 2362–2378 (2006) 43. Roskova, E., Lajunen, T., Kovacsova, N.: Forgivingness, anger, and hostility in aggressive driving. Accid. Anal. Prev. 62, 303–308 (2014) 44. Fatian, H., Gota, S., Mejia, A., Leather, J.: Walkability and pedestrian facilities in asian cities. In: ADB Sustainable Development Working Paper Series, vol. 17, February 2011 45. Pinet-Peralta, L.M., Short, J.R.: No accident: traffic and pedestrians in the modern city. Mobilities 5(1), 41–59 (2010) 46. Roe, M., Shin, H., Viola, R.: New York City pedestrian safety study & action plan. New York City Department of Transportation, New York City (2010) 47. Kelly, G., McCabe, H., Whelan, G.: Roll your own city. In: DIMEA 2008 Proceedings of the 3rd international conference on Digital Interactive Media in Entertainment and Arts, pp. 534–535 (2008) 48. Meijer, S., Velden, J.V., Iosup, A., Hendrikx, M.: Procedural content generation for games: a survey. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 9(1), Article No. 1 (2013)

Autonomous Vehicle Decision Making and Urban Infrastructure Optimization

1201

49. Smith, G.: Understanding procedural content generation: a design-centric analysis of the role of PCG in games. In: CHI 2014 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 917–926 (2014) 50. Kavak, H., Crooks, A., Kim, J.: Procedural city generation beyond game development. SIGSPATIAL Spec. 10(2), 34–41 (2018) 51. Lobo, J., Strumsky, D., West, G.B., Bettencourt, L.: Urban scaling and its deviations: revealing the structure of wealth, innovation and crime across cities. PLoS ONE 5, e13541 (2010) 52. Karmeshu, P.K., Sikdar, P.K.: On population growth of cities in a region: a stochastic nonlinear model. Environ. Plann. A: Econ. Space 14(5), 585–590 (1982) 53. Perlin, K.: An image synthesizer. SIGGRAPH Comput. Graph 19, 287–296 (1985) 54. Wikipedia Perlin Noise. https://en.wikipedia.org/wiki/Perlin_noise 55. Toint, P.L., Barthelemy, J.: Synthetic population generation without a sample. Transp. Sci. 47(2), 131–294 (2013) 56. Ferreira, J., Jr., Zhu, Y.: Synthetic population generation at disaggregated spatial scales for land use and transportation microsimulation. Transp. Res. Rec.: J. Transp. Res. Board 2429(1), 168–177 (2014) 57. Ebeida, M.S., et al.: Spoke-darts for high-dimensional blue-noise sampling. ACM Trans. Graph. (TOG) 37(2), Article No. 22 (2018) 58. Wijgerse, S.: Generating realistic city boundaries using two-dimensional Perlin noise. Doctoral thesis (2007) 59. Muller, P., Parish, Y.: Procedural modeling of cities. In: SIGGRAPH 2001 Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 301–308 (2001) 60. Bidarra, R., Teng, E.: A semantic approach to patch-based procedural generation of urban road networks. In: GDG 2017 Proceedings of the 12th International Conference on the Foundations of Digital Games, Article No. 71 (2017) 61. Yu, X., Bacui, G., Green, M., Sun, J.: Template-based generation of road networks for virtual city modelling. In: VRST 2002 Proceedings of the ACM Symposium on Virtual Reality Software and Technology, pp. 33–40 (2002) 62. Knoblock, C.A., Chiang, Y.: Automatic extraction of road intersection position, connectivity, and orientations from raster maps. In: GIS 2008 Proceedigns of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Article No. 22 (2008) 63. The Center for Connected Learning and Computer-Based Modeling. http://ccl.northwestern. edu/Uri.shtml 64. NetLogo (2019). https://ccl.northwestern.edu/netlogo/index.shtml 65. Mudrak, G., Semwal, S.: AgentCity: an agent-based modeling approach to city planning and population dynamics. In: International Conference on Collaboration Technologies and Systems, CTS 2012, pp. 91–96 (2012) 66. Wilson, J.: The Official SimCity Planning Commission Handbook. Osborne McGraw-Hill (1994) 67. Abar, S., Theodoropoulos, G.K., Lemarinier, P., O’Hare, G.M.P.: agent based modelling and simulation tools: a review of the state-of-art software. Comput. Sci. Rev. 24, 13–33 (2017) 68. Kravari, K., Bassiliades, N.: A Survey of Agent Platforms. J. Artif. Soc. Soc. Simul. 18(1) (2015). http://jasss.soc.surrey.ac.uk/18/1/11.html 69. Bazzan, A., Klugl, F.: Multi-agent Systems for Traffic Transportation Engineering. IGI Global (2009) 70. Imai, H., Asano, T.: Efficient algorithms for geometric graph search problems. SIAM J. Comput. 15(2), 478–494 (2006) 71. Zhou, R., Hansen, E.A.: Sparse-memory graph search. In: InIJCAI - International Joint Conference on Artificial Intelligence, 9 August 2003, pp. 1259–1268 (2003)

1202

G. Mudrak and S. K. Semwal

72. Hansen, E.A., Zilberstein, S.: LAO*: a heuristic search algorithm that finds solutions with loops. Artif. Intell. 129(1), 35–62 (2001) 73. Morrison, S.: A survey of road pricing. Transp. Res. Part A: Gener. 20(2), 87–97 (1986) 74. Jaensirisak, S., Wardman, M., May, A.D.: Explaining variations in public acceptability of road pricing schemes. J. Transp. Econ. Policy (JTEP) 39(2), 127–154 (2005) 75. Martin, L.A., Thornton, S.: City-wide trial shows how road use charges can reduce traffic jams. The Conversation, November 2017. https://theconversation.com/city-wide-trial-shows-howroad-use-charges-can-reduce-traffic-jams-86324 76. Bigazzi, A.Y., et. Al.: Road pricing most effective in reducing vehicle emissions. Phys.Org. (2017). https://phys.org/news/2017-10-road-pricing-effective-vehicle-emissions.html 77. Ettema, D., Knockaert, J., Verhoef, E.: Using incentives as traffic management tool: empirical results of the ‘peak avoidance’ experiment. Int. J. Transp. Res. 2(1), 39–51 (2010) 78. Kuhn, H.W.: Game theory and models of negotiation. J. Conflict Resolut. 6(1), 1–4 (1962) 79. Cui, R., Guo, J., Gao, B.: Game theory-based negotiation for multiple robots task allocation. Robot 31(6), 923–934 (2013) 80. Trappey, A.J., Trappey, C.V., Ni, W.C.: A multi-agent collaborative maintenance platform applying game theory negotiation strategies. J. Intell. Manuf. 24(3), 613–623 (2013) 81. Kearney, A., De Young, R.: Changing commuter travel behavior: employer-initiated stragegies. J. Environ. Syst. 24(4), 373–393 (1996) 82. Kracheel, M., McCall, R., Koenig, V.: Studying commuter behaviour for gamifying mobility (2014) 83. Cities Skylines, Sony Playstation, June 2020. https://www.playstation.com/en-us/games/cit ies-skylines-ps4/ 84. NetLogo Shapes Editor. https://ccl.northwestern.edu/netlogo/docs/shapes.html 85. Wikipedia. Connected Graph. https://en.wikipedia.org/wiki/Connectivity_(graph_theory)# Connected_graph

Prediction of Road Congestion Through Application of Neural Networks and Correlative Algorithm to V2V Communication (NN-CA-V2V) Mahmoud Zaki Iskandarani(B) Al-Ahliyya Amman University, Amman 19328, Jordan [email protected]

Abstract. Implementation of simulated Vehicle-to-Vehicle (V2V) communication results using Basic Safety Message (BSM) of 320 Bytes size for the purpose of congestion management and control is carried out. The obtained results used as input to a neural network architecture. The Neural structure uses Weigend Weight Elimination Algorithm (WWEA) to achieve learning in order to predict congestion events by predicting hops count, average network lifetime, average route length and provide them as inputs to the Congestion Management Algorithms (CMA) that correlate between them in order to detect congestion events. This will also aid road traffic designers in their effort for a less congestive roads, and work towards smart applications. Testing results showed that congestion can be determined and predicted in advance, such that a control mechanism can be initiated to re-route traffic with traffic lights timing also intelligently adjusted to alleviate expected traffic congestion and related problems. Keywords: Intelligent Transportation Systems · Connected vehicles · Congestion · Network lifetime · V2V · Neural networks

1 Introduction Vehicular Ad hoc Networks (VANETs) are important Networks with the required communication tolerance and delay becoming a critical issue as more safety applications are becoming available and part of the integrated safety strategy of the Intelligent Transportation System (ITS) paradigm. Vehicles in the connected vehicles concept and application would share safety critical information as they communicate with both Infrastructure (V2I) and with other vehicles (V2V). Road traffic safety and traffic management applications are the two main applications associated with vehicular networks, with interrelation between them, as accidents and road incidents, which are safety issues can also affect smooth traffic flow and needs management strategies [1–5]. Safety applications are used to minimized traffic accidents and consequential injuries and fatalities. Various accidents are road related and some © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1203–1221, 2021. https://doi.org/10.1007/978-3-030-80126-7_84

1204

M. Z. Iskandarani

of them are caused by ill managed roads, particularly at intersections, where, head-on, front and rear collisions can occur. Vehicular network applications for connected vehicles provide information to vehicles regarding issues related to distance, speed, and location among others. Thus trying to avoid accidents that will lead to congestion. At congestion, vehicles speed and location are affected and thus both safety applications and traffic control application under the appropriate algorithms will provide vital information regarding congestion, with traffic management applications dealing with traffic flow issues to avoid congestion under different conditions. The management and control of traffic operates in both spatial and temporal dimensions, with concentration on speed control and vehicle navigation [6–10]. Dedicated Short Range Communication (DSRC) is considered a viable option for wireless V2V communication. It is important to be able to quantify and assess communication links between connected vehicles in order to optimize communication and enable better traffic control and management. Thus it is of prime importance to examine route or path length connecting two vehicles with its associated parameter of Network Lifetime (NTLT), which is associated with energy consumed in the communication process and reflects communication link usage in terms of time (Path or Route Length) and duration of contact between the considered vehicles. The importance of link sustainability on the network performance is considered an essential metric in routing Basic Safety Messages (BSMs) among connected vehicles and very important in multi-hop routes or paths. Communication link properties is closely related to vehicles speed, physical distance, and route hops, energy consumed, communication time, among others. This indicates that a change in any of these parameters can be used as a sign of traffic irregularity, such as congestion [11–15]. V2V communication using BSM enables vehicles to exchange data covering main parameters such as, speed, location, position, direction, thus enabling decision making through data processing via vehicles On Board Units (OBUs). The exchanged information among vehicles, which covers status like traffic ahead and distance between vehicles among others can be collected either directly form the vehicles (Vehicular Clouds) or through Vehicle-to-Infrastructure communication employing Road Side Units (RSUs) [16–21]. Neural Networks (NN) form an adaptive system that carry out changes to its structure and response behavior as it learns during training process. In its essence, neural network adopts the mechanisms and dynamic structure of biological neurons and process information in a similar manner to the nervous system. Neural Networks learns instead of being programmed, a neural network learns by example (input data) and has the ability after training and inter-neurons connection optimization to generalize and associate inputs with outputs [22–25]. Normally, congestion related models are based on standard models, and are not known for accuracy or optimization. Hence, neural networks can be an efficient and viable tool to optimize existing models and to provide results for complex cases that cannot be formulated easily. Neural networks structures can be trained to predict data under different environments. Neural networks are able to model nonlinear systems using multi-layer neurons with ability for synaptic weights modification as needed to optimize the performance of the network and obtain good convergence [26–28].

Prediction of Road Congestion Through Application

1205

In this paper, a neural network model is used to predict congestion on roads. The model uses Weigend Weight Elimination Algorithm (WWEA) [29] and multi-layer architecture which is trained using simulated data resulting from V2V communication using BSM messaging. The simulation is based on actual road structure. The resulting data is communicated to traffic control center and to vehicles to re-route traffic and provide alternative roads, hence, avoid congestion before it actually occurs.

2 Methodology The main objective of this work is to enable control and avoidance of roads congestion as a function of traffic density depending on V2V wireless communication using BSM messages of 320 Bytes size, using applications which incorporates the following parameters: 1. 2. 3. 4.

Hop Count Route Length Communication Time (Hop Time) Network Lifetime (Consumed Energy)

The intention is to provide a managing algorithm interfaced to a neural networks algorithm that acts on the data. The predictions provided by the neural networks algorithms are to be used both in real time management and for design purposes of smart areas. Figure 1 present the Neural Networks architecture used to feed the Congestion Management Algorithm (CMA).

Fig. 1. Neural networks structure for congestion control

The Neural Networks is trained using Weigend Elimination Algorithm (WWEA) as shown in Fig. 2. WWEA focuses on weights pruning, which will assist in uncovering of dominant feature (s) and contributes towards neural structure fine tuning, hence, optimized results. The WWEA operates on the principles of weight decay with error minimization through an error function. Such optimization process is carried out by adding an overhead term to the original error function employed by the algorithm.

1206

M. Z. Iskandarani

Fig. 2. Neural networks training interface

Based on the overhead, large weights are more affected as they can contribute negatively towards network convergence depending on their position in the network structure. Weights with large values between the input layer and the hidden layer can result in possible discontinuities in the output values resulting in a rough function mapping, while weights between the hidden and output layers can result in instability and oscillations of mapped output function. Thus, WWEA is an efficient algorithm in optimizing and pruning the network structure through its vital overhead function. The mathematical representation of the application of the overhead function through an error function (EWWE) is described by Eq. (1). EWWE = Estart + Eoverhead

(1)

The error function (Estart ) is given by Eq. (2), with the optimizing part (EOverhead ) is given by Eq. (3). Estart =

2 1 rj − aj 2

(2)

j

⎛

⎜ Eoverhead = Alpha ∗ ⎝ ij

1+

wij 2 wn

Where; Alpha: Weight reduction operator. wij : Individual weights of the neural network model. wn : Conditioning parameter computed by the WWEA.

wij 2 wn

⎞ ⎟ ⎠

(3)

Prediction of Road Congestion Through Application

1207

ri : Required Output. ai : Actual Output. The dynamic weight change is calculated through a modified version of the gradient descent algorithm as shown in Eq. (4).

∂Estart ∂Eoverhead − Alpha ∗ (4) wij = −LRate ∗ ∂wij ∂wij Where; LRate: Network learning rate (0 to 1). The conditioning parameter wn operates such that the smallest weight is selected from the last iteration or epoch and can operate on a whole set of epochs. The idea is to push the weights values to zero, hence weight elimination. The weight reduction operator. (Alpha) covers the network complexity aspects as a function of performance for datasets under consideration. The adjustable operator behavior is described by Eq. (5). (5) Alpha = Alpha0 ∗ exp −λ 1 − Qp Where; Alpha0 : Scaling operator and can be set to 1. λ: Multiplication constant, Qp : Ratio of correctly classified patterns to the total number of patterns. The design change in the learning rate (LRate) can be also related to the network overall error through Eq. (6): d (LRate) = γ ∗ EWWE

(6)

Where γ is a scaling factor, and should be at least twice the error of the network. Equation (5) describes a dynamic relationship between Alpha and Qp , such that an increase in Qp will results in an increase in Alpha, thus increasing the influence of Eoverhead , which pushes smaller weights to zero values (convergence and optimal performance). On the other hand, reduction in Qp will have the opposite effect, which will adversely affect the state of the network and will disable optimum performance and convergence. The choice of Alpha is critical for a proper pruning process to occur and not to eliminate weights too early, thus loosing accuracy and affecting prediction capability. Thus an indirect connection between Alpha and network error function is established. Also, Eq. (6), applies control of the learning rate (LRate) in correspondence to the network error. Both terms in Eq. (5) and Eq. (6) are tied up and correlated through Eq. (3) for an optimum network convergence. Weigend Weight Elimination Algorithm (WWEA) is a bidirectional algorithm, which is critical for neural network optimization and design.

1208

M. Z. Iskandarani

3 Results and Discussion Tables 1, 2 and 3 present the initial simulated results for road traffic and V2V communication. Table 1. 200 vehicles (nodes). Route no.

Route length

Hops per route

Network lifetime

1

67.1

3

782

2

50.7

2

1270

3

50.6

2

1275

4

50.7

2

1274

5

50.8

2

1269

6

51.0

2

1260

7

51.3

2

1247

8

51.7

2

1230

9

52.2

2

1210

10

52.9

2

1186

11

59.5

3

966

12

55.0

2

1109

13

55.5

2

1091

14

56.1

2

1072

15

56.7

2

1052

16

68.4

2

754

17

70.3

3

718

18

64.9

3

829

19

66.4

3

797

20

61.7

3

906

21

62.1

3

897

22

62.9

3

877

23

64.1

3

849

24

65.6

3

814

25

67.3

3

776

26

69.4

3

736

27

71.6

3

694 (continued)

Prediction of Road Congestion Through Application Table 1. (continued) Route no.

Route length

Hops per route

Network lifetime

28

74.0

3

653

29

72.3

3

682

30

79.0

3

579

31

80.9

3

554

32

82.9

3

530

33

83.1

3

528

34

85.0

3

505

35

95.4

4

407

36

95.5

4

406

37

95.8

4

404

38

96.3

4

400

39

96.9

4

395

40

97.8

4

388

41

98.7

4

381

42

106.6

4

329

43

101.0

4

365

44

99.8

4

374

45

101.4

4

362

46

103.2

4

350

47

105.8

4

334

48

107.8

4

322

49

109.8

4

311

50

111.9

4

300

Table 2. 520 vehicles (nodes). Route no.

Route length

Hops per route

Network lifetime

1

47.1

2

1435

2

46.9

2

1448

3

46.6

2

1460 (continued)

1209

1210

M. Z. Iskandarani Table 2. (continued) Route no.

Route length

Hops per route

Network lifetime

4

46.4

2

1470

5

46.3

2

1480

6

46.1

2

1488

7

46.0

2

1496

8

45.8

2

1503

9

43.5

2

1635

10

44.4

2

1583

11

45.4

2

1525

12

46.5

2

1465

13

47.8

2

1403

14

49.1

2

1341

15

50.5

2

1279

16

46.3

2

1475

17

46.2

2

1480

18

46.3

2

1475

19

46.6

2

1460

20

47.1

2

1435

21

47.8

2

1403

22

48.6

2

1364

23

49.6

2

1321

24

50.7

2

1274

25

51.9

2

1225

26

53.2

2

1175

27

54.5

2

1124

28

56.0

2

1075

29

57.3

2

1033

30

58.7

2

989

31

69.4

3

735

32

81.7

3

544

33

82.3

3

537

34

82.2

3

538 (continued)

Prediction of Road Congestion Through Application Table 2. (continued) Route no.

Route length

Hops per route

Network lifetime

35

82.2

3

539

36

82.2

3

538

37

82.4

3

535

38

82.7

3

532

39

81.1

3

552

40

82.5

3

535

41

82.2

3

539

42

83.6

3

522

43

82.2

4

539

44

83.8

4

519

45

88.7

4

467

46

86.6

4

488

47

98.8

4

381

48

100.4

4

369

49

102.1

4

357

50

103.9

4

346

51

105.7

4

334

52

107.7

4

323

53

110.8

4

306

54

112.8

4

295

55

106.1

4

332

56

108.2

4

320

57

110.4

4

308

58

110.0

4

310

59

111.8

4

300

60

110.6

4

307

61

116.6

4

277

1211

1212

M. Z. Iskandarani Table 3. 600 vehicles (nodes). Route no.

Route length

Hops per route

Network lifetime

1

52.5

2

1200

2

51.2

2

1249

3

50.1

2

1298

4

50.6

2

1276

5

49.2

2

1339

6

47.8

2

1403

7

46.5

2

1468

8

45.3

2

1531

9

44.2

2

1592

10

43.2

2

1648

11

42.4

2

1699

12

41.8

2

1741

13

41.3

2

1773

14

41.0

2

1792

15

40.9

2

1799

16

41.0

2

1792

17

41.3

2

1773

18

41.8

2

1741

19

42.4

2

1699

20

43.2

2

1648

21

44.2

2

1592

22

45.3

2

1531

23

46.5

2

1468

24

47.8

2

1403

25

49.2

2

1339

26

50.6

2

1276

27

44.2

2

1592

28

44.9

2

1555

29

45.7

2

1510

30

46.6

2

1460

31

47.7

2

1406

32

48.9

2

1349 (continued)

Prediction of Road Congestion Through Application Table 3. (continued) Route no.

Route length

Hops per route

Network lifetime

33

50.2

2

1292

34

51.6

2

1235

35

53.1

2

1178

36

54.0

2

1144

37

55.2

2

1100

38

56.6

2

1055

39

58.0

2

1011

40

74.7

3

642

41

76.6

3

613

42

64.9

3

828

43

66.5

3

793

44

68.2

3

758

45

69.9

3

725

46

71.7

3

693

47

73.5

3

662

48

75.3

3

633

49

77.2

3

605

50

79.1

3

579

51

79.2

3

577

52

81.2

3

551

53

83.3

3

525

54

88.0

3

474

55

104.7

4

341

56

101.6

4

361

57

102.8

4

353

58

104.1

4

345

59

105.5

4

336

60

106.9

4

327

61

101.2

4

364

62

103.2

4

350

63

105.2

4

338

64

107.2

4

326 (continued)

1213

1214

M. Z. Iskandarani Table 3. (continued) Route no.

Route length

Hops per route

Network lifetime

65

107.6

4

323

66

108.9

4

316

67

110.3

4

308

68

111.7

4

301

69

109.6

4

312

70

111.5

4

302

Figure 3, 4 and 5 present simulated and fitted curves describing the relationship between route length and V2V pattern of communication over time. From the plots it is realized that the vehicle under consideration forms an incrementing exponential relationship as it travels through blocks of infrastructure. Such an exponential curve fitting covers the overall path the vehicle travels relative to other vehicles, at the same time it provides a mathematically reasonable illustration to the effect of traffic density on increasing the number of hops per route within the same infrastructure, which is non-linear and agrees with real time scenarios. The general fitting curve prescribing the relationship between V2V communication over route length variation and sequential route connectivity (distance and time dependent) is approximated by Eq. (7) V 2V (N )RL = η ∗ exp(β ∗ SRC)

(7)

Where; SRC: Sequential Route Connectivity. N : Number of Vehicles (Nodes) in a spatial domain.

V2V Communication Route Length (V2V RL)

120 100 80 60 40

V2V(200 Nodes)RL = 46.9e0.018 * SRC

20 0 0

10

20

30

40

50

60

Sequential Route Connectivity (SRC) V2V Communication

Expon. (V2V Communication )

Fig. 3. Relationship between route length and connectivity over time for traffic density of 200 vehicles (nodes)

V2V Communication Route Length (V2VRL) in meters

Prediction of Road Congestion Through Application

1215

140 120 100 80 60 40

V2V(520 Nodes)RL = 37.7e0.019 * SRC

20 0 0

10

20

30

40

50

60

70

Sequential Route Connectivity (SRC) V2V Communication

Expon. (V2V Communication)

Fig. 4. Relationship between route length and connectivity over time for traffic density of 520 vehicles (nodes)

η: Route length parameter dependent on traffic density. β: Scaling parameter.

V2V Communication Route Length (V2V RL) in meters

120 100 80 60 40

V2V(600 Nodes)RL = 35.4e0.016 * SRC

20 0 0

10

20

30

40

50

60

70

80

Sequential Route Connectivity (SRC) V2V Communication

Expon. (V2V Communication )

Fig. 5. Relationship between route length and connectivity over time for traffic density of 600 vehicles (nodes)

Figure 6, 7 and 8 present the relationship between network lifetime and sequential route connectivity (SRC) with exponential fitting curves. The relationship is decreasing exponential as it should be, since the number of hops and route length increases, causing a decrement in network lifetime as more energy is consumed through V2V communication. The fitted curve can be approximated by Eq. (8). Network Lifetime (N ) = κ ∗ exp(−ρ ∗ SRC) Where; SRC: Sequential Route Connectivity.

(8)

1216

M. Z. Iskandarani

N : Number of Vehicles (Nodes) in a spatial domain. κ: Network lifetime parameter dependent on traffic density. ρ: Scaling parameter.

1600

Network Lifetime

1400 1200

Network Lifetime(200 Nodes) = 1487.7e-0.0319*SRC

1000 800 600 400 200 0 0

10

20

30

40

50

60

Sequential Route Connectivity (SRC) Network Liftime

Expon. (Network Liftime)

Fig. 6. Relationship between network lifetime and connectivity over time for traffic density of 200 vehicles (nodes)

Network Lifetime

2500 2000

Network Lifetime(520) = 2157.8e-0.034*SRC 1500 1000 500 0 0

10

20

30

40

50

60

70

Sequentional Route Connectivity (SRC) Network Lifetime

Expon. (Network Lifetime)

Fig. 7. Relationship between network lifetime and connectivity over time for traffic density of 520 vehicles (nodes)

By limiting the exponent in Eqs. (7) and (8) to one variable, and looking at ratios of values as a function of nodes, it is then sufficient to compare the values multiplied by the exponential function as shown in Table 4. Network Life time supposed to decrease as number of nodes increases, based on the assumption that more BSMs are exchanged. However, if a congestion state occurs, then the number of BSMs will be reduced as the vehicle under consideration is boxed and surrounded by finite number of vehicles, hence, shorter routes of communication and fewer number of exchanged safety messages. Thus and from Table 4, we realize an increase in network lifetime and decrease in route length.

Prediction of Road Congestion Through Application

1217

Network Lifetime

2500 2000

Network Lifetime(600 Nodes) = 2385.4e-0.0286*SRC

1500 1000 500 0 0

10

20

30

40

50

60

70

80

Sequential Route Connectivity (SRC) Network Lifetime

Expon. (Network Lifetime)

Fig. 8. Relationship between network lifetime and connectivity over time for traffic density of 600 vehicles (nodes) Table 4. Testing nodes. Number of nodes V2VRL Network lifetime 200

46.9

1487.7

520

37.7

2157.8

600

35.4

2385.4

To distinguish between heavy traffic and congestion (traffic standstill), the presented neural network model is used to learn from data in Tables 1, 2, 3 and 4 and to generate a prediction, classification, distinction, and confirmation of traffic status, depending on V2V communication and to provide three types of traffic states: 1. Smooth traffic flow 2. Smooth to Heavy traffic 3. Heavy to stand still traffic (congestion) The obtained data is not descriptive of actual situation as it is predictive to avoid such conditions occurring by feeding such classification back to Traffic Management Center (TMC) and Traffic Operational Centers (TOC) in order to re-route traffic and avoid congestion by also controlling traffic lights. Such predictive data is shown in Table 5. The shown data in Table 5 proves the three mentioned cases, as when smooth traffic is flowing, there is a dynamic change in hops count, average route length, and average network lifetime, up to the point where regardless of the increase in vehicles, all other parameters stay constant at fixed values.

1218

M. Z. Iskandarani Table 5. Neural networks predicted data using WWEA.

Number of vehicles (nodes)

Hops count

Average route length

Average network lifetime

200

152

76

736

480

161

75

804

500

166

74

846

520

172

72

899

550

180

69

973

600

187

67

1027

750

190

66

1044

760

190

66

1044

800

190

66

1044

840

190

66

1044

850

190

66

1044

880

190

66

1044

900

190

66

1044

950

190

66

1044

1000

190

66

1044

Traffic status Smooth traffic Smooth to heavy traffic turning points

Heavy to standstill traffic turning points Standstill traffic (congestion)

Figure 9, 10 and 11 illustrate the predicted data in Table 5 with the three states of traffic (smooth traffic, smooth to heavy traffic, and heavy to standstill traffic) clear in the plots. The concavity of the turning point shows opposite differential signs to further support the predicted data.

Total Number of Hops

200 190 180 170 160 150 140 150

250

350

450

550

650

750

850

Number of Vehicles (Nodes)

Fig. 9. Predicted V2V route hops as a function of traffic density

950

1050

Prediction of Road Congestion Through Application

1219

Average Network Lifetime

1100 1050 1000 950 900 850 800 750 700 150

250

350

450

550

650

750

850

950

1050

Number of Vehicles (Nodes)

Fig. 10. Predicted V2V average life time as a function of traffic density

Average Route Length (m)

78 76 74 72 70 68 66 64 150

250

350

450

550

650

750

850

950

1050

Number of Vehicles (Nodes)

Fig. 11. Predicted V2V average route length as a function of traffic density

4 Conclusions This work presents an intelligent approach to manage traffic through making use of V2V communication and the formed VANETs that connects vehicles. The neural network structure designed and based on Weigend Weight Elimination Algorithm (WWEA) is used to predict congestion using hops count, average network lifetime, and average route length as feeds to traffic controlling centers, whereby algorithms are used to better manage traffic and avoid congestion. Indirect correlation to network energy is shown through network lifetime. The results show a promising concept if further developed and applied will enable smarter management of traffic and vehicles. Explicit energy correlation will result in a more accurate prediction. Also, modelling of different scenarios such as accidents and work zones will be very helpful. Correlation of number of lanes and weather conditions in relation to threshold decision of congestion or no congestion will focus the predictive algorithm further, as vehicles speeds are weather dependent.

1220

M. Z. Iskandarani

References 1. Baek, M., Jeong, D., Choi, D., Lee, S.: Vehicle trajectory prediction and collision warning via fusion of multisensors and wireless vehicular communications. Sensors 20(288), 1–26 (2020) 2. Hussain, J.S., Wu, D., Xin, W., Memon, S., Bux, N., Saleem, A.: Reliability and connectivity analysis of vehicluar ad hoc networks for a highway tunnel. Int. J. Adv. Comput. Sci. Appl. 10(4), 181–186 (2019) 3. Li, T., Ngoduy, N., Hui, F., Zhao, X.: A car-following model to assess the impact of V2V messages on traffic dynamics. Transp. B Transport Dyn. 8(1), 150–165 (2020) 4. Bai, Y., Zheng, K., Wang, Z., Wang, X.: MC-safe: multi-channel real-time V2V communication for enhancing driving safety. ACM Trans. Cyber-Phys. Syst. 4(4), 1–27 (2020) 5. Guerber, C., Gomes, E., Fonseca, M., Munaretto, A., Silva, T.: Transmission opportunities: a new approach to improve quality in V2V networks. Wirel. Commun. Mob. Comput. 2019, 1–20 (2019). Article ID: 1708437 6. Jung, C., Lee, D., Lee, S., Shim, D.: V2X-communication-aided autonomous driving: system design and experimental validation. Sensors 20, 1–21 (2020) 7. Xie, D., Wen, Y., Zhao, X., Li, X., He, Z.: Cooperative driving strategies of connected vehicles for stabilizing traffic flow. Transp. B Transport Dyn. 8(1), 166–181 (2020) 8. Chen, Y., Lu, C., Chu, W.: A cooperative driving strategy based on velocity prediction for connected vehicles with robust path-following control. IEEE Internet Things J. 7(5), 3822– 3832 (2020) 9. Mertens, J., Knies, C., Diermeyer, F., Escherle, S., Kraus, S.: The need for cooperative automated driving. Electronics 9(754), 1–20 (2020) 10. Yu, K., Peng, L., Ding, X., Zhang, F., Chen, M.: Prediction of instantaneous driving safety in emergency scenarios based on connected vehicle basic safety messages. J. Intell. Connected Veh. 2(2), 78–90 (2019) 11. Gao, K., Han, F., Dong, P., Xiong, N., Du, R.: Connected vehicle as a mobile sensor for real time queue length at signalized intersections. Sensors 19, 1–22 (2019) 12. El Zorkany, M., Yasser, A., Galal, A.I.: Vehicle to vehicle “V2V” communication: scope, importance, challenges, research directions and future. The Open Transp. J. 14, 86–98 (2020) 13. Baek, M., Jeong, D., Choi, D., Lee, S.: Vehicle trajectory prediction and collision warning via fusion of multisensors and wireless vehicular communications. Sensors 20, 1–26 (2020) 14. Nampally, V., Sharma, R.: A novel protocol for safety messaging and secure communication for VANET system: DSRC. Int. J. Eng. Res. Technol. 9(1), 391–397 (2020) 15. Kim, H., Kim, T.: Vehicle-to-vehicle (V2V) message content plausibility check for platoons through low-power beaconing. Sensors 19, 1–20 (2019) 16. Sheikh, M., Liang, J., Wang, W.: Security and privacy in vehicular ad hoc network and vehicle cloud computing: a survey. Wirel. Commun. Mob. Comput. 2020, 1–25 (2020). Article ID: 5129620 17. Liu, X., Jaekel, A.: Congestion control in V2V safety communication: problem, analysis, approaches. Electronics 8, 1–243 (2019) 18. Seyd Ammar, A., Abolfazl, H., Hariharan, K., Farid, A., Ehsan, M.: V2V System Congestion Control Validation and Performance. IEEE Trans. Veh. Technol. 86(3), 2102–2110 (2019) 19. Son, S., Park, K.: BEAT: beacon inter-reception time ensured adaptive transmission for vehicle-to-vehicle safety communication. Sensors 19, 1–11 (2019) 20. Nahar, K., Sharma, S.: Congestion control in VANET at MAC layer: a review. Int. J. Eng. Res. Technol. 9(3), 509–515 (2020) 21. Singh, H., Laxmi, V., Malik, A., Batra, I.: Current research on congestion control schemes in VANET: a practical interpretation. Int. J. Recent Technol. Eng. 8(4), 4336–4341 (2019)

Prediction of Road Congestion Through Application

1221

22. Zhang, T., Liu, S., Xiang, W., Xu, L., Qin, K., Yan, X.: A real-time channel prediction model based on neural networks for dedicated short-range communications. Sensors 19, 1–21 (2019) 23. Elwekeil, M., Wang, T., Zhang, S.: Deep learning for joint adaptations of transmission rate and payload length in vehicular networks. Sensors 19, 1–21 (2019) 24. Mignardi, S., Buratti, C., Bazzi, A., Verdone, R.: Path loss prediction based on machine learning: principle, method, and data expansion. Appl. Sci. 9, 1–18 (2019) 25. Thrane, J., Zibar, D., Christiansen, H.: Model-aided deep learning method for path loss prediction in mobile communication systems at 2.6 GHz. IEEE Access 8, 7925–7936 (2020) 26. Popoola, S., et al.: Determination of neural network parameters for path loss prediction in very high frequency wireless channel. IEEE Access 7, 150462–150483 (2019) 27. Cavalcanti, B., Cavalcante, G., de Mendonça, L., Cantanhede, G., de Oliveira, M., D’Assunção, A.: A hybrid path loss prediction model based on artificial neural networks using empirical models for LTE and LTE-A at 800 MHz and 2600 MHz. J. Microwaves Optoelectron. Electromagn. Appl. 16(3), 708–722 (2017) 28. Popoola, S., Adetiba, E., Atayero, A., Faruk, N., Calafate, C.: Optimal model for path loss predictions using feed-forward neural networks. Cogent Eng. 5, 1–19 (2018). Article: 1444345 29. Weigend, A., Rumelharta, A., Huberman, B.: Generalization by weight-elimination with application to forecasting. In: Advances in Neural Information Processing Systems Conference, pp. 875–882. ACM (1990)

Urban Planning to Prevent Pandemics: Urban Design Implications of BiocyberSecurity (BCS) Lucas Potter1 , Ernestine Powell2(B) , Orlando Ayala3 , and Xavier-Lewis Palmer1,4 1 Frank Batten College of Engineering and Technology, Biomedical Engineering Institute, Old

Dominion University, 4111 Monarch Way, Norfolk, VA 23508, USA [email protected] 2 Department of Neuroscience, College of Natural and Behavioral Sciences, Christopher Newport University, Forbes Hall 1053, Newport News, VA 23606, USA [email protected] 3 Department of Engineering Technology, Frank Batten College of Engineering and Technology, Old Dominion University, 102 Kaufman Hall, Norfolk, VA 23529, USA 4 School of Cybersecurity, Old Dominion University, Norfolk, VA 23529, USA

Abstract. Recent pandemic complications have spurred significant conversation concerning how various organizations rethink their infrastructure and how people interact with it. More specifically, some city governments are considering how they can improve or re-purpose infrastructural design as they manage pandemics and potentially those to follow. In facilitating this effort, it is essential to look at effectively established principles and perspectives of modern designers to forge new functional and practical infrastructure that can accommodate behavioral changes within a pandemic. As smart cities increase in number, this is of particular concern due to their utilization of computing components in routine citizen and infrastructural interactions. This prevalent interface of citizens interacting with cyber-physical and remote cyber systems requires consideration of the intersection of biosecurity, particularly during a pandemic. Consequently, in this work, a proposed baseline by which city planners can draw inspiration for their efforts to boost resilience to pandemics is designed using First Principles, along with a collection of modern opinions. This exploratory work in progress aims to spur such questions in literature and provoke meaningful dialogue towards this endeavor. Keywords: Infrastructure · Biocybersecurity · Cyberbiosecurity · Urban · Bio-Security

1 Introduction Cities have long been shaped by the needs of community health from ancient times, even to today. For instance, Roman baths were a focus, and health was a consistent feature of the seminal “Ten Books on Architecture” [1]. In another example, epidemics that struck the urban environments between the 18th and 20th centuries substantially formed the modern urban setting. A more specific example from this period stems from © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1222–1235, 2021. https://doi.org/10.1007/978-3-030-80126-7_85

Urban Planning to Prevent Pandemics

1223

the beginning of the 20th century in Senegal [2] with attempts to ameliorate conditions like heart disease with urban planning [3]. In stark contrast, today, the collaboration and progress of public health and urban planning have become notably disparate, particularly regarding the spread of diseases. Ironically, two fields that began with noble intentions of decreasing preventable diseases face a threat found in the current pandemic that exhibits both fields’ flaws [4]. This occurrence is especially odd, considering that some consider epidemics to be the engine of urban planning [5]. However, it is time to acknowledge two unique facts of the world as it is today that exacerbate the spread of biological agents during a pandemic. Foremost is the massively sizable urban population in modern times that exponentially increases the number of possible interactions with people and their intersections compared to previous eras. The urban population was “54% of the total global population … and is expected to grow approximately 1.84% per year between 2015 and 2020, 1.63% per year between 2020 and 2025, and 1.44% per year between 2025 and 2030” [6]. The second contributing fact is the sheer breadth of interconnectedness that exists in today’s world. The prescient Arthur C. Clarke mentioned future potential changes in urban design in 1964, with uncanny accuracy [7]. Clarke proposed that a virtual means of personal interaction would arise from new wireless technologies, which are now known as the Internet. He also discussed how this technology could prove advantageous as the world adapts to a new normal for its interactions to confront the advent of viral pandemics [7]. Concerning the advent of COVID-19, communities around the world faced complexities embedded within balancing the need to efficiently preserve both community health and the economy by oscillating between implementing lock-downs and lifting said restrictions. Be the lift gradual or immediate, many communities face difficulties that improved urban planning can potentially mitigate by addressing the new reality of living in an urban environment with a hardy, airborne disease. This paper explores a possible urban planning effort and design notions that can be considered for more universal implementation for the current pandemic and for years beyond COVID-19. Additionally, digital technologies are not discussed at length within this paper due to this work being a supportive piece aimed towards those with background knowledge of such technologies. It is prudent that data, biometric and peripheral, collectible and accessible throughout, is kept in consideration for discussion.

2 Background The most important thing to remember as we embark on our quest to design a diseaseresistant city is that the vectors of diseases exist in both the temporal and spatial domains. For us, that means that an effective urban design would confirm the ability to separate potential attack vectors according to space by increasing the physical distance and increasing separation by time. This involves not only increasing gaps between public transit seating or having fewer spaces in restaurants, but also increasing the time between different individuals entering that space. In practice, this could mean having such a large number of public transportation options that only a certain number of people would use the same seat before it was adequately sanitized, To be clear, creating mechanisms to control the spread of disease in cities is hardly a novel concept. The formation of specific zoning boards, health inspections, and metropolitan boards have all been in response to

1224

L. Potter et al.

the spread of diseases in the urban environment. As exemplified in the recent past, cities have often taken the initiative to control spread, although sometimes in a manner that further harmed relationships between the government and minority groups [8]. These swift changes, such as the rat-catching and sanitation initiatives encouraged by the local government during the San Francisco Bubonic Plague epidemic and using sewage to help track the number of Covid cases, demonstrate the innovation that arises when cities are forced to handle these crises [8, 9]. Furthermore, the essence of cities as gathering places could not be changed from their ancient roots as much as they can be changed in the modern world. That is to say that the health considerations of a city have nearly always been seen as side issues to the primary objective of keeping the city’s mechanisms running and its economy strong. Again, this can be seen in the San Francisco government’s strong denial of the Bubonic Plague, so as to better maintain trade relationships with other states and countries [8, 10]. Eventually, the plague spread outside of Chinatown and affected a wider demographic than before with increasing case numbers, forcing the local government to no longer suppress efforts to acknowledge the existence of the epidemic and cease suppressing efforts to resolve it [8]. This continuous prioritizing of a city’s economy in urban design or function over the well being of a city’s citizens may be an error in judgment to be reconsidered.

3 Methods For this paper, first, the authors gathered opinions and sought inspiration from past differences in urban design. This ranged as far back as the Roman Empire [1], to the Cholera epidemics of the 1800s [1], and even to the last major cases of bubonic plague [10]. Second, we had to acknowledge that the future disease vectors that the urban setting may face might potentially stem from not only naturally occurring diseases, but also diseases designed by individual hostile actors or malicious groups, or even nationstates. In this case, we felt the need to acknowledge the possible vectors which, while not seen as a natural or conventional way for a disease to enter the population, could be co-opted to deliver the microbe.

4 Results In the case of creating a disease-resistant urban environment, we separated the task into several components. This honors the complexity of this task by reflecting the conceptualization of urban environments as not just spaces, but the combination of processes that form them. The first component was to identify the present potential attack vectors that could be used to infect a given urban environment. Yet, this is not to say that these vectors have or will be used to introduce disease into the urban environment. In most cases, the vectors in question are simply routes of entry to bodies that are not native to the urban environment. Consider the simple example of food. Few cities could cultivate their crops on a scale necessary to feed their inhabitants. Yet, carefully designed and regulated vertical farms may resolve this, but such efforts remain to be seen. Regardless, upon such a route being viable, a city can consider how the food produced and later consumed could be a route

Urban Planning to Prevent Pandemics

1225

for harmful biological agents if the facilities in question are not vigilantly and properly monitored. Notable and excellent examples include farm culling and factory recalls of meats. The second component questions the relevant principles upon which cities are based. This includes our current understanding of what it means to be in an urban environment. In essence, it is a framework of developing reasons as to why people choose to live in an urban environment, and how those needs could be met in a more thoughtfully designed space. These choices can vary along lines of age, work, and more. For example, some employees may choose urban environments for higher pay, while some employers may do so for access to a diverse and robust supply of workers. Other individuals who choose along non-economic aligned angles may do so for aesthetic reasons or reasons aligned with other cultural groupings. Real-world examples include college or industrial parks that capture a wide range of citizen preferences [9]. The next component involves creating a system of responding to a pandemic-level biological threat in an urban environment. This is the culmination of the first two steps: the rigorous threat analysis of an idealized city, and the questioning of why the urban environment exists the way it does now. To further elaborate, city weaknesses and strengths, such as the age, grade, and integrity of the infrastructure, as well as the reasons behind its composition, are important to consider. These considerations can in turn empower planning how to leverage the strengths and mitigate the weaknesses of urban environments, especially when implemented into official policy with active effort and effective documentation [11]. Taking these components into consideration, in Table 1, we suggest a set of rules through which a disease-resistant city could be created. We additionally included guidance as to what such a city may look like. Below a list of potential attack vectors is given. 4.1 Potential Attack Vectors To begin this stage of our analysis, the authors hypothetically assumed the role of the biological antagonists, to simulate exploitations of vulnerabilities in infrastructure. The goal of this exercise would, at the least, imagine how a malicious agent or a naturally occurring, but contaminating biological agent might eventually immobilize a city and cause outstanding biological damage to urban structures. This in turn could thereby disable significant parts of a city, negatively impacting the population of an urban environment. This in turn presents discussion material for city planners to address in order to protect the public. It helps to differentiate a biological agent from the conventional ordinance, for understanding its effects on an environment. Vital importance in the capacity of a biological agent lies within its unlimited resource as a weapon, particularly as a canon in need of something to shoot. A piece of ordinance, once expended, typically no longer presents a threat. Yet, a virus or microbe can continuously infect a population long after the original delivery has been completed so long as it has access to hosts or a viable means of replication. This does rely on the biological agent not being fatal to the host before successful transmission and infection of another person. Keeping this in mind, the goal no longer becomes simple delivery, but the targeting and dissemination of the microbial

1226

L. Potter et al. Table 1. Potential attack vectors

Common item

Purpose(s)

Means of vulnerability/exploit

Bench

Place to sit

Particles on surfaces

Bus

Transportation

Aerosols ejected from cushions as they are sat upon, provided that they can settle and become embedded without loss of function Some buses have electronic kiosks through which pay is transferred, which can be a route of transfer if they are interacted with through skin contact (Thus, Pathogen transfer through skin from improper sanitation)

Toilets

Restroom relief

Aerosols via a toilet plume

Active-Equipment (especially outdoor)

Fitness

Pathogen transfer through skin from improperly sanitation

Accessibility Accommodations (Railings, Motorized Scooter Rentals, Wheelchairs, Patient Call Bells, Kiosks)

Assist with individuals having more autonomy to complete their activities of daily living and assist with their quality of life

Pathogen transfer through skin from improper sanitation

Phone Booths

Communication

Lingering Aerosols in the enclosed space, resulting in a “hot box”-like phenomena

Pool

Relief from hot temperatures, low-impact exercise

Pathogen transfer through skin or orally, through ingesting of pool water and mixed fluids

Hand Dryers

Replacement for towels

Particles aerosolized by high-velocity air (In the case of motion detected air dryers, this is harder to control as improper detection can spur aerosolization at less controllable rates)

Public, Open Computers and Accessing Information, money, Pathogen transfer through skin Kiosks (ATMS, Information or services from unclean surfaces or Booths, and Similar) harvesting of biometrics (fingerprints, voice)

payload. In this case, urban planners must strive to understand the complexity of the problem. A person in a major metropolitan city could potentially be within the infection

Urban Planning to Prevent Pandemics

1227

range of thousands of people during their day. This is all without taking into account any changes in virulence, the severity of the disease, the contagiosity, the ability of the healthcare system and government to manage the infected individuals, the ability and willingness of facilities and individuals in helping to reduce the spread and practice proper disease control measures, or other means connected to the spread of diseases in an urban environment. This also has major implications in the delivery of medical services in an urban environment. For instance, say that a disease is virulent enough to cause a city’s major hospitals to send all other patients home to self-quarantine, leaving the hospitals free to deal with the patients infected with a single disease. The only other avenue of delivering medicine would be local pharmacies, pop-up treatment centers, and similar authorized distribution facilities. In this case, those local pharmacies, while they would be more diffused, present a much more vulnerable target for potential hostile actors as they contain less expensive equipment and patient records, per building to protect. The resources needed to purposefully infect and endanger those pharmacies would be larger, as the target is distributed and requires more resources. However, if this avenue was already composed of asymptomatic individuals or people with mild symptoms, then the problem of dissemination solves itself. Likewise, public transport, such as trains, trams, or buses, offers another route of spread for the biological agent. This can be unintentional due to close contact or improper sanitation in an appropriate time frame. It could also be intentional if the agent is stable enough to be planted in these public spaces. Additionally, biological agents can be intentionally spread via deliberate infection of gas stations or similar nodes at which transporters restock or relieve themselves. The primary issue, in this case, is that to support the current quality of life in an urban environment, a massive quantity of goods are needed, which are not currently cost-effective to produce locally in the present scheme of resource allocation. Each means of resource transport is more critical than the last and the ability to halt a city’s operations by severely affecting even a fraction of these resources can reduce or decimate the quality of life in urban environments. Thus, vigorous protection is of the utmost importance. To best account for insidious methods of spread, we considered likely ways to accomplish this task. Infection of surfaces that one contacts is a common concern, but is hardly the most insidious method. It could be partially addressed by the implementation of simple cleaning measures. Still, this remedy is vulnerable to a natural aspect of city life – wealth inequality. As many communities saw early during the COVID-19 pandemic, much publicly accessible infrastructure was ill-equipped to deal with a pandemic. For example, efforts focused on cleaning of public transportation and seating, such as subways and park benches, were unintentionally countered through frequent contamination of surfaces via direct touching of surfaces or expelling of particles via coughing or talking by infected individuals [12–15]. The volume and frequency of such are difficult to stem given that so much of this unintentional contamination can be attributed to people traveling for work who do so out of necessity, unintentionally and partially countering lock-down as workers without access to adequate social safety nets simply cannot afford to stop working [16]. Even during a crisis, few of them received sufficient financial support that prevented them from becoming unintentional deliverers of a viral payload. This

1228

L. Potter et al.

was and is amplified in cases where workers lack(ed) proper personal protective equipment [17]. Notice of this is particularly important within places where this is perhaps needed the most: hospitals, nursing homes, other long-term or specialized care facilities, and practically anywhere with front-line essential personnel [16]. An infection could be easily encouraged by deliberately infecting the poorest workers of a city, especially without their knowledge. In particular, the individuals who are directly responsible for sanitation procedures can be at an even higher level of risk. Even sanitation could be co-opted into a method of negative efforts for preventing the spread, as resources are wasted where they could be better spent on other infrastructural improvements or treatment. This is also to further ask, without addressing these base issues, what frequency of sanitation is even useful? To what degree is crowd control helpful in making even that number useful? For example, if a homeless person was to sleep in a train right after it gets sanitized what is the point of sanitizing it in the first place? What could be done to motivate governments to properly house the homeless to prevent this? A mere redesign of surfaces to make touching them unattractive, found most often under the term “Hostile Architecture”, has often shown to be counter-intuitive due to human ingenuity in times of desperate situations [18–21]. Reducing surfaces to touch, as well as funding smart, touch-less interfaces, warning devices, or monitoring systems that address the most critical of spaces to sanitize (areas that are the site of inordinate exposure to body fluids) may help address concerns in these spaces [15]. Nonetheless, as it currently stands, waste service professionals have little choice to continue doing their jobs amid a pandemic. This is especially true in cases without proper representation advocating for workers’ health. We term this idea – of the infection of the poorest members of a population, especially in a city where wealth inequality is large, a trickle-up attack. The primary concept is that while wealthy members of a population may have access to the resources to combat a biological agent with longevity, the poorer citizens may not. However, since one cannot live within a bubble in a city, eventually the poorer citizens might infect the wealthier members, leading to deeper penetration of the agent in the population. There is a commonality to the problem considered above. Why, exactly, do large populations want to put themselves at such risk by living so tightly together? The answer is likely too complex to be answered in an exploratory and commentary paper such as this, but a simple assumption is that humans are typically socially-centered and desire interaction with one another [22]. This interaction can take the form of either structured, purposeful groupings (companies, political organizations, or public services) or informal groupings (sporting or music events). To better understand how to view those interactions as a potential vector, we can turn to design from first principles [17]. 4.2 Designing from First Principles In this context, Principles of a City are: 1) to centralize resources in order to lay claim to a geographic area 2) to provide services to residents in an urban environment 3) to develop technologies, ability, or knowledge.

Urban Planning to Prevent Pandemics

1229

In this exercise, one first explores the First Principle of a city, which is to control a given geographic area by political means. In the case of a globally connected world, the presence of disease can quickly be spread from around the globe. The transition from traveling strictly on foot or via the assistance of an animal, to utilizing trains, cars, and planes has rendered the impact of geographic distance and features to be less of a concern to the spread of disease and more of a catalyst. Just one infected passenger is all that is needed to kickstart the spread of infection to another nation, which is why infrastructure capable of detecting, containing, and stopping the spread of infections remains important as the essential idea of distance or geographic features being able to inhibit or excite the spread of disease thus becomes negligible [8, 12, 13, 23]. Which poses the question: should the borders of regions, states, or even nations be rethought? For instance, the divisions of most states in the United States are either built upon relatively arbitrary geographic divisions of latitude and longitude, or divisions based on geological features. In post-colonial states like much of central Asia and Africa, some of these divisions exist in an even more arbitrary way, such as the reasoning behind the partitioning of the Ottoman Empire [18]. Why not make the divisions in such a way as to enable the ease of quarantine? This is not meant as a way to return to more insular communities, but rather as a way to corral the spread of diseases by exciting more local trade, with global trade still existing, but not being the lifeblood of any one community. Take this for an example- imagine two fictional states: East and West Dakota. Both of these states are primarily self-sufficient, but East Dakota lacks ready food supplies, and West Dakota has no energy capability. In the case of an emergent threat response, but especially in the case of a pandemic, why should these two states not cooperate? To a medical officer, does this political division do anything to aid them? Then why ask a medical professional to, if not tolerate it, have to work around it, instead of letting them provide medical support as best they can? Yes, hospital policies, as well as local, state and federal laws related to licensure impose restrictions concerning where medical professionals can work with legal and career consequences should they violate any laws. However, perhaps in times of crisis, such as during pandemics, there could be regulations made that better allow for monitored orchestration of aid across these borders and barriers. This has been the case across the U.S. by way of varying waivers, regulated by the states, to the telehealth telephone and video conference license requirement and waivers that allow physicians from different states to collaborate with each other [17]. This is especially helpful for locations with more Covid experience sharing knowledge with locations that have less or little experience [24, 25]. This network can also be a saving grace for healthcare facilities that are in dire need of supplies, if other facilities are able and willing to donate materials, especially personal protective equipment or PPE [26]. Now, we turn to our second principle. Cities are meant to serve the people in them, to enable the specialization of citizens. Expecting a neurosurgeon to start his day by harvesting his or her wheat would be a difficult demand to ask. However, some of these services are now found even in rural areas. Telecommunications enables the spread of news and novelties. Sanitation is found at similar levels around the developed world. Transportation can link rural and urban areas alike. This leads to the delivery of items being comparable in both environments. This is to say that many of the most essential

1230

L. Potter et al.

services are available now even without participating in the urban environment. Even business, typically done in crowded urban marketplaces, has made a transition in part to the internet enabled world. In this case, assuming the appropriate resource allocation to services like sanitation, communications, and consumer goods, the second principle of cities is negated in part in developed countries. The matter of using cities as hubs for commercial activity brings up another interesting point that concerns our third principle. Some believe that we are currently in an “Idea Economy or Economy of Ideas”, where concepts or thoughts are the most valuable commodities to the world’s economic system [27, 28]. Supporting this is the current trend with placing emphasis on innovative yet effective solutions due to the lack of or decreased productivity in several fields, particularly research, transportation, and architecture [19, 29]. This has in turn spurred an increase in funding or re-analysis of how to improve methods for the benefit of the city and its citizens [19, 29]. In this casewhy are we bothering to spread ideas in such an archaic manner if virtual means suffice and have experienced success? Telemedicine has shown this possible to a certain extent in the medical field [16]. The rise and growth of companies like WebEx and Zoom have become testament that much work, even in research, can continue without the need for traditional, in-person means of idea exchange [27, 30]. For instance, many companies discovered during the different quarantine orders that just as much work, if not more can be completed even without access to a physical space, or the physical presence of people in the same space. Also, many government research entities work closely with facilities across the county; and the adoption of ARPANET showed the potential of these research facilities to work within a virtual framework [31]. Also noteworthy, many workers have been found to be more productive working remotely [32]. Thus, lending support to the idea that a virtual partnership or exchange is well-founded. The adoption of a completely virtual research group is a fascinating idea that may be the future of innovation. As an example, the very work you are reading is a work that required no physical presence on the part of any of the authors. All of these principles, however, neglect the comprehensive reality of life in a city. While many services are done by public servants that are ultimately accountable to citizens, many features of a city are through privatization. This leads to another inquiry worth pondering. Should a private organization, whose ultimate purpose is profit, be allowed to make decisions that negatively affect employee’s health and well-being? More importantly, should it be allowed to make decisions that affect the health and well-being of individuals that are not compensated by that company, such as community members who are in contact with their workers and who may or may not be contagious? This is a question for each government to answer as each individual will no doubt make contact with the infrastructure within the urban environment, creating a plethora of opportunities for the spread of infection. Now, though the formerly addressed principles are important, some sub-principles also deserve mention. Another important aspect of a city is political thought, specifically through political demonstrations. One of the undeniable powers of a centralized location of a population is the ability to instigate great societal change in a relatively short amount of time, either through dialogue, protests, riots, or demonstrations. However, in the case of a pandemic situation, a large group of people can be seen as not just an example of

Urban Planning to Prevent Pandemics

1231

political will, but as a distributed denial of service (DDOS) attack in a physical space. This scenario assumes two things- that the group willingly gathers together, and that their motivation for doing so overpowers their motivation for safety. The former case (willingness to gather) could be a condition of income or labor relations. For instance, a group of factory workers may not be fiscally able to stop working. Does this mean that to prevent the spread of pandemics, paid sick leave should be a mandated guarantee? For that matter, should an amount of personal space be guaranteed, or required, to build or rent a home? The latter case assumes a pained motivation behind such a gathering. What if the group is forced together by circumstance? Then, should the amount of space a citizen of a city can expect be regulated in the defense of public health? If a city does mandate a minimum amount of space that a citizen is to be guaranteed, what happens to those citizens that fall in the cracks – the vagrants and those without homes? It is important to consider how design can motivate proper spacing so that such individuals can not only group safely, but also so that they are not encouraged to leave behind fomites. 4.3 Hypothetical Responses to Disease One of the ideas that came to mind when exploring the aforementioned principles of urban design were the ways that our cities respond to disease - and so the people that treat it. In this case, what should be planned when the first responders of a city become the vector of a disease? This is especially the case during the middle of the Covid-19 epidemic, when PPE shortages prevented medical providers from adequate protection and high infection rates among health workers placed doubt on the reasonable capacity for mass quarantining without significant losses in workforce output [23, 33, 34]. To what degree could telepresence alleviate some of this stress? Further, to what degree may disruptions occur, through natural and malicious actions, that need to be additionally safeguarded? Of course, future designs of medical robotics could, theoretically, enable the complete care of a patient from transport to treatment, but that is not a discussion for most citizens as of now and is not a financial consideration for more underserved hospitals, or even healthcare systems. As seen during this last epidemic, many elective surgeries have been canceled or postponed, and telemedicine has surged in use for more routine visits. The latter is likely to see a short- term sustained increase as the public moves to limit their liability and cut costs. However, the goal of most telepresence (to offer consultation support and treatment or guide their counterparts through procedures) is hampered by one massive consideration – that of privacy. As discussed in a previous work [35], telepresence devices could be exploited in ungainly ways to make them incompatible with the ideals of patient privacy, requiring re-designing telepresence practices in some instances, in the future; it is important to be mindful about how current practices may change. This is not currently an issue, for most commonly accessible, personal gadgets that allow a user to track their health are devices like pedometers and sleep trackers, which are of minimal importance. However, as more and more data becomes available, the commensurate risks grow. Perhaps an easier method could be to have citizens of a city become allies in preventing the disease. Imagine creating a smartphone app that plays a high-pitched, aggravating tone when in less than 2 m of another smartphone.

1232

L. Potter et al.

Another imagination gives way to infrastructure that alarms based on proximity to certain devices that present a risk to the user or others in proximity to it. Larger entities such as companies or governments could also incentivize larger regional efforts to curb the spread of covid-19 as based on tracking through waste or trash deposited [36]. Efforts such as these could prevent, at the least, the most immediate risk factors of infection. These are, but a start, but worth considering and building upon.

5 Further Discussion The relationship between humanity and its technology is one of the most interdependent relationships that exists. Our heroes and villains, kings and commoners, poets and scientists – all are defined by what they have, what they create, what they design, and what they use, leading to the change of themselves and their environment. Such applies to all individuals in their varying uses of technology, especially with planned and dynamically generated means of infrastructure. COVID-19 has shown the world that there is much change needed to phase out the risks that pandemics pose towards the use of urban infrastructure. Certainly, the current technology has prevented many deaths that would have occurred compared to prior pandemics centuries ago – the quality and amount of medical knowledge, training, and tools in such a short amount of time during our response to this crisis has likely never before been leveraged on this planet. And yet, it seems that the solutions to making our cities more resilient and more sustainable in defiance to this disease are not solely technological innovations given the losses incurred despite the use of our current technologies. Instead, it would seem that a new, more equitable social contract between people in our world may be the next best way, vaccine aside, to prevent such a scourge from occurring once more. It is questionable that there will ever be a single drug or injection that alone will obviate the need for medical treatments, or make pandemics a thing of the past. Instead, it may be that the nature of the disease in the urban environment relies not on our mastery of the concrete jungle, but on our social response to those who navigate it and answer why they use it at disadvantageous times. Our optimal response points to a holistic design and management of current and planned designed spaces that are mindful of our fellow citizens and their capacity to adapt, insofar as we can assist.

6 Limitations Clear limitations exist in that this paper is not at all comprehensive of the intersection of BCS and urban design. Further, there exists expansive commentary and development upon first principles worth consulting that is not included herein. Moreover, detailed mechanisms of technologies discussed and how one may optimally use them, and beneficial or malicious purposes are not expounded upon within either as such is beyond the scope of this work and consequently requires more careful writing. Lastly, proposed designs may be effective on paper, but when adding in human free will and government politics, bringing the proposed plan to reality and fully realizing the potential positive benefits may not occur in the way envisioned. This paper is both commentary and exploration, but not more than such.

Urban Planning to Prevent Pandemics

1233

7 Conclusions What one can make of this is that Covid-19’s arrival and contemplation of other potential pandemics challenges us to rethink our urban planning and economic functions to adequately deal with pandemics. This requires us to map out potential attack vectors and be creative when doing so in order to reduce what natural and direct means may likely evade both counter and capture. Further, we must reconsider how we design and modify our cities based on first principles, considering the basic aspects of what makes a city and working from there. Ultimately, we must then redesign our responses to disease outbreaks. The disruption produced is the new normal and is likely to remain that way unless we can adapt from reflection on first principles to design a new normal to which we more willing consent and frame as progress towards a more well prepared and managed city. It is hoped that this will provide a helpful basis for consideration in addressing urban planning designs.

8 Future Work Future work entails a more robust discussion of vectors and mapped opinions of those in the area of Biocybersecurity/Cyberbiosecurity. Specifically, this would be the subject of a comprehensive review of recent papers in BCS at the intersection of Urban Design. As the evolution of urban design has included new technologies and threats that have emerged, those at the intersection of BCS will undoubtedly be no exception.

References 1. Pollio, V., Morgan, M.H., Warren, H.L., Pollio, V.: The Ten Books on Architecture. Harvard University Press, Columbia (2018) 2. Bigon, L.: A History of Urban Planning and Infectious Diseases: Colonial Senegal in the Early Twentieth Century, 21 February 2012. https://www.hindawi.com/journals/usr/2012/589758/. Accessed 13 July 2020 3. Barton, H., Tsourou, C.: Healthy Urban Planning. SPON Press, London (2000) 4. Corburn, J.: Confronting the challenges in reconnecting urban planning and public health. Am. J. Public Health 94(4), 541–546 (2004). https://doi.org/10.2105/ajph.94.4.541 5. Malone, C., Bourassa, K.: Americans Didn’t Wait for Their Governors To Tell Them To Stay Home Because of COVID-19, 8 May 2020. https://fivethirtyeight.com/features/americansdidnt-wait-for-their-governors-to-tellthem-to-stay-home-because-of-covid-19/ 6. Knorr, D., Khoo, C.S.H., Augustin, M.A.: Food for an urban planet: challenges and research opportunities. Front. Nutr. 4, 73 (2018) 7. Middleton, K.: Arthur C Clarke predicts the internet in 1964 [video]. YouTube, 22 December 2013. https://www.youtube.com/watch?v=wC3E2qTCIY8 8. Tirachini, A., Cats, O.: COVID-19 and public transportation: current assessment, prospects, and research needs. J. Public Transp. 22(1) (2020). https://doi.org/10.5038/2375-0901.22.1.1. https://scholarcommons.usf.edu/jpt/vol22/iss1/1 9. Marohn, K.: Tracking COVID-19 through sewage, 22 May 2020. https://www.mprnews. org/story/2020/05/22/meet-the-minnesota-scientiststrying-to-track-covid19-spread-throughsewage. Accessed 06 July 2020

1234

L. Potter et al.

10. Tansey, T.: Plague in San Francisco: rats, racism and reform, 24 April 2019. https://www.nat ure.com/articles/d41586-019-01239-x. Accessed 20 Feb 2021 11. Laurian, L.: Planning for active living: should we support a new moral environmentalism? Plann. Theory Pract. 7(2), 117–136 (2006) 12. Ayenew, B., et al.: Challenges and opportunities to tackle COVID-19 spread in Ethiopia. J. PeerScientist 2(2), e1000014 (2020). https://www.peerscientist.com/volume2/issue2/e10 00014/Challenges-and-opportunities-to-tackle-COVID-19-spread-in-Ethiopia.pdf 13. Bonful, H.A., Addo-Lartey, A., Aheto, J.M.K., Ganle, J.K., Sarfo, B., Aryeetey, R.: Limiting spread of COVID-19 in Ghana: compliance audit of selected transportation stations in the Greater Accra region of Ghana. PLoS ONE 15(9), e0238971 (2020). https://doi.org/10.1371/ journal.pone.0238971 14. Swanson, M.W.: The sanitation syndrome: bubonic plague and urban native policy in the Cape Colony, 1900–1909. J. Afr. Hist. 17(3), 387–410 (1977) 15. Ventura, J.R.: Virginia Residents Required To Wear Face Masks In Public Indoor Places, 27 May 2020. https://www.ibtimes.com/virginia-residentsrequired-wear-face-masks-public-ind oor-places-2983347 16. Larochelle, M.R.: “Is it safe for me to go to work?” Risk stratification for workers during the Covid-19 pandemic. New Engl. J. Med. (2020). https://doi.org/10.1056/nejmp2013413 17. Federal Association of State Medical Boards: U.S. States and Territories Modifying Requirements for Telehealth in Response to Covid-19, 5 February 2021. https://www.fsmb.org/sit eassets/advocacy/pdf/states-waiving-licensure-requirements-for-telehealth-in-response-tocovid-19.pdf. Accessed 10 Feb 2020 18. Baer, R.: See no evil: the true story of a ground soldier in the CIA’s war on terrorism (Reprint ed.). Broadway Books (2003) 19. Bloom, N., Jones, C., Van Reenen, J., Webb, M.: Great Ideas Are Getting Harder to Find: Unless we keep raising research inputs, economic growth will continue to slow in advanced nations, 20 December 2017. https://sloanreview.mit.edu/article/great-ideas-are-getting-har der-to-find/. Accessed 10 Feb 2021 20. Chellew, C.: Design paranoia. Ontario Plann. J. 31, 18 (2016) 21. Petty, J.: The London spikes controversy: homelessness, urban securitisation and the question of ‘hostile architecture.’ Int. J. Crime Justice Soc. Democr. 5(1), 67 (2016) 22. Chen, Y., et al.: High SARS-CoV-2 antibody prevalence among healthcare workers exposed to COVID-19 patients. J. Infect. 81(3), 420–426 (2020) 23. Shen, J., et al.: Prevention and control of COVID-19 in public transportation: experience from China. Environ. Pollut. 115291 (2020). https://doi.org/10.1016/j.envpol.2020.115291 24. Krishnamoorthy, Y., Nagarajan, R., Saya, G.K., Menon, V.: Prevalence of psychological morbidities among general population, healthcare workers and COVID-19 patients amidst the COVID-19 pandemic: A systematic review and meta-analysis. Psychiatry Res. 293, 113382 (2020) 25. Kumar, R., Nedungalaparambil, N.M., Mohanan, N.: Emergency and primary care collaboration during COVID-19 pandemic: a quick systematic review of reviews. J. Fam. Med. Primary Care 9(8), 3856–3862 (2020). https://doi.org/10.4103/jfmpc.jfmpc_755_20 26. Sen-Crowe, B., Sutherland, M., McKenney, M., Elkbuli, A.: A closer look into global hospital beds capacity and resource shortages during the COVID-19 pandemic. J. Surg. Res. 260, 56–63 (2021) 27. Barlow, J.: The economy of ideas, 1 March 1994. https://www.wired.com/1994/03/economyideas/ 28. Paneth, N., Vinten-Johansen, P., Brody, H., Rip, M.: A rivalry of foulness: official and unofficial investigations of the London cholera epidemic of 1854. Am. J. Public Health 88(10) (1998). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1508470/pdf/amjph000220105.pdf

Urban Planning to Prevent Pandemics

1235

29. Bloom, N., Jones, C., Van Reenen, J., Webb, M.: Are ideas getting harder to find? Am. Econ. Rev. 110(4), 1104–1144 (2020). https://doi.org/10.1257/aer.20180338 30. Omidi, M.: Anti-homeless spikes are just the latest in ‘defensive urban architecture’. The Guardian, 12 June 2014. https://www.theguardian.com/cities/2014/jun/12/antihomeless-spi kes-latest-defensive-urban-architecture 31. Rudan, I.: A cascade of causes that led to the COVID-19 tragedy in Italy and in other European Union countries. J. Glob. Health 10(1) (2020). https://doi.org/10.7189/jogh.10.010335 32. Hunter, P.: Remote working in research: an increasing usage of flexible work arrangements can improve productivity and creativity. 20(1) (2019) https://www.embopress.org/doi/full/10. 15252/embr.201847435 33. Chen, B., et al.: Basic psychological need satisfaction, need frustration, and need strength across four cultures. Motiv. Emot. 39(2), 216–236 (2014). https://doi.org/10.1007/s11031014-9450-1 34. Groat, L.N., Després, C.: The significance of architectural theory for environmental design research. In: Zube, E.H., Moore, G.T. (eds.) Advances in Environment, Behavior, and Design, pp. 3–52. Springer, Boston (1991). https://doi.org/10.1007/978-1-4684-5814-5_1 35. Roberts, L.: The Arpanet and computer networks. In: A History of Personal Workstations, pp. 141–172 (1988) 36. Mukand, S., Rodrik, D.: The political economy of ideas: on ideas versus interests in policymaking (no. w24467). National Bureau of Economic Research (2018)

Smart Helmet: An Experimental Helmet Security Add-On David Sales1 , Paula Prata1,2,3 , and Paulo Fazendeiro1,2,3(B) 1

Departamento de Inform´ atica (UBI-DI), Universidade da Beira Interior, Covilh˜ a, Portugal {david.sales,paulof}@ubi.pt, [email protected] 2 Instituto de Telecomunica¸c˜ oes (IT), Universidade da Beira Interior, Covilh˜ a, Portugal 3 C4 - Centro de Competências em Cloud Computing (C4-UBI), Universidade da Beira Interior, Covilh˜ a, Portugal

Abstract. When it comes to ride a motorcycle the drivers-centered road safety is quintessential; every year a remarkable number of accidents directly related to sleepiness and fatigue occur. With the objective of maximizing the security on a motorcycle, the reported system aims to prevent sleepiness related accidents and to attenuate the effects of a crash. The system was developed as the less intrusive as it could be, with sensors that allow the capture of reaction times to stimuli-response and collect acceleration values. To obviate the lack of data related to sleepiness during motorcycle riding, a machine learning system was developed, based on Artificial Immune Systems. This way, resourcing to a minimum amount of user input, a custom system is synthesized for each user, allowing to assess the sleepiness level of each subject differently. Keywords: Smart Helmet · Road safety · Motorcycle · Sleepiness Machine learning · Reaction time · Artificial immune systems

1

·

Introduction

With the ever-growing safety solutions that are, currently, either vehicle embedded, or mere phone applications, it is easy to find systems that can predict and/or detect accidents. However the vast majority of these are most likely found in automobiles, and not in motorcycles. The system proposed in this work, Smart Helmet, was conceptualized and created with the intent of maximizing road safety for motorcyclists. The problems associated with road safety that are tackled by this system were: the driver’s sleepiness and the aftermath of a motorcycle accident. According to the data available on the National Sleep Foundation’s website [1], from the 2005 Sleep in America poll, about 60% of the adults have confessed that they have driven while experiencing sleepiness (this corresponds to around 168 million people). The fact that sleepiness is a common road safety issue is c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1236–1250, 2021. https://doi.org/10.1007/978-3-030-80126-7_86

Smart Helmet

1237

also emphasized by the National Highway Traffic Safety Administration (USA), having recorded 4111 fatalities (between 2013 and 2017), 795 deaths (only in 2017) and 91000 accidents (in 2017) all sleepiness-related [2]. In Portugal, the Road Safety National Authority (Autoridade Nacional de Seguran¸ca Rodovi´ aria), in its 2018 annual reports of victims in road accidents within 24 h and 30 d of the accident [3], reveals that of the 34235 accidents present in both reports, there were 508 deaths (24 h report) and 675 fatalities (30 d report) in accidents caused by sleepiness. Considering the above figures, we can say that there is a noticeable amount of people who get affected by the effects of sleepiness while driving, and also a considerable amount of accidents in motorcycles. Taking in mind these two issues, there is a need for a motorcycle safety system with the following features: – Capability to detect the driver’s sleepiness, through some techniques that will be discussed later on, with notifications associated, to prevent possible accidents. – Identification of crashes using an accelerometer built-in into the system (placed on the helmet). After the correct analysis of a crash, the system will notify the selected emergency contacts with the current location, aiming at quickly aiding the user and possibly reducing any crash related injury. Because one of the issues is related to the driver’s attention to the road, Smart Helmet was developed in a way that is as non-intrusive as it can be. This way, the system can be added in a daily setting to the helmet without conflicting with the motorcyclists and their driving. The purpose of this system is to prevent crashes, as well as adding safety measures after a crash, while being available to the whole spectrum of motorcyclists. Looking at the current market, at least the consumer grade one, there isn’t any product that can really provide the same sense of security that this system can. There are some “Smart” helmets, in the sense that they can satisfy the multimedia needs of the motorcycle user, as well as voice control, GPS navigation, and extra vision by capturing the rear view (examples include CrossHelmet [4] and Argon [5]). There are phone applications that provide that bit of security, after a crash occurs, like Realrider [6], that after detecting a motorcycle crash send an SOS signal with your location. The fatigue feeling that one has experienced while driving is not always sleepiness. A number of studies have been made around this matter (sleepiness) and, of those, many contemplate this feeling of sleepiness while driving. There are many ways of assessing this condition, such as: using an Electrocardiogram (or ECG) incorporated in the steering wheel of an automobile (as proposed in [7]), or measuring the sleepiness with the help of an Electroencephalogram (or EEG, used in an helmet in [8]). Our proposed solution Smart Helmet combines the findings of the research done in the sleepiness field (mostly regarding driving) with the simplicity and availability of the commercial grade helmet add-ons. It maintains a minimum direct interaction with the motorcyclists while assessing their sleepiness level, preventing an important cause of casualties.

1238

D. Sales et al.

The remaining of this paper is structured as follows: in Sect. 2, some background knowledge regarding the techniques behind sleepiness assessment is given. Section 3 presents the final developed system, explaining its features, components, the Artificial Immune Systems approach to sleepiness detection and the implemented algorithms. A selected set of experiments and respective results, as well as the decisions taken along the development stage are presented in Sect. 4. Section 5 regards the conclusion and future work/experiments that can be done in the scope (and out as well) of this paper.

2

Background

In the context of driving, sleepiness is closely related with the propensity to stop paying attention to the road and progressively being overtaken by sleep. There are four methods that can be considered to “measure” sleepiness (according to [9]): – Subjective Methods: Based on questionnaires made to the individuals in question. A scale is presented to them and they self-evaluate their sleepiness level. For example, the most common scale is the Karolinska Sleepiness Scale reproduced in Table 1, below. This scale presents nine levels of sleepiness from lowest (extremely alert) to highest (very sleepy, great effort to keep alert, fighting sleep), and this one is going to be used as reference in this paper further on. Although effective, it is not practical in this matter if not used together with another method. – Vehicle-based Methods: Various sensors are put on various the parts of the vehicle that have a say in driving (example: brakes, gas pedal, . . . ), these are going to give a correlation between the user’s tiredness and its behaviour with the vehicle. For example: braking too soon (to compensate their sleepiness) or too late. These aren’t too practical because of the intrusiveness in the motorcycle. – Behavioral Methods: Based on the driver’s change in behavior, like, for example, yawning or the eye movements. One’s behavior is captured through a camera, and, because of that, can’t be considered for this system, although it has great success in automobiles. – Physiological Methods: Through various factors of an individual (examples include: heartbeat, electrical activity of the brain, muscle activity), these measures are so accurate that it is possible to detect sleepiness in it’s early stages. The problem with this approach is its level of intrusiveness, every sensor has some level of it and, for that reason this method has to be discouraged. Physiological measures are intrusive but accurate, according to [8], there is some correlation between the values obtained from the EEG and the subject’s reaction time. Moreover, [11] explores the quantification of sleepiness with and without EEG stating that through reaction times it is possible to evaluate the subject’s sleepiness. However, the test has to be simple enough so that the driver

Smart Helmet

1239

Table 1. The karolinska sleepiness scale (KSS ) as referred in [15]. Level Verbal description 1 2 3 4 5 6 7 8 9

Extremely alert Very alert Alert Fairly alert Neither alert nor sleepy Some signs of sleepiness Sleepy, but no effort to keep awake Sleepy, but some effort to keep alert Very Sleepy, great effort to keep alert, fighting sleep

can do it without disrupting their driving and, the gathering of EEG signals from a helmet is cumbersome and somewhat ineffective (cf. [10]). In [12], the drivers’ response to danger stimuli after sleep deprivation is studied, and the conclusion is that the reaction time increases with sleep deprivation, as well as the KSS level. Another work [13], that focuses on the sleep restriction and its correlation with attention and reaction time, concludes that reaction time is prolonged with sleep restriction. The work presented in [14], explores the correlation between a subjects’ sleepiness and reaction time by doing an auditory test during 10 min before and after sleep deprivation. During those 10 min, various auditory stimuli are produced, and each subject has to press a button when they hear it, as fast as they can. There was a correlation between the two components, the reaction time increased. With everything said so far, it is possible to, while taking in mind the correlation between KSS levels and reaction time, make a specific 10-minute test adapted to the motorcycle driving setting, during that time the average reaction time and standard deviation are considered, and classified as a KSS level.

3

Developed System

The data that would be available to the system in question would be the accelerometer values for each axis (x, y, z), extracted from a device latched to an helmet. This, along with a microcontroller/computer with moderate processing power, is enough. As an obvious initial approach, machine learning techniques could be applied to conquer the objectives of this system. Because of the initial lack of historical accelerometer data, a user based approach was used, by letting the user interact with the agent, it will give an opportunity for, said agent, to grow its “knowledge” progressively. And so, an Artificial Immune System (AIS ) [16] was used to represent each situation of the system. Regarding the implemented AIS self-nonself discrimination is a core key concept, that is the ability to distinguish what is part of the immune system or not, being self the

1240

D. Sales et al.

system and non-self everything that isn’t. When defining an immune system, self-nonself discrimination techniques are used to identify threats to the system. For example: an unknown cell can be harmful to a system depending on what the system perceives being part of it or not. A clear example can be observed between the two yellow cells in the Fig. 1.

Fig. 1. Example of the representation of the crash detection system in R2 . the blue circle represents the hypersphere. the green cell is the furthest cell from the centroid, the orange cell. yellow cells represent 2 different scenarios.

By using such approach, a model of each situation is created per each individual, and this way an agent is specifically catered to each user’s way of interacting with the environment. AIS ’s are a way of representing systems, inspired by the immune systems of vertebrates, in a way that, in an euclidean space, a representation of a specific system is made, in which said system has cells with certain characteristics specific to the system in question. Figure 2 presents the final system’s architecture. The choice of the components that are depicted in Fig. 2) was as follows: – Raspberry Pi 3 B+ (RPI ): This small computer possesses enough processing power to engage in such problem. Because of its capabilities of networking and connectivity, bluetooth connections can be made to communicate with other devices, in this system an Android phone is considered. The whole AIS approach was implemented here, using Python programming language. Disadvantages associated with the RPI are the fact that it still is a rather large component to attach to an helmet, some adaptations have to be made in the future. – Buzzer and Button: To produce simple auditory stimuli, to gather the reaction times and evaluate one’s sleepiness level, a simple buzzer was added. Also, a button serves simple I/O functionalities, that will be expressed later on (regarding the progress of a motorcycle ride). – Sensors: The only sensor used was an accelerometer (LIS3DH [17] specifically, because it can reach 16G values, which are enough for the scope of the work), this because all the values gathered from it can be used to represent a normal situation and an accident situation (will be explained further on), as well as representing the stimulus answers, that will compose a sleepiness level. If head injury detection, or impact analysis, were to be added to this work, a

Smart Helmet

1241

Fig. 2. System architecture, regarding the whole “Smart Helmet” environment. where the numbers refer to the RPi’s attachments, and they are, respectively: accelerometer, buzzer and button.

higher shock accelerometer would have to be used (much higher than 16G, in order to evaluate values that can be harmful to human beings). – Android Application: Because most of the users possess an android phone, and the user interface on the RPI isn’t as friendly, an application was developed with the intent of communicating with the part of the system latched to the helmet. This will let the user interact with the AIS (RPi ) through a bluetooth connection. The user can train the agent and perform rides (this means they can start the agent and set it on the lookout for possible accidents, as well as detecting the user’s sleepiness level from time to time). 3.1

Mode of Operation

The “Smart Helmet” depends on its associated Android application (app), to turning on the helmet-side system and start the system utilization. In the first time it is launched, a small trial (of 15 min) has to be done to gather common information to build the two models (crash detection and sleepiness evaluation). Before starting a trial, a subjective sleepiness evaluation has to be done within the KSS, table 1. During a trial, accelerometer data will be retrieved (as acceleration norm, that translates to x2 + y 2 + z 2 , being x, y and z each acceleration component), and used for two models: crash detection (data representing the whole 15 min) and sleepiness detection (10 min, starting after the first 5, along with the KSS level given at the beginning). In order to use the modeled systems, one has to begin a motorcycle ride after pressing the respective button (in the app), and, before actually starting, the current sleepiness has to be self-evaluated. This because the system needs to know how the user is feeling initially, and the amount of data available to evaluate

1242

D. Sales et al.

the user’s sleepiness level can be low (and initially it will be pretty low to make a correct assumption). During a ride, the system will gather accelerometer data (every second) and try to find out if it represents a situation out of the ordinary or not, if so, an accident might have been detected (the word might is used, in this context, because various factors like bumps in the road, and so on, have to be considered). On crash detected, the user can respond if the accident has actually happened (on the helmet, through the button). One click and it uses that data to re-model the crash detection system (not a crash), two clicks and it asks for help (a crash happened, an SOS message is sent, along with the location, to the user’s selected emergency contacts, that can be set in the application). Still regarding the detection of a crash, if the user doesn’t respond in 30 s, an automatic SOS message is sent to the user’s emergency contacts. The user can also answer in the application, in case of a crash detection and, if the user whether wants to send an SOS message or not. Along a ride, the sleepiness is going to be evaluated, in case it is mild or none (below, and including level 4 of the KSS) nothing happens, between levels 5 and 6 an auditory signal is played that reminds the user they may want to rest for a little bit, and if 7 or above a signal is played that advises the user to stop and rest for 30 min, or take a nap/sleep. During a ride not all levels of sleepiness will be correctly evaluated (during the initial states of the system), this means that, because there is lack of initial user data, and the modeled sleepiness system can’t find a way of classifying all of the incoming data (levels that aren’t yet classified), it will try to give a response based on it’s current system state. But, the response given will always be according to the worst case scenario (taking in mind the current system state, and levels evaluated during the ride). This way the motorcyclist will never be undermined by the sleepiness evaluation. Needless to say that these evaluations wont by fully trustable, and so, at the end of a ride the user will have to answer a questionnaire regarding these evaluations. While answering, the user can observe where and when the sleepiness level assessment took place, and the answer will have to be either true or false. This way, all the levels that aren’t fully trustable, but are correct, can be used to re-model the sleepiness evaluation system, allowing it to evaluate a wider range of sleepiness levels. As one can see, the system adapts along the time, with fewer and fewer user interventions. The prototype of the developed system (helmet part) can be observed in Fig. 3. 3.2

Crash Detection

This immune system is composed of various cells that are part the definition of the self, viz. a set representing the normal state of operation. The characteristics of each cell are: average acceleration norm, maximum and minimum acceleration norm and the standard deviation. Each cell is the representation of one second worth of data. For example, in a trial, for each second of a 15 min interval a cell is going to be created and, in the end, they are going to be consolidated. That final consolidation is what the model is going to look like, so in this final part, the system finds the central cell (or centroid) of the agglomerate of cells,

Smart Helmet

1243

Fig. 3. Prototype of system developed, the RPI is underneath the tape (at the side of the Helmet). a power bank can also be observed (it is used to power the RPI ).

and computes the distance from the centroid towards the cell that is furthest apart. With a center on the agglomerate and a maximum distance defined, a hypersphere is created (because each cell has four characteristics, the euclidean space is R4 ). This hypersphere then defines what a normal situation is, and to get that distinction, an input cell is taken and, only if that cell presents itself inside the defined field (hypersphere), that cell represents a normal situation, and not, for example, a crash. In Fig. 1 a simplified representation of the crash detection system can be observed. In reality the system is in R4 , because cells have four characteristics, in the figure cell number 1 is an example of, let’s say, a crash, and, cell number 2 a regular situation, using self-nonself discrimination. 3.3

Sleepiness and Fatigue Evaluation

In this case was chosen a different representation from the crash detection one because, since there are various KSS levels, each level has its own immune system. Before further explanation, all levels are going to be similar systems, apart from the position of the cells and the level they represent. And the cells that compose each level, called sleepiness cells, represent one 10 min reaction time test, that is composed by various reactions to stimuli, called reaction cells. So, every 10 min test, various reaction cells are extracted, these represent the processed data before the stimulus until a short time after the stimulus, this processed data is the reaction time it took for the user to answer a stimulus, the data processing is going to be introduced later on. A sleepiness cell takes all of the reaction cells, that are essentially reaction times for the stimuli in the 10 min test, and defines itself by the following characteristics: average reaction time and standard deviation. Each level is going to be composed by an agglomerate of sleepiness cells, although they are represented separately, they are all in the same R2 euclidean space. The agglomerate that represents each level is a representation of the immune system of each level, when building a level the average distance between every cell (or in other words, the threshold), within a certain

1244

D. Sales et al.

level, is going to be computed and used to define the area of each sleepiness cell that composes a certain KSS level. So, a level is defined by the area of its sleepiness cells, and, using self-nonself discrimination, a cell belongs to a certain level if it is within the area of a level. Other techniques to classify unknown cells are going to be discussed further on. In Fig. 4, there are three levels illustrated along with two yellow unknown cells, which one of them belongs to level 3 and the other is going to undergo the techniques expressed further on, to classify it.

Fig. 4. Illustration of what the sleepiness model is, each agglomerate corresponds to a level. each circle is a cell, and yellow cells are unknown cells without any particular size.

One of the problems associated with the sleepiness evaluation system is the fact that, data extracted from the real world signifies such small bits of data in the system itself, for example if four 10-min reaction time tests are done, a total of about 40 or 50 reactions are extracted but those are represented in a total of only four sleepiness cells, that can even only represent one level. Through some algorithms that predict that classification of a level, this problem can be resolved. Side note, the crash detection system doesn’t suffer from this because it ranks its cells in a binary classification as representing self or non-self. Because the reaction times regarding a certain KSS level can be very specific, the classification of a cell has to be very precise, and can never be predicted as a level that can be dangerously wrong, in the worst case scenario. So, later on, if the classification has to lean on the last stages, the worst case scenario (within the information available) is going to be predicted, meaning a higher level on the scale might be predicted instead of the current level. And so, to classify the level of an unknown cell, it undergoes a maximum of three stages, the first will try to classify it and the last two will try to give an approximation (it is important to state that predicted levels are levels that might not exist, were predicted through approximations, meaning that they might not still be modeled): – The first stage is based on the k-NN algorithm, where the premise of the closest neighbors is changed a little bit, tailoring the needs of this system. Every cell of every level is going to be analyzed, and so, for each level, the

Smart Helmet

1245

distance between the unknown cell and each cell is going to be calculated, and based on a limit distance defined for each level (called plausible distance, that is 2 times the threshold of a level), a ratio (which corresponds to the division of the distance, between the unknown cell and the current cell analyzed, by the plausible distance of the current level analyzed) is calculated between each cell and the unknown cell. If the ratio is below or equal to 1, regarding each cell of each level, that ratio along with the level (tuple) is added to a list, else it is disregarded. Then, the closest level (with the highest number of cells in the list) is verified and if there are no ties the level is returned, else a random level is picked within the list of closest levels (using the random integer generator from the module random of Python). If there aren’t any close levels (the list is empty), the unknown cell fails this stage. For example: in Fig. 4 one can see that the top unknown cell would fit in this stage and be classified as part of level 3, and the bottom unknown cell would fail this prediction stage. – The second stage will try to predict on what level the unknown cell might be included or what level it could represent, based on how much the reaction time of the unknown cell has increased, in relation to the average reaction time of the current level. It will be executed if the unknown cell presents itself above the last evaluated sleepiness level (current), and if there are any higher levels (modeled in the system) than the one evaluated before. Otherwise it will proceed to the next, and last, stage. The algorithm finds out the growth ratio between the reaction time of the unknown cell and the average reaction time of the current sleepiness level. With that, it verifies if the cell might be enclosed by two levels, or if it is above any known level that is equal or higher to the current. And, according to its growth, it verifies to which level it might be closer to, or what it could classify. – The last stage really focuses on over-estimating the sleepiness level of the user, the growth ratio is again introduced. If the ratio between the unknown input cell and the current average reaction is below 20%, it is considered that the level has maintained the same, else it has probably risen. In case there weren’t predicted any levels before (through any of these stages), this stage returns the current level plus one, else it will compare with the last predicted level (that isn’t modeled). If the ratio (in relation to the reaction time of the last unknown cell, used to predict the previous level) is higher than 20% then the returned level is one level higher than the previous predicted, otherwise it stays the same level (previous predicted).

4

Selected Experiments and Results

The reported experiment had the sole intent of finding what patterns the acceleration values followed, after the motorcyclist responded to a stimulus. The protocol consisted on the extraction of the accelerometer data, over an extended period of time, and, every 30 to 60 s an auditory stimulus would be produced, to which the user was told to respond to in a comfortable manner (a simple

1246

D. Sales et al.

nod). It is valuable to add that the experiment also had an effort to maximize the quantity of data received, and also to be as close as possible to a real case scenario. The Savitzky-Golay filter (more specifically, the function signal.savgol filter from SciPy [18]) was used on all of the retrieved data, in order to turn rather noisy data smooth (so it could be more easily interpreted). Noisy data is inevitable, because of the constant vibration of the motorcycle’s high-rotation engine. The parameters for said filter were obtained empirically, always considering the factors involved, and in a way that it didn’t filter out too much data. A real example of such filtering can be observed in Fig. 5.

Fig. 5. The plotted red line is the raw accelerometer data unfiltered, the blue is the raw data filtered. the filter used was the SciPy implementation of Savitzky-Golay filter. the parameters were coefficient = 7 and polyorder = 1.

For the matter of a real case scenario, the subject was a motorcyclist that rode their own motorcycle through a common route, for a minimum of 15 min. Figure 6 illustrates the first 200 s of data recorded for a particular ride. In the figure the yellow lines express the exact moment in that an auditory stimulus was produced, the plotted blue line represents the acceleration norm (y-axis) over time (seconds, x-axis), the red rectangles are two arbitrary cases used for analysis later in this report (that will concern how a reaction cell is obtained from raw data) and, the green lines are the threshold to which the nod of the user should surpass to be counted as a reaction to the stimulus. In order to facilitate the analysis, on the left side of the Fig. 7 we can observe a magnification of the previously marked regions of Fig. 6. A first obvious pattern can be observed: after every auditory stimulus there is a peak in the accelerometer data. Moreover, after a stimulus the distribution of the accelerometer values appear to follow an approximate normal distribution as is depicted on the right side of the figure. In the time-lapse prior to any stimulus, the distribution was calculated and, after said stimulus, the threshold was defined (to be able to identify most of the

Smart Helmet

1247

Fig. 6. Plotted acceleration values, already filtered, that concern the experimented exposed in this section. the “C” and “P” at the top mean coefficient and polyorder, respectively.

Fig. 7. This figure regards auditory stimuli highlighted in 6, respectively. at the left side the data is plotted, yellow lines represent the exact moment of a stimulus and green lines depict the threshold. at the right side the distribution of the values, before the stimulus was produced, can be found, for each case.

normal distribution values, about 99.7% of them, it was set to average ± 3 ∗ σ of the prior distribution). If the threshold was surpassed (at the top, or bottom), it means that the system received abnormal data, considered to be a nod from the user. From this, a reaction cell was created, to later be a part of the conglomerate that constitutes a sleepiness cell. In the reaction cell construction, all of the data (about 4 s, 2 before and 2 after) was analyzed. From the first 2 s the distribution of the data was calculated so that, in the last 2 s, the threshold could be defined and, the reaction time was measured from the moment of the stimulus until the

1248

D. Sales et al.

threshold is firstly surpassed. Every time an auditory stimulus is produced, a reaction cell will be created and, it will be only used if there was identified an user response.

5

Conclusion

Getting inspiration from various studies surrounding the subject of road safety and sleepiness/drowsiness, a smart system was developed to improve the motorcyclists’ road safety. It is important to note that additional findings and studies are much in need, namely, in the sleepiness field. In our endeavour of building a non-intrusive easy to use system, we have adopted an indirect measurement of sleepiness depending on the speed of response to stimuli instead of any of the proven direct physiological methods. That is a recognized trade-off, since as previously noted in other studies (e.g. [19]) a system providing more stimulation for the sleepy driver, breaks monotony and can significantly reduce the subjective sleepiness, with a trend for fewer incidents. On the proposed solution, Artificial Immune Systems were used to account for the little data available for both crash and sleepiness detection systems. A different set of machine learning techniques can be applied to arrange a more accurate and effective classification system, once the amount of collected data is gathered. Nonetheless, a good solution was proposed, the initial use has quite a bit of user interaction, but, progressively, the system learns the task of evaluating the user’s sleepiness and driving styles. This way every user has a specific system, built right for them. 5.1

Future Work

Other alternatives to solve the challenges of this work include (i) the need to rethink design-wise the encasing method in order to attach the control unit to the helmet in an appealing way and, (ii) the assessment of direct physiological sleepiness evaluation methods as putative replacements for the reaction time approach. If this system is introduced to the market a cloud-oriented implementation can arise and with its growth a lot of data can be gathered hence contributing to further studies in both sleepiness and motorcycle crash prevention. This means that, through an implementation based on the Cloud, not only this data can be used outside of the scope of this project, but it can also be used to compute new models for each user (based on their personal characteristics, and resourcing to other machine learning techniques), in order to research and find out if there is a better way of classifying incoming data, during a motorcycle ride. Because of the scalability and flexibility of cloud computing, various applications could be developed in order to share data to further increase knowledge on this matter and, with it, comes a great and feasible opportunity of creating a new way of detecting crashes as well as, and more important, preventing them.

Smart Helmet

1249

Acknowledgments. This work was supported by operation Centro-01-0145-FEDER000019 - C4 - Centro de Competências em Cloud Computing, co-financed by the European Regional Development Fund (ERDF) through the Programa Operacional Regional do Centro (Centro 2020), in the scope of the Sistema de Apoio a ` Investiga¸c˜ ao Cient´ıfica e Tecnol´ ogica - Programas Integrados de IC&DT. This work was also funded by FCT/MCTES through the project UIDB/50008/2020.

References 1. Facts and Stats - Drowsy Driving - Stay Alert, Arrive Alive https://drowsydriving. org/about/facts-and-stats/ 2. Drowsy Driving—National Highway Traffic Safety Administration https://www. nhtsa.gov/risky-driving/drowsy-driving 3. Autoridade Nacional de Seguran¸ca Rodovi´ aria - Relat´ orios de Sinistralidade http://www.ansr.pt/Estatisticas/RelatoriosDeSinistralidade/Pages/default.aspx 4. CrossHelmet X1 features: HUD, Bluetooth & much more https://www. crosshelmet.com/features/index.html 5. Argon Transform—Smart Augmented Reality Riding https://www. argontransform.com/ 6. REALRIDER - The Motorcycle App - Safety - Routes http://www.realrider.com/ 7. Lee, B.G., Chung, W.Y.: A smartphone-based driver safety monitoring system using data fusion. Sensors (Basel) 12(12), 17536–17552 (2012). https://doi.org/ 10.3390/s121217536 8. Foong, R., Ang, K.K., Quek, C., Guan, C., Wai, A.A.P.: An analysis on driver drowsiness based on reaction time and EEG band power. In: Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 2015, pp. 7982–7985 (2015). https://doi.org/10.1109/EMBC.2015.7320244 9. Sahayadhas, A, Sundaraj, K, Murugappan, M.: Detecting driver drowsiness based on sensors: a review. Sensors (Basel) 12(12), 16937–16953 (2012). https://doi.org/ 10.3390/s121216937 10. Sun, Y., Yu, X.B.: An innovative nonintrusive driver assistance system for vital signal monitoring. IEEE J. Biomed. Health Inform. 18(6), 1932–1939 (2014). https:// doi.org/10.1109/JBHI.2014.2305403 11. Penzel, T., Fietze, I., Sch¨ obel, C., Veauthier, C.: Technology to detect driver sleepiness. Sleep Med. Clin. 14(4), 463–468 (2019). https://doi.org/10.1016/j.jsmc.2019. 08.004 12. Mahajan, K., Velaga, N.R.: Effects of partial sleep deprivation on braking response of drivers in hazard scenarios. Accid. Anal. Prev. 142, 105545 (2020). https://doi. org/10.1016/j.aap.2020.105545 13. Choudhary, A.K., Kishanrao, S.S., Dadarao Dhanvijay, A.K., Alam, T.: Sleep restriction may lead to disruption in physiological attention and reaction time. Sleep Sci. 9(3), 207–211 (2016). https://doi.org/10.1016/j.slsci.2016.09.001 14. Lisper, H.O., Kjellberg, A.: Effects of 24-hour sleep deprivation on rate of decrement in a 10-minute auditory reaction time task. J. Exp. Psychol. 96(2), 287–290 (1972). https://doi.org/10.1037/h0033615 15. Sahayadhas, A., Sundaraj, K., Murugappan, M.: Drowsiness detection during different times of day using multiple features. Australas. Phys. Eng. Sci. Med. 36, 243–250 (2013). https://doi.org/10.1007/s13246-013-0200-6. supported by the Australasian College of Physical Scientists in Medicine and the Australasian Association of Physical Sciences in Medicine

1250

D. Sales et al.

16. Fernandes, D.A., Freire, M.M., Fazendeiro, P.A., In´ acio, P.R.: Applications of artificial immune systems to computer security: a survey. J. Inf. Secur. Appl. 35, 138–159 (2017). http://hdl.handle.net/10400.6/8203, https://doi.org/10.1016/ j.jisa.2017.06.007 17. Adafruit LIS3DH Triple-Axis Accelerometer Breakout https://learn.adafruit.com/ adafruit-lis3dh-triple-axis-accelerometer-breakout/overview 18. SciPy—SciPy v1.5.2 Reference Guide https://docs.scipy.org/doc/scipy/reference/ index.html 19. Baulk, S.D., Reyner, L.A., Horne, J.A.: Driver sleepiness—evaluation of reaction time measurement as a secondary task. Sleep 24(6), 695–698 (2001)

Author Index

A Abadi, Hossein Karkeh, 754 Abdelhameed, Wael A., 1115 Abeywardena, Kavinga, 793 Abubahia, Ahmed, 524 Agrawal, Ayush Manish, 678 Agwu, Nwojo, 1056 Alelyani, Salem, 1151 Alihodzic, Adis, 341 Aliyev, Elchin, 13 AlKhawaldeh, Fatima T., 661 Al-Mubaid, Hisham, 261 Alonso, Luis, 940 Alphinas, R. A., 463 Al-Saadi, Bushra, 357 Al-Saadi, Muna, 357 Alsaif, Hasan, 80 Alsaqer, Mohammed Saleh, 1151 Anagnostopoulos, Christos, 619 Anzilli, Luca, 210 Apaza-Alanoca, Honorio, 492 Araújo, Luiz, 1169 Arciniega-Rocha, Ricardo P., 1073 Ayala, Orlando, 1222 B Bader-El-Den, Mohamed, 524 Baimukhamedov, Malik, 39 Bandara, Eranga, 891 Bankher, Ahmed, 1085 Barros, Vladimir, 1169 Bel Hadj Youssef, Soumaya, 825 Beling, Peter A., 912 Beloff, Natalia, 858

Bhattacharya, Rituparna, 858 Black, Paul E., 881 Blazsik, Zoltan, 976 Bleile, MaryLena, 196 Boranbayev, Askar, 39 Boranbayev, Seilkhan, 39 Borbély, Tamás, 453 Boudriga, Noureddine, 825 Bunyard, Sara, 115 Burnaev, E., 591 C Camargo, Darcy, 927 Capossele, Angelo, 840 Castaldi, Paolo, 23 Chen, Haonan, 232 Chepkov, Roman, 97 Cho, Michael Cheng Yi, 65 Contreras, M. Leonor, 371 Crocket, Keeley, 49 Cruz, John Emmanuel B., 1124 Cui, Yuanjun, 743 Cunjalo, Fikret, 341 D D’Souza, Daryl, 174 da Silva, Isabela Ruiz Roque, 869 Das Gupta, Dipannoy, 295 Datta, Siddhartha, 637 De Zoysa, Kasun, 891 DeLise, Timothy, 139 Do, Thuat, 805 Duan, Paul, 743 Dzwonkowski, Adam, 284

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 284, pp. 1251–1254, 2021. https://doi.org/10.1007/978-3-030-80126-7

1252 E Ebrahimi, Mohammad Sadegh, 754 Elek, Istvan, 976 Elkatsha, Markus, 940 El-Sharkawy, Mohamed, 781 Erdem, Atakan, 273 Eryong, Li, 1141 F Fakieh, Bahjat, 1085 Farrell, Steven, 473 Farsoni, Saverio, 23 Fazendeiro, Paulo, 1236 Felemban, Emad, 1106 Flores, Anibal, 492 Fossi, Jose, 284 Fotouhi, Farshad, 408 Fouda Ndjodo, Marcel, 427 Fountas, Panagiotis, 619 Foytik, Peter, 891 Fry, John, 49 G Garcés-Báez, Alfonso, 952 Gerber, Luciano, 49 Gibson, Ryan M., 1032 Gil, Santiago, 1010 Giove, Silvio, 210 Gregorics, Tibor, 453 Grignard, Arnaud, 940 Griguta, Vlad-Marius, 49 Grooss, O. F., 463 H Haig, Ella, 524 Haiko, Svitlana, 97 Han, Yi, 743 Hasan, Mahady, 581 Hasanspahic, Damir, 341 He, Jing, 232 Heger, Tamas, 976 Henson, Cory, 325 Herman, Maya, 607 Holm, C. N., 463 Huang, Cayden, 743 Huang, Hsiu-Chuan, 65 Huang, Kevin, 325 Hussain, Mohammad Rashid, 1151 I Ikechukwu, Onyenwe, 1056 Ingabire, Winfred, 1032 Iskandarani, Mahmoud Zaki, 1203 Islam, Md. Saiful, 581

Author Index Islam, Tanhim, 581 Ivanovna, Sipovskaya Yana, 249 J Jabrah, Mohammad, 1085 Jackson, Karen Moran, 725 Jara-Figueroa, Cristian Ignacio, 940 Jayasinghe, D. P. P., 793 Jiang, Hui, 398 Jiang, Shenfei, 743 K Kayid, Amr, 678 Kazakov, Dimitar, 661 Kelefouras, Vasilios, 357 Khan, Asiya, 357 Kim, Ji Eun, 325 Kodati, Meenakshi, 129 Kolomvatsos, Kostas, 507, 619 Kurth, Thorsten, 473 Ku´smierz, Bartosz, 840 L Laakso, Mikko-Jussi, 174 Larijani, Hadi, 1032 Larson, Kent, 940 Lbath, Ahmed, 1106 Lenger, Daniel, 976 Levi, Ofer, 607 Lghoul, Rachid, 1044 Li, Xiaofeng, 648 Li, ZhiQiang, 764 Liang, Xueping, 891 Licea Torres, Luis David, 261 Lin, Siyu, 912 Lin, Wan-Yi, 325 Liyanage, Romesh, 793 Llanes-Cedeño, Edilberto A., 1073 López-López, Aurelio, 952 López-Villada, Jesús, 1073 Lu, Minggui, 743 Luna, Sanzida Mojib, 295 M Macabebe, Erees Queen B., 1124 Maguire, Phil, 152 Majid, Abdur Rahman Muhammad Abdul, 1106 Manna, Sukanya, 115 Masmoudi, Ilias, 1044 Matovu, Richard, 308 Mazyavkina, N., 591 Meng, Shawn, 743 Miao, YaBo, 398 Miller, Robert, 152 Montesclaros, Ray Mart M., 1124

Author Index Mousavi Mojab, Seyed Ziae, 408 Moustafa, S., 591 Mudrak, George, 1190 Müller, Sebastian, 840, 927 N Nadutenko, Maksym, 97 Ndognkon Manga, Maxwell, 427 Ng, Wee Keong, 891 Nurbekov, Askar, 39 Nwokeji, Joshua C., 308 O Obianuju, Nzurumike L., 1056 Oliveira, Eduardo, 1169 Omar, Nizam, 869 Osadchyy, Volodymyr, 987 Othman, Salem, 80, 284 P Pagoulatou, Tita, 562 Palmer, Xavier-Lewis, 1222 Pan, Yue, 743 Papa, Rosemary, 725 Parocha, Raymark C., 1124 Pearson, H. B. D. R., 793 Penzkofer, Andreas, 927 Pietroń, Marcin, 712 Pintér, Balázs, 453 Pirouz, Matin, 1 Popova, Maryna, 97 Potter, Lucas, 1222 Powell, Ernestine, 1222 Prata, Paula, 1236 Prykhodniuk, Vitalii, 97 Q Qahmash, Ayman, 1151 Qu, ShaoJie, 764 R Ranasinghe, Nalin, 891 Rehman, Faizan Ur, 1106 Roesch, Phil, 80 Romanyuk, Kirill, 221 Rosado, Bryan, 881 Rosero-Garcia, Jhonatan F., 1073 Rozas, Roberto, 371 Rzayev, Ramin, 13 S Saa, Olivia, 927 Salakoski, Tapio, 174 Sales, David, 1236 Salmanov, Fuad, 13

1253 Sarwar, Hasan, 295 Sawarkar, Kunal, 129 Schmeelk, Suzanna, 881 Scola, Leila, 378 Semwal, Sudhanshu Kumar, 1190 Shah, Prasham, 781 Shams, Seyedmohammad, 408 Shetty, Sachin, 891 Sierra, Rodolfo García, 1010 Sikka, Harshvardhan, 678, 967 Sikka, Sidhdharth, 967 Simani, Silvio, 23 Sindely, Daniel, 976 Singh, Sahib, 678 Skuratovskii, Ruslan V., 987 Slater-Petty, Helen, 49 Smajlovic, Haris, 341 Soltanian-Zadeh, Hamid, 408 Souza, Caio, 694 Sriyaratna, Disni, 793 Stamoulis, George, 507 Stryzhak, Oleksandr, 97 Su, Yuan-Hsiang, 65 Suhi, Nusrat Jahan, 295 Sun, Shu, 648 T Tam, Eugene, 743 Tasnim, Marzouka, 295 Tendle, Atharva, 678 Tito-Chura, Hugo, 492 Tran, Tuan A., 325 Trofimov, I., 591 Tsai, Yu-Lung, 65 Tsanakas, Stylianos, 562 Tserpes, Konstantinos, 562 U Uvindu Sanjana, K. T., 793 V Varvarigou, Theodora, 562 Veerasamy, Ashok Kumar, 174 Velhor, Luiz, 694 Villarroel, Ignacio, 371 Violos, John, 562 W Walker, David J., 357 Wang, Chuyi, 232 Wang, Yunsong, 473 White, Martin, 858 Wiatr, Kazimierz, 712 Wild, Árpád János, 453 Williams, Samuel, 473 Wu, Shu-Fei, 166

1254 X Xie, Jacke, 743 Xu, FangYao, 764 Y Yang, Charlene, 473 Yu, Jinsong, 743 Yuan, Tommy, 661 Yue, JiaQi, 764

Author Index Yukun, Li, 1141 Yurrita, Mireia, 940 Z Zagagy, Ben, 607 Zapata-Madrigal, Germán D., 1010 Zhang, Yan, 940 Zhang, Ziqi, 543 Zioviris, Georgios, 507 ˙ Zurek, Dominik, 712