Data Management, Analytics and Innovation: Proceedings of ICDMAI 2021, Volume 2 (Lecture Notes on Data Engineering and Communications Technologies, 71) 9811629366, 9789811629365

This book presents the latest findings in the areas of data management and smart computing, machine learning, big data m

121 77 13MB

English Pages 548 [530] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Data Management, Analytics and Innovation: Proceedings of ICDMAI 2021, Volume 2 (Lecture Notes on Data Engineering and Communications Technologies, 71)
 9811629366, 9789811629365

Table of contents :
Preface
Contents
About the Editors
Track I
Simulation of Lotka–Volterra Equations Using Differentiable Programming in Julia
1 Introduction
2 Dataset
3 Approach Overview
4 Results
5 Conclusion
References
Feature Selection Strategy for Multi-residents Behavior Analysis in Smart Home Environment
1 Introduction
2 Related Work
3 Methods
3.1 ARAS Datasets
3.2 Performance Measurements
3.3 Machine Learning Algorithms
3.4 Feature Selection Methods
4 Experimental Results and Discussions
4.1 Experimental Results
4.2 Experimental Discussions
5 Conclusion
References
A Comparative Study on Self-learning Techniques for Securing Digital Devices
1 Introduction
2 Intrusion Detection System
2.1 Types of IDS Technologies
2.2 Intrusion Detection Techniques
3 Machine Learning in IDS
3.1 Classification
3.2 Regression and Clustering
3.3 Deep Learning
3.4 Genetic Algorithms
3.5 Neural Networks and Fuzzy Logic
4 Datasets
4.1 KDD Cup 99 Dataset
4.2 NSL-KDD Dataset
4.3 Exploit Database (Exploitdb)
4.4 CICIDS2017 Dataset
5 Results and Discussion
6 Future Work
7 Conclusion
References
An Intelligent, Geo-replication, Energy-Efficient BAN Routing Algorithm Under Framework of Machine Learning and Cloud Computing
1 Introduction
2 Proposed Algorithm
3 Result Analysis and Discussion
4 Conclusion
References
New Credibilistic Real Option Model Based on the Pessimism-Optimism Character of a Decision-Maker
1 Introduction
2 Preliminaries
3 The Expected Value and the Weight Component
3.1 The Credibilistic Expected Value with Respect to the mλ-Measure
3.2 The Center-Of-Gravity Expected Value and the Weight Component
4 Numerical Example: Valuing M&A Synergies
5 Conclusions
References
Analysis of Road Accidents in India and Prediction of Accident Severity
1 Introduction
2 Literature Review
3 Methodology
3.1 Decision Tree
4 Result Analysis
4.1 Road Accidents in India (2000 to 2017)
4.2 Black Spots in 2013 and 2014
4.3 Road Accidents in 2017 and 2018
4.4 Maps
5 Prediction Model
5.1 Dataset
5.2 Proposed Model
5.3 Results and Performance
6 Conclusion
References
Mining Opinion Features and Sentiment Analysis with Synonymy Aspects
1 Introduction
2 Related Works
3 Proposed Algorithm
3.1 Feature Generation
3.2 Sentiment Analysis
4 Methodology and Methods
4.1 Data Collection
4.2 Feature Generation
4.3 Model Architecture
5 Result and Discussion
5.1 Aspect Generation
5.2 Sentiment Analysis
6 Conclusion
References
Understanding Employee Attrition Using Machine Learning Techniques
1 Introduction
2 Literature Review
3 Methodology
4 Feature Engineering
5 Result and Discussion
6 Conclusion
References
Track II
Fake News Detection: Experiments and Approaches Beyond Linguistic Features
1 Introduction
2 Related Work
2.1 What Is Fake News?
2.2 Challenges Encountered in Fake News Detection
2.3 Existing Datasets
2.4 Classification Methods
2.5 Evaluation Metrics
3 Description of Datasets Used
4 Experiments and Results
4.1 Regression Model
4.2 Siamese Network with BERT
4.3 Sequence Model
4.4 Enhanced Sequence Model
4.5 Sequence Model on FakeNewsNet
4.6 Convolutional Model for Linguistic and Visual Features
5 Conclusion
References
Object Recognition and Classification for Robotics Using Virtualization and AI Acceleration on Cloud and Edge
1 Introduction
2 Literature Survey
2.1 YOLO
2.2 Cyber Foraging and Cloudlets
2.3 Intel OpenVINO
2.4 Gabriel Cloud and VM
2.5 Raspberry Pi and Camera Module
3 Methodology
3.1 Preparation
3.2 Implementation Considerations
3.3 Stages of Implementation
4 Results
5 Discussion
6 Conclusion
References
Neural Networks Application in Predicting Stock Price of Banking Sector Companies: A Case Study Analysis of ICICI Bank
1 Introduction
1.1 Neural Networks
2 Objective of the Study
3 Review of Literature
3.1 Banking Sector
4 Hypothesis of the Study
5 Scope of the Study
6 Limitations of the Study
7 Research Methodology of the Study
8 Sources and Methods
9 Data Analysis
10 Findings and Suggestions of the Study
11 Conclusions and Suggestions of the Study
References
Epilepsy Seizure Classification Using One-Dimensional Convolutional Neural Networks
1 Introduction
2 Related Work
3 Methodology
3.1 Flow Design
3.2 Dataset
3.3 Evaluation Metrics
3.4 Training Model
3.5 Inception
3.6 ResNet
4 Experimental Results
4.1 Results Obtained
4.2 Results Compared to Machine Learning Techniques
4.3 Results Compared to Deep Learning Techniques
5 Conclusion and Future Scope
References
Syntactic and Semantic Knowledge-Aware Paraphrase Detection for Clinical Data
1 Introduction
2 Related Works
3 Proposed Knowledge-Aware Paraphrase Detection Model
3.1 Output Layer
3.2 Similarity Detection Using Knowledge Base
4 Experimental Set-Up
4.1 Results and Discussions
5 Conclusion
References
Enhanced Behavioral Cloning-Based Self-driving Car Using Transfer Learning
1 Introduction
2 Related Work
3 Proposed Approach
3.1 Network Pruning Using 1times1 Filter
3.2 Transfer Learning
4 Dataset Description and Preprocessing
5 Experimental Results
6 Conclusion
References
Early Detection of Parkinson’s Disease Using Computer Vision
1 Introduction
2 Dataset
3 Literature Review
3.1 Algorithms
3.2 Data Collection
3.3 Methodology
3.4 Analysis
4 Implementation
5 Results
6 Conclusion
References
Sense the Pulse: A Customized NLP-Based Analytical Platform for Large Organization—A Data Maturity Journey at TCS
1 Introduction
2 Literature Review Summary
3 Solution Delivered
3.1 Data
3.2 Technology
3.3 Process
3.4 People
4 Results and Analysis
5 Non-functional Requirements
6 Conclusion
References
Track III
Fact-Finding Knowledge-Aware Search Engine
1 Introduction
2 Related Work
3 Problem Statement
4 System Overview
5 Methodology
5.1 Data
5.2 Data Distiller
5.3 Document Search Engine
5.4 Knowledge Graph
6 Why Search Engine and Graph Need to Be Used Together?
7 Ranking and Evaluation
8 Conclusion
9 Future Work
References
Automated Data Quality Mechanism and Analysis of Meteorological Data Obtained from Wind-Monitoring Stations of India
1 Introduction
2 Related Works
3 Methodology
3.1 Flagging Mechanism
3.2 Automated QC Procedure
4 Primary Checks
4.1 Chronological Sorting and Handling of Duplicate Records
4.2 Handling Missing Data
4.3 Standardization of Data Type
5 Internal Consistency Tests
5.1 Physical Limit Test
5.2 Gradient Test
5.3 Deviation Test
6 Relational Consistency Tests
6.1 Consistency Between Wind Speeds at Different Levels
6.2 Storage and Extraction of QC Data
7 Analysis and Discussion
7.1 Interdependency of Sensors at Different Levels
7.2 Interdependency of Sensors at Same Level
7.3 Effect of Outliers
8 Results
9 Conclusion
References
Efficient and Secure Storage for Renewable Energy Resource Data Using Parquet for Data Analytics
1 Introduction
2 Related Works
3 Methodology
3.1 Different File Storage Formats
3.2 Dataset Description
4 Results and Discussion
5 Conclusion and Future Work
References
Data Processing and Analytics for National Security Intelligence: An Overview
1 Introduction
2 Intelligence in the Context of National Security
3 Intelligence in the context of external threats
3.1 ELINT Receivers
3.2 Processing in ELINT Receivers
3.3 Data Processing Techniques
4 Intelligence in the Context of Internal Threats
4.1 Asymmetric Warfare
4.2 Open-Source Intelligence (OSINT)
4.3 Data Processing of OSINT
5 Intelligence Derived from Multiple Sensors
5.1 Data Processing in Multiple Sensor Environment
5.2 Geolocation Accuracy
6 Conclusions
References
Framework of EcomTDMA for Transactional Data Mining Using Frequent Item Set for E-Commerce Application
1 Introduction
2 Literature Survey
3 Results and Findings
3.1 Existing System Structure
3.2 Proposed System Framework
3.3 EcomTDMA for Hash Base Frequent Item Set Mining
4 Conclusion
5 Future Scope
References
Track IV
A Survey on Energy-Efficient Task Offloading and Virtual Machine Migration for Mobile Edge Computation
1 Introduction
2 Literature Survey
2.1 Energy Efficiency
2.2 Energy Efficiency and Latency
2.3 Trade-Off Between Energy Efficiency and Latency
2.4 Virtual Machine Migration
3 Proposed Work
4 Conclusion
References
Quantitative Study on Barriers of Adopting Big Data Analytics for UK and Eire SMEs
1 Introduction
2 Literature Review
3 Proposed Work
3.1 Research Design
3.2 Questionnaire Design
3.3 Population and Sample of the Study
3.4 Administration and Distribution of the Questionnaire
4 Result
4.1 Data Analysis
4.2 Initial Analysis
4.3 Associations Between Demographics and Understanding of Big Data Analytics
4.4 Barriers to Big Data Analytics
4.5 Cronbach’s Alpha
4.6 Framework Refinement
4.7 Limitations
5 Conclusion
Appendix 1: Flow chart of the Number of Responses Included in the Analysis
Appendix 2: Demographics
Appendix 3: Significance Testing
Appendix 4: Cronbach's Alpha Test on the Five Pillars of the Big Data Analytics Strategic Framework for SMEs
References
Post-quantum Cryptography
1 Introduction
2 Literature Review
2.1 Classical Cryptosystems
2.2 Simon's Periodicity Algorithm
2.3 Shor's Algorithm
2.4 The Fallacy of the Classical Cryptosystems and the Advent of a Quantum Computer
2.5 Post-quantum Cryptosystems
2.6 Signature Schemes
2.7 Attacks on Signature Scheme
2.8 Hash Function
2.9 One-Time Signature Scheme
2.10 Merkle Static Tree
2.11 Advantage of Using Hashed-Based Signature Scheme
2.12 Comparison Between Quantum Cryptography and Post-quantum Cryptography: Quantum Key Distribution
3 Conclusion
References
A Comprehensive Study of Security Attack on VANET
1 Introduction
2 VANET Overview
2.1 VANET Architecture
2.2 VANET Model Overview
2.3 VANET Characteristics
2.4 VANET Communication Standards
2.5 Types of VANET Communication
3 Routing Protocols
4 Challenges and Security Attacks
4.1 Classification of Attackers
4.2 Classification of Attacks
5 Overview of Jamming Attack
5.1 Effectiveness of Jamming
6 Related Works
7 Conclusion
References
Developing Business-Business Private Block-Chain Smart Contracts Using Hyper-Ledger Fabric for Security, Privacy and Transparency in Supply Chain
1 Introduction
1.1 Smart-Contracts (SC) for B2B
1.2 Hyper-Ledger Fabric (HF)
1.3 Problem Formulation
1.4 Methodology
2 Implementation
3 Sample of Screen Shots
4 Conclusion
References
Data-Driven Frameworks for System Identification of a Steam Generator
1 Introduction
2 Methodology
2.1 Ranking of Variable
2.2 Multiple Linear Regression
2.3 Multi-layer Perceptron Model
3 Case Study
3.1 Boiler Pilot Plant Configuration
3.2 System Identification
3.3 Model Identification
4 Conclusion
References
Track V
An Efficient Obstacle Detection Scheme for Low-Altitude UAVs Using Google Maps
1 Introduction
1.1 Unmanned Aerial Vehicles
1.2 Obstacle Detection in UAVs
1.3 Limitations of Live Analysis Techniques
1.4 Proposed Approach
1.5 Contribution of this Paper
1.6 Organisation of the Paper
2 Related Work
3 Experiment and Results
3.1 Considerations
3.2 Definitions
3.3 Simulation Environment
3.4 Proposed Methodology
3.5 Set Initial Mission
3.6 Get Map
3.7 Image Preprocessing
3.8 Segmentation
3.9 Modifying Mission
4 Observations
5 Conclusions and Future Scope
References
Estimating Authors’ Research Impact Using PageRank Algorithm
1 Introduction
2 Related Work
3 Methodology
3.1 Dataset
3.2 PageRank Calculation of a Research Paper
3.3 Calculating Authors’ Impact
4 Results and Analysis
4.1 Top-20 Papers Based on Their PageRank Values
4.2 Top-30 Authors Based on PageRank Values
4.3 Top-30 Authors Based on Citation Counts
4.4 Relation Between Authors’ Cumulative Citation Counts and Cumulative PageRank
4.5 Comparing PageRank-Based Author Impact with h-index
5 Conclusions and Further Scope
References
Research Misconduct and Citation Gaming: A Critical Review on Characterization and Recent Trends of Research Manipulation
1 Introduction
2 Characterization: Types of Possible Research Misconduct
2.1 Citation Malpractices
2.2 Detection of Plagiarism
2.3 Figure or Image Manipulation
2.4 Honorary or Ghost Authorship
2.5 Biases in Peer-Review Process
3 Future Scope of Study: How Computational Intelligence, Data Science, and Analytics Can Help?
References
Dynamic Price Prediction of Agricultural Produce for E-Commerce Business Model: A Linear Regression Model
1 Introduction
2 Literature Review
3 Motivation of the Work
4 Materials and Method
4.1 Linear Regression
4.2 Outliers Detection
5 Result and Discussion
6 Conclusion and Future Work
References
Real-Time Facial Recognition Using SURF-FAST
1 Introduction
2 Related Work
3 Methodology
3.1 Pre-Processing
3.2 Training
3.3 Face Identification on Real-Time Video
4 Experiment
5 Conclusion
6 Future Work
References
Microblog Analysis with Machine Learning for Indic Languages: A Quick Survey
1 Introduction
2 Machine Learning Overview
3 Microblog Analysis Stages
3.1 Collection of Tweets
3.2 Pre-processing of Tweets
3.3 Language Identification
3.4 Topic Modelling/Classification
3.5 Opinion Mining
4 Conclusion and Future Direction
References
Author Index

Citation preview

Lecture Notes on Data Engineering and Communications Technologies 71

Neha Sharma Amlan Chakrabarti Valentina Emilia Balas Alfred M. Bruckstein   Editors

Data Management, Analytics and Innovation Proceedings of ICDMAI 2021, Volume 2

Lecture Notes on Data Engineering and Communications Technologies Volume 71

Series Editor Fatos Xhafa, Technical University of Catalonia, Barcelona, Spain

The aim of the book series is to present cutting edge engineering approaches to data technologies and communications. It will publish latest advances on the engineering task of building and deploying distributed, scalable and reliable data infrastructures and communication systems. The series will have a prominent applied focus on data technologies and communications with aim to promote the bridging from fundamental research on data science and networking to data engineering and communications that lead to industry products, business knowledge and standardisation. Indexed by SCOPUS, INSPEC, EI Compendex. All books published in the series are submitted for consideration in Web of Science.

More information about this series at http://www.springer.com/series/15362

Neha Sharma · Amlan Chakrabarti · Valentina Emilia Balas · Alfred M. Bruckstein Editors

Data Management, Analytics and Innovation Proceedings of ICDMAI 2021, Volume 2

Editors Neha Sharma Analytics and Insights Tata Consultancy Services Pune, Maharashtra, India Valentina Emilia Balas Aurel Vlaicu University of Arad Arad, Romania

Amlan Chakrabarti A. K. Choudhury School of Information Technology Kolkata, West Bengal, India Alfred M. Bruckstein Faculty of Computer Science Technion—Israel Institute of Technology Haifa, Israel

ISSN 2367-4512 ISSN 2367-4520 (electronic) Lecture Notes on Data Engineering and Communications Technologies ISBN 978-981-16-2936-5 ISBN 978-981-16-2937-2 (eBook) https://doi.org/10.1007/978-981-16-2937-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

These two volumes constitute the proceedings of the International Conference on Data Management, Analytics and Innovation (ICDMAI 2021) held from 15 to 17 January 2021 on a virtual platform due to pandemic. ICDMAI is a signature conference of Society for Data Science (S4DS) which is a not-for-profit professional association established to create a collaborative platform for bringing together technical experts across industry, academia, government laboratories and professional bodies to promote innovation around data science. ICDMAI is committed to create a forum which brings data science enthusiasts on the same page and envisions its role towards its enhancement through collaboration, innovative methodologies and connections throughout the globe. This year is special, as we have completed 5 years, and it gives us immense satisfaction to put on record that we could successfully create a strong data science ecosystem. In these 5 years, we could bring 50 doyens of data science as keynote speakers and another set of 50 technical experts contributed towards workshops and tutorials. Besides, we could engage around 200 experts as reviewers and session chairs. Till date, we have received around 2093 papers from 42 countries, out of which 361 papers have been presented and published, which is just 17% of submitted papers. Now, coming to the specifics of this year, we witnessed participants from 13 countries, 15 industries and 121 international and Indian universities. A total of 63 papers were selected after rigorous review process for oral presentation, and the Best Paper Awards were given for each track. We tried our best to bring a bouquet data science through various workshops, tutorials, keynote sessions, plenary talks, panel discussion and paper presentations by the experts at ICDMAI 2021. The chief guest of the conference was Prof. Ashutosh Sharma, Secretary, DST, Government of India, and the guest of honours were Prof. Anupam Basu, Director, NIT Durgapur, and Mr. Ravinder Pal Singh, CEO, Merkhado RHA and GoKaddal. Keynote speakers were the top-level experts like Phillip G. Bradford, Director, Computer Science Program, University of Connecticut, Stamford; Sushmita Mitra, IEEE Fellow and Professor, Machine Intelligence Unit, Indian Statistical Institute, Kolkata; Sandeep Shukla, IEEE Fellow and Professor, Department of CSE, Indian Institute of Technology Kanpur, Uttar Pradesh; Regiane Relva Romano, Special Adviser to the Ministry of Science, Technology v

vi

Preface

and Innovation, Brazil; Yogesh Kulkarni, Principal Architect (CTO Office), Icertis, Pune; Dr. Aloknath De, Corporate Vice President of Samsung Electronics, South Korea, and Chief Technology Officer of Samsung R&D Institute India, Bangalore; Sourabh Mukherjee, Vice President, Data and Artificial Intelligence Group, Accenture; Pallab Dasgupta, Professor, Department of Computer Science and Engineering, IIT Kharagpur; and Alfred M. Bruckstein, Technion—Israel Institute of Technology, Faculty of Computer Science, Israel. Pre-conference was conducted by Dipanjan (DJ) Sarkar, Data Science Lead at Applied Materials; Usha Rengaraju, Polymath and India’s first women Kaggle Grandmaster; Avni Gupta, Senior Data Analyst— IoT, Netradyne; Kranti Athalye, Sr. Manager, University Relations, IBM; Sonali Dey, Business Operations Manager, IBM; Amol Dhondse, Senior Technical Staff Member, IBM; and Vandana Verma Sehgal, Security Solutions Architect, IBM. All the experts took the participants through various perspectives of data and analytics. The force behind organizing ICDMAI 2021 was of the general chair Dr. P. K. Sinha, Vice-Chancellor and Director, IIIT, New Raipur; Prof. Amol Goje, President, S4DS; Prof. Amlan Charabarti, Vice President, S4DS; Dr. Neha Sharma, Secretary, S4DS; Executive Body Members of S4DS—Dr. Inderjit Barara, Dr. Saptarsi Goswami, Mr. Atul Benegiri and all the superactive volunteers. There was a strong support from our technical partner—IBM, knowledge partner— Wizer, academic partners—IIT Guwahati and NIT Durgapur and publication partner Springer. Through this conference, we could build the strong data science ecosystem. Our special thanks go to Fatos Xhafa, Technical University of Catalonia, Barcelona, Spain (Series Editor, Springer, Lecture Notes on Data Engineering and Communications Technologies) for the opportunity to organize this guestedited volume. We are grateful to Springer, especially to Mr. Aninda Bose (Senior Publishing Editor, Springer India Pvt. Ltd.), for the excellent collaboration, patience and help during the evolvement of this volume. We are confident that the volumes will provide state-of-the-art information to professors, researchers, practitioners and graduate students in the areas of data management, analytics and innovation, and all will find this collection of papers inspiring and useful. Pune, India Kolkata, India Arad, Romania Haifa, Israel

Neha Sharma Amlan Chakrabarti Valentina Emilia Balas Alfred M. Bruckstein

Contents

Track I Simulation of Lotka–Volterra Equations Using Differentiable Programming in Julia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ankit Roy

3

Feature Selection Strategy for Multi-residents Behavior Analysis in Smart Home Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John W. Kasubi and D. H. Manjaiah

11

A Comparative Study on Self-learning Techniques for Securing Digital Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dev Kumar, Shruti Kumar, and Vidhi Khathuria

27

An Intelligent, Geo-replication, Energy-Efficient BAN Routing Algorithm Under Framework of Machine Learning and Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Annwesha Banerjee Majumder, Sourav Majumder, Somsubhra Gupta, and Dharmpal Singh

43

New Credibilistic Real Option Model Based on the Pessimism-Optimism Character of a Decision-Maker . . . . . . . . . . . Irina Georgescu, Jani Kinnunen, and Mikael Collan

55

Analysis of Road Accidents in India and Prediction of Accident Severity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sajal Jain, Shrivatsa Krishna, Saksham Pruthi, Rachna Jain, and Preeti Nagrath Mining Opinion Features and Sentiment Analysis with Synonymy Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sourya Chatterjee and Saptarsi Goswami

69

89

vii

viii

Contents

Understanding Employee Attrition Using Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Agnibho Hom Chowdhury, Sourav Malakar, Dibyendu Bikash Seal, and Saptarsi Goswami Track II Fake News Detection: Experiments and Approaches Beyond Linguistic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Shaily Bhatt, Naman Goenka, Sakshi Kalra, and Yashvardhan Sharma Object Recognition and Classification for Robotics Using Virtualization and AI Acceleration on Cloud and Edge . . . . . . . . . . . . . . . . 129 Aditi Patil and Nida Sahar Rafee Neural Networks Application in Predicting Stock Price of Banking Sector Companies: A Case Study Analysis of ICICI Bank . . . . . . . . . . . . . 141 T. Ananth Narayan Epilepsy Seizure Classification Using One-Dimensional Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Gautam Manocha, Harit Rustagi, Sang Pri Singh, Rachna Jain, and Preeti Nagrath Syntactic and Semantic Knowledge-Aware Paraphrase Detection for Clinical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Sudeshna Jana, Abir Naskar, Tirthankar Dasgupta, and Lipika Dey Enhanced Behavioral Cloning-Based Self-driving Car Using Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Uppala Sumanth, Narinder Singh Punn, Sanjay Kumar Sonbhadra, and Sonali Agarwal Early Detection of Parkinson’s Disease Using Computer Vision . . . . . . . . 199 Sabina Tandon and Saurav Verma Sense the Pulse: A Customized NLP-Based Analytical Platform for Large Organization—A Data Maturity Journey at TCS . . . . . . . . . . . . 209 Chetan Nain, Ankit Dwivedi, Rishi Gupta, and Preeti Ramdasi Track III Fact-Finding Knowledge-Aware Search Engine . . . . . . . . . . . . . . . . . . . . . . . 225 Sonam Sharma Automated Data Quality Mechanism and Analysis of Meteorological Data Obtained from Wind-Monitoring Stations of India . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Y. Srinath, Krithika Vijayakumar, S. M. Revathy, A. G. Rangaraj, N. Sheelarani, K. Boopathi, and K. Balaraman

Contents

ix

Efficient and Secure Storage for Renewable Energy Resource Data Using Parquet for Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 A. G. Rangaraj, A. ShobanaDevi, Y. Srinath, K. Boopathi, and K. Balaraman Data Processing and Analytics for National Security Intelligence: An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 G. S. Mani Framework of EcomTDMA for Transactional Data Mining Using Frequent Item Set for E-Commerce Application . . . . . . . . . . . . . . . . . . . . . . 317 Pradeep Ambavane, Sarika Zaware, and Nitin Zaware Track IV A Survey on Energy-Efficient Task Offloading and Virtual Machine Migration for Mobile Edge Computation . . . . . . . . . . . . . . . . . . . . 333 Vaishali Joshi and Kishor Patil Quantitative Study on Barriers of Adopting Big Data Analytics for UK and Eire SMEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 M. Willetts, A. S. Atkins, and C. Stanier Post-quantum Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 Sawan Bhattacharyya and Amlan Chakrabarti A Comprehensive Study of Security Attack on VANET . . . . . . . . . . . . . . . . 407 Shubha R. Shetty and D. H. Manjaiah Developing Business-Business Private Block-Chain Smart Contracts Using Hyper-Ledger Fabric for Security, Privacy and Transparency in Supply Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 B. R. Arun Kumar Data-Driven Frameworks for System Identification of a Steam Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 Nivedita Wagh and S. D. Agashe Track V An Efficient Obstacle Detection Scheme for Low-Altitude UAVs Using Google Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 Nilanjan Sinhababu and Pijush Kanti Dutta Pramanik Estimating Authors’ Research Impact Using PageRank Algorithm . . . . . 471 Arpan Sardar and Pijush Kanti Dutta Pramanik Research Misconduct and Citation Gaming: A Critical Review on Characterization and Recent Trends of Research Manipulation . . . . . 485 Joyita Chakraborty, Dinesh K. Pradhan, and Subrata Nandi

x

Contents

Dynamic Price Prediction of Agricultural Produce for E-Commerce Business Model: A Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . 493 Tumpa Banerjee, Shreyashee Sinha, and Prasenjit Choudhury Real-Time Facial Recognition Using SURF-FAST . . . . . . . . . . . . . . . . . . . . 505 Showmik Setta, Shreyashee Sinha, Monalisa Mishra, and Prasenjit Choudhury Microblog Analysis with Machine Learning for Indic Languages: A Quick Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523 Manob Roy Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535

About the Editors

Neha Sharma is working with Tata Consultancy Services and is also a Founder Secretary of Society for Data Science. Prior to this she has worked as Director of premier Institute of Pune, that run post-graduation courses like MCA and MBA. She is an alumnus of a premier College of Engineering and Technology, Bhubaneshwar and completed her PhD from prestigious Indian Institute of Technology, Dhanbad. She is an ACM Distinguished Speaker, a Senior IEEE member and Secretary of IEEE Pune Section. She is the recipient of “Best PhD Thesis Award” and “Best Paper Presenter at International Conference Award” at National Level. Her area of interest includes Data Mining, Database Design, Analysis and Design, Artificial intelligence, Big data, Cloud Computing, Block Chain and Data Science. Prof. Amlan Chakrabarti is a Full Professor in the School of I.T. at the University of Calcutta. He was a Post-Doctoral fellow at the Princeton University, USA during 2011–2012. He has almost 20 years of experience in Engineering Education and Research. He is the recipient of prestigious DST BOYSCAST fellowship award in Engg. Science (2011), JSPS Invitation Research Award (2016), Erasmus Mundus Leaders Award from EU (2017), Hamied Visiting Professorship from University of Cambridge (2018). He is an Associate Ed. of Elsevier Journal of Computers and Electrical Engg. and Guest Ed. of Springer nature Journal in Applied Sciences. He is a Sr. Member of IEEE and ACM, IEEE Comp. Society Distinguished Visitor, Distinguished Speaker of ACM, Secretary of IEEE CEDA India Chapter and Vice President of Data Science Society. Prof. Valentina Emilia Balas is currently Full Professor in the Department of Automatics and Applied Software at the Faculty of Engineering, “Aurel Vlaicu” University of Arad, Romania. She is author of more than 300 research papers. Her research interests include intelligent systems, fuzzy control, soft computing, smart sensors, information fusion, modeling and simulation. She is the Editor-in Chief of the IJAIP and IJCSysE journals in Inderscience. She is the Director of the Department of International Relations and Head of Intelligent Systems Research Centre in Aurel Vlaicu University of Arad. xi

xii

About the Editors

Professor Alfred M. Bruckstein BSc, MSc in EE from the Technion IIT, Haifa, Israel, and PhD in EE, from Stanford University, Stanford, California, USA, is a Technion Ollendorff Professor of Science, in the Computer Science Department there, and is a Visiting Professor at NTU, Singapore, in the SPMS. He has done research on Neural Coding Processes, and Stochastic Point Processes, Estimation Theory, and Scattering Theory, Signal and Image Processing Topics, Computer Vision and Graphics, and Robotics. Over the years he held visiting positions at Bell Laboratories, Murray Hill, NJ, USA, (1987–2001) and TsingHua University, Beijing, China, (2002–2023), and made short time visits to many universities and research centers worldwide. At the Technion, he was the Dean of the Graduate School, and is currently the Head of the Technion Excellence Program.

Track I

Simulation of Lotka–Volterra Equations Using Differentiable Programming in Julia Ankit Roy

Abstract In this paper, we explore the usage of differentiable programming in computer programs. Differentiable programming allows a program to be differentiated, meaning you can set a certain task that you want to be optimized. This skill is extreme, as it gives the author of the program the ability to choose which areas are needed to be optimized. Differentiable programming also allows for easy parallelism, allowing for parallel parts of a program to be run together. We aim to use Julia and its Flux libraries in order to simulate the Lotka–Volterra equations, also known as the predator–prey equations to show the capabilities of differentiable programming to simulate two equations simultaneously and discuss the benefits of our approach.

1 Introduction Automatic differentiation is defined as the process in which computer programs take the derivatives of certain equations, usually through the chain rule [1]. Automatic differentiation differs though, as it is involved with more than simple chains of operations, working as the bridge point between programming and calculus. As a result, a need for a programming technique built directly to handle automatic differentiation is needed. Problems arise with existing popularized programming languages. When taking multiple gradients, commonly in neural networks, a need for a program that is differentiable is needed. As described by Liao, Liu, Wang, and Xiang, the idea of differentiable programming emerges from deep learning but it can be applied to other than simply training neural networks. By using differentiable programming, one can compute higher-order derivatives of the program accurately and efficiently using automatic differentiation. Differentiable programming means that you can set a certain task that you need to optimize, then calculate the gradient with respect to the task, and then fine-tune and fix the task in the direction of the gradient. Differentiability is what enables deep learning. Instead of trying a brute-force method, which A. Roy (B) Westfield High School, Chantilly, VA, USA © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_1

3

4

A. Roy

can turn into an extremely expensive process given a few hundred parameters, differentiable programming instead allows us to take a pseudo-walk around parameter space to find a good set to optimize. As a result, differentiable programming in deep learning means that you can not only easily shift heavily parameterized models into much simpler structures, but also heavily reduce the time and increase the efficiency of a program. Additionally, differentiable programming exists in the intersection between programming and calculus; it is a technique and language built specifically for the optimization of various differential equations. Already existing popular languages for artificial intelligence learning models such as Python also lack efficient intrinsic parallelism, meaning that the program you run is not efficient in running two or more parallel tasks at the accurate time. In differentiable programming, speed is necessary as you differentiate a program, allowing quick and easy tasks to be run through. Though existing Python libraries such as PyTorch or TensorFlow are fast in running various models, such as a CNN or a RNN, they lack the speed to execute networks built up of smaller operations. As a result, programs such as Swift and Julia have become popular in their differentiable programming implementation. In this paper, we aim to show the efficiency of differentiable programming by running simulations of the Lotka–Volterra equations. dx = αx − βx y dt dy = δx y − γ y dt We use the Flux libraries used for differentiable programming within Julia in order to simulate differential equations. The Lotka–Volterra equations are defined as two parallel first-order differential equations, and we aim to use differentiable programming to be able to simulate both differential equations simultaneously.

2 Dataset The dataset is an automatically generated dataset approximating that of a pendulum or a sinusoidal curve, based on factors such as the percentage of error contribution in the data, the sparsity of data points, and the number of data points. After generating a possible dataset, we then apply differentiable programming techniques to generate simulations of the datasets (Figs. 1 and 2).

Simulation of Lotka–Volterra Equations …

5

Fig. 1 Julia code generating datasets

3 Approach Overview We aim to show an example of the Lotka–Volterra model, differential equations that aim to show the relationship between a predator and prey in a biological ecosystem. The first-order differential equation of the model is defined as: dx = αx − βx y dt dy = δx y − γ y dt where x is the current number of prey, y is the current number of predators, dx and dt dy represent the change in prey and predators over time, respectfully, and α, β, γ, dt δ are parameters that describe biological interactions between the two species. We first set up these equations in Julia:

6

A. Roy

Fig. 2 Generation of datasets based on different factors

After setting up the equations, we use the ordinary differential equation (ODE) solver in existing libraries in Julia. The ODE aims to solve the differential equation: du = f (u, p, t) dt

Simulation of Lotka–Volterra Equations …

7

where p represents parameters and t represents a time interval. Setting up into Julia, instead of simply passing one differential equation into the ODE, we set the entire Lotka–Volterra model into the ODE, allowing us to work parallel on both equations simultaneously. The Tsit5() represents the algorithm used, the Tsitouras 5/4 Runge– Kutta method.

With the differential equations set up and represented in ODE, we turn to the Flux library for solving and representing the models. We first set up our parameters, α, β, γ , δ, explained above with Flux.

In order to set up a trainable problem, we create our predict function, represented with the solve() function in the ODE earlier, as well as a defined loss function in Flux using the predict function. We also use the generated datasets to set up data to further train our model on later.

Finally, we set up the model to train in Flux to generate the Lotka–Volterra graphs. We use the ADAM optimizer and train the model using Flux, passing the loss function, parameters, data, optimizer, and a function to display data.

8

A. Roy

Fig. 3 Depiction of the Lotka–Volterra model

4 Results After training our model in Flux, a graph is auto-generated showing the relationship between the two differential equations. In the Lotka–Volterra model, we expect a semi-inverse relationship between the predator and prey. As stated before, the model shows the relationship between population between two species. The equations estimate that with the decrease in the number of “predators” (shown as u2), it leads to an eventual increase in the number of “prey” (shown as u1). This eventual increase of prey leads to an increase of predators, which leads to the fluctuations and cycles seen in the graph. This graph shows the power of differentiable programming in relation to mathematical modeling: being able to simulate two differential equations simultaneously (Fig. 3).

5 Conclusion In this paper, we discussed the advantages of differentiable programming with solving and representing differential equations over other traditional programs. We discussed the usage of Julia for differentiable programming due to its capabilities with existing libraries. We showed the advantage through the simulation of the Lotka–Volterra model with Flux, which is optimized to work with two parallel equations. Differentiable programming is a very promising field in the intersection of calculus and

Simulation of Lotka–Volterra Equations …

9

computer programming. More work needs to be done in improving differentiable programming; limitations of existing frameworks make it difficult to implement this technique into models of higher complexity. Nevertheless, we hope that the demonstration of the simulations provided key insights of the usefulness of differentiable programming. Acknowledgements The author of this paper would also like to thank Dr. Himadri Nath Saha for his help in the idea of the paper and the support shown throughout writing the paper.

References 1. Abadi M, Plotkin GD, A Simple Differrentiable Programming Language 2. Chen RTQ, Rubanova Y, Bettencourt J, Duvenaud D, Neural Ordinary Differential Equations 3. Wang F, Zheng D, Decker J, Wu X, Essertel GM, Rompf T, Demystifying Differentiable Programming: Shift/Reset the Penultimate Backpropagator 4. Li T-M, Gharbi M, Adams A, Durand F, Ragan-Kelley J, Differentiable Programming for Image Processing and Deep Learning in Halide 5. Innes M, Saba E, Fischer K, Gandhi D, Rudilosso MC, Joy NM, Karmali T, Pal A, ShahV, Fashionable Modelling with Flux 6. Innes M, Flux: Elegant machine learning with Julia 7. Besard T, Foket C, De Sutter B, Effective Extensible Programming: Unleashing Julia on GPUs 8. Hernandez A, Amigo J, Differentiable programming and its applications to dynamical systems

Feature Selection Strategy for Multi-residents Behavior Analysis in Smart Home Environment John W. Kasubi and D. H. Manjaiah

Abstract Feature selection (FS) plays vital role in reducing computing complexity of the models due to irrelevant features in the data with the intention to develop better predictive models. This process involves selecting significant features to apply in machine learning for model building, where redundant features are removed and new features developed. This approach involves selecting suitable features for use, removing redundant features and create new feature in the process of building models. The study focused on developing a predictive model that performs best for daily living activities (ADLs) using Activity Recognition with Ambient Sensing (ARAS). In this regard, we used feature importance, univariate, and correlation matrix to prepare ARAS dataset before modeling the data. The following algorithms were used to assess the accuracy of selected features, this includes; Logistic Regression (LR), SVM and KNN to learn and analyze the data. The results show that SVM outperformed in both House A and B, compared to other algorithms. Support Vector Machine (SVM) performed best on univariate feature selection with 10 features compared to 5 features with the accuracy of 100% from both House A and House B, while on feature importance selection SVM performed best with 5 features compared to 10 features with the accuracy of 99% from House A and 100% accuracy from House B. The feature selection has improved the prediction accuracy in ARAS dataset compared to the previous results, which achieved the accuracy of 61.5% in average score in House A and 76.2% accuracy for House B.

J. W. Kasubi (B) · D. H. Manjaiah Department of Computer Science, Mangalore University, Karnataka 574199, India D. H. Manjaiah e-mail: [email protected] J. W. Kasubi Local Government Training Institute, Dodoma, Tanzania © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_2

11

12

J. W. Kasubi and D. H. Manjaiah

1 Introduction The feature selection (FS) is very important practice in model development that greatly affects the model’s efficiency, as result it offers a lot of benefits such as simplifications of the model and better understanding of the data for easy interpretations, model accuracy improvement, reducing model overfitting and shorten the training time of the model. Selecting features that contribute most to the performance of the model can be done automatically or manually, in this study we opted automatically techniques to prepare our data before modeling the data [1]. The FS method involves three methods, namely embedded, wrapper-based and filter-based. Filter-based techniques are used to select selects features based on a performance measure, irrespective of the machine learning algorithms that will be employed later while Wrapper techniques are the method of selecting features that we are trying to fit into a given dataset based on a specific machine learning algorithm and the Embedded approach incorporates the benefits of filter and wrapper techniques to perform feature selection. In this study, we deployed filter techniques to carter for feature selection due to its advantages over others includes user friendly and provide better results though the wrap base method provided accurate results than filter methods, with wrapper methods increasing the cost of processing [2]. Smart Home pays a vital role in human activity recognition, as a result, helps diagnose diseases at early stage, not only do smart homes provide health care services, in addition, it utilizes the IoT technologies to track dangerous activities from occurring at homes, control and monitor of energy and water usage that are taking place at home and we can make home a better place to live by automating whatever we need to automate, limitation is where our imaginations stop [3, 4]. Human activity recognition (HAR) is used to explore different activities performed by human within the smart home in the presence of sensors in this regard plays a significant role in monitoring the daily activities of human life which result into healthcare, security, electricity and water usage [5]. This study was carried out using the ARAS dataset, which involved 27 different types of activities. The ARAS Dataset in smart home was collected using the installed sensor into different household appliances such refrigerators, kitchen, sitting room, bedroom, toilet, Laundry and so forth, of which at the end generated this huge amount of data (5,184,000 instances) [6]. The research purposes for ARAS Dataset are to enhancing quality of life and maintain the comfort of its residents, for this matter smart home must be competent to collect all behavior changes performed by residents in their daily activities in order to be able to extract hidden knowledge and insights through using machine learning algorithms. Healthcare experts strongly agree that monitoring for changes in the ADLs is the better ways to detect potential health problems before they become uncontrollable [7]. Recognizing activities performed by smart home residents and their activity of daily living can significantly assist in offering healthcare, security, grid and water usage, automation, and more importance for the quality of human life. However, feature selection plays a vital role in ARAS dataset to influence feature

Feature Selection Strategy for Multi-residents Behavior …

13

values to activity recognition performance of the models, and test the relationship between different activities and activity recognition accuracy rate. This paper contributes by performing feature selection on ARAS dataset which is used to prepare our data before modeling the data [8]. Three machine learning algorithms were subsequently used to improve the effectiveness of the selected features in ARAS dataset to predict future outcomes and the results demonstrate that higher recognition rates are produced by the proposed techniques. This work is structured as follow: the relevant works are briefly explained in Sect. 2; Sect. 3; express the resources and methods applied; Sect. 4 presents the experimental outcomes and discussions; Sect. 5, presents the conclusions and give suggestions for future study.

2 Related Work This part explains previous related works reviewed in relation to smart home and feature selection techniques, the reviewed articles are as shown below: Alberdi et al. [9] suggested feature selection to be used to detect the multimodal symptoms in smart home, classification models were developed to recognize correct complete modification of scores that predict symptoms, for this matter different algorithms were used to resolve levels difference. The results show that feature selection boosted the model accuracy and not all behavioral patterns contribute equally to a symptom’s prediction. Shangfeng et al. [10] conducted a study on human activity in smart home with evaluation to develop human activity model using extreme learning machine (ELM), using CASAS dataset. The outcome shows that ELM in ADL was improved after conducting feature selection and hidden neural networks lead to distinct recognition accuracy. Labib et al. [11] provided a study on activity recognition in smart home based on basic activities performed by residents in the smart home such that old living people residents may facilitate their own homes such as cooking meals or watching TV independently. Experimental evaluation used Kastern and CASAS (kyoto1 and Kyoto7) dataset to carter for the same. The results show that Kyoto7 dataset obtained accuracy of 77%, while Kasteren and Kyoto1 achieved accuracy of 93% and 97%, respectively; the model accuracy was good after the researcher performing feature selection to the respective datasets. Hameed et al. [12] applied feature selection using DT and RFE to remove irrelevant features in the dataset, for this matter; Enhanced ELR classifier, LR and MLP were used in prediction, as a result ELR classifier outperformed compared to LR and MLP after feature selection. Manoj et al. [13] proposed ACO and ANN for feature selection by using hybrid method, in order to remove unnecessary and redundant features from the dataset. The experimental results outperformed by using hybrid algorithms as compared to the previous results after performing feature selection.

14

J. W. Kasubi and D. H. Manjaiah

Liu et al. [14] presented the idea to recognize ADL by applying feature selection in smart home using PCC; researcher engaged three machine learning algorithms to test the efficiency of the recommended approach in ADL detection. The experimental results show that the recommended method produces better outcome of the ADL recognition. Tanaka et al. [15] suggested a swarm optimizer to support feature selection model for KLR. The proposed technique reduces unnecessary and redundant features, as the result the experiment demonstrates the increase in the generalization efficiency of the method proposed. Fang et al. [16] presented BP algorithm outperformed on feature selection based on smart home, the study used three machine learning techniques. The results concluded that after performing feature selection on the research dataset, the ADL detection performance of the NN using the BP algorithm is robust than NB and HMM Model. Abudalfa et al. [17], the researcher, evaluated semi-supervised clustering for ADL detection with different ML approaches to carter for the same. The use of feature selection improved the performance, the values increased and the confidence level is decreased. The experimental results show that the presented technique provided remarkable accuracy; the performance was improved significantly when applying more sophisticated feature selection techniques. Pablo et al. [18] presented feature selection for IoT using wrapper-based feature selection method by merging RFE and GBTs that were able to pick the most significant attributes automatically. The experimental shows better results, compared before performing feature selection. Oukrich [19] conducted a research on feature selection for ADL recognition in Smart Home using Cairo and Aruba datasets. Researchers commonly tested several methods, such as VBN, KNN, Hidden Markov, DT, SVM, CRFs. The outcomes show that the accuracy attained in Aruba dataset was 90.05% and 88.49% for Cairo, suggestions were given to use deep learning techniques, as compared to other traditional machine learning algorithms, they are proved to be effective and robust. Minor and Cook [20] compared the performance of three classifiers; regression tree, linear regression and SVM and applied them to CASAS datasets for forecasting the future occurrence of activities. The experimental outcome discovered that regression tree produces better predicts of activity with lower error, faster training time and capacity to handle more complex datasets than SVM and linear regression classifiers after running feature selection in the dataset. In order to enhance activity forecasts with less complexity, researchers proposed adapting other methods of classifiers and combining numerical forecasting techniques in future work. In fact, the selection of features is the first phase in model development, which is used to decrease the model’s complexity by selecting the proper features by computing the value of each one in the dataset in order to provide a good predictive model output [21].

Feature Selection Strategy for Multi-residents Behavior …

15

Table 1 List of activities in ARAS dataset ID

Activity

ID

Activity

ID

Activity

1

Others

10

Having snack

19

Laundry

2

Going out

11

Sleeping

20

Shaving

3

Preparing breakfast

12

Watching TV

21

Brushing teeth

4

Having breakfast

13

Studying

22

Talking on the phone

5

Preparing lunch

14

Having shower

23

Listening to music

6

Having lunch

15

Toileting

24

Cleaning

7

Preparing dinner

16

Napping

25

Having conversation

8

Having dinner

17

Using Internet

26

Having guest

9

Washing dishes

18

Reading book

27

Changing clothes

3 Methods This part presents the methods used in this work during the development of the predictive model for sleep behavior in smart homes using machine learning techniques.

3.1 ARAS Datasets Table 1 provides characteristics of the ARAS dataset which was composed of two residential houses, which involved 27 different types of activities and it was generated in Turkey in 2013. Activities collected from House A and B every day were as follows; having shower, toileting, preparing breakfast, having breakfast, preparing dinner, having dinner, going out, sleeping, having snacks, watching TV, studying, and reading books while on the other side of the activities that were not performed in every day were such as washing dishes, napping, laundry, shaving, talking on the phone, listening to music, having conversation and having guest. In House A and House B, the aggregate number of occurrences of activities is 2177 and 1023, respectively. For each day there were 86,400 data points, that consist of the time stamp, and in this study the prediction on both House A and B show that going out, sleeping, studying, watching TV and having breakfast were frequently performed by the residents [22].

3.2 Performance Measurements We use the following four assessment measures were used for the evaluation of the proposed approach:

16

J. W. Kasubi and D. H. Manjaiah

Accuracy =

TP + TN TP + TN + FP + FN

(1)

TP TP + FP

(2)

Precision = Recall = F1-Measure = 2 ∗

TP TP + FN

(3)

Precision ∗ Recall Precision + Recall

(4)

whereby TP represents number of True Positive, TN represents number of True Negative, FP represents number of False Positive and FN represents number of False Negative.

3.3 Machine Learning Algorithms In this work we employed different Machine Learning Algorithms to see which algorithm work best with the ARAS dataset after performing feature selection using filter method, for this matter evaluated using three algorithms namely; LR, SVM and KNN to check which model performs best in ARAS dataset for both House A and B.

3.4 Feature Selection Methods The feature selection was performed using python programming language to decrease the complication of the predictive model and to know the influence of the characteristics on the general ARAS dataset forecast. The purpose of the FS process is to define characteristics that provide the best score and remove unnecessary features that are likely to cause complexity of the model of which may lead to poor performance of the model [23]. In this study, we deployed filter techniques to carter for feature selection due to its advantages over others such as easy to use and also gives good results, though wrap base method wrapper methods increasing the cost of processing though it provides enhanced results than filter methods. In this regard, we used different filtering techniques such univariate, feature importance and correlation matrix analysis to prepare our data before modeling the data [24]. To test the accuracy of the proposed FS techniques in the ARAS dataset, three machine learning algorithms were used. The outcome shows that the suggested technique provides better outcomes (Fig. 1).

Feature Selection Strategy for Multi-residents Behavior …

17

Fig. 1 Proposed feature selection approach

4 Experimental Results and Discussions 4.1 Experimental Results To evaluate the performance of the proposed approaches for ARAS dataset from both House A and B, we employed LR, SVM and KNN to develop models and perform comparisons of the results obtained to both House A and B.

4.1.1

Filter Methods

• Univariate Feature Selection Univariate feature selection is a feature selection that independently explores each feature to evaluate the magnitude of the feature’s relationship with the response variable. The univariate feature selection is generally easy to run, understand and calculate significance of data. Tables 2 and 3 show best features selected by univariate feature selection technique in both House A and B. • Feature Importance Selection Feature importance refers to a feature selection technique that allocates score to input characteristics which support dependent variable to be predicted. Figures 2 and 3 show best features selected by feature importance technique in both House A and B.

18

J. W. Kasubi and D. H. Manjaiah

Table 2 Best 10 attribute selected by univariate feature selection in House A Activity

House A Attribute name

Feature score

9

Sensor_ID_11

831,382.401054

18

Sensor_ID_20

463,988.373077

19

Sensor_ID_12

315,096.408903

3

Sensor_ID_05

263,208.650265

16

Sensor_ID_18

262,665.258971

11

Sensor_ID_13

243,929.874075

2

Sensor_ID_04

197,382.065389

4

Sensor_ID_06

91,999.530473

14

Sensor_ID_16

82,968.851006

8

Sensor_ID_10

75,974.156652

Table 3 Best 10 attribute selected by univariate feature selection in House B Activity

House B Attribute name

Feature score

19

Sensor_ID_12

1.919454e + 06

6

Sensor_ID_8

9.256252e + 05

13

Sensor_ID_15

6.811450e + 05

11

Sensor_ID_13

6.332835e + 05

3

Sensor_ID_05

5.924828e + 05

15

Sensor_ID_17

4.995963e + 05

4

Sensor_ID_06

4.919505e + 05

12

Sensor_ID_14

4.712039e + 05

14

Sensor_ID_16

4.603801e + 05

7

Sensor_ID_09

4.388673e + 05

• Correlation Matrix Feature Selection Correlation-based feature selection refers to method that measures the correlation between any two variables in the dataset. Figures 4 and 5 show best features selected by Correlation-based feature selection technique in both House A and B.

Feature Selection Strategy for Multi-residents Behavior …

19

Fig. 2 Best 10 features selected by feature importance selection in House A

Fig. 3 Best features selected by feature importance feature in House A

4.2 Experimental Discussions The prediction was performed in order to obtain accuracies and matrices measures using ARAS dataset in House A and B after performing feature selection techniques. The outcome shows that SVM outperformed in both House A and B; univariate feature selection performed best with 10 features compared to 5 features with the accuracy of 100% from both House A and House B using the SVM, while on the side feature importance selection performed best with 5 features compared to 10 features with the accuracy of 99% from House A and 100% accuracy from House B using the SVM. This implies that SVM algorithms performed best in ARAS dataset using univariate feature selection with 10 numbers of features while feature importance selection with 5 numbers of features (Tables 4, 5 and 6).

20

J. W. Kasubi and D. H. Manjaiah

Fig. 4 Best features selected by correlation matrix with heatmap in House A

From Table 7, the results show that after performing feature selection on ARAS Dataset, the performance of the model increased compared to the previous studies, which achieved the average score of 100% from both House A and B compared to the previous results which achieved the accuracy of 61.5% in average score in House A and 76.2% accuracy for House B.

5 Conclusion The result of the prediction for feature selection in smart home environment using ARAS dataset from both House A and B, respectively, showed that SVM algorithms outperformed in feature selection compared to other algorithms such as Logistic

Feature Selection Strategy for Multi-residents Behavior …

21

Fig. 5 Best features selected by correlation matrix with heatmap in House B

Regression (LR) and KNN. The SVM performed best on univariate feature selection with 10 features compared to 5 features with the accuracy of 100% from both House A and House B, while on feature importance selection SVM performed best with 5 features compared to 10 features with the accuracy of 99% from House A and 100% accuracy from House B. The feature selection has improved the prediction accuracy in ARAS dataset compared to the previous results, which achieved the accuracy of 61.5% in average score in House A and 76.2% accuracy for House B. For future work, we suggest different algorithms and feature selection methods like wrap-based and embedded to be employed on ARAS dataset for comparison purposes and improvement of the accuracy.

Correlation

0.93

0.91

5

10

0.96

10

0.91

5

Importance feature selection

0.98

10

Univariate feature selection

0.91

0.91

0.96

0.91

0.99

0.90

0.92

0.95

0.90

0.98

0.91

0.90

0.94

0.90

0.98

0.92

0.99

0.98

0.99

0.99

Acc

F1-score

House B Recall

Acc

Precision

House A

No. of features

Feature selection techniques

Table 4 Prediction results for logistic regression (LR) model for House A and B

0.90

0.99

0.98

0.97

0.98

Precision

0.90

0.98

0.99

0.98

0.97

Recall

0.91

0.99

0.97

0.99

0.99

F1-score

22 J. W. Kasubi and D. H. Manjaiah

10

F1-score

0.99

0.99

0.95

0.98

0.97

0.98

0.97

0.99

0.98 0.96

0.98

0.97

0.97

1.00

Bold refers to the outperformance of SVM algorithm in feature selection compared to other algorithms

Correlation

0.99

5

0.99 0.98

10

5

Importance feature selection

Recall 1.00

0.97

1.00

0.99

0.99

1.00

Acc

Precision 1.00

Acc 1.00

10

Univariate feature selection

House B

House A

No. of features

Feature selection techniques

Table 5 Prediction results for SVM model for House A and B Precision

0.95

1.00

0.96

0.98

1.00

Recall

0.96

1.00

0.98

0.99

1.00

F1-score

0.96

1.00

0.99

0.98

1.00

Feature Selection Strategy for Multi-residents Behavior … 23

Correlation

0.96

0.92

5

10

0.96

10

0.91

5

Importance feature selection

0.97

10

Univariate feature selection

0.96

0.92

0.96

0.91

0.97

0.93

0.90

0.95

0.90

0.97

0.95

0.91

0.94

0.90

0.97

0.90

0.90

0.91

0.93

0.99

Acc

F1-score

House B Recall

Acc

Precision

House A

No. of features

Feature selection techniques

Table 6 Prediction Results for KNN Model for House A and B

0.89

0.91

0.92

0.94

0.96

Precision

0.87

0.89

0.90

0.91

0.98

Recall

0.90

0.91

0.90

0.93

0.97

F1-score

24 J. W. Kasubi and D. H. Manjaiah

Feature Selection Strategy for Multi-residents Behavior …

25

Table 7 Comparisons of prediction results on ARAS Dataset with previous research work Research study

Accuracy-average score House A (%)

Accuracy-average score House B (%)

Current study

100

100

Previous study

61.5

76.2

References 1. Brownlee J (2016) Machine learning mastery with Python: understand your data, create accurate models, and work projects end-to-end. Machine Learning Mastery 2. Raschka S, Mirjalili V (2017) Python machine learning. Packt Publishing Ltd 3. Kwon M-C, Choi S (2018) Recognition of daily human activity using an artificial neural network and smartwatch. Wirel Commun Mobile Comput 2018 4. Oukrich N, Maach A et al (2019) Human daily activity recognition using neural networks and ontology-based activity representation. In: Proceedings of the Mediterranean symposium on smart city applications. Springer, pp 622–633 5. Wang J, Chen Y, Hao S, Peng X, Lisha Hu (2019) Deep learning for sensor-based activity recognition: a survey. Pattern Recogn Lett 119:3–11 6. Igwe OM, Wang Y, Giakos GC (2018) “Activity learning and recognition using margin setting algorithm in smart homes. In: 2018 9th IEEE annual ubiquitous computing, electronics & mobile communication conference (UEMCON). IEEE 7. Fang H, Srinivasan R, Cook DJ (2012) Feature selections for human activity recognition in smart home environments. Int J Innov Comput Inf Control 8:3525–3535 8. Alemdar H, Ersoy C (2017) Multi-resident activity tracking and recognition in smart environments. J Ambient Intell Humaniz Comput 8(4):513–529 9. Alberdi A et al (2018) Smart home-based prediction of multidomain symptoms related to Alzheimer’s disease. IEEE J Biomed Health Inf 22(6):1720–1731 10. Chen S, Fang H, Liu Z (2020) Human activity recognition based on extreme learning machine in smart home. J Phys Conf Ser 1437(1) 11. Fahad LG, Tahir FT (2020) Activity recognition in a smart home using local feature weighting and variants of nearest-neighbors classifiers. J Ambient Intell Humanized Comput, 1–10 12. Hameed J et al (2020) Enhanced classification with logistic regression for short term price and load forecasting in smart homes. In: 2020 3rd international conference on computing, mathematics and engineering technologies (iCoMET). IEEE 13. Manoj RJ, Anto Praveena MD, Vijayakumar K (2019) An ACO–ANN based feature selection algorithm for big data. Cluster Comput 22(2):3953–3960 14. Liu Y et al (2020) Daily activity feature selection in smart homes based on Pearson correlation coefficient. Neur Process Lett, 1–17 15. Tanaka K, Kurita T, Kawabe T (2007) Selection of import vectors via binary particle swarm optimization and cross-validation for kernel logistic regression. In: 2007 international joint conference on neural networks. IEEE 16. Fang H et al (2014) Human activity recognition based on feature selection in smart home using back-propagation algorithm. ISA Trans 53(5):1629–1638 17. Abudalfa S, Qusa H (2019) Evaluation of semi-supervised clustering and feature selection for human activity recognition. Int J Comput Digital Syst 8(6) 18. Rodriguez-Mier P, Mucientes M, Bugarín A (2019) Feature selection and evolutionary rule learning for Big Data in smart building energy management. Cogn Comput 11(3):418–433 19. Oukrich N (2019) Daily human activity recognition in smart home based on feature selection, neural network and load signature of appliances. PhD thesis 20. Minor B, Cook DJ (2017) Forecasting occurrences of activities. Pervasive Mobile Comput 38:77–91

26

J. W. Kasubi and D. H. Manjaiah

21. Zainab A, Refaat SS, Bouhali O (2020) Ensemble-based spam detection in smart home IoT devices time series data using machine learning techniques. Information 11(7):344 22. Alemdar H et al (2013) ARAS human activity datasets in multiple homes with multiple residents. In: 2013 7th international conference on pervasive computing technologies for healthcare and workshops. IEEE 23. Tang S et al (2019) Smart home IoT anomaly detection based on ensemble model learning from heterogeneous data. In: 2019 IEEE international conference on big data (big data). IEEE 24. Mohammadi M et al (2018) Deep learning for IoT big data and streaming analytics: a survey. IEEE Commun Surv Tutor 20(4):2923–2960

A Comparative Study on Self-learning Techniques for Securing Digital Devices Dev Kumar, Shruti Kumar, and Vidhi Khathuria

Abstract In the present time, the use of technology in our daily activities is imperative as we employ various technological solutions for activities like banking, communication, business and e-governance. All of this requires large network infrastructures, and maintaining the security of the same is still challenging as there is no dearth of malicious attacks. Our reliability on these systems and the volume of activity that these systems facilitate makes them more vulnerable to cyber-attacks. Apart from the traditional firewalls and anti-malware software, tools like the intrusion detection system (IDS) can be really helpful in increasing the security of our networks and information systems. There are different techniques by which an intrusion can be detected in a network. Out of which, a number of machine learning solutions can be used to develop a robust IDS as they are capable of detecting an intrusion in a network efficiently with the availability of historical data. In this paper, a comparative study has been conducted between such techniques that have been leveraged to build an intelligent IDS. The performance of these models is compared using the accuracy rate, and it is observed that artificial neural networks give the best accuracy rate, i.e. 98% for network intrusion detection. Most of these experiments were conducted by their respective authors using the KDDCup99 dataset or the improved NSL-KDD dataset, both of which are relatively old. In order to build our version of intelligent IDS, we aim to leverage deep learning algorithms along with recently developed datasets such as the CICIDS2017 dataset. This comparative study will be helpful to compare and contrast the various techniques available to develop a competent IDS.

1 Introduction As more and more people get access to the Internet every day, they are introduced to a vast resource of knowledge and information. But with this comes the threat of being a victim of a cyber-attack. Security has become an important and unavoidable D. Kumar (B) · S. Kumar · V. Khathuria Department of Information Technology, Thadomal Shahani Engineering College, Bandra (W), Mumbai 400050, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_3

27

28

D. Kumar et al.

characteristic of the Internet and other networks. Advancements in Information technology (IT) have brought a lot of convenience in our lives. A lot of activities that are vital to us are now made simpler by various IT applications. Many individuals interact with applications pertaining to e-banking, e-commerce, e-government, etc. on a daily basis. These interactions involve a lot of sensitive information, and such interactions are also being conducted on a global scale. It is imperative to ensure that the security measures taken while planning a network of such scale are of supreme quality. In this current age of technological advancements, there are numerous methods for securing data and devices. These methodologies aim to revamp the security infrastructure of information systems and networks of digital devices. Many applications and services adopted from these methodologies are used in the industry to ensure network security. These include but are not restricted to firewalls, anti-virus softwares, network segmentation, data loss prevention (DLP) systems, etc. They can be assumed as a part of the primary security layer. Firewalls and anti-malware software are commonly installed on individual devices in a network. But as enterprises grow larger, these networks are scaled up, and new devices are connected to them. Firewalls and anti-malware software alone is not enough to protect an entire network from malicious attacks. Networks of such scale demand a system of complex nature to ensure the security of the utmost quality. One such system that best fits the purpose is the intrusion detection system (IDS). IDS is an application or a system that monitors and analyses traffic across networks and large systems to detect any anomalous or suspicious activity. IDS generally comes after the primary security layer viz. firewalls, anti-malware software, DLP systems, etc. IDS looks for familiar and recognizable activities that pose a threat to the network. If any such activity or threat is detected, it is immediately brought to notice by sending alerts. In this paper, we first try to understand the common and well-known classifications of IDS. We further go on to see how different self-learning and predictive technologies can be used to develop the desired model for the purpose of intrusion detection in a network.

2 Intrusion Detection System An intrusion detection system (IDS) scans the network for dubious or suspicious behaviour and notifies the administrator of potential threats. It is a mechanism that monitors network traffic for data breach or security hazard, if any. While IDS tracks the network for potentially harmful activities, false alarms are equally possible and unavoidable. As a result, proper configuration of intrusion detection systems is needed to understand and differentiate between regular network traffic and any anomalous or malicious behaviour. Intrusion prevention systems often track packets entering the network and verify the possibility of anomalies. It further sends warning alerts to the administrator if any deviation is observed in the packets. IDS can be

A Comparative Study on Self-learning Techniques …

29

Fig. 1 Intrusion detection system and its types

classified in a variety of categories which is discussed as follows. Figure 1 provides a summary of the below-mentioned classification.

2.1 Types of IDS Technologies IDS technologies are distinguished fundamentally by the entity that they examine and the methods by which the features are achieved. Broadly, they are categorized as follows: (a)

Network-Based

They keep an eye on the network traffic for a section of the network. It also examines the network and application protocol activity to point out anomalous behaviour. Network intrusion detection systems (NIDS) are established at a suitable point in a network to monitor the traffic of all users in the network. It tracks the movement on the entire subnet and compares it with previous movements on the subnets in order to select any previously occurred, thus, known attacks. In case a suspicious activity or an attack is reported, it may be brought to the administrator’s notice. For example, a NIDS is installed on a subnet where firewalls are also set up to keep a watch if an intruder is trying to penetrate a firewall. (b)

Host-Based

These IDS monitor the host and the events that occur within the host. Host intrusion detection systems (HIDS) operate on a network of separate servers or computers. HIDS tracks the flow of packets only from the system and warns the user if any malicious or anomalous behaviour is observed. It records the details of files in the

30

D. Kumar et al.

current device and contrasts it with existing records. If the analytical system files have been tampered with, it is brought to the administrator’s notice. For example, HIDS are used in devices that are vital and required to keep their architecture intact for smooth functioning of the system.

2.2 Intrusion Detection Techniques The intrusion detection techniques mainly consist of two detection methodologies viz. signature-based and anomaly based. Most of the systems use a combination of the below-mentioned techniques to reduce the error of its detections. (a)

Signature-Based Detection

A signature is a pattern that has been developed with the knowledge of known threats. These signatures can be used to map with the ongoing activity in the network, and if they appear to be similar, they can be classified under the same category thereby labelling the ongoing activity as a threat. In signature-based detection, we compare the signature with the ones that are known to be a threat against the events that are being observed in the network. (b)

Anomaly Based Detection

Anomaly intrusion detection system is primarily based on statistical techniques. It works in such a way that it is able to identify unknown anomalous patterns as well. It detects the attack based on its irregular pattern in the network. This enables the detection system to detect and alarm about newer unknown threats and anomalous activities. In order to achieve this kind of functioning in an intrusion detection system, it makes perfect sense to leverage self-learning techniques such as deep learning, machine learning and other artificial intelligence algorithms.

3 Machine Learning in IDS It is imperative for an intrusion detection system to have a robust model capable of identifying various kinds of anomalous activities. To develop such a model, we will have to investigate a number of available mathematical and statistical techniques that have been worked upon by various researchers. In today’s time, machine learning is being leveraged in a number of domains ranging from health care to finance. Machine learning is used for solving various analytical and statistical problems that revolve around classification, clustering, and self-learning techniques. It, therefore, seems capable of providing varied and optimum solutions for intrusion detection in our network. We will compare and contrast various types of mathematical and statistical

A Comparative Study on Self-learning Techniques …

31

models that have been developed and used specifically for the task of intrusion detection. In this review, we further look at solutions that have been provided to develop a robust intrusion detection system and evaluate the best practices available to us.

3.1 Classification Anish Halimaa et al. [1] explore the techniques that can be used to develop a machine learning-based IDS. They emphasize on the importance of accuracy as a key factor in the performance of the system. They intend to propose an approach with reduced false alarms or false positives to improve the detection rate. Out of the available machine learning techniques, they apply support vector machine (SVM) and Naive Bayes to the NSL-KDD knowledge discovery dataset, which is a refined version of the benchmark KDDCup99 dataset. They have designed an experiment where they used three different approaches in order to examine the efficiency of the two algorithms viz. SVM and Naive Bayes. The metrics for the same are the accuracy rate and the misclassification rate of the model. We discuss this in detail in the subsequent paragraph. The first approach consists of using the algorithm itself to build the SVM and Naive Bayes models for the purpose of intrusion detection. In order to get the best out of the dataset, they go on to incorporate feature reduction and normalization which branches out to give the other two approaches. For the second approach, CfsSubsetEval [2] is adopted for feature reduction—a technique that helps in extracting out the most relevant attributes. This gives us two new and updated models viz. SVMCfsSubsetEval and Naive Bayes-CfsSubsetEval. For the third and final approach, they apply normalization to the dataset which results in SVM-normalization and Naive Bayes-normalization. As a result of this experiment, the authors conclude that models based out of SVM significantly outperform those based out of Naive Bayes. This holds true even for the models obtained after feature reduction and normalization. This conclusion is also evident when the data obtained from this experiment is examined.

3.2 Regression and Clustering Regression analysis deals with the relationship of the output variable with the set of input features. On the other hand, clustering literally means forming clusters or groups of data exhibiting similar features which can further help us in classifying between different data groups. As per our assumption, these data groups will be benign and malign connections or activities in the network. A similar approach is observed in the paper presented by Dikshant Gupta et al. [3]. The paper deals with two techniques viz. linear regression and K-means clustering in

32

D. Kumar et al.

order to develop a model capable of detecting an intrusion in a network. The model has been developed and trained upon the NSL-KDD dataset. In the experiment conducted by the authors, they first do adequate data preprocessing which includes transforming the nominal features into numerical inputs which are favourable conditions for the techniques involved. The dataset is also preprocessed using the mean normalization method before the algorithms are applied. Linear regression gives out an accuracy rate of 80.14% (for cost variation alpha = 0.005) which is significant but not satisfying. On the other hand, K-means clustering displayed a tolerable accuracy rate of 67.5% accuracy. Perhaps, the results seem to be sufficient for experimental purposes, but they may not satisfy the industrial requirements. In order to achieve a better accuracy rate, we may look for multi-level hybrid models or other self-learning techniques (some of which are discussed further in this review).

3.3 Deep Learning Deep learning belongs to a broader set of machine learning techniques primarily consisting of artificial neural networks (ANN) that are inspired by the functions and structure of neurons present in the brain and the central nervous system [4, 5]. Dissimilar to machine learning, feature extraction in deep learning is done intuitively. Indicating significant features is one of the important prerequisites for ANNs. ANNs have also been used in different types of classification problems existing in various domains [6]. This is why deep learning seems to be more beneficial than traditional machine learning solutions as the core structure of deep learning methods enables it to do the step intuitively as we see in the biological behaviour of a brain. Shenfield et al. [6] present a technique to detect malicious network traffic using ANN in an IDS. The dataset utilized in this experiment is procured from the online exploit and vulnerability repository exploitdb [7]. In the experiment, the proposed ANN architecture, which is a two hidden layer multi-layer perceptron (MLP), is capable of performing the classification task that is important for the IDS. The metrics used for this experiment. After the experiment, the author(s) claimed an average accuracy rate of 98% on their ANN model with an average ‘area under the receiver operator characteristic’ (AUROC) curve of 0.98. The higher the value for the AUROC curve, the better is the classifier at differentiating between malicious and benign attacks, i.e. the two classes. Kim et al. [8] proposed a long short-term memory-recurrent neural network (LSTM-RNN) classifier which is developed on the KDDCup99 dataset. While an advanced self-learning algorithm is extremely important, the dataset that is being used to train the model on has equal significance, if not more. That being said, the proposed model gives a detection rate of 98.88%, a false alarm rate of 10.04%, and an accuracy rate of 96.93%. While these metrics are really commendable, the ANN architecture trained on the NSL-KDD dataset, a refined

A Comparative Study on Self-learning Techniques …

33

version of the KDDCup99 dataset, gave an accuracy rate of 98% which is slightly higher than that of the proposed LSTM-RNN model. It will probably be interesting to see what the results would be if a different dataset (probably, the NSL-KDD dataset or a relatively newer dataset) is used to train the classifier based on the proposed model.

3.4 Genetic Algorithms Genetic algorithm is a series of steps that are inspired by the process of natural selection. This process is based on the theory of natural evolution which was proposed by Charles Darwin. Genetic algorithm is used for optimization and searching with respect to evolutionary computation [9]. Similar to what happens in the biological evolutionary process where the best of the genetic information is taken from the pool of information and is transferred to the next generation, the best features of the data are selected for the new generation resulting in optimization of information and better computational solutions. One of the first mentions of this algorithm can be traced back to 1957 when Fraser [10] recommended that genetic systems be modelled in computers. Like neural networks, it is also inspired by a natural biological process, and it aims to solve computational problems by mimicking the processes that nature has enabled us with. Resende et al. [11] talk about an anomaly based IDS which is adaptive and selects the appropriate attributes in order to profile the ‘normal’ behaviour for a network using genetic algorithms. Any activity which deviates from this normal behaviour can be classified as an anomalous behaviour in the network. As per the author(s), this process of classification is efficient in detecting the intrusion in a network. The role of a genetic algorithm in this approach is to help extract relevant attributes necessary for profiling the normal behaviour in a system. In the experiments conducted on the CICIDS2017 dataset [12], their approach gave an accuracy rate of 92.85% and a false positive rate of 0.69%. In order to boost the effectiveness of the model in a real implementation, they propose evolving the initial population in a large number of generations (greater than 1000) and over a dataset consisting of different combinations of attacks.

3.5 Neural Networks and Fuzzy Logic Fuzzy logic originates from the fuzzy set theory according to which reasoning is approximate rather than reliably deduced from classical predicate logic. This factor enables fuzzy techniques to be used in anomaly and/or intrusion detection because the features to be extracted and examined for solving this problem can behave as fuzzy variables.

34

D. Kumar et al.

Mizdic et al. [13] propose a hybrid structure consisting of neural networks along with implementation of fuzzy logic. The crux of this architecture is the selforganizing map (SOM) block cascade linked with fuzzy systems and is developed using the KDDCup99 dataset. In order to enhance the effectiveness of the model, a corrector block has also been introduced in the architecture. The SOM block is divided into two layers, and the neural networks in those layers are cascade linked. The corrector block consists of a fuzzy system and automatic corrector, and its primary role involves determining the unknown samples forwarded from SOM blocks. This hybrid solution achieves a total accuracy rate of 94.3% which is higher than the accuracy rates of previously developed models based on the same dataset, Also, it is observed that correction and the inherent fuzzy system improves the classification of R2L attack amongst all other classes of attacks in the data (Table 1).

4 Datasets In order to implement each of the different techniques discussed above, datasets play a crucial role in determining the performance of the algorithm and thereby the system. Procuring and applying appropriate data is an onerous yet imperative task. The dataset should contain instances of different possible scenarios for the system to make sense of it and act appropriately in case of any anomaly. All in all, the system should be capable of labelling the traffic based on the features provided in the dataset. Over the years, a number of datasets have been put to practice for the purpose of detecting anomalous behaviour in a network. Datasets that are taken into consideration for our study are as follows:

4.1 KDD Cup 99 Dataset This dataset was used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining [14]. It is built on the data captured in The Defence Advanced Research Projects Agency’98 IDS conference. It is tcpdump data recorded over 7 weeks. Each feature in the dataset is either labelled as an attack feature, normal feature or content feature. There are 41 features which are further classified as follows:(a)

Denial of Service Attack

In this attack, the memory or computing resources are made unavailable to legitimate user(s). This is possible by allowing consumption of resources crucial to the system such as bandwidth and/or memory. The motive behind this attack is to freeze or

A Comparative Study on Self-learning Techniques …

35

Table 1 Comparative study of various self-learning techniques for network intrusion detection Year

Author

Technique used

Dataset

Experimental approach

Performance results

2019

Anish Halimaa et al. [1]

Classification

NSL-KDD knowledge discovery Dataset

SVM

Accuracy rate = 97.29% Misclassification rate = 2.705%

SVM-CfsSubsetEval

Accuracy rate = 93.95% Misclassification rate = 6.04%

SVM-normalization

Accuracy rate = 93.95% Misclassification rate = 2.705%

Linear regression

Accuracy rate = 80.14%

K-means clustering

Accuracy rate = 67.5%

ANN

Average accuracy rate = 98% Average AUROC = 0.98

2016

Dikshant Gupta et al. [3]

Regression NSL-KDD and clustering dataset

2018

Shenfield et al.[6]

Deep learning Online exploit and vulnerability repository exploitdb

2016

Kim et al. Deep learning KDDCup99 [8] dataset

LSTM-RNN

Accuracy rate = 96.93% Detection rate = 98.88%

2018

Resende et al. [11]

Genetic algorithms

CICIDS2017 dataset

Coupled with various anomaly based intrusion detection methods

Accuracy rate = 92.85% False-positive rate = 0.69%

2016

Mizdic et al. [13]

Neural networks and fuzzy logic

KDDCup99 dataset

SOM block cascade linked with fuzzy system. (neural network and fuzzy logic)

Accuracy rate = 94.3%

avert the access of legitimate user(s) by initiating bootless consumption of system resources. (b)

User to Root Attack

The attacker gains root privileges of the system by using techniques like keylogging, buffer overflow attack, etc. This attack is nothing but a shift of the attacker’s account type or class from a general-purpose account to the one with better access to sensitive information and other root privileges.

36

(c)

D. Kumar et al.

Remote to Local Attack

The attacker gains local access to the machine remotely using password cracking. This access is unauthorized in nature and can be detrimental for the functioning of a system. (d)

Probing Attack

In case of a probing attack, the attacker gathers information about the target machine, network or the information system before initiating the attack. This type of attack falls in the category of passive attacks.

4.2 NSL-KDD Dataset The previously discussed KDD Cup ‘99 dataset has been the most common dataset used for detection of anomalous behaviour in a network or a system. For a long time, it was considered as the benchmark dataset However, it contained several redundancies and information that needed to be updated. University of New Brunswick solved the problems existing in the KDD Cup 1999 dataset and made the NSL-KDD dataset [15]. The number of records was reduced, by removing redundant information. There are 43 features in each record with no redundancy whatsoever. This enables the classifier to deal with less frequent records, and this heavily improves the performance of the same. This dataset made evaluation from different research works comparable and consistent. It thus proved advantageous over the benchmark KDD Cup ‘99 dataset.

4.3 Exploit Database (Exploitdb) The exploit database is a project intended for community use by offensive security [16], a company that specializes in training of information security and related certifications. Exploit database is an archive of public exploits and softwares that are vulnerable to cybersecurity attacks. Thus, this archive or collection can be really helpful for infrastructure security researchers and other cybersecurity enthusiasts willing to experiment with various security measures. The experiment that was concerned with development of IDS using these exploits used the shellcodes that were available in this repository. A set of benign traffic data was also included in order to make the model familiar with normal behaviour. This included logs, images and other miscellaneous files.

A Comparative Study on Self-learning Techniques …

37

4.4 CICIDS2017 Dataset The intrusion detection evaluation dataset, popularly known as CICIDS2017 dataset, is a collection of benign data points as well as the most recent and familiar malicious attacks. The results of the analysis done on the network traffic using CICFlowMeter [17] are also included in this dataset along with data of which labelling is done based on IPs and ports (both source and destination) as well as timestamp. The details on various protocols and attacks are also present in the dataset. The data collected is based on the emulated behaviour of 25 users in varied protocols. Protocols like HTTP and HTTPS along with FTP and SSH were incorporated. A wide range of attacks that are said to be up-to-date and are prone to take place is also included in the dataset. The attacks include DoS, DDoS, Botnet, Infiltration, Web Attack, Brute Force FTP, Brute Force SSH and Heartbleed [18].

5 Results and Discussion In the above module, we have discussed a few recent and relevant self-learning techniques, ranging from mathematical and statistical models to systems and algorithms inspired by evolutionary biology. These techniques are investigated for their potential in creating state-of-the-art IDSs. To summarize the comparison between different techniques used, we will first select the best experimental approach within each technique subject to availability of more than one experimental approach. Out of the three classification techniques analysed by Anish Halimaa et al., it is clear by looking at Fig. 2 that the accuracy rate of SVM is greater than the other two models viz. SVM-CfsSubsetEval and SVM-Normalization. Hence, we will use SVM for further discussion and comparison with other models taken into consideration in this study. From Fig. 3, it is evident that linear regression outperforms K-means clustering by a significant margin, and hence, it will be considered for further comparison in this study. After selecting the best from the first two techniques when we compare the findings of other available techniques, it is observed from Fig. 4 that ANNs used in experimentation by Shenfield et al. give the best results for intrusion detection in a network with an accuracy rate of 98%. This is followed by the SVM classifier proposed by Anish Halimaa et al. with an accuracy rate of 97.29%. The LSTM-RNN approach of deep learning proposed by Kim et al. is also highly functional with an accuracy rate of 96.93%. Out of the top three performing models, two of them are deep learning solutions. Thus, it is safe to say that deep learning solutions give a significantly high accuracy rate for network intrusion detection.

38

D. Kumar et al.

Fig. 2 Performance results of the classification techniques proposed by Anish Halimaa et al

Fig. 3 Performance results of techniques proposed by Dikshant Gupta et al.

6 Future Work A homogeneous experimental setting with an updated dataset such as the NSL-KDD dataset, a refined version of KDD Cup 1999 dataset, applied across all considered techniques might yield different results. Perhaps, overcoming the differences between different experiments and establishing similar conditions for all different techniques

A Comparative Study on Self-learning Techniques …

39

Fig. 4 Accuracy rates of various self-learning techniques for network intrusion detection

will give us a clearer picture. But with the current analysis in hand, neural networks and deep learning techniques seem to be the way to go. With this study, we aimed at getting an insight on the gaps that exist in the research that has been conducted in recent years to apply self-learning techniques for the purpose of developing smart and intelligent systems for intrusion detection in a network. We also wanted to understand the best techniques that are available out there. Our future work will involve dealing with a vast amount of data, i.e. big data for which we might consider using big data processing and analytics engines such as Spark and Hadoop. Current prototypes which have given significant results in our comparative study are expected to be further refined in order to turn them into production quality systems. Intrusion detection in a network can be enhanced using new data mining solutions such as dbscan and deep neural networks. Feature selection techniques can be updated to assess the weight of features used in other IDSs and likely boost their performance by removing noisy features. It is possible to improve time performance as well as accuracy by using feature selection. Using techniques like information gain, which depend on the information entropy and is used to evaluate the significance of a feature, can help us get better insight on how to use the available data effectively. The NSL-KDD dataset is outdated and consists of several vulnerabilities for which attempts have been made to solve with a new, well-derived dataset like CICIDS2017 dataset. Promising techniques like deep learning will be further evaluated using these relatively new datasets.

40

D. Kumar et al.

7 Conclusion We investigated different types of techniques available for building an intelligent intrusion detection mechanism, and as per the comparative study, we can conclude that the deep learning algorithms have given the highest accuracy rate. However, the technique or the algorithm used is not the only deciding factor. In spite of the fact that the deep learning algorithms have outperformed other self-learning techniques, an intelligent system capable of accurate intrusion detection not only requires a really good self-learning algorithm but also a dataset that is accurate enough and provides the model with adequate points to train upon. While there are a number of datasets available for this purpose, the KDDCup99 dataset is still considered to be a benchmark dataset. But as technology is evolving, the dataset tends to lose its relevance, and we can see an improved version of this dataset being used for experimentation which is the NSL-KDD dataset NSL-KDD and is a data set suggested to solve some of the inherent problems of the KDD Cup 1999 dataset which are mentioned in [19]. Although, in a recent paper presented by Iman Sharafaldin et al. [12], they have introduced a relatively newer dataset, i.e. CICIDS2017 dataset which can be alternatively used as it has recently been developed and contains more relevant data points. Hence, in our intended experimental research project and for any other future endeavours, choosing the appropriate dataset will really be crucial as it may further bolster the performance of the algorithm. Acknowledgements This study was done in order to investigate and present the available solutions for building a self-configuring learning system for securing digital devices. All the work with respect to this project will be pursued under the guidance of Dr. G. T. Thampi.

References 1. Anish Halimaa A, K. Sundarakantham: Machine Learning Based Intrusion Detection System. In: Proceedings of the Third International Conference on Trends in Electronics and Informatics, pp. 916–920. IEEE Xplore, Tirunelveli, India (2019). 2. CfsSubsetEval, https://weka.sourceforge.io/doc.dev/weka/attributeSelection/CfsSubsetEval. html 3. Gupta D, Singhal S, Malik S, Singh A (2016) Network intrusion detection system using various data mining techniques. In: International conference on research advances in integrated navigation systems (RAINS—2016). R. L. Jalappa Institute of Technology, Doddaballapur, Bangalore, India, pp 1–6 4. McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 5(4):115–133 5. Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386–408 6. Shenfield A, Day D, Ayesh A (2018) Intelligent intrusion detection systems using artificial neural networks. ICT Express 4(2):95–99 7. Exploit database, https://www.exploit-db.com/shellcodes

A Comparative Study on Self-learning Techniques …

41

8. Kim J, Kim J, Thi Thu HL, Kim HL (2016) Long short term memory recurrent neural network classifier for intrusion detection. In: 2016 international conference on platform technology and service (PlatCon), Jeju, pp 1–5. https://doi.org/10.1109/PlatCon.2016.7456805 9. Holland JH (1962) Outline for a logical theory of adaptive systems. J ACM 9(3):297–314 10. Fraser AS (1957) Simulation of genetic systems by automatic digital computers. I. Introduction. Austr J Biol Sci 10(4):484–491 11. Resende PAA, Drummond AC (2018) Adaptive anomaly-based intrusion detection system using genetic algorithm and profiling. Secur Privacy 1(4):e36. https://doi.org/10.1002/spy2.36 12. Sharafaldin I, Lashkari AH, Ghorbani AA (2018) Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: 4th international conference on information systems security and privacy (ICISSP), Portugal, January 2018 13. Midzic A, Avdagic Z, Omanovic S (2016) Intrusion detection system modeling based on neural networks and fuzzy logic. In: 20th Jubilee IEEE international conference on intelligent engineering systems (INES). IEEE Xplore, Budapest, Hungary, pp 189–194. https://doi.org/ 10.1109/INES.2016.7555118 14. KDD Cup 1999 Data, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html 15. NSL-KDD dataset, https://www.unb.ca/cic/datasets/nsl.html 16. Offensive Security Community Projects, https://www.offensive-security.com/community-pro jects/ 17. CICFlowMeter, https://www.unb.ca/cic/research/applications.html#CICFlowMeter 18. CICIDS2017 dataset, https://www.unb.ca/cic/datasets/ids-2017.html 19. Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analysis of the KDD CUP 99 data set. In: 2009 IEEE symposium on computational intelligence for security and defense applications, pp 1–6, Ottawa, ON (2009). https://doi.org/10.1109/CISDA.2009.5356528

An Intelligent, Geo-replication, Energy-Efficient BAN Routing Algorithm Under Framework of Machine Learning and Cloud Computing Annwesha Banerjee Majumder, Sourav Majumder, Somsubhra Gupta, and Dharmpal Singh Abstract Health care is the basic need for each and individual in the universe. Reality is not very good all the time. There is still lack of proper medical services and scarcity of efficient medical practitioners. Critical patients need 24 × 7 monitoring and support. In this paper, we have proposed a model for remote patient monitoring using an intelligent, geo-replicated and energy-efficient BAN routing mechanism under the framework of machine learning and cloud computing. The proposed model monitors patient by collecting information from sensors embedded in the body and the route the collected information using an intelligent routing using Naïve Bayes. The proposed model is using the geo-replication feature of cloud computing to share the patient critical data across the globe.

1 Introduction As per report by World Health Organization (WHO) by 2035, there would be short of 12.3 billion healthcare workers [1]. Increase number of support is required in case of diseases like cancer, heart disease, and stroke. Computer science, machine learning can contribute in this problem with the remote patient monitoring. In our proposed method, we have worked on this area by a model that is energy efficient, intelligent and using geo-replicated BAN routing algorithm. For the efficient node selection, our model has used the statistical supervised learning model—Naive Bayes. A. B. Majumder (B) Department of Information Technology, JIS College of Engineering, Kalyani, West Bengal, India S. Majumder Wipro Technology Ltd., Kolkata, India S. Gupta Amity University, Kolkata, India D. Singh Department of Computer Science and Engineering, JIS College of Engineering, Kalyani, West Bengal, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_4

43

44

A. B. Majumder et al.

Fig. 1 Wireless body area network [4]

Body Area Network Body area network or mainly known as wireless body area network (WBAN) is a wireless network which consisting of wearable components like different sensors and actuators [2, 3].Through this technological invention critical patient remotely, alarm can be sent even to hospitals (Fig. 1). Naïve Bayes: Naïve Bayes is a Bayesian theorem-based probabilistic classifier [5]. This is statistical model of supervised learning. P( A|B) = (P( A|B) ∗ P(A))/ P(B)

(1)

where P (A|B) P (B|A) P (A) P (B)

Posterior probability. Likelihood. The probability B given that the A was true. Prior probability. Predictor prior probability.

Geo-replication Geo-replication is the process by which replication of information is done across multiple regions around the world. In this technique, ideally data is created at one place and over the cloud the same information gets replicated across multiple regions. It helps to keep same information across multiple regions and can be used as and when required.

An Intelligent, Geo-replication, Energy-Efficient …

45

Fig. 2 Geo-replication architecture [6]

Concept of geo-replication feature of cloud service would be used for sharing critical information of patients globally so that views from world famous doctors can be taken instantly without bothering about setup, infrastructure of maintaining the data at different location globally. It will save time, save cost, improve efficiency of diagnosis and might also help in formulating repository to fight various critical diseases [6–8] (Figs. 2 and 3). Plenty of works have already proved their efficiencies in this field of body area network and remote patient monitoring. Geetha and Ganesan have proposed a BAN routing algorithm that is energy efficient, priority based using cooperative communication method. Their main focus was reliability and energy efficiency [9]. Majumder and Gupta have proposed another algorithm for body area network which has considered the priority, energy level and hop count of each node for transferring information [10]. A cluster-based BAN routing algorithm has been propped in [11] which is energy efficient as well, where the nodes for sending the data are being selected by the level of data criticality. Rani Kumari and Parma Nand have proposed a BAN routing algorithm, where they have considered no. of hops, energy and quality of link to choose the optimal path projecting reliable energy economical adaptation (REEA) [12]. In [13], a paper authors have proposed another energy-efficient wireless body area network algorithm considering location and residual energy of each nodes. They have considered the destination node enriched with energy. Samaneh Movassaghi et al. have proposed an algorithm for body area network which is thermal and power aware [14]. Another routing algorithm which is sink mobility based routing algorithm

46

A. B. Majumder et al.

Fig. 3 Geo-replication interface [6]

and also energy efficient [15]. Rakhee and Srinivas have proposed an algorithm for body area network using breadth first search. Authors have used ANT colony optimization also in the proposed method [16]. ANN-based integrated cost measure has been proposed in [17] which is also an energy-efficient algorithm. An algorithm for wireless body area network has been proposed by authors which is based on dynamic duty cycle [18]. Vahid Ayatollahitafti, Md Asri Ngadi, Johan bin Mohamad Sharif and Mohammed Abdullahi proposed an effective algorithm for multi-hop body area network. Authors have used link cost and hop count both for choosing the best next node [19]. In [20], authors have proposed NetBAN which is network of body area network. The proposed model relays packets according to link quality and energy consumption. Jafarizadeh et al.have proposed a method for wireless sensor network to select cluster node head optimally based on Naïve Bayes [21]. In [22], authors have proposed a least loaded algorithm using Naïve Bayes which would work for circuit-switched network.

2 Proposed Algorithm The proposed model is for remote monitoring of critically ill heart patients using an intelligent, geo- replicated and energy-efficient BAN routing algorithm. The proposed mechanism has three distinct parts. Flow diagram of proposed model is depicted in Fig. 4. Part 1: The sensors embedded in the patient’s body will send data to sink node in a certain time interval. As these nodes are all constraint nodes—they do not have much

An Intelligent, Geo-replication, Energy-Efficient … Collect Patient Data from sensor

Finding optimal Path using Naïve Bayes

47 Geo Replication feature of Cloud Service for storing the Information at various locations globally

Node Sleep and Awaking for energy saving

Fig. 4 Flow diagram of our proposed model

energy power. In our proposed mechanism, each node will be in sleep mode for the amount of time according to their propriety. The nodes which have higher priority will be in sleep mode for less amount of time than the nodes with lower priorities. In our proposed model, monitoring of pulmonary artery pressures having the highest priority. Sleep Time : ST(Ni ) = Pi ∗ T

(2)

where Pi is the priority of ith node, and T is a constant Part 2: The sink nodes will route the collected information by choosing the competent node. This is one of the major challenge in the BAN routing as critical date is being routed. In our proposed method, routes will be chosen considering following four factors: 1. 2. 3. 4.

Hop count Congestion level Energy level of the node Priority.

By considering these factors, Naïve Bayes supervised learning is being used in search of competent node to transmit data. Following figures depict the model development steps. The model has considered as training set 80% of the dataset and remaining for testing. Data Set Description As mentioned above, hop count, congestion level, energy level of the node and priority are being considered to choose the competent node. Hop count and congestion level are inversely proportional factors as high value for hop count and congestion level mean less competent node. On the other hand, energy level and priority are direct proportional factors as high value of energy level and priority indicates more competent node. Followings figures are describing the above-mentioned factors analysis.

48

Fig. 5 Hop count feature

Fig. 6 Congestion level feature

Fig. 7 Energy level feature

A. B. Majumder et al.

An Intelligent, Geo-replication, Energy-Efficient …

Fig. 8 Priority feature

Fig. 9 Distribution of selecting and non-selecting node to send data

Fig. 10 Training the model

49

50

A. B. Majumder et al.

Part 3: Critical patient needs extreme care. It might be the scenario that a particular medical practitioner may not take exact decision in certain case, so it is always better to have view from other different pardoners. Data coming out from model would be replicated globally at various locations(Regions) in cloud using geo-replication feature of cloud so that medical practitioner across globe can take reference of data to provide required guidance which would deem fit for critical patients.

3 Result Analysis and Discussion In the proposed model the constraint nodes, sensors embedded in the body will send the data and also go to sleep mode for certain time in accordance with its priority so energy can be used in optimal way. Sample execution screen shot is in Fig. 11. For the selection of nodes, the proposed mechanism has used the Naïve Bayes machine learning classifier. Through the mechanism, the accuracy achieved is 80%. Confusion matrix of our proposed model is in Fig. 12.

Fig. 11 Nodes sleep mode calculation

Fig. 12 Confusion matrix of proposed model

An Intelligent, Geo-replication, Energy-Efficient …

51

Table 1 Comparative analysis of different proposed methods Proposed methods

Observation

A neural network approach in sensor network [23]

R value: 0.62 (15 hidden nodes) 0.99(20 hidden nodes)

KNNR [24]

Probability of message delivery—greater average latency—more

An intelligent opportunistic routing algorithm for wireless sensor networks [25]

Good latency Good throughput Good network lifetime

Genetic-based routing algorithm with priority constraints [26]

improvement in the selected path with a priority than without a priority improvement in the selected path with a priority than without a priority improvement in the selected path with a priority than without a priority Find path based on multi-constraint with GA: efficient

Weighted energy-balanced efficient routing algorithm for wireless body area network [27]

Dynamic routing algorithm: sometimes unstable Improved DRA- energy saving and balance issues have solved

Our proposed method

Energy efficient—Nodes go to sleep model in certain intervals Accuracy in choosing competent nodes: 0.80

Table 1 contains a brief comparative analysis with different past proposed model.

4 Conclusion The proposed model is an energy-efficient BAN routing algorithm where the sensor nodes would be in sleep mode corresponding their priority. Naïve Bayes has been applied in search of competent node. And the interesting and useful features of our proposed model are geo-replication. In the future, we would like to apply optimization technique in route finding for data transfer using genetic algorithm.

References 1. https://www.who.int/mediacentre/news/releases/2013/health-workforce-shortage/en/ 2. http://www.ieee802.org/15/pub/TG6.html

52

A. B. Majumder et al.

3. Hernandez M, Mucchi L (2014) Body area networks using IEEE 802.15.6. https://www.sci encedirect.com/topics/engineering/body-area-network#:~:text=A%20body%20area%20netw ork%20is,individual%2C%20to%20collect%20biomedical%20data 4. Mukherjee P, Mukherjee A (2019) Sensors for health monitoring. https://www.sciencedirect. com/topics/engineering/body-area-network#:~:text=A%20body%20area%20network%20i s,individual%2C%20to%20collect%20biomedical%20data 5. McCallum A (2019) Graphical models, Lecture2: Bayesian network representation (PDF). Retrieved 22 Oct 2019. https://people.cs.umass.edu/~mccallum/courses/gm2011/02-bn-rep. pdf 6. https://docs.microsoft.com/en-us/azure/azure-sql/database/active-geo-replication-overview 7. https://searchwindowsserver.techtarget.com/definition/geo-replication 8. https://en.wikipedia.org/wiki/Geo-replication 9. Geetha M, Ganesan R (2020) CEPRAN-cooperative energy efficient and priority based reliable routing protocol with network coding for WBAN. Wireless Pers Commun. https://doi.org/10. 1007/s11277-020-07798-x 10. Majumder AB, Gupta S (2018) An energy-efficient congestion avoidance priority-based routing algorithm for body area network. In: Bhattacharyya S, Sen S, Dutta M, Biswas P, Chattopadhyay H (eds) Industry interactive innovations in science, engineering and technology. Lecture notes in networks and systems, vol 11. Springer, Singapore. https://doi.org/10.1007/978-981-103953-9_52 11. Gupta S, Majumder AB, Sarkar I, Majumder S (2019) Body area network using cluster based energy efficient routing. IJRAR—Int J Res Anal Rev (IJRAR) 6(1): 42–45, E-ISSN 2348-1269, P- ISSN 2349-5138. Available at: http://www.ijrar.org/IJRAR19H1009.pdf 12. Kumari R, Nand P (2017) An optimized routing algorithm for BAN by considering hop-count, residual energy and link quality for route discovery. In: 2017 international conference on computing, communication and automation (ICCCA), Greater Noida, pp 664–668. https://doi. org/10.1109/CCAA.2017.8229884 13. Kumari J (2015) An energy efficient routing algorithm for wireless body area network. IJWMT 5(5):56–62. https://doi.org/10.5815/ijwmt.2015.05.06 14. Movassaghi S,Abolhasan M, Lipman J (2012) Energy efficient thermal and power aware (ETPA) routing in Body Area Networks. In: 2012 IEEE 23rd international symposium on personal, indoor and mobile radio communications—(PIMRC), Sydney, NSW, pp 1108–1113. https://doi.org/10.1109/PIMRC.2012.6362511 15. Jose DV, Sadashivappa G (2014) A novel energy efficient routing algorithm for wireless sensor networks using sink mobility. Int J Wirel Mobile Netw (IJWMN) 6(6) 16. Rakhee, Srinivas MB (2016) Cluster based energy efficient routing protocol using ANT colony optimization and breadth first search. Procedia Comput Sci 89. https://doi.org/10.1016/j.procs. 2016.06.019 17. Sharma R (2017) Ann based framework for energy efficient routing in multi-hop WSNS. Int J Adv Res Comput Sci 8 (5) 18. Kim J, Song I, Jang E, Choi S (2012) A dynamic duty cycle MAC algorithm for wireless body area networks. Int J Bio-Sci Bio-Technol 4(2) 19. Ayatollahitafti V, Ngadi MdA, bin Mohamad Sharif J, Abdullahi M (2016) An efficient next hop selection algorithm for multi-hop body area networks Published online 2016. Jan15. https:// doi.org/10.1371/journal.pone.0146464. PMCID: PMC4714909 PMID: 26771586 20. Manirabona A, Boudjit S, Fourati L (2018) NetBAN, a concept of network of BANs for cooperative communication: Energy awareness routing solution. Int J Ad Hoc Ubiquitous Comput 28:120. https://doi.org/10.1504/IJAHUC.2018.092655 21. Jafarizadeh V, Keshavarzi A, Derikvand T (2017) Efficient cluster head selection using Naïve Bayes classifier for wireless sensor networks. Wireless Netw 23:779–785. https://doi.org/10. 1007/s11276-015-1169-8 22. Li L, Zhang Y, Chen W, Bose SK, Zukerman M, Shen G (2019) Naïve Bayes Classifier-Assisted Least Loaded Routing for Circuit-Switched Networks. IEEE Access 7:11854–11867. https:// doi.org/10.1109/ACCESS.2019.2892063

An Intelligent, Geo-replication, Energy-Efficient …

53

23. Turˇcaník M (2013) Neural network approach to routing in sensor network. Adv Military Technol 8:2 (2013). http://aimt.unob.cz/articles/13_02/13_02%20(7).pdf 24. Sharma DK, Aayush, Sharma A, Kumar J (2017) KNNR:K-nearest neighbour classification based routing protocol for opportunistic networks. In: Proceedings of 2017 tenth international conference on contemporary computing (IC3), 10–12 Aug 2017, Noida, India 25. Bangotra DK, Singh Y, Selwal A, Kumar N, Singh PK, Hong W-C (2020) An intelligent opportunistic routing algorithm for wireless sensor networks and its application towards ehealthcare. Sensors 20:3887. https://doi.org/10.3390/s20143887. www.mdpi.com/journal/sen sors 26. Baklizi M, Al-wesabi O, Kadhum M, Abdullah N (2017) Genetic-based routing algorithm with priority constraints. Int J Networking Virtual Organ 17:3. https://doi.org/10.1504/IJNVO.2017. 10004175 27. Li Z, Xu Z, Mao S, Tong X, Sha X (2016) Weighted energy-balanced efficient routing algorithm for wireless body area network. Int J Distrib Sens Netw. https://doi.org/10.1155/2016/7364910

New Credibilistic Real Option Model Based on the Pessimism-Optimism Character of a Decision-Maker Irina Georgescu, Jani Kinnunen, and Mikael Collan

Abstract Fuzzy real options analysis has gained increasing attention among investment practitioners as well as investment theory-focused academics. The strength of the real option valuation (ROV) models, when compared to the more traditional net present value methods, is that they can account for flexibility, which is often available in long-term real investment opportunities. So called fuzzy pay-off methods (FPOMs) represent the most intuitive and easy-to-apply real options techniques published during the last decade. As part of the methodological FPOM family, the original credibilistic approach to real option valuation was published in 2012. In this paper, the credibilistic approach is extended by using the mλ -measure, which is built on a linear combination of necessity and possibility measures to deal with the problem of a decision-maker or, say, an expert analyst, neither being fully optimistic nor fully pessimistic; instead, the λ ∈ [0,1] parameter will represent the level of optimism of a decision-maker leading to a range of real option values estimated for an investment under analysis. The paper presents the new credibilistic ROV model with R code and compares it to the original credibilistic approach, as well as, to the recent center-of-gravity fuzzy pay-off model (CoG-FPOM) through a numerical example of valuing operational synergies developed during a corporate acquisition process. Finally, some future research opportunities are expressed.

I. Georgescu Bucharest University of Economics, Calea Dorobant, i 15-17, Sector 1, 010552 Bucharest, Romania e-mail: [email protected] J. Kinnunen (B) Åbo Akademi University, Tuomiokirkontori 3, 20500 Turku, Finland M. Collan Lappeenranta-Lahti University of Technology, P.O. Box 20, 53851 Lappeenranta, Finland VATT Institute for Economic Research, Arkadiankatu 7, 00101 Helsinki, Finland M. Collan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_5

55

56

I. Georgescu et al.

1 Introduction Fuzzy real option models were introduced in the late twentieth century, and the development started with fuzzy versions (cf. [1, 2]) of the famous Black–Scholes model [3, 4], which were still rather impractical for real investments. Carlsson [5] discusses the successful integration of scenario thinking with real options methodology: managers found the new fuzzy models more intuitive with more valuable outcomes; the models were perceived more in line with reality and the managerial experience than the traditional net present value (NPV) approaches. The most intuitive and practical fuzzy pay-off methods for real option valuation have been developed roughly in the last ten years starting from Collan and others in 2009 [6] by transforming the probabilistic framework of Datar and Mathews [7], developed at Boeing Corporation, into a possibilistic environment. Many variants of the pay-off method have been published previously. The credibilistic fuzzy pay-off method, introduced in 2012 [8], was built on possibility and necessity measures, which has certain computational advantages over the purely possibilistic approaches. This paper extends the credibilistic pay-off method for real option valuation. Some recent extensions of the credibilistic approach have been modeled to account for interval-valued triangular and trapezoidal fuzzy numbers [9, 10]. The family of fuzzy pay-off models operate under similar valuation logic and components: the real options value (ROV) is obtained by weighting some fuzzy expected mean value of the positive NPV outcomes, i.e., (cf. [6]): ∞ ROV = Weight × Expected value =  0∞

−∞

A(x)dx A(x)dx

× E(A+ ).

(1)

The weight component is the same over different pay-off model variants, adjusting only to the form of the fuzzy NPV distribution (represented by a fuzzy triangular number A(a, b, c) in Fig. 1), but obtained by the ratio of the two integrals of Eq. (1): the area of the positive NPV pay-off distributions (cf. positive NPV, A+ in Fig. 1) and the total area of the distribution (sum of the negative NPV, A- and the positive NPV, E+ in Fig. 1). By definition, this is the key feature of the real options valuation logic Fig. 1 Triangular NPV distribution, scenario inputs, and the real option value

New Credibilistic Real Option Model Based …

57

as the pay-off distribution is truncated at NPV = 0, because an owner of a real option has the right to withdraw from the investment if it is out-of-money, i.e., NPV < 0. The expected value of the positive side E( A+ ) varies depending on the approach. It can be, e.g., possibilistic, credibilistic, or a center-of-gravity type expected value. The fuzzy NPV distribution in Fig. 1 is built by an expert/analyst, who evaluates a strategic investment and builds three scenarios, in the case of triangular fuzzy distribution, with possible cashflows. Triangular and trapezoidal fuzzy distributions are the most typically applied distributions in fuzzy investment modeling. In Fig. 1, b represents the most likely base scenario, which is the cumulative NPV of total cashflows, a represents the downside, i.e., the pessimistic scenario with the smallest possible NPV, and c represents the optimistic scenario with the largest possible NVP of an strategic investment. The rest of the paper is organized as follows. Section 2 presents the technical preliminaries on credibility, possibility, and necessity needed for the subsequent sections. Section 3.1 deals with the new credibilistic expected value component, and Sect. 3.2 will recall the corresponding center-of-gravity expected values together with the weight component, which is independent of the model variant (credibilistic, possibilistic, or center-of-gravity) and will be the same for all types assuming only the same triangular underlying net present value distribution estimated for a strategic investment problem under real option analysis. Section 4 will present a numerical example on valuing synergies arising from a merger and acquisition. R code is presented to compute the new credibilistic model. The model outcome will be compared to the original credibilistic model [8] and to the recent center-of-gravity model [11]. Section 5 concludes the paper and identifies some future research opportunities.

2 Preliminaries Let X be a non-empty universe and P(X ) the family of its subsets. The elements of X can be objects, individuals, states, alternatives, etc. An event is a subset A of X, i.e., A ∈ P(X ). In this paper, we shall assume that X ⊆ R. A fuzzy variable is an arbitrary function ξ : X → R. Recall that a possibility measure on X is a function Pos : P(X ) → [0, 1], such that (Pos 1)Pos(∅)  Pos(X ) = 1;  = 0;  Ai = supi∈I Pos(Ai ), for any family (Ai )i∈I of events. (Pos 2)Pos i∈I

The function Nec: P(X ) → [0, 1], defined by Nec(A) = 1 − Pos(Ac ), for any event A for which Ac is the complement or opposite event, is called the necessity measure associated with Pos. For any λ ∈ [0, 1], consider the function m λ : P(X ) → [0, 1], defined by m λ (A) = λPos(A) + (1 − λ)Nec(A), for all A ∈ P(X )

(2)

58

I. Georgescu et al.

This new measure m λ , introduced by Yang and Iwamura [12], is a convex combination of the possibility measure Pos and the necessity measure Nec by means of the weight λ. If λ = 21 , then one obtains the notion of credibility measure in the sense of the monograph of Liu [13]: Cred(A) =

1 (Pos(A) + Nec(A)), for all A ∈ P(X ) 2

(3)

A possibility distribution on X is a function μ : X → [0, 1], such that supx∈X μ(x) = 1; μ is said to be normalized if μ(x) = 1 for some x ∈ X . Let us fix a normalized possibility distribution μ : X → [0, 1]. Then, one can associate to μ a possibility measure Pos and a necessity measure Nec: Pos(A) = sup μ(x), for any A ∈ P(X );

(4)

x∈A

N ec( A) = inf μ(x), for any A ∈ P(X ). x∈A

(5)

Thus, for each parameter λ ∈ [0, 1], the measure m λ defined by Eq. (2) has the form: m λ (A) = λ sup μ(x) + (1 − λ) inf μ(x), for all A ∈ P(X ). x∈A

x∈A

(6)

We say that the normalized distribution μ is the membership function associated with the fuzzy variable ξ if for any event A we have Pos(ξ ∈ A) = sup μ(x).

(7)

x∈A

In this case, the following hold: Nec(ξ ∈ A) = inf μ(x);

(8)

x∈A

m λ (ξ ∈ A) = λ sup μ(x) + (1 − λ) inf μ(x). x∈A

x∈A

(9)

3 The Expected Value and the Weight Component In this section, the main components are to compute the real option value, ROV, with the pay-off method and the general ROV formula (1), i.e., the expected values and the respective weights in all possible (four) cases.

New Credibilistic Real Option Model Based …

59

Section 3.1 introduces the new credibilistic expected values, and Sect. 3.2 recalls the center-of-gravity expected values together with the weight components, which are the same for both the new credibilistic model and the center-of-gravity model, which will then be put in comparison in Sect. 4.

3.1 The Credibilistic Expected Value with Respect to the mλ -Measure In this section, we shall recall from [14] the definition of the expected value E λ (ξ ) of a fuzzy variable ξ w.r.t. the measure m λ . Then, we shall define the expected value E λ (ξ+ ) of the positive part of ξ w.r.t. the measure m λ , and this indicator is computed for a triangular fuzzy variable A. We fix a parameter λ ∈ [0, 1]. Let ξ be a fuzzy variable, μ its membership function, and m λ the measure defined by Eq. (6). Following [14] and [15], the expected value of ξ w.r.t. m λ is defined by 0

∞ [m λ (ξ ≥ x) − 1]d x +

E λ (ξ ) = −∞

m λ (ξ ≥ x)dx

(10)

0

Recall from [13, p. 13], that a triangular fuzzy number A = (a, b, c), with a ≤ b ≤ c, is defined as ⎧ x−a ⎪ ⎨ b−a if a ≤ x ≤ b x−c A(x) = b−c (11) if b ≤ x ≤ c ⎪ ⎩ 0 otherwise. Lemma 3.1 [14] For any triangular fuzzy number A = (a, b, c), we have

m λ (A ≥ x) =

⎧ ⎪ 1 ⎪ ⎪ ⎨ λ(x−a)+b−x b−a

λ(c−x) ⎪ ⎪ c−b ⎪ ⎩0

if x ≤ a if a ≤ x ≤ b if b ≤ x ≤ c if c ≤ x.

(12)

Remark 3.2 [14] By using Lemma 3.1, the expected value E λ (A) has the form E λ (A) = (1 − λ)

c b a +λ + . 2 2 2

(13)

Definition 3.3 The expected value of the positive part of ξ w.r.t. m λ is defined by the formula

60

I. Georgescu et al.

∞ E λ (ξ+ ) =

m λ (ξ ≥ x)dx.

(14)

0

Proposition 3.4 The expected value of the positive part of a triangular fuzzy number A = (a, b, c), with a ≤ b ≤ c, is given by the formula: ⎧ b (1 − λ) a + λ 2c + ⎪

2 ⎪ ⎪ 1 2b2 ⎨ λ 2 − ab + E λ (A+ ) = b−a λc2 ⎪ ⎪ ⎪ ⎩ 2(c−b) 0

b2 2



if 0 ≤ a +

λ(c−b) 2

if a ≤ 0 ≤ b if b ≤ 0 ≤ c if c ≤ 0

(15)

Proof In order to compute E λ (A+ ), we compute the following four cases. Case 1, 0 ≤ a In case 1, the triangular fuzzy number is fully positive as depicted in Fig. 2, and Eq. (14) becomes E λ (A+ ) = E λ (A) = (1 − λ) a2 + λ 2c + b2 . Case 2, a ≤ 0 ≤ b By using Lemma 3.1 and Fig. 3, the following holds: Fig. 2 Fully positive triangular fuzzy number in case 1

Fig. 3 Triangular fuzzy number with positive peak (b > 0) in case 2

New Credibilistic Real Option Model Based …

61

∞ E λ (A+ ) =

m λ (A ≥ x)dx 0

b

c λ(x − a) + b − x λ(c − x) . dx + dx = b−a c−b 0 b   2   b b2 λ(c − b) 1 λ − ab + + = b−a 2 2 2 Case 3, b ≤ 0 ≤ c According to Lemma 3.1, the expected value of the positive side of the fuzzy number ∞ c λc2 A as depicted in Fig. 4 becomes E λ (A+ ) = m λ (A ≥ x)d x = λ(c−x) d x = 2(c−b) . c−b 0

0

Case 4, 0 ≥ c In case 4, where the triangular fuzzy number A is fully negative as depicted in Fig. 5, it is clear that the positive side of the expected value E λ (A+ ) = 0.  Remark 3.5 If the triangular fuzzy number A is written as A = (a − α, a, a + β), where a is the peak, α is the left width, and β is the right width, α, β ≥ 0, as depicted in Fig. 6, then the formula for the expected positive side becomes: Fig. 4 Triangular fuzzy number with negative peak (b < 0) in case 3

Fig. 5 Fully negative triangular fuzzy number in case 4

62

I. Georgescu et al.

Fig. 6 Triangular fuzzy number A = (a − α, a, a + β)

⎧ ⎪ (1 − λ) a−α + λ(a+β) + a2 ⎪ ⎪  2  22  ⎪ ⎨ 1 λa α − a + a + λβ 2 2 2 E λ (A+ ) = 2 (a+β)2 ⎪ λ ⎪ ⎪ 2 β ⎪ ⎩ 0

if 0 ≤ a− ∝ if a− ∝≤ 0 ≤ a if a ≤ 0 ≤ a + β if a + β ≤ 0

(16)

Remark 3.6 Using the triangular fuzzy number A = (a − α, a, a + β) and formula (16) from Remark 3.5 and fixing λ = 21 , we obtain the formula of the original credibilistic expected value E C (A+ ) for the positive part as proved in [8] and discussed in [16, pp. 126–128] and [17]: ⎧ ⎪ a + β−α ⎪ 4 ⎪ ⎨ a + a2 + β 4 E C (A+ ) = 2a 4α a2 ⎪ + 4β + β4 ⎪ 2 ⎪ ⎩ 0

if if if if

0 ≤ a− ∝ a− ∝≤ 0 ≤ a a ≤0≤a+β a+β ≤0

(17)

3.2 The Center-Of-Gravity Expected Value and the Weight Component For the comparative purposes of the subsequent Sect. 4, we recall the recent centerof-gravity fuzzy pay-off model, GoG-FPOM, of Borges, and others [11], which is the triangular special case of the more general formulations of the interval-extension of [18] and the trapezoidal extension of [9]. The real option valuation formula (1) can be adapted for the center-of-gravity ROV (cf. [11, 18, 19]) and becomes: ∞ ROVCoG =  0∞

−∞

A(x)dx A(x)dx

× E GoG (A+ )

New Credibilistic Real Option Model Based …

∞

=

A(x)dx  0∞ −∞ A(x)dx

63

∞

x A(x)dx × 0 ∞ , 0 A(x)dx

(18)

where the triangular fuzzy number A = (a, α, β) has the membership function defined in [20–22] as: ⎧ a−x ⎪ ⎨ 1 − ∝ if ∝ −a ≤ x ≤ a if a ≤ x ≤ a + β A(x) = 1 − x−a (19) β ⎪ ⎩ 0 otherwise. The expected NPV of the triangular fuzzy number of the original fuzzy pay-off model, E FPOM (A+ ) is based on Zadeh’s [23, 24] possibility theory and Carlsson and Fullér’s [16] mean. In the center-of-gravity model, the expected values, E CoG (A+ ) s are (cf. [11, 18, 19]): ⎧ 3a−α+β if ⎪ ⎪ 3 ⎪ ⎨ ∝(a+β)3 −a 3 (α+β) if 2 3α(a+β) −3a 2 (α+β) E CoG (A+ ) = a+β ⎪ if ⎪ ⎪ ⎩ 3 0 if

0 ≤ a− ∝ a− ∝≤ 0 ≤ a a ≤0≤a+β a+β ≤0

(20)

The corresponding weight components for the four cases can be presented in line with [6] and [8] for the original possibilistic and credibilistic models, as well as, [11] and [9] for the center-of-gravity model (the weights are the same in all these model variants): ⎧ ⎪ 1 if 0 ≤ a− ∝ ⎪ ⎪ 2 ∞ ⎪ 2a− a∝ +β ⎨ A(x)dx if a− ∝≤ 0 ≤ a ∝+β == (21) Weight =  0∞ 2 (a+β) ⎪ ⎪ if a ≤ 0 ≤ a + β −∞ A(x)dx ⎪ β(∝+β) ⎪ ⎩ 0 if a + β ≤ 0 The center-of-gravity expected values (20) and weights (21) for the presented four triangular cases can be found partly in [11] and [18] and fully including also R codes in [9]. For the credibilistic case, the formulas are presented in [8–10] and [25]. These R codes will be utilized together with a new R algorithm written for our credibilistic lambda-based model, and the results will be compared in the next section.

4 Numerical Example: Valuing M&A Synergies Synergies are one of the most often announced rationale for corporate acquisitions. Thus, an ex-ante evaluation of potential operating cost-reducing and sales-enhancing synergies is a necessary action, which can be conducted by framing an M&A process

64

I. Georgescu et al.

Table 1 Cumulative NPV of sales synergies Year

1

2

3

4

5

Optimistic scenario

500

650

750

800

1000

Base scenario

160

200

220

260

300

Pessimistic scenario

−160

−200

−220

−260

−300

as a real options investment problem (cf. [26–28]). There are various ways to value the needed components, and this example is built on the evaluation procedure developed in [17, 28] and [9], where an acquisition target is firstly evaluated as a stand-alone company assuming it operating as usual without an acquisition, it operates on a certain market with some market share, which develops together with its costs and sales affected by markets and managerial efforts; secondly, cash-flow scenarios are built for potential different types of cost and sales synergies based on the standalone company’s and the acquirer’s combined financial and strategic resources and the corporate development actions taken during an M&A process. Different risk may apply to sales and costs as often cost development is better foreseen than future sales, and a discounting factor may be selected accordingly or it may be based, e.g., on the weighted average cost of capital, WACC. In Table 1, cumulative NPV scenarios are shown for potential sales synergies evaluated by acquiring company analysts. In the base case, the first post-acquisition year is expected to increase sales by $160.000, in two years, the sum is expected to be $200.000, and so on until year 5, when the base case expectation is having achieved the discounted sum of total sales synergies of $300.000. This may be because of cross-selling potential of acquirer’s products to target company’s customers, or vice versa, or some other synergistic increase in sales expected to be realized. Similarly, in a pessimistic scenario, sales synergies are expected to turn out as much negative, −$300.000, e.g., due to problems in integration process and/or possibly by losing customer base through destroyed brand of target’s products under the brand of the acquiring company. In the optimistic scenario, sales may increase as much as $1 million. For the numerical example, we have written an R algorithm, shown in Fig. 7, to compute the new lambda-credibilistic real option values using the weights from Eq. (21) and the expected values from Eq. (16). The algorithm is seen easily adaptable for most spreadsheet software. The algorithm is in line with Remarks 3.5 and

Fig. 7 R code for ROVCred(λ)

New Credibilistic Real Option Model Based …

65

3.6, which defined a triangular number as A = (a − α, a, a + β) meaning that the required parameters include the optimism–pessimism parameter λ (lambda in Fig. 7) and the net present value of synergies in base (a = 300), pessimistic (a −α = −300), and optimistic scenarios (a + β = 1000) as presented for year-5 cumulated NPVs in Table 1. Calling the algorithm with a = 300, α = 600, β = 700 and by varying the optimism–pessimism parameter λ from 0.25 to 0.5 and 0.75, together with the centerof-gravity ROV utilizing Eq. (20) for comparative purposes, the following real option values are obtained: ROVCred(λ=0.25) = $193.510 ROVCred(λ=0.50) = $320.673 ROVCred(λ=0.75) = $447.837 ROVCoG = $344.872 We notice that with λ = 0.5, the ROV of sales synergies is $320.673. This ROV coincides with the original credibilistic ROV [8] and is 7.5% lower than the centerof-gravity ROV [11]. However, with high optimism level of λ = 0.75, the ROV is 39.7% higher; while with the dominating pessimism and λ = 0.25, the ROV is − 39.7% lower. In Fig. 8, the base scenario NPV varies roughly from negative −$0.8 million to roughly positive $1.2 million, a = [−800, 1200] so that other parameters are the same as before, i.e., the upper dashed line shows the variation of ROV with high level

Fig. 8 Sensitivity of ROV to base scenario NPV

66

I. Georgescu et al.

of optimism ROVCred(λ=0.75) , the lower dashed line shows the ROVCred(λ=0.25) with high level of pessimism, and in the middle of the range, the solid black line represents ROVCred(λ=0.5) . The red line shows the respective center-of-gravity ROVCoG . This is to see that ROVCoG ≥ ROVCred(λ=0.50) systematically over all possible four cases.

5 Conclusions We introduced a new credibilistic real options model using the optimism–pessimism measure [12]. By taking account, the new optimism characteristic of an analyst, a new dimension to real options analysis was added. The new model was proved to coincide with the original credibilistic real options model [8], when setting λ = 0.5. The credibilistic pay-off model with the range of real option values was shown to value investments lower than (or equally to) the recent center-of-gravity variant when λ = [0, 0.5] over all possible cases. However, with a high level of optimism, the center-of-gravity ROV was surpassed significantly. This is interesting as the used comparative CoG method has been shown to value investments higher than other pay-off methods, i.e., the original fuzzy pay-off method, FPOM [6], the original credibilistic method [8], and their many variants (cf. [9, 10]). For future research, some obvious extensions of the introduced new credibilistic model would be toward allowing intervals and generalizing to trapezoidal distributions. Also, comparisons not only to the center-of-gravity model, but also to the new fully possibilistic FPOM of [30], which has corrected the problems of the original FPOM of [6] by utilizing the Luukka-Stocklasa-Collan transformation [30] between the center-of-gravity and possibilistic means, may be found valuable. Due to the significant effect on ROVs, the optimism–pessimism measure could be put under a qualitative and multi-criteria evaluation to possibly attach qualitative descriptions for different levels of optimism/pessimism. Further, interesting applications of the new credibilistic model can appear in various application domains, ranging from patent valuations, to oil extraction or to governmental policy opportunities, among many other possibilities. Acknowledgements This research is supported by the Finnish Strategic Research Council at the Academy of Finland project Manufacturing 4.0 grants 313349 and 313396.

New Credibilistic Real Option Model Based …

67

References 1. Carlsson C, Fullér R (2001) On possibilistic mean value and variance of fuzzy numbers. Fuzzy Sets Syst 122(2):315–326 2. Collan M, Carlsson C, Majlender P (2003) Fuzzy Black and Scholes real options pricing. J Decis Syst 12:391–416 3. Black F, Scholes M (1970) The pricing of options and corporate liabilities. J Polit Econ 81(3):637–654 4. Merton RK (1973) Theory of rational option pricing. Bell J Econ Manag Sci 4:141–183 5. Carlsson C (2019) Digital coaching to make fuzzy real options methods viable for investment decisions. In: Pelta DA, Corona CC (eds) Soft computing based optimization and decision models. Studies in fuzziness and soft computing, vol 360. Springer Verlag, Heidelberg, pp 153–175 6. Collan M, Fullér R, Mézei J (2009) Fuzzy pay-off method for real option valuation. J Appl Math Decision Syst 2009:14 7. Datar V, Mathews S (2007) A practical method for valuing real options: the Boeing approach. J Appl Corp Financ 19:95–104 8. Collan M, Fullér R, Mezei J (2012) Credibilistic approach to the fuzzy pay-off method for real option analysis. J Appl Oper Res 4(4):174–182 9. Kinnunen J, Georgescu I (2020) Fuzzy real options analysis based on interval-valued scenarios with a corporate acquisition application. Nordic J Bus 69(1):44–67 10. Kinnunen J, Georgescu I (2020) Credibilistic real options analysis using interval-valued triangular fuzzy numbers. Int J Adv Comput Eng Netw 8(5):1–6 11. Borges REP, Dias MAG, Dória Neto AD, Meier A (2018) Fuzzy pay-off method for real options: the center of gravity approach with application in oilfield abandonment. Fuzzy Sets Syst 353:111–123 12. Yang L, Iwamura K (2008) Fuzzy chance-constrained programming with linear combination of possibility measure and necessity measure. Appl Math Sci 46:2271–2288 13. Liu B (2004) Uncertainty theory: an introduction to its axiomatic foundations. Springer-Verlag, Berlin 14. Dzuche J, Tassak CD, Sadefo J, Fono LA (2020) The first moments and semi-moments of fuzzy variables based on an optimism-pessimism measure with application for portfolio selection. New Math Nat Comput 16(2):271–290 15. Dzuche J, Tassak CD, Sadefo Kamdem J, Fono LA (2021) On two dominances of fuzzy variables based on a parametrized fuzzy measure and application to portfolio selection. Ann Oper Res 300:355–368. https://doi.org/10.1007/s10479-020-03873-5 16. Carlsson C, Fullér R (2011) Possibility for decision: a possibilistic approach to real life decisions. Springer-Verlag, Berlin-Heidelberg 17. Kinnunen J, Georgescu I (2019) Decision support system for evaluating synergy real options in M&A. In: Proceedings (CD-ROM) of the international conference on management and information systems (ICMIS-19), Bangkok, Thailand 18. Kinnunen J, Georgescu I, Collan M (2020) Center-of-gravity real options method based on interval-valued fuzzy numbers. In: Kahraman C, Onar SÇ, Öztay¸si B, Sari IU, Çebi S, Tolga AC (eds) Proceedings of the intelligent and fuzzy techniques: Smart and innovative solutions conference (INFUS-20). Springer, Izmir, Turkey, pp 1292–1300 19. Georgescu I, Kinnunen J (2021) The digital effectiveness on economic inequality: A computational approach. In: Dima AM, D’Ascenzo F (eds) Business revolution in a digital era. Springer proceedings in business and economics. Springer, Cham. https://doi.org/10.1007/978-3-03059972-0 20. Dubois D, Prade H (1980) Fuzzy sets and systems: theory and applications. Academic Press, New York 21. Dubois D, Prade H (1988) Possibility theory: an approach to computerized processing of uncertainty. Plenum Press, New York

68

I. Georgescu et al.

22. Georgescu I (2012) Possibility theory and the risk. Springer-Verlag, Berlin-Heidelberg 23. Zadeh LA (1975) The concept of a linguistic variable and its application to approximate reasoning-I. Inf Sci 8(3):199–249 24. Zadeh LA (1978) Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst 1:3–28 25. Georgescu I, Kinnunen J (2011) Credibility measures in portfolio analysis: from possibilistic to probabilistic models. J Appl Oper Res 3(2):91–102 26. Bruner R (2004) Applied mergers and acquisitions. Wiley, New York 27. Loukianova A, Nikulin E, Vedernikov A (2017) Valuing synergies in strategic mergers and acquisitions using the real options approach. Bus Perspect 13(1):236–247 28. Collan M, Kinnunen J (2011) A procedure for the rapid pre-acquisition screening of target companies using the pay-off method for real option valuation. J Real Options Strategy 4(1):117– 141 29. Stoklasa J, Luukka P, Collan M (2021) Possibilistic fuzzy pay–off method for real option valuation with application to research and development investment analysis. Fuzzy Sets Syst 409:153–169. https://doi.org/10.1016/j.fss.2020.06.012 30. Luukka P, Stoklasa J, Collan M (2019) Transformations between the center of gravity and the possibilistic mean for triangular and trapezoidal fuzzy numbers. Soft Comput 23(10):3229– 3235

Analysis of Road Accidents in India and Prediction of Accident Severity Sajal Jain, Shrivatsa Krishna, Saksham Pruthi, Rachna Jain, and Preeti Nagrath

Abstract Road accidents are a global menace, and no country can curb it. In this paper, an attempt has been made to study the various factors associated with a road accident and its effect on the cause and severity of the accident by analyzing the road accidents occurring in the nation of India from 2000 onwards. The severity of accidents can be measured in terms of human loss as well as economic loss. Further, the data is visualized on the map of India using Folium python library for the convenience of comparison between various states and better visualization. In this paper, decision tree classifier has been implemented for the prediction of the severity of a road accident. For each road accident, different parameters such as lighting conditions, vehicle type, etc., have been taken into consideration. All of the tasks have been deployed on a webpage with the help of the Flask web application framework. The proposed model achieves a testing accuracy of 79.45%.

1 Introduction Road accidents are one of the global causes of increasing deaths. In a developing country such as India, wherein road transport is one of the most widely used modes of transport, it is even bigger issue. According to reports by WHO, India accounts for nearly 10% of the world accident-related deaths. Besides the loss of human life, road accidents also lead to loss of private and public property in the form of infrastructure and economic losses. Road accidents S. Jain · S. Krishna (B) · S. Pruthi Electronics and Communication Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India R. Jain · P. Nagrath Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India e-mail: [email protected] P. Nagrath e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_6

69

70

S. Jain et al.

are a serious concern in developing countries because of rapid industrialization and urbanization and because of no proper safety measures in place. The major causes of an accident are driver, vehicle, and environmental conditions [1]. From the analysis of many datasets, some of the factors affecting the severity of an accident has been found out to be driver characteristics, road type, vehicle type, lighting conditions, road obstructions, and a few other. At present, road accidents hold a rank of nine as the most serious cause of death in the world, and this situation will not improve unless new initiatives to enhance road safety are undertaken [2]. In India, the total number of road accidents increased by 27.64% from 2000 to 2010. After 2010, instances of road accident have gradually decreased with minor fluctuations with an exception of a rise in the number of cases in the year 2015. The rest of the paper is organized as follows: Sect. 2 discusses the related work in the field of road accident analysis and accident severity prediction. Section 3 gives the proposed methodology. Section 3 contains a brief description of the modeling approach and its algorithm. After this, Sect. 4 states the various observations drawn from the analysis of the road accidents in India from 2000 onwards. The road accidents are then further characterized based on different aspects associated with an accident. The analysis is visualized with the help of numerous graphs, bar plots, and pie charts. The state-wise data is visualized on the map of India using Folium, a python library. Section 5 describes the dataset used for the prediction model and explains the various steps followed along with the development of the model. It also lists the experimental results, the accuracy of our classification model, and comparisons with other existing techniques. Finally, Sect. 6 depicts the conclusion of the analysis and the model also emphasizing its application with the future scope of it.

2 Literature Review Analysis of 5 year PTW road accident was carried out in the state of Uttarakhand using the decision tree algorithm. The decision tree algorithm is employed due to its better prediction accuracy than support vector machine and Naïve bayes [3]. In India, the deaths and injuries caused by road accidents are a serious issue, serious enough for the government to draw a plan to reduce road accidents. Not just from the loss of human life point of view, but also from the perspective of the economic effects of road accidents, there is an urgent need to properly analyze and prevent road accidents. Around Rs. 60 billion economic loss is caused each year in India due to road accidents [4]. Another study was conducted in Odisha on NH-55(India) which connects to various major industries and mines. The study indicated that major accidents on the highway were due to trucks and other heavy transport vehicles. It also shed light on the fact that there are three reasons behind an accident, the driver, vehicle, and the environment where an accident occurs [5]. A study was also done on the impact of weather conditions on accidents. It mentions that weather conditions have a significant impact on road accident and casualties with different effects depending on the road type. According to literature, weather can cause about

Analysis of Road Accidents in India and Prediction …

71

5% variability. It used a two-stage approach for the climatic factor. An analysis was done for Netherlands, France, and Athens on a daily and monthly basis. The number of accident injuries was collected from the police of these regions, and averaged weather effects were determined [6]. Road patterns in India are quite different from those in developed countries. The increasing dependence of the population on motor vehicles can be attributed to an increase in population, urbanization, and modern infrastructure. Increasing economy means a greater number of motor vehicles on the road which will result in more quantity of road accidents [7]. The higher number of males in various social and economic activities can be considered as a reason for the higher accident rates caused by males since such activities require a high degree of movement and travel. Taking such factors into consideration is necessary before for arriving at any judgment concerning sex role in road accidents [8]. An accident prediction model is an equation that expresses the average accident frequency of an object in terms of traffic flow, weather conditions, and other road characteristics. APMs are used to estimating the expected number of accidents in places like intersections and road crossings. These estimates help in the identification of places or areas for possible safety treatment and the evaluation of such treatments [9]. A study has been done on comparing two model techniques, regression models, and Bayesian network in accident severity modeling. The results were applied to predict accident severity to identify and analyze the contributing factors in a road accident [10]. Another case study was done on the state of Tamil Nadu, wherein accident severity Index (ASI) was calculated based on the data from 1997 to 2008. The calculation was done based on various attributes associated with a road accident. A study of the accident severity based on vehicle type gave interesting result such as clear weather increased the injury severity and collisions with pedestrians and bicycles also increased the severity [11]. A study on road accidents and prevention was done using K-means clustering to show and analyze accident hotspots in an area. The study attempted to classify 428 hotspots into relative types on basis of their environmental characteristics. The clustering created 15 clusters and 5 groups and tabulates the number of hotspots in each cluster [12].

3 Methodology 3.1 Decision Tree Decision tree is one of the popular and most powerful tools for classification and prediction. It has a structure similar to flowcharts, wherein each internal node implies a test on an attribute and each branch signifies an outcome of that test. Leaf node holds a class label, the outcome. This process is called recursive partitioning and is repeated on each subset in a recursively. The process gets completed when there is

72

S. Jain et al.

the same value of subset at a node compared to the target variable, or when no value is added to the prediction after splitting. The decision tree classifier construction does not require any knowledge of the specific domain or parameter setting, and therefore is appropriate for exploratory discovery of knowledge. Decision trees are known for handling high dimensional data. A decision tree classifier has generally high accuracy. A typical inductive approach to learn knowledge on classification is decision tree induction. Algorithm 1. 2. 3. 4.

Start as the parent node at the root node Split the parent node at the feature x i to maximize information gain by minimizing the sum of the child node impurities Child nodes are assigned training samples Repeat steps 1 and 2 for every new child node and stop if criteria is satisfied or leave nodes are pure.

Rules for Stopping 1. 2. 3.

When all leaf nodes are pure When a maximum node depth is achieved No information gain on splitting a node.

4 Result Analysis 4.1 Road Accidents in India (2000 to 2017) In India, the total number of road accidents increased by 27.64% from 2000 to 2010. After 2010, instances of road accident have gradually decreased with minor fluctuations with an exception of a rise in the number of cases in the year 2015 (Fig. 1). Whereas for the number of persons killed, the number increased by a huge percentage of 70.46% from 78,911 in the year 2000 to 134,513 in 2010. The increase in persons injured was much lower than that of persons killed. The increase was 32.12% from 399,265 in 2000 to 523,193 in 2010. Out of the total persons involved in road accidents between 2000 and 2017, 19.7% suffered injuries, while 80.3% were killed. From 2001 through 2014, the highest number of accidents is occurred during 1500–1800 h during the day. During the same interval, the state of Tamil Nadu registered an average of 10,114 road accidents during 1500–1800 h, the highest among all the states. The least number of accidents occurred during 0000–0300 h having an average of 796 accidents among all states. Lakshadweep Islands witnessed almost 0 accidents during 0000–0300 h, the lowest among all states (Fig. 2).

Analysis of Road Accidents in India and Prediction …

73

Fig. 1 Number of road accidents (2000–2017)

Fig. 2 Count plot of persons killed and injured (2000–2017)

4.2 Black Spots in 2013 and 2014 Identification of zones which are at accident risk is a crucial step toward the prevention of road accidents. National highways run hundreds of kilometers and have certain black spots which are more prone to accidents than other parts because of several reasons such as bad road condition, poor lightning, inaccessibility of emergency response vehicles, etc. (Fig. 3). During 2013, a total of 726 black spots were identified across the major cities of India. In the very next year, the number was significantly reduced to 63. There was a

74

S. Jain et al.

Fig. 3 Black spots in various states (2013 and 2014)

considerable decrease in the number of black spots in each state except for Madhya Pradesh where the number increased from 25 in 2013 to 29 in 2014.

4.3 Road Accidents in 2017 and 2018 4.3.1

Weather Condition

While the weather might not be the principal agent behind the reason for an accident, it is still an important environmental component [13]. Weather plays an important factor in determining the occurrence as well as the severity of any road accident. The change in weather during extreme weather conditions influences road safety and the resulting numbers of crashes and casualties. Understanding and analyzing the impact that weather conditions have on road accidents is necessary while drawing up road safety measures [14] (Figs. 4 and 5). In the year 2018, surprisingly 74.54% of the total accidents occurred in Sunny/Clear weather conditions, while the least number of accidents, i.e., 0.88%

Analysis of Road Accidents in India and Prediction …

75

Fig. 4 Number of accidents based on weather (2017 and 2018)

of total accidents occurred in Hail weather conditions. The lesser number of accidents in adverse weather condition can be explained by the human nature of choosing to walk or use public transport in unfavorable weather conditions.

4.3.2

Vehicle Type

Vehicle type has an obvious direct impact on the accident severity as it is the main element involved in a road accident. Type of vehicle determines the degree of damage. The number of occupants of a vehicle is directly involved and determines the human loss in an accident if any. The persons involved have to suffer physical, economical as well as emotional pain [15]. In the year 2017 and 2018, two-wheelers were involved in the largest number of road accidents followed by Car/Jeep/Van/Taxi. In 2017, twowheelers accounted for 33.9% share in the total number of vehicles involved in the

76

S. Jain et al.

Fig. 5 Weather conditions in 2018

Fig. 6 Vehicles involved in accidents (2018)

accidents and 29.8 in the year 2018. Hand drawn, animal drawn carts, and bicycles accounted for the least in both the years with the percentage share of the three is less than one percent. A study of the accident severity based on vehicle type gave interesting result such as clear weather increased the injury severity and collisions with pedestrians and bicycles also increased the severity [11] (Fig. 6).

4.3.3

Traffic Rules Violation

Traffic rules are in place for the safety and security of drivers and pedestrians on the road. Any violation of these rules imposes a risk of an accident. Characteristics of

Analysis of Road Accidents in India and Prediction …

77

Fig. 7 Traffic violations (2017 and 2018)

the driver’s knowledge and behavior play a key role in the occurrence and prevention of an accident. Based on a study, male drivers, young drivers, and drivers having a valid license for less than 10 years tend to have a casual and carefree attitude toward road safety than other drivers. These same groups displayed ‘non-ideal’ behavior in traffic and were more likely to be involved in road accidents [16] (Fig. 7). In 2017–18, the greatest number of accidents occurred due to over-speeding and the least due to jumping a red light. 70.4% of total accidents in 2017 occurred due to over-speeding; while in 2018, it was 66.7%. The share of drunken driving was 3% and 3.2% in 2017 and 2018, respectively (Fig. 8).

4.3.4

Age Profile

According to recent studies, age does not play a factor in relating attitude with accident risk. However, age, sex, and years holding a license also contribute to some extent on the severity of an accident [16]. The working age group of 25–35 were more involved in road accidents than the other age groups. In 2017, of the total numbers of people killed in road accidents, 25.3% are in the age group of 25–35, and in the year 2018, the share was increased to 26.7% (Fig. 9).

78

S. Jain et al.

Fig. 8 Traffic violations leading to accidents (2018)

Fig. 9 Age profile of persons involved in accidents (2018)

4.3.5

Gender Profile

Although the gender of persons involved in an accident would not have much impact on the accident or the severity of the accident but from a statistical point of view, it seems worthwhile to look at the gender distribution. There has been a considerable number of investigations on the driving behavior of female and male drivers [8]. The males of the rural areas in India constituted more than half of the total persons killed in road accidents in India in 2018. The second major gender group was female in the rural areas followed by male and female in

Analysis of Road Accidents in India and Prediction …

79

Fig. 10 Gender distribution of persons involved in accidents (2018)

the urban areas. It can be safely said that a greater number of males were involved in road accidents in rural as well as urban areas. One point can also be drawn that rural area are more prone to accidents because of poor infrastructure, lesser facilities, and improper implementation of traffic rules (Fig. 10).

4.3.6

Injury Type

The type of injury suffered by a person in the accident can be a key element to determine the severity of an accident, as it is directly related to the human loss involved. For the sake of example, take the case of pedestrians. For pedestrians, 61% of injuries were to the head and skull fractures accounted for 37%. These were mostly caused by contact with the windshield and front components of the colliding vehicle, and also due to ground contact post-impact [17] (Fig. 11). The year 2015 recorded the highest numbers of almost all types of accidents based on injury type. Most of the people involved suffered minor injuries which constituted of 38.4% of the total accidents occurring in 2015. 26.3% of the accidents were fatal. After 2015, the number of accidents decreased with minor fluctuations. 2017 and 2018 had almost the same number of cases with just a 0.5% increase in 2018. However, the number of fatal and grievous injury accidents was highest in 2018 during the period of 2014–18. The percentage share of fatal being 29.5% and that of grievous injuries being 26.8%.

80

S. Jain et al.

Fig. 11 Trend of accidents based on injury type (2014–2018)

4.4 Maps Folium is a powerful python library used to visualize data on an interactive leaflet map. It enables both visualizations as well as interacting with data with the help of markers. It provides a wide variety of tools to visualize geospatial data. The statewise available data is analyzed and plotted on the map of India for better visualization and deployed on a webpage with the help of Flask, a python web framework (Maps 1, 2 and 3).

5 Prediction Model 5.1 Dataset The Kaggle ‘accidents_india’ dataset [18] consists of 10 columns and 684 rows making the total number of accident data as 6840 out of which 86% of the dataset has been used for the training dataset, and the rest 14% have been used for the testing. The following figure shows the bar graph, which represents the feature importance of the dataset used for the training and the testing phase. The 1 target value demonstrates that the severity of the accident is major and 0 implies that the severity of the accident is minor. From Fig. 12, it can be observed that though the day feature importance has the highest importance logically, it is of not much value in determining the severity of the accident. The next important feature, road type plays a major role in determining the severity of an accident as it is directly linked to the accident. Bad road conditions will lead to severe accidents.

Analysis of Road Accidents in India and Prediction …

81

Map 1 State-wise information of number persons killed in rural and urban areas (2018)

The speed limit is yet another important feature since a vehicle at high is difficult to control and can cause an accident. Since a vehicle is a primary entity involved in an accident, taking into the account, its type is crucial for the evaluation of an accident.

5.2 Proposed Model In this paper, an effort has been made to develop a single generic model for predicting accident severity for the entire country of India after careful examination of already existing models of various other countries. For a developing country like India, such a model can be very beneficial in reducing the severity of road accidents if not the number of road accidents itself.

82

S. Jain et al.

Map 2 State-wise number of black spots (2013 and 2014)

Various algorithms were employed and it was observed that the decision tree algorithm had the maximum score for accuracy. The first task was improving the dataset. Firstly, the null values in each column were found out. With columns having numeric entries, the null values were replaced with the mean of the column for uniformity in the data entries. For the columns which had string as data entries, label encoding was performed to assign the different entries in those columns with numbers so that they could be computed by the model. This would make the work of fitting and training the data fairly easy. After performing label encoding, train_test_split was employed to split the dataset into two parts:

Analysis of Road Accidents in India and Prediction …

Map 3 Total number of accidents (2001–2012) Fig. 12 Feature importance of the attributes

83

84

S. Jain et al.

Fig. 13 Decision tree graph structure

1. 2.

Training Dataset Testing Dataset.

The test size was taken as 0.14, which means that 86% of the total dataset was used to train the model, and tests were performed on the remaining 14% of the dataset. The algorithm using in prediction model is the decision tree algorithm. A decision tree is built top-down starting from a root node, and it involves splitting the data given into subsets having properties similar to the parent node. The criterion used is entropy. The criterion parameter determines how the impurity of a split will be measured. When the given sample is completely homogeneous, then the entropy of the sample is zero, and when the sample is divided equally, then it has an entropy of one (Fig. 13). Decision tree is one of the popular and most powerful tools for classification and prediction. It is structured like a flowchart, wherein each internal node implies a test on an attribute, and each branch signifies an outcome of that test. Leaf node holds a class label, the outcome. Our model can perform well as it uses the decision tree machine learning algorithm which processes data by splitting the source set into subsets based on an attribute value test. This process is called recursive partitioning and is repeated on each subset in a recursively. The process gets completed when there is the same value of subset at a node compared to the target variable, or when no value is added to the prediction after splitting. In our project, we have used the decision tree algorithm due to various reasons. Primarily, using this model gave us the maximum accuracy which would be better for future predictions. For large datasets, it is very helpful as one needs a low level of data preparation before beginning the project. A decision tree does not require scaling of data as well it is very intuitive. In its very nature, decision tree is easier to explain to all parties involved. One of the other major reasons as to why we gave

Analysis of Road Accidents in India and Prediction …

85

preference to decision tree is that NULL or NOT or missing values present in the dataset do not affect the accuracy of producing a decision tree by a considerable extent although data cleaning and organizing is always a good thing.

5.3 Results and Performance From the feature importance, it can be observed that if we ignore the day on which the accident occurred, the next major factor in determining the severity of an accident is road type. This seems logical since better road conditions will automatically result in fewer accidents and low damage in case of an accident as compared to bad roads. The second most contributing factor is the number of passengers in the impacting vehicle. The number of passengers directly corresponds to the injuries and human life lost in an accident and is naturally an important factor as to when we talk about accidents the first thought that comes to the mind is how many people were involved and what are their conditions post-impact. The next determining factor is the speed limit. The higher the speed of the vehicle the more chance it has of going out of control. The speed limit is followed by lighting conditions which can also be thought of as visibility. The better than lighting conditions, that is well light roads, the better the visibility which enables the driver to avoid any obstacles and also helps in ensuring the safety of pedestrians. Our model uses the decision tree machine learning algorithm which processes data by splitting the original dataset into subsets. Various algorithms were employed, and it was observed that the decision tree algorithm had the maximum score for accuracy. Our model has an accuracy of 79.45% (Fig. 14). The heat map represents the data matrix in a graphical form. It is used to represent two-dimensional data as colors. Our model is successfully able to identify 161 true positives, 140 false positives, 138 true negatives, and 149 false negatives. The confusion matrix of the model was plotted it using seaborn library’s heat map (Fig. 15). Fig. 14 Comparison between different models

86

S. Jain et al.

Fig. 15 Experimentally obtained confusion matrix

The proposed model is inspired after studying various available models. After research, it was found out that no particular model for accident severity prediction exists in India. The dataset available at present though not adequate nor sufficient for nationwide prediction gives a very good idea of what can be achieved if proper data is collected about various factors associated with a road accident.

6 Conclusion Various graphs and maps were visualized after careful and thorough analyzation of the road accidents occurring in India 2000 onwards. A conclusion can be drawn that road accidents are a serious issue in India, and recently, there has not been much decline in the number of road accidents, and better road conditions and better safety measurements are need of the hour. For the prediction model, various algorithms were employed, and it was observed that the decision tree algorithm had the maximum score for accuracy. Our model has an accuracy of 79.45%. Currently, the dataset available is small, less elaborate and not up to the mark for practical use but with a larger and more detailed dataset, higher accuracy can be achieved and employed for practical applications such as more accurate mapping of accident-prone areas, devising efficient plans and countermeasures to reduce the severity of an accident. Another potential application could be deploying the model in smart cars in future for live prediction of the probability of an accident in India. Since India accounts for nearly 10% of the world accident related deaths, there is an urgent need to collect accurate and sufficient data for analysis and prediction purposes if we want to reduce the number of fatalities and other economic losses inflicted by road accidents.

Analysis of Road Accidents in India and Prediction …

87

References 1. Dehury AN, Patnaik AK, Das AK, Chattraj U, Bhuyan P, Panda M (2013) Accident analysis and modelling on NH-55 (India). Int J Eng Innov. ISSN: 2278-7461 2. NHTSA—National Center for Statistics and Analysis (NCSA) (2016) NHTSA Studies vehicle safety and driving behavior to reduce vehicle crashes. http://www.nhtsa.gov/NCSA 3. Kumar S, Toshniwal D (2017) Severity analysis of powered two-wheeler traffic accidents in Uttarakhand, India. Eur Transp Res Rev 9(2):24 4. Singh SK, Misra A (2004) Road accident analysis: a case study of Patna City. Urban Transp J 2(2):60–75 5. Dehury AN, Patnaik AK, Das AK, Chattraj U, Bhuyan P, Panda M (2013) Accident analysis and modeling on NH-55 (India). Int J Eng Invent. e-ISSN: 2278-7461 6. Bergel-Hayat R, Debbarh M, Antoniou C, Yannis G (2013) Explaining the road accident risk: weather effects. Accid Anal Prev 60:456–465 7. Ramesh A, Kumar M (2011) Road accident models for Hyderabad metropolitan city of India. Indian Highw 39(7) 8. Al-Balbissi AH (2003) Role of gender in road accidents. Traffic Inj Prev 4(1):64–73 9. Lord D, Persaud B (2000) Accident prediction models with and without trend: application of the generalized estimating equations procedure. Transp Res Rec J Transp Res Board 1717:102–108. https://doi.org/10.3141/1717-13 10. Zong F, Xu H, Zhang H (2013) Prediction for traffic accident severity: comparing the Bayesian network and regression models. Math Prob Eng 11. George Y, Athanasios T, George P (2017) Investigation of road accident severity per vehicle type. Transp Res Procedia 25:2076–2083 12. Anderson TK (2009) Kernel density estimation and K-means clustering to profile road accident hotspots. Accid Anal Prev 41(3):359–364 13. Edwards JB (1998) The relationship between road accident severity and recorded weather. J Safety Res 29(4):249–262 14. Bijleveld F, Churchill T (2009) The influence of weather conditions on road safety. SWOV 15. Chang LY, Mannering F (1999) Analysis of injury severity and vehicle occupancy in truck-and non-truck-involved accidents. Accid Anal Prev 31(5):579–592 16. Iversen H, Rundmo T (2004) Attitudes towards traffic safety, driving behaviour and accident involvement among the Norwegian public. Ergonomics 47(5):555–572. https://doi.org/ 10.1080/00140130410001658709 17. Rameshkrishnan N, Sathyakumar A, Balakumar S, Hassan AM, Rajaraman R, Padmanaban J (2013) The new in-depth, at-the-scene, accident investigation database in India. In: Proceedings of IRCOBI conference 18. Roadside accidents in India dataset. https://www.kaggle.com/drvenkateshrathod/roadside-acc idents-in-india

Mining Opinion Features and Sentiment Analysis with Synonymy Aspects Sourya Chatterjee and Saptarsi Goswami

Abstract In present days, sentiment analysis has gained much importance in the field of natural language processing (NLP). In near future, we expect its importance will grow exponentially. In this paper, our aim is to handle the problem of synonymy aspects. With lots of popular aspect mining algorithms, we sometimes get synonymy aspects, which lead to produce improper results. An algorithm to distinguish synonymy aspects and further proceed in sentiment analysis with synonymy aspects is proposed here. For sentiment analysis, aspect level categorization and document level categorization experiments are done, where we get promising outcomes. Also, we show here sentiment-polarity and sentiment-uncertainty per review, which will help the product/business owner to get proper feedback.

1 Introduction Nowadays, customer review is one of the main concern in any kind of business. With quick growth in the field of e-commerce, it is also getting inevitable to know customer satisfaction on a product that may be in the sector of health care, gaming, advertisement, information technology, etc. It is really helpful for business to evaluate its product/service by knowing the feedback from customer. It is more helpful when it provides not only polarity, but also positivity and negativity score of the review. After getting the polarity, product owner can analyze improvement areas of his product. He may be confused at that part because he will be unsure about improvement areas until he will get customer review as feature/aspect oriented. Sentiment analysis is basically text mining, which extracts the key information and helps business/product owner to get social sentiment of their product or brand while monitoring customer feedback/review. It is a classification process having different S. Chatterjee (B) A.K. Choudhury School of Information Technology, University of Calcutta, Kolkata, West Bengal, India S. Goswami Bangabasi Morning College, University of Calcutta, Kolkata, West Bengal, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_7

89

90

S. Chatterjee and S. Goswami

part, e.g., document level, sentence level, and aspect level [1]. Our approach is based on aspect level analysis combined with document level analysis. As per Tractica’s analysis, top most use case categories for sentiment analysis include customer experience, product/market research, customer service, health care, gaming, education, and automotive [2]. The growth of sentiment analysis in near future is also discussed here. The term ‘Opinion mining’ comes along with the term ‘Sentiment Analysis’ as like as mutual property. Where opinion mining helps us to extract people’s opinion, sentiment analysis is actually identifying the sentiment that is expressed in the text. So our target is to find opinions, identify sentiments, and classify their polarity (e.g., positive, negative, and neutral). In our paper, a solution for synonymy aspect-based sentiment analysis is provided where we find out aspects and get polarity for each aspects as well as polarity for whole review. Also, positivity and/or negativity score and polarity-uncertainty of that aspect are shown in that review. Our contributions are discussed mainly in Sect. 3 (1) An algorithm is proposed to distinguish synonymy aspects and mining aspects on that basis. (2) A mathematical approach is proposed to find out polarity-uncertainty. The structure of rest of the paper will be like: Literature survey and background works are discussed in Sect. 2. In Sect. 4, we discuss data collection (Various data sources and how we split data in our database), feature generation (A lexicon-based approach with skip-gram methodology to find out aspects), and model architecture (Naïve Bayes and VADER sentiment analysis model). In Sect. 5, we discuss our result as well as we compare our result with other methodologies. Then in Sect. 6, we summarize our approach and discuss about application of our approach. As sentiment analysis trend in increasing exponentially over last decade, we hope we put some contribution for future studies.

2 Related Works For feature mining technique, there are mainly two types of approaches, machine learning-based and lexicon-based approach. For lexicon-based approach, Hu and Liu [3] suggested an algorithm where they used a product review dataset. They performed opinion mining by 1-step association rule mining algorithm to get aspects. Ding et al. [4] improved that lexicon-based approach by holistic approach. That approach allowed system to tackle context dependent words and a new system ‘Opinion Observer’ based on that approach. Rambocas and Gama [5] suggested a keyword-based approach. They were mainly identifying adjectives to perform aspect determination. Kaviani and Dhotre [6] did a survey about machine learning approaches, e.g., Naïve Bayes. Different Naïve Bayes applications, advantages, and disadvantages are discussed in this paper. Behdenna et al. [7] did a survey, where they discussed document level sentiment analysis mainly by differentiating between positive and negative polarity words. Dey et al. [8] discussed two supervised machine learning

Mining Opinion Features and Sentiment Analysis …

91

algorithms, K-NN classifier and Naïve Bayes algorithm and came in a conclusion that Naïve Bayes gave better accuracy than K-NN in the field of sentiment analysis. Nigam and Yadav [9] did classification of tweets into positive and negative polarity on lexicon-based approach by determining semantic score. They used R language to display the histogram, etc. Almatarneh and Gamallo [10] tried in unsupervised dictionary-based approach for analyzing sentiments. Here they searched for extreme opinions to perform sentiment analysis. Suppala and Rao [11] shown sentiment analysis of tweets by Naïve Bayes classification in a survey paper. They also shown percentage of polarity (negativity and positivity) of any tweet.

3 Proposed Algorithm In this section, our proposed algorithm of feature generation and sentiment analysis is discussed. In Fig. 1, we discuss flow of our algorithm.

Fig. 1 Flow diagram of aspect-based sentiment analysis approach

92

S. Chatterjee and S. Goswami

3.1 Feature Generation Algorithm 1: Synonymy aspect generation Require: Noun Database, Review Database 1. for every Row in Nouns.csv do 2. perform lemmatization & stemming 3. End for 4. Declare a1 = {}, a2 = {},a3 = {},a = {},syn = hashmap < string, list(string) > 5. Perform association rule mining on Noun.csv file do 6. Insert words with support 0.7 in a1 do 7. Insert words with support 0.5 in a2 do 8. for word in a2 do 9. insert synonyms of word in syn do 10. for synonym in syn do 11. if synonym not in Noun.CSV then 12. Remove synonym from syn do 13. else 14. Perform word embedding between synonym & word in Review Database do 15. if cosine similarity < 0.5 then 16. Remove synonym from syn do 17. End if 18. End if 19. End for 20. for synonym in syn do 21. Replace synonym with word in Noun.CSV do 22. End for 23. End for 24. Perform association rule mining on Noun.csv file do 25. Insert words with support 0.8 in a do 26. a = a1 U a2 27. return a

We propose Algorithm 1 for finding synonymy aspects. In Fig. 2, we briefly discuss the flow of feature generation. After lemmatization and stemming, Apriori algorithm is run twice for association rule mining. At first, on basis of support 0.7, the result is put in set a1 . Next, we run same algorithm on basis of support 0.5, and the result is put in set a2 . Next synonyms of the words in set a2 are found out and put in a hashmap with a distinction of single key and multiple values. Let a2 = {x, y, …, z}. Let Hashmap < String, List(String) > map = {x = [x 1 , x 2 ], y = [y1 , y2 , y3 ], …, z = [z1 , z2 ]} [x and z have 2 synonyms, y has 3 synonyms]. Next, synonyms are removed which are not in the transaction set of Noun.csv. After removal map is {x = [x 1 ], y = [y1 , y3 ], …, z = [z1 , z2 ]} [x 2 and y2 not in Noun.csv]. Next, word embedding is performed with skip-gram [12] by choosing threshold value = 0.5. Synonyms with cosine similarity lesser than 0.5 are removed. As well as if more than one word have same synonym, then we remove the pair with lesser cosine similarity, e.g., ‘time’ and ‘film’ both have common synonym ‘show.’ ‘show’ and ‘time’ have cosine similarity 0.501, and ‘show’ and ‘film’ have cosine similarity 0.653. So we keep only (‘film’: ‘show’).

Mining Opinion Features and Sentiment Analysis …

93

Fig. 2 Flow diagram of aspect generation

Now map = {x = [x 1 ], y = [y3 ], …, z = [z1 ]} [y1 and z2 does not meet our word embedding threshold value]. Next, elements of a2 are replaced with their mapping synonyms in Noun.csv, and Apriori algorithm is run on basis of support 0.8. Result of that operation is put in set a3 . In Fig. 3, flow diagram of getting synonyms of aspects is shown. In final step of aspect/feature generation, we take our final set a as (a1 Union a3 ).

3.2 Sentiment Analysis In our algorithm as input, aspects are provided along with review and in output, polarity and uncertainty are received. Here we use sentiment score of words got from Naïve Bayes algorithm instead of its own lexical dictionary score. We give extra emphasis on aspect words and its neighbor words (window size 5) as well as special idioms and emoji. For getting uncertainty, following formula is used. 

(Polarity of words) Uncertainty = 1 −  (Polarity of words)2 + (number of words)

94

S. Chatterjee and S. Goswami

Fig. 3 Getting synonyms of probable aspects

The term uncertainty differs from the term entropy in following terms: Entropy determines randomness of variable. It is randomness of the positivity and negativity. Randomness of effect created by some variable is determined by uncertainty. It is randomness of the positivity and negativity, created by positive and negative particle. It tells us about fuzziness of the polarities of the sentence. Less uncertainty means we can be more sure about polarity of the sentence. Let a sentence ‘x’ has positivity 0.7 with 10 number of words. Sentence ‘y’ has positivity 0.7 with 20 number of words. Here Uncertainty of sentence ‘y’ > Uncertainty of sentence ‘X’ though positivity is same in both case. In Fig. 4, we show difference between entropy and uncertainty by example.

4 Methodology and Methods In this section, we discuss about method and methodologies for our proposed technique. In 1st step, data collection has been done, and thereafter, feature extraction processes and the sentiment analysis models, which are used here, are discussed.

Mining Opinion Features and Sentiment Analysis …

95

Fig. 4 Example of uncertainty and entropy

4.1 Data Collection 50,000 movie reviews are taken with their polarity from IMDB dataset. Those data are put in a csv file. 1. 2. 3. 4.

At first, part of speech tagging of the words from the sentences (review dataset + our newly added reviews) is done. Database of nouns from the ‘Noun’ part of speech from our review database is made, e.g., Noun.csv Database of adjective from the ‘Adjective’ part of speech from our review database is made, e.g., Adjective.csv Database of neighbor words of window size 4 from all nouns and adjectives is made (only noun and adjective), e.g., Neighborwords.csv (Tables 1, 2, and 3). We can visualize this by an example. Let the sample sentence is A powerful and tragical comedy, created by Sam Mendes and a wonderful cast of Kevin Spacey.

Table 1 Row added in Noun.csv for sample sentence Comedy

Sam

Table 2 Row added in Adjective.csv for sample sentence

Mendes

Powerful

Cast

Tragical

Kevin

Spacey

Wonderful

96

S. Chatterjee and S. Goswami

Table 3 Row(s) added in Neighbour.csv for sample sentence

Powerful

Tragical

Comedy



Comedy

Sam





Sam

Mendes





Wonderful

Cast

Kevin



Cast

Kevin

Spacey



4.2 Feature Generation In feature generation process, frequent features are found out by association rule mining. Here Apriori algorithm is used for this purpose. The term ‘support’ means how frequent an itemset is in the transactions. Support({X }− > {Y }) =

(Transaction Containing both X and Y ) (Total Number of Transactions)

Confidence is conditional probability of occurrence of consequent, when antecedent is provided. Confidence({X }− > {Y }) =

(Transaction Containing both X and Y ) (Transaction Containing X )

Then we proceed further by finding out synonyms of features and perform word embedding [12] for the same. Word embedding is actually vector representation of words. We use skip-gram method in word embedding. This method tries to predict surrounding words, when input word is given. Hidden layer contains number of dimension. In Fig. 5, we describe concept of word embedding skip-gram method. Then we perform feature pruning to drop incorrect features. Two types of pruning are there [3]. 1. 2.

Compactness Pruning: This method checks feature contain feature phrase (more than 1 word) and removes meaningless ones. Redundancy Pruning: This removes redundant features, contains single word.

Next we find out compact phrases and infrequent features with Neighbourwords.csv and Adjective.csv, respectively [3].

4.3 Model Architecture In Vader sentiment analysis [13], we give review as input string, and in output, we receive polarity of the sentence as well as how intense those polarity is. In this analysis, own lexical dictionary for polarity of words is used. This methodology

Mining Opinion Features and Sentiment Analysis …

97

Fig. 5 Skip-gram word embedding concept

gives some extra emphasis on some special idioms. It has its own sentiment score for its dictionary words as well as each emoji also has sentiment score. Naïve Bayes algorithm [6] determines polarities of words by taking joint probabilities of classes and words. Here we determine polarities by logarithmic approach. For independent feature vector(x 1, …, x n ), final decision rule can be defined as [14] 

n 

Nki + α Y = arg max ln P(Ck ) + ln k Nk + αn i=1



N ki = number of times feature I appears in Sample class k. N k = Total number of all features. α = smoothing priors [α = 1 is Laplace smoothing]. For document level analysis after training our model in Naïve Bayes approach, we generate aspects for testing model, then we determine polarity of the sentence.

5 Result and Discussion 5.1 Aspect Generation Existing methodology in lexical analysis [3, 4] gives us aspects. But in that case, we get synonymy aspects (e.g., film and movie both) in their process with movie review dataset.

98

S. Chatterjee and S. Goswami

In our process, we remove duplicate synonyms. For example, ‘Mining opinion features in customer reviews by Hu and Liu’ give us result with our dataset: aspect = {‘character’, ‘film’, ‘time’, ‘story’, ‘music’, ‘movie’}. Our approach gives result. aspect = {‘character’, ‘film’, ‘time’, ‘story’, ‘music’, ‘scene’}. So, in other popular algorithms when we try to find out aspects there, all words are considered. Two types of disadvantages are there in that case. 1.

In some cases, in the aspect list, synonymy aspects are present, (e.g., ‘cinema’, ‘film’, ‘movie’ all are present). If we try to find out top five aspects or features, then two spots we lose for synonymy aspects In some cases, we are not getting correct aspect list as synonymy aspects with same meaning we are not considering. For example, we try to find out aspects (Rule: support > 0.75) for three sentences, in a sentence the word ‘cinema’ is present, in another sentence ‘film’ is present, and in another, one ‘movie’ is present. So we have support = 0.33 for all, and nothing is considered as aspect.

2.

In our approach, we overcome these disadvantages. Existing Method [3]: After 1st step Association Rule Mining/compactness Pruning/P-support pruning: Recall: 0.67 Precision: 0.79. Our Method (Adding 2nd step Association Rule mining with these steps): Recall: 0.69 Precision: 0.84 (Table 4). With this precision and recall of 2nd step association rule mining, if we perform FBS, OPINE or Opinion observer method [4] to get final result, we get below result (Table 5). Table 4 Precision and recall comparison after 1st step and 2nd step association rule mining Existing method [3]

Our method

Precision

Recall

F-score

Precision

Recall

F-score

0.67

0.79

0.73

0.69

0.84

0.77

Table 5 Precision and recall comparison for FBS, OPINE, and opinion observer Existing method [3, 4] Precision

Our method Recall

F-score

Precision

Recall

F-score

FBS

0.93

0.76

0.83

0.92

0.82

0.87

OPINE

0.86

0.89

0.87

0.86

0.9

0.88

Opinion observer

0.92

0.91

0.91

0.92

0.91

Mining Opinion Features and Sentiment Analysis …

99

Fig. 6 Sample result of aspect wise sentiment analysis

5.2 Sentiment Analysis In our approach, we get result like Fig. 6 for a review with aspect set [film, script, actor]. Here we can see that we are getting uncertainty of the polarity as well as positivity and negativity. In document level analysis as our final result by matching predictive statement and actual statement, we get this below match described in Fig. 7. In Table 6, comparison of Naïve Bayes and our algorithm is shown. Fig. 7 Result of document level sentiment analysis

Table 6 Sentiment analysis comparison with Naïve Bayes algorithm Naïve Bayes

Our method

Precision

Recall

F-score

Precision

Recall

F-score

Negative

0.88

0.83

0.85

0.90

0.82

0.86

Positive

0.84

0.89

0.86

0.84

0.91

0.87

100

S. Chatterjee and S. Goswami

6 Conclusion In this paper, we propose a technique for mining aspects to performing sentiment analysis on it. To perform aspect determination, we use lexical approach; whereas for sentiment analysis, we use machine learning approach. For mining aspects, we distinguish synonymy aspects and perform 2-step mining process, which gives us better result. Our sentiment analysis process not only gives the opinions and polarity, it also gives uncertainty of polarity. As result of our approach, one can rank the most discussed aspects and can find the opinions (positive/negative) based on that. We believe our approach will help business/product owners in many ways such that for a restaurant it can rank about the foods those are mostly discussed by customers and its opinions, it can follow most discussed subject from any social media and its opinions or for a product, it can find most discussed characters rank wise and its opinions.

References 1. Kamble SS, Itkikar AR (2018) Study of supervised machine learning approaches for sentiment analysis. IRJET 2. Tractica (2018) Press Release on “Emotion recognition and sentiment analysis market to reach 3.8 billion by 2025” 3. Hu M, Liu B (2004) Mining opinion features in customer reviews. AAAI 4:755–760 4. Ding X, Liu B, Yu PS (2008) A holistic lexicon-based approach to opinion mining. In: Proceedings of the 2008 international conference on web search and data mining. ACM, pp 231–240 5. Rambocas M, Gama J (20170 Marketing research: the role of sentiment analysis. ISSN: 08708541 6. Kaviani P, Dhotre S (2017) Short survey on naïve bayes algorithm. Int J Adv Res Comput Sci Manag 04(11) 7. Behdenna S, Belalem G, Barigou F (2018) Document level sentiment analysis: a survey. EAI Endorse Trans Cont Aware Syst Appl 4(13):154339 8. Dey L, Chakrobarty S, Biswas A, Tiwary S (2016) Sentiment analysis of review datasets using Naïve Bayes and K-NN classifier. Int J Inf Eng Electron Bus 8(4):54–62 9. Nigam N, Yadav D (2018) Lexicon-based approach to sentiment analysis of tweets using R language. In: Second international conference, ICACDS 2018, Dehradun, India, 20–21 Apr, Revised Selected Papers, Part I 10. Almatarneh S, Gamallo P (2018) A lexicon based method to search for extreme opinions. PLoS ONE 13(5):e0197816 11. Suppala K, Rao N (2019) Sentiment analysis using Naïve Bayes classifier. Int J Innov Technol Explor Eng (IJITEE) 8(8). ISSN: 2278-3075 12. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 13. Hutto CJ, Gilbert E (2013) VADER: a parsimonious rule-based model for sentiment analysis of social media text 14. Rubtsova Y (2015) Constructing a corpus for sentiment classification training. Softw Syst 109(1):72–78

Understanding Employee Attrition Using Machine Learning Techniques Agnibho Hom Chowdhury, Sourav Malakar, Dibyendu Bikash Seal, and Saptarsi Goswami

Abstract Attrition, in human resource terminology, refers to the phenomenon of the employees leaving the company. It is usually measured with a metric called attrition rate, which is the number of employees moving out of the company (voluntary resigning or laid off by the company). To study this problem in depth and improve employee retention, if possible, we have developed two predictive models for two different datasets based on supervised machine learning algorithms. Therefore, this article aims to provide a framework for predicting employee attrition by studying various aspects of an employee’s behavior and attributes using classification techniques like naïve Bayes, k-nearest neighbors, and random forest.

1 Introduction High attrition is a cause of concern for a company as it presents a cost to the company. The company loses on the amount it spent to recruit and select these employees and to train them for their respective jobs. The company may also have to spend additional money to fill the vacancies left open by these employees. Hence, it becomes critical for a company to keep a tab on the attrition rate, which down-sizes the employee base. There are several reasons why employees consider the option of moving out of the current organization. Some of the main reasons why attrition amongst employees exists can be better pay and job opportunities outside the organization. Improper work–life balance can cause a high attrition rate. Impolite behavior of manager and peers, ineffective team management, stagnancy in career growth, and poor quality of work-life can lead to a higher attrition rate. Inadequate and poor working conditions lead to a lack of motivation. A. H. Chowdhury (B) · S. Malakar · D. B. Seal University of Calcutta, JD Block, Sector III, Bidhannagar, Kolkata, West Bengal 700106, India S. Goswami Bangabasi Morning College, 19, Rajkumar Chakraborty Sarani, Baithakkhana, Kolkata, West Bengal 700009, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_8

101

102

A. H. Chowdhury et al.

In today’s competitive business environment, the impact of attrition on a business can be both bottom-line and morale. It decreases the overall performance, as there is no time to train the new employee, who is to take over the job, and the whole team gets affected. It can directly be seen as an overall decrease in the performance of the team. Daily task management is another problem. It leads to difficulty in managing daily tasks, and the remaining employees suffer since they have to pick up the workload and deal with the disruption to daily routines. There is an increased cost associated with every level of the process—losing and paying the previous employee, hiring a new one, and training-cost for the new employee. Employee turnover also causes a lack of knowledgeable employees. This goes without saying when employees leave an organization, they take with them the experience they have gained over time. Even with experienced employees are hired, they may suffer at taking care of the critical business matter as they are new to the company’s policies, culture, and current employees. Attrition also creates a negative image, and it may even lead to a drastic change in customer relationships. Customers connect with employees in an organization, and those leaving all of a sudden may lead to doubts in customer’s minds as well. Employee development is also an adverse effect when there is a disturbance within the organization due to employees leaving the organization, affecting the development process for all. Employee development plans take time and huge investments. A change mid-way mostly means loss of the past work done, which benefits none. According to KPMG Annual Compensation Trends Survey India 2018, • Figure 1 depicts the share of voluntary staff turnover among Indian companies. • Figure 2 shows the average employee attrition rate in Indian companies and organizations based on job level (n = 256). Fig. 1 Share of voluntary staff turnover among Indian companies

Understanding Employee Attrition Using Machine …

103

Fig. 2 The average employee attrition rate based on job level

• The above data and the corresponding findings are based on 272 Indian companies and organizations spread across 18 sectors. The above data and the corresponding findings are based on 272 Indian companies and organizations spread across 18 sectors. The rest of the paper is organized as follows. In Sect. 2, a detailed attrition-based study is performed. In Sect. 3, we have discussed the materials and methods employed in setting up the empirical study. In Sect. 4, the most important attributes for the target variables are selected through feature selection. In Sect. 5, the classification performance of the models is critically analyzed and discussed, and finally, in Sect. 6, the concluding remarks are presented.

2 Literature Review There has been a lot of studies regarding attrition in recent years. In a meta-analytic review of voluntary turnover studies [1], it was found that the strongest predictors for voluntary turnover were age, tenure, pay, overall job satisfaction, and employee’s perceptions of fairness. and employee perceptions of fairness. Other similar research findings suggested that personal or demographic variables, such as age, gender, ethnicity, education, and marital status, were important factors in the prediction of voluntary employee turnover [2–6]. Other characteristics that studies focused on are salary, working conditions, job satisfaction, supervision, advancement, recognition, growth potential, burnout, etc. [7–10].

104

A. H. Chowdhury et al.

3 Methodology This section has two subsections. In the first subsection, the source of the data has been outlined. In the second subsection, the appropriate classification model has been selected. • Input datasets IBM HR Analytics Employee Attrition and Performance: This dataset has a total of 1470 observations and 35 variables (attributes) related to an employee’s life [11]. HR Attrition: The data includes 14,999 observations and 12 variables (features) for each record of the employee [12]. • Model selection The above problem is a classification problem. There are a few machine learning algorithms for classification such as logistic regression (LR), naïve Bayes classifier (NB) [13], support vector machine (SVM) [14], random forest (RF) [15], and decision tree (DT) [16]. We can use any of the classification algorithms. However, we have chosen RF because it is an ensemble learning method for classification, regression, and other tasks. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. We have also used up-sampling, which is the process of randomly duplicating observations from the minority class to reinforce its signal.

4 Feature Engineering Now that we have trained our models, we can move forward to testing our model. Before that, we need to select those attributes or features which have a positive impact on our target variable and remove those having negative or no impact at all on our target variable. IBM HR Analytics Employee Attrition and Performance: For feature selection, we have selected only those variables whose values for mean decrease accuracy are greater than the median of mean decrease accuracy. In our case, the median of mean decrease accuracy is 3.32. In Fig. 3, we have shown the mean decrease accuracy for all the input features, and in Table 1, the finally selected features have been presented. HR Attrition: In this case, the median of mean decrease accuracy is 77.10 In Fig. 4, we have shown the mean decrease accuracy for all the input features, and in Table 1, the finally selected features have been presented.

Understanding Employee Attrition Using Machine …

105

Fig. 3 Selected features for IBM HR analytics employee attrition dataset Table 1 Important features for the classification model Dataset

Selected features

IBM HR Analytics Employee Attrition and Performance

Age, attrition, business travel, environment satisfaction, job involvement, job level, job role, marital status, monthly income, number of companies worked, overtime, stock option level, total working years, years at company, years in current role, years since last promotion, and years with current manager

IHR Attrition

Satisfaction level, last evaluation, number project, average monthly hours, and time spend company

106

A. H. Chowdhury et al.

Fig. 4 Selected features for HR attrition dataset

5 Result and Discussion From Table 2, we have observed that naïve-Bayes has the best Recall; i.e., it correctly predicts 61.93% of employee attrition, and the least accuracy meaning the total number of correct predictions made by it was the lowest. k-nearest neighbor (KNN) has an overall accuracy of 83.85% and precision of 36.86%, but a very low recall of 6.23%. RF is the best classifier among the three with an accuracy of 86.52%, recall of 25.96%, and precision equaling of 66.75%. Table 2 Classification result for dataset one Model

Accuracy (test set)

Recall (test set)

Precision (test set)

Naïve-Bayes

0.77

0.61

0.37

K-NN

0.83

0.60

0.36

SVM

0.84

0.26

0.37

Random forest

0.86

0.25

0.66

Understanding Employee Attrition Using Machine …

107

Table 3 Classification result for dataset two Model

Accuracy (test set)

Recall (test set)

Precision (test set)

Naïve-Bayes

0.80

0.43

0.62

K-NN

0.97

0.94

0.95

SVM

0.97

0.94

0.97

Random forest

0.98

0.94

0.99

From Table 3, we have seen that naïve-Bayes is the poorest classifier among the three with a very low recall of 43.87% and accuracy of 80.5%, precision of 62.97%. k-nearest neighbor (KNN) achieves a quite nice accuracy of 97.68%, recall of 94.3%, and precision of 95.96%. RF is the best classifier in all three measures with an accuracy equaling 98.59%, recall of 94.74%, and precision at 99.25%. From Fig. 5, it has been observed that SVM is a better classifier compared to both naïve-Bayes and k-nearest neighbors. However, compared to the others, random forest marginally betters the overall measures of model evaluation. From Fig. 6, we see that naïve-Bayes is the poorest classifier. Decision tree is marginally better than k-nearest neighbors, but RF is somewhat more accurate at predicting correctly compared to the remaining three. Fig. 5 Classification result for dataset one on accuracy, recall, and precision

108

A. H. Chowdhury et al.

Fig. 6 Classification result for dataset two on accuracy, recall, and precision

6 Conclusion In this paper, we have performed an empirical study by designing classification models for two different datasets using machine learning algorithms for correctly predicting employee attrition rate. The major findings of our study are as follows: • It has been observed that for the first dataset RF has achieved the best accuracy and precision compared to others. Also, for the second dataset, RF has achieved the highest accuracy, recall, and precision. • Overtime, monthly income, total working years, age, etc., are the most dominant attributes for the first dataset. For the second dataset, attributes like satisfaction level, last evaluation, time spend company, etc., are the best performing attributes. Hence, the above discussion suggests that random forest has performed the best compared to others.

References 1. Cotton JL, Tuttle JM (1986) Employee turnover: a meta-analysis and review with implications for research. Acad Manag Rev 11(1):55–70 2. Finkelstein LM, Ryan KM, King EB (2013) What do the young (old) people think of me? Content and accuracy of age-based metastereotypes. Eur J Work Organ Psychol 22(6):633–657 3. Holtom BC, Mitchell TR, Lee TW, Eberly MB (2008) 5 turnover and retention research: a glance at the past, a closer review of the present, and a venture into the future. Acad Manag Annals 2(1):231–274 4. von Hippel C, Kalokerinos EK, Henry JD (2013) Stereotype threat among older employees: relationship with job attitudes and turnover intentions. Psychol Aging 28(1):17

Understanding Employee Attrition Using Machine …

109

5. Peterson SL (2004) Toward a theoretical model of employee turnover: a human resource development perspective. Hum Res Dev Rev 3(3):209–227 6. Sacco JM, Schmitt N (2005) A dynamic multilevel model of demographic diversity and misfit effects. J Appl Psychol 90(2):203 7. Allen DG, Griffeth RW (2001) Test of a mediated performance–turnover relationship highlighting the moderating roles of visibility and reward contingency. J Appl Psychol 86(5):1014 8. Liu D, Mitchell TR, Lee TW, Holtom BC, Hinkin TR (2012) When employees are out of step with coworkers: how job satisfaction trajectory and dispersion influence individual-and unit-level voluntary turnover. Acad Manag J 55(6):1360–1380 9. Swider BW, Zimmerman RD (2010) Born to burnout: a meta-analytic path model of personality, job burnout, and work outcomes. J Vocation Behav 76(3):487–506 10. Heckert TM, Farabee AM (2006) Turnover intentions of the faculty at a teaching-focused university. Psychol Rep 99(1):39–45 11. www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset/ 12. www.kaggle.com/mahapatran/hr-attrition/ 13. Zhang J (2004) The optimality of naive bayes. In: Proceedings seventeenth international Florida artificial intelligence research society conference, FLAIRS 2004, vol 1, pp 1–6 14. Steinwart I, Christmann A (2008) Support vector machines. Springer Science and Business Media 15. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140 16. Ross Quinlan J (2014) C4. 5: programs for machine learning. Elsevier

Track II

Fake News Detection: Experiments and Approaches Beyond Linguistic Features Shaily Bhatt, Naman Goenka, Sakshi Kalra, and Yashvardhan Sharma

Abstract Easier access to the Internet and social media has made disseminating information through online sources very easy. Sources like Facebook, Twitter, online news sites and blogs of self-proclaimed journalists have become significant players in providing news content. The sheer amount of information and the speed at which it is generated online makes it beyond the scope of human verification. There is, hence, a pressing need to develop technologies that can assist humans with automatic fact-checking and reliable identification of fake news. This paper summarises the multiple approaches that were undertaken and the experiments that were carried out for the task. Credibility information and metadata associated with the news article have been used for improved results. The experiments also show how modelling justification or evidence can lead to improved results. Additionally, the use of visual features in addition to linguistic features is demonstrated. A detailed comparison of the results showing that our models perform significantly well when compared to robust baselines, and state-of-the-art models are presented.

1 Introduction News consumption from social media is nothing short of a double-edged sword. The minimal effort, simple access and quick dispersal of data on the Internet and social S. Bhatt (B) · N. Goenka · S. Kalra · Y. Sharma Web Intelligence and Social Computing Lab, Department of Computer Science and Information Systems, Birla Institute of Technology and Science, Pilani, Rajasthan 333031, India e-mail: [email protected] N. Goenka e-mail: [email protected] S. Kalra e-mail: [email protected] Y. Sharma e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_9

113

114

S. Bhatt et al.

media are increasingly encouraging people to switch from traditional sources of news to online ones. Sources like Facebook, Twitter, online news sites, other social media platforms and personal blogs of self-proclaimed journalists have become significant players in providing news content. The sheer amount of information and the speed at which it is generated and propagated online makes it practically beyond the scope of human verification. There is, hence, a pressing need to develop technologies that can assist humans with automatic fact-checking and reliable identification of fake news. “Fake news” is not a new term. Journalists often define fake news as “viral posts based on fictitious accounts made to look like news reports” [1]. A recent study gave the definition as “news articles that are intentionally and verifiably false and could mislead readers” [2]. While what categorises as fake news is an open debate and has a broad spectrum of social, psychological and factual perspectives attached to it. However, limiting to a simplified definition of fake news as “content that has been intentionally created to mislead the readers” will suffice for the scope of this paper [2]. Here, an empirical study to develop models for fake news detection by leveraging the latest breakthroughs of deep learning, more specifically, neural networks for processing language and visual data is presented. The main contribution of this paper is to substantiate the three hypotheses: First, features other than those of language can significantly improve performance of fake news detection models. Metadata and information about the author/speaker have been employed to create a credibility index which together with linguistic features leads to significant improvement in results. Second, modelling evidence or justification, along with the supervised labels of the articles, improves the model performance significantly. Finally, visual features can benefit the task performance. For substantiating this, experiments exploiting visual features for creating multimodal fake news detection models are presented. Standard datasets which are well accepted within the research communities including LIAR, LIAR-Plus and FakeNewsNet have been used. The techniques used range from traditional machine learning-based approach for establishing a baseline to more advanced and complex neural networks. Throughout the paper, results are compared with robust baselines and state-of-the-art (SOTA) models.

2 Related Work 2.1 What Is Fake News? What categorises as fake news is open to debate on a variety of levels such as psychological, social and contextual foundations [2]. Authenticity and intent are two critical features in fake news [2], and thus, a narrow definition of fake news is “intentionally written misleading content”. A similar definition of fake news is “content whose

Fake News Detection: Experiments and Approaches Beyond …

115

veracity is compromised by intentional deception” [1]. From the perspective of factchecking, fake news can be defined as “news that is verifiably false”. A broader definition would include satire, parodies, unverifiable claims and unintentional rumours as fake news. What categorises as fake news is an open and evolving debate that has a broad spectrum of social, cultural and psychological aspects tied to it. The most common feature accepted by all definitions is “the intentional spread of verifiably false information”. For the scope of this paper, the narrow definition of fake news will be followed.

2.2 Challenges Encountered in Fake News Detection A comprehensive analysis of the challenges faced in automatic fake news detection is described in [3]. The various challenges faced are: First, the involvement of multiple players in the news ecosystem increases the difficulty of building computational, technological and business strategies that can cope with the dynamic and quality information. Second, malicious or adversarial intent is tough to detect. Thus, it is difficult to segregate false information that is spread with the intent of misleading readers from the news that contain false information due to honest mistakes. Third, the lack of public awareness and the vulnerability of the audience is a key factor in the spread of false information. Fourth, social and cultural differences play a role in psychological and contextual interpretation. There may be differences in perspectives which make it challenging to categorise news articles as fake or real. Such news also attacks vulnerable emotions of the audience, hence determining veracity from the style of news content may prove to be unreliable. Furthermore, the dynamic nature of propagation complicates the matters. False information spreads at tremendous rates and changes rapidly as it passes from one user to another. Finally, a significant challenge occurs due to fast-paced developments. Systems relying on external knowledge need to retrieve information on newly emerging facts continuously. This causes static models to suffer from what is known as “concept drift” in machine learning models which means that the data on which the model was trained becomes obsolete.

2.3 Existing Datasets There are several standard datasets publicly available for the research community [3–6]. The most used datasets include BuzzFeedNews, BS Detector, PHEME, CREDBANK, BUZZFACE, FacebookHoax, LIAR and FakeNewsNet.

116

S. Bhatt et al.

The major challenges faced in the creation of datasets from these techniques are: First, crowd-sourced datasets have a degree of doubt associated with the ground truth label itself. This makes models built of such datasets unreliable. Second, there is no algorithm to label websites generating news content as malicious or authentic. The probability of incorrect labels is significant, again making datasets obtained from such websites unreliable. Furthermore, fact-checking websites often focus on specific topics like politifact.com is for political news only. Thus, it is not possible to obtain comprehensive datasets from such websites. Finally, expert fact-checking and human annotation are extremely time-consuming as well as costly. In this paper, the LIAR dataset [7] is used for experimentation on linguistic features with source/author credibility and metadata. The LIAR-Plus [8] which is built on top of the LIAR dataset and contains additional justification or evidence for the label associated with the news is also used. For experiments involving visual features, the FakeNewsNet dataset [9] which contains both news pieces and images, among other features is utilised.

2.4 Classification Methods Fake news detection can been considered as a classification problem: their goal is to provide labels fake or real to a particular news piece. In recent literature, authors have used techniques ranging from machine learning, both supervised and unsupervised, to deep learning. Data mining, time-series analysis and utilising external knowledge bases are also prevalent. Features used include those from linguistic analysis [10], semantic and contextual understanding of language, metadata, multimodal data, network analysis [11] and among others. Authors in [12] report a traditional machine learning-based technique. K-means is used for feature selection, and a supervised learning-based technique, support vector machine (SVM) has been used to classify the fake news from the corpus. The paper [13] addresses the problem of labelled benchmarked datasets by applying a two-path semi-supervised technique. One of the paths is supervised, and the other is unsupervised. For the extraction of the features, a shared CNN has been used. Both the paths are jointly optimised to complete semi-supervised learning. The authors of [14] have proposed a model called FNDnet, which leverages a deep convolutional model for classification. The model achieves the highest accuracy of 98.36 comparable to the SOTA methods by evaluation on the Kaggle fake news dataset. The limitation, however, in this case, is that the model was not tested with other benchmark datasets which are commonly used and accepted by the research community. The paper [15] addresses the problem by applying a capsule neural network which has been previously used in the computer vision tasks and is now receiving attention for use in language tasks. Different embedding models for news items of different lengths have been used, and distinct levels of n-grams have been used for the feature extraction. The model has been tested on LIAR, and ISOT datasets and performance

Fake News Detection: Experiments and Approaches Beyond …

117

better than SOTA is reported in the paper. Comparison of our model performance to this model is presented in later sections. The authors of [16] propose Fakedetector, a novel deep diffusive neural network and perform experiments on a dataset obtained from politifact.com. They obtain the best accuracy in comparison to a number of competitive methods that use textual information for prediction. The dataset used in their experiments has been obtained from the same website from which LIAR and LIAR-Plus have been created. Hence, comparisons of our models to the Fakedetector model described in this paper are given in later sections. In the paper [12], author reported a novice multimodal architecture by considering both the text and image features, and model is evaluated on the self-generated dataset named r/Fakeddit, which is collected from Reddit. Pre-trained InferSent and BERT have been used for the text feature extraction, and VGG16, ResNet 50 and EfficentNet have been used for image feature extraction. We plan to test our multimodal model on this dataset in the future. In [17], authors develop a novel network, a multimodal variational autoencoder (MVAE) to learn features from text and images jointly. The network learns probabilistic latent variable models and couples it with a binary fake news classifier. The model is tested on data from datasets obtained from Twitter and Weibo and reports state-of-the-art results. Authors in [18] have built a multimodal architecture, Similarity-aware multimodal fake news detection model, SAFE, that considers the relationship or similarities between the text and images in the news articles. First, a neural network is used for the text and image feature extraction. Secondly, the relationship among the extracted features across different modalities is investigated, and based on similarities and mismatches, news article is classified. This model has been tested on the FakeNewsNet dataset and outperforms baselines and competitive models to give the best performance in all cases. Comparison of our multimodal results to those obtained in this paper is presented in later sections.

2.5 Evaluation Metrics Standard metrics for evaluating classifiers defined by the formulas given here are: precision, recall, accuracy and F1 score. In binary classification, fake news is taken to be a positive class. Thus, when a fake news piece is predicted fake, the example is a true positive (TP); when a real news piece is predicted real, the example is a true negative (TN); when a real news piece is predicted fake, the example is a false positive (FP), and when a fake news piece is predicted real, the example is a false negative (FN). Precision =

|TP| |TP| + |FP|

118

S. Bhatt et al.

Recall = Accuracy =

|TP| |TP| + |FN|

|TP| + |TN| |TP| + |TN| + |FP| + |FN|

F1 Score =

2 × Precision × Recall Precision + Recall

Specifically, accuracy measures how well the classifier distinguishes between fake and real news precision measures the fraction of all predicted fake news that are fake. Often, fake news detection datasets skewed, allowing fewer positive predictions to result in a high precision. Recall is the fraction of actual fake articles predicted as fake. This is used to measure the sensitivity of the classifier. F1 provides an overall performance measure by combining precision and recall. Better performance of the classifier is indicated by a high value of each of the evaluation metrics.

3 Description of Datasets Used For the experiments of this paper, the datasets used are: LIAR [7], LIAR-Plus [8] and FakeNewsNet [9]. The focus is on the news content and auxiliary features of the dataset. Contextual and linguistic features from both datasets are explored, extracted and modelled. Visual features are utilised from FakeNewsNet. The LIAR dataset contains features including statement (or claim), label (six classes) subject, speaker and auxiliary information about the speaker such as job, political affiliation, state, context/venue of the claim and credit counts. The LIARPlus [7] builds on this by adding a justification column where evidence is provided for why a particular claim is labelled into a particular category. This information is referred to as evidence or justification throughout the paper. Both these datasets have a six-grained labelling with classes as—true, mostly true, half true, barely true, false and pants-on-fire. The FakeNewsNet [9] repository has many features including news content features, network features and spatio-temporal information. In this paper, the focus is on the news content by exploiting visual-based features along with linguistic and contextual features.

4 Experiments and Results Several techniques of increasing complexity and features on the above-mentioned datasets are explored in our experiments. First, four models on the LIAR and LIARPlus dataset that take into consideration only linguistic features are proposed. These

Fake News Detection: Experiments and Approaches Beyond …

119

Table 1 Results obtained on LIAR-Plus using regression models Model

Six-way classification

Binary classification

Mean accuracy

Variance

Mean accuracy

Variance

LR

0.2287

4.092e−05

0.6500

6.546e−06

LoR

0.3157

4.396e−05

0.6321

1.050e−05

OLR

0.2384

3.76e−05

0.6500

6.266e−06

are—regression model, Siamese network with BERT base [19], sequence model and an enhanced sequence model. Then, experiment with two models on the FakeNewsNet dataset is presented, one of which uses only contextual linguistic features (Sequence model) and another that additionally uses visual features (CNN model). This provides us with a comparison of whether visual features are useful in distinguishing fake news. The specifics of techniques and results obtained for each of the techniques are described as follows.

4.1 Regression Model To begin with one of the most preliminary machine learning technique, regression is used. The purpose of this experiment is to establish a baseline for automatic fake news detection. GLoVe embeddings [20] are used to encode the text. This model is tested on the LIAR-Plus dataset. Standard pre-processing steps like removal of stop words, neglecting casing, substituting missing values with average and ignoring words not present in GLoVe were applied. Note—these pre-processing steps are applicable to all models described in future sections also. The regression models used are linear regression (LR), logistic regression—one versus rest (LoR) and ordinal logistic regression (OLR). The results obtained on fivefold cross validation gave highest mean accuracy of 31.57% on six-way classification. The model is adapted to binary classification. FALSE category includes the classes—pants-on-fire, false and barely true and TRUE category includes half true, mostly true and true. The highest mean accuracy on fivefold cross validation is 65%. The results are shown in Table 1.

4.2 Siamese Network with BERT An artificial neural network that uses the same weights while working on two different inputs to produce a comparable output is known as a Siamese neural network. The following Siamese models with BERT in the base architecture were used on the LIAR-Plus dataset. One branch, two branch and triple branch Siamese network

120

S. Bhatt et al.

were utilised. Best results were obtained with the triple branch network. The model architectures shown in Fig. 1 are described below: 1.

2.

Single Branch: The input sequence is first passed through a pre-trained BERT model. The BERT architecture is fine-tuned by passing the tensor from BERT through a linear fully connected layer (FC). This gives a binary output for fake or true labels. Here, no metadata or justification is used for training; only a single branch with the news statement is used. Testing accuracy of 60% is obtained on binary classification. Two Branch: News statement and justification were used to create the two branches of this network. These are passed through a linear FC layer after concatenation. Both these branches share weights. This architecture makes use of the “justification” along with statement giving better result than the single branch. Binary classification accuracy of 65.4% and six-way classification accuracy of 23.6% was obtained using this method. As evident, this is an improvement from the single branch model substantiating the hypothesis that using justifications can give better results.

Single Branch

Two Branch

Triple Branch

Fig. 1 Architecture of BERT-based Siamese network

Fake News Detection: Experiments and Approaches Beyond …

3.

121

Triple Branch: An additional branch is added in this approach. This branch takes as input the additional available metadata such as speaker, source, affiliation, etc. The authenticity of the publisher is considered using the feature “credit score” (CS) as defined in Eq. (1). The CS was added to the concatenation of output from the three branches. The length of the input sequence of each branch is modified to be equal the average input length in that branch. The six-way classification accuracy improved by a huge margin. 37.4% and 77.2% were the highest accuracy obtained for six-way and binary classification, respectively. The binary classification accuracy is 7.2% higher than the accuracy obtained by authors.

As evident, the accuracy of two branch network is higher than one branch substantiating the claim that justification modelling can lead to improved accuracies. Further the triple branch accuracy is even better as both metadata and justification are used. Once again, this proves the hypotheses that features beyond linguistic ones can help improve model performance. There is a scope for further improvement by finding better methods to integrate metadata and credit score and further fine-tuning the model. These results are further improved in the next experiment with sequence models and enhanced sequence model. Definition of Credit Score (CS) The scalar—credit score (CS)—is indicative of the credibility of the author, calculated as counts of false news propagated by the author in the past. A weighted aggregate of the six-grained counts provided for every author followed by tanh activation is used to calculate CS. The weights are taken as hyperparameters and not tuned by the model. The scalar is defined as in Eq. 1.     0.2 ∗ MTC + 0.5 ∗ HTC + 0.75 ∗ BTC + 0.9 ∗ FC + 1 ∗ PFC +b CS = tan h w ∗ MTC + HTC + BTC + FC + PFC

(1)

Here, MTC refers to mostly true count for the speaker, HTC refers to half-true counts, BTC refers to barely true counts, FC refers to false counts and PFC refers to pants-on-fire counts. The credit score is passed to a 1-neuron dense layer to learn the relative importance of credit score in determining the final claim, and w and b are weight and bias learned during training. The scalar is biased towards authors with more false counts due to progressively higher value of weights from mostly true to pants-on-fire counts. The rationale behind choosing such a weighting scheme is that the knowledge about an author making a false statement is intuitively more critical for judging the credibility of his/her statements. The credit score is useful in distinguishing fake and real news by creating a relative difference in the activation outputs because more the credit score, less reliable is the person making a claim.

122

S. Bhatt et al.

4.3 Sequence Model Sequence models have been successfully used in various NLP tasks. Two models trained here are: one without using the justifications as in LIAR and another using the justification from LIAR-Plus. The data was pre-processed with standard techniques like removing stop words, neglecting casing, substituting missing values with average, etc. GLoVe vector embeddings [20] of dimension 100 were used to input the statements and justifications, and the rest of the data was fed to a feed-forward neural network. The architectures of these models have been shown in Fig. 2. In both the architectures, the first input branch is for the encoded statement, and the second input branch is for speaker-related metadata. The third input branch in the model with justification corresponds to the encoded justification from LIAR-Plus. In the model architecture shown in the figure: LSTM nodes refer to a standard LSTM layer with 128 cells. Dropout nodes refer to regularisation dropout with probability 0.15 in the statement branch and 0.2 in the justification branch to prevent overfitting. Dense nodes refer to fully connected dense feed-forward layer with 32 units in the statement and justification branch and 64 units in the metadata branch followed by Relu activation. Concatenate node is a concatenation layer to combine outputs of each branches. The final dense node after concatenation is a feed-forward layer with softmax activation for output. The binary classification model was trained for 120 epochs with a batch size of 512. The six-grained classification model is trained for 40 epochs. The data was

Model without justification Fig. 2 Architecture of sequence models

Model with justification

Fake News Detection: Experiments and Approaches Beyond …

123

Table 2 Results obtained by sequence model Classification Binary Six-way

Dataset

Justification

Accuracy Training

Testing 0.7862

LIAR

No

0.8192

LIAR-Plus

Yes

0.8559

0.8205

LIAR-Plus

Yes

0.5439

0.5015

distributed into: 16,000 training rows, 4000 validation rows and 1744 testing rows. The results are shown in table 2. As expected, sequence model shows significant improvements on both binary and six-grained classification in LIAR as well as LIAR-Plus datasets. Once again as expected accuracy performance with justification is better in both binary and six-way classification.

4.4 Enhanced Sequence Model The sequence model is enhanced by introducing an additional branch with the “credit score” (CS) as defined in Eq. 1. The model architecture is shown in Fig. 3. The four branches have inputs as statement (S branch), metadata (M branch), justification (J branch) and credit score (C branch). The sizes of input layers are as mentioned in Fig. 3. The LSTM node is a standard LSTM layer with 128 cells in the S and J branch. The dropout node is regularisation dropout to prevent overfitting in the S and J branch with a dropout probability of 0.15 and 0.21, respectively. The dense layer is a feed-forward fully connected layer with 32 units each in the S and J branch, 64 units in the M branch and 1 unit in the C branch. This is followed by a Relu activation in the first three branches and tanh activation with the credit score as defined in the definition of credit score. The concatenate node concatenates the results of the S, M and J branch which is then passed to the add node which adds it to the result from the C branch. Finally, a single dense layer with sigmoid in case of binary classification and softmax in case of 6 classes is used to give the final label output. Categorical cross-entropy loss and ADAM optimiser were used for training this model for multi-class fine-grained classification. Binary cross-entropy loss and ADAM optimiser were used for training this model for binary classification. Early stopping in Keras callback with the patience of 15 epochs and validation loss being monitored quantity was used to prevent overfitting. For binary classification, model was trained for 500 epochs, and for multi-classification, the model was stopped early on 230th epoch. ADAM learning rate was tuned after grid search to 0.001 for both classification tasks. All results were verified using fivefold stratified cross-validation in both classification tasks. Batch size was tuned after grid search to 256. Table 3 outlines our best results achieved.

124

S. Bhatt et al.

Fig. 3 Architecture of enhanced sequence model with details of layers and hyperparameters

Table 3 Results of enhanced sequence model Accuracy

Precision

Recall

F1 score

Classification

Training

Testing

Testing

Testing

Testing

Binary

0.8403

0.8297

0.734

0.712

0.722

Six-way

0.5370

0.5272

0.43a

0.42a

0.42

a Refers

to macro average for all six classes

Significantly better performance than the scores reported in the papers [7, 8] is obtained. These results clearly show that using a weighted aggregate of credit score can give better performance. Finally, comparison of our results to the performance of Fakedetector [16] which reports SOTA performance (to the best of our knowledge)

Fake News Detection: Experiments and Approaches Beyond …

125

after comparison to various baselines and competitive models is described here. The authors use a dataset with ~ 14 k examples (slightly less than the ~ 16 k examples in LIAR-Plus) which have also been obtained from PolitiFact.com along with justification and other metadata. This is a fair comparison due to strong similarity in dataset size, features available as well as dataset sources. Fakedetector obtains a maximum accuracy of 0.64 on binary classification and 0.29 on six-way classification. Our scores using the triple branch Siamese network as well enhanced sequence model exceed these results. This improvement can be attributed to the effectiveness of the newly introduced credibility index – credit score (Eq. 1) which is not used in the case of Fakedetector.

4.5 Sequence Model on FakeNewsNet The FakeNewsNet dataset contains a variety of features including linguistic and visual features. First, experiments with a sequence model similar to the one used in the previous sections are presented to establish a baseline score for comparison of the multimodal model. F1 score of 98.74% is obtained on training, and 93.71% is obtained on validation for binary classification. The scores of this sequence model are not compared with those of the previous one because of the different datasets used in both cases. The scores of models trained on FakeNewsNet are better owing to the larger number of examples in the dataset. The dataset also does not have justification or six-grained labels, and hence, those comparisons are not made. The purpose of this model is solely to establish a baseline against which we the performance of multimodal model will be compared. The model loss is shown in Fig. 4.

4.6 Convolutional Model for Linguistic and Visual Features A CNN is trained on the features of text and images simultaneously [21]. Multiple convolutions are employed to capture the hidden features of text and images. Features are classified as latent (hidden) or explicit. There are two parallel CNNs to extract features from text and images and text, respectively. The latent and explicit features are then projected on the same feature space. These representations are then fused to give output. The model loss is depicted in Fig. 5. F1 score of 99.2% is obtained in training, and 96.3% is obtained in validation. It is clearly evident from these results that using the visual features in addition to linguistic features in fake news detection can lead to performance enhancement. This substantiates the assumption that images that are often associated with news posts on social media can be an important indication for the veracity of the news item.

126

S. Bhatt et al.

Fig. 4 Loss of sequence model on FakeNewsNet

Fig. 5 Loss of CNN model on FakeNewsNet

Finally, a comparison of our results to SOTA performance reported in literature is described as follows. The MVAE, a multimodal model proposed in [17] reports an F1 score of 73% on a multimodal dataset from Twitter and 83.7% on a multimodal dataset from Weibo. Our model gives a better F1 performance. However, this difference maybe partly due to the difference in datasets. The authors in [18] propose a multimodal model, SAFE which they rigorously test on the FakeNewsNet dataset. They compare their models to multiple baselines as well as competitive models and

Fake News Detection: Experiments and Approaches Beyond …

127

report SOTA performance (to our best knowledge) with an F1 score average of 0.8955. Our model improves this performance by 6 points using the parallel CNN method described in this section.

5 Conclusion In this paper, an empirical study of multiple models of varied complexities for fake news detection is presented. We attempt to go beyond using just linguistic features to improve model-performance on the task. The hypotheses proved using experiments are: First, credibility index of source/speaker and metadata associated can significantly improve model performance. Second, modelling evidence or justification with the news claim significantly improves the model performance. Finally, multimodal models that exploit visual features from images associated with news articles can perform better than models that utilise only linguistic and contextual features. Experiments are carried out using the LIAR, LIAR-Plus and FakeNewsNet datasets, and comparisons of our results with baselines and SOTA models are presented. Our models gave comparable results and better than SOTA models. Best accuracy is obtained on binary classification and six-way classification on the LIAR-Plus dataset using an enhanced LSTM-based sequence model which uses linguistic features, credit scores, metadata and justification. An improvement in performance on the FakeNewsNet dataset is obtained using a multi-modal model as compared to the model that uses only linguistic features. Finally, the findings in the case of the multi-modal model are particularly encouraging. In the future, we plan to explore the further integration of features from visual data for building better fake news detection systems.

References 1. Tandoc EC Jr, Lim ZW, Ling R (2018) Defining “fake news” a typology of scholarly definitions. Digit J 6(2):137–153 2. Allcott H, Gentzkow M (2017) Social media and fake news in the 2016 election. J Econ Perspect 31(2):211–236 3. Cardoso Durier da Silva F, Vieira R, Garcia AC (2019) Can machines learn to detect fake news? A survey focused on social media. In: Proceedings of the 52nd Hawaii international conference on system sciences 4. Shu K, Sliva A, Wang S, Tang J, Liu H (2017) Fake news detection on social media: a data mining perspective. ACM SIGKDD Explorat Newsl 19(1):22–36 5. Bondielli A, Marcelloni F (2019) A survey on fake news and rumour detection techniques. Inf Sci 497:38–55 6. Sharma K, Qian F, Jiang He, Ruchansky N, Zhang M, Liu Y (2019) Combating fake news: a survey on identification and mitigation techniques. ACM Trans Intel Syst Technol (TIST) 10(3):1–42

128

S. Bhatt et al.

7. Wang WY (2017) Liar, Liar Pants on Fire: a new benchmark dataset for fake news detection. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 2: short papers), pp 422–426 8. Alhindi T, Petridis S, Muresan S (2018) Where is your evidence: improving fact-checking by justification modeling. In: Proceedings of the first workshop on fact extraction and verification (FEVER), pp 85–90 9. Shu K, Mahudeswaran D, Wang S, Lee D, Liu H (2018) Fakenewsnet: a data repository with news content, social context and dynamic information for studying fake news on social media. arXiv preprint arXiv:1809.01286 10. Pérez-Rosas V, Kleinberg B, Lefevre A, Mihalcea R (2018) Automatic detection of fake news. In: Proceedings of the 27th international conference on computational linguistics, pp 3391– 3401 11. Conroy NK, Rubin VL, Chen Y (2015) Automatic deception detection: methods for finding fake news. Proc Assoc Inf Sci Tech 52(1):1–4 12. Yazdi KM, Yazdi AM, Khodayi S, Hou J, Zhou W, Saedy S (2020) Improving fake news detection using K-means and support vector machine approaches. Int J Electron Commun Eng 14(2):38–42 13. Dong X, Victor U, Qian L (2020) Two-path deep semi-supervised learning for timely fake news detection. arXiv preprint arXiv:2002.00763 14. Kaliyar RK, Goswami A, Narang P, Sinha S (2020) FNDNet–a deep convolutional neural network for fake news detection. Cogn Syst Res 61:32–44 15. Goldani MH, Momtazi S, Safabakhsh R (2020) Detecting fake news with capsule neural networks. arXiv preprint arXiv:2002.01030 16. Zhang J, Dong B, Yu Philip S (2020) Fakedetector: effective fake news detection with deep diffusive neural network. In: 2020 IEEE 36th international conference on data engineering (ICDE). IEEE 17. Khattar D et al (2019) Mvae: Multimodal variational autoencoder for fake news detection. In: The world wide web conference 18. Zhou X, Wu J, Zafarani R (2020) SAFE: similarity-aware multi-modal fake news detection. arXiv preprint arXiv:2003.04981 19. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 20. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543 21. Yang Y, Zheng L, Zhang J, Cui Q, Li Z, Yu PS (2018) TI-CNN: convolutional neural networks for fake news detection. arXiv preprint arXiv:1806.00749

Object Recognition and Classification for Robotics Using Virtualization and AI Acceleration on Cloud and Edge Aditi Patil and Nida Sahar Rafee

Abstract With the development of cloud robotics, a much broader scope of multidisciplinary applications to create smart systems is now available. The “Artificially Intelligent” system’s brains are in the cloud. The cloud can hold data centers, deep learning, communication support, etc. With the help of edge computing, VM-based cloudlets, deploying deep learning implementation systems are a more practical option rather than one single system doing all the tasks. The mobile applications and IoT devices often produce streaming data which requires real-time analysis and control. When the application involves end devices as hardware like Raspberry Pi and laptops working at edge, an acceleration in the result generation is also necessary. This paper presents its observations with the implementation of one such machine learning application of object detection and recognition, i.e., You-Only-Look-Once (YOLO) on a robotic environment working on the number of clients and servers ends. Differentiating cloud and edge, we have demonstrated the analysis and results where output efficiency leverage is seen with AI acceleration with toolkits like Intel’s OpenVINO.

1 Introduction Time-critical applications and latency go hand in hand. In understanding edge-native applications [1], we realized that the AI-based time-critical applications interfaced differently. By using the newer techniques and implementation method, we can reduce latency. Latency is the time it takes for data or a request to go from the source to the destination or latency measures the delay between an action and a response. Edge A. Patil MKSSS’s Cummins College of Engineering for Women, Pune, India e-mail: [email protected] N. S. Rafee (B) Nife Labs Pte Ltd., Singapore, Singapore e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_10

129

130

A. Patil and N. S. Rafee

computing implementations promise cloud-like resources with low latency and high bandwidth, resulting in low network overheads [2]. In general understanding, latency can arise by network and application. 1.

Latency because of the network [3]: a. b. c.

d. 2.

Transmission mediums such as WAN or fiber optic cables all have limitations and can affect latency simply because of their nature. Propagation is the time it takes for a packet to travel from one source to another (at the speed of light). Routers take time to analyze the header information of a packet and, sometimes, add additional information. Each hop a packet takes from router to router increases the latency time. Storage delays can occur when a packet is stored or accessed, resulting in a delay caused by intermediate devices like switches and bridges.

Application Latency: a. b.

c.

Application Design: The design of the application, the data transmission between the logical layers adds to the latency Database Response: Time taken to retrieve records from the database the type of database used, and complex data schemes can add to the database latency Computation: More powerful machines will compute data faster, and the machines with lower computation will take time. Hence, the hardware plays an important role.

One aspect to consider is that network latency due to the distance (speed of light) can be reduced by bringing computers closer to the end-user. If we do that, it can reduce the number of network hops and the deploy in getting the response time as well. Another aspect to consider is computational latency. While the computation depends on factors like hardware, processing power, and the application needs itself, if we speed up the processing, then latency can be reduced. To work on the hypothesis, taking cloud robotics as a use case, we isolated the AI computation inference models and worked on moving the computational inference to the Raspberry Pi using cloudlets [4]. Warehouse robotics, on the other hand, uses embedded AI which does localized computation but reduces the battery life of the system. A typical need for the robots in warehouses is object recognition and deciding based on the objects. For this need, we use a YoloV3. YoloV3 object detection is a known model that can recognize objects. In our case, object recognition is required by the robots in warehouse. By trying out different methodologies, we saw latency variance in different cases of moving servers on the cloud and edge. Similarly, we saw latency variance in computation using AI acceleration, here Intel OpenVINO.

Object Recognition and Classification for Robotics Using Virtualization …

131

In this paper, we cover the different results that we were able to tabulate and the corresponding results.

2 Literature Survey 2.1 YOLO J. Redmon, S. Divvala et al. have proposed and developed an object detection technique, namely You Look Only Once(YOLO) [5]. In YOLO, they have dealt with the problem of object detection as a regression problem so that every object which will be detected will have a bounding box associated with each class of object. In one evaluation of the frame itself, the YOLO algorithm, using its unified architecture of a single neural network predicting bounding boxes and class probabilities directly from full images.

2.2 Cyber Foraging and Cloudlets Grace Lewis, Sebastián Echeverría, et al. have described tactical cloudlets and also mentioned the experimental results for five tactical cloudlets provisioning mechanisms [6]. They help in moving cloud computing concepts closer to the edge for enhanced decision making even at the edge. In this implementation, a very thin client running on a mobile device and a computer-intensive server working on the cloudlet are the static partition done in the design. These two applications are the key elements called mobile client and cloudlet host for the architecture (refer to Fig. 1). The first stage of the working of the architecture is cloudlet discovery, where the cloudlet client finds the appropriate cloudlet for offload and connection. The second stage involves cloudlet provisioning, which is classified into five different ways using: optimized VM synthesis, application virtualization, cached VM, cloudlet push, or on-demand VM provisioning. The last stage involves the execution of the application, where the client app is notified that it is ready for execution once the service VM is started.

2.3 Intel OpenVINO The full name of OpenVINO is the open visual inference & neural network optimization toolkit (formerly Intel Computer Vision SDK) [7]. Intel OpenVINO deep learning workbench, under one umbrella, provides a range of tools in the OpenVINO deep learning development toolkit (DLDT) [8]. OpenVINO DLDT is an inferencing toolkit, which ensures the fast deployment of cognitive machine learning applications

132

A. Patil and N. S. Rafee

Fig. 1 Cloudlet model (Elijah)

on Intel hardware. The purpose of using the OpenVINO toolkit [9] was to enhance the performance of deep learning applications. The OpenVINO deep learning development toolkit is to create enhancements by streamlining the workflow of a neural model with optimal execution on Intel hardware. OpenVINO used a model optimizer to process inference faster (refer to Fig. 2). The role of the OpenVINO model optimizer [10] is to load a model into memory, read it, and build an internal representation of the model. This optimizes and produces an intermediate representation in the form of two files. The optimized intermediate representation removes layers necessary during training and redundant during inference. The model optimizer also removes groups of patterns and replaces them with a single operation hence decreasing the inference time.

Fig. 2 OpenVINO model optimizer

Object Recognition and Classification for Robotics Using Virtualization …

133

Fig. 3 Backend processing of Gabriel on the cloudlet

2.4 Gabriel Cloud and VM Gabriel is a high-level design for cognitive assistance systems. In robotic implementations, synthesis varies from real-time analysis to multiple sensor inputs which involves a diversity of data used. It should process all these inputs for effective execution in parallel for a single synthesized output. While implementing machine learning algorithms on edge-native applications for interaction in real-life scenarios, Gabriel works as a platform as a service (PaaS) layer on cloudlets [11]. Gabriel encapsulates each type of application supported by them, namely cognitive engines, along with its operating system, dynamically linked libraries, supporting toolchains and applications, configuration files, and data sets in its virtual machine (VM) [12]. One single control VM works as a single interaction unit through which the mobile IoT device will communicate to the cloudlet (refer Fig. 3). The VM, hence, processes input and gives the output back to the end-user device. This cloudlet runs the Gabriel system and provides practical cognitive assistance.

2.5 Raspberry Pi and Camera Module Raspberry Pi is a family of microcomputers that can be coded and directly operated using an OS such as Raspbian. We can consider it as a computer with limited capacity in terms of memory and processing [13]. Along with this system, for object detection, we use a Raspberry Pi camera module [14]. A Raspberry Pi camera module is a portable, lightweight camera that supports Raspberry Pi. Using the MIPI camera

134

A. Patil and N. S. Rafee

serial interface, it communicates with the Raspberry Pi computer. Hence, to get visual input, Raspberry Pi and the Raspberry Pi camera module works together.

3 Methodology 3.1 Preparation (a)

(b)

(c)

(d)

(e)

Yolo Algorithm The implementation of the Yolo algorithm is loaded with the help of three files which include a weight file, a configuration file, and a name file. Later, the image frames, which are used for analysis, are converted into the blob to extract features and resize them through the deep neural networks layers with the help of the OpenCV library. The confidence of the blobs is calculated, which is compared to a threshold value. In order to avoid the detection of the same object multiple times, a non-maximum suppression function is used. Accordingly, we detect the object with the YOLO algorithm with its label, and also the confidence is displayed. Localhost—Client–server coordination The client and server coordination works in two steps: The Gabriel server creates a socket with port number 9099. The server end waits for the clientside device to join using the same server host with the port number. Once the client joins, it establishes the connection. It then sends the image frames to the server end for processing under the YOLO algorithm. The client and server codes are written in Python. OpenVINO (Setup) OpenVINO is set up on a functioning Gabriel VM on the EC2 instance of Ubuntu server. On the server, it creates a swap space which results in helping in allocating a small additional amount of RAM by moving inactive pages in memory to space. After this, OpenVINO is installed on the server which is followed by configuring the OpenVINO model optimizer. The OpenVINO model optimizer generates two files with intermediate representation which is then loaded into the memory. AWS Setup—EC2 The AWS EC2 is chosen as the cloud instance running remotely. The cloud here refers to the Amazon EC2 instance Singapore data center. On the same cloud instance, Gabriel server is configured with the same method of client– server, and then OpenVINO is configured with the model optimization. The EC2 is a t2.micro instance. Raspberry Pi Raspberry Pi 4 is used as an edge server placed locally in the same network as the client. The same steps of Gabriel server configuration are repeated first, and then additionally, OpenVINO is setup.

Object Recognition and Classification for Robotics Using Virtualization …

135

3.2 Implementation Considerations In order to maintain uniformity of the results, consistent performance, environment, and input parameters considerations are maintained throughout the different stages of implementation. (a)

(b)

(c)

(d)

Same Model The same YoloV3 model is used throughout all the stages. The same YoloV3 model is optimized using OpenVINO. Latency Tabulation A log file is used to tabulate the time taken to process and retrieve the data. This is to maintain consistency in the inputs and outputs. Same Hardware and Resources The same cloud server and Raspberry Pi server were used to check the output results, hence ensuring that there is no variance in the compute or network latency variance. Same Input Parameters To ensure that it uses the same input parameters, we used the same images while generating the results.

3.3 Stages of Implementation In order to show the latency improvement using our methodology and the processing without it, the same inference model is computed in multiple setups and at every stage got different results. (a)

(b)

Laptop as a Server as well as Client (localhost) The YOLO algorithm implementation is divided into two different processes as the very first step for the effective and collective work of a distributed system. It establishes a socket connection between server and client. The visual input of the RGB image of 416 × 416 is captured by the web camera and is given to the localhost address. The web camera works as a client-side. The input is received by the server-side, and the YOLO algorithm is implemented on the server-side. It gives the output of the process with correct detection and bounding boxes to the client-side back. This is the first step for checking the successful implementation of the YOLO algorithm after dividing the entire process into two among the client and the server end. Laptop as a Server and Cloud as Client In order to baseline the numbers and results, it was essential to do vanilla testing with no VM loading features or including any AI acceleration. The client code was moved to cloud and executed to baseline the YOLO algorithm between laptop and cloud. The camera of the laptop was used as the input stream, and it generated the inference on the cloud (refer to Fig. 4).

136

A. Patil and N. S. Rafee

Fig. 4 Client–server implementation a using Gabriel and b using Gabriel and OpenVINO on cloud

(c)

(d)

(e)

Gabriel server (Cloud) and laptop as a client The Gabriel server instance works as the cognitive assistant implementing the YOLO algorithm for visual input. This visual input is sent by the laptop web camera, which acts as a client in this case, in a byte-string format. The Gabriel server executes the YOLO algorithm, and the output, detected objects, produced of detected objects are again given back to the server. It gives the result in bytes string format with labels and confidence values which are displayed with the blobs on the client screen. Gabriel server (Cloud + openvino) and laptop as client-end The next stage creates operation offload on the cloud with OpenVINO configured as Gabriel server. Both factors of offload at cloud and OpenVINO’s integrated working create a significant change in latency (refer to Fig. 4). Gabriel server (Raspberry Pi) and laptop as client-end At this stage, the server end is changed from laptop to Raspberry Pi (refer to Fig. 5). The Raspberry Pi acts as a local VM closer to the end-user. The client connects to Gabriel’s 9099 port to establish a connection. The visual input is sent in the byte-string form to the VM, which executes the YOLO algorithm

Fig. 5 Client–server implementation a using Gabriel and b using Gabriel and OpenVINO using Raspberry Pi

Object Recognition and Classification for Robotics Using Virtualization …

137

Fig. 6 Latency variance in different stages

(f)

and sends back the visual output with labels and confidence values back to the client device in byte-string format. Figure 6 shows the outputs. Gabriel server (Raspberry Pi + openvino) and laptop as client-end The change done in this stage is that the server end changes to an IoT device, i.e., Raspberry Pi 4. In this case, the computation is done by the Gabriel server residing on the Raspberry Pi with acceleration improvements of OpenVINO (refer to Fig. 5). Multiple IoT devices can be connected to the same Raspberry Pi instead of the client.

4 Results After running object detection for tables, chairs, and elements such as pen and paper, the same model produced different numbers for different implementations. We averaged out all the numbers. The latency using Gabriel, Raspberry Pi, and OpenVINO is the lowest. The average of 20 trail runs can be indicated in Fig. 7. The numbers of frames per second (FPS) also seemed to improve. In the case of EC2, the number of FPS was two, but when we moved to Raspberry Pi with OpenVINO and Gabriel, it was about 6. Laptop (as Implementation Server + Laptop + Client) Stage Cloud

Gabriel +Cloud

Latency(ms)

1.15

1.32

1.28

Fig. 7 Results of the implementation

Gabriel +Raspberry Pi

0.345

Gabriel + Gabriel + Cloud Raspberry Pi +OpenVINO +OpenVINO

0.045

0.0253

138

A. Patil and N. S. Rafee

Fig. 8 Relationship of optimization and efficient resource location with latency

As indicated in Fig. 8, the results are very clear. As the optimization of the application increases—virtualization, intermediate representation, and moving the hardware close to the location of computing, the latency steadily continues to decrease. By moving the computation from cloud to edge, there seems to be a latency improvement of 70% on average. With AI acceleration, there is an accelerated improvement in both cloud and edge latencies making this a solution for improvement for large-scale time-critical deployments at a significantly lower price.

5 Discussion The results after running the same algorithm and methodology using other algorithms like TinyYolov3 were the same. There are other factors that can be evaluated—the performance utilization, the different models that can be used for inferences like MobileNet and YoloV4. In this paper, the consideration is video analytics, but this can be expanded to more areas like speech and text analytics. The future scope of the work includes running on cloud platforms like AWS DeepLens and Google Cloud AI. Another related research expansion into accelerated performance toolkits for AI and inference could yield exciting results. For robotics, it is well known that embedded AI is well suited for time-critical applications helping the robots better. However, by using the methodology discussed in the paper, we can achieve similar latency numbers at a lower cost and low battery life. It will be interesting to benchmark the research against the embedded AI toolkits for cost and battery life.

Object Recognition and Classification for Robotics Using Virtualization …

139

6 Conclusion There are two factors that were taken into account—the location in proximity to where the computation needs to be done and then then the AI acceleration. The lowest latency is achieved when the device computing is closer to the location where the device requesting for processing is located. The latency reduction is boasted when the AI acceleration algorithms are applied. Further, with just AI acceleration, the application latency is reduced, but with a reduction in network hops, there is a massive improvement in overall latency. The third factor is the VM offloading. Hence in order to get the best results, all three implementation factors cloudlet, Intel OpenVINO, and proximity-based edge server are necessary. Acknowledgements The authors would like to thank Intel and JustRobotics for helping us by being a partner in the research and case study. The authors would like to express their gratitude toward Sheryl Lim from Intel for providing us access to the OpenVINO APIs and educating us on the usage. The authors would like to thank Raghu and Ribin Mathew of JustRobotics for guiding us through their robotics customer case and providing a valuable contribution in helping us with the implementation of the project. In order to implement and execute the research, the different teams at Nife Labs have helped to drive conversations. Hence the authors would like to appreciate the team at Nife Labs.

References 1. Wang J, Feng Z, George S, Iyengar R, Pillai P, Satyanarayanan M (2019) Towards scalable edge-native applications 152–165. https://doi.org/10.1145/3318216.3363308 2. Chen Z, Hu W, Wang J, Zhao S, Amos B, Wu G, Ha K, Elgazzar K, Pillai P, Klatzky R, Siewiorek D, Satyanarayanan M (2017) An empirical study of latency in an emerging class of edge computing applications for wearable cognitive assistance. In: Proceedings of the second ACM/IEEE symposium on edge computing. Association for computing machinery, New York, NY, USA, Article 14, pp 1–14. https://doi.org/10.1145/3132211.3134458 3. Network Latency Issues—What is Latency? https://www.keycdn.com/support/what-is-latency 4. Satyanarayanan M (2013) Cloudlets: at the leading edge of cloud mobile convergence. In: Proceedings of the 9th international ACM Sigsoft conference on quality of software architectures (QoSA ‘13), pp 1–2. https://doi.org/10.1145/2465478.2465494 5. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, pp 779–788. https://doi.org/10.1109/CVPR.2016.91 6. Lewis G, Echeverría S, Simanta S, Bradshaw B, Root J (2014)Tactical cloudlets: moving cloud computing to the edge. In: 2014 IEEE military communications conference, Baltimore, MD, pp 1440–1446. https://doi.org/10.1109/MILCOM.2014.238 7. Intel’s OpenVINO Toolkit Getting Started Documentation. https://software.intel.com/content/ www/us/en/develop/tools/openvino-toolkit/get-started.html 8. OpenVINO Deep Learning Deployment Toolkit (2019). https://github.com/opencv/dldt 9. OpenVINO Toolkit (2020). https://software.intel.com/enus/openvino-toolkit

140

A. Patil and N. S. Rafee

10. OpenVINO Model Optimizer—Image Flow Description. https://docs.openvinotoolkit.org/lat est/openvino_docs_MO_DG_Deep_Learning_Model_Optimizer_DevGuide.html 11. Elijah Home Page. http://elijah.cs.cmu.edu/ 12. Ha K, Chen Z, Hu W, Richter W, Pillai P, Satyanarayanan M (2014) Towards wearable cognitive assistance. In: MobiSys 2014—proceedings of the 12th annual international conference on mobile systems, applications, and services. https://doi.org/10.1145/2594368.2594383 13. Nadaf R, Bonal V (2019) Smart Mirror using Raspberry Pi as a security and vigilance system. In: 2019 3rd international conference on trends in electronics and informatics (ICOEI), Tirunelveli, India, pp 360–365. https://doi.org/10.1109/ICOEI.2019.8862537 14. Raspberry Pi Camera Module Hardware Documentation. https://www.raspberrypi.org/docume ntation/hardware/camera/

Neural Networks Application in Predicting Stock Price of Banking Sector Companies: A Case Study Analysis of ICICI Bank T. Ananth Narayan

Abstract This analysis paper a survey on the application of the neural networks of the stock markets of banking domain. During this article, numerous domains and applications of neural networks are clearly mentioned. To predict stock value movement, it cannot outline its complicated, non-stationary, chaotic and non-linear phase space of the stock market. Prediction is that the method of estimation in unknown future things. Predicting stock performance may be a terribly giant and profitable space of study. The stock market is one in every of the acquainted investment places attributable to its expected high profit. Prediction of stock market returns by minimizing the risk associate in increasing the profit is an vital issue in finance domain. The present study is supported and illustrated by utilization of results.

T. Ananth Narayan (B) Jawaharlal Nehru Technological University Hyderabad (JNTUH), Kukatpally, Hyderabad, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_11

141

142

T. A. Narayan

1 Introduction Neural networks are mathematical models originally impressed by biological processes within the human brain. They’re created from variety of straightforward process components interconnected by weighted pathways to create networks. Every component computes its output as a nonlinear performance of the weighted input once combined into networks. These process components will implement willynilly complicated, nonlinear networks which may be wont to solve classification, prediction, and improvement issues.

1.1 Neural Networks Definitions of Neural Networks: A neural network is a very large scale parallel distributed processor that has a natural tendency for storing experiential knowledge and making it accessible for use. It resembles the brain in two respects: 1. 2.

Knowledge is obtained by the Network through a learning process. Interneuron connection strengths known as synaptic weights are used to store the knowledge.

Neural Networks Application in Predicting Stock Price …

143

2 Objective of the Study To study the neural networks application in predicting stock price of banking sector companies: a case study analysis of ICICI BANK.

3 Review of Literature Neural networks have been extensively applied to accounting, finance, and other business studies in areas such as forecasting, pattern recognition, and classification (Wong et al. 1997; O’Leary 1998; Vellido et al. 1999). Tam and kiang (1992) find neural networks to be a superior approach in bankruptcy predictions. Echoing Tam and Kiang (1992), other researchers in economics and finance recognize the strength of the neural networks in handling nonlinear relationships and accommodating various probability distributions. Azoff (1994) recommends the neural networks approach as a “multivariate nonlinear nonparametric inference technique that is data driven and model free.” Beltratti et al. (1996) provide a more fundamental explanation for the appeal of the neural network in economic modeling. Kim and McLeod (1999) demonstrated superiority of neural network models in bankruptcy prediction, especially when there exist nonlinear patterns in datasets. In addition, numerous bond rating studies (Know et al.1995; Singleton and Surken 1995; Maher and Sen 1997) have demonstrated that neural networks are a reliable alternative to traditional statistical techniques such as discriminant analysis for business classification problem. Hill et al. (1994) and Tang and Fishwiek (1993) suggest replacing conventional statistical techniques with neural networks in building financial forecasting models. Neural networks are extensively applied to accounting, finance, and different business studies in areas like forecasting, pattern recognition, and classification. Tam and Kiang (1992) notice neural networks to be a superior approach in bankruptcy predictions. Echoing Tam and Kiang (1992), different researchers in social science and finance acknowledge the strength of the neural networks in handling nonlinear relationships and accommodating varied chance distributions. Azoff (1994) recommends the neural network approach as a “multivariate nonlinear statistic logical thinking technique that’s information driven and model free.” Beltratti et al. (1996) offer a lot of elementary rationalization for the attractiveness of the neural network in economic modeling. Kim and McLeod (1999) incontestable superiority of neural network models in bankruptcy prediction, particularly once there exist nonlinear patterns in information sets.

144

T. A. Narayan

3.1 Banking Sector Banking Industry Most banks are a unit, nowadays, sustaining the pressure to remain profitable and at the same time perceive the requirements, desires and preferences of the shoppers. Lately, several money establishments have adopted some new models which will facilitate them to sustain within the market field. Then the banks can ought to transcend the normal business report age and sales foretelling to be able to ascertain a group of crucial factors regarding success. The appliance of knowledge mining and prognosticative analytics to extract unjust insights and quantitative predictions will facilitate the banks to realize insights that comprise all kinds of client behavior, as well as channel transactions, account gap and shutting, default, fraud, and client departure. Insights into these banking behaviors are often discovered through variable descriptive analytics, and prognosticative analytics, like the assignment of credit score. Banking Analytics, or applications of knowledge mining in banking, enhances the performance of the banks by rising however banks phase, target, acquire, and retain customers.

4 Hypothesis of the Study H1: There is neural networks application in predicting stock worth of banking sector companies: a case study analysis of ICICI BANK.

5 Scope of the Study The project is concentrated on ICICI bank. The practical metrics output reflects varied methods employed by these players.

6 Limitations of the Study For company analysis, the information has been taken for the restricted fundamental quantity as the information is taken from secondary sources.

Neural Networks Application in Predicting Stock Price …

145

7 Research Methodology of the Study 1. 2. 3. 4.

Information is taken from origin until date on a routine. The total information sample is split between training (70%), Testing (20%) and validation (10%) on a random basis. Relative error is ascertained for each the size dependents. For those dependent variables wherever the Relative error is pretty less (2 h and possess capacity to consent themselves The child and parent or legal guardian is able to provide assent and/or consent Willingness and ability to sign informed consent document Provide assent and have a legal guardian that will participate and provide parental permission Able to comprehend and willing to provide written consent

Syntactic and Semantic Knowledge-Aware Paraphrase …

171

trials. Such tasks are commonly addressed through paraphrase detection. Paraphrases are word sequences that convey the same meaning using alternative language expressions [4, 5]. We quantify paraphrases by computing the amount of semantic overlap exhibited by a pair of textual expressions. This typically involves measuring the extent to which a pair of words, phrases or sentences are semantically related to each other [6–8]. A host of works have been already done in developing models for paraphrase detection [9–13]. Such models have been trained and tested over well-known openly available datasets. However, standard models for paraphrase detection do not work well with domain-specific texts particularly texts containing technical contents. For example, clinical domain texts often contain multiple mentions of clinical terms such as conditions, interventions and chemical compositions that mean the same (Example: stomach flu and food poisoning, Hyperinsulinemia and Type-II diabetes). Such technical terms play an important role in determining whether a sentence pair is paraphrased (“Obese with Hyperinsulinemia can be included in the study” and “Patients with Type-II Diabetes can be included”. In order to identify the similarity between the above two sentences, it is important for a model to detect the semantic relation between the concepts Hyperinsulinemia and Type-II diabetes. At the same time, it is also important to understand that concepts that look very similar may not be that much semantically close. For example, the sentence pair “Patients with only Type-I diabetes can participate in the test” and “Person with Type-II diabetes can also participate in the trial”. It becomes a challenging problem for a computational model to automatically detect such complicated semantic relationships between concepts. Therefore, it is increasingly becoming important to involve domain-specific knowledge bases to retrieve such type of semantic relatedness among domain-specific concepts. Such semantic information in combination with the standard syntactic and structural analysis of sentence pairs may give us an holistic view about the relatedness of a pair of sentences. In this paper, we have explored syntactic and semantically knowledge-aware deep neural network-based models for the automatic detection of paraphrases between pairs of sentences. Accordingly, we have proposed a clinical BERT + Tree-LSTMbased model. The Tree-LSTM model is specifically used to learn the complex dependency structure within a sentence that the standard Clinical BERT model [14] fails to capture. We have also observed that for domain-specific terms, both Clinical BERT and Tree-LSTM [15] fail to capture the inherent semantic concepts. Therefore, we have augmented a domain-specific knowledge base with the proposed model. The knowledge base is typically used to capture any semantic relatedness between domain terms occurring in the pair of sentences. The clinical BERT architecture is fined-tuned over the TREC 2018 precision medicine track clinical trial dataset. Our preliminary investigation shows that our proposed knowledge-aware model surpasses the existing state-of-the-art neural network architecture in detecting paraphrases in the clinical domain.

172

S. Jana et al.

2 Related Works There has been a large body of work investigating automated methods identifying paraphrases in text documents. Tai et al. [8] proposed a Tree-LSTM model that gives sentence representations for each sentence in the pair over each sentence parse tree (using a trained parser) and predicts similarity score using a neural network that considers distance and angle between the pair. Mueller and Thyagarajan [6] proposed siamese adaptation of LSTM network for semantic similarity between sentences. Socher et al. [7] introduced a method for paraphrase detection using recursive autoencoders (RAE) to find the vector representation of phrases in syntactic trees. Zhuang and Chang [5] proposed an attention-based RNN model at the SemEval 2017 cross-lingual semantic textual similarity (STS) task. Pawar and Mago [4] presented semantic similarity of sentences in different domains by considering word similarity, sentence similarity and word order similarity. Creutz and Aulamo [12] proposed a paraphrase detection models on subtitle data from the Opusparcus corpus comprising six European languages. They trained word-averaging (WA) and a gated recurrent averaging network (GRAN) models for sentence embedding. Agarwal et al. [9] proposed a method detecting paraphrases in short texts having language irregularity and noise. For that, they introduced a CNN-RNN-based architecture. Mohamed and Oussalah [11] proposed sentence-to-sentence semantic similarity model in which they derived word level semantic similarity using CatVar_aided WordNet-based model and semantic similarity between named-entities using Wikipedia entity cooccurrence. In more recent works, a BERT-based model was proposed by Devlin et al. [10]. This pretrained BERT model can be fine-tuned by adding an output layer on top of it and build state-of-art models for different NLP tasks such as question answering and paraphrase detection. Zhang et al. [13] proposed an improved language representation model, semantics-aware BERT (SemBERT), in which pretrained semantic role labelling is combined with BERT, for considering contextual semantics.

3 Proposed Knowledge-Aware Paraphrase Detection Model Our model is composed of three components: clinical BERTBASE model, child-sum Tree-LSTM and knowledge base model. We fit our sentence pair into a pretrained clinical BERTBASE model by tokenizing the sentences using BERT tokenizer, which prepend a [CLS] token at the beginning for classification and append [SEP] token to the end of each sentence. The embedding output of the final transformer corresponding to [CLS] token fed into the softmax classifier to find the probability of being paraphrased or not. In order to get fine-grained structural similarity among sentences, we obtained a dependency relation of words within a sentence. These dependencies were provided as an input to the child-sum Tree-LSTM model described in [8] on dependency tree structure topology. The Tree-LSTM network is fed with embedding of words. In our paper, we use word2vec embedding along with character-level

Syntactic and Semantic Knowledge-Aware Paraphrase …

173

embedding for better understanding of clinical domain base words. The word2vec embedding is created over the entire TREC 2018 precision medicine task dataset.1 The Tree-LSTM unit processes any dependency tree from its leaves to its root nodes and provides sentence representation on the root node. Then, the representations are fed into a fully connected neural network, which outputs a score of each possible class. To identify the relation on domain-specific terms, we extract several clinical entities using CliNER [16] and create a similarity matrix of two sentences using several clinical databases. Then, we feed each of these output vectors as input to a fully connected neural network to get the corresponding probability vector, which represents the probability that the sentences are paraphrased or not. In the following sections, the components will be discussed in detail. We use a publicly released clinical BERTBASE model which is pretrained on clinical notes of MIMIC-III V1.4 database. This model has 12 layers of transformer blocks, 768 hidden units and 12 self-attention heads. The pretrained model is finetuned on the training dataset. We need to do some adjustment to use the clinical pretrained BERT model for our classification task. We set the early stopping of finetuning to 800 steps to prevent overfitting. We use a batch size of 32, a maximum sequence length of 128 and a learning rate of 2 × 10−5 for fine-tuning this model. Finally, in post-processing step, we took the output corresponding to [CLS] token from the 12th transformer layer of the pretrained BERT model and fed it as input to a fully connected layer with a sigmoid activation function. We have used binary cross-entropy loss function for this model, and finally, the BERT model gives the probability score of two sentences for each of the labels. Along with the BERT model, we have also used the child-sum Tree-LSTM to get dependency relations between words in a text. Here, each Tree-LSTM unit takes input as a pair of vectors (wi , wj ) corresponding to the word sequence pair of the sentence. The model then tries to learn the syntactic structure to predict the similarity among the word sequences. In our application, each wj is a vector representation of a word in a sentence, and in our model, we use word2vec representation with character embedding. Each Tree-LSTM unit (index by j) contains input and output gates ij and oj , respectively, a memory cell cj , hidden state hj and forget gate f jk for each child k. The input word at each node depends on the dependency tree structure which is used for the Tree-LSTM network (Fig. 1). Two sentences S 1 and S 2 pass through two different Tree-LSTM units and find sentence vectors for those two sentences, say hS1 and hS2 . Then, we feed these sentence representations to a fully connected neural network and predict the similarity score of two sentences. We have used the loss function described in [8] defined as m 1   k (k)  λ KL p || pˆ θ + ||θ ||22 j(θ ) = m k=1 2

1

http://www.trec-cds.org/.

174

S. Jana et al.

Fig. 1 Overview of the knowledge-aware BERT + Tree-LSTM model of paraphrase detection 

where m is the number of training pairs, p is the target distribution pθ , is the predicted distribution of this model, and λ is an L2 regularization hyperparameter. The word embeddings were constructed using the word2vec model [17] over the entire clinical trial dataset of TREC 2018 precision medicine track 1. All words that occur at least five times in the corpus are included. We further incorporated the character-level embeddings to mitigate loss due to unavailability of embeddings for domain-specific terms. For example, texts mentioning drug names like METHOTREXATE or FLUOXETINE can be better represented by character-level embedding. Using character-level embedding helps in detecting sub-word similarities useful for identifying misspellings consaltant (consultant) and also correctly recognizes new or unknown domain words like mucositis, pharyngitis, bronchitis, etc. It also helps to model drug dosages like “10 mg”, “12 cc” and “1:3 ratio” very effectively.

3.1 Output Layer The final output layer takes the output from the BERT classifier for paraphrase, childsum Tree-LSTM paraphrase output and the collapsed matrix from the unsupervised similarity detection unit and trains a neural network with sigmoid activation function to limit the possible scores to the range of (0, 1). Finally, the loss function is computed using the cross-entropy loss defined by

Syntactic and Semantic Knowledge-Aware Paraphrase …

L=−

2 

175

yi log(yi )

i=1

where y¯ is actual output, and y is the predicted output.

3.2 Similarity Detection Using Knowledge Base In order to identify semantic relatedness between domain terms, we have used a number of biomedical knowledge bases. The algorithm to detect such relatedness is discussed as follows: we first extract the different key concepts from the pairs of textual mentions. We have used CliNER [16] to extract the key concepts such as disease name or the associated conditions, laboratory test reports and proposed treatments and their reports. For example, the CliNER output of the sentence, “Chest CT of the patient showed pneumonia, antiviral therapy was required for that.” is Problem: “pneumonia”, Test: “Chest CT ” and Treatment: “antiviral therapy”. The clinical concepts corresponding to each sentence are represented as {e1 , e2 , …, ek } and {e 1 , e 2 , …, e l } for simplicity k, l ≤ 10. We then construct a k × l similarity matrix M = (m)i,j where mi,j = sim(ei , e j ). Where sim(ei , e j ) is the semantic similarity between each pair of key clinical concepts. The algorithm to compute sim(ei , e j ) is defined as below. We are using several clinical databases like, UMLS network, comparative toxicogenomic database (CTD) and DrugBank database [18–20]. In the CTD database, there are entities like chemical, gene, disease, their association and their IDs. For a specific entity, we are extracting the corresponding ID, alternate IDs and parent IDs. For that, we are selecting the longest database entity name that contains entirely the CliNER entity name. For example, the disease “Acrodermatitis enteropathica” has the ID “MESH:C538178”, there is no alternate disease ID, and parent disease IDs are MESH:D000169, C16.131.831.066/C538178,C17.800.174.100/C538178, C17.800.804.066/C538178, C16.131.831.066, C17.800.174.100 and C17.800.804.066. We then compute the similarity score based on the four sets of IDs A1 , A2 , P1 and P2 , first two are for the corresponding and alternate IDs, and next two are the corresponding parent IDs. If we get A1 ∩ A2 = ∅, then we set similarity value 1, if A1 ∩ A2 = ∅ and P1 ∩ P2 = ∅, then we set similarity value 0.5, and if A1 ∩ A2 = ∅ and P1 ∩ P2 = ∅, then the similarity value 0. We pad the similarity matrix to a 10 × 10 size. The matrix is collapsed to a one-dimensional array.

176

S. Jana et al.

Table 2 Dataset statistics: |V| is the number of sentence pairs Train

Validate

Test

Dataset

MSRP

MSRP

TREC 2018 precision medicine track

|V|

4075

1725

3160

4 Experimental Set-Up In our implementation, we have used the Microsoft Research Paraphrase Corpus (MSRP)2 introduced by Dolan et al. [21]. This data has 5800 sentence pairs with two labels “1” if they are paraphrases, “0” if not, and the average length of the sentence is 21 words. We have split the data as 4075 pairs in training data and 1725 pairs in validation set. We have used the TREC 2018 precision medicine task clinical trial dataset 3 to test our proposed models [22]. We randomly picked around 3160 pairs of eligibility conditions and annotated them in terms of two classes, “1” if they are paraphrases and “0” if not. The annotations have been done by a group of three annotators. We took a majority voting corresponding to each eligibility criteria condition. For every criterion, the average length of the text is 11 words. Table 2 summarizes the statistics of the datasets.

4.1 Results and Discussions We try to study the following: (a) whether the performance of the prediction task improves when BERT-based classification models are augmented with domainspecific knowledge base. (b) Whether additional syntactic information provides any improvement over the standard BERT-based paraphrase prediction model. (c) We try to compare the performance of the proposed model with respect to the standard baseline neural network models. (d) We also try to study the effects of different word and character-level embeddings over the specific paraphrase detection task. Table 3 reports the F1 scores of the proposed knowledge-aware BERT-TreeLSTM model. We can observe that incorporating the CTD and DrugBank knowledge base over the standard BERT model indeed improves the prediction performance. In fact most of the errors that the model generated were due to the unavailability of the knowledge about the specific clinical concepts. For Example, “Scheduled for elective, open abdominal or vaginal hysterectomy” and “Scheduled for bilateral cervical hysterectomy”. A lot of errors (around 7%) are due to non-standard reporting

2

“Download Microsoft Research Paraphrase Corpus from Official.” 3 Mar. 2005, https://www.mic rosoft.com/en-us/download/details.aspx?id=52398. 3 “2018 TREC Precision Medicine Track.” http://www.trec-cds.org/2018.html.

Syntactic and Semantic Knowledge-Aware Paraphrase …

177

Table 3 Results comparing the F1 scores of the proposed models with respect to the baseline systems W2V

FastText

P

R

F

P

R

F

BiLSTM Siamese

0.51

0.48

0.49

0.53

0.55

0.53

CNN

0.45

0.47

0.45

0.49

0.52

0.50

CNN-BiLSTM Siamese

0.60

0.68

0.63

0.58

0.59

0.58

BERT

0.79 0.80 0.79

BERT + KB

0.80 0.81 0.80

BERT + Tree-LSTM

0.79

0.77

0.77

0.80

0.78

0.78

BERT + Tree-LSTM + KB

0.81

0.80

0.80

0.81

0.82

0.81

Fig. 2 Histogram of the difference between probability scores of two classes. Figure a shows the overall difference distribution for all the sentence pairs. Figure b shows the distribution of the lowest difference between the ranges of 0.0–0.4

of the clinical test results like, BMI > 30 and BODY MASS INDEX in the range of 29– 40. In addition to this, the incorporation of the Tree-LSTM model along with BERT performs better in learning the syntactical structure and long-distance dependencies of the input text. This may indicate that the local transformer networks in BERT do not significantly boost up the predictive performance of the model. Along with the classification output, we also tried to analyse the model’s performance with respect to the output class probability distribution. As discussed earlier, the proposed model predicted probabilities of both class-0 and class-1. Here, class1 represents paraphrase, and class-0 represents not a paraphrase. Corresponding to each sentence pair, the models predict probabilities of both class-0 (P0) and class-1 (P1). Therefore, if P0 > P1, we assume that the sentence pairs are not a paraphrase, on the other hand, if P1 > P0, then the sentence pairs are considered as paraphrases. With respect to this, we tried to analyse the difference in probabilities between class0 and 1. The objective is to observe the difference between the probabilities of the two classes. Higher the difference is, the better is the model’s confidence. Whereas low difference implies uncertainty in prediction.

178

S. Jana et al.

Figure 2a shows the overall distribution of the probability difference between the test sentence pairs across the three different models, namely BERT, Tree-LSTM and BERT + Tree-LSTM. We have observed that across the models around 88% of the cases the probability difference across classes is near to 1 which signifies a high prediction confidence. For example, the probability score of the sentence pair {“age > 18 years old”, “age > or = 18 years”} was 0.49298167 and 0.50701833 and {“age: 18, 70”, “age 18–70”} was 0.49949166 and 0.50050837 for class “0” and “1” respectively. Figure 2b depicts the number of sentences for which the class probability difference was less than 0 to 0.4. This implies that for such sentences the model was not confident enough to produce the prediction. We have also observed that BERT + Tree-LSTM performed well between all the models, there are only a few sentence pairs for which the difference of class probability score was very low. Most of these ambiguities occurred because of using some symbolic expression in the text. Table 4 contains most similar test set examples found by our model. Figure 3 illustrates the distribution of paraphrases of top ten eligibility criteria. Model is quite good at identifying active–passive equivalence, but without using knowledgebase, our model failed to detect clinical concepts, for example: It failed to find the difference between “allergic conjunctivitis”, “perennial allergic rhinitis” and {“history of allergic conjunctivitis to ragweed for at least one year”, “subjects with a history of perennial allergic rhinitis for at least 1 year”} labelled as paraphrases. Table 5 represents examples of some paraphrases which were detected using KB in our model. In general, we have observed that Word2vec (W2V) embeddings trained over the TREC dataset perform better than FastText embedding throughout all the models. However, we do not observe any significant improvement in performance by incorporating the character-level embeddings in the model. This is clearly due to the fact that the W2V model is trained over the domain-specific text, and thus, the model may have generated fairly good word embeddings for out of vocabulary words. We also observe that the proposed model outperforms some of the popular baseline models discussed in the literature of paraphrase detection. The primary difference in the prediction performance is due to the presence of a knowledge base on the specific domain. Table 6 shows some examples in which our model faced some issues in predicting the class. In most of the cases, we have observed that the model fails to detect paraphrases due to lack of the domain knowledge about the key terms. For example, consider the two sentences (1) and (2), 1. 2. 3. 4.

No evidence of eyelid abnormality or infection. Patient has no history of blepharitis. No history of anaphylaxis. No history of allergic rhinitis to birch pollen and/or grass pollen.

We understand that both the statements are speaking about the same issue, however, our existing knowledge base fails to detect the similarity between “eyelid

Syntactic and Semantic Knowledge-Aware Paraphrase …

179

Table 4 Examples of paraphrases of some criteria, where |S|: number of sentences of this criteria, |P|: number of sentences which are semantically similar Criteria

|S|

Age

520 342 Age at least 18 years Age 18 and over 18 years of age or older

English speaking

340 336 Ability to speak and understand English sufficiently to allow for the completion of all study assessments Be proficient in English and willing to take part in group sessions English can be a second language provided that the participant feels that he/she is able to understand all the questions used in the assessment measures

No weight loss

143 107 Not taking weight loss medications Not currently following a weight loss or other special/non-balanced diets No health problems that make weight loss or unsupervised exercise unsafe

Consent

536 402 Ability to give written informed consent to participate in the study Patients must have signed an informed consent form Subjects who provide written informed consent to participate in the study

Hearing problem

54

|P|

Examples of paraphrases

22 No hearing aids in the last 2 years Patients using hearing aids will be excluded If they wear hearing aids, they can be removed for the scan

Smoking history

230 183 Current smokers or ex-smokers with at least 10 pack-year smoking history Smoking history of at least 10 pack years Subjects with a current or prior history of 10 pack years of cigarette smoking at screening

History of infection

432 125 No evidence of active infection and no serious infections within the past month No clinical evidence of active infection at the time of study entry No active infection for which the subject is receiving treatment

Pregnancy

532 320 Female subjects of childbearing potential must not be pregnant at screening Negative pregnant test in case of fertile women Who are neither pregnant nor breastfeeding, nor planned to become pregnant during the study (continued)

180

S. Jana et al.

Table 4 (continued) Criteria

|S|

History of surgery

334 207 Patients who had recovered from prior major surgery are eligible if all surgical wounds have healed No prior major surgery within 4 weeks of randomization from which the patient has not recovered At least 4 weeks since prior major surgery and recovered from all toxicity prior to randomization

|P|

Examples of paraphrases

History of allergy

698

26 Clinical history of perennial allergic rhinitis for at least 1 year Have a history of perennial allergic rhinitis for a minimum of 1 year before study entry 34 No known history of allergic reactions attributed to compounds of similar chemical or biological composition Subject has no history of chemical allergy 11 No history of allergic conjunctivitis within the last 2 years Patient does not have any chronic allergic diseases in past 2 years

History of brain injury 398 102 Patients with a history of brain metastases are eligible if they have been treated with radiation and have stable brain metastases at least 3 months after radiation and must also be off steroids Patients with prior history of treated brain metastases who are off steroids and have stable metastatic brain disease for at least 3 months are eligible Cancer

200

26 Patients with clinically localized high risk prostate cancer (ct > 2, bx gs > 7, psa > 10 ng/ml) scheduled for radical prostatectomy Intermediate or high risk prostate cancer patients who have > t2a or gleason > 6 or psa > 10 14 High risk of colorectal cancer (as determined by ct or mri) patients with high risk of colon cancer

abnormality” and “blepharitis”. As a result of this, the model missed the paraphrase sentence pairs. Similar issues were observed for sentence pairs (3) and (4) below. In general, our model also has some difficulty in identifying relations between some clinical domain terms like “upper neck region” and “supraclavicular region”. Another type of errors that we have faced is due to the occurrences of semi-structured data in human-readable text form like symbolic expression. For example, expressions like “age: 18–60”, “age ≥ 18”, “values < 3X the normal range” are difficult to interpret by the existing models and thus remains undetected. This remains a future challenge of the present work.

Syntactic and Semantic Knowledge-Aware Paraphrase …

181

Fig. 3 Distribution of top ten eligibility criteria for clinical trials

Table 5 Examples of paraphrases using domain knowledge Examples of paraphrases

BERT + Tree-LSTM probability score of class-1

BERT + Tree-LSTM + KB probability score of class-1

Has no sign of osteomyelitis and 0.2947278 no clinical signs of bone infection

0.99342

Documented history of traumatic 0.068816185 brain injury veterans with a history of craniocerebral trauma

0.848675

Have a history of brain metastasis patients may have a history of secondary brain tumours

0.87470096

0.2586

Clinical history of allergic 0.31535 rhinitis for at least 1 year and a history of nasal congestion for at least 1 year

0.909375

5 Conclusion In this work, we have proposed a knowledge-aware neural network-based approach towards detecting paraphrases from domain-specific eligibility criteria in clinical trials. We have specifically observed that incorporating knowledge in standard BERTbased paraphrase detection models indeed improves the prediction performance. In

182

S. Jana et al.

Table 6 Examples of some incorrect predictions of our model Sentence Pair

Gold label

Predicted label

Age 18–70 years Age: 18, 70

Paraphrase

Not paraphrase

Patients must provide written informed consent to participate in the study Subject’s legal guardian has given informed consent to

Paraphrase

Not paraphrase

A history of allergic rhinitis (hay fever) History of seasonal allergic rhinitis

Paraphrase

Not paraphrase

Prior surgery in the supraclavicular region Prior surgery on the upper neck region

Paraphrase

Not paraphrase

addition to this, we have also observed that the Tree-LSTM model better learns the syntactic structure and long-distance dependencies of the input text, thereby further improving the model’s performance. However, a fine-grained analysis of the experiments and results are yet to be done as a future work.

References 1. Cao YG, Liu F, Simpson P, Antieau L, Bennett A, Cimino JJ, Ely J, Yu H (2011) Askhermes: an online question answering system for complex clinical questions. J Biomed Inform 44(2):277– 288 2. Weng C, Tu SW, Sim I, Richesson R (2010) Formal representation of eligibility criteria: a literature review. J Biomed Inform 43(3):451–467 3. Boland MR, Johnson SB, Luo Z, Weng C, Theodoratos D, Wu X (2011) Elixr: an approach to eligibility criteria extraction and representation. J Am Med Inform Assoc 18(1):116–124 4. Pawar A, Mago V (2018) Calculating the similarity between words and sentences using a lexical database and corpus statistics. arXiv preprint arXiv:1802.05667 5. Zhuang WL, Chang E (2017) Neobility at semeval-2017 task 1: an attention-based sentence similarity model. arXiv preprint arXiv:1703.05465 6. Mueller J, Thyagarajan A (2016) Siamese recurrent architectures for learning sentence similarity. In: Thirtieth AAAI conference on artificial intelligence 7. Socher R, Huang EH, Pennin J, Manning CD, Ng AY (2011) Dynamic pooling and unfolding recursive autoencoders for paraphrase detection, In: Advances in neural information processing systems, pp 801–809 8. Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 9. Agarwal B, Ramampiaro H, Langseth H, Ruocco M (2018) A deep network model for paraphrase detection in short text messages. Inf Process Manage 54(6):922–937 10. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 11. Mohamed M, Oussalah M (2019) A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics. Lang Res Eval 54(2):1–29 12. Sjöblom E, Creutz M, Aulamo M (2018) Paraphrase detection on noisy subtitles in six languages. arXiv preprint arXiv:1809.07978 13. Zhang Z, Wu Y, Zhao H, Li Z, Zhang S, Zhou X, Zhou X (2019) Semantics-aware bert for language understanding. arXiv preprint arXiv:1909.02209

Syntactic and Semantic Knowledge-Aware Paraphrase …

183

14. Alsentzer E, Murphy JR, Boag W, Weng W-H, Jin D, Naumann T, McDermott M (2019) Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323 15. Li D, Huang L, Ji H, Han J (2019) Biomedical event extraction based on knowledge-driven tree-LSTM. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1. Long and Short Papers, pp 1421–1430 16. Boag W, Wacome K, Naumann T, Rumshisky A (2015) Cliner: a lightweight tool for clinical named entity recognition, In: AMIA joint summits on clinical research informatics (poster) 17. Goldberg Y, Levy O (2014) Word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 18. Davis AP, Murphy CG, Johnson R, Lay JM, Lennon-Hopkins K, Saraceni-Richards C, Sciaky D, King BL, Rosenstein MC, Wiegers TC et al (2013) The comparative toxicogenomics database: update 2013. Nucl Acids Res 41(D1):D1104–D1114 19. McCray AT (1989) The UMLS semantic network. In: Proceedings symposium on computer applications in medical care. American Medical Informatics Association, pp 503–507 20. Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, Sajed T, Johnson D, Li C, Sayeeda Z et al (2018) Drugbank 5.0: a major update to the drugbank database for 2018. Nucl Acids Res 46(D1):D1074–D1082 21. Dolan B, Quirk C, Brockett C (2004) Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: COLING 22. Voorhees EM (2013) The TREC medical records track. In: Proceedings of the international conference on bioinformatics, computational biology and biomedical informatics. ACM, p 239

Enhanced Behavioral Cloning-Based Self-driving Car Using Transfer Learning Uppala Sumanth , Narinder Singh Punn , Sanjay Kumar Sonbhadra , and Sonali Agarwal

Abstract With the growing phase of artificial intelligence and autonomous learning, the self-driving car is one of the promising areas of research and emerging as a center of focus for automobile industries. Behavioral cloning is the process of replicating human behavior via visuomotor policies by means of machine learning algorithms. In recent years, several deep learning-based behavioral cloning approaches have been developed in the context of self-driving cars specifically based on the concept of transfer learning. Concerning the same, the present paper proposes a transfer learning approach using VGG16 architecture, which is fine-tuned by retraining the last block while keeping other blocks as non-trainable. The performance of proposed architecture is further compared with existing NVIDIA’s architecture and its pruned variants (pruned by 22.2 and 33.85% using 1 × 1 filter to decrease the total number of parameters). Experimental results show that the VGG16 with transfer learning architecture has outperformed other discussed approaches with faster convergence.

1 Introduction The end-to-end deep learning model is the most popular choice to deal with large volume data [1–4] among researchers. Conventionally, the deep learning approaches decompose the problem in several subproblems to solve them independently, and all the outputs are combined to draw final decision. Many automobile companies like Hyundai, Tesla are trying to bring millions of self-driving or autonomous cars on U. Sumanth · N. S. Punn (B) · S. K. Sonbhadra · S. Agarwal Indian Institute of Information Technology Allahabad, Jhalwa, Prayagraj, Uttar Pradesh, India e-mail: [email protected] U. Sumanth e-mail: [email protected] S. K. Sonbhadra e-mail: [email protected] S. Agarwal e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_14

185

186

U. Sumanth et al.

Fig. 1 End-to-end learning

the road by utilizing deep learning approaches. In this frantic race to come up with fully safe self-driving cars, some of the organizations like NVIDIA is following the end-to-end approach [5] as shown in Fig. 1, whereas Google is following mid-to-mid approach [6]. Following from these notions, the main objective of present research work is to predict the steering angle of the car via front facing camera. The behavioral cloning [7] is a process of reproducing human performed tasks by a deep neural network. Behavioral cloning is achieved by training the neural network with the data of human subject performing the task. In 1989, a self-driving car was developed by Pomerleau [8] based on neural networks. Afterward, since past 130 years, the automobile manufacturers did not give attention to the replacement of the driver, who is the most vulnerable part of the car. The automotive companies tried to make the cars safer by including many safety features like antilock braking systems, shatter-resistant glass, and airbags [9]. However, organizations failed to succeed in developing driver-less intelligence. Self-driving cars are the most desirable revolutionary change in twenty-first century for a fully safe driving experience to change the way of transportation. According to the World Health Organization’s report on “Global status report on road safety 2018,” every year around 1.35 million humans lose their lives due to road accidents [10]. Self-driving cars will bring this number down and also enable people with disabilities to commute easily. Convolution neural networks (CNN) have revolutionized pattern recognition [11] with the ability to capture 2D images in the context of self-driving cars. The greatest advantage of CNN is that it automatically extracts the important features to interpret the surrounding environment from the images which can be utilized to develop the intelligent driving system. In the present research, to establish the importance of transfer learning approach concerning self-driving cars, a novel end-to-end-based VGG16 approach is proposed which is fine-tuned to predict the steering angle based on the environmental constraints. Later, the proposed approach is compared with NVIDIA and its pruned variants. Due to the lesser number of parameters in the pruned architectures, the training time reduces significantly compared to baseline architecture. Since the transfer learning approach follows from the pre-trained model where only a part of network is trained, significant computational time is saved without compromising the performance. It has been observed that if the tasks are similar, then the weights of initial few

Enhanced Behavioral Cloning-Based Self-driving Car …

187

layers are similar and the last layers have relevant information toward the task [12], making transfer learning a better way of saving training time. The paper is organized in various sections including related work which briefly discusses the available approaches applied to self-driving cars and highlights the drawbacks and advantages of the research work carried out so far. The proposed approach section presents the various approaches utilized in the process of generating a novel model which accurately drives the car. Dataset and preprocessing techniques are also discussed in the subsequent sections. At the end, the experimental results were elaborated with concluding remarks.

2 Related Work The process of reconstructing the human subcognitive skill through the computer program is referred as behavioral cloning. Here, actions of human performing the skill are recorded along with the situation that gave rise to the action [13]. There are two popular methods for behavioral cloning. In the first method, the skill could be learned in terms of series of dialogues with an operator. Here, in case of autonomous vehicle, it is anticipated from the operator to explain all set of useful skills to control the vehicle. This method is challenging because manual description of skills is not perfectly possible due to human limitations. Alternatively, skill can be reconstructed through recorded actions which are maintained in a structured way by using learning algorithms in terms of various manifestation traces [14–17] to reproduce the skilled behavior. Defense Advanced Research Projects Agency (DARPA) initiated DARPA autonomous vehicle(DAVE) [18] project including a radio-controlled model truck which is attached with sensors and lightweight video cameras to test vehicle in an intrinsic environment having trees, heavy stones, lakes, etc. The testing vehicle is trained with 225,000 frames of driving data. However, in test runs, DAVE crashed for every 20 m on an average. In 1989, Pomerleau built an autonomous land vehicle in a neural network (ALVINN) model using the end-to-end learning methodology and it was observed that the vehicle can be steered by a deep neural network [8]. NVIDIA started their research work in the self-driving inspired by the ALVINN and DARPA projects. The motivation for their work was to create an end-to-end model which enables steering of the car without manual intervention [5, 19] while recording humans driving behavior along with the steering angle at every second by using controller area network (CAN) bus. Based on the NVIDIA proposed architecture PilotNet (as shown in Fig. 2), Texas Instruments released JacintoNet, i.e., an end-to-end neural network for embedded system vehicles such as tiny humanoid robots [20]. In 2020, a group of researchers from Rutgers University proposed a feudal network based on hierarchical reinforcement learning that performs similar to the state-of-the-art models with simpler and smaller architecture which reduces training time [21]. Jelena et al. [22] have proposed a network which is 4 times smaller than the NVIDIA model and about 250 times smaller than the AlexNet [23]. The model

188 Fig. 2 Baseline NVIDIA architecture

U. Sumanth et al.

Enhanced Behavioral Cloning-Based Self-driving Car …

189

is developed only for the use in the embedded automotive platforms. To study the working of these end-to-end models, Kim et al. [24] researched about the region of the images contributing in predicting steering angle. Although learning to drive from this system would not suffice the self-driving car, furthermore the driving system should also address the issues such as how it would backtrack on to the road if it goes off the road by mistake, or else the vehicle will eventually move out of the road. Therefore, the images which are provided by the dataset are combined with more images to visualize the vehicle in different field-of-views on-and-off the road. The datasets usually augmented with new images generated by view transformations via flipping the images to cover maximum possible scenarios [5]. For the transformed images, the steering angle is changed in such a way that the vehicle would come back to the right position and direction within 2 s. The NVIDIA model proved to be quite powerful by achieving 98% autonomy time on road. The results observed from NVIDIA’s model consisting of only five convolution layers followed by three dense layers exhibited limited performance, and thus, it is evident that complex tasks require complex structure of deep neural networks with more number of layers to achieve better performance.

3 Proposed Approach A competent human is required for controlling any intricate system such as helicopter and bike. The competency is learnt through experience that develops within the subconscious capability of the brain. These subcognitive skills are challenging and can only be described roughly and inconsistently. In case of frequently occurring actions, the competency can be achieved by the system via learning from the recorded common patterns using deep learning techniques. Extracting and replicating such patterns from human subject performing the task are called behavioral cloning [25]. Following from the idea of behavioral cloning, a novel end-to-end transfer learning-based VGG16 approach (as shown in Fig. 3) is proposed to predict the appropriate steering angle. The proposed model is compared with NVIDIA baseline model and its pruned variants built by chopping off the baseline NVIDIA model by 22.2 and 33.85% by using a 1 × 1 convolution filter. Figure 4 presents the overall schematic representation of the proposed approach.

3.1 Network Pruning Using 1 × 1 Filter The use of 1 × 1 convolution for network pruning To downsample the contents of feature maps pooling is used which will reduce the height and weight while retaining the salient features of feature maps. The number of feature maps of a convolution neural network will increase as its depth increases [26], and this will lead to an increased number of parameters which will increase the training time. This problem

190 Fig. 3 Architecture of VGG16 with transfer learning

U. Sumanth et al.

Enhanced Behavioral Cloning-Based Self-driving Car …

191

Fig. 4 Schematic representation of proposed approach

can be solved using a 1 × 1 convolution layer that will do channel-wise pooling, called projection or max-pooling. This technique could be used for network pruning in the CNN networks [26, 27] and to increase the number of features after classical pooling layers. The 1 × 1 convolution layer can be used in the following three ways: • Linear projection of feature maps can be created. • Since the 1 × 1 layer also works as channel-wise pooling, it can also be used for network pruning. • The projection created by 1 × 1 layer can also be used to increase the number of feature maps. Downsampling with1 × 1 filter A 1 × 1 filter will only have a single parameter or weight for each channel in the input that leads to single channel output value. The filter acts as a single neuron with input from the same position for each of the feature maps. The filter can be applied from left to right and top to bottom which results in a feature map of same height and width as the input [28]. The idea of using of 1 × 1 filter to summarize the input feature maps is inspired from inception network proposed by Google [29]. The use of 1 × 1 filter allows for the effective control of number of feature maps in the resulting output. Hence, the filter can be used anywhere in the network to control the number of feature maps and so it is also called a channel pooling layer. In the two models, shown in Fig. 5, the network size is pruned by 22.2 and 33.85% with the help of downsampling.

192

U. Sumanth et al.

Fig. 5 Pruned NVIDIA architectures by using 1 × 1 filter

3.2 Transfer Learning Training of deep neural network needs massive computational resources. To minimize this effort, transfer learning has been explored, which assists in using neural networks implemented by various large companies who have abundant resources. The trained models provided by them can be used for academic research projects and startups [30]. As reported in recent publications, the significance of the use of transfer learning for image recognition, object detection, and classification [31–33] has been highlighted. In transfer learning approach, a pre-trained model is adopted and fine-tuned to solve the desired problem, i.e., by freezing some layers and training only a few layers. Studies show that models trained on huge datasets like ImageNet should

Enhanced Behavioral Cloning-Based Self-driving Car …

193

Fig. 6 Transfer learning with VGG16

generally work well for other image recognition problems [34]. It is also proven in research that initializing a model with the pre-trained model weights would lead to faster convergence than initializing the model with random weights [12]. For implementing transfer learning mechanism, VGG16 has been used and all the blocks are frozen from training except the last block which contains a max-pooling layer and three convolution layers as highlighted in Fig. 6. Deep neural networks when trained on a huge set of images, the initial layer weights are similar regardless of the undertaken objective, whereas the end layers generally learn more problem-specific features. The initial layers of CNN learn the hidden edges, patterns, and textures [27] that tend to identify the features which can be utilized as generic feature extractors for identifying the desired patterns to aid in analyzing the complex environment for developing an intelligent driver-less system. VGG16 with transfer learning VGG16 [35] is the state-of-the-art deep CNN model which is a runner up in ILSVRC (ImageNet) competition in 2014 [36]. Compared to other models proposed in ILSVRC like ResNet50 [37] and Inception [29], VGG16 model has lesser number of parameters because of the way the convolution filters are arranged, i.e., 3 × 3 filter with a stride 1, followed by 2 × 2 max pool filter with stride 2. This arrangement of convolution accompanied by the max pool is followed in the entire network consistently, whereas the two fully connected layers form the decision layer which aggregate to 138 million parameters. In the proposed approach, the initial four convolution blocks of the VGG16 are frozen and the last convolution block is fine-tuned, i.e., block 5, to predict the appropriate steering angle based on the surrounding conditions acquired from the captured frames.

4 Dataset Description and Preprocessing The dataset is a sequence of front camera dashboard view images captured around Rancho Palos Verdes and San Pedro California traffic [38]. It contains 45,400 images

194 Table 1 Dataset description Feature Image Steering angle

U. Sumanth et al.

Information The path of the image present on the disk A value in the range of −90 to +90 indicating the steering angle

Fig. 7 Snapshot of original and processed images

and associated steering angle as described in Table 1. In this research, 80% of images are used for training and remaining 20% for validation testing. The range of steering angle is between −90 and +90 where +90 indicates that the steering is tilted toward the right and −90 indicates that the steering is tilted toward the left. The data is preprocessed to get the images in the desired format which will be suitable for the network to learn and help in prediction of appropriate steering angle. The original and preprocessed images are shown in Fig. 7. The images are preprocessed by performing following steps: • Remove unnecessary features by cropping the image. • Convert the image to YUV format. • Reduce dimensions of the image to 66 × 200 × 3.

5 Experimental Results Series of experiments have been carried out with baseline NVIDIA model, its pruned variants and proposed approach as described follows: 1. By decreasing the number of feature maps from 64 to 32 and 64 to 16 by keeping the height and width constant, we pruned the network by 22.2% and 33.85%, respectively. 2. The transfer learning approach adopted with the convolution blocks of VGG16 and trained only the last block (3 convolution layers and 1 max-pooling 2D layer).

Enhanced Behavioral Cloning-Based Self-driving Car … Table 2 Performance comparison of learning models S. No. Model MSE 1 2

3

4

NVIDIA model NVIDIA model pruned by 22.2% with 1 × 1 filter NVIDIA model pruned by 33.8% with 1 × 1 filter VGG16 with transfer learning

195

Trainable parameters

29.24848 41.61325

252,219 196,699

38.67840

166,859

23.97599

10,373,505

The training of the models is assisted with stochastic gradient descent (SGD) [39] approach with Adam as the learning rate optimizer [40]. For robust training, fourfold validation technique is applied along with the earlystopping to avoid the problem of overfitting [41]. The performance of the models is evaluated using the mean squared error (MSE) as given in Eq. 1. MSE =

n 1 (yi − xi )2 n i=1

(1)

where yi stands for the actual steering value and xi stands for the predicted steering angle. Here, the lesser MSE indicates higher learning ability, whereas a higher MSE means the model is not learning from complex environments. The experimental results show that the pruned networks do not perform better compared to the baseline model. It is also observed that the proposed VGG16 model with transfer learning (training only last 4 convolutions layers) is trained within 40 epochs compared to other models which are trained with 100 epochs. With the experimental results, it has been proved that the VGG16 model with transfer learning works better as compared to the other NVIDIA models. As observed from Table 2, the novel transfer learning-based approach achieved the better MSE score as compared to NVIDIA and its pruned variants. Due to the deep nature of VGG16, the architecture is able to learn complex patterns, whereas shallowness of NVIDIA models restricts their ability to adopt such complex environment conditions. Figure 8 highlights the training behavior of the models at each iteration where it is observed that the proposed approach achieved comparatively minimal loss with least number of training epochs. It is also observed that due to the adoption of trained weights the model starts with better loss and converges faster.

196

U. Sumanth et al.

Fig. 8 Validation-loss versus epochs of all the models

6 Conclusion A novel approach based on transfer learning with VGG16 is proposed which is finetuned by retraining the last block while keeping all the other layers non-trainable. The proposed model is compared with NVIDIA and its pruned architectures developed by applying 1 × 1 filter. Since the proposed transfer learning architecture starts with minimal initial loss and converges at just 40 epochs compared to NVIDIA’s architecture which took 100 epochs, experimental results show that the transfer learningbased approach works better than NVIDIA and its pruned variants. Naturally, the driving patterns also depend on several other environmental conditions like weather and visibility. To adopt these challenging conditions in the presence of limited number of samples, generative adversarial network (GAN) can be explored in the future to generate vivid weather conditions for more robust driver-less solutions.

References 1. Chopra R, Roy SS (2020) In: Advanced computing and intelligent engineering. Springer, Singapore, pp 53–61 2. Lee MJ, Ha Yg (2020) In: 2020 IEEE international conference on big data and smart computing (BigComp). IEEE, pp 470–473

Enhanced Behavioral Cloning-Based Self-driving Car …

197

3. Chen Z, Huang X (2017) In: 2017 IEEE intelligent vehicles symposium (IV). IEEE, pp 1856– 1860 4. Glasmachers T (2017) arXiv preprint arXiv:1704.08305 5. Bojarski M, Yeres P, Choromanska A, Choromanski K, Firner B, Jackel L, Muller U (2017) arXiv preprint arXiv:1704.07911 6. Bansal M, Krizhevsky A, Ogale A (2018) arXiv preprint arXiv:1812.03079 7. Sheel S (2017) Behaviour cloning (learning driving pat-tern)—CarND. https://medium.com/ @gruby/behaviour-cloning-learning-driving-pattern-c029962a0bbf. Accessed 04 June 2020 8. Pomerleau DA (1989) Advances in neural information processing systems, pp 305–313 9. W. Contributors (2020) Automotive safety. https://en.wikipedia.org/wiki/Automotive_safety. Accessed 04 June 2020 10. World Health Organization et al (2018) Global status report on road safety 2018: summary. Tech. rep., World Health Organization 11. LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Neural Comput 1(4):541 12. Yosinski J, Clune J, Bengio Y, Lipson H (2014) Advances in neural information processing systems, pp 3320–3328 13. Torabi F, Warnell G, Stone P (2018) arXiv preprint arXiv:1805.01954 14. Sammut C, Webb GI (2011) Encyclopedia of machine learning. Springer 15. Michie D, Camacho R (1994) Mach Intell 13 16. Kulic R, Vukic Z (2006) In: IECON 2006—32nd annual conference on IEEE industrial electronics. IEEE, pp 3939–3944 17. Michie D (1993) In: Intelligent systems. Springer, pp 1–19 18. LeCun Y, Cosatto E, Ben J, Muller U, Flepp B (2004) Courant Institute/CBLL. Tech. Rep., DARPA-IPTO Final Report. http://www.cs.nyu.edu/yann/research/dave/index.html 19. Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel LD, Monfort M, Muller U, Zhang J et al (2016) arXiv preprint arXiv:1604.07316 20. Viswanath P, Nagori S, Mody M, Mathew M, Swami P (2018) In: 2018 IEEE 8th international conference on consumer electronics-Berlin (ICCE-Berlin).IEEE, pp 1–4 21. Johnson F, Dana K (2020) Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 1002–1003 22. Koci´c J, Joviˇci´c N, Drndarevi´c V (2019) Sensors 19(9):2064 23. Krizhevsky A, Sutskever I, Hinton GE (2012) Advances in neural information processing systems, pp 1097–1105 24. Kim J, Canny J (2017) In: Proceedings of the IEEE international conference on computer vision, pp 2942–2950 25. Bain M, Sammut C (1995) In: Machine intelligence, vol 15, pp 103–129 26. Brownlee J (2019) A gentle introduction to 1 × 1 convolutions to manage model complexity. https://machinelearningmastery.com/introduction-to-1x1-convolutions-to-reduce-thecomplexity-of-convolutional-neural-networks/. Accessed 04 June 2020 27. Brownlee J (2019) Transfer learning in keras with computer vision models. https:// machinelearningmastery.com/transfer-learning-for-deep-learning/. Accessed 04 June 2020 28. Lin M, Chen Q, Yan S (2013) arXiv preprint arXiv:1312.4400 29. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9 30. Brownlee J (2020) A gentle introduction to transfer learning for deep learning. https:// machinelearningmastery.com/transfer-learning-for-deep-learning/. Accessed 04 June 2020 31. Punn NS, Agarwal S (2020) ACM Trans Multimed Comput Commun Appl (TOMM) 16(1):1 32. Punn NS, Agarwal S (2019) In: 2019 twelfth international conference on contemporary computing (IC3). IEEE, pp 1–6 33. Punn NS, Sonbhadra SK, Agarwal S (2020) arXiv preprint arXiv:2005.01385 34. Brownlee J (2017) A gentle introduction to transfer learning for deep learning. https:// machinelearningmastery.com/transfer-learning-for-deep-learning/. Accessed 04 June 2020

198

U. Sumanth et al.

35. Simonyan K, Zisserman A (2014) arXiv preprint arXiv:1409.1556 36. Competition L (2009) Large scale visual recognition challenge (ILSVRC). http://www.imagenet.org/challenges/LSVRC/. Accessed 04 June 2020 37. He K, Zhang X, Ren S, Sun J (2016) In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 38. Chen S (2017) Driving dataset. https://drive.google.com/file/d/0BKJCaaF7elleG1RbzVPZWV4Tlk/view. Accessed 04 June 2020 39. Bubeck S (2014) arXiv preprint arXiv:1405.4980 40. Kingma DP, Ba J (2014) arXiv preprint arXiv:1412.6980 41. Caruana R, Lawrence S, Giles CL (2001) Advances in neural information processing systems, pp 402–408

Early Detection of Parkinson’s Disease Using Computer Vision Sabina Tandon and Saurav Verma

Abstract Non-invasive tests like handwriting using spirals patterns have been used over the years to study writing speed and strength of patients compromised by Parkinson’s. Handwriting tests can be administered at various times during a patient’s treatment to understand the severity of the disease, disease progression, and effect of treatment. In this paper, we study handwriting traces and analyze them using machine learning methods to understand the subject.

1 Introduction According to the United Nations, approximately 16% people worldwide suffers from various neurological problems like Parkinson’s, strokes, Alzheimer’s, multiple sclerosis, dementia, epilepsy to migraine, brain injuries, and neuro-infections, with approximately 9–14 million deaths each year. The UN World Health Organization (WHO) study—neurological disorders: public health challenges—shows that people in all countries, irrespective of age, sex, education, or income are affected [1]. On average, Parkinson’s affects 50% more men than women. Parkinson’s disease is a progressive debilitating disease that affects the central nervous system. Root cause of the disease is the loss of dopamine in the substantia nigra (SN), which in turn leads to a decreased level of dopamine in the striatum. The striatum is a part of the region of the human brain that is responsible for necessary human behaviors like decision-making, emotional and motivational behaviors, movement, and muscle control. Since striatum manages various actions of the body, including voluntary motor control, a patient suffering from this disease suffers from loss of coordination and control between various limbs and tremors that become debilitating with passage of time. Over a few years, patients lose complete control of movement, resulting in living in hospice care. S. Tandon (B) · S. Verma Artificial Intelligence, Mukesh Patel School of Technology and Engineering, NMIMS, Mumbai, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_15

199

200

S. Tandon and S. Verma

In India, currently, a million people suffer from Parkinson’s at various severity levels. Parkinson’s disease is identified by four primary symptoms—(1) Tremors in arms, legs, hands, jaws, (2) Rigidity—stiffness in the body, arms, and legs, (3) Bradykinesia—slow movements, and (4) Postural Instability—loss of balance and coordination. The disease progresses slowly till the patient loses control over his movements, balance, control over his muscles, and coordination between all parts of his body, difficulty in using hands, arms, and legs. Due to a lack of dopamine, patients often undergo personality changes. Some people also suffer from other problems like hallucinations, depression, anxiety, compulsive, and impulsive behaviors. Early symptoms of Parkinson’s disease are very subtle, and it is possible to dismiss them easily as being healthy or some minor temporary problem. Neuro degeneration progresses slowly over the years and may increase in intensity like shaking in various limbs or tremors that may interfere with daily activities. Other symptoms may include difficulty in speaking, chewing, and swallowing, unrestful sleep, urinary problems, and skin changes. 1.

2.

3.

4. 5. 6.

7.

People suffering from Parkinson’s can also show some other secondary symptoms like—light-headedness or dizziness from suddenly standing up after sitting for some time. All these indicate impaired communication between muscles and brain and loss of activity in the central nervous system. Different people exhibit different symptoms of this disease in different order and intensity of manifestations. That is why it is difficult to diagnose Parkinson’s early. Typically, there are five identifiable and overlapping phases in the development of Parkinson’s disease [2]. Stage 1—In this stage, the person has mild and sporadic symptoms that do not interfere in daily activities. Tremors and movement problems occur on one side of the body. There can be changes in walking and facial expressions. Stage 2—In this stage, symptoms like tremors, rigidity, walking problems, and wrong posture get worse with time and affect both sides of the body. Stage 3—This stage is characterized by loss of balance and slow movements Stage 4—At this stage, symptoms get severe. The patient requires the use of walking aids like walker for moving around. They are unable to live alone and require assistance for the smallest of the daily activities. Stage 5—It is the most advanced stage in which the patient gets ultimately disable to anything on his own. The patient may experience dementia, hallucinations, or depression. They are entirely bedridden and need round the clock assistance.

2 Dataset For this study, data has been taken from the UCI machine learning repository. The PD and Control handwriting database consist of 62 people with Parkinson’s (PWP) and 15 healthy individuals who appealed at the Department of Neurology in

Early Detection of Parkinson’s Disease Using Computer Vision

201

Cerrahpasa Faculty of Medicine, Istanbul University. From all subjects, three types of handwriting recordings, namely static spiral test (SST), dynamic spiral test (DST), and stability test on certain point (STCP)) are taken. Handwriting dataset was constructed using Wacom Cintiq 12WX graphics (Hahne et al. 2009) tablet. It is basically a graphics tablet and LCD monitor rolled into one. It enables to display a PC’s screen on its monitor and only interacts with digitized pens. Special software was designed for recording handwriting drawings and testing the coordination of the PD patients using the recordings. In this study, there are three different kinds of tests developed for the data collection via graphics tablet. The first one is the static spiral test (SST) which is frequently used for clinical research in the literature for different purposes like determining motor performance (Wang et al. 2008), measuring tremor (Pullman 1998), and diagnosing PD (Saunders et al. 2008). In this test, three wound Archimedean spirals appear on the graphics tablet using the software, and patients were asked to retrace the same spiral as much as they can using the digital pen. In this test, three Archimedean spirals appeared on the screen of the tablet and patients were asked to retrace the same spiral using digital pen. The second test is the dynamic spiral test (DST). Unlike SST, Archimedean spiral just appears and disappears in certain time intervals, and in other words, the Archimedean spiral blinks. This forces the patient to keep the pattern in mind and continue to draw. The purpose of this test is to determine the change in patient’s drawing performance and pause times since it is more difficult to retrace the Archimedean spiral in this case. The third test is stability test on certain point (STCP). In this test, there is a certain red point in the middle of the screen, and the subjects are asked to hold the digital pen on the point without touching the screen in a certain time. The purpose of this test is to determine the patient’s hand stability or hand tremor level [3, 4]. The dataset of drawings in divided into training and testing set for both classes of subjects—PWP and healthy. The training dataset was used for feature extraction and training of the model. The testing dataset was used to derive test accuracies. Images of spiral datasets of PWP and healthy subjects are shown in Figs. 1 and 2.

3 Literature Review Considering the disabling, chronic, and progressive nature of this disease, it is of utmost importance to have methods for early detection and treatment. Over the years, there has been extensive research in Parkinson’s, resulting in the design and development of various methods to improve testing and effective diagnosis, prevent further deterioration, treatment, and care of patients. There has also been extensive research to develop new drugs and clinical trials of these drugs in different parts of the world. Other innovative ways of treatment have also been invented, studied, and researched extensively and are being used to treat people. Some of the innovative methods for diagnosis and treatment are as follows:

202

S. Tandon and S. Verma

Fig. 1 Spiral test on patients with Parkinson’s

Fig. 2 Spiral test on healthy people

1.

2.

3. 4.

5.

6.

One current theory is that the earliest signs of Parkinson’s are found in parts of the brain—enteric nervous system, the medulla, and the olfactory bulb, which control the sense of smell [2]. One method of earliest diagnosis can be based on the testing capability of a sense of smell. Discovery of biomarkers that can identify molecules of protein alpha-synuclein that accumulates in the brain that initially causes loss of sense of smell and later migrate to the substantia nigra, resulting in chronic Parkinson’s. The protein associated with Parkinson’s can also be found in different parts of the body like skin, colon, and gut. A brain-imaging test called DaTscan can be used to identify dopamine activity in the brain. The scan can identify a reduction in the number of dopamine cells by a reduction in dopamine transport. Diagnosis of Parkinson’s using DaTscan and clinical diagnosis have been found to be equally accurate. Development of smell test for identification of the disease. Sebum is an oil that humans excrete from the skin. Presence of specific molecules in the sebum can help in early diagnosis of the disease. One promising treatment is deep brain stimulation to treat movement problems.

Early Detection of Parkinson’s Disease Using Computer Vision

7.

8.

9.

203

Clinical trials are underway in Japan to treat Parkinson’s using engineered stem cells from the patient itself. In this test, cells from skin or blood are harvested and treated to change back to pluripotent stem cells. They are programmed to develop as dopamine receptor cells and injected into the brain to increase dopamine levels in the brain. This method is currently being tested on various animals before clinical trials. What we eat and environmental factors have an effect on our health. Foods high in sugar and carbohydrate and low in essential nutrients have a detrimental effect on our health. Diets like the Mediterranean, which are balanced in nutrition, slow down health issues. The ketogenic diet is helpful in controlling another debilitating brain condition called epilepsy. Other helpful diets include low protein and high in vegetables. Intermittent fasting also helps the body. Developing new drugs is a tedious and expensive process. The timelines can go into years. A lot of existing drugs being used to treat certain conditions have been found to treat other conditions effectively. It is called drug repurposing.

In order to identify people who have Parkinson’s, handwriting trace (HT) gathered from the control group and patients is compared to an exam template (ET). Various metrics are calculated to infer the type of patient for each type of handwritten trace. All data is analyzed using the random forest classification algorithm. Handwriting trace and exam template are stored as picture files. OpenCV is a library in Python used for image processing and classification. Using this library, handwriting and exam trace templates are read and analyzed. Each image has to be preprocessed before applying any algorithm. Preprocessing includes resizing all images to a common size, applying filters for edge detection, and normalizing them. For edge detection, Canny is used, which is currently the most versatile filter. Normalization of data helps to scale features with higher values with features having lower values for easier comparison. Most of the algorithms work better with scaled values. The complete dataset is split into the training and testing set. For analysis of the training dataset, the preferred algorithm is random forest.

3.1 Algorithms 3.1.1

Random Forest

Random forest algorithm was first proposed by Breiman [5]. In his paper, Breiman defined random forest classifier as one consisting of a group of tree-structured classifiers, each generating a classification vote that is independent of other trees in the ensemble. The result is obtained by finding the most popular class. Random forest model consists of multiple decision trees. It is based on three things—a random sample of data points, a random sample of features, and average results. Random forest uses two key concepts—(1) multiple random samples are

204

S. Tandon and S. Verma

created from data points of the training dataset, and (2) multiple datasets are created with different features from the training dataset. The samples are made with replacement, which means that data points can be used in multiple samples. This is done to reduce variance and bias. Decision tree classification models are affected with low bias and high variance and suffer from overfitting of training data. The model tries to split training data along with data points and features. The random forest model is based on two concepts—bagging and subspace sampling. • In bagging, also called bootstrap aggregation, multiple datasets are created from the original dataset as samples. Each sample has the same number of data points as the original dataset with replacement. The random forest model is trained on these multiple sample datasets, and different results are drawn for regression and classification models. For the regression model, the mean and median are used. For the classification model, the mode is used. Averaging the variance of multiple models reduces the variance and increases prediction accuracy. Bagging is a technique used in reduction of variance component of an estimated prediction function. The essential idea is to average many noisy predictions to reduce the variance [6]. • The second concept is subspace sampling. Random forest model created with the first concept of bagging produces similar results because each data point in the dataset is used multiple times. In order to produce variable results, some features are used instead of all features in the dataset. This is based on the premise that certain features in the dataset may be more important than the rest. Even with samples with different data points and features, some features emerge as more important. This can be measured by the gain in information between the two models. 3.1.2

AdaBoost

Boosting is one of the most powerful features employed to create powerful ensemble models. It is considered powerful because it uses combined results of a group of weak classifiers used sequentially to create a strong model that is a vast improvement over each weak model. A weak classifier is one whose error rate is only slightly better than random guess [6]. The predictions from all of them are combined through a weighted majority vote to produce the final prediction. In AdaBoost algorithm, weights are assigned to each training observation that determines how significant it is in the training set. The larger the weight, the more the observation contributes during model creation process and vice versa. A classification model is evaluated on the number of classes identified correctly. [7–10]

Early Detection of Parkinson’s Disease Using Computer Vision

205

3.2 Data Collection The objective of this study is to administer a non-invasive test that may help in the early diagnosis of the presence of Parkinson’s in the test subjects. It is a simple test in which the subject has to trace a spiral pattern on a tablet using electronic pen. For a normal healthy human being, this test would not be a challenge, and the spirals will be drawn easily. For a person who has Parkinson’s, the stiffness in arms, wrists, hands, and fingers would prevent them from making a clear pattern. For the purpose of this study, we needed pictures of spiral pattern drawing of patients with Parkinson’s to diagnose well-in-time and also the extent of progression of disease in case the person is suffering from it.

3.3 Methodology The methodology used to read the drawings and identify patterns is histogram of oriented gradients with random forest. Histogram of oriented gradients is a powerful method used for selection of features from the images. Random forest model is used to create a model based on the features found and predict probability of a drawing belonging to healthy person and from someone who suffers from Parkinson’s. To analyze the drawings, it is essential to extract correct and relevant features and discard noisy ones. Since we are analyzing images, we need to apply techniques of distribution of the direction of gradients to find useful features. Gradients found in an image are important features to identify images and, in this case, drawings. Procedure for doing feature extraction by applying histogram of oriented gradients and analyzing the drawings is shown in the flowchart in Fig. 3.

3.4 Analysis In order to do analysis of the images, the following steps have to be performed: 1.

2.

3. 4.

Preprocessing of images—all images are cropped and scaled to a common size. The images are scaled to maintain the aspect ratio of 1:2, with a unit of width and double the height. Calculate the gradients—the following kernels are applied to calculate the horizontal and vertical gradients. The purpose of applying gradients is to remove noise from the image. Image partitioning—the image is partitioned into 8 × 8 cells for analysis and removing noise from the portion that is considered by the cells. Compute the histogram of oriented gradients—HOG is calculated for each of the cells. To find the direction of the gradient inside a cell, we build a histogram

206

S. Tandon and S. Verma

Start Train the model - Random Forest Read Image Predict new Image using Model Convert Image to Grayscale

Resize Image

Drawing classified as being from a healthy or diseased person

Apply Thresholding Stop Apply HOG – Magnitude & Orientation

Fig. 3 Methodology

5.

for that portion of the cell. In total, we get 64 values (8 × 8) of gradient directions and another 64 values (8 × 8) of magnitude. Build histogram—there are three cases to be considered while building the histogram of oriented gradients, • The angle is lesser than 80° and less than halfway between two classes. In this case, the angle will be added in the right-hand side of the histogram • The angle is lesser than 80° and precisely between two classes. In this condition, we consider equal division to the two closest classes and divide the magnitude into two halves. • The angle is greater than 80°. In this case, we look at the pixel that contributed equally to 80° and 0°.

6. 7. 8.

Normalization of block—in order to normalize the image, each value of HOG of 8 × 8 is divided by the L1 norm. Creating the model—random forest classifier is applied on the resulting values to create and train the model on input images. Predicting based on the model—use the trained model to predict a new image.

Early Detection of Parkinson’s Disease Using Computer Vision Table 1 Results

207

Algorithm

Precision

Recall

Accuracy (%)

Random forest classifier

0.82

0.80

0.80

AdaBoost classifier

0.80

0.80

0.80

SVM + AdaBoost classifiers

0.76

0.77

0.76

4 Implementation This implementation of image classification has been done in Python in Google Colaboratory environment, using CPU. The libraries used in Python are Numpy, Pandas, scikit-learn, and scikit image.

5 Results The model was trained on using images of spirals from 62 persons suffering from Parkinson’s and 15 healthy people. The dataset was divided into training and validation datasets with 80% in training and the rest in validation. Two separate models were created—one using random forest and the other using AdaBoost. The models were trained and tuned using cross-validation on the training dataset. Table 1 shows accuracies achieved by creating various learning models. The above results indicate that bagging and boosting algorithms give the best results. In both ensemble techniques of bagging and boosting, a set of weak learners (low bias and high variance) are combined to create strong models that deliver better results with high accuracy. The main cause of error in machine learning models is due to noise, bias, and variance. Ensemble models help in minimizing these factors. Bagging aims to decrease variance, and boosting aims to reduce bias.

6 Conclusion As part of early diagnosis of Parkinson’s, people undergo a non-invasive handwriting exam. This exam is based on tracing spiral patterns using a digitized device. If a person is healthy, the trace will be clean and precise. If a person suffering from Parkinson’s will trace, the trace will be uneven because the patient is slowly losing motor control. The purpose of this project was to create a model to be able to predict if a person actually suffers from Parkinson’s or not. The dataset consisted of spiral patterns traced by healthy as well as Parkinson’s subjects. The main features from the images were extracted using histogram of oriented gradients. The features were then used

208

S. Tandon and S. Verma

to build and train models for prediction. Then, test images from both healthy and diseased people were used to predict if they have the disease or not. Accuracies achieved in this project can be improved upon in future by using better algorithms to process images and extract features.

References 1. https://news.un.org/en/story/2007/02/210312-nearly-1-6-worlds-population-suffer-neurologi cal-disorders-un-report 2. https://www.parkinson.org/Understanding-Parkinsons/What-is-Parkinsons/Stages-of-Parkin sons 3. Isenkul ME, Sakar BE, Kursun O (2014) Improved spiral test using digitized graphics tablet for monitoring Parkinson’s disease. In: The 2nd international conference on e-health and telemedicine (ICEHTM-2014), pp 171–175 4. Erdogdu Sakar B, Isenkul M, Sakar CO, Sertbas A, Gurgen F, Delil S, Apaydin H, Kursun O (2013) Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings. IEEE J Biomed Health Inform 17(4):828–834 5. Breiman L (2001) Random forests. Mach Learn 45:5–32. https://doi.org/10.1023/A:101093 3404324 6. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, New York 7. Tianqi C, Guestrin C (2016). XGBoost: a scalable tree boosting system, pp 785–794. https:// doi.org/10.1145/2939672.2939785 8. http://rob.schapire.net/papers/explaining-adaboost.pdf 9. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139 10. Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28:337–407. https://doi.org/10.1214/aos/1016218223

Sense the Pulse: A Customized NLP-Based Analytical Platform for Large Organization—A Data Maturity Journey at TCS Chetan Nain, Ankit Dwivedi, Rishi Gupta, and Preeti Ramdasi

Abstract With an ever-growing total strength of 4.53 lakh employees at the end of September 2020, company sees growth in the future [1]. Aligned to TCS’s data-driven business approach and data maturity operating target model (DATOM™), for that matter, this exercise has built a scalable and customizable platform. It helps integrate data preparation, preprocessing frameworks and AI-ML-based sentiment analysis algorithms, analytical data model and visualization tool. This interprets emotions within an unstructured text curated from performance appraisal conversational data. Categorizing them as positive, negative or neutral, the untapped potential gives an overview of employee opinions pan TCS. The target consumers are business group heads and HR management. These insights are referred as an important guideline to strategically plan HR initiatives and actions toward balancing employee satisfaction, reducing attrition rate at an organization.

1 Introduction The large organizations like Tata Consultancy Services pay a great attention to employee satisfaction, their career progression, learning opportunities and related activities. HR interventions are an important and integral part of employee journey within the organization. C. Nain · A. Dwivedi · R. Gupta · P. Ramdasi (B) TCS Data Office. Analytics and Insights Department, Tata Consultancy Services, Pune, Maharashtra, India e-mail: [email protected] C. Nain e-mail: [email protected] A. Dwivedi e-mail: [email protected] R. Gupta e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_16

209

210

C. Nain et al.

Performance appraisal system is one of the important and matured processes being practiced, where employees converse with their supervisors and managers. It is a free flow of exchange of opinions, their views and comments on the views as well. A performance appraisal (PA) is also referred to as a performance review, performance evaluation and career development discussion or employee appraisal. While practices are already established to digitize such a conversation, the discussions and feedbacks are initiated any time during the year, thus making it a continuous process; in contrast to the earlier way of yearly, half-yearly activity. At the same time, when the world is getting ready for digital transformation, TCS has brought a trademark DATOM™ model, in the market, to analyze data maturity level of any of the organization. Organization’s data maturity level is an extremely important parameter to know before start of the digital journey. With existing processes and availability of data, systematically flowing to the intended stakeholder, first maturity level is achieved for the said performance appraisal system. Business Challenges • Employee’s thoughts, opinions, facts and feedback get captured digitally through in house Ultimatix SPEED application. Ensuring the technical capabilities (1 ml), it is important to traverse through data maturity levels one by one as defined by DATOM™. • As the employee strength is growing exponentially, it is important to take help of latest tools and technology for HR to strategically plan the initiatives, take data-driven decisions and continuously monitor and analyze key performance indicators and risk indicators. • It is a globally known fact that the business grows well and much faster when employees are happy and have positive attitude. Thus, need of having a holistic view of employee satisfaction. • In the process, a lot of textual conversations are involved between appraise and appraiser. Generally, the conversation is highly context specific and is related to the project for which associate is working. This makes the problem of understating the sentiments around the continuous conversation more challenging as the collective text involved is of diversified nature. The problem becomes even more challenging with the ever-growing size of data set. In a big organization like TCS with 450,000 plus employees and around 30–40 lines of text comments for each associate, which calls for highly customized, organization and context specific, sentiment analysis model for understating the overall sentiment of organization. The experiment mentioned in this paper addresses above business challenges. The paper discusses the solution delivered to address the given challenges in terms of four pillars of DATOM™ model, data, technology, process and people. Analytical architecture, the methodology, deployment architecture are explained in detail followed by an example. A sample dashboard adds to the knowledge that displays few of the key performance indicators generated through a visualization tool.

Sense the Pulse: A Customized NLP-Based Analytical Platform …

211

2 Literature Review Summary Applying sentiment analysis techniques to mine the huge amount of data of appraisal process has become an important research problem. Business organizations are putting their efforts to improve techniques for sentiment analysis. Although some algorithms have been used in sentiment analysis with good results [2–4], there are still no technique able to find sentiment of comments specific to appraisal process of TCS. Further work is needed on further improving both the accuracy of the sentiment classification and improving the rate at which this huge amount of information must be processed.

3 Solution Delivered 3.1 Data Siloed data that exists with HR department and individual BU which is referred to develop an analytical data model. Thus, some of the information management functions are established, namely data architecture and modeling, data quality.

3.2 Technology • AI-based natural language processing [5] algorithms are customized to suit TCS internal data, processes and environment. This uncovers the insights hidden in the word streams. • Talend and SAP HANA [6] are the ETL tools and landing space for sourced data. • Qlik sense visualization tool is used to develop automated self-service analytical dashboards. • Python is intended to use for developing advance analytical solutions. • The application is available to be used on mobile phones and laptops. Service

Software

Front end reporting

Qlik sense

Back end—database

HANA studio version: 2.3.24

Back end—text mining and sentiment analysis

Python 3.5.9

212

C. Nain et al.

Fig. 1 Functional diagram of the process

3.3 Process A simplified data landscape is established through data office processes. This connects data architecture to the consumption layer. RAD—Rapid application development approach is taken.

3.4 People Established a preliminary level of ‘Strategic visibility’ to further reach to operational excellence- the next level of data maturity level. The application is consumed globally by the HR’s of respective IOU and geographies. Looking at the benefits, stakeholders are grooming toward improving data quality, and hence, no data is discarded. a.

High-Level Functional Diagram

A high-level functional architecture illustrates long-term vision of target state architecture (Fig. 1). b.

Functional Components

Continuous feedback stored in PI DB database is the core data source for sentiment analysis (Fig. 2). Data Ingestion and Integration: It is done by creating a view in Hana and establishing connection between the same database and Qlik sense. Reports and Dashboard: The dashboards generate reports by fetching data from Hana CV’s. The Qlik app is refreshed on daily/weekly/monthly basis by scheduled task implemented in Qlik Sense. Reports and dashboard will be accessible to authorized users in Ultimatix portal through browser. Python Component: Python code takes input from Hana database and stores the processed data in a new table within the same database. Python code is executed automatically on daily basis through rundeck scheduler.

Sense the Pulse: A Customized NLP-Based Analytical Platform …

213

Fig. 2 Functional components

c.

Analytical Architecture

Sentiment analysis is the extraction of thoughts, attitudes and subjectivity of script or text to identify polarity, i.e., positive, negative or neutral [7]. There are three methods available for sentiment analysis: supervised, lexicon-based and hybrid approach, where the supervised method supersedes in performance from lexicon-based method, and hybrid is a combination of both. The performance of supervised method is extremely reliant on the excellence, and the size of exercise data while on the other hand several lexical objects seems positive in the script of a domain while appearing negative at the same time in another domain therefore lexicon-based analysis does not have high accuracy yet, and optimizing it is still a very interesting research topic in the domain of sentiment analysis. The labeled comments were provided to us by the GSPEED team. The data consisted of 5000–6000 labeled comments. This data was refined using a Python algorithm to remove the articles, pronouns and some other neutral words. TCS specific words such as SLA, need to, personal calls and few other words were included. The data used for creating lexicons was randomly picked from last 5–6 years so as to get a good combination. Application of a lexicon is one of the two main approaches to sentiment analysis, and it involves calculating the sentiment from the semantic orientation of word or phrases that occur in a text. With this approach, a dictionary of positive and negative words is required, with a positive or negative sentiment value assigned to each of the words. Generally speaking, in lexicon-based approaches, a piece of text message is represented as a bag of words. Following this representation of the message, sentiment values from the dictionary are assigned to all positive and negative words or phrases within the message. A combining function, such as sum or average, is applied in order to make the final prediction regarding the overall sentiment for the message.

214

C. Nain et al.

Apart from a sentiment value, the aspect of the local context of a word is usually taken into consideration, such as negation or intensification. The fact that a lexicon-based approach can be more easily understood and modified by a human is considered a significant advantage for our work. We found it easier to generate an appropriate lexicon than collect and label relevant corpus. The following five sub-sections describe in details the development of the algorithm applied in this study. d.

Methodology

The below section explains the approach used to analyze the comments exchanged between the appraiser and the manager (Fig. 3). Preprocessing Preprocessing helps in cleaning the data by removing the noisy and unwanted text. The preprocessing block is formed by the following subcomponents: i.

ii.

iii.

Tokenization: The tokenization block helps us in breaking the complete sentence into small tokens, and these tokens are then used to determine the overall sentiment of the sentence. Punctuation Removal: Punctuation by and large does not give much helpful data. This progression, thus, deletes the punctuation action characters from the sentences. Case Conversion: Case conversion helps in changing the text into a uniform case be it upper or lower.

Fig. 3 Methodology flow diagram

Sense the Pulse: A Customized NLP-Based Analytical Platform …

iv.

215

Stop Word Removal: Stop words comprise of relational words, help action words, articles, etc. They commonly do not contribute in breaking down assessments and are eliminated from the content.

Stemming The words that have the same origin or the same root word and are appearing in a different form can be replaced with a generic word. Stemming [8] step helps in avoiding redundant labeling of the words and helps in faster identification of words. Sentiment Library The lexicon library is constructed contains about 7000+ words and was developed by the associates keeping the Vader Sentiment Library as a baseline. Each word of the library is assigned a sentiment which in turn is calculated by the average of 10 denoted sentiment by 10 different associate who was a part of the team designated to prepare this library. The lexicon assigned has a value representing sentiment in the range of −4 (most negative) to +4 (most positive). Sentiment Computation Module In this step, every comment given in the system has relegated a score depending on the sentiment by adding the overall sentiment of the tokens of each word in the comment. This score is then used to assess an appraiser or appraises sentiment. We take in a variable ‘n’ where the total polarity of positive and negative words in the appraiser comment is taken. For example, the word ‘abstruse’ comes in a sentence only one time; consequently, its recurrence is 1 while its disposition is negative. Hence, the sentiment of the word abstruse is −1. Then again, the words extraordinary, capable and useful additionally show up once, so their recurrence is 1. Since their sentiment is positive, the sentiment of every one of these words is 1. The assumption score is registered by adding sentiment of all positive and negative words while overlooking the neutrals. The conclusion score is the summation of sentiment of every assessment word in an input. Opinion Classifier The opinion classifier helps us in labeling the entire sentence as positive, neutral or negative. The overall sentiment calculated is normalized such that it is boiled down between −1 and +1. Then, based on the overall score ‘N,’ the classification is done on the following basis: • If N > 0.05, then positive. • If −0.05 < N < +0.05, then neutral. • If N < −0.05, then negative. e.

Physical/Deployment Architecture

216

C. Nain et al.

Python script can be run in a server containing the Python with the available required lib installed. Instead of running it directly through server, we can run it through talend job. We will run the talend job which contains the talend component which will run the execution command of Python file. With the help of context file (at which we will store our server credentials and password in encrypted form), it will connect the Python server and will process the code. The code will connect with the Hana database and take the input from Hana, process the code and once again connect with Hana, and then, output will be stored in the Hana. And from Hana table, data will be loaded to Qlik views, and from there, Qlik sense will retrieve the data (Fig. 4). f.

Example

Following set of selective statements are taken as an example to explain their categorization as positive, neutral, negative and compound. Each statement has got these

Fig. 4 Physical architecture

Table 1 Sample conversational statements and categorization Statement

Positive Neutral Negative Compound

All the appraisal related activities were completed on time

0.39

0.61

0.0

Please plan to meet the target T factor

0.0

0.71

0.286

Accepted. Keep up the good work

0.367

0.5393 −0.34

0.0

0.0

0.44.4

Good to hear. Continue your support in forthcoming 0.265 days too

0.556

0.0

0.6808

Avoid unnecessary internal escalation

0.577

0.423

0.0

−0.296

Sense the Pulse: A Customized NLP-Based Analytical Platform …

217

Table 2 Confusion matrix predicted and actual scores for sample statements Predicted actual

Negative

Negative

61

1

Neutral

26

15

52

93

Positive

3

3

138

144

90

19

206

315

All

Neutral

Positive 16

All 78

Fig. 5 Sentiment analysis dashboard developed using Qlik sense visualization tool

four parameters tagged to it. The parameter with maximum score is tagged to the statement that contributes to the overall score of the conversation (Table 1). Confusion Matrix Confusion matrix [9] shown below compares the predictive scores and the actual scores for given sample statements from Tables 1 and 2. Sample Dashboards Following sample of visualization dashboard displays a typical set of yearly data, selected for a period of three months termed as quarterly data. The overall score is seen through a donut chart. The entire dataset is available for performing selfservice analytics like employee grade-wise sentiments or geography-wise sentiment analysis, as per the business goals of the stakeholders (Fig. 5).

4 Results and Analysis To evaluate the said customized sentiment analysis classification model, the multiclass classification metrics of precision, recall and F1 score [10] were used. For a given English statement, the classifier would return the sentiment as either positive,

218 Table 3 Confusion matrix for sentiment analysis model on sample dataset

Table 4 Measures to evaluate model accuracy

C. Nain et al. Actual

Predicted

Negative

588

58

Neutral

134

635

167

Positive

136

155

1151

Negative

Neutral

Positive 126

Evaluation criteria

Experimental values

Precision

0.79

Recall

0.86

F1-score

0.82

negative or neutral. Additionally, to visualize the performance of the model in tabular format, confusion matrix or error matrix is also shown below. The original sample dataset annotated by experts contains 3150 comments in which 1442 positive, 772 negative and 936 neutral comments which were present. From confusion matrix, we can see that out of 1442 positive comments, 1151 were correctly classified as positive, out of 772 negative comments, 588 were correctly classified as negative, and out of 936 neutral comments, 635 were correctly classified as neutral (Table 3). In statistical analysis of classifier performance, precision is the number of true classifications divided by the total number of elements labeled as that class (including both correct and incorrect classifications). Recall is the number of true classifications divided by the total number of elements that are known to belong to the class; low recall is an indication that known elements of a class were missed. The F1-score is the harmonic mean of precision and recall and represents the overall accuracy (Table 4). Precision talks about how precise/accurate the model is out of predicted positive, how many of them are actual positive. The proposed sentiment analysis model is doing reasonably well with 0.79 precision value. Recall calculates how many of the actual positives the model captures through labeling it as positive (True Positive). The recall of proposed model on sample dataset was 0.86 which means it is performing even better for positive class, which can also be verified by statistics shown in confusion matrix. Finally, we have the F1-score, which takes both precision and recall into account to ultimately measure the overall accuracy of the model. The F1-score of the model was 0.82 which indicates that the overall model accuracy is good on sample dataset.

Sense the Pulse: A Customized NLP-Based Analytical Platform …

219

5 Non-functional Requirements The mentioned platform also addresses following non-functional requirements for any of the application to be reusable and scalable. a.

b.

c.

d.

e.

f.

Reduced Complexity The code is scalable to the extent to implement segmentation [11], thereby becoming aware of the load being put-on to the server per unit time. Incremental data load implemented at database end fetch that differential data loaded in the backend through Python code. Bandwidth Requirement Bandwidth required by the server to run this code is dependent on the segment size. Kept it to the maximum to reduce the time in thrashing and further reducing the time and efforts in fetching the data one by one. In stress testing revealed, the code can fetch 75 lakh records in one go. Scalability for Future Growth The code as well as the server is very much scalable for multiple and future applications. The platform provides ease in incorporating new algorithms on top of the existing one or even replacing the existing algorithms. Horizontal Scaling The system is scalable horizontally. As the data set grows, having two or more servers instead of one will help in load division and balancing. Horizontal scaling [12] will increase the performance and help in achieving real-time results. Vertical Scaling The system is vertically scalable [13] and can implement enhancement of hardware as and when required. Data Archival and Purging No data is stored on Python server. So, no need of data archival or purging [14].

Reusability The model which has been improvised can be reused to derive sentiments and empathy scores [15] around various use cases. For any domain or industry, it will just require some customization in the lexicon/dictionary built as a domain/industryspecific dictionary. The common keywords could be kept as it is, whereas domain/problem-specific keywords may need to be changed and added to enrich the dictionary.

6 Conclusion The mentioned analytical platform based on AI-ML NLP techniques has brought forward a foundation for an automated, intelligent and cognitive-driven organization. It has delivered a noticeable business value as these are small steps toward creating a

220

C. Nain et al.

digitally empowered, de-siloed, collaborative organization. At one of the areas, the digital data maturity is grown up to second level as the data is systematically being used for the purpose. This has crossed the silos, and the insights are being consumed for a larger purpose. The given method and approach are very much specific to the culture of the said organization, TCS; however, there are scalable components that increase the reusability of the platform as well as reduces the complexity of building a new domain-specific solution. The solution also addresses other non-functional requirements as mentioned in previous section. Reusability within TCS: There are multiple opportunities within TCS through different channels like pulse survey analytics, feedback analytics captured for every survey, trainings, webinars and events. External reusability, to mention for detail study of the reader is empathy scores of relationship managers in bank, who provides various services via telephonic conversations. There is a huge scope in multiple areas, where domain-specific customization is the key to success. To mention a few, they are brand monitoring, improving your customer support, product analytics, monitoring market research, analyzing the competition and uncovering brand influencers. Acknowledgements We acknowledge Head TCS Data office for support, encouragement and mentorship, the solution architect and the entire team who have put efforts in continuous exploring the best-optimized algorithms and reviewing the results for the set benchmark. We also acknowledge the business groups, HR and the end consumers of the analytical dashboards for their valuable inputs.

References 1. https://economictimes.indiatimes.com/markets/expert-view/we-have-a-talent-revolutiongoing-on-in-tcs-milind-lakkad/articleshow/71535411.cms 2. Liu B (2012) Sentiment analysis and opinion mining. Morgan & Claypool, San Rafael, CA 3. Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1):1–135 4. Hutto CJ, Gilbert E (2015) VADER: a parsimonious rule-based model for sentiment analysis of social media text. In: Proceedings of the 8th International conference on weblogs and social media. ICWSM 2014 5. Gurusamy V, Subbu K (2014) Preprocessing techniques for text mining, Int J of Comput 6. Sree R (2019) https://www.talend.com/blog/2019/11/06/how-to-snowflake-query-pushdowntalend/. Accessed Nov 2019 7. Liu B (2009) Handbook chapter: sentiment analysis and subjectivity. handbook of natural language processing. In: Handbook of natural language processing. Marcel Dekker, Inc., New York 8. Pande B, Dhami H (2011) Application of natural language processing tools in stemming. Int J Comput Appl 27. https://doi.org/10.5120/3302-4530 9. Carrell D, Halgrim S, Tran D-T, Buist D, Chubak J, Chapman W, Savova G (2014) Using natural language processing to improve efficiency of manual chart abstraction in research: the case of breast cancer recurrence. Am J Epidemiol 179. https://doi.org/10.1093/aje/kwt441

Sense the Pulse: A Customized NLP-Based Analytical Platform …

221

10. Powers D, Ailab (2011) Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. J Mach Learn Technol 2:2229–3981. https://doi.org/10.9735/ 2229-3981 11. Pak I, Teh P (2018) Text segmentation techniques: a critical review. https://doi.org/10.1007/ 978-3-319-66984-7_10 12. Roy C, Barua K, Agarwal S, Pandey M, Rautaray S (2019) Horizontal scaling enhancement for optimized big data processing. In: Proceedings of IEMIS 2018, vol 1. https://doi.org/10. 1007/978-981-13-1951-8_58 13. Sharma R, Mathur M (2010) Achieving vertical scalability: a hindrance to cloud computing 14. Borgerud C, Borglund E (2020) Open research data, an archival challenge. Arch Sci 20:279– 302. https://doi.org/10.1007/s10502-020-09330-3 15. Neumann D, Chan R, Boyle GJ, Wang Y, Westbury R (2015) Measures of empathy. https:// doi.org/10.1016/B978-0-12-386915-9.00010-3

Track III

Fact-Finding Knowledge-Aware Search Engine Sonam Sharma

Abstract Search engine based on knowledge graphs has become increasingly popular and demanding. One of the most popular aspects is being relations between entities. What is needed is a smarter and apt enterprise search which helps to provide all the right answers using knowledge graph and NLP. Fact-finding signifies that apart from getting a document returned by search engine, with the in-place knowledge graph, it helps in finding facts on user-specific query when it finds a match in the knowledge graph. In this paper, we present an approach using scalable, open-source approach which takes unstructured data and helps in creating a search platform that refines large text for search and a knowledge graph with question–answer system. It helps in getting related document(s) based on the searched query. The complete application can be used and integrated into many use-cases since search is the integral part of most (almost all) applications.

1 Introduction Since the emergence of search engines, we became habitual with keyword searching. Keyword searching helps in getting apt answers on what we are looking for. The more convoluted questions and longer questions are less likely to receive an answer. For example, “How much does Starbucks best-selling coffee cost?” would unlikely yield results but re-phrasing the question to “Starbucks Vanilla latte cost” would give more relevant documents and can sometimes provide direct answers. As a result, we are looking for search engines which are fact-finding in nature. Now, search engines do not just return the valid matched documents but wherever possible they inject facts/answers to end user queries. This helps to shorten the time of end user to access relevant information or fact. When it comes to understanding the idea of the search engine, knowledge graphs (KGs) are identified as most essential component. Knowledge graph is most S. Sharma (B) Research and Innovation, A&I, Tata Consultancy Services, New Delhi, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_17

225

226

S. Sharma

commonly defined as a knowledge base (KB) which is the combination of an ontology and instances of the classes in the ontology consisting of a large number of facts about entities. Besides knowledge graphs which are openly available, e.g., DBpedia [1], Yago [2, 3], WordNet [4], domain-specific KGs which can help in representing relation between domain entities have become an increasingly popular research direction toward cognition and human-level intelligence. KGs are not only applicable to a set of users or an enterprise, but any person on the Web having access to the Web can leverage KG capabilities. The discussed system extends beyond traditional information retrieval systems for the retrieval of documents. The system aids in modeling relationships and co-occurrences between entities automatically based on the semantic relations present in underlying corpus of documents. The graph component of the system encodes triples, entities, extracted concepts and their relationships. This further helps in getting the facts for the specific user query if present in the system. To find answers from large set of documents and further pin-point the exact location, fact extraction is a way to go and to store them for fast access. Facts are generally stored in the form of triples in KGs. They go together with search engine for complete search experience. This paper is organized in sections where in Sect. 2 we discuss about related work and critical research gaps highlighted in this paper. Sect. 3 describes the objective and problem statement. An overview of the system is provided in the next section, i.e., Sect. 4. The methodology used to solve the problem is used in the next section, i.e., Sect. 5. Then, we try to explain why we are doing it with graphs. Finally, we discuss about results and conclusions in Sect. 8.

2 Related Work An ontology is a formal description of knowledge as a set of concepts within a domain and the relationships that hold between them. To enable such a description, we need to formally specify components such as individuals (instances of objects), classes, attributes and relations as well as restrictions, rules and axioms. As a result, ontologies do not only introduce a sharable and reusable knowledge representation but can also add new knowledge about the domain [5]. Building ontologies have become a common activity across various domains for vocabulary building and understanding of domain concepts. This further helps in knowledge building of the underlying domain using ontologies structure. The data model of ontology is applied to a set of individual facts while building knowledge graph—a collection of different entities, and their types where relationship between them mostly expressed by nodes and edges between these nodes, and when we are describing the structure of the knowledge in a domain, using ontology structure, we build knowledge graph structure and fill data in it using the same. Classification of ontologies can be classified in the following types: formal ontologies which can be classified into three main types [6], namely (i) formal ontologies, which are distinguished by axioms and formal definitions, stated in logic; (ii)

Fact-Finding Knowledge-Aware Search Engine

227

prototype-based ontologies, which are based on typical instances or prototypes rather than axioms and definitions in logic; (iii) lexicalized (or terminological) ontologies, which are specified by subtype–supertype relations and describe concepts by labels or synonyms rather than by prototypical instances, e.g., WordNet. Most large-scale KGs are constructed using ontologies and structure Web resources like FreeBase [7], Dbpedia [1], etc. Building an ontology is a very laborious task which is still not automated, so whenever any new domain data comes into picture, a special team is usually formed to create ontology on specific domain. Further, it is necessary to maintain it and is generally always incomplete. There are current ontology learning systems in place [8, 9] where both domain knowledge and technical expertise is required to build the complete ontology which is non-evolving and fixed. They are error-prone, have multiple redundant facts whic are very noisy, unreliable and manual because of which they provide very less accuracy. The current ontology building system does not adapt to the dynamic nature of the data and emerging trends like if any relation was not present before then it becomes very tedious task to add new relationships. Furthermore, any variation in meaning of the words, i.e., nuanced meanings are not considered in building these ontologies because of which context of the data is lost. Our work improves upon the above-described task since we use machine learning techniques to capture the meaningful information from the data-source to build the complete knowledge base. With the extraction of triples to entities (domain and general) and relationship among these entities. After graph creation, we use graph modeling to further capture hidden trends in the data to get it most meaningful form. Subsequently, mere presence of ontology for any domain does not do any good until we make good use of it, and there helps our search engine which provides insights to the available data which furthermore helps in converting the available data to knowledge. This can also be used as a question answering system. The final system helps in getting the documents from the available searched content plus gets facts out of data whenever and wherever they match the available facts present in our knowledge base. A similar document retrieval system is discussed in which uses hidden Markov model for retrieval of information which is said to outperform the tf-idf. The discussed approach is probabilistic in every step, and it becomes difficult to explain sometimes why and how we reached to the provided answer. We instead followed multiple approached and simpler and deterministic functionality wherever required. Similarly, in most question and answering systems like [10], all decision making is left to the model which we felt will nt be realistic and will not fit on every kind of data. Neural networks tend to hallucinate, and by hallucinating, we mean that neural networks often give strange outputs and predict weird results. For these results, we will not be able to know how we get where we are. We have focussed on building the kind of system which provides confident with the expected results.

228

S. Sharma

3 Problem Statement Given a large collection of text documents in an enterprise, we wish to build search engine with collection of facts containing named entities and relations among them. We further assume that we have initial relevant information on domain entities and documents. Documents contain one or more entities. Entities contribute in creation of facts. We first create enterprise knowledge graph. It will provide 360° view of the available unstructured text in organization. Natural language processing (NLP) has come a long way since its inception in the twentieth century. We decided to use this subfield of artificial intelligence in order to solve our problem. NLP has proven to work very well in the past few years due to development of fast processors, GPUs and sophisticated model architectures. Using NLP, we have also created document search engine which helps in getting the appropriate document on the search query. The combination of two will help us in solving the problem at hand, i.e., getting relevant document for the search query and getting the facts from the same document by analyzing the query and running it on knowledge graph. This will help us not only in getting the relevant document for the search but answer users question if fact matches the relation in knowledge graph against a document. The main objective is to build an end-to-end platform which supports preprocessing of unstructured documents, extraction of triples, entities, co-references, entity co-occurrences, entity relations to build knowledge graph. For search, we use natural language processing techniques to build document search engine.

4 System Overview The overall architecture of system flow is shown in Fig. 1. An assumption is made that an appropriate document store is available to store and retrieve documents. We

Fig. 1 High-level flow of the system. Documents are stored and preprocessed and used to create search engine. The raw documents are also used to create knowledge graph which is stored in Neo4j which has query engine which finally returns the desired results

Fact-Finding Knowledge-Aware Search Engine

229

can also use Hadoop since available documents for an enterprise are usually large. In the execution layer, we work on various techniques to get the knowledge out from the documents. What we want is for the platform to actually understand the semantics of what the user is trying to search for and then return the most helpful results for him. We extract entities, relation, their co-occurrences from the unstructured text. We have used spaCy python package for most of the modeling tasks, because of its easy-to-use nature and availability.

5 Methodology In this section, we explain the procedure which we followed to build the underlying platform.

5.1 Data We have used the dataset from text retrieval conferrence (TREC) which is majorly a text retrieval community. The data is selected from 2019 deep learning track. The dataset contains 367 k queries and a corpus of 3.2 million documents. For our research purpose, we had to reduce the dataset quite a lot to fit our needs. For our research, we have used files named msmarco-docs.tsv which contains the documents, msmarcodoctrain-queries.tsv contains queries, and msmarco-doctrain-top100 contains top 100 documents.

5.2 Data Distiller Data is cleansed, and all unnecessary junk is removed from the dataset. The data is cleansed in two sets first for building search engine and another one for building knowledge graph. • For building search engine, we do not focus on named entity recognition and extraction we focus on words and sentences and their position in the documents to build better search. We have tried to replicate the techniques used by Lucene search in order to have better, fast and accurate search engine. Here, presence of special characters disrupts the process, so we remove them. • For knowledge graph creation, it is necessary to get sentence-level information for each entity and what role they play in those sentences. We try to apply automatic conferencing available wherever applicable. Co-occurrences of the entities are also required. We have found that it is necessary to keep special characters at this stage in order to build knowledgeable base.

230

S. Sharma

5.3 Document Search Engine In order to solve the semantic similarity search task at hand, we found a way to convert the text data into a vector space so that we can use them as features in predictive model. This is called vectorizing. Common approaches to vectorizing text data include bag of words approach and TF-IDF approach; however, these are very sparse representations. In a vector of length 300, around 299 values would be 0. Thus, plugging it into our neural network would not work as the gradients would vanish. Therefore, for neural networks, we go for word embeddings. These are fixed length dense vector representations which work very well with neural network. I experimented with different options for pre-trained word embeddings such as 1. 2. 3. 4.

Google’s universal sentence encoder ELMO contextual word embeddings Word2vec embeddings FastText embeddings.

However, we found that the embeddings were not able to adapt to the data very well. This may be due to the fact that the domain-specific data uses a very specific vocabulary. The pre-trained embeddings above are trained on plain English text such as Wikipedia or News corpus and almost never encounter the words we are hoping to realize. Therefore, it is better to build embeddings from scratch. We tried both word2vec and FastText embeddings, and word2vec gave the best results. The word embeddings are then used to train a classifier, which predicts the most suitable topics from domain for a user query. Finally, the trained word embeddings are used to calculate similarity measures between all the available titles and a user query and then retrieve the most suitable results. While typical semantic search implementations utilize “cosine similarity” to rank results, we have come up with a custom measure which is specifically created for our task. It takes into factors such as the post popularity and sentiment while ranking the results. The weights assigned to each of them were tuned manually after several experiments based on sanity checks (explained in Fig. 2).

5.4 Knowledge Graph 5.4.1

Extraction of SVO Triples

Triple extraction of unstructured data is done using Python programming language. We have used spaCy Web large model (en_core_web_lg) to extract triples from the text. Each document is taken into consideration individually to extract triples. Triples are extracted for each sentence. There can be multiple triples for a specific sentence bases on the nature of the sentence. Following are the steps to extract triples from unstructured text.

Fact-Finding Knowledge-Aware Search Engine

231

Fig. 2 Above figure explains the process flow of our document search engine

(1) (2) (3)

(4) (5) (6)

(7)

Since most of the textual data is highly unstructured, there is the requirement of repeating the process multiple times based of the data. We have not removed all the stop words since when it comes to building triples negations against a verb help while extracting a relation. Multiple of times we need to join words with no separator between them, so we will apply wordninja library after basic cleaning of the data on subset of words. It is required to lemmatize the words to their base form. All non-ASCII characters are required to be removed since they will be of no help further. Repetition of words are removed (when same word occurs twice or more one after the other), and same is to be done for characters. This case happens only when data was created by end user, e.g., if any survey data, chat data, form data Next are the grammar rules (dependency markers) for subject–verb–object extraction. For triple extraction we try to understand the distribution of the count of words that are available in Subject, Object (Fig. 4) and Verbs. 1. 2. 3. 4.

SUBJECTS = {“nsubj”, “nsubjpass”, “csubj”, “csubjpass”, “agent”, “expl”} OBJECTS = {“dobj”, “dative”, “attr”, “oprd”} VERB = {“CCONJ”, “VERB”} (POS tags that will break adjoining items) NEGATIONS = {“no”, “not”, “n’t”, “never”, “none”} (words that are negations) • Triple Boundaries (1)

1 7.5 m/s

6 m/s < WSS ≤ 7.5 m/s

50–100 m

This is relational level 1.

And physical combinations by wind speed measured at 100 m height are, 100–50 m 100–10 m

This is relational level 1. This is relational level 2.

From the above, it is evident that the relational level is mere comparison between levels. A single step level is considered as level 1, and for dual step or double level, it is considered as level 2. The plausible limits considered for relational test between wind speed as different heights are given in Table 4.

6.2 Storage and Extraction of QC Data The QC data is stored in NIWE’s local storage in the form of parquet files. This file format ensures high security, higher query and efficient space management. This parquet files can be read using various programming languages such as R and Python by using pre-defined attributes set by the programmer. Hence, user-friendly interface was developed to extract the data based on the requirements, viz. WMS, time, meteorological parameter, raw or QC data. A provision has also been made to set invalid flags as “NaN” values or set “Nan” values for corresponding QC flag other zero. A screenshot of the interface is given in Fig. 5.

7 Analysis and Discussion Meteorological parameters exhibit high hourly, monthly and seasonal variations. It is therefore important to consider the interdependence of the different parameters obtained from a single WMS. Figure 6 shows the standard time series data of all parameters (average) collected from different heights from the standard WMS during the peak windy season. It can be shown that the wind speed increases with an increase in height. However, this statement is valid in the plateau/plain terrain, but not necessarily in the complex terrain. Second, the temperature is usually low during the night and rises steadily throughout the day. On the opposite, the pressure

250

Y. Srinath et al.

Fig. 5 Screenshot of data extraction package

Fig. 6 Time series data of all the parameters (average) obtained from various heights from a typical WMS during peak windy season

Automated Data Quality Mechanism and Analysis of Meteorological …

251

reaches its height during the day and there is a distinct decrease during the night. The interdependence of sensors at different levels, the interdependence of sensors at the same level and the impact of extreme outliers are explained in detail in this section.

7.1 Interdependency of Sensors at Different Levels This analysis mainly focuses on ensuring plausible values between various sensors. To understand the correlation between recorded parameters measured at different heights along the MET-MAST, scatter matrix is plotted to compare the wind speeds at 10 m, 50 m, 80 m and 100 m with and without applying QC as shown in Figs. 7 and 8. In figure, very distinct outliers and a slight deviation in pattern is observed. In this particular station, it is evident that the outliers are mainly caused due to the implausible values measured at 10 m height, i.e. higher wind speeds at 10 m as compared to other higher heights. Therefore, it is necessary to identify and remove such outliers. After applying the necessary QC, it is that observed the trend is more or less along 1:1 identity line. In addition, high correlation is observed between

Fig. 7 Scatter matrix comparing the wind speeds at 10 m, 50 m, 80 m and 100 m without applying QC for a typical WMS

252

Y. Srinath et al.

Fig. 8 Scatter matrix comparing the wind speeds at 10 m, 50 m, 80 m and 100 m after applying QC for a typical WMS

wind speeds at neighbouring heights, i.e. higher correlation level 1 wind speed as compared to correlation of wind speed at level 2. It is important to note that this validation is not necessarily applicable for all sites and at all times. Therefore, it is recommended to report and verify the data for localized weather conditions such as cyclone, light wind conditions and so on. This helps in reduction of risk in elimination of correct data.

7.2 Interdependency of Sensors at Same Level The sensors are typically located in the prevailing direction of the wind. Sometimes, when more than one wind direction prevails, two sensors are mounted at the same height. The difference between the data obtained from such sensors should be as minimal as possible. This form of dual measurement can be used for early detection of drift adjustment or sensor failure or tower shading. For example, Fig. 9 shows crossvalidation between the wind speed data obtained at 100 m south and 100 m north for a period of one year. The data within the appropriate range is considered legitimate data points and the points beyond the threshold limits are deemed suspicious.

Automated Data Quality Mechanism and Analysis of Meteorological …

253

Fig. 9 Cross-validation between 100 S and 100 N wind speed for a typical WMS. The data point exceeding limit is filtered (red)

The upper and lower limits can be reached either by evaluating historical data or by using the standard deviation. For the particular station in Fig. 9, 0.25% of the total data is expected and the resulting correlation is 0.99, which is within acceptable limits.

7.3 Effect of Outliers In order to understand the wind potential and patterns, it is important to estimate average, median, standard deviation, minimum and maximum values at different heights. But the involvement of outsiders could impede the outcome of the study. It is therefore important to identify and eliminate major errors for accurate analysis as much as possible. In order to better understand the effect of outliers on statistics, violin graphs are plotted at different heights for both non-QC data and QC data in Fig. 10. It can be noted that the operating range is different for different scenarios. In addition, it should be noticed that the data points are clustered at lower wind speeds, while the violin plots have tapered ends at higher heights. This research is the basis of the physical limit test, the variance test and the gradient test. It is also noted that similar results are obtained with other meteorological parameters, i.e. temperature and pressure.

254

Y. Srinath et al.

Fig. 10 Violin plots for wind speed data with and without applying QC

8 Results The data quality algorithm was successfully applied to 20 wind-monitoring stations over a span of one year. The findings of the study are listed briefly in Table 1. Below is a snapshot of the wind speed data flagging method for a specific location of the site (Fig. 11).

Fig. 11 Flagging mechanism showing wind speed time series data for a particular height and the associated error flags for a typical WMS

Automated Data Quality Mechanism and Analysis of Meteorological …

255

Figures 12a, 13a, 14a and 15a provide an overview of the amount of good data (flag 1) at the end of QC for all the measured parameters across 20WMS. Figures 12b, 13b, 14b and 15b provide in-detail understanding of the rate of erroneous/suspected data (flag 1 and flag 2) observed each test observed over all 20 WMS. Figure 12a, b shows the percentage distribution of flags caused by the average wind speed parameter for all stations. It is completely obvious that there has been a maximum failure in the relational consistency test and a minimal failure in the

(a)

(b)

Fig. 12 a Rate of correct values observed in wind speed average data for all 20 wind-monitoring stations. b Rate of error flags (flag 1 and 2) observed in wind speed data for all 20 wind-monitoring stations

256

Y. Srinath et al.

(a)

(b)

Fig. 13 a Rate of correct values observed in pressure data for all 20 wind-monitoring stations. b Rate of error flags (flag 1 and 2) observed in pressure data for all 20 wind-monitoring stations

physical limit test. It can be noted, however, that the total number of error flags is between 1 and 3%. On average, 98.6% of all stations included in this study are classified as good wind speed data. Figure 13a, b shows the percentage distribution of flags caused by the average pressure parameter for all the stations. One can notice high error in Station 16 and

Automated Data Quality Mechanism and Analysis of Meteorological …

257

(a)

(b)

Fig. 14 a Rate of correct values observed in temperature data for all 20 wind-monitoring stations and b Rate of error flags (flag 1 and flag 2) observed in temperature data for all 20 wind-monitoring stations

Station 19 due to physical limit test and relational test. The data obtained from the above stations was done through manual/offline mode of transfer, and hence, there is possibility of sensor failure, which was left undetected until the data was collected and processed. This is a major drawback of offline mode of data transfer.

258

Y. Srinath et al.

(a)

(b)

Fig. 15 a Rate of correct values observed in wind direction data for all 20 wind-monitoring stations and b Rate of error flags (flag 1 and flag 2) observed in wind direction data for all 20 wind-monitoring stations

Figure 14a, b shows the percentage distribution of flags caused by the average temperature parameter for all stations. It can be reported that the percentage of error flags are very high when considering the gradient test. On average, about 2% of the pressure data is stated to be incorrect/suspicious associated with the gradient test.

Automated Data Quality Mechanism and Analysis of Meteorological …

259

Similar tests were conducted for the average wind direction parameter in Fig. 15a, b However, some stations that performed very well had an error percentage close to zero; the missing data for some stations is as high as 3%. Table 5 shows that a specific data value can fail in one or more tests. The benefit is that it not only accounts for the seriousness of the problem, but also ensures the maximum detection and flagging of inaccurate and suspicious data. For example, if invalid data is not detected in one test, it will be easily captured by another test. Two speculations were formulated when performing the analysis. First, it has been found that there is a close interaction between the different wind parameters. While this is not included in this report, the authors recommend that data quality checks be included to verify the interdependence of different parameters within the same WMS. This will be useful for filling gaps in data as well as ensuring that readings obtained from different sensors do not contradict each other. Second, for a few sites, a spatial correlation between similar parameters from adjacent stations has been observed. The correlation varies from station to station, which may be due to the landscape and local weather conditions. Based on the availability of historical data for all WMS for the same span of time, only a small data set was considered, and thus, a conclusion could not be formulated. A robust data quality control algorithm will help to create an effective data imputation algorithm to fill out invalid % missing data. Authors propose creating a region-wise catalogue containing error threshold limits based on geographical location and seasonality, thus reducing the false positive rate. In addition, the implementation of a monitoring tool would provide easy access to the status of all operating wind-monitoring stations, which would enable fast error detection and rapid correction.

9 Conclusion A quality control algorithm has been developed and applied to 20 Meteorological Wind-Monitoring Stations obtained from the National Institute of Wind Energy. The QC test used in this experiment includes the primary test (missing and duplicate records), the internal consistency test (physical limit test, gradient test and deviation test) and the relational consistency test (consistency between wind speeds at different levels). Based on a number of literature surveys and taking into account historical statistical analysis, the limits have been set. A 16-bitwise flagging method was used to demonstrate whether the measurement data had passed or failed for each test along with the reason for acceptance or refusal. Efforts have been made to automatically identify all apparent outliers, thus mitigating manual involvement. Regular testing and calibration of the sensors are expected to improve the accuracy of the measured values. Reliable and high-quality data can pave the way for understanding long-term climate change and help to define the exact wind capacity for wind farm planning.

AVG

AVG

Temperature

Pressure

99.41

97.67

99.42

99.30

STDEV

AVG

99.30

MAX

Wind direction

98.67

AVG

Wind speed

Good data (%)

Recorded parameter

Metrological parameter

0.26

0.27

0.58

0.70

0.70

0.25

0.12

0.05

0.00

0.00

0.00

0.0001

Physical limit test (%)

Missing data (%)

0.11

1.984

0.00

0.00

0.00

0.63

Gradient test (%)

Internal consistency test

Primary test

Table 5 Statistics of errors observed for all wind parameters from 20 WMS

0.0000

0.0000

0.0018

0.0000

0.0000

0.0351

Deviation test (%)

0.392

Correlation test (%)

Relational consistency test

0.106

0.030

0.000

0.00

0.00

0.03

Failed in two or more tests (%)

260 Y. Srinath et al.

Automated Data Quality Mechanism and Analysis of Meteorological …

261

Acknowledgements The installation of MET-MAST is funded by Ministry of New and Renewable Energy (MNRE), Government of India. The authors gratefully acknowledge Dr. Rajesh Katyal (DDG & DH, NIWE), Mr. Haribhaskar (DDT, NIWE), Mr. J. Bastin (DDT, NIWE) and their entire team for their constant support and valuable inputs. The authors would like to express sincere thanks to all the engineers in NIWE who have been part of meteorological MAST installation and data collection process.

References 1. Sustainable development goals. https://in.one.un.org/page/sustainable-development-goals/ 2. Coville A, Siddiqui A, Vogstad KO (2011) The effect of missing data on wind resource estimation. Energy 36(7):4505–4517 3. National Institute of Wind Energy. www.niwe.res.in 4. Liu J, Gao CY, Ren J, Gao Z, Liang H, Wang L (2018) Wind resource potential assessment using a long term tower measurement approach: a case study of Beijing in China. J Clean Prod 174:917–926 5. Nyckowiak J, Le´sny J (2010) Verification of data quality from automatic weather stations. Acta Agrophysica 184:218–228. ISSN 1234-4125 6. Wade CG (1987) A quality control program for surface mesometeorological data. J Atmos Oceanic Technol 4:435–453. https://doi.org/10.1175/1520-0426(1987)004%3c0435: AQCPFS%3e2.0.CO;2 7. Meek DW, Hatfield JL (1994) Data quality checking for single station meteorological databases. Agric Forest Meteorol 69(1–2):85–109. ISSN 0168-1923. http://doi.org/10.1016/0168-192 3(94)90083-3 8. Cerlini PB, Silvestri L, Saraceni M (2020) Quality control and gap-filling methods applied to hourly temperature observations over Central Italy. Meteorological Appl 27(3):e1913 9. Chen J (2020) Quality control and verification of citizen science wind observations. Master’s thesis 10. Dann J, Prodata Weather Systems (2020) Wind issues—Davis Weather Stations knowledgebase. https://www.manula.com/manuals/pws/davis-kb/1/en/topic/wind-issues. Accessed 9 Aug 2020 11. Lee M-K, Moon S-H, Yoon Y, Kim Y-H, Moon B-R (2018) Detecting anomalies in meteorological data using support vector regression. Adv Meteorol 12. Ha JH, Kim YH, Im HH, Kim NY, Sim S, Yoon Y (2018) Error correction of meteorological data obtained with mini-AWSs based on machine learning. Adv Meteorol 13. Durre I, Menne MJ, Gleason BE, Houston TG, Vose RS (2010) Comprehensive automated quality assurance of daily surface observations. J Appl Meteor Climatol 49:1615–1633. https:// doi.org/10.1175/2010JAMC2375.1 14. Shulski MD, You J, Krieger JR, Baule W, Zhang J, Zhang XD, Horowitz W (2014) Quality assessment of meteorological data for the Beaufort and Chukchi Sea coastal region using automated routines. Arctic 67, no 1, pp 104–112. Accessed 6 Sept 2020. https://www.jstor.org/ stable/24363725 15. Eischeid JK, Bruce Baker C, Karl TR, Diaz HF (1995) The quality control of long-term climatological data using objective data analysis. J Appl Meteor 34:2787–2795. https://doi.org/10. 1175/1520-0450(1995)034%3c2787:TQCOLT%3e2.0.CO;2 16. MEASNET procedure: evaluation of site-specific wind conditions. Version2, Apr 2016 17. Zahumensky I (2004) Guidelines on quality control procedures for data from automatic weather stations 18. Fiebrich CA, Crawford KC (2001) The impact of unique meteorological phenomena detected by the Oklahoma Mesonet and ARS Micronet on automated quality control. Bull Am Meteor

262

Y. Srinath et al.

Soc 82:2173–2188. https://doi.org/10.1175/1520-0477(2001)082%3c2173:TIOUMP%3e2.3. CO;2 19. Wind resource assessment handbook: fundamental for conducting a successful monitoring program. National Renewable Energy Laboratory 20. Anjan A, Gupta MK, Rudra P, Vashistha RD. Implementation of quality control by India Meteorological Department on its automatic weather station and automatic rain gauge station network. India Meteorological Department, Pune. https://metnet.imd.gov.in%2Fmausamd ocs%2F16618%5F%46.pdf 21. Guide to meteorological instruments and methods of observation. World Meteorological Organization-No. 8. https://www.weather.gov/media/epz/mesonet/CWOP-WMO8.pdf

Efficient and Secure Storage for Renewable Energy Resource Data Using Parquet for Data Analytics A. G. Rangaraj, A. ShobanaDevi, Y. Srinath, K. Boopathi, and K. Balaraman

Abstract The era of time series data as its implementations and interpretation have recently become increasingly relevant in different domains and fields. Numerous fields of industry and science rely on collecting and examining vast quantities of time series data—finance and economics, the Internet of things, medicine, environmental protection, hardware surveillance, among several others. The main structural problems in time series data analysis are the collection and retrieval in vast quantities of data. In addition, time series data analysis seems to be of tremendous value, since previous patterns are helpful in predicting future results. Owing to the considerable latency in data volume changes, the frequency of persistence and the absence of inborn structures in time series, the traditional hierarchical data retrieval method would not seem to be able to analyze time series data efficiently. Often, many traditional data storage solutions do not support time series-based operators, which result in unreliable time series service. The description of effective and stable local data storage for time series data has been provided in this paper. The study of wind/solar time series data is represented by changing the number of rows, the number of days and the number of columns with a time resolution of 10, 15, 30 and 60 min. The results of the experiment display time series data access time, write period and storage size of the parquet data storage format that has proved to be an effective and reliable data storage device for the local centralized server. The data storage model of Apache

A. G. Rangaraj (B) · A. ShobanaDevi · Y. Srinath · K. Boopathi · K. Balaraman National Institute of Wind Energy, Chennai, India e-mail: [email protected] A. ShobanaDevi e-mail: [email protected] Y. Srinath e-mail: [email protected] K. Boopathi e-mail: [email protected] K. Balaraman e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_19

263

264

A. G. Rangaraj et al.

Parquet appears to be effective as compared to other file storage formats such as CSV, HDF5 and FEATHER.

1 Introduction Time series information is very productive. Numerous mechanisms exist to enable the collection, distribution and management of time series-based data. In addition, some of the science-based domain systems have metadata capabilities that provide further perspective for data interpretation and comprehension. Apart from the scientific field, some other resources and databases based on time series have been developed. They typically have reduced metadata capacity, but they can be scalable to enhance productivity. Immediate real-time visualization tools and different circumstances involving rapid data extraction and data discovery may benefit from a standard-based architecture that provides rich metadata knowledge as well as fast performance. In recent times, with the widespread usage of sensors in a broad network, time series data is widely generated and data processing is often needed. The implementation of an effective database management system would improve the overall efficiency and effectiveness of the organization. The efficient data management framework would render the organization more profitable and can also help to carry out successful data review in whole. On the other hand, a confidential data management system can contribute to a rather dysfunctional organization. Finally, it would allow completely to take data-driven actions within the organization and increase the overall productivity of the organization. Effective data management system: • Permits data structure that can be store for easy retrieval and future reference. • Makes it simpler for employees to discover and understand the data that they require to do their tasks. • Permits staff to conveniently validate their conclusions or results. • Gives the structure to data to be effectively shared to other people or customers. • Information management will also make the commercial more Cost-Efficient. • Consistent data, improved data quality, reduce time and cost, Increase data accuracy, quick decision making, improved data sharing and data integration. Another benefit of an efficient data management system is that it will allow the organization more cost-effective, since it will enable the company to prevent needless replication of results. Through handling and preserving all data, it is easily ascertainable. It ensures that employees have never carried out the same practice, research or study that had previously been carried out by some other employee. On the other side, this efficient data processing often reveals the weaknesses in data collection when at the same time managing such immense data sources. An assortment of literature surveys have been performed and most of the study works tend to be utilizing cloud-driven databases and distributed databases, such as OpenTSB, InfluxDB, and FluteDB, for large quantities of time series info. And

Efficient and Secure Storage for Renewable Energy …

265

as cloud storage has allowed users to automatically transfer and view data without access to their local power stations. Cloud computing often has certain downsides compared to local data systems: • Too much dependence on the Internet connection—The data file can be transferred to the cloud server only when the Internet is active all the times. There are some difficulties to transfer data to remote server when the Internet faces some technical issues. • Costs of cloud server or storage—The cloud service providers may offer additional costs for file downloading and uploading file from cloud server and also for cloud backups and data recovery. • Loss of vital data to hackers—Malware threats and cybercrimes are ever increasing these days. There is a possibility of loss of vital data to hackers while using cloud servers. The organization must take serious efforts to check the malware protection and data security threats before purchasing the storage from cloud service providers. • Compliance and Integration issues—Cloud-based storage devices are sometimes will not be best suitable option for public traded organizations when considering the experience of using advanced features ingrained in cloud storage products. In distributed database systems, the databases are placed at different locations and interconnected using network. The distributed databases at each site are capable of processing and accessing both local and remote data. Though distributed DBMS is proficient in sharing of data and effective communication, still it suffers from different drawbacks as given below. • Complex in nature—the Distributed database is relatively more complex than a centralized database. Complex software is required for these kind of databases. Moreover, it confirms no data replication, which gives further added complexity to its nature. • Overall Cost—Different costs, namely hardware, communication, labor, maintenance, and procurement costs, etc., increases the overall cost and make expensive than traditional databases. • Security issues—The network used in distributed databases can be easily attacked for data misuse and theft. • Integrity Control—The important concern in maintaining data consistency. All changes made to data at one host must be reflected to all the hosts. • Lacking Standards—There are no standard rules and protocols to convert a centralized database to a large Distributed database. Lack of standards decreases the potential of distributed data storage. Based on the drawbacks of cloud storage and distributed data storage for time series data, investigation of storing large volumes of data has been analyzed in local centralized server efficiently. The local storage system also has both advantages and disadvantages. The benefits of local storage are:

266

A. G. Rangaraj et al.

• The employer has unlimited authority above access to data and files. In this way, it is truly secure in contrast with cloud-based storage since organization do not have any information regarding data storage and data access security. • Data could be accessed effectively, easily and rapidly. • User does not need an Internet connection for accessing data from server for additional investigation. Disadvantages: • The organization has to continually take regular backup of data backup to avoid any loss/corruption. Employee is entirely responsible data safety. • It is even complex to share any data to other people, it is essential to create host server to upload data, and afterward either data sent through email or link to the intended client. • It consumes additional storage space while storing the data in local storage server. However, there are some disadvantages in local storage, most of the organizations are still prefer storing data in local server. Some of them prefer hybrid method storing data in both local and cloud/remote server. In that perspective, an overview of storing and managing time series data effectively in the National Institute of Wind Energy (NIWE) centralized server. This article provides a state of art for storing data in different file storage formats with varying data size such as number of rows, number of columns and temporal resolution of sample dataset. The following sections emphasize the methodology and analysis results of efficient and secure data storage for Renewable Energy (RE) resource data in local server.

2 Related Works Struckov et al. have chosen four databases with various models and directed hypothetical and trial assessment of their appropriateness for various instances of time series data functionalities. The assessment indicated that for various scenarios and for various viewpoints the decision database ought to appear as different. Thus, the most significant finish of this study was the way to pick the best reasonable database anyone ought to assess testing with their specific data, remaining tasks, and sorts of inquiries that would be fitting. To preference the correct instrument, it is imperative to characterize a testing technique. Information ought to be the equivalent for everything as near conceivable with the genuine information since test results dependent on various information cannot be looked at. Tasks ought to compare with scenarios for picked-up instruments because distinctive use cases require an alternate arrangement of systems within the instrument. Concerning inspected databases, everyone has their own motivation. InfluxDB because of the incredible compression technique and its architecture for storage suits well for small-scale monitoring frameworks. It is extremely simple to operate, coordinate and deploy. Click house revealed it as steady solid venture solution intended

Efficient and Secure Storage for Renewable Energy …

267

for frameworks using high write rates of data. It is more entangled to incorporate into a current framework however incredible soundness justified, despite all the trouble. Timescale DB was acceptable if any database requires a greater number of queries compared to data writes. Also, it has all advantages of utilizing PostgreSQL, and however with greater usefulness, situated for the time series data storing and processing, a little something extra might be introduced on current examples of PostgreSQL. The OpenTSDB was essentially an augmentation for the HBase frameworks, intended to utilize the advantages of entire HBase perfect solutions, for example, BigTable of Google [1]. Jensen et al. have presented the time series data collection facts that could increase the deployment of extra automation and tracking. It has various applications starting from or ranging from Internet of things (IoT) devices placed in the household items to good-sized dispensed Cyber-Physical Systems (CPSs) generating large amounts of data on excessive velocity rate. To save and examine those large quantities of data, dedicated Time Series Management Systems (TSMSs) have been triumph over the constraints of conventional Database Management Systems (DBMSs) for management of time series data. In this work, the authors have given the thorough classification and evaluation of TSMSs evolved through industrial or academic research and acknowledged through its publications in journals. The classification was dependent on the observed architectures at some point of our evaluation. Furthermore, they deliver an outline of every device with an emphasis on motivational scenario to drive the improvement of the device, the storage capability and system implements of time series queries, the system modules, and every system’s capabilities with respect to Approximate Query Processing (AQP) and Stream Processing. They provided precis directions for research that has been offered by various researchers in this field and their current vision for TSMS next generation. As a summary, they suggest that a disbursed TSMS offering the similar analytical abilities has been developed as data warehouse specifically to be used with time series data. The TSMS needs to offer functionality for updating data in real-time, user-defined functions to support stream data processing and execution of queries on both incoming data and historical data through AQP with an interactive speed of query [2]. Recent growths in development of Internet and the generated data volume urge the rising demand for storage solutions with large capacity. Even though cloud storage has produced new approaches of storing, managing and accessing data, there is requirement for an efficient, inexpensive and effective storage solution in particular applicable to analysis and management of big data. In this research work, they have done one step further in-depth evaluation of the important features of big data storage facilities for semi-structured data and unstructured data, and also discussed the construction and deployment procedures. The authors recommended model for data storage offers a substantial benefit for storage providers by merging prevailing techniques into a distinct modified framework for storing big data [3]. The quantity of the data produced daily through industries, huge organizations, and the research institutes is growing at faster rate. The massive volumes of generated data

268

A. G. Rangaraj et al.

want to be stored not only for analysis purposes, however additionally in agreement with service and law agreements level to preserve and protect data. Data management and storage are main concerns during this big data era. Selecting the right devices for data storage, management tool for data, and also efficient methods are appropriate and defines the growth rate. The way to deal with big data management and storage can considerably, influence the whole organization. The organizations and business units are presently more concerned regarding how to proficiently retain and store their data. The capability for storage strategies is to scale in order to meet this data growth speed, access time improvement and rate of data transfer is correspondingly challenging one. These aspects, to a substantial extent, decide the complete performance of the data management and storage. The requirements for storage of big data are difficult and it needs a universal methodology to alleviate its challenges. In this work, author surveys the big data management and storage challenges. Furthermore, this work also inspects currently existing big data management and storage platforms and offer valuable suggestions to overcome those challenges [4]. In this review paper, authors were giving a best in class outline of cloud-driven big data arrangement along with the methodologies for data storage. It is an endeavor to feature the genuine correlation among both as far as enhanced supporting management for big data. Further, our contribution to unique placement of big data also introduced in this article. Also, the in-depth investigation of every fundamental article area is secured by well-made evaluation criteria. Such an examination additionally aids a superior classification of the individual methodologies (or advancements, associated with each part). Our overview empowers per users to all the more likely comprehend which arrangement could be used under which non-practical prerequisites. In this manner, helping toward the development of client explicit big data the executives frameworks as indicated by the non-useful necessities posted. Along these lines, authors have portrayed significant difficulties that can make ready for the best possible development of such frameworks later on. Each challenge recommended in “Exercises educated and future examination headings” segment has been drawn from the directed investigation [5]. Water quality was a continuous concern and remote water quality detecting guarantees cultural benefits. This work probably adds to a low-cost water quality detecting framework. The specific focal point of this study is determination of database intended for putting away water quality information. As of late, time arrangement databases have picked-up fame. This paper figures models for a correlation, measure chosen database systems and creates a proposal for particular database system. A minimal effort low power worker, for example, a RaspberryPi, be able to deal with upwards of 450 sensors’ information simultaneously by utilizing the InfluxDB time arrangement database. The current design utilizes a web worker just for outwardly questioning the database system and causes sensors to send the solicitations straightforwardly to database port. In this manner this work uncovered that using a sufficient design and a painstakingly chose TSDB, also a minimal effort low power worker, namely RaspberryPi can deal with quality of water checking establishments of considerable size [6].

Efficient and Secure Storage for Renewable Energy …

269

The Expressive Stream Language for Time Series (ESL-TS) and its inquiry enhancement strategies take care of these issues efficiently and are important for the information stream the executive’s framework model created at UCLA. Time arrangement inquiries happen regularly in information stream applications; however, they are not upheld well by the SQL-based ceaseless question dialects proposed by most current information stream the executives frameworks. In this paper, authors have presented ESL-TS that can communicate ground-breaking time arrangement questions by straightforward expansions of SQL. This work additionally examined advancement procedures for ESL-TS, and indicated that they can be utilized to limit execution time and memory for intra-question and between inquiry improvements [7]. In this paper, authors portray hybrid way to deal with accomplish proficient conveyance of enormous time arrangement datasets using difficult metadata. Authors utilize three subsystems inside a solitary arrangement of frameworks: an intermediary Python, an effective time arrangement database (InfluxDB) and a Sensor Observation Service (SOS) usage (52 North SOS). The intermediary measures standard form SOS inquiries and subjects to either InfluxDB or 52 North SOS for handling. Reactions were returned legitimately from the 52 North SOS or by implication from the InfluxDB by means of Python intermediary and handled into the WaterML system. This empowers the versatility and execution points of interest of time series data hitched using complex metadata treatment of SOS system. Testing demonstrates that an ongoing adaptation of the 52 North SOS arranged using Postgres information base performs well yet a usage fusing InfluxDB and 52 North SOS in mixture engineering performs roughly multiple times quicker. This study also thought about the exhibition of an independent 52 North SOS occasion and model execution of mixture design combining 52 North SOS examples with InfluxDB system. Based on tests, execution was roughly multiple times quicker in half and half framework. There are execution benefits utilizing cross breed engineering contrasted with an independent 52 North SOS case different focal points perhaps that free coupling between parts promptly permits joining of new advancements and reuse of information in existing organizations [8]. Additionally, the genuine trustworthiness of databases is likewise challenging we ensured. In this article, authors used FluteDB system, trustworthy and an efficient time arrangement database storage system, which is made out of different time arrangement upgraded submodules. The approvals of all the submodules had exhibited that their improved procedures significantly beat the current strategies continuously arrangement condition. In the interim, the total FluteDB uses different measures to ensure its constancy and accomplishes a higher generally storage efficiency with the best in class time arrangement information bases. Particularly, contrasted and the current time arrangement information stockpiling arrangements, FluteDB can work with higher compose rate, involve less capacity and process assets, and have enough flexibility and trustworthiness. Obviously, the proposed procedures are just a few significant segments of time arrangement SE, the future work would primarily concentrate on planning and actualizing different advancements

270

A. G. Rangaraj et al.

(e.g., ordering component) to improve the general efficiency of the entire information base framework. Then, the capacity and trustworthiness technique for time arrangement of disseminated framework are likewise worth further considering [9]. Gorilla is another in-memory time’s arrangement database that has been created and conveyed at Facebook. Gorilla streamlines for remaining profoundly accessible for composes and peruses, even notwithstanding disappointments, to the detriment of perhaps dropping modest quantities of information on the compose way. To improve inquiry efficiency, this work forcefully influence pressure methods, for example, delta-of-delta timestamps and XOR’d floating guide esteems toward decrease Gorilla’s stockpiling impression by 10×. This permits us to store Gorilla’s information in memory, decreasing question inactivity by 73× and improving inquiry throughput by 14× when contrasted with a customary information base (HBase) backed time arrangement information. Gorilla capacities as compose through reserve for as long as 26 h of checking information accumulated over the entirety of Facebook’s frameworks. Further, Gorilla has permitted us to diminish our creation question dormancy by over 70× when contrasted with our past on-circle TSDB. Gorilla has empowered new observing apparatuses including cautions, robotized remediation and an online abnormality checker. Gorilla has been in sending for as long as year and a half and has effectively multiplied in size twice in this period absent a lot of operational effort exhibiting the adaptability of our answer. This work additionally verified Gorilla’s adaptation to internal failure abilities by means of a few enormous scope recreated disappointments just as genuine catastrophe circumstances—Gorilla remained exceptionally accessible for both composes and peruses these occasions supporting site recovery [10]. In this study paper, authors provided a best in class review of a big data placement based on cloud-centric along with the methodologies for data storage. This study attempts some highlights the correlation between them with respect to big data management support. The emphasis is on the aspects of data management which are seen under the crystal of properties that are non-functional. Eventually, the readers could value the deep examination of particular advancements identified with the big data management and offers a guide toward their choice with regard to fulfilling their prerequisites of non-functional applications. Moreover, challenges are provided featuring the recent gaps in management of big data by offering the way it has to be evolved in the future [5]. Most of the conventional data structures have not supported time series operators, that results in less efficient time series data access. Consequently, the traditional database management systems experience issues in managing large amounts of data and it leads to the necessity of enormously distributed software that executes on many server systems. This leads authors to execute Chronos software, time series database based on in-memory key-value pairs. C++ language is used for implementation of this software. The software design was achieved by using parallelism, temporal procedures and procedures of RAM data storage. The results show that the RAM for data storage and the for accessing the timeline index of the keys time background in chromos software convert into an 40–50% increase in the efficiency while compared to other traditional databases like MongoDB and MySQL [11].

Efficient and Secure Storage for Renewable Energy …

271

The concepts of data handling and data management in an organization were explained and stated clearly in [12]. Time series storing data options in Cloudant and step by step example of storage procedure are illustrated in [13]. The data management using Bluepencil software for different organizations has been explained in [14]. The major objective of data storage and management system is to limit the uncertainties that enter into centralized server at various processes, namely collection, transmission and storage of data. Identification and regulated data storage and management procedures will help in deriving useful information. This paper, mainly focuses on the data storage procedure, effective storage of meteorological observations obtained from Wind/solar monitoring stations across India. The methodology section describes the analysis of different storage formats adopted and dataset considered for this analysis. The performance and outcome of the data storage procedure are discussed in results and discussion section.

3 Methodology The analysis was carried out using a sample of RE resource data with different number of days and number of parameters (columns) with various time resolution. The raw data in CSV format is converted to parquet, hdf5 and feather. The study did not provide an Excel file storage structure, since it is challenging to read or write if the amount of columns/rows in the sample dataset is greater. File read/write time analysis and data size are carried out to determine the reliability and consistency of different file storage formats. The study used 3 iterations and 10 iterations to determine the average reading period, write time and storage size of the file. However, the average values of 10 iterations have provided reliable findings and are seen in Figs. 3a to 5e and Tables 1a to 3e as three different cases in further sections.

3.1 Different File Storage Formats This section provides salient features and characteristics of various file storage formats. The authors have considered the Parquet, Feather and HDF5 file format for this analysis as it can be read and processed using various programming languages such as Java, Python, R and also it is platform independent. Based on the requirement, size and purpose, the most efficient data storage format is opted.

3.1.1

Parquet

Apache parquet is based on columnar data storage format accessible to any kind of project irrespective of selection of programming language, data model or data

272

A. G. Rangaraj et al.

processing framework. Apache Parquet has been implemented by means of the Apache Thrift structure that enhances its flexibility. Parquet can also work using a number of programming languages such as C++, Java, Python, and PHP. Parquet Features: • • • • • •

Columnar File Format Support Nested Data Structures Not tied to any commercial framework Accessible by HIVE, Spark, Drill, Python R/W in HDFS or local file system Gaining Strong usage.

Parquet design objective: • • • •

Interoperability Space Efficiency Query Efficiency.

3.1.2

Hierarchical Data Format 5 (HDF5)

Hierarchical Data Format 5 (HDF5) is a distinctive open-source framework appropriate for data collection management of various complexity and all sizes. HDF5 has been explicitly designed for: • High volume data and more complex data (however, it can also be used in case of simple and low volume data). • Any type of system and for all sizes (it is portable). • I/O operations, efficient data storage and more flexibility. • To accommodate novel data models and to facilitate applications to go forward by using HDF5. • Utilized as a file storage format tool kit. HDF5 has highlights of different formats yet it could ensure considerably more. HDF5 is like XML where HDF5-based files permit users to indicate complex data dependencies and relationships and are more self-describing. When compared to XML-based files, HDF5 files could store binary data and permits direct access inbetween file parts without entire contents parsing. When compared to tables used in relational databases, the HDF5 additionally permits data objects in a hierarchical manner to be communicated in its natural way. Relational databases use tables, whereas HDF5 can support dataset of n-dimensional and every element of dataset reflects itself as complex object type. The queries dependent on matching fields is the best support offered by relational database systems and it is not suitable for records sequential processing in database. Figure 2 shows the HDF5 Library executes the objects of the HDF5 unique data model. A portion of these objects incorporate attributes, datasets, and groups. The application program maps

Efficient and Secure Storage for Renewable Energy …

273

Fig. 1 Data storage system of HDF5 library

Fig. 2 Parquet data storage system process flow

the data structure of application to the HDF5 objects hierarchical system. Every application will make a mapping most appropriate to its motivations. The HDF5 library API data storage system is shown in Fig. 1.

3.1.3

Feather

Feather is easy to use, lightweight, and fast binary storage format of file for data frame storage. It has a couple of explicit design objectives: • It makes data frames pushing in memory and out of memory as simply with minimal APT and its lightweight. • Language agnostic: Feather files data are similar in both Python code and R code. Different scripting languages can write and read feather data files, as well. • High read performance and write performance. Whenever possible, the featherbased operations ought to be bound on performance of local disk.

274

A. G. Rangaraj et al.

3.2 Dataset Description The renewable energy wind and solar resource data from our organization National Institute of Wind Energy (NIWE) centralized server has been considered for experimentation and analysis. The wind/solar data used for this analysis is the data received from wind/solar monitoring stations in India with an original temporal resolution of 1 min for solar and 10 min for wind of multiple stations has been combined by varying the number of parameters from 10 to 90 and number of days from 30 to 7300 with different temporal resolutions 10, 15, 30 and 60 min.

3.2.1

Customized Parquet-Based Data Management System

National Institute of Wind Energy (NIWE) has indigenously developed a customized python package named RDAF_DBMS to store/retrieve our RE wind and solar resource data in Parquet file storage format in the NIWE centralized server. The sample raw data is extracted, cleaned and pre-processed using the RDAF_DBMS package before analysis. The Package contains following functions: • write_data—function to write the data in RDAF Data management system. • read_parquet—function to know how to read single station data from RDAF Data management system. • bulk_read_db—function to read data from many station. • apply_QC—function is used to execute Quality Check (QC) routine to verify the RAW data. Currently, the QC is used for WMS and SMS data. In the future, QC for generating data has planned to include in this package. The raw data is compressed using snappy and stored in the parquet format as month-wise folder in the server. Metadata is generated while writing and storing data in parquet file format. This metadata provides summary of the data, viz. station name, columns, period of data availability and last updated date time. The overall flow of data storage using parquet is shown in Fig. 2.

4 Results and Discussion The results attained based on the proposed methodology has been categorized into various cases such as (1) File read time analysis with varying number of days and number of columns as case 1. (2) File write time with varying number of days and number of columns as case 2. (3) File storage size analysis with varying number of days and number of columns as case 3. The average values taken to read, write and storage space size of 10 iterations are illustrated since it shows consistent results in all scenarios of analysis.

Efficient and Secure Storage for Renewable Energy …

275

Case 1 Comparison of Average File Read Time for different storage formats with varying temporal resolution 10, 15, 30 and 60 min for the number of days ranging from 30 to 7300 days and for the number of columns ranging from 10 to 90 parameters. • In this scenario, the parquet file storage format shows the consistent increase in file read time and it shows less time (in secs) compared to other storage formats. The results of 365, 1825, 3650, 5745 and 7300 days with the columns as 10, 20, 40, 60 and 90 for varying temporal resolution from 10, 15, 30 and 60 min is shown in Fig. 3a–e. • The next priority of better data storage format is feather and it has slight variation in the file read time when compared to parquet storage format. • The HDF5 file format shows third best storage system also seems to be better data storage system for large volumes of data. • The csv file storage format shown an exponential increase in data retrieval time if the rows of sample dataset increase. It takes more data retrieval time for long-term data. Also, the csv performs better for processing short-term data of 30 days’ time period. • Figure 3a–e provides the in-detail values of analysis results, it shows the Parquet data storage format has least file read time (in secs) for sample dataset of one year, five years, ten years, fifteen years and twenty years compared to other storage formats. • Reading the file with 90 columns in single I/O operation (1.27 s) performs better in contrast to reading the file of 90 columns in multiple I/O operations as 10 columns (0.30 * 9 = 2.70 s). Case 2 Comparison of Average File Write Time for different storage formats with varying temporal resolution 10, 15, 30 and 60 min and varying the number of days ranging from 30 to 7300 days and for the number of columns ranging from 10 to 90 parameters. • The parquet file storage format shows the consistent increase in file write time and it shows less time (in secs) contrasted to other storage formats. The graphical view of 365, 1825, 3650, 5745 and 7300 days with the columns as 10, 20, 40, 60 and 90 for varying temporal resolution form 10, 15, 30 and 60 min is shown in Fig. 4a–e. • The next priority of better data storage format is feather and it has slight variation in the file write time when compared to parquet storage format. • The HDF5 file format shows third priority storage system for large volumes of data. • The csv file storage format shown an exponential increase in file writing time if the rows of sample dataset increase. It takes more file write time for long-term data. Also, the csv format performs better for processing short-term data of 30 days’ time period.

276

A. G. Rangaraj et al.

File Read Time Analysis - 365 Days

a 90

365

60

60 40 20 10

365

30

60 40 20 10

parquet hdf5

90

feather csv

365

60 15

Number of Columns / Temporal Resolution

90

40 20 10 90

365

10

60 40 20 10 0

0.5

1

1.5

2

2.5

File Read Time (AVG - in secs) Fig. 3 a File read time analysis of 365 days. b File read time analysis of 1825 days. c File read time analysis of 3650 days. d File read time analysis of 5475 days. e File read time analysis of 7300 days

Efficient and Secure Storage for Renewable Energy …

277

File Read Time Analysis - 1825 Days

b 90

1825

60

60 40 20 10

1825

30

60 40 20 10

parquet hdf5

90

feather

1825

60 15

Number of Columns / Temporal Resolution

90

csv

40 20 10 90

1825

10

60 40 20 10 0

1

2

3

4

5

6

File Read Time (AVG - in secs) Fig. 3 (continued)

7

8

9

278

A. G. Rangaraj et al.

File Read Time Analysis - 3650 Days

c 90

3650

60

60 40 20 10

3650

30

60 40 20 10

parquet hdf5

90

feather

3650

60 15

Number of Columns / Temporal Resolution

90

csv

40 20 10 90

3650

10

60 40 20 10 0

5

10

15

File Read Time (AVG - in secs) Fig. 3 (continued)

20

Efficient and Secure Storage for Renewable Energy …

279

File Read Time Analysis - 5475 Days

d 90

5475

60

60 40 20 10

5475

30

60 40 20 10

parquet hdf5

90

feather

5475

60 15

Number of Columns / Temporal Resolution

90

csv

40 20 10 90

5475

10

60 40 20 10 0

5

10

15

20

File Read Time (AVG - in secs) Fig. 3 (continued)

25

30

280

A. G. Rangaraj et al.

File Read Time Analysis - 7300 Days

e 90

7300

60

60 40 20 10

7300

30

60 40 20 10

parquet hdf5

90

feather csv

7300

60 15

Number of Columns / Temporal Resolution

90

40 20 10 90

7300

10

60 40 20 10 0

5

10

15

20

25

30

File Read Time (AVG - in secs) Fig. 3 (continued)

35

40

Efficient and Secure Storage for Renewable Energy …

281

File Write Time Analysis - 365 Days

a

90

365

60

60 40 20 10

365

30

60 40 20 10

parquet hdf5

90

feather

365

60 15

Number of Columns / Temporal Resolution

90

csv

40 20 10 90

365

10

60 40 20 10 0

20

40

60

80

100

File Write Time (AVG - in secs) Fig. 4 a File write time analysis of 365 days. b File write time analysis of 1825 days. c File write time analysis of 3650 days. d File write time analysis of 5475 days. e File write time analysis of 7300 days

282

A. G. Rangaraj et al.

File Write Time Analysis - 1825 Days

b 90

1825

60

60 40 20

90

1825

30

60 40 20 10

parquet

90

hdf5 feather

1825

60 15

Number of Columns / Temporal Resolution

10

csv

40 20 10 90

1825

10

60 40 20 10 0

100

200

300

400

500

600

File Write Time (AVG - in secs) Fig. 4 (continued)

• The analysis results are provided in Fig. 4a–e, which shows the Parquet data storage format has least file write time (in secs)—data storage for sample dataset of one year, five years, ten years, fifteen years and twenty years compared to other storage formats. • Writing the file with 90 columns in single I/O operation (0.951 s–60 min) performs better in contrast to reading the file of 90 columns in multiple I/O operations as 10 columns (0.612 * 9 = 5.502 s).

Efficient and Secure Storage for Renewable Energy … c

283

File Write Time Analysis - 3650 Days 90

3650

60

60 40 20 10

3650

30

60 40 20 parquet

10

hdf5 90

feather csv

3650

60 15

Number of Columns / Temporal Resolution

90

40 20 10 90

3650

10

60 40 20 10 0

100

200

300

400

500

File Write Time (AVG - in secs) Fig. 4 (continued)

600

700

284

A. G. Rangaraj et al. d

File Write Time Analysis - 5475 Days 90

5475

60

60 40 20

90

5475

30

60 40 20 10

parquet

90

hdf5 feather

5475

60 15

Number of Columns / Temporal Resolution

10

csv

40 20 10 90

5475

10

60 40 20 10 0

200

400

600

File Write Time (AVG - in secs) Fig. 4 (continued)

800

1000

Efficient and Secure Storage for Renewable Energy …

285

File Write Time Analysis - 7300 Days

e 90

7300

60

60 40 20 10

7300

30

60 40 20 10

parquet hdf5

90

feather

7300

60 15

Number of Columns / Temporal Resolution

90

csv

40 20 10 90

7300

10

60 40 20 10 0

200

400

600

800

File Write Time (AVG - in secs) Fig. 4 (continued)

1000

286

A. G. Rangaraj et al.

Case 3 Comparison of File Storage Size (in bytes) for different storage formats with varying temporal resolution 10, 15, 30 and 60 min and varying the number of columns ranging from 10 to 90 parameters and also varying the number of days ranging from 30 to 7300 days and for the number of columns ranging from 10 to 90 parameters. • Similarly, in this scenario also parquet file storage format shows the consistent increase in file storage size and it takes less storage size (in megabytes) by compressing the data while compared to other formats. The graphical view of 365, 1825, 3650, 5745 and 7300 days with the columns as 10, 20, 40, 60 and 90 for varying temporal resolution form 10, 15, 30 and 60 min is shown in Fig. 5a–e. • The next priority in data storage format is feather and it has slight variation in the file storage size when compared to parquet storage format. • The HDF5 file format shows third priority of storage system for large volumes of data. • The csv file storage format shown an exponential increase in file storage size if the rows/columns of sample dataset increases. • The analysis results are provided in Fig. 5a–e, which shows the Parquet data storage format has least file storage size—data storage (in megabytes) for data storage for sample dataset of one year, five years, ten years, fifteen years and twenty years compared to other storage formats.

5 Conclusion and Future Work The primary purpose of this research is to present a comprehensive analysis of both data storage and time series data management in a local centralized server. The evolving area of machine learning entails testing multiple alternatives and algorithms for various parameters, from data cleaning to model validation for decision making and prediction problems. Python programmers sometimes load a complete data collection to a Pandas data frame without ever modifying the storage data. This loading phase can often seem comparatively long when working with time series data. This article discusses the various alternatives for data storage in terms of file processing time, file writing time, and disk storage capacity. The experiment was also performed to test loading time and writing time from small to medium sample datasets deposited in various formats, including in a file (CSV file, Feather file, Parquet file or HDF5 file). For file storage formats, the file size on the disk has also been evaluated and the findings displayed. It is important to note that the study does not deal here with large data collection, but rather with rather typical small to medium time series datasets. From this performance assessment, the Parquet file format has been shown to be a reasonable option in most cases concerning loading time, writing time and disk storage capacity. This also allows python developers to develop a tailored package using parquet to store data as a parquet file format on the server before creating any machine learning and artificial intelligence application

Efficient and Secure Storage for Renewable Energy …

287

File Storage Size Analysis - 365 Days

a 90

365

60

60 40 20 10

365

30

60 40 20 10

parquet hdf5

90

feather csv

365

60 15

Number of Columns / Temporal Resolution

90

40 20 10 90

365

10

60 40 20 10 0

10

20

30

40

50

60

70

File Storage Size (AVG - in MB) Fig. 5 a File storage size analysis of 365 days. b File storage size analysis of 1825 days. c File storage size analysis of 3650 days. d File storage size analysis of 5475 days. e File storage size analysis of 7300 days

288

A. G. Rangaraj et al.

File Storage Size Analysis - 1825 Days

b 90

1825

60

60 40 20 10

1825

30

60 40 20 10

parquet hdf5

90

feather

1825

60 15

Number of Columns / Temporal Resolution

90

csv

40 20 10 90

1825

10

60 40 20 10 0

100

200

300

File Storage Size (AVG - in MB) Fig. 5 (continued)

400

500

Efficient and Secure Storage for Renewable Energy …

289

File Storage Size Analysis - 3650 Days

c 90

3650

60

60 40 20 10

3650

30

60 40 20 10

parquet hdf5

90

feather 3650

60 15

Number of Columns / Temporal Resolution

90

csv

40 20 10 90

3650

10

60 40 20 10 0

200

400

600

800

File Storage Size (AVG - in MB) Fig. 5 (continued)

1000

290

A. G. Rangaraj et al. d

File Storage Size Analysis - 5475 Days

90

5475

60

60 40 20 10

5475

30

60 40 20 10

parquet hdf5

90

feather

5475

60 15

Number of Columns / Temporal Resolution

90

csv

40 20 10 90

5475

10

60 40 20 10 0

200

400

600

800

1000

1200

File Storage Size (AVG - in MB) Fig. 5 (continued)

1400

1600

Efficient and Secure Storage for Renewable Energy … e

291

File Storage Size Analysis - 7300 Days 90

7300

60

60 40 20 10

7300

30

60 40 20 10 parquet 90

hdf5 feather

7300

60 15

Number of Columns / Temporal Resolution

90

csv

40 20 10 90

7300

10

60 40 20 10 0

500

1000

1500

File Storage Size (AVG - in MB) Fig. 5 (continued)

2000

292

A. G. Rangaraj et al.

to reduce the loading time of the entire dataset. The future work to be discussed is to maintain the efficiency of parquet flooring by adjusting the iterations used for analysis. Comparison of the parquet file system with several other time series storage management systems utilizing massive data volumes. Acknowledgements The installation of METMAST for RE resource data collection is funded by Ministry of New and Renewable Energy (MNRE), Government of India. The authors would like to express sincere thanks to all the engineers in NIWE who have been part of meteorological MAST installation and data collection process that has been used for this analysis. Conflict of Interest The authors disclose no potential conflicting interests regarding the publication of this work.

References 1. Struckov A, Yufa S, Visheratin AA, Nasonov D (2019) Evaluation of modern tools and techniques for storing time-series data. Procedia Comput Sci 156:19–28 2. Jensen SK, Pedersen TB, Thomsen C (2017) Time series management systems: A survey. IEEE Trans Knowl Data Eng 29(11):2581–2600 3. Li Y, Guo L, Guo Y (2012) An efficient and performance-aware big data storage system. In: International conference on cloud computing and services science. Springer, Cham, pp 102–116 4. Agrawal R, Nyamful C (2016) Challenges of big data storage and management. Global J Inf Technol Emerg Technol 6(1):1–10 5. Mazumdar S, Seybold D, Kritikos K, Verginadis Y (2019) A survey on data storage and placement methodologies for cloud-big data ecosystem. Journal of Big Data 6(1):15 6. Fadhel M, Sekerinski E, Yao S (2018) A comparison of time series databases for storing water quality data. In: Interactive mobile communication, technologies and learning. Springer, Cham, pp 302–313 7. Bai Y, Luo CR, Thakkar H, Zaniolo C (2005) Efficient support for time series queries in data stream management systems. In: Stream data management. Springer, Boston, MA, pp 113–132 8. Leighton B, Cox SJ, Car NJ, Stenson MP, Vleeshouwer J, Hodge J (2015) A best of both worlds approach to complex, efficient, time series data delivery. In: International symposium on environmental software systems. Springer, Cham, pp 371–379 9. Li C, Li J, Si J, Zhang Y (2017) FluteDB: an efficient and dependable time-series database storage engine. In: International conference on security, privacy and anonymity in computation, communication and storage. Springer, Cham, pp 446–456 10. Pelkonen T, Franklin S, Teller J, Cavallaro P, Huang Q, Meza J, Veeraraghavan K (2015) Gorilla: a fast, scalable, in-memory time series database. Proc VLDB Endowment 8(12):1816–1827 11. Tahmassebpour M (2017) A new method for time-series big data effective storage. IEEE Access 5:10694–10699 12. https://ori.hhs.gov/education/products/n_illinois_u/datamanagement/dhtopic.html 13. https://blog.cloudant.com/2019/04/08/Time-series-data-storage.html 14. https://www.blue-pencil.ca/what-is-data-management-and-why-it-is-important/

Data Processing and Analytics for National Security Intelligence: An Overview G. S. Mani

Abstract Data processing and analytics is one of the most trending terminologies of business research today. The term is mostly used in the context of processing, analysing and understanding important trends in businesses, with the objective of making agile decisions in real time. Modern techniques involving Artificial Intelligence, Artificial Neural Networks and Deep learning enable extraction of high-level abstractions and provide insight into complex patterns of an order which was not possible till recently. Present paper provides an overview of the data processing and analytic methodologies used for gaining intelligence in the context of national security. Modern-day national security is threatened on two fronts. One is from external agencies, traditionally known to be hostile. The other is from terrorism and insurgency groups mostly operating within the physical boundaries of the country. The former is dominated by high-technology systems and hence intelligence in that context is mainly related to gaining information superiority about hostile systems and tactics. Dealing with the latter has to be done under a totally different framework, and solutions to these conflicts do not lie in the conventional battlefield. Objective of the present paper is to provide a perspective view of the data-related techniques and technologies used for gaining intelligence for national security in the above contexts.

1 Introduction Thomas Davenport defines analytics as ‘the extensive use of data, statistical and quantitative analysis, explanatory and predictive models and fact-based management to drive decisions and actions’ [1]. It is well known that data analytics refers to analysis of data to aid decision-making applicable for all disciplines. However, most people associate it with ‘Business data Analytics’. One probable reason is that companies are able to expand their businesses based on collecting and analysing the data on customers’ purchase behaviour and predicting the future trend. Data Processing and analytics directly impacts growth across all sections of business including retail, G. S. Mani (B) Pune, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_20

293

294

G. S. Mani

finance, consumer product, travel and entertainment. Apart from driving present-day business intelligence mechanisms, these techniques have also been used in certain big data domains including speech recognition [2], computer vision [3] and health informatics [4]. Analysis has always been at the core for all innovations, developments and improvements. With the growth of Information Technology (IT) tools, new methodologies based on Artificial Intelligence (AI), Machine Language (ML), Deep Learning (DL) and big data have been developed for exploring analytical methodologies in disciplines related to security issues such as cybersecurity [5, 6], fraud detection [7], crime analysis [8], underwater mine warfare [9], intrusion detection [10] and anomaly detection [11]. Safeguarding the nation both from external threats and from internal uprisings has been issue of concern, where analysis plays a dominating role. Scientific and technical intelligence obtained by the Qualitative analysis of technical data associated with a source, emitter or sender plays a great role in ensuring national security. This paper is an attempt to provide an overall view of the processing and analytic techniques used for extracting intelligence from data in the above context. The paper is organized in the following manner. Section 2 deals with Intelligence in the context of national security and discusses a generic data analytic model for the same. Section 3 is about Intelligence gathered from enemy equipment like radar systems. It discusses the concepts of electronic intelligence, emitter identification, and radar fingerprinting. Data processing methodologies followed in these techniques are presented in this section. Section 4 is about how data processing of open sources can provide intelligence for fighting asymmetric warfare. Section 5 is about data processing involving multiple sensors. The concluding section summarizes the various aspects discussed in the paper.

2 Intelligence in the Context of National Security National security depends, to a large extent, on prior information collected by people authorized for collecting and preparing the required information in proper, easily interpretable format. This information, generally termed as intelligence is based on large volumes of relevant data collected from different sources and managed properly. Just as large customer database can provide business intelligence essential for boosting the business only when analysed properly, it is essential that the data relevant for national security is processed properly before the data becomes ‘intelligence’ useful for the security agencies. It would not be possible to mount effective operations without knowing the ‘who and where’ about those who may be threatening national security and ‘what and how’ about the use of hostile weaponry. Strategic and Tactical Intelligence: From the temporal standpoint, the Intelligence requirements can be classified as Strategic or Tactical. Strategic Intelligence would be required for formulating Security Policies and Plans and are based on the vision

Data Processing and Analytics for National Security …

295

Fig. 1 A conceptual model for collection, extraction, analysis of information for National security

and foresight of the commanders [12]. Tactical Intelligence is required for conducting the battle engagement as they are being executed. Thus, the former focusses on the long-term issues, whereas the latter on the short-term contingencies without losing sight of the long-term goals. Generic Data Analytic model: A generic Data Analytic model for collection, extraction and analysis of National security Information is shown in Fig. 1. The source for primary data is based on the type of intelligence one is looking for. It can be a sophisticated electronic sensor or encrypted communication channel or a simple blog or a twitter message. Extraction involves parsing to obtain the kind of intelligence sought. This is followed by filtering process to eliminate all unwanted items or noise, retaining only those which needs to be processed further. This is an important step since it can always be assumed that useful content will normally be only a very small portion of the collected data. In many cases, it would be like ‘searching for a needle in the straw’ and hence is controlled by a ‘discriminant’ which defines the bandwidth of intelligence sought. Retained items can then be stored in a database for future analysis, based on query. Or can be analysed immediately for online processing and further action such as like initiating a jamming signal against an incoming threat. The conceptual model is very generic and needs to be adapted based on need of the application.

3 Intelligence in the context of external threats Intelligence gathered about offensive equipment or weapons forms an integral part of any external threat. Today’s warfare has turned mostly digital, with most of the offensive weapons including missiles and artillery equipment being controlled by electronic means. Ground-based, air-borne and ship-borne radars are the main equipment controlling most of the actions in a modern digital warfield. In this section, we discuss issues related to gathering intelligence about enemy’s radar capabilities. This

296

G. S. Mani

is called electronic intelligence (ELINT) and is based on deriving the characteristics of the radar, which can help in estimating the adversity’s capability and can help in planning defence measures accordingly. Intelligence gathered from communication equipment is called COMINT and it deals with the content being shared in the communication channel, which can be useful in planning our strategy of national defence. In this paper, we discuss only ELINT.

3.1 ELINT Receivers The US Joint Chiefs of Staff define ELINT as ‘Technical and geolocation intelligence derived from foreign noncommunications electromagnetic radiations emanating from sources other than nuclear detonations or radioactive sources’ [13]. Following aspects play an important role in ELINT. (a)

(b)

(c)

The radars of interest are located in enemy’s territory and access to them are not possible. Hence, the only method of gathering details about them is by intercepting signals radiated by them by placing ELINT receivers within the friendly area. In that sense, the term can be related to remote sensing of enemy’s radars. Many of the enemy’s radars would be located deep within the foreign territory and hence signal strength received by the ELINT receivers will be very low. Hence the receivers must be capable of receiving, processing and extracting required intelligence under very low signal-to-noise ratio (SNR) conditions. Since a-priori information about the radars are not likely to be available, the receivers are expected to operate in wide-open mode (must be able to receive signals from all directions and also have large instantaneous bandwidth).

Signal Environment: Basic radar functioning is shown in Fig. 2a. The radar transmitter sends out an Electromagnetic (EM) pulse stream, which gets reflected by an object or target. The reflected echo pulses are detected by the radar receiver, from which, distance of the target and velocity can be found. The angular location of the target is obtained using the directive antenna of the radar. Typical signal environment faced by an ELINT receiver is shown in Fig. 2b. In general, it will be receiving signals from various transmitters (called emitters) from different directions located at different distances and operating at different frequencies. The objective of the receiver is to receive, interpret, locate and identify all hostile radar emitters. Emitter identification and radar fingerprinting: Within the signal environment of an ELINT receiver, there can be many types of radar emitters. Emitter identification refers to identifying the type of the radar, based on the characteristics derived from the pulse stream intercepted by the ELINT receiver. This is often done based on pulse descriptor word (PDW), derived from a batch of pulses processed by the pulse processor in the ELINT receiver.

Data Processing and Analytics for National Security …

(a): Basic Radar functioning

297

(b): ELINT receiver receiving signals from large number of sources

Fig. 2 a Basic radar functioning, b ELINT receiver receiving signals from large number of sources

Signals identified to be emanating from some of the radars may require further processing for initiating offensive/evasive action. As an example, signals of a tracking radar may sometimes require further online processing for finding finer details about the radar. This is referred as radar fingerprinting and would be based on a batch of received pulses, or in some cases by processing signals within each pulse also. A typical frequency fingerprint of a commercial ATC radar is given in Fig. 3a. The X-axis refers to time scale and Y-axis to frequency. The signal strength is colour coded, darker region showing presence of signal. Similar fingerprint from a simulated frequency modulated source is shown in Fig. 3b [14]. In practice, similar fingerprints will be required for other important characteristics of the radar, such as pulse width (PW) and pulse repetition interval (PRI). All these together form the PDW, which is described later in the section.

(a) ATC Radar

Fig. 3 Radar’s frequency fingerprint (from [14])

(b) Simulated FM emitter

298

G. S. Mani

3.2 Processing in ELINT Receivers Data processing methodology at the ELINT receiver is shown in Fig. 4. The receiver consists of three parts. (a) (b) (c)

Pulse deinterleaving Pulse descriptor word formation Emitter identification and fingerprinting.

Due to the several emitters in the signal environment, a single stream of input at the ELINT receiver will consist of a combination of pulses coming from all radars operating at that time. In order to derive intelligence about the different radar emitters, one of the first tasks of the receiving system is to separate the combined, or interleaved pulses into individual pulse streams, corresponding to each radar. This process is known as pulse deinterleaving and is shown in Fig. 5a [15, 16]. The top two plots represent periodic pulse trains emitted from two individual radars. The centre plot shows how the interleaved signals will appear at the EW receiver. The bottom plots represent the successful deinterleaving of the received signal, which in a perfect scenario, should identically match the top plots. After pulse deinterleaving, the characteristics of each pulse is measured and packed into a structure labelled as pulse descriptor word (PDW) and further processed for emitter identification. Pulse Description Word: Every pulse received at the receiver is described by a pulse description word of the form   PDW = par1, par2, par3, . . . par N

(1)

where par1, par2, par3, … par N represent the parameters of the pulse. Some of the parameters could be time of arrival (TOA), angle of arrival (AOA), carrier frequency (RF), pulse amplitude (PA), pulse width (PW), pulse repetition

Fig. 4 Data processing at the ELINT receiver

Data Processing and Analytics for National Security …

299

Fig. 5 Pulse deinterleaving, Pulse parameters and PDW format

interval (PRI). Some of these with reference radar pulse are shown in Fig. 5b. A typical PDW format is shown in Fig. 5c. AOA of the signal pulse provides the direction of emitter and is taken as the basic descriptor of the emitter, based on which clustering is done. AOA is a relatively stable parameter and can be used to segregate the pulse stream into clusters, belonging to same angular sectors. Further, the angular sectors can be simplified by sorting on RF and PW parameters. PRI and antenna-based information of the emitter are derived from successive TOA and PA measurements and are used for multi-dimensional clustering. Figure 6 shows representation of emitters in a three-dimensional space, where the three dimensions of AOA, RF, and PW are shown along the three axes. The figure shows three pulses described by V 1 = [0, f 0 , PW0 ]

(2)

V 2 = [3π/8, f 2 , PW0 ]

(3)

V 3 = [π/8, f 0 , PW1 ]

(4)

Pulses 1 and 3 have same frequency, whereas Pulses 1 and 2 have same PW.

300

G. S. Mani

Fig. 6 Representation of pulses in a three-dimensional space

3.3 Data Processing Techniques Analysis of Pulsed data intercepted by ELINT receiver and fingerprinting the emitter holds the key in modern electronic warfare. Success of a mission depends on accomplishing this task reliably and in real time. A typical application occurs when deception jamming has to be initiated during an offensive mission after establishing the technical identity with minimum ambiguity. That is the reason most of the activities in this discipline are classified and not available in open literature. Some of the techniques reported in open literature are discussed below. (a)

Sorting based on multi-parameter clustering: Histogram-based methods based on simple differences are useful for PRI-based deinterleaving. Mardia et al. proposed a discrete histogram-based method based on accumulation and differentials, called Cumulative Differential (CDIF) [17]. Efforts to decrease the computational cost of the same method led to Sequential Differentials (SDIF) [18]. Standard methods such as Support Vector Clustering (SVC) and K-means clustering can be adapted in simple cases. As the signal environment becomes dense, number of data to be handled becomes very large, and treating all the data as training samples will make the scale of adjacency matrix of SVC clustering algorithm enormous. This affects the speed of processing. In such cases, processing can be speeded up by adopting joint K-means and SV clustering. Complexity of the signal environment is quantified as entropy in such cases, which helps in quantification and macroscopic analysis of the environment. [19]. This is shown in Fig. 7. Pulse data processing starts with the Block former accumulating pulses from ELINT front end. The block of pulses is submitted to multi-parameter clustering sorter (MPCS), after a certain pre-determined number of pulses or after a pre-determined time interval. MPCS deinterleaves radar pulse sequence based on multi-dimensional attributes.

Data Processing and Analytics for National Security …

301

Table 1 Some commonly-used PRI types in radar Type

Feature

Comments

Constant PRI

Peak variations < 1% of mean PRI is taken as Constant PRI

Mostly non-military applications

Jittered PRI

Variations of up to 30% of mean PRI

Mostly used for ECCM. Detection of jittered PRI in ELINT receivers is difficult

Dwell & Switch PRI The radar transmits pulses at a constant PRI for a dwell time and switches its PRI for the next dwell time

Used to resolve range ambiguities in pulse Doppler radars

Staggered PRI

Use of two or more PRIs in a fixed sequence

Used to eliminate blind speeds in Moving Target Indicator (MTI) systems

Sliding PRI

Monotonically increasing or Can improve radar’s functionality decreasing PRI followed by a rapid and also eliminate blind ranges jump to one extreme limit when the other extreme limit is reached

Pulse Group

Radars transmit groups of closely spaced pulses separated by longer time intervals

Resolve range and velocity ambiguity problems

Fig. 7 ESM Data processing using clustering analysis (adapted from [19])

MPCS and TOA-difference histogram deinterleave the pulses in the block into pulse chains. The deinterleaving has 2 stages: (a) MPCS splits each block of pulses into a number of batches of pulses. (b) The batches are then processed sequentially by TOA -difference histogram and split into individual pulse chains. Parameters to be entered in emitter table are then evaluated for each deinterleaved pulse chain by the pulse chain characterizer. Parameters of characterized pulse chains are then compared with those in current emitter table

302

(b)

(c)

(d)

(e)

G. S. Mani

by the emitter table updater. Simulation shows that the system can sort highly dense and complex pulse environment [19, 20]. Data fusion based on PDW: Modern radars are designed to change their PRI modulation to resolve ambiguities or to improve their countermeasure capabilities, which make them difficult to be identified. Some of the common PRI types used in radars are given in Table 1. Highly staggered PRI (SPRI) is one of the most difficult signals to be identified, since most of deinterleaving algorithms cannot deinterleave such a complex pulse data [21, 22]. Data fusion based on PDW can help in solving this problem. After detecting the sub-PRIs of a complex SPRI based on overlapped PRI bins, data fusion is carried out to distinguish between the SPRI and jittered PRI signals according to their characteristics in the set of PRI values. This method can be extended for establishing a framework which can be applied to fingerprint Staggered, Jittered as well as sliding PRI signals. Tian et al. have applied this technique of Emitter Description Word (EDW) fusion to deinterleave SPRI signals with more than 10 levels of staggering. They claim to have applied this technique to real-time data collected from a ship-based phased array radar [23]. Double adaptive thresholding: Due to the increasing number of LPI radars, the inherent noise limitations of ELINT radar receivers and the overlapping pulses in a dense environment, some pulses will be missed at the receiver input. It is observed that histogram method based on adaptive thresholding can provide a robust and reliable method in such conditions. Two thresholds, first one to extract constant and staggered PRIs and the second to reveal the jittered pulse can be integrated adaptively to balance between the correct detection probability and the false alarm rate. It is reported that such a system can handle ‘missed pulses’ in a much better way than SDIF or FFT-based methods as shown in Fig. 8. RNN/DL-based techniques: Recurrent neural networks (RNN) are useful for processing discrete sequences and have been used for machine translation and text comprehension [25, 26]. Since PRI sequences in pulse groups have similar formulations as word sequences, they have been used to extract highdimensional temporal features of the pulses of emitter identification. A deep learning model (DL) based on PRI quantization, vectorization, gated recurrent unit (GRU) and classification is shown in Fig. 9. The model can extract highdimensional sequential patterns hidden in pulse trains with agile parameters and has shown robust performances in recognizing radars that are hardly distinguishable according to their statistical parameters [27]. This has been further extended for recognizing multi-function radars based on hierarchical mining [28]. Modified Self-organizing Feature Map (SOFM): Self-organization feature map (SOFM) neural network is a major branch of Artificial Neural Networks (ANN), which has self-organizing and self-learning features. Self-organizing maps differ from other ANN as they apply competitive learning as opposed to error-correction learning (such as backpropagation with gradient descent).

Data Processing and Analytics for National Security …

303

Fig. 8 Comparison of efficiencies of Adaptive threshold method with other methods (reconstructed from [24])

Fig. 9 Recurrent Neural network structure for radar classification (from [28])

They use a neighbourhood function to preserve the topological properties of the input space. SOFM has been used to optimize Wireless Sensor Network (WSN) by dynamically adjusting the transmitting power of the cluster head nodes [29]. This has been adapted for deinterleaving and clustering of the radar pulses [30]. However, setting of the conventional SOFM size depends on prior information and the network structure cannot be self-adjusted and hence not much useful to meet the needs of the modern electronic warfare. SOFM with self-adaptive network topology (SANT-SOFM) has been proposed which can dynamically adjust its topology as input changes. Initially, it starts with a small map size and gradually gets optimized with input changes. Structural optimization including neuronal elimination, merging and division are used to optimize the topology of the SOFM network, constructing an optimal topology, as shown in Fig. 10. Simulation results show that the proposed algorithm could not only adapt to the complex and variable EW environments,

304

G. S. Mani

Fig. 10 Neuronal position and weight adjustments (from [29])

(f)

but also obtain better clustering effects, thus improving the deinterleaving performance effectively. Real-time processing: Deinterleaving and emitter identification in a dense environment on real-time basis requires processing with low latency. The maximum allowable processing latency will be a function of the PRFs of the expected emitters in the environment and also the total number of the emitters expected to be processed. Theoretical latency requirements of a ELINT receiver with 250 MHz bandwidth for different conditions are shown in Fig. 11.

The PRF of the emitters considered in this case is 2 kHz (low), 15 kHz (Medium) and 200 kHz (High). The largest tolerable processing latency corresponds to low PRF emitters (high PRI), since the time between adjacent pulses of a low PRF emitter is

Fig. 11 Latency requirements of processing receiver

Data Processing and Analytics for National Security …

305

large. With increasing density, latency requirements become more stringent. Up to 40µsec may be tolerable in an environment of 12 emitters with all emitters working in low PRF mode. But to handle high PRF radars in the same environment will require 500 ns. In general, a practical requirement of latency in a ELINT receiver can be taken as a maximum of 1 µs. Robust low-latency deinterleaving algorithms are necessary for reliable identification and engagement of correct targets, and any misidentification or false identification can be disastrous. Though fuzzy-based self-organizing neural networks may be desirable, hardware implementation for achieving low latency may be difficult. In general, FPGA technology lends itself well to the low-latency real-time requirements of deinterleaving in modern EW systems.

4 Intelligence in the Context of Internal Threats This section is about intelligence gathered from people, in particular those who are suspected to be associated with terrorist or insurgency activities. In the last couple of decades, the phenomenon of ‘Terrorism’ has become a world-wide phenomenon. From 2006 to 2013, there have been approximately 90,000 terrorist attacks causing about 130,000 fatalities throughout the world [31]. In 2019 alone, there have been 32,836 casualties due to terrorism. A major way to take pro-active action against the extremists is by collecting intelligence about the terrorist agencies and organizations.

4.1 Asymmetric Warfare Coined in the beginning of the century, ‘asymmetric warfare’ is an umbrella term that includes engagements with the insurgency, militancy, proxy war, and terrorist campaigns. However, war between two unequal sides has been going on for a long time. Out of a total of 294 wars between 1816 and 1899, 262 (about 89%) were asymmetric in nature [32]. The asymmetry arises from three aspects—strength, terms of reference and weaponry used. It is the weaker side that usually resorts to asymmetric war to offset its disadvantage. The group being less in numbers uses non-traditional tactics, weapons, or technology seeking major psychological impact, such as shock or confusion among the general public. Such actions also curb the initiative, freedom of action, and the willpower of the operating officials. The last-mentioned factor may cause frustration among the operational forces, which causes further damage among the agencies responsible for security. According to Henry Kissinger, ‘The conventional army loses if it does not win. The guerrilla wins if he does not lose’ [33]. Thus, combating asymmetric war requires knowledge of enemy’s weakness. Though there can be many offensive measures based on high-technology, high-on-firepower militarized approach, the defensive measures could be based on synergising high-quality intelligence [34]. See Fig. 12.

306

G. S. Mani

Fig. 12 Synergising quality Intelligence is a major Defensive measure in Asymmetric Warfare (based on [34])

4.2 Open-Source Intelligence (OSINT) Open-source intelligence is the intelligence that can be extracted from publicly available resources. Before the Digital Era, the most prolific OSINT sources were television, radio, and print media. With the Internet gathering momentum, Internet became the major source of data, but it has since been joined by many social media networks including Facebook, Twitter, Instagram, various blogs, video streaming services and many others [35]. It is estimated that the two social networks Twitter and Facebook contribute more than 650 million daily tweets and 4 billion daily messages, respectively. Apart from that, Google searches (more than 5 billion per day) and YouTube videos watching (4 million a minute) add to the open data resource. In modern times, OSINT means utilizing publicly and commercially available information in various social media and couple it with rapidly improving big data analysis tools, so that meaningful intelligence can be gathered about the enemy forces, their partners, and other key players. In irregular conflicts, intelligence gathered from foreign social media platforms and the dark web can also be very useful in locating the real enemy as well as others who attempt to avoid the negative consequences of defined allegiances while still benefiting from security and service. Advanced analytical techniques present the opportunity of identifying where the support of each citizen truly lies, which may become the key for the success of the operation.

4.3 Data Processing of OSINT OSINT can be used to identify events, activities, and patterns which become the basic building blocks for synergising quality intelligence useful in Asymmetric warfare.

Data Processing and Analytics for National Security …

307

For example, the tracking of hashtags can be used to gauge public opinion and attitudes as they relate to plans behind actions by the miscreant agencies [36]. Admittedly, the feedback from open-source intelligence will need to be carefully measured and interpreted. Simpler techniques like Natural Language Processing (NLP) cannot accomplish this task, and that is where modern Data analytic tools need to be intelligently deployed and deep learning can be useful in this approach. It is understood that An Amazon Web Services-built cloud platform was used by CIA to perform open-source intelligence and big data experiments in partnership with industry experts. This could enable Police and Homeland security to scan electronic chatter linked to a crime in more than 200 languages, including emojis [37]. Big data technology can be used to crunch enormous amount of data, and when used correctly, can identify hard-to-detect patterns of terrorist groups or terrorist attacks, allowing the users to either prevent or respond to terrorism. This technology has been used to link different databases of different organizations and bridge the gaps, for extracting intelligence that can be useful for tracking terrorists’ movement [38]. Palantir Technologies, a Palo-Alto technology company working on this technology had given a demonstration of how all useful information could be linked into a single portal for intelligence and law enforcement communities to act upon. A condensed version of a hypothetical situation that Palantir created to show the OSINT capability is shown in Table 2 [39]. Table 2 A hypothetical situation linked through big data analytics (from [39])

308

G. S. Mani

5 Intelligence Derived from Multiple Sensors Data obtained from sensor forms the basic input for deriving Intelligence. The quality and relevance of this intelligence can be improved by deriving data from multiple sensors. These sensors can be on a single platform [40] or on multiple platforms in space, on land, or on sea surface or even sub-surface platforms. Their integration with networked decision support system will help in synergising maximum damage on the enemy. This section discusses data processing aspects related to data emerging from different sensors. Different types of sensors that can provide intelligence are shown in Fig. 13. They are based on different subdisciplines operating at different wavelengths, from Radio waves through Microwave, InfraRed, Visible to X-ray and gamma wave region [41]. Operational benefits that can accrue from multi-sensor integration is shown in Table 3. Intelligence derived from such a combination of sensors can be useful for threat assessment in countering threats from a variety of situations such as incoming offensive weapons [42] sensing toxic environment [43] or even improvised explosive devices [44]. Guardian Angel was a technology demonstration programme based on fusion of imagery and near-imagery sensors deployed to detect improvised explosive devices (IEDs) in Iraq and Afghanistan. NC3S Vigilare was an Australian system blending information from about 50 sensors including radar systems, and database information from military and civil aircrafts. It is understood that Raytheon’s Cooperative

Fig. 13 Sensors from different subdisciplines useful for collecting Intelligence

Data Processing and Analytics for National Security …

309

Table 3 Benefits from multi-sensor integration Type of Benefit

Mechanism

Operational Advantage

Operational Robustness

When a sensor becomes non-operational, others can contribute

Increased probability of detection of hostile object Graceful degradation

Extended spatial coverage

Different sensors can cover different spatial sectors

Increased probability of detection of hostile object Increased survivability

Extended Temporal coverage

One sensor can detect while other can measure

Increased probability of detection of hostile object

Reduced Ambiguity

Joint information from multiple sensors can reduce ambiguity

Accuracy in attack Target prioritization Reduced pilot workload

Improved sensitivity

Integrated output from various Increased reaction time sensors Increased survivability

Enhanced spatial Resolution

Multiple sensors can provide synthetic aperture effect

Accuracy in attack

Improved system Reliability

Redundancy

Graceful degradation

Increased Dimensionality

Less Vulnerable due to natural All weather operation conditions, like weather, etc. Better reliability

Engagement Capability (CEC), an air surveillance and defence system based on multisensory fusion can provide a composite Single Integrated Air Picture (SIAP) to yield tracking data of sufficient fidelity to provide weapons firing solutions for ships, aircrafts, and land vehicles.

5.1 Data Processing in Multiple Sensor Environment The new war paradigm where both conventional and asymmetric wars are prominent, an Integrated Intelligence, surveillance and reconnaissance (ISR) system plays a decisive role. This requires that data from multiple sensors are acquired, integrated, processed and the extracted intelligence is made available to decision-makers and Intelligence analysers. It is also essential that the intelligence provided is timely, accurate, relevant, and coherent to support commander’s conduct of activities. It is reported that a properly designed system can provide Signal to Noise Ratio (SNR) improvement of 4 to 8 dB, factors of 10 reductions in convergence and identification times, and as much as 100 times better geolocation accuracy than a single sensor [45]. Figure 14 shows an Integrated Sensor-collected Intelligence Architecture [44] for handling data acquired from different sensors. The main features are metadata tagging, assured communication and net centricity.

310

G. S. Mani

Fig. 14 Integrated Sensor-collected Intelligence Architecture (Adapted from [45])

Metadata: This is required for integration and reusability. Most data from sensors are unstructured and thus need arises for the data be correlated in some way, so that they can be meaningfully integrated. Often, time, space and calibration data are useful bases for integration. Calibration enables common use of measurement units, reference points, and so forth. Aligning metadata can be accomplished by aligning the standards with which the metadata are created at the point of collection or it can be ‘translated’ later to allow data from multiple sensors with misaligned metadata to be ‘mapped’ onto one another downstream in the process. Well integrated, calibrated data enables the data to be reused. Assured Communication: Assured, high-capacity communications are essential for making ISR information available to military forces and national security agencies based on need. Because of the large and rapidly growing trend towards ubiquity of unmanned aerial vehicles, the demand on communication infrastructure is increasing. Three major issues in communication networking are Bandwidth, Latency and Security. In some bandwidth-starved environments, sensor data is processed/semiprocessed at the sensor front end, subject to time-availability constraints. This enables communications-disadvantaged users to pull only the data they need in near real time. Information latency is one of those critical issues that need to be considered during sensor data integration and processing. Detailed processing for application like targetsignature analysis generally will be required to be done at the application end only. This will help raw or lightly processed data form to be used by other users without

Data Processing and Analytics for National Security …

311

delay. Further, because of the criticality, protection against the full range of vulnerabilities including physical attack on key nodes, electronic attack (e.g. jamming and spoofing) and cyberattack must be provided. US DoD is implementing a net-centric architecture to facilitate collaboration and enable enterprise-wide discovery and access to data by authorized users. The Transformational Communications Architecture comprises of terrestrial fibre, secure satellite communications, and software programmable radios with networking capability [44]. Net centricity: Network-Centric Operations (NCO) is the use of a network to connect decision-making across multiple domains, including ISR, command, control, communications, computers, and intelligence (C4I) and precision engagement (PE) among others. ISR involves merging all sensors into spatially aware databases across networks to obtain a common operational and tactical picture. C4I supports timelier and better decisions by assessing, analysing, and planning actions. PE focusses on the coordination of strike assets in time and space to achieve the commander’s desired effect. This will require services-oriented architecture that provides transparent interoperability between various domains. The aim is to make available discrete capabilities, but with each capability supported by a common information and tool infrastructure. The key features would include spatial data production, data reuse, interoperability, enterprise IT platform, multi-DBMS, and multiarchitecture information system. Such a system would help in continuous update of situational awareness across many platforms.

5.2 Geolocation Accuracy This refers to ensuring that the intelligence is gathered from a specified target without any ambiguity, which is one of the primary objectives of multi-sensor intelligence gathering system. Metadata Usage: Accurate metadata using time and location helps in many cases where ambiguity resolution is required. This is shown for a two-sensor case in Fig. 16. The key question is when and from where the two sensors are getting the data. Correlation and fusion happen if the two sensors are looking at the same location at the same time. When the coverage is on different locations at same time, this corresponds to ‘stitch and hands off’, where end-to-end coverage is possible over time. When the coverage is on same location at different times, this corresponds to ‘cross-cueing’, and this will require upstream integration. In both cases, further processing is required for optimization for better situational awareness. Model-driven Approach: This approach is used in cases where a particular target needs confirmation. In this approach, first the complicated intelligence problem is

312

G. S. Mani

Fig. 16 Ambiguity resolution based on time and location observations

broken down into its constituent pieces. The sensor data is then used to check the existence and confirmation of these parts. The procedure involves the following steps. (a) (b) (c) (d) (e)

Model the target through its processes Translate the model into potential observables Generate the generic model from the potential signatures Match the intelligence obtained from the sensor data to the generic model Check if the target is true or false.

This can be explained through an example of identifying a potential chemical weapon manufacturing site. The process model could include estimates of the type and quantities of inputs and output chemicals at the site, the plant’s power and water consumption, communications, manpower, temporal sequences, etc. The potential observables could be physical observables based on the above, along with probable subtle observables such as thermal signatures, waste gas generation, electronic signals generated by specialized equipment used at the site. Based on these, a generic model would be generated which can then be matched with the intelligence obtained from the sensors. From temporal sequence analysis, the quantity of manufacture could also be estimated. Target confirmation or rejection is not a one-step process. As data builds up, the identification process may become more complicated with appropriate integrated intelligence emerging from the system. And finally, any alternate hypothesis must be rejected with confidence. Such model-driven approach is possible only for specific sites. Building a generic template of signatures requires detailed collection of site characteristics and knowledge of discipline experts. An important aspect is ruling out alternate hypotheses, since any wrong decision can lead to serious global level repercussions. For instance,

Data Processing and Analytics for National Security …

313

mistaking a baby food manufacturing plant for a chemical weapon manufacturing site can lead to disastrous consequences. For complex, subtle intelligence problems, a wealth of data as well as sophisticated reliable data processing is essential. It is understood that this technique has been extended for use by set of UAVs to obtain persistent surveillance of a target and its environment. Compensation of Doppler shift is a serious issue when the platforms are moving continuously. Calibration of metadata based on speeds, and altitude of the platforms play a significant role in processing the sensor data in such cases.

6 Conclusions Security of many nations across the globe are threatened from two fronts, one needs to be engaged through conventional warfare and the other by what is now known as asymmetric warfare. In both cases, intelligence can play a decisive role. Conventional warfare mostly involves high technology, and intelligence inputs in terms of enemy’s radar assets would be most helpful. Data processing in such cases involves acquiring pulse data and processing through pulse deinterleaving and emitter identification. Data analytics aim at signature analysis of enemy radar through radar fingerprinting techniques. Asymmetric warfare uses non-traditional tactics and aim at creating psychological impact among the general public through shock and confusion. Open sources can provide the data for building intelligence to counter this type of threat. Objective of processing such data is to identify events, activities, and patterns that can lead to information about the miscreants. Analysis can lead to their specific plans through tracking their movement. The quality and relevance of intelligence to counter both types of threats can be largely improved by deriving data from simultaneous multiple sensors. Data processing in such multiple sensor environment require an integrated architecture based on metadata tagging, assured communication and net centricity. Efficient utilization of acquired intelligence can be through data reuse involving ISR, C4I and precision engagement. Modern data processing techniques and analytic tools play a prominent role in providing intelligence, which is vital for security of a nation.

References 1. Davenport TH, Harris JG (2007) Competing on analytics: the new science of winning. Harvard Business Review Press.ISBN: 1422103323 2. Hinton G, Deng L, Yu D, Mohamed A-R, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T, Dahl G, Kingsbury B (2012) Deep neural networks for acoustic modelling in speech recognition: The shared views of four research groups. Signal Process Mag IEEE 29(6):82–97

314

G. S. Mani

3. Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, vol 25. Curran Associates, Inc., pp 1106–1114 4. Ravı D, Wong C, Deligianni F, Berthelot M, Andreu-Perez J, Lo B, Yang G-Z (2016) Deep learning for health informatics. IEEE J Biomed Health Inf 5. Alazab M, Tang MJ (eds) (2019) Deep learning applications for cyber security. Springer 6. Berman DS, Buczak AL, Chavis JS, Corbett CL (2019) A survey of deep learning methods for cyber security. Information 10:122. https://doi.org/10.3390/info10040122 7. Raghavan P, El Gayar N (2019) Fraud detection using machine learning and deep learning. In: 2019 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE), December, 2019. 8. Kim S, Joshi P, Kalsi PS, Taheri P (2018) crime analysis through machine learning. In: IEEE 9th annual information technology, electronics and mobile communication conference (IEMCON), November 2018 9. Denos K, Ravaut M, Fagette A, Lim H-S (2017) Deep learning applied to underwater mine warfare. In: OCEANS 2017-Aberdeen. IEEE, pp 1–7 10. Kumar G, Kumar K, Sachdeva M (2010) The use of artificial intelligence-based techniques for intrusion detection: a review. Artif Intell Rev 34(4):369–387 11. Mascaro S, Nicholso AE, Korb KB (2014) Anomaly detection in vessel tracks using Bayesian networks. Int J Approx Reason 55(1):84–98 12. Rolington A (2013) Strategic intelligence for the 21st century: the mosaic method. Oxford University Press 13. US Department of Defense (2007) Joint Publication 1-02 Department of Defense Dictionary of Military and Associated Terms 14. Agarwal RC, Syam Kumar KSVM, Divakar N (2001) Finger printing Techniques for unique identification of emitters. In: Proceedings of the seminar on emerging trends in electronic warfare 15. Lin S, Thompson M, Davezac S, Sciortino Jr JC (2006) Comparison of time of arrival vs. multiple parameter-based radar pulse train deinterleaves. In: Proceedings of SPIE Vol. 6235. Signal Processing, Sensor Fusion, and Target Recognition XV 16. Tsui J (2004) Digital techniques for wideband receivers, 2nd edn. SciTech Publishing Inc., Raleigh, NC 17. Hk M (1989) New techniques for deinterleaving repetitive sequences. Proc IEE, Part F 136(4):149–154 18. Milojevic DJ, Popovic BM (1992) Improved algorithm for the deinterleaving of radar pulses. Proc IEE, Part F, 98–104 19. Guo Q, Chen W, Zhang X, Li Z, Guan D (2006) Signal sorting based on SVC & K-means clustering in ESM systems. In: King I, Wang J, Chan LW, Wang D (eds) Neural information processing. ICONIP 2006. Lecture Notes in Computer Science, vol 4233. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11893257_67 20. Kutman D (2011) Solutions for radar pulse deinterleaving. Carleton University, Ottawa, Ontario 21. Ata’ A, Abdullah S (2007) Deinterleaving of radar signals and PRF identification algorithms. IET Radar Sonar Navig 1(5):340–347 22. Han W, Hu J, Ni J (2011) A study on signal sorting algorithm based on PRI spectrum signatures for staggered radars. Radar ECM 1(2):39–42 23. Tian T, Ni J, Jiang Y (2019) Deinterleaving method of complex staggered PRI radar signals based on EDW fusion. IET International Radar Conf (IRC 2018). J Eng 2019(20):6818–6822 24. Ahmed UI, ur Rehman T, Baqar S, Hussain I, Adnan M (2018) Robust pulse repetition interval (PRI) classification scheme under complex multi emitter scenario. In: 2018 22nd international microwave and radar conference (MIKON), May 2018 25. Cho K, Merrienboer B et al (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. Comput Sci, 1–15 26. Lin Z, Feng M et al (2007) A structured self-attentive sentence embedding. In: International conference on learning representations, Toulon, France, pp 1–15

Data Processing and Analytics for National Security …

315

27. Liu Z-M, Yu PS (2019) Classification, denoising and deinterleaving of pulse streams with recurrent neural networks. IEEE Trans Aerosp Electron Syst 55(4):1624–1639 28. Liu Z-M (2020) Recognition of multi-function radars via hierarchically mining and exploiting pulse group patterns. IEEE Trans Aerosp Electron Syst. https://doi.org/10.1109/TAES.2020. 2999163 29. Chen Z, Li S, Yue W (2014) SOFM neural network based hierarchical topology control for wireless sensor networks. J Sens 2014, Article ID 121278, 6 p. https://doi.org/10.1155/2014/ 121278 30. Jiang W, Fu X, Chang J, Qin R (2020) An improved de-interleaving algorithm of radar pulses based on SOFM with self-adaptive network topology. J Syst Eng Electron 31(4):712–721 31. Statistics and Facts—Terrorism. https://www.statista.com/topics/2267/terrorism/ 32. Kalyanaraman S (2012) Asymmetric warfare: a view from India. Strateg Anal 36(2):193–197 33. https://www.oxfordreference.com/view/10.1093/acref/9780191826719.001.0001/q-oro-ed400006377 34. Sudhir MR (2008) Asymmetric war: a conceptual understanding. CLAWS J 35. United Nations Office on Drugs and Crime (2012) The use of the Internet for Terrorist purposes. United Nations, September 2012 36. Kilcullen D (2017) The accidental Guerrilla: fighting small wars in the midst of a big one. C. Hurst & (Publishers), London 37. https://www.washingtonpost.com/business/economy/for-this-company-online-surveillanceleads-to-profit-in-washingtons-suburbs/2017/09/08/6067c924-9409-11e7-89fa-bb822a46d a5b_story.html?utm_term=.8f16d241929a 38. Counter-terrorism Tools Used to Spot Fraud, Richard Waters. Financial Times, https://www. ft.com/content/796b412a-4513-11e2-838f-00144feabdc0 39. https://digital.hbs.edu/platform-rctom/submission/defeating-terrorism-with-big-data/ 40. Shukla AK, Parthasarathy T, Rao PNAP (2003) Use of multisensor fusion technology to meet the challenges of emerging EO and RF threats to a combat aircraft. In: Proceedings of the SPIE 5099, Multisensor, multisource information fusion: architectures, algorithms, and applications 2003, (1 April 2003); https://doi.org/10.1117/12.486878 41. Center for MASINT Studies and Research (2020) “MASINT: the intelligence of the future”, Air Force Institute of Technology. Archived from the original on 2007–07–07. Retrieved 3 Sept 2020 42. Imam N, Barhen J, Glover C (2012) Optimum sensors integration for multi-sensor multi-target environment for ballistic missile defense applications. In: 2012 IEEE international systems conference SysCon 2012, Vancouver, BC, pp 1–4. https://doi.org/10.1109/SysCon.2012.618 9519 43. Mani GS (2013) Mapping Contaminated clouds using UAV—a simulation study. In: IEEE conference INDICON, Bombay, December 2013 44. Hong J, Liu C (2019) Intelligent electronic devices with collaborative intrusion detection systems. IEEE Transactions on Smart Grid 10(1):271–281. https://doi.org/10.1109/TSG.2017. 2737826 45. Integrating Sensor-Collected Intelligence, Report of the Joint Defense Science Board Intelligence Science Board Task Force, November 2008

Framework of EcomTDMA for Transactional Data Mining Using Frequent Item Set for E-Commerce Application Pradeep Ambavane, Sarika Zaware, and Nitin Zaware

Abstract In data mining, comprehending out the common item set is an indispensable job. In statements such as participation rule mining and co-relationships, these conventional item sets are beneficial. These systems use particular algorithms to secure numerous item sets, but during unnecessary data come opposite; these are unproductive in communicating and coordinating the load. With these algorithms, automated parallelization is not feasible either. There is an inadequacy to build an algorithm to determine these difficulties with modern algorithms that will help the lacking characteristics, such as automatic parallelization, balancing and fair data configuration. We are using a new technique in this paper to discover frequent item sets by using MapReduce. With the HDFS framework, the modified Apriori algorithm is used, which is called the EcomTDMA technique. By using the disintegrate method, the MapReduce approach can manipulate individually and simultaneously in this system. The consequence of this approach to the reducers and the redactors will show the produce. Three various algorithms like base Apriori, FP-growth and our advanced development Apriori have practiced in the experiment, and the process was performed in both standalone computer and assigned environment and showed the consequences how the recommended algorithm is more substantial than standard algorithms.

1 Introduction In various marketing, medicine, real estate and management of customer partnerships, as well as engineering, Web mining, etc., data mining ideas and strategies can P. Ambavane (B) Neville Wadia Institute of Management Studies and Research, Pune 411001, India S. Zaware AISSM’s Institute of Information Technology, Pune 411001, India N. Zaware RIIM—The Academy School of Business Management, Pune 411033, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_21

317

318

P. Ambavane et al.

be related. Data mining (DM) has its primary objective; clustering, classification, prediction and relation analysis (associations) have used to produce non-obvious yet valuable knowledge for conclusion-makers from extensive data mining functions. With these approaches, several kinds of information, such as association laws, classifications and clustering, can be discovered [1, 2]. One of the most critical applications for data mining is the rules for mining associations to produce information that will assist top-level management or stakeholders in making a successful decision in business organizations. These directions surely boost in all degrees, e-business. Nextanalert analysis of existing algorithms is used through research for data mining, and the researcher suggests practical algorithms for the multilevel mining association running the show, which successfully examines for exciting relationships in a given dataset between items at different levels especially helpful for the e-market [3, 4]. To infer association law and frequently occurring item sets, a number of in-effect data mining techniques are shown. Still, with the rapid entry of time of big data, the traditional data mining algorithm has not been able to meet large dataset analysis requirements [5]. The efficiency and accuracy of parallel processing must be increased, thus minimizing the complexity of execution time. It is also indifferent to changes in any single personal record to ensure the performance of a computer. So, it will prevent privacy leakage from outcomes [6]. Therefore, using the HDFS system with privacy protection techniques around is a need to include a more traditional item set mining approach. The powerful applications for data mining are the rules of the mining association to produce information that will assist top-level management or stakeholders in making a successful decision in business organizations [7]. The datasets are excessively large in this modern era, so only sequential algorithms cannot compute large databases, and they have struggled to interpret data and often suffer from performance degradation correctly. A new equivalent numerous item sets excavating algorithm is used with MapReduce called EcomTDMA to solve this problem. This process increases storage space and problem computing.

2 Literature Survey Hmm. JW. The frequent pattern tree stores the compressed data in an extended prefix tree structure. Han et al. [8] the repeated patterns in a compact form are put away. It produces an FP-tree-based strategy for mining known as FP-growth. Without the positive set age, the suggested algorithm helps to remove frequent item sets. To achieve the efficacy of mining, three techniques are used. Next, a massive database is converted in a small data creation to prevent rehashed checks of databases that are said to be exorbitant. Second, this one introduces a pattern of frequent strategy for growth to go without from creating significant, exorbitant and optimistic sets. Third, mining activities are separated into a smaller task that is extremely helpful in growing the room for the search. FP-tree-based mining also has various examination problems, such as the highly flexible SQL-based FP-tree structure, the mining of recurrent patterns with imperatives and the use of FP-tree structure for successive

Framework of EcomTDMA for Transactional Data Mining Using …

319

patterns of mining [9]. However, pursuant to Li et al.[10] the mining task is separated into separate parcels by the parallel FP-growth algorithm. The different machines are supplied with each of the segments, and each parcel is freely recorded. It is proposed to resolve the difficulties faced by the FP-growth algorithm such as power, calculation dispersion and extremely expensive parallel FP-growth algorithm calculation. Five stages comprise the PFP algorithm. Initially, the database is divided into smaller sections. For parallel counting in the second phase, the mapper and reducer are used. An normal things they’re collected now the third stage. The frequent pattern (FP) tree is constructed in the fourth step, and the regular item sets are mined. A neighbourhood frequent item sets are summarized in the fifth step. The PFP algorithm is successful in mining tag-tag associations and Web Page-Web Page associations that are used as part of query recommendations or some other analysis [11]. The developers illustrated a problem by eliminating frequent item sets beginning the enormous database. The creators have shown in the data here a problem of separating frequent objects from a large number of databases. In addition to limited trust, the developers discovered rules that have less value-based support. They planned a calculation that purposely tests the collection of items for a single move. Similarly, it can differ in the middle of the quantity of data missed and stuff sets that are calculated in a move. This count utilizes the pruning structure to maintain a strategic distance from certain sets of objects. Points of concern in this equation are that it uses a support management method that is not ideal for one pass in the memory, so it will switch to the following pass. There is also no repetition [12, 13]. Moreover, this is an improved approach to calculate the efficiency of Apriori in MapReduce as an algorithm. MapReduce is the method used in either homogeneous or heterogeneous groups for parallel mining of large-size information. Compared to current systems, MapReduce distributes unnecessary data between maps and decreases functions and enables overall resource utilization [14]. Therefore, MapReduce is now a common method for parallel mining. The authors have suggested three algorithms, namely SPC, FPC and DPC through the use of MapReduce. With the MapReduce function in those algorithms, they used the Apriori algorithm. The various data lengths are dynamically accepted by the DPC algorithm, which benefits from this algorithm. In contrast to the other two algorithms, SPC, FPC and DPC show great results. These three algorithms also show that these equations scale up directly with the size of the dataset [15, 16]. Zhang et al. [17] deliberated that the recurrent patterns of the vertical format algorithm are mined using the Eclat algorithm. In flat-format databases, the algorithms for mining frequent designs are not the same as vertical database mining algorithms, such as Éclat’s. In order to obtain from the large amounts of data the frequent item sets, a parallel MREclat algorithm that uses a map reduction approach has been proposed [18]. The MREclat algorithm is composed of three steps. Both frequent 2-item sets and their TID records are obtained from the conversation database in the underlying development. The second step is the modified gathering step, where gatherings are divided into frequent 1-item sets. The third stage is the parallel mining process, where the information obtained is redistributed to different computing nodes in the initial step. To mine frequent item sets, each hub runs an upgraded Eclat. Finally, MREclat

320

P. Ambavane et al.

gathers the final result of all the returns from each computing hub and arrangements. With a similar prefix, MREclat uses the enhanced Eclat to process data. MREclat has been shown to have elevated scalability and a high proportion of speed [19, 20]. Furthermore, frequent item set mining is an integral component of association rules and many other fundamental applications for data mining [21]. However, mining algorithms are unfortunately overlooked to deal with such unnecessary databases as the dataset becomes more orderly. Parallel FP-growth algorithm, BPFP [3, 22], an extension of the PFP algorithm [1], was proposed by the developers. FP-growth is used as the parallel FP-growth algorithm for the MapReduce worldview. BPFP is accustomed to PFP load balancing, which improves parallelization and naturally improves execution of this portion. By using the grouping method of PFPs, BPFP provides more impressive execution. The huge load is parallelized by BPFP with a well-balanced algorithm [23]. The efficient scheme for mining frequent item sets is FIUT. It is a highly efficient frequent item set mining (FIM) technique referred to as frequent items ultrametric tree (FIUT) [4]. This involves two main steps of database tests. It calculates the help count in the first step for all item sets in a wide database. It relates to the prune method in the second stage and gives merely frequent item sets. Phase two can assemble small ultrametric trees now the overriding period; frequent a single one item sets are planned. These outcomes resolve to be seen now small ultrametric trees. FIUT’s profit is that it quickly expels the K-FIU tree [24]. There are four basic points of interest in FIUT. Second, by reviewing the databases only twice, it eliminates I/O overhead. Second, the search space is reduced. Third, for each expansive amount of processing, FIUT provides frequent item sets as output. Using this new strategy for FIUT, the user can get only frequent item sets as each leaf provides frequent item sets within the cluster for each datum trade [25]. Similarly, it uses an expanded structure for MapReduce. Splitting the mass data file obtains a variety of subfolders. On each subfile, the bitmap computation is performed to acquire the frequent patterns. By combining the results of all subfiles, the frequent design of the overall mass data file is assimilated. When processing each subfile, an arithmetic analysis approach is cast off to prune the insignificant patterns. It has been demonstrated that in mining frequent patterns in big data, the strategy is scalable and efficient [26]. An improved algorithm for parallel association rules based on MapReduce system for big data was proposed by [27]. There is a comparison between the proposed algorithm and the current traditional Apriori algorithm. The time used to think about the implementation of the algorithms is the multi-sided consistency of both algorithms. In comparison with the traditional algorithm, the proposed algorithm has been shown to be more efficient [28]. Furthermore, Liao et al. [29] elaborated parallel algorithm that uses the Hadoop stage to execute. The MRPrePost is an improved Pre-Post algorithm using the structure of map reduction. By mining the substantial datasets, the MRPrePost algorithm is used to notice the association rules. There are three stages to the MRPrePost algorithm. The database is split into data blocks called shards in the first step, which are distributed to each specialist core. The FP-tree is built at the second level. In the last step, to obtain the standard item sets, the FP-tree is mined. Test results have shown that the algorithm for MRPrePost is the fastest

Framework of EcomTDMA for Transactional Data Mining Using …

321

[29]. Although in the proposed algorithm, large datasets are mined using the map reduction scheme. To obtain the ClustBigFIM algorithm, the large FIM algorithm is altered. The scalability and speed used to extract useful information from essential datasets are given by the ClustBigFIM algorithm [30]. In order to settle on better business action options, valuable data may be used. There are four basic steps to the suggested ClustBigFIM algorithm. The suggested algorithm uses the K-means algorithm to produce the clusters in the first step. The standard item sets are mined from the clusters in the second level. The worldwide TID list is obtained by constructing the prefix tree. To get the usual item sets, the sub-trees of the prefix tree are mined. Compared to the Big FIM algorithm, the future ClustBigFIM algorithm finished up presence more successfully [31]. The occasional problem of mining is discovering sets of items whose knowledge repetition is not exactly or equal to the most outrageous edge. It examines various structures that periodically set items for mining. Finally, relative strategy is applied for each operation. Information mining is a complex example of extracting or editing large amounts of data [32]. Intelligence digging is a mechanism for discovering facts and outlining valuable information from different perspectives. It has a big influence on some of the mistakes in data mining to find common examples concealed in the database. In the information mining market, two types of models are planned. Additionally, it is suggested customer buying prediction now a Grocery collection with Machine Learning Methods. Two representative machine learning methods are used by the system: the Bayes classifier and support vector machine (SVM) and the output of the data in the real world. A strategy for extracting customer buying behaviour was also carried out. We investigated some important methodological problems related to the use of RFID data to predict purchasing behaviour in support vector machines using RFID data collected from individuals in a Japanese supermarket [33–35]. Liu and Shi [36] discussed that the consumer purchase intention prediction based on machine learning. A Naive Bayesian algorithm, the current step of the method, has the benefits of rapid implementation and high classification performance. However, this approach relies too much on the distribution of the sample in the sample space and has the potential for instability. The decision tree approach is used to solve the problem of interest classification and the innovative use of local HTML5 storage technologies to obtain the experimental data necessary for this purpose. The classification method utilizes the knowledge entropy of the training dataset through a simple search of the classification model to complete the classification of unknown data items in order to create the classification model. Liao et.al proposed a parallel MRPrePost—an algorithm tailored for large-scale data mining [29]. It is a parallel calculation that uses the Hadoop step to be modified. The MRPrePost is an improved Pre-Post calculation that uses the method of MapReduce. The MRPrePost calculation is used to discover that the association runs through the comprehensive datasets being mined. There are three steps to the MRPrePost estimation. The database is isolated in the knowledge squares called the shards allocated to each specialist centre in the initial stage. The FP-tree is generated in the second stage. The FP-tree is mined in the last advance to obtain the successive object sets. Trial results have shown that the MRPrePost estimate is the fastest [37].

322

P. Ambavane et al.

However, Gole and Tidke frequent item set mining uses the ClustBigFIM algorithm for big data in social media [21]. Substantial datasets in the proposed estimation are mined using the MapReduce system. According to the ClustBigFIM estimate, massive FIM figures are altered. Computing with ClustBigFIM offers adaptability and speed used to handle data from far-reaching datasets. Profitable data can be used to focus on optimal business development options. There are four important steps to the proposed ClustBigFIM estimation. To generate the bundles in the underlying growth, the proposed computation uses K-infers count. From the gatherings in the second process, the relentless object sets are mined. The overall description of the TID is obtained through the prefix tree formation. In order to get the regular item sets, the sub-trees of the prefix tree are mined. The implied ClustBigFIM figure is wound up actuality more reliable compared to the Big FIM calculation, appearing differently. R. S. and Priyanka. From P. Siddique I. A survey on infrequent weighted item set mining approaches was proposed [38]. The issues of selecting the exceptional and weighted item sets are discussed in this paper. The problem of sporadic object set mining is to discover object sets whose data repetition is not accurate or equal to the most outrageous edge. This paper audits the distinctive mining method for rare objects. Finally, the relative strategy is shown for each form. Data mining is defined as collecting fascinating illustrations or learning from colossal data calculation. Data burrowing is the method of discovering data from various points of view and gathering information into accommodation. Finding traditional illustrations hidden in a database in a couple of data mining tasks expects a pivotal component. In data mining tasks, there are two types of models anticipated [39]. In Hadoop MapReduce system, Natarajan and Sehar distributed FP-ARMH algorithm [40]. The proposed calculation effectively uses the gatherings and helps from large databases in mining general case. The workload between the gatherings is handled using the dispersed Hadoop structure. The large database is stored by the Hadoop passed on report structure [41]. While incremental FP-growth mining strategy for dynamic threshold value and database based on MapReduce [42] has been proposed by Wei et al. Using a MapReduce method, the enormous scale information is treated. In the meantime, the proposed incremental estimation is convincing when edge consideration and special database change [43].

3 Results and Findings 3.1 Existing System Structure In this section, we can understand the theory of essential concepts relating to association rule mining (ARM) and the map reduction technique. In big data, a review of the literature related to FPM is presented.

Framework of EcomTDMA for Transactional Data Mining Using …

3.1.1

323

Rule Mining Association

The common rules that term relationships are sandwiched between unconnected frequent objects in databases are contained inside association rule mining, and it has two key measurements: confidence values and support [25]. There are two fragments of association law: antecedent (if) as well as consequential (then). An object found in the data is a forerunner. In conjunction with the predecessor, a corresponding one is found. These rules are generated by analysing data for repeated examples at that point and using the help and trust criteria to identify the most important ties. A sign is referred to as help for the amount of regular items appearing in the database. The number of occasions when words have been shown to be open is called assurance at that stage. Set of items has more substantial provision appreciation than or equal to the appreciation of a base bound support, and frequent rules are referred to as frequent item sets as the rules that have additional prominent appreciation of confidence than otherwise equal to the appreciation of least edge confidence. For mining frequent item sets, the cap figures are generally presumed to be available. ARM is linked to the discovery of all rules that surpass the edge, the least support and the least confidence in terms of support and trust. There are two key advances in association rule mining: the first step is to find sufficient support for all item sets, and the second step is to establish rules for association by combining regular or comprehensive item sets. It is assumed that the edge estimates in the customary system are accepted. Without any prior knowledge, it is extremely difficult to set the threshold value and achieve the appropriate outcomes. A very small number of rules can be created by setting the meaning of the threshold very high, or if the meaning of the threshold is dreadfully low, then a large number of rules will be produced, and the result will take a long time to compute [44]. In key-value pairs, the outcome of certain rules can be acquired, and this production is then mapped. The technique Map and Reduce is used for mapping.

3.1.2

Parallel Algorithms of Frequent Pattern Mining

The main problem with tree-based algorithms is that they use a lot of memory. Level for comparatively less important datasets, the main tree will expand to billions of nodes if a small value is set at the minimum threshold, leading to high memory consumption [45]. Therefore, major study efforts were committed to the production of accurate data frequent pattern mining algorithms in parallel implementations, making it possible to create and mine smaller trees on multiple machines in parallel. Recall that no computational dependencies on other conditional trees are present in each conditional tree. A number of parallel implementations of FP-growth have exploited this property [46, 47].

324

P. Ambavane et al.

3.2 Proposed System Framework 3.2.1

Problem Description

Mining has been a focus of computer science study. There are different theories behind this rapidly increasing phenomenon. The main problem with tree-based algorithms is that they use a lot of memory. Used for a relatively small dataset, setting a minimum threshold at a low value would increase the main tree to billions of nodes, resulting in higher memory usage. Therefore, in parallel implementation, major research efforts have been made to build precise data frequent sample mining algorithms, making it possible to construct and mine small plants parallel on multiple machines. Therefore, from the point of view of computer science, the two dominant ecommerce issues are as follows. (1) (2)

By what means to efficiently excerpt information is the subject of the data mining algorithm; The subject of e-commerce is how data mining outcomes can be serviced and benefited from.

The purpose of the aim of this research is to present a software design and implementation to address these issues for e-commerce business improvement. The proposed work firstst examines data mining methods such as Approery, FPtree and FIUT and identifies current system problems, and the framework also focuses on the security of databases such as structured query language injection with top-K parallel data mining recovery techniques on artificial HDFS platforms on our internal multi-data nodes. EcomTDMA.

3.2.2

Proposed System Architecture

Generating frequent 1-item sets for the specified database seems to be the preliminary stage of generating frequent item set production. This is how it is shown middleware is used to find frequent 1st-item sets and use limitations for this support count feature to build large item sets, which is discussed in detail here. For the measurement of the frequent-1 item set from each mapper shown in Fig. 1, a cache will be included in the map step to preserve the support count tree. In order to improve the costs of the mission of map reduction. In the original MapReduce tasks, this reduces the total time it takes to count the frequency-1 item set as it combines the work by scrambling and sorting each mapper. Collection is added therefore that the support programmers can be measured in the cache feature in order to further increase the efficiency of FIM generation. As the cache data can be easily retrieved, an updated MapReduce algorithm is presented for this. I.

Authentication of Device: This is the app starting page where the user must sign up to use the device. This acknowledges the registered user’s username

Framework of EcomTDMA for Transactional Data Mining Using …

325

Fig. 1 Proposed System Architecture

II.

III. IV.

V.

VI. VII. VIII.

identity and password. If a user does not have the correct identification and password, then the device will not allow it to be used any further. This is done for the sake of defence. Uploading File with Hadoop Method: Clicking on the “File Upload” option allows the system to use the dataset. The MapReduce functionality is also initiated by clicking on the “Hadoop Method” button. With this move, MapReduce will start to function. Data on inventory: By connecting to the system display-data option, the item code is displayed to indicate the items bought by the user. Apriori of Hash Foundation: The “Method” option allows the algorithm to be chosen for calculating frequent item sets. It displays two choices with the name of the algorithm proposed. The frequent item sets with reduced time can be seen by clicking on the user option. Set of Frequent-1-item: This is a step of the first MapReduce. The user can obtain the values by giving the minimum support value and dates. The frequent-1 element sets are obtained by using the modified Apriori algorithm. Set of Frequent-k-item: In the type of frequent-k item sets, these are the final frequent item sets. Value K is given by the user. Largest Set Object: The worker will obtain the largest products purchased by the user at a specific date by clicking on this option. Graph of comparison with two methodologies: This is really the graph that illustrates the difference in scheduling between EcomTDMA but base Apriori and the FP-tree theorem.

3.3 EcomTDMA for Hash Base Frequent Item Set Mining Input: Transactional dataset min_Sup_req_itemsm_k; Output: produceTset item set.

DBset,

minimum

support

generator

Den,

326

P. Ambavane et al.

• Step 1: Foreach (Ti_List in DBset) do • Step 2: item [] ←−split (Ti_List) • Step 3: Generate minimum support dynamically. Support = (Ti_List.amount/100)*Den

• Step 4: Creating hash table from hash_table = { T_i_List + 1……. T_i_List + n} • Step 5: Add all item existences with corresponding to Ti • Step 6: Generate two pair group for all item sets. • Step 7: Generate third pair group of item set. • Step 8: Generate multiple or n pair of groups for final iteration. • Step 9: Process the pruning on hash table hash_table up to generate Top_K item set. • Step 10: Save all top-k items generated by hash_table. With the application of the above steps, the following structure of datasets is supported for data mining (Table 1). We used generative model with short frequency patterns that mimic market basket data. Real data, dense in long frequent patterns that have been dense, are the other two parameters. In the previous analysis of association rules mining, these datasets were also used. The experimental results of this system are set at the minimum level for support (or, proportionally, greater data sizes) than was even considered. These enhancements are the same at no cost of execution, as shown by the way our execution achieves less time compared to other methods. We therefore use the EcomTDMA algorithm; as opposed to current systems, it can be the best algorithm to give precise results. Also, for massive databases, the proposed framework algorithm demonstrates faster execution. We may build our own large dataset on which experiments can also be carried out, but the cost of doing so is negligible. The data in the collection of Web documents comes from a real domain and is therefore relevant. In our experiments, we used five sets of data. Three of these sets are the T10I4D100 K, T25ITEM10D10 K and T40ITEM10D100k synthetic results. Real data (groceries) is the other two datasets that are dense in long frequency trends. This dataset was a direct dataset from which the association rules for the mining analysis were downloaded for http:// www.jbtraders.in. Three separate datasets were used for the suggested production, the grocery dataset was taken the www.jbtraders.in, and some generic datasets of the automated item base were given from either the Web. The second dataset was taken with www.sports365.infor sports. We have conducted numerous observations below, which are depicted in graphs. Efficiency of EcomTDMA with various support denominators with different datasets in terms of resources needed in seconds. Table 2 shows the time taken through seconds also for multiple datasets of grocery, electronic and sports by suggested algorithm EcomTDMA for various support numbers. For three separate item sets of grocery, electronic and sports, Fig. 2 shows the findings shown in Table 1 for different help effects of 5, 8, 10 and 15% sequentially. This displays the time necessary to secure the frequent item set with an updated dataset in seconds. In implementing the EcomTDMA algorithm for general item

Framework of EcomTDMA for Transactional Data Mining Using …

327

Table 1 Structure of datasets for data mining application Step 5: All item existences with corresponding to Ti Support D

Ti_List + 1

Ti_List + 2

Ti_List + 3







Ti_List + 10

Total of items

{ITEM1},{ ITEMn}

{ITEM1}, {ITEMn}

{ITEM1}, {ITEMn}







{ITEM1}, {ITEMn}

Step 6: Two pair group for all item sets Support D

Ti_List + 1

Ti_List + 2









Ti_List + 10

Total of items

{ITEM1, ITEM_2}, { ITEM_1, ITEM_2}

{ITEM_1, ITEM_2}, { ITEM_1, ITEM_2}









{ITEM_1, ITEM_2}, { ITEM_1, ITEM_2}

Step 7: Third pair group of item set Support D

Ti_List + 1

Ti_List + 2









Ti_List + 10

Total of items

{ITEM_1, ITEM_2, ITEM_3}, {ITEM_1, ITEM_2, ITEM_3}

{ITEM_1, ITEM_2, ITEM_3}, {ITEM_1, ITEM_2, ITEM_3}









{ITEM_1, ITEM_2, ITEM_3}, {{ITEM_1, ITEM_2, ITEM_3}

Step 8: N pair of groups for final iteration Support D

Ti_List + 1

Ti_List + 2









Ti_List + 10

Total of items

{ITEM_1, ITEM_2, ITEM_3, ITEM_n}, {ITEM_1, ITEM_2, ITEM_3, ITEM_n}

{ITEM_1, ITEM_2, ITEM_3, ITEM_n}, {ITEM_1, ITEM_2, ITEM_3, ITEM_n}









{ITEM_1, ITEM_2, ITEM_3, ITEM_n} {ITEM_1, ITEM_2, ITEM_3, ITEM_n}

Table 2 Time in seconds required to generate frequent item set with different support for all three datasets (records = 2500) Support Value

Grocery

Electronic

Sport

5

452

506

604

8

302

366

402

10

201

299

305

15

106

186

208

set mining with a retail item set for various support principles, utilization is very beneficial. Using statistics, the experimental result shows that the period used to extract large item sets set using EcomTDMA reductions as support improves.

328

P. Ambavane et al. 7000 6000

time

5000 4000

Grocery

3000

Electronic

2000

Sport

1000 0 Support 5

Support 8

Support 10

support 15

Fig. 2 Number of item set extracted with various support values from transactional dataset

4 Conclusion The main objective of the research work is now on the challenge of mining candidate item sets in the Hadoop framework from larger datasets efficiently and effectively. Constrained frequent item set EcomTDMA algorithm has been suggested because a huge amount of designs or dictates are frequently set up in frequent item set mining, and it is completed in big data utilizing Hadoop’s MapReduce job. In performing frequent 1-item collections, the largest of the FPM algorithms consume half the interval. With the help of the advanced hash-based algorithm, a simplistic and accessible to perform support calculation has been introduced that has decreased the time to perform frequent 1-item sets. To obtain regular 1-item sets and their identical calculations, this technique can be easily encoded into any of the optimization schemes aimed at mining. The updated MapReduce EcomTDMA was meant to mitigate the error rate of retrieving frequent patterns from big data leveraging MapReduce. A cache is also included in the map process to preserve the support count tree between each Mapper’s frequent-1 item set calculations. This decreases the total time to measure most frequent item sets as the hobble, sort but syndicate tasks of all Mapper in the innovative MapReduce tasks are bypassed. Additional function is also introduced to find combined frequent item sets from multiple files.

5 Future Scope The future researchers can concentrate on parallel networking with shared Hadoop environments using slot set-up for system development. The allocation and ordering of runtime slots will optimize the use of resources and increase the accuracy of the results.

Framework of EcomTDMA for Transactional Data Mining Using …

329

References 1. Stoenescu LV (2013) Social analytics role in high-tech business. Diss. 2013 2. Martelli A (2009) System dynamics modeling and data mining analyses: a possible integration 3. Nowduri S (2011) Management information systems and business decision making: review, analysis, and recommendations. J Manage Marketing Res 7:1 4. Dinsmore PC, Cooke-Davies TJ (2005) Right projects done right: from business strategy to successful project implementation. Wiley, New York 5. Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam 6. Zhu J et al (2018) Review and big data perspectives on robust data mining approaches for industrial process modeling with outliers and missing data. Ann Rev Control 46:107–133 7. Gan W et al (2017) Data mining in distributed environment: a survey. Wiley Interdiscip Rev Data Mining Knowl Discov 7(6):e1216 8. Han JW, Pei J, Yin YW (2000) Mining frequent patterns without candidate generation. In: International conference on management of data, vol 29(2), pp 1–12 9. Nisbet R, Elder J, Miner G (2009) Handbook of statistical analysis and data mining applications. Academic, London 10. Li H, Wang Y, Zhang D, Zhang M, Chang E (2008) PFP: parallel FP-growth for query recommendation. In: Proceedings of the 2008 ACM conference on recommender systems, pp 107–114 11. Waghere SS, Rajarajeswari P (2017) Parallel frequent dataset mining and feature subset selection for high dimensional data on hadoop using map-reduce. Int J Appl Eng Res 12(18):7783–7789 12. Menon S, Sarkar S (2016) Privacy and big data: scalable approaches to sanitize large transactional databases for sharing. MIS Quart 40(4) 13. Wang YR, Madnick SE (1989) The inter-database instance identification problem in integrating autonomous systems. In: ICDE 14. Bajcsy P et al (2013) Terabyte-sized image computations on hadoop cluster platforms. In: 2013 IEEE international conference on big data. IEEE 15. Shah AA (2019) Performance optimization of big data processing using heterogeneous distributed computing. Dissertation, Maharaja Sayajirao University of Baroda (India) 16. Ahmad R (2013) Engineering machine translation for deployment on cloud. Dissertation, International Institute of Information Technology Hyderabad, India 17. Zhang Z, Ji G, Tang M (2013) MREclat: an algorithm for parallel mining frequent itemsets. In: 2013 international conference on advanced cloud and big data 18. Huang K-Y, Chang C-H (2008) Efficient mining of frequent episodes from complex sequences. Inf Syst 33(1):96–114 19. Mining, What Is Data. Data mining: Concepts and techniques. Morgan Kaufinann 10, pp 559–569 20. Buehrer GT (2008) Scalable mining on emerging architectures. Dissertation, The Ohio State University 21. Gole S, Tidke B (2012) frequent itemset mining for big data in social media using ClustBigFIM algorithm. In: International conference on pervasive computing 22. Allemang D, Hendler J (2011) Semantic web for the working ontologist: effective modeling in RDFS and OWL. Elsevier, Amsterdam 23. Helbing D (2015) The automation of society is next: How to survive the digital revolution. Available at SSRN 2694312 24. Gahar RM et al (2017) ParallelCharMax: An effective maximal frequent itemset mining algorithm based on mapreduce framework. In: 2017 IEEE/ACS 14th international conference on computer systems and applications (AICCSA). IEEE 25. Chen H, Lin TY, Zhang Z, Zhong J (2013) Parallel mining frequent patterns over big transactional data in extended MapReduce. In: 2013 IEEE international conference on granular computing

330

P. Ambavane et al.

26. Mazumder S (2016) Big data tools and platforms. In: Big data concepts, theories, and applications. Springer, Cham, pp 29–128 27. Zhou X, Huang Y (2014) An improved parallel association rules algorithm based on MapReduce framework for big data. In: 2014 11th International conference on fuzzy systems and knowledge discovery 28. Yang XY, Liu Z, Fu Y (2010) MapReduce as a programming model for association rules algorithm on Hadoop. The 3rd international conference on information sciences and interaction sciences. IEEE 29. Liao J, Zhao Y, Long S (2014) MRPrePost—a parallel algorithm adapted for mining big data. In: 2014 IEEE workshop on electronics, computer and applications. IEEE 30. SheelaGole, Bharat Tidke, — Frequent Item set Mining for Big Data in social media using ClustBigFIM algorithm, International Conference on Pervasive Computing. 31. Ambavane PK, Zaware N (2021) Data mining using hadoop distriuted file system (Hdfs) For E-Commerce marketing strategy 32. Siddique Ibrahim SP, Priyanka R (2015) A survey on infrequent weighted item set mining approaches. In: 2015, IJARCET, vol 4, pp 199–203 33. Zuo Y, Yada K, Shawkat Ali ABM (2016) Prediction of consumer purchasing in a grocery store using machine learning techniques. In: 2016 3rd Asia-Pacific World Congress on Computer Science and Engineering (APWC on CSE). IEEE 34. Lee I, Shin YJ (2020) Machine learning for enterprises: Applications, algorithm selection, and challenges. Business Horizons 63(2):157–170 35. Ma L, Sun B (2020) Machine learning and AI in marketing–Connecting computing power to human insights. Int J Res Mark 37(3):481–504 36. Bing L, Yuliang S (2016) Prediction of user’s purchase intention based on machine learning. In: 2016 3rd International conference on soft computing & machine intelligence (ISCMI), 23 Nov 2016. IEEE, pp 99–103 37. Baderiya MH, Chawan PM ?(2018) Customer online buying prediction using frequent item set mining 38. Ibrahim S et. Al (2015) A survey on infrequent weighted itemset mining approaches. IJARCET 4:199–203 39. Fayyad U, Haussler D, Stolorz P (1996) Mining scientific data. Commun ACM 39(11):51–57 40. Natarajan S, Sehar S (2013) Distributed FP-ARMH algorithm in hadoop map reduce framework for IEEE 41. Lu Z et al (2018) IoTDeM: an IoT Big Data-oriented MapReduce performance prediction extended model in multiple edge clouds. J Parallel Distrib Comput 118:316–327 42. Wei X et al (2014) Incremental FP-growth mining strategy for dynamic threshold value and database based on MapReduce. In: Proceedings of the 18th IEEE international conference on computer supported cooperative work in design 43. Sakr S et al (2011) A survey of large scale data management approaches in cloud environments. IEEE Commun Surv Tutor 13(3):311–336 44. Li X, Wang Y, Li D (2019) Medical data stream distribution pattern association rule mining algorithm based on density estimation. IEEE Access 7:141319–141329 45. Braun P et al (2019) Pattern mining from big IoT data with fog computing: models, issues, and research perspectives. In: 2019 19th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID). IEEE 46. Peter B et al (2018) An innovative framework for supporting frequent pattern mining problems in IoT environments. In: International conference on computational science and its applications. Springer, Cham 47. Leung CK-S, MacKinnon RK, Jiang F (2017) Finding efficiencies in frequent pattern mining from big uncertain data. World Wide Web 20(3):571–594.

Track IV

A Survey on Energy-Efficient Task Offloading and Virtual Machine Migration for Mobile Edge Computation Vaishali Joshi and Kishor Patil

Abstract Technological development in the mobile devices shows that global trend is changing from desktop computing to mobile computing. Although there is tremendous increase in the potential of mobile devices, they are still facing several challenges, such as consumption of high power, bandwidth of the wireless medium, and computation complexity, and these challenges had been met the arrival of cloud technology. Cloud computing is incorporated with the mobile domain to create mobile cloud computing (MCC). The MCC is a dominant paradigm in which the mobile devices are connected by means of the Internet via wireless network and further communicate with the distant cloud. The performance of the mobile applications can be enhanced by offloading its subcomponents to the cloud, which is resource rich in terms of storage and speed of computation. But offloading the task to the cloud is not always beneficial to the user equipment (UE), because data rate may be dropped due to low movement and fading channel, and offloading process may increase the energy required for transmission. Also, there is trade-off between energy consumption and latency in communication. The proposed method designs an energy-efficient task offloading with delay awareness (EETOWDA) where tasks which cannot be executed by the mobile devices are partitioned and offloaded to MEC with the delay and energy constraints. Further, a VM migration scheme is also addressed which prevents the service degradation by copying a task when a user moves from one node to other.

1 Introduction The mobile application is increasing in various categories like health, games, travel, and entertainment. With the advancement in technology, mobile applications have evolved from static applications to real-time one. Several real-time applications V. Joshi (B) · K. Patil Sinhgad Academy of Engineering Kondhwa, Savitribai Phule Pune University, Ganesh Khind, Pune, Maharashtra 411007, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_22

333

334

V. Joshi and K. Patil

such as location-based applications and sensor-based applications require extensive processing, resources, computational, and energy demands [1]. These factors will create hurdles in creating applications as well as in availing the services of the applications. The technical and computational demands cannot be solved with the mobile phone or user equipment (UE) architecture. These problems are found to be solved with the technique cloud computing (CC) [2]. The inherent property of cloud computing where the computation-intensive applications or tasks are offloaded from resource-constrained platforms to cloud can be utilized for handling the mobile applications which cannot be processed by the mobile architecture [3]. The mobile cloud computing (MCC) is one where cloud technology is utilized to process the resource hungry applications. For example, when mobiles want to run a service like Gmail, the devices act as the client connecting the server through wireless technology like 3G. The requested services are processed in a cloud. The requested services/tasks by the mobile users are placed on the cloud through a process named offloading [4]. The presence of huge number of devices in the cloud may cause problems while providing service to multiple users. When the computation-intensive tasks from a UE are offloaded to cloud, then it may pose several challenges like long latency and energy drain in mobile devices. [5]. The quality of the MEC offloading is also affected by the quality of the wireless links. This further adds the delay in the offloading [6]. The dynamic quality of the wireless channel has to be focused while offloading a task [7]. The mobile devices limited computation, and energy factor has raised two important problems named energy-delay trade-off. This has received attention among research community since they are the prime decision-making factor for an efficient MEC system. Several researchers have proposed task offloading solutions for single and multiple devices through various techniques with the focus to reduce the energy consumption, improve the latency, cost factor, etc. [8]. Another one operation issue in MEC is the VM migration when the user is moving from one network to another one. The movement of the user may affect the QoS of the MEC. Migration of VM under mobility involves stateful and stateless migration. Stateless migration involves allocating separate instances when a mobile user moves from one location to another, while stateful migration involves migration of a running application to another VM. The interactive services prefer the second type since they need a continuous usage of resources. This paper is arranged as below: Following section discusses the related work. Section 3 focuses about proposed work, and Sect. 4 gives the conclusion.

2 Literature Survey The decision on computation offloading results in one of the three cases that are no offloading, partial offloading, and full offloading. This decision depends upon several factors such as architecture used, the bandwidth of communication medium, type of application to be executed, amount of data to be processed, and static or dynamic offloading. Main objective of computational offloading is to minimize the

A Survey on Energy-Efficient Task Offloading and Virtual Machine …

335

battery consumption of UE for executing computationally intensive task with cloud. But this offloading process increases extra communication and computation overheads. These overheads depend upon several factors such as resources of UE (these resources depend upon time and location of UE because of mobility), network conditions, available bandwidth, data rate, and available cloud resources; while reducing battery consumption of UE, it should not increase additional overhead in terms of communication latency otherwise it will reduce QoE for UE. Communication latency will get increase if offloading process takes more time. This section gives the detailed survey for computation offloading methods by focusing three key parameters. First parameter is only energy efficiency (communication latency is not addressed), second parameter is energy efficiency along with latency, and the third parameter is trade-off between these two.

2.1 Energy Efficiency This section surveys current research for the decision on computation offloading while focusing only one parameter that is energy efficiency. Main objective of these papers is to reduce the energy consumption of UE while executing computationintensive tasks. Multiple servers, multiple users, and multiple tasks type of computation offloading scheme is proposed by Haung et al. [9]. Linear programming relaxation-based (LR based) algorithm and heterogeneous distributed deep learningbased offloading (DDLO) algorithms are introduced and compared. Results show that DDLO is more efficient as compared to the LR-based approach. In DDLO, if the number of UEs increases from 1 to 7, the computation time is increased from 0.63 to 0.74 ms, which is very negligible, whereas in LR-based approach, it increases from 0.33 to 5.8 s. Heterogeneous deep learning approach is first time introduced in the literature, but the mobility of UE is not considered. Chen et al. [10] projected the efficient multiple user computation offloading (MUCO) for MEC. This system highlighted the offloading computation problem among the users in MUCO games and the space constraint of the cloudlet system. To discuss these challenges, a novel MEC computing has been introduced. But it possesses mobility patterns which show a significant role in the problem formulation, and the system was technically challenging one. You et al. [11] introduce the wireless energy transfer concept for eenergy efficiency in mobile cloud computing. The proposed system differentiated in three factor such as full-duplex transmission, single-computing task and to extend the work from single UE to multiple UE require the joint design of radio and computational resource allocation for mobile cloud computing. Consequently, the proposed system has much lesser energy consumed and lightweight system. Though, it is not good for the computation-intensive task. Shiraz et al. [12] introduced the energy-efficient computational offloading scheme for MCC system. Distributed architecture is suggested by author to address the issue of additional energy consumption in computational offloading for MCC. The

336

V. Joshi and K. Patil

proposed scheme consists of dual operating nature which gives flexibility and robustness of the distributed and elastic model for concentrated mobile use in MCC. As a result, the proposed system reduces system cost and reduces computational overhead. But the system faces some limitations as the consistency of simultaneous application execution and seamless application execution. Chen et al. [13] introduce the decentralized computation offloading game (DCOG) for mobile CC. The decision-making problem is formulated among the users of mobile devices as DCOG with game theory. By evaluating the physical property of game, they showed that Nash equilibrium is continuously achieved. This system achieves efficient offloading computation performance. Even though the augment computation capability of the mobile device and user’s mobility remains challenging. Table 1 gives the summary of papers on computation offloading with energy efficiency.

2.2 Energy Efficiency and Latency This section surveys the papers for the decision on computation offloading by focusing two parameters that are energy efficiency and communication latency. Selective computation offloading strategy with multiple UEs and single edge server has been suggested by Zhao et al. [14]. Even though the computation capability of MEC is tremendous as compared to UE, computation capability decreases with a number of offloaded tasks. In other words, if less number of tasks are migrated to MEC, it gives the best performance in terms of computation and storage. Estimating the computation capability of MEC for offloading decision, with ABSO algorithm is proposed by the author. Actual estimating the computation capability is very difficult, but predicting it with ARIMA-BP method has been proposed. This method gives better energy efficiency, but the mobility of UE is not considered in this paper. Smart decision about computation offloading and adaptive monitoring of metrics has been proposed by Paulo et al. [15]. Authors discuss three challenges like when and where to offload, which metric must be monitored and user’s mobility. The decision tree has been created depending upon historical data, and mobile device has to parse the tree for optimal decision. The drawback of this paper is only single-user equipment, simultaneous offloading by multiple UEs scenario should be considered. Energy-efficient dynamic offloading and resource scheduling (eDors) policy, under the hard constraints of application completion time, have been proposed by Gao et al. [16]. Dynamic partitioning and scheduling scheme is used for task offloading. To study the interdependence of task, acyclic graph theory is proposed, and authors also consider task ready time along with execution time. eDors algorithm function with three sub-algorithms, i.e. selection of computation offloading, control of clock frequency, and allocation of transmission power. DVFS technique is employed for transmission power update. The weak point of this paper is the author used MCC technique, and better energy efficiency with reduced delay can be achieved if MEC architecture is used.

Objective

Energy efficiency

Energy efficient

Energy efficiency

Energy efficiency

Energy efficiency

Ref no. and year

[9] 2019

[10] 2016

[11] 2015

[12] 2015

[13] 2013

Distributed computation offloading Game Theory

EECOF

MPT

Distributed computation offloading Game Theory

LR-based linear programming and DDLO distributed deep learning

Proposed solution

Only MEC

No MEC

No MEC

MCC

MEC and MCC

Architecture

MUE

SUE

SUE

MUE

MUE

No. of UEs

SMEC

MCC

MCC

SMEC

MMEC

No. of servers

NA

Dynamic

NA

NA

Static

Environment

Table 1 Summary of individual papers of computation offloading decision for energy efficiency

Full

Full

Full

Full

NA

Partitioning

NA

Yes

NA

NA

NA

Mobility support

Mobility of UE and joint power control is not addressed

Data tx size and energy consumption cost is reduced by 84 and 69.9%, respectively

Multi-user and multi-task scenario is not considered

Mobility of UE and joint power control is not addressed

Offloading decision is intensive to no. of WDs

Remark

A Survey on Energy-Efficient Task Offloading and Virtual Machine … 337

338

V. Joshi and K. Patil

Multi-user mobile edge cloud architecture is proposed by You et al. [17] for minimizing the weighted sum of mobile energy consumption while satisfying the delay constraint. TDMA and OFDMA resource allocation policy has been proposed in this paper. Two cases have been considered that are finite cloud computation capability and infinite cloud computation capability. For both methods, authors suggest that for resource allocation threshold-based policy should be adopted, and the problem of offloading will be a binary problem. One dimensional search algorithm is discussed by Liu et al. [18] to minimize the processing delay. The optimal decision has been taken by analysing four parameters, i.e. buffer queuing state, user equipment’s, and MEC’s available processing power. The weak point of this approach is that feedback is required to UE for deciding to offload, which increases signalling overhead. Use of renewable energy sources along with energy harvesting device is proposed by Mao et al. [19]. Lyapunov Optimization-based dynamic computation offloading (LODCO) algorithm is suggested for dynamic offloading with low task failure. The drawback of this paper is the use of energy harvesting (EH) device, which is an additional device user has to carry along with UE. Deng et al. [20] perform the offloading computation for service workflow in MCC. They introduced the novel offloading system with mobility and a trade-off for the computation offloading. Moreover, this approach considers the relations among component services and for the consumption of energy and the optimization of execution time for mobile services. So, the genetic algorithm was introduced for the offloading which is designed and then implemented after modifying the parts of the genetic algorithm. Thus, the proposed system balances the transaction between waiting for network reconnection, reduces optimization problem, and improves the quality of the final solutions. But it was not suitable for the computationally intensive task. Satyanarayanan et al. discussed improving the capability of resource-poor mobile device and providing cognitive assistance to user by leveraging the processing and storage from distant cloud [21]. Single-user virtual machine (VM)-based cloudlet system has been proposed in this paper to minimize the energy consumption, latency, and jitter. The use of cloudlet system is proposed by authors to minimize LANbased latencies. Table 2 addresses the summary of papers for computation offloading decision with energy efficiency and latency.

2.3 Trade-Off Between Energy Efficiency and Latency Joint optimization of task offloading and computation scaling problem has been studied by Paulo et al. [22]. Single UE with multiple servers’ system model is used to reduce trade-off between energy consumption and communication latency. Markov approximation framework is proposed to solve the problem of task assignment decision. Multiple UEs should be considered, and mobility of UE is not addressed.

Objective

Energy efficiency, latency

Energy efficiency, latency

Energy efficiency, latency

Energy efficiency, latency

Energy efficiency, latency

Ref no. and year

[14] 2019

[15] 2019

[16] 2019

[17] 2016

[18] 2016

Markov decision approach

TDMA and OFDMA resource allocation

eDors

Entropy and information gain

ARIMA-BP and ABSO strategy

Proposed solution

SMEC

SMEC

No MEC

Multiple cloudlets

Only MEC

Architecture

SUE

MUE

SUE

SUE

MUE

No. of UEs

MEC

MEC

MCC

Cloudlet, public cloud

SMEC

NA

NA

Dynamic

NA

NA

No. of servers Environment

Table 2 Summary of individual papers of computation offloading decision for energy efficiency and latency

Full

Full

Full

Full

Full

Partitioning

NA

NA

Fixed waypoints

Mobility support

NA

Mobility support

(continued)

MUE and MMEC servers not addressed

Offloading decision is done at BS

Dynamic mobility patterns are not addressed

50% reduction in energy consumption

Selective offloading by estimating the MEC capacity

Remark

A Survey on Energy-Efficient Task Offloading and Virtual Machine … 339

Objective

Energy efficiency, latency

Energy efficiency, latency

Energy efficiency and latency

Ref no. and year

[19] 2016

[20] 2015

[21] 2009

Table 2 (continued)

VM-based cloudlets

GA-based offloading method

LODCO

Proposed solution

Cloudlet

No MEC

SMEC

Architecture

SUE

MUE

SUE

No. of UEs

Cloudlet

MCC

MEC

NA

Static

Dynamic

No. of servers Environment

Full

Full

Full

Partitioning

NA

Yes

NA

Mobility support

Mobility is not addressed

Mobility path is known in advance

Energy harvesting device

Remark

340 V. Joshi and K. Patil

A Survey on Energy-Efficient Task Offloading and Virtual Machine …

341

Optimization of task allocation decision with semidefinite relaxation-based approach is addressed by Thinh et al. [23]. Here, task is offloaded to multiple access points (Aps), but only single-user equipment is considered. Mobility of UE is not considered. Collaborative storage architecture model is suggested by [24] to enhance the storage capabilities at the mobile edge cloud. Scheduling of storage capacity among different nodes is done with ACMES for optimum time and energy. But the results are not robust, as it contains unnecessary iterations. Table 3 gives the summary of papers with trade-off between energy efficiency and latency.

2.4 Virtual Machine Migration In MEC, optimal utilization of system resources while maintaining a required level of QoE is a great challenge and the use of the virtual machines (VMs). Recently, virtualization becomes popular in the design of network systems [25]. VMs are used to partition and share physical resources such as computing power, storage capacity, and network bandwidth. Moreover, the user mobility essential to be considered in MEC system. The MUs can able to move freely. To ensure a certain performance criterion, a VM on a physical node must be migrated to another node without interrupting the application being performed on the VM. Therefore, to achieve maximum performance in an MEC system, an essential issue to be decided is how best to migrate VMs between nodes. Machen et al. [26] illustrated a VM migration technique for MEC. In this research work, live service migration is elaborated along with impact of mobility. To handle the live service migration, a number of available resources are considered. To perform the live migration of VM, the layered model was introduced from one cloud to another cloud. But this layered model technique requires more time for designing the three-layered model. Power control measures for VM migration policy are discussed in Rodrigues et al. [27]. Appropriate VM is chosen for reducing the processing delays for migration. The mobility between the users along with the delay in transmission, service, backhaul, and processing is carried out. Thus, a VM is selected from the adjacent MEC which ensures less delay. The simulation results have ensured a less transmission delay, processing delay, and backhaul delay with the VM selected. Mobility-based MEC which focus on cost factor is proposed in Ouyang et al. [28]. Lyapunov optimization algorithm is used to solve the mobility issue. Future information is not needed for this approach. Computing delay and communication delay are focused to improve QoS. The optimization is a NP hard problem therefore two heuristics Markov optimization and best response update technique. Zhang et al. [29] discussed the VM migration approach for MEC with migration algorithms M-All and M-Edge and optimization algorithms M-weight and M-predict. Two types of user mobility patterns are considered like certain moving trajectory and uncertain moving trajectory. The approach is itself a computation-intensive one;

Objective

Trade-off, task execution time, and energy efficiency

Trade-off, task execution time, and energy efficiency

Trade-off, task execution time, and energy efficiency

Ref no. and year

[22] 2019

[23]

[24]

Algorithm of collaborative mobile edge storage

Exhaustive search-based and semidefinite relaxation-based approach

Markov approximation joint optimization problem

Proposed solution

MEC

Multiple access points

MMEC

Architecture

SUE

SUE

SUE

No. of UEs

Multiple

Multiple access points

MEC

No. of servers

NA

NA

NA

Environment

NA

Yes

Full

Partitioning

NA

NA

NA

Mobility support

Table 3 Summary of individual papers of computation offloading decision for trade-off execution time and energy efficiency

Not robust and include unnecessary iterations

Not suitable for multiple UEs

Algorithm arrives at real optimal solution but with static case

Remark

342 V. Joshi and K. Patil

A Survey on Energy-Efficient Task Offloading and Virtual Machine …

343

however, the results demonstrate that the network overhead can be reduced which is due to the mobility of the VM during migration.

3 Proposed Work The conventional methods which handle the task offloading have proposed several techniques to reduce the energy and the delay. The priority-based task offloading techniques in conventional method may increase the delay when the user request for computation-intensive tasks. Most of the conventional methods like game theoretic distributed computation offloading, EECOF, and LR-based linear programming failed to focus on the communication delay experienced while offloading the task to the MEC server. Further, the mobility which affects the communication link and the reducing the task burden on the MEC server is not discussed by the existing approaches. The proposed research handles the task offloading policy by considering both the processing and the communication delay. The proposed approach offloads the complex task into MEC server and processes the local tasks within the MEC server. A matching-based task offloading policy allocates the user to the MEC server which can process the task within short interval of time. The task offloading policy is handled only through the quality links while handling the task to the respective MEC server. Along with energy reduction policy dynamic voltage and frequency scaling, the reduction of delay in communication and processing will help to reduce the energy consumption of the mobile user. The VM migration policy of conventional methods gives insight into live migration of services from one MEC to another one. The migration when an application is running in one VM to another is not discussed in existing techniques, such as ARIMA-BP and ABSO strategy, TDMA OFDMA resource allocation, and Markov decision approach. The proposed migration policy designs a request of resources before a mobile user enters another one network. Based on the request, the MEC responds whether there is any available VM to execute the tasks. This helps to decrease the delay as well as ensure higher QoS while allocating a VM for a particular service. The proposed approach consists of two phases. The first stage consists of task/service offloading to networks and the later design a VM migration scheme. The task offloading is a process by which the requested tasks are sent to MEC server since the mobile device is unable to process the request due to its resource-intensive nature. The task offloading encounters the problem of energy and delay due to various reasons. The delay encountered due to the processing of the task at the MEC and transmission of the task to the MEC server while uploading the task. Similarly, the power consumption of mobile devices is high when streaming services are used by the user. The consumption increases with increase in transmission and processing delay. The proposed task offloading process designs a task server matching policy to reduce the delay and improve the energy efficiency of the device. Figure 1 shows the proposed model where users are moving from one MEC to the other. The heterogeneous nature of the users may lead to the demand of local as well

344

V. Joshi and K. Patil

Cloud

MEC1

MEC2

MEC3

MEC n

Migration

Fig. 1 MEC model

as resource-intensive task. The proposed task offloading policy at a certain instant of time splits the requested task into task with local computation and task required to be offloaded into MEC. The task which can computed locally is determined based on the task size. The streaming applications are offloaded to MEC since they cannot be processed locally by mobile user. The splitting of task is handled to remove unwanted wastage of power and delay while transmitting and computing the tasks requested through uplink and downlink transmission. The power consumption is reduced with dynamic voltage and frequency scaling (DVFS) to adjust its CPU-cycle frequency [30]. Each CPU core of the MEC server is allocated for one user at a single time slot. The communication links also play a major role in delay while transferring the tasks from users to MEC. A user-server matching policy is designed based on the quality of communication link and computation capability of the server. Further, the mobility aspect is also taken into account while designing the task offloading mechanism. Figure 2 shows the proposed model for task offloading and VM migration policy. In order to offload the task without delay, the proposed task offloading policy matches the respective user device with the MEC server by taking account of various constraints like wireless link strength, computation capability of server, and the task size demand by the user. The wireless link is decided based on the SINR of the signal. The computation delay is avoided by allocating the tasks to the MEC which

Local Task Task Classification

User -Server Matching policy MEC Task

Fig. 2 Proposed model for task offloading and VM migration

VM migration policy

A Survey on Energy-Efficient Task Offloading and Virtual Machine …

345

possess the sufficient resources to execute the task. The matching policy is handled between the set of devices and MEC servers from which the respective MEC-CPU core which can handle task at the earliest is selected. The proposed policy avoids delay by selecting the appropriate MEC as well as through the quality transmission link thereby avoiding the processing and transmission delay. The proposed matching scheme selects the best of the MEC for a requested task. The delay reduction will help to save the energy of the device. Adding to the delay reduction, the energy consumption is reduced with the DVFS policy. The second stage includes VM migration policy based on request of resources by which the user’s migration from one cloud to another cloud should not affect the performance of the application. The increasing of the load and the need for more resources also lead a VM to migrate from one cloudlet to another. The task handoff is handled with a handoff algorithm along with the fuzzy logic approach while choosing the network which is heterogeneous in nature. The MEC consists of numerous numbers of devices which are mobile in nature. The mobility of the users can lead to the movement of user from the sensing range of a certain network to the outer region. However, with the advancement in technologies, the devices possess the capability to work with multiple network technology like Wi-Fi, LTE, and GPRS. The task handoff algorithm first receives the request from the user. The algorithm checks whether the user is moving at a high velocity. If the device is working with GPRS communication, it moves at a speed greater than the threshold speed. The task handoff manager decides that the user may enter another one area where the present network strength will be reduced. Therefore, the proposed technique decides to hand off the particular task to another one network which is available in the area. The threshold speed is defined as the ratio of the maximum of the sensing area to the total distance covered under the certain network. The task handoff manager will use a fuzzy logic-based approach in order to handle the task to a network which is available. To maintain, the QoS network must be connected all time in order to compute the task when the terminals are moving. The parameters selected as input parameters are bandwidth of the wireless quality, received signal strength, velocity of a mobile node, respectively. The rule-based fuzzy logic considers the quality of service while choosing a network from set of networks like Wi-Fi, UMTS, 5G, and LTE. The output of the fuzzy logic is the selection of the available network which is having the capability to process the task with the resources. The second stage includes designing a VM migration scheme in task migration scheme. The VM migration policy is necessary when a VM from one area enters another. During migration, the MEC has to consider the mobility issue while the devices are moving from one network to other. The VM migration policy of the proposed model is based on a predictive approach. The VM migration algorithm first collects the total number of resources needed to execute a particular task, i.e. the memory, bandwidth, etc. The VM migration policy requests for a particular VM with the necessary prerequisites for a task to the adjacent clouds in another network. The VM matching with the requested requirements are carried out in the edge cloud available nearby. The VM migration policy also cares not to lose the service at the midst while moving from one network to other. The VM

346

V. Joshi and K. Patil

migration technique monitors the boundary of the network through location sensing. Once a task is on process and the edge device enters a new network with the task handover method, the VM manager sends the details regarding the total resource the task needs to execute and resources the task which has been consumed in the current VM from the starting process. Thus, through the request-based resource allocation to the adjacent clouds, the tasks are allocated with a VM.

4 Conclusion Detailed literature survey has been carried out by considering three use cases of MCC; they are as follows: computation offloading, handoff, and VM migration. Heterogeneous network has been proposed by considering multiple user equipment and multiple MEC servers. To have an energy efficiency, user-server matching policy had been proposed which will take care of channel condition, available bandwidth, and capacity of MEC server. Request-based VM migration policy has been proposed for seamless service.

References 1. Mao Y, You C, Zhang J, Huang K, Letaief KB (2017) A survey on mobile edge computing: the communication perspective. IEEE Commun Surv Tutor 19(4):2322–2358 2. Abbas N, Zhang Y, Taherkordi A, Skeie T (2017) Mobile edge computing: a survey. IEEE Internet Things J 5(1):450–465 3. Mach P, Becvar Z (2017) Mobile edge computing: a survey on architecture and computation offloading. IEEE Commun Surv Tutor 19(3):1628–1656 4. Chen M, Hao Y (2018) Task offloading for mobile edge computing in software defined ultradense network. IEEE J Sel Areas Commun 36(3):587–597 5. Wang Z, Liang W, Huang M, Ma Y (2018) Delay-energy joint optimization for task offloading in mobile edge computing. arXiv preprint arXiv:1804.10416 6. Tao X, Ota K, Dong M, Qi H, Li K (2017) Performance guaranteed computation offloading for mobile-edge cloud computing. IEEE Wirel Commun Lett 6(6):774–777 7. Chen Y, Zhang N, Zhang Y, Chen X, Wu W, Shen XS (2019) Energy efficient dynamic offloading in mobile edge computing for Internet of Things. IEEE Trans Cloud Comput 8. Liu CF, Bennis M, Debbah M, Poor HV (2019) Dynamic task offloading and resource allocation for ultra-reliable low-latency edge computing. IEEE Trans Commun 67(6):4132–4150 9. Huang L, Feng X, Zhang L, Qian L, Wu Y (2019) Multi-server multi-user multi-task computation offloading for mobile edge computing networks. Sensors 19:1446 10. Chen X, Jiao L, Li W, Fu X (2016) Efficient multi-user computation offloading for mobile-edge cloud computing. IEEE/ACM Trans Networking 24(5):2795–2808 11. You C, Huang K, Chae H (2016) Energy efficient mobile cloud computing powered by wireless energy transfer. IEEE J Selected Areas Commun 34(5):1757–1771 12. Shiraz M, Gani A, Shamim A, Khan S, Ahmad RW (2015) Energy efficient computational offloading framework for mobile cloud computing. J Grid Comput 13(1):1–8 13. Chen X (2015) Decentralized computation offloading game for mobile cloud computing. IEEE Trans Parallel Distrib Syst 26(4):974–983

A Survey on Energy-Efficient Task Offloading and Virtual Machine …

347

14. Zhao M, Zhou K (2019) Selective offloading by exploiting ARIMA-BP for energy optimization in mobile edge computing networks. Algorithms 12:48 15. Rego PAL, Trinta FAM, Hasan MZ, de Souza JN (2019) Enhancing offloading systems with smart decisions, adaptive monitoring, and mobility support. Hindawi Wirel Commun Mob Comput 2019, Article ID 1975312 16. Guo S, Xiao B, Yang Y, Yang Y (2016) Energy-efficient dynamic offloading and resource scheduling in mobile cloud computing. In: Proceedings of the IEEE international conference on computer communications (INFOCOM), San Francisco, CA, USA, pp 1–9, 10–14 April 2016 17. You C, Huang K, Chae H, Kim BH (2017) Energy-efficient resource allocation for mobile-edge computation offloading. IEEE Trans Wireless Commun 16(3):1397–1411 18. Liu J, Mao Y, Zhang J, Letaief KB (2016) Delay-optimal computation task scheduling for mobile-edge computing systems. In: IEEE international symposium on information theory (ISIT), pp 1451–1455 19. Mao Y, Zhang J, Letaief KB (2016) Dynamic computation offloading for mobile-edge computing with energy harvesting devices. IEEE J Sel Areas Commun 34(12):3590–3605 20. Deng S, Huang L, Taheri J, Zomaya AY (2015) Computation offloading for service workflow in mobile cloud computing. IEEE Trans Parallel Distrib Syst 26(12):3317–3329 21. Satyanarayanan M, Bahl P, Caceres R, Davies N (2009) The case for VM based cloudlets in mobile computing. IEEE Pervasive Comput 8(4):14–23 22. Zhou W, Fang W, Li Y, Yuan B, Li Y, Wang T (2019) Markov approximation for task offloading and computation scaling in mobile edge computing, Hindawi. Mobile Inform Syst 2019, Article ID 8172698 23. Dinh TQ, Tang J, La QD, Quek TQS (2017) Adaptive computation scaling and task offloading in mobile edge computing. In: 2017 IEEE wireless communications and networking conference (WCNC), San Francisco, CA, pp 1–6. https://doi.org/10.1109/WCNC.2017.7925612 24. Guanlin Wu, Chen J, Bao W, Zhu X, Xiao W, Wang Ji (2017) Towards collaborative storage scheduling using alternating direction method of multipliers for mobile edge cloud. J Syst Software. https://doi.org/10.1016/j.jss.2017.08.032 25. Choi HW, Kwak H, Sohn A, Chung K (2008) Autonomous learning for efficient resource utilization of dynamic vm migration. In: Proceedings of the 22nd annual international conference on Supercomputing, 7 June 2008. ACM, pp 185–194 26. Machen A, Wang S, Leung KK, Ko BJ, Salonidis T (2017) Live service migration in mobile edge clouds. IEEE Wirel Commun 25(1):140–147 27. Rodrigues TG, Suto K, Nishiyama H, Kato N, Temma K (2018) Cloudlets activation scheme for scalable mobile edge computing with transmission power control and virtual machine migration. IEEE Trans Comput 67(9):1287–1300 28. Ouyang T, Zhou Z, Chen Xu (2018) Follow me at the edge: Mobility-aware dynamic service placement for mobile edge computing. IEEE J Sel Areas Commun 36(10):2333–2345 29. Zhang F, Liu G, Zhao B, Fu X, Yahyapour R (2019) Reducing the network overhead of user mobility-induced virtual machine migration in mobile edge computing. Software Practice Experience 49(4):673–693 30. Renugadevi T, Geetha K, Prabaharan N, Siano P (2020) Carbon-Efficient virtual machine placement based on dynamic voltage frequency scaling in geo-distributed cloud data centers. Appl Sci 10(8):2701

Quantitative Study on Barriers of Adopting Big Data Analytics for UK and Eire SMEs M. Willetts, A. S. Atkins, and C. Stanier

Abstract Big data analytics has been widely adopted by large companies, enabling them to achieve competitive advantage. However, small and medium-sized enterprises (SMEs) are underutilising this technology due to a number of barriers including financial constraints and lack of skills. Previous studies have identified a total of 69 barriers to SMEs adoption of big data analytics, rationalised to 21 barriers categorised into five pillars (Willetts et al. in A strategic big data analytics framework to provide opportunities for SMEs. In: 14th International technology, education and development conference, 2020a, [Willetts M, Atkins AS, Stanier C (2020a) A strategic big data analytics framework to provide opportunities for SMEs. In: 14th International technology, education and development conference, pp 3033–3042. 10.21125/inted.2020.0893]). To verify the barriers identified from the literature, an electronic questionnaire was distributed to over 1000 SMEs based in the UK and Eire using the snowball sampling approach during the height of the COVID-19 pandemic. The intention of this paper is to provide an analysis of the questionnaire, specifically applying the Cronbach’s alpha test to ensure that the 21 barriers identified are positioned in the correct pillars, verifying that the framework is statistically valid.

1 Introduction SMEs account for 99.9% of all businesses in the UK, employ 60% of the workforce and generate £2168 billion; this represents 52% of the turnover of all businesses in the UK [20]. Similarly in Eire, SMEs consist of 99.8% of all businesses, 70.1% of employment and contribute e 91.9 billion, 41.5% to value added [7]. This paper discusses the use of a questionnaire to collect primary data for use in the validation of the big data analytics adoption framework for SMEs proposed by Willetts et al. M. Willetts (B) · A. S. Atkins · C. Stanier School of Digital, Technologies and Arts, Staffordshire University, College Road, Stoke-on-Trent ST4 2DE, UK e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_23

349

350

M. Willetts et al.

[27]. The resulting data is then used to assess the internal consistency of the pillars of the strategic framework, using Cronbach’s alpha statistical analysis, to test the validity of the framework. This will allow poor internal consistency to be addressed by restructuring the framework. The individual barriers can then be further assessed and ranked in order of relative importance, in order to identify those barriers that present challenging issues to the implementation of big data analytics at SMEs. The structure of this paper is as follows: Sect. 2 provides a literature review. Section 3 describes the construction, revision and distribution of the questionnaire. Section 4 outlines the data analysis, statistical techniques employed and the revision to the strategic framework. Section 5 provides a conclusion to the paper and discusses future work.

2 Literature Review Big data is defined as “an umbrella term used to describe a wide range of technologies that capture, store, transform and analyse complex data sets which can be of a high volume, generated at a high velocity in a variety of formats” [27, p. 3034]. Big data analytics refers to the variety of software tools and techniques which are used to extract insights from big data sources. Mikalef et al. [17, p. 262] state that a widely used definition of big data analytics is “a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high velocity capture, discovery and/or analysis”. There are many case studies of large companies achieving a variety of benefits from the adoption of big data analytics including significant savings from unplanned downtime [16] and increased efficiencies [22], but there are also case studies of SMES utilising the technology resulting in increased sales [26]. However, there are a number of barriers to SMEs adoption including lack of understanding, shortage of in-house data analytic expertise and financial barriers [4]. Sixty-nine barriers to big data analytics have been identified through a previous literature review which was rationalised to 21 barriers through the utilisation of the thematic analysis process [27, 28]. This study outlines a quantitative analysis and statistical validation of the results to develop a holistic assessment framework to assist SMEs in adopting big data analytics to provide competitive advantage.

3 Proposed Work 3.1 Research Design A questionnaire was developed to validate the barriers identified from a thematic analysis [27]. The questionnaire development approach documented in Moore and

Quantitative Study on Barriers of Adopting Big Data Analytics …

351

Benbasat [18] was followed, which divides the process into three stages: item construction, reviewing process and testing. 1.

2.

3.

Stage One: Item Construction Sixty-nine barriers to SMEs adopting big data analytics were identified from a literature review. A thematic analysis was performed which rationalised the barriers from 69 to 21 barriers and grouped them into five pillars: Business, environmental, human, organisational and technological. The pillars originated from three overlapping theoretical frameworks: Technology–organisation–environment (TOE) [24]; human, technology, organisation-fit (HOT-fit) [30] and the information systems strategy triangle (ISST) [19]. TOE provides the technological, organisational and environmental pillars, HOT-fit provides the human pillar, and ISST provides the business pillar. Therefore, these were used as measurement items for the questionnaire. One of the most widely reported barriers to SMEs adopting big data analytics in the literature is lack of awareness. Therefore, it is anticipated that some SMEs will not know what big data analytics is [23], for this reason an “I do not know” option was added to the Likert items. Although the “I do not know” or “not applicable” options are not recommended for online surveys, as it may increase the number of “I do not know” responses [14], “I do not know” responses may provide valuable information in this case as it may suggest that the lack of big data analytics awareness could be one of the most significant barriers to SMEs adopting the technologies. Similarly, it was considered that failing to provide an “I do not know response” option could have resulted in a lower response rate as, if participants did not understand big data or big data analytics, they may have chosen not to complete the questionnaire. Stage Two: Reviewing Process The purpose of the reviewing process is to evaluate the content validity of the questionnaire. Content validity is the extent to which the items of the questionnaire provide adequate coverage of the investigative questions [21]. This can be achieved through reviewing the literature available to identify content items and consulting experts in the field [21]. To provide further validity, the questionnaire was reviewed by subject matter experts to ensure that the questions represented the barriers to big data analytics adoption by SMEs. Five IT professionals reviewed the questionnaire during this stage to check the content of the questionnaire from a technical perspective. They suggested that additional questions were added to the Likert questions to ensure participants understood the difference between presenting data and choosing a suitable big data analytics solution, as this distinction may not be clear to the participant, depending on their understanding of big data analytics. Stage Three: Testing The final stage of the questionnaire development is testing. The minimum sample size for a student pilot questionnaire is suggested as 10, due to the lack of financial or time resources required for large-scale field trials [21]. Isaac and Michael (1995) suggest that small sample sizes are suitable when it is not

352

M. Willetts et al.

economically feasible to collect a large sample, and that sample sizes of 10 to 30 are sufficient [12]. Therefore, a pilot sample size of ten was considered sufficient for this study. A pilot study of the questionnaire was distributed to two groups. The first group consisted of five IT professionals who were asked to review the content of the questionnaire. The second group consisted of five non-IT professionals working for SMEs; the aim was to test the usability of the Qualtrics questionnaire system and obtain their feedback. The pilot questionnaire was successful, as all participants completed the questionnaire without encountering any technical issues and the content of the questionnaire was understood, with minor amendments suggested, such as formatting changes, which were subsequently implemented.

3.2 Questionnaire Design The questionnaire consisted of 42 questions divided into five parts. The first part acted as a coversheet, stating that participants remained anonymous, participation was voluntary and that they were not required to answer every question if they did not wish to. Information was provided regarding the storage and use of the data provided, following the University’s ethical guidance. The second part consisted of demographic questions, and the third section contained questions relating to data captured and analysed, software applications, IT support and the IT budget. The fourth part consisted of Likert questions relating to the 21 barriers to SMEs adopting big data analytics, and the final part provided a thank you message and the author’s contact details. A Likert scale 1 to 5 [3] was adopted for the where 1 is strongly agree and 5 is strongly disagree.

3.3 Population and Sample of the Study The research population is the quantity of items, people, objects or organisation which will be the subject of the study [25]. However, depending on the nature of the study, it is rarely feasible to collect data from the entire population, for example, due to limitations of time, money or access [21]. Therefore, a sample that represents the research population needs to be selected [21]. This aim of this study is to develop a strategic framework to assist SMEs in adopting big data analytics, and therefore, the research population is all SMEs based in the UK and Eire. The study focuses on the UK and Eire for several reasons. Firstly, the definition of an SME can vary between countries; for example, in Australia, a business which employs up to 200 staff is regarded as an SME while in the United States is up to 499 people [1]. In addition, SMEs in different countries may encounter different challenges, and hence, the barriers encountered by UK SMEs may not be applicable to SMEs in other countries, raising issues of consistency. Similarly, the trading conditions may vary

Quantitative Study on Barriers of Adopting Big Data Analytics …

353

from country to country, including legislation such as the General Data Protection Regulation in the European Union. As the researcher was located in the UK and their SME contacts are all located in the UK and Eire, it was more feasible to limit the study to SMEs based in the UK and Eire.

3.4 Administration and Distribution of the Questionnaire The questionnaire was designed and developed using an online surveying platform, Qualtrics. Evans and Mathur [9] outline a number of advantages to utilise an online questionnaire, including physical reach, as participants can be located anywhere, the flexibility offered by survey applications allows questionnaires to be developed relatively easily without the need to write programming or mark-up code, convenience, speed and timeliness, question diversity, as multiple question formats can be utilised, and large sample sizes are easy to obtain. Physical reach was a key advantage of using online questionnaires, as the questionnaire was distributed during the COVID-19 pandemic, and it was not possible to meet face to face with interview subjects during the lockdown period. Online questionnaires are recommended when interviewer interaction with respondents is not required or desirable, and therefore, interviewer bias and errors are eliminated [8]. Despite the advantages of online questionnaires, there are also a number of weaknesses including perception that the emails distributed to participants are junk mail; the surveys may be seen as impersonal, privacy issues, and low response rates [9]. The FluidSurveys [10] sample size calculator was utilised to calculate a sample size using the population of all UK SMEs reported in 2019, which is stated as 4.86 million [20], using a confidence level of 95% and a margin of error of 5%. The recommended sample size generated by this calculation was 385. However, Gorsuch (1983) and Kline (1979) recommended that 100 is a sufficient sample size [15], and this has been recommended by other authors for statistical techniques, including factor analysis [29]. Due to the constraints of time and the COVID-19 pandemic, it was decided that Gorsuch (1983) and Kline (1979) sample size of 100 provided a sufficient sample for the statistical analysis and was more feasible to acquire than 385. The questionnaire was distributed at the height of the COVID-19 pandemic, when many businesses were closed or had non-essential staff furloughed, resulting in businesses not being able to, or not prioritising, completing a questionnaire over other work. The questionnaire was distributed using”snowball sampling”, which is a socialchain approach to sampling, whereby participants assist in identifying further participants to grow the sample size [21]. This technique is employed for studying hard-toreach populations and has been utilised in a variety of disciplines [11]. The snowball approach has been utilised in other big data analytics studies [6]. However, a potential disadvantage of the snowball approach is that participants are likely to invite other participants who have similar characteristics to themselves, introducing the

354

M. Willetts et al.

possibility of bias [21]. Kirchherr and Charles [13] provide a number of recommendations for increasing sample diversity in snowball samples including utilising personal contacts, issuing reminders and ensuring that the initial sample seed is diverse. Therefore, to promote diversity in the sectors represented in the sample, a wide range of SMEs operating in different sectors, including manufacturing, retail, financial services and business services, were initially contacted, in an attempt to maximise diversity at the genesis of the “snowballs”. Invitation emails containing a link to the online questionnaire were distributed to participants who worked for UK-based SMEs from the researcher’s personal contacts in May 2020. The researcher also utilised contacts on their personal LinkedIn profile, who were sent messages containing a link to the online questionnaire. Where appropriate, contacts were asked to invite members of their own network of SME contacts to participate in the study. The British Computer Society was contacted in May 2020 for assistance in distributing the questionnaire. They agreed to help and distributed the details and link to the questionnaire in an email to all members of the Data Management Special Interest Group. The Chambers of Commerce for the Black Country, Staffordshire, Shropshire and Birmingham were also contacted for assistance distributing the questionnaire to local businesses. Other charities located in the West Midlands which support SMEs were contacted, but they were unable to assist with the distribution of the questionnaire. To further increase the number of respondents, businesses were randomly selected using Google Maps in the West Midlands, East Midlands, London, Glasgow and Staffordshire areas. This had the additional benefit of increasing the geographical distribution of SMEs covered in the questionnaire. Each business selected was reviewed on the Company’s House Website [5], which displays the accounts for each business submitted at the end of each financial year. Using this report, it could be determined whether the selected companies SMEs, based on the turnover, assets and number of staff documented in the annual accounts report.

4 Result 4.1 Data Analysis A total of 224 questionnaire responses were received. The results were coded and analysed utilising IBM’s Statistical Package for Social Sciences (SPSS) version 27. 102 fully completed responses from SMEs were received; however, an additional five responses identified as large companies were excluded from the analyses. Similarly, 46 responses were mostly complete or had not completed the Likert questions, and therefore, these were utilised for the initial analysis. Appendix 1 shows a flow chart detailing the exclusions made for the various phases of the analysis. The included responses were then assessed for validity. The questionnaire utilised predominantly closed questions and Likert questions, which restricted the data that

Quantitative Study on Barriers of Adopting Big Data Analytics …

355

could be entered in response to each question. Hence, it was not possible for respondents to give unrelated responses. In addition, comparisons were made between questions, to identify any combinations of responses that would anomalous. For example, if a participant that stated that their business utilises big data analytics, then it should not also have been reasonable for them to state that they did not know what big data is, as it would be assumed that to utilise big data analytics, the user would need to understand what this is. No such anomalies were detected.

4.2 Initial Analysis The majority of respondents were in senior roles at the business (owner: 38.0%, director: 24.8%), with the remainder generally in managerial or IT-based roles. As such, it is likely that most respondents would have been in a position of sufficient knowledge to accurately complete the questionnaire. A breakdown is shown in Fig. 1. A diverse range of business sizes were also represented, with 39.4, 37.2 and 23.4% of respondents from companies employing 1–9 (micro), 10–49 (small) and 50–249 (medium-sized) staff, respectively. A total of 24 sectors were represented in the respondents (with two additional respondents stating “other” sectors), indicating that some degree of diversity had been achieved by the sampling methodology. However,

Fig. 1 Number of staff

356

M. Willetts et al.

Fig. 2 Participants by sector

there was a clear preponderance of respondents based in the technology (21.9%) and business services (20.4%) fields, displayed in Fig. 2. Appendix 2 provides a full breakdown of the demographics. As big data analytics can be utilised to analyse a variety of data stored in different formats, therefore it was important to understand the data captured and analysed by SMEs. The majority of participants analyse customer data (72.3%), sales data (62.0%) and Website data (51.1%). Other categories of data were less widely utilised including supplier data (38.0%), competitor data (23.4%), social media data (40.9%), images (21.2%). and sensor data (5.1%). It was also important to identify the software applications currently utilised by SMEs to analyse data. The majority of businesses utilised spreadsheet applications (83.9%) and over half utilised Google analytics (51.8%). However, Twitter analytics (18.2%), Microsoft Power BI (15.3%) and data warehouses (13.1%) are not widely adopted by the SMEs surveyed. An option for the participants to input other data analytics software was provided in the form of a free text input box. Some of the applications utilised by SMEs to analyse data include SPSS, Sage, GDS, Tableau, Snowflake, Zoho Analytics, QuickBooks, Bullhorn (a recruitment system), Xero, Survey Monkey, Qualtrics, SNAP, Askia, Crystal Knows, Qlik, Salesforce, Python, R, Mailchimp, Cube19, Snap Surveys and QuenchTec. As there are a number of technical barriers to the adoption of big data analytics, there were several questions relating to IT support and the IT budget. A total of 37.2% of participants stated that their business had their own IT department; however, with 30 of the businesses taking part in the survey being based in the technology sector, it would be expected that many of these will have their own IT department. 35.8% of the businesses outsource their IT support, and 7.3% combine IT support with another role. 19.7% of businesses do not have any dedicated IT support, as shown in Fig. 3. This suggests that the skills required to implement a big data analytics solution may be lacking without dedicated IT staff.

Quantitative Study on Barriers of Adopting Big Data Analytics …

357

Fig. 3 IT support

Only 45.3% of the participants stated that their business has an IT budget, 42.3% stated that they do not have a budget, and the remaining 12.4% did not know. Sixtyone of the 137 sample completed the follow-on question asking how much their IT budget is. Of the 61, 24.6% stated that their IT budget was more than £50,000, 23.0% have an IT budget between £10,000 and £50,000, 16.4% have an IT budget between £5000 and £10,000, and 13.1% have an IT budget of less than £5000. 23% did not know if their business has a dedicated IT budget. Figure 4 displays a breakdown for the participants who answered the follow-on question regarding the percentage of the IT budget of their business’ annual turnover. Participants were asked if they knew what big data and big data analytics are. Most participants (63.5%) understood what big data is, 14.6% were unsure and 21.9% did not know. Similarly, 61.1% of participants understood what big data analytics was but only 9.6% of businesses were using it. A recent study reported that one in ten SMEs in the European Union are using big data analytics [2] which confirmed similar results. A survey of 15 manufacturing SMEs based in South Wales revealed that only 46.7% were aware of big data analytics, of which 75% had a vision of how they would use it [23], which suggests that the level of knowledge of big data analytics amongst SMEs has increased. However, 28.7% did not know what big data analytics was and 14% were unsure, suggesting that the lack of awareness may be a barrier as shown in Fig. 5.

358

Fig. 4 IT budget percentage of turnover

Fig. 5 Awareness of big data analytics

M. Willetts et al.

Quantitative Study on Barriers of Adopting Big Data Analytics …

359

4.3 Associations Between Demographics and Understanding of Big Data Analytics The “Do you know what Big Data Analytics is?” question was correlated against the questions from the first two parts of the questionnaire to identify if there is a relationship between the knowledge of big data analytics and other factors such as the sector the business operates in or the role of the participant. Appendix 3 displays the results for the significance tests. The relationship between the role of the participant and knowledge of big data analytics was insignificant, with a p-value of 0.492 reported, suggesting that there was not a significant relationship between these values. However, there was a very significant relationship between the sector the business operates in and the participant’s knowledge of big data analytics with a p-value < 0.001 calculated. The majority of participants in the sectors like communications and technology (87.9%) have an understanding of big data analytics, with a large proportion in business services (85.7%) and marketing and media (78.9%). The relationship between number of staff and knowledge of big data analytics was insignificant with a p-value of 0.627. The relationship between IT support and big data analytics knowledge was very significant with a p-value of p < 0.001. The majority of participants with a dedicated IT department or staff (88.2%) or where IT support was combined with another role (100%) stated that they know what big data analytics is, suggesting that when IT support is provided in-house, there is a greater knowledge computer technology internally than for firms who outsource their IT support. The relationship between knowledge of big data analytics and IT budget was also significant with a p-value of 0.019 reported. However, the relationship between big data analytics and either IT decision-making or IT budget amount was insignificant as both categories scored p-values greater than 0.05. There was a very significant relationship between whether a firm analyses data and knowledge of big data analytics, with a p-value of p < 0.01 reported. Of the businesses which analyse data, 81.6% of participants report that they know what big data analytics is, suggesting that if a business analyses data then there is a high probability they will be aware of big data analytics. Very strong relationships were reported between big data analytics knowledge and social media (0.020) and images (0.020). Both types of data are classified as big data analytics; this suggests that if businesses analyse these categories of data, they are likely to understand what big data analytics is. However, there was no significant relationship between big data analytics knowledge and analysis of customer data (0.149), sales data (0.186), supplier data (0.722), competitor data (0.599), Website data (0.054) and sensor data (0.085).

360

M. Willetts et al.

4.4 Barriers to Big Data Analytics The final set of questions were the Likert questions representing the barriers to big data analytics adoption. The final section of the questionnaire contained 23 Likert questions, 21 of which represented the 21 barriers identified to SMEs adopting big data analytics. 102 SME participants fully competed the Likert questions. The “I do not know” responses were removed from the Cronbach’s alpha calculation used to test the internal consistency. Table 1 shows the data used to calculate the Cronbach’s alpha scores, where the difference between N and 102 represents the “I do not know answers”. A further Likert question in the final stage of the analysis discussed issues unrelated to the barriers to SMEs adopting big data analytics. The 102 complete responses were utilised for this Likert question analysis. The first question asked the participants how strongly they agreed with the statement: “My business would benefit from Big Data Analytics”. Almost half of the participants agreed that their business would benefit from big data analytics with 11.8% strongly agreeing and 33.3% moderately agreeing. There were 18.6% of participants were neutral, 7.8% strongly disagreed and 7.8% moderately disagreed. 20.6% of participants did not know, suggesting that they do not know what big data analytics is or how it would benefit their business.

4.5 Cronbach’s Alpha The reliability of the questionnaire was tested by examining the internal consistency between the questionnaire items. In the study described in this paper, the analysis based on Cronbach’s alpha utilised a complete cases approach in that only those barriers in a pillar where respondents gave an affirmative response (i.e. excluding “I do not know” responses) to all of the barriers within a pillar were included in the analysis of that pillar, and as such, the number of the respondents included in the analysis of each pillar ranged from 64 to 85. The Cronbach’s alpha test was performed on each of the five pillars to test the relationship between the barriers. Appendix 4 displays the results of the test. Where there are more than two barriers in a pillar, the Cronbach’s alpha score for each pillar is displayed if one of the barriers is removed. The internal reliability was highest on the technological pillar with a Cronbach’s alpha score of 0.91. The removal of barriers in this pillar had minimal affect with a Cronbach’s alpha for items removed ranging from 0.88 to 0.90. Similarly, high Cronbach’s alpha was observed for the organisational pillar at 0.86 with a Cronbach’s alpha of items removed ranging from 0.82 to 0.85. The Cronbach’s alpha for the environmental pillar was 0.65. There was some evidence that the ethical concerns in data use barrier was less consistent with the other barriers in this pillar as removing this improved the alpha to 0.71; however, this was deemed to be acceptable. However, there were two pillars which did not meet the acceptable threshold of 0.5, namely

Quantitative Study on Barriers of Adopting Big Data Analytics …

361

Table 1 Data used for the Cronbach’s alpha test Pillar

Barrier

N

Mean Strongly Moderately Neutral disagree disagree

Moderately Strongly agree agree

Business

Financial barriers

84 2.5

17 28 (33.3%) 23 13 (15.5%) 3 (3.6%) (20.2%) (27.4%)

Business

Lack of business cases

82 3.4

8 (9.8%) 10 (12.2%) 20 29 (35.4%) 15 (24.4%) (18.3%)

Environmental Ethical concerns in data use

86 3.4

10 10 (11.6%) 15 34 (39.5%) 17 (11.6%) (17.4%) (19.8%)

Environmental Inability to assess and address digital risks

79 3.5

4 (5.1%) 12 (15.2%) 15 35 (44.3%) 13 (19.0%) (16.5%)

Environmental Regulatory issues

81 3.4

6 (7.4%) 17 (21.0%) 10 34 (42.0%) 14 (12.3%) (17.3%)

Environmental The lack of common standards

73 3.2

6 (8.2%) 14 (19.2%) 19 27 (37.0%) 7 (9.6%) (26.0%)

Human

Lack of 87 3.2 in-house data analytics expertise

10 10 (11.5%) 24 35 (40.2%) 8 (9.2%) (11.5%) (27.6%)

Human

Shortage of consultancy services

87 3.5

12 7 (8.0%) (13.8%)

13 35 (40.2%) 20 (14.9%) (23.0%)

Organisational Change management

81 3.5

5 (6.2%) 5 (6.2%)

24 39 (48.1%) 8 (9.9%) (29.6%)

Organisational Cultural barriers

86 3.4

8 (9.3%) 6 (7.0%)

25 35 (40.7%) 12 (29.1%) (14.0%)

Organisational Insufficient volumes of data to be analysed

84 2.5

24 21 (25.0%) 16 18 (21.4%) 5 (6.0%) (28.6%) (19.0%)

Organisational Lack of managerial awareness and skills

87 2.7

15 22 (25.3%) 24 22 (25.3%) 4 (4.6%) (17.2%) (27.6%)

Organisational Lack of top management support

84 3.3

8 (9.5%) 11 (13.1%) 26 26 (31.0%) 13 (31.0%) (15.5%)

Organisational Management of technology

80 3.2

11 15 (18.8%) 16 25 (31.3%) 13 (13.8%) (20.0%) (16.3%)

Organisational Talent management

82 2.8

19 21 (25.6%) 8 (23.2%) (9.8%)

24 (29.3%) 10 (12.2%) (continued)

362

M. Willetts et al.

Table 1 (continued) Pillar

Barrier

N

Mean Strongly Moderately Neutral disagree disagree

Moderately Strongly agree agree

Technological

Complexity of data

75 2.8

17 15 (20.0%) 13 23 (30.7%) 7 (9.3%) (22.7%) (17.3%)

Technological

Data scalability

81 2.4

27 19 (23.5%) 14 14 (17.3%) 7 (8.6%) (33.3%) (17.3%)

Technological

Data silos

84 3.0

16 14 (16.7%) 20 24 (28.6%) 10 (19.0%) (23.8%) (11.9%)

Technological

Infrastructure 78 2.9 readiness

19 10 (12.8%) 17 20 (25.6%) 12 (24.4%) (21.8%) (15.4%)

Technological

Lack of suitable software

83 2.7

21 15 (18.1%) 19 20 (24.1%) 8 (9.6%) (25.3%) (22.9%)

Technological

Poor data quality

82 3.2

11 13 (15.9%) 20 25 (30.5%) 13 (13.4%) (24.4%) (15.9%)

human with 0.46 and business at 0.37. As such, these were investigated further to assess whether the barriers populating these pillars could be rearranged to improve the Cronbach’s alpha of the pillars as a whole. The four barriers forming the business and human pillars were moved into each of the other pillars to test the Cronbach’s alpha again to determine where they would fit. Moving financial barriers to the environmental pillar increased the Cronbach’s alpha by 0.014. Moving lack of business cases to the organisational pillar slightly decreased the Cronbach’s alpha score by − 0.020, but from a theoretical perspective, business cases are required by the organisation’s decision-makers to influence their decision to adopt big data analytics. Moving the lack of in-house data analytics expertise barrier to the organisational pillar increased the Cronbach’s alpha score by 0.024. The shortage of consultancy services remained in the human pillar on its own as moving this to any of the other pillars reduced the Cronbach’s alpha. Table 2 shows the breakdown of the Cronbach’s alpha test following the relocation of the four barriers.

4.6 Framework Refinement The big data analytics strategic framework for SMEs has been refined utilising the feedback received from SMEs participating in the questionnaire. The barriers which had a low Cronbach’s alpha scored have been moved to pillars which increased their score. This suggests that statistically, the barriers are in their correct position alongside barriers which they are related, ensuring that the framework is intuitive. The Cronbach’s alpha test has been widely utilised in studies across a variety of fields, and therefore, it provides confidence in its results. Using the Cronbach’s alpha test also helps with the validation and evaluation of the strategic framework. Figure 6 shows

Quantitative Study on Barriers of Adopting Big Data Analytics …

363

Table 2 Cronbach’s alpha test on the four pillars of the revised big data analytics strategic framework for SMEs Environmental

N included

Cronbach’s alpha

63

0.67

alpha (if item removed)

Ethical concerns in data use

0.66

Financial barriers

0.66

Inability to assess and address digital risks

0.63

Regulatory issues

0.58

The lack of common standards

0.55

Human

N/Aa

N/A*

Shortage of consultancy services Organisational

N/A* 62

0.87

Change management

0.85

Cultural barriers

0.84

Insufficient volumes of data to be analysed

0.85

Lack of business cases

0.89

Lack of in-house data analytics expertise

0.84

Lack of managerial awareness and skills

0.86

Lack of top management support

0.84

Management of technology

0.86

Talent management

0.86

Technological

64

0.91

Complexity of data

0.88

Data scalability

0.89

Data silos

0.90

Infrastructure readiness

0.89

Lack of suitable software

0.89

Poor data quality

0.89

a Not

applicable because Cronbach’s alpha requires at least two barriers to be calculable

displays how the strategic framework has been revised based on the questionnaire feedback and statistical analysis, which are as follows: 1.

The original version of the strategic framework was developed by undertaking a literature review to identify the barriers to SMEs adopting big data analytics

364

M. Willetts et al.

Fig. 6 Framework refinement

2.

[27]. The barriers were refined utilising a thematic analysis, and the barriers were categorised into pillars identified from theoretical frameworks. The second version of the framework was developed from the feedback received from the questionnaire. The Cronbach’s alpha suggested that three of the barriers needed to be relocated; for example, financial barriers moved from the business pillar to the environmental pillar. The business pillar was removed because both of its barriers were moved to other pillars, and therefore, the revised version of the framework contains four pillars. The environmental, organisational and technological pillars can be considered as internal as all of the barriers contained within these pillars relate to the internal constraints of the organisation. The human pillar, which contains shortage of consultancy services, can be

Quantitative Study on Barriers of Adopting Big Data Analytics …

365

considered an external pillar as it refers to factors outside of an organisation’s control.

4.7 Limitations One limitation of this study is the sample seed. The initial participants selected were the author’s contacts, primarily based in the technology and business services sectors, and therefore, the questionnaire was not evenly distributed amongst sectors. As the snowball technique was used to distribute the surveys to the participants’ contacts, it is likely that their contacts were also located in the same sector in which they operate. Additionally, as the data collection was conducted during the COVID-19 pandemic, this may have resulted in a lower response rate than if the questionnaire had been distributed prior to the pandemic.

5 Conclusion The qualitative analysis of the questionnaire has demonstrated that despite the majority of the participants understand the concept of big data and big data analytics, less than 10% of the participants have adopted big data analytics. It has also shown that SMEs in the UK are diverse, with some businesses having dedicated IT staff and utilising software for the analysis of data, suggesting that they may be more receptive to big data analytics. Similarly, only 45.1% of businesses stated that they believe big data analytics would be beneficial for their businesses, which may suggest that the relevance of this technology may depend on the nature of the business or the participants may not be aware of the potential benefits. The 21 barriers to big data analytics have also been verified. This study has resulted in a revised strategic framework for SMEs adoption of big data analytics utilising the feedback from a statistical analysis. Future work will require qualitative data to be capture from SME practitioners to provide further verification of the barriers identified. The intention of this framework is to help make SMEs aware of the barriers outlined and assist them in overcoming these to provide competitive advantage. Acknowledgements The authors would like to thank James Hodson for his statistical advice and assistance.

366

M. Willetts et al.

Appendix 1: Flow chart of the Number of Responses Included in the Analysis

Questionnaire Responses (N=224)

Large companies (N=11)

Completely blank or did not progress beyond demographic section (N=76)

Included in initial analysis (N=137)

Did not complete barriers to Big Data Analytics adoption questions (N=35)

Included in analaysis of barriers to Big Data Analytics Adoption (N=102)

Appendix 2: Demographics

Total responses Role Owner

N (%)

137 52 (38.0%) (continued)

Quantitative Study on Barriers of Adopting Big Data Analytics …

367

(continued) Total responses

N (%)

Director

34 (24.8%)

Senior manager (not IT-related)

19 (13.9%)

IT manager/head of IT

6 (4.4%)

Manager (not IT-related)

10 (7.3%)

Line manager

2 (1.5%)

IT specialist

4 (2.9%)

Other staff Sector

10 (7.3%) 137

Aerospace and defence

3 (2.2%)

Asset and wealth management

1 (0.7%)

Automotive

2 (1.5%)

Business services

28 (20.4%)

Capital projects and infrastructure

1 (0.7%)

Charity

2 (1.5%)

Education

6 (4.4%)

Engineering and construction

6 (4.4%)

Financial services

5 (3.6%)

Government and public services

1 (0.7%)

Health care

2 (1.5%)

Horticulture

1 (0.7%)

Hospitality and leisure

1 (0.7%)

Insurance

2 (1.5%)

Manufacturing

9 (6.6%)

Marketing

11 (8.0%)

Media and entertainment

8 (5.8%)

Pharmaceutical and life sciences

2 (1.5%)

Power and utilities

3 (2.2%)

Real estate

2 (1.5%)

Retail and consumer

3 (2.2%)

Technology

30 (21.9%)

Telecommunications

3 (2.2%)

Transport and logistics

3 (2.2%)

Other Number of staff

2 (1.5%) 137

1–9

54 (39.4%)

10–49

51 (37.2%)

50–249

32 (23.4%)

368

M. Willetts et al.

Appendix 3: Significance Testing Chi-squared test comparing the relationship between knowledge of big data analytics, demographics and IT* Where P < 0.050 text is bold. N

Know big data analytics Yes

Role

p-value No

136

0.492

Owner/director

62 (72.1%)

24 (27.9%)

Manager (not IT-related)

19 (63.3%)

11 (36.7%)

IT manager/head of IT

6 (100.0%)

0 (0.0%)

IT specialist

3 (75.0%)

1 (25.0%)

Other

7 (70.0%)

3 (30.0%)

Sector

136

p < 0.001

Communications and technology

29 (87.9%)

4 (12.1%)

Business services

24 (85.7%)

4 (14.3%)

Financial

6 (60.0%)

4 (40.0%)

Construction and manufacturing

8 (36.4%)

14 (63.6%)

Marketing and media

15 (78.9%)

4 (21.1%)

Others

15 (62.5%)

9 (37.5%)

1–9

41 (75.9%)

13 (24.1%)

10–49

34 (68.0%)

16 (32.0%)

22 (68.8%)

10 (31.3%)

Dedicated IT department or staff

45 (88.2%)

6 (11.8%)

Outsourced IT or other third-party staff

25 (52.1%)

23 (47.9%)

IT support combined with another role

10 (100.0%)

0 (0.0%)

No dedicated IT support staff

17 (63.0%)

10 (37.0%)

The owner

40 (71.4%)

16 (28.6%)

Senior management

38 (67.9%)

18 (32.1%)

19 (86.4%)

3 (13.6%)

51 (82.3%)

11 (17.7%)

Number of staff

136

50–249 How is IT supported

IT decision-makers

136

Yes

p < 0.001

134

IT manager/head of IT IT budget?

0.627

0.253

119

0.019 (continued)

Quantitative Study on Barriers of Adopting Big Data Analytics …

369

(continued) N

Know big data analytics

p-value

Yes

No

36 (63.2%)

21 (36.8%)

Less than £5000

7 (87.5%)

1 (12.5%)

£5000 to £10,000

8 (80.0%)

2 (20.0%)

Between £10,000 to £50,000

12 (85.7%)

2 (14.3%)

More than £50,000

14 (93.3%)

1 (6.7%)

No IT budget amount

47

0.801

Chi-squared test comparing the relationship between knowledge of big data analytics, software and data analysed * Where P < 0.050 text is bold. N

Know big data analytics

p-value

Yes

No

No

17 (44.7%)

21 (55.3%)

Yes

80 (81.6%)

18 (18.4%)

No

80 (67.8%)

38 (32.2%)

Yes

17 (94.4%)

1 (5.6%)

No

14 (66.7%)

7 (33.3%)

Yes

83 (72.2%)

32 (27.8%)

No

43 (66.2%)

22 (33.8%)

Yes

54 (76.1%)

17 (23.9%)

No

78 (70.3%)

33 (29.7%)

Yes

19 (76.0%)

6 (24.0%)

No

79 (68.7%)

36 (31.3%)

Yes

18 (85.7%)

3 (14.3%)

No

23 (62.2%)

14 (37.8%)

Yes

74 (74.7%)

25 (25.3%)

No

33 (64.7%)

18 (35.3%)

Yes

64 (75.3%)

21 (24.7%)

Analyse data?

Data warehouse

Spreadsheet applications

Google analytics

Twitter analytics

Microsoft power BI

Customer data

Sales data

136

p < 0.001

136

0.020

136

0.608

136

0.202

136

0.567

136

0.113

136

0.149

136

0.186

(continued)

370

M. Willetts et al.

(continued) N

Know big data analytics Yes

Supplier data

p-value No

136

0.722

No

59 (70.2%)

25 (29.8%)

Yes

38 (73.1%)

14 (26.9%)

Competitor data

136

0.599

No

73 (70.2%)

31 (29.8%)

No

24 (75.0%)

8 (25.0%)

Social media

136

0.020

No

51 (63.8%)

29 (36.3%)

Yes

46 (82.1%)

10 (17.9%)

Website data

136

0.054

No

42 (63.6%)

24 (36.4%)

Yes

55 (78.6%)

15 (21.4%)

No

90 (69.8%)

39 (30.2%)

Yes

7 (100.0%)

0 (0.0%)

No

83 (77.6%)

24 (22.4%)

Yes

14 (48.3%)

15 (51.7%)

Sensor data

Images

136

0.085

136

0.020

Appendix 4: Cronbach’s Alpha Test on the Five Pillars of the Big Data Analytics Strategic Framework for SMEs

Business

N Included

Cronbach’s Alpha

76

0.37

Alpha (if item removed)

Financial barriers

N/Aa

Lack of business cases

N/Aa

Environmental

65

0.65

Ethical concerns in data use

0.71

Inability to assess and address digital risks

0.56

Regulatory issues

0.49

The lack of common standards

0.56

Human

85

0.46 (continued)

Quantitative Study on Barriers of Adopting Big Data Analytics …

371

(continued) N Included

Cronbach’s Alpha

Alpha (if item removed)

Lack of in-house data analytics expertise

N/Aa

Shortage of consultancy services

N/Aa

Organisational

67

0.86

Change management

0.84

Cultural barriers

0.84

Insufficient volumes of data to be analysed

0.84

Lack of managerial awareness and skills

0.85

Lack of top management support

0.82

Management of technology

0.85

Talent management Technological

0.85 64

0.91

Complexity of data

0.88

Data scalability

0.89

Data silos

0.90

Infrastructure readiness

0.89

Lack of suitable software

0.89

Poor data quality

0.89

a Not

applicable because Cronbach’s alpha requires at least two barriers to be calculable

References 1. Alkhoraif A, Rashid H, MacLaughlin P (2018) Lean Implementation in small and medium enterprises: literature review, p 100089. https://doi.org/10.1016/j.orp.2018.100089 2. Bianchini M, Michalkova V (2019) OECD SME and entrepreneurship Papers No. 15 data analytics in SMEs: trends and policies. https://doi.org/10.1787/1de6c6a7-en 3. Boone HN, Boone DA (2012) Analyzing Likert data. J Extension 50(2) 4. Coleman S et al (2016) How can SMEs benefit from big data? Challenges and a path forward. Qual Reliab Eng Int 32(6):2151–2164. https://doi.org/10.1002/qre.2008 5. Companies House (2020) Companies House—GOV.UK. Available at: https://www.gov.uk/gov ernment/organisations/companies-house. Accessed: 1 May 2020 6. Côrte-Real N, Oliveira T, Ruivo P (2017) Assessing business value of big data analytics in European firms. J Business Res. 70:379–390. https://doi.org/10.1016/j.jbusres.2016.08.011 7. European Commission (2020) 2019 SBA Fact Sheet: Ireland. European Commission, pp 1–19. Available at: https://ec.europa.eu/docsroom/documents/32581/attachments/21/translations/en/ renditions/native

372

M. Willetts et al.

8. Evans JR, Mathur A (2005) The value of online surveys. Internet Res 15(2):195–219. https:// doi.org/10.1108/10662240510590360 9. Evans JR, Mathur A (2018) The value of online surveys: a look back and a look ahead. Internet Res 854–887. https://doi.org/10.1108/IntR-03-2018-0089 10. FluidSurveys (2020) Survey sample size calculator—FluidSurveys. Available at: http://fluids urveys.com/university/survey-sample-size-calculator/. Accessed: 18 Aug 2020 11. Heckathorn DD (2011) Comment: snowball versus respondent-driven sampling. Sociol Methodol 41:355–366. Available at: http://www.jstor.org/stable/41336927 12. Hill R (1998) What sample size is “enough” in internet survey research? Interpers Comput Technol Electron J 21st Century 6(3):1–10. Available at: http://cadcommunity.pbworks.com/ f/whatsamplesize.pdf. Accessed: 20 Aug 2020 13. Kirchherr J, Charles K (2018) Enhancing the sample diversity of snowball samples: recommendations from a research project on anti-dam movements in Southeast Asia. PLoS ONE 13(8):e0201710–e0201710. https://doi.org/10.1371/journal.pone.0201710 14. de Leeuw ED, Hox JJ, Boevé A (2016) Handling do-not-know answers: exploring new approaches in online and mixed-mode surveys. Soc Sci Comput Rev 34(1):116–132. https:// doi.org/10.1177/0894439315573744 15. MacCallum RC et al (1999) Sample size in factor analysis. Psychol Methods 4(1):84–99. https://doi.org/10.1037/1082-989X.4.1.84 16. Mathew B (2016) How Big Data is reducing costs and improving performance in the upstream industry. Available at: https://www.worldoil.com/news/2016/12/13/how-big-data-is-reducingcosts-and-improving-performance-in-the-upstream-industry. Accessed: 5 Dec 2019 17. Mikalef P et al (2019) Big data analytics and firm performance: findings from a mixed-method approach. J Bus Res 98:261–276. https://doi.org/10.1016/j.jbusres.2019.01.044 18. Moore GC, Benbasat I (1991) Development of an instrument to measure the perceptions of adopting an information technology innovation. Inform Syst Res 2(3):192–222. https://doi.org/ 10.1287/isre.2.3.192 19. Pearlson K (2001) Managing and using information systems: a strategic approach. Wiley, New York, 278p. Available at: file://catalog.hathitrust.org/Record/004209999 20. Rhodes C (2019) Briefing paper 2019: business statistics. UK statistics: a guide for business users. https://doi.org/10.4324/9780429397967-7 21. Saunders M, Lewis P, Thornhill A (2012) Research methods for business students. 6th edn. Pearson, Harlow. Available at: https://www-dawsonera-com.focus.lib.kth.se/readonline/978 0273750802. Accessed: 7 May 2020 22. Sena V et al (2019) Big data and performance: what can management research tell us? Br J Manage 30(2):219–228. https://doi.org/10.1111/1467-8551.12362 23. Soroka A et al (2017) Big data driven customer insights for SMEs in redistributed manufacturing. Procedia CIRP 692–697. https://doi.org/10.1016/j.procir.2017.03.319 24. Tornatzky LG, Fleischer M, Chakrabarti AK (1990) The processes of technological innovation. In: Issues in organization and management series. Lexington Books. https://doi.org/10.1016/j. cca.2013.05.002 25. Walliman N (2017) Research methods: the basics. Taylor & Francis Group, Florence. Available at: http://ebookcentral.proquest.com/lib/staffordshire/detail.action?docID=5015633 26. Walsh J (2017) How a small company used big data to increase its sales an Australian pool company mined customer insights on a budget and turned its tides. Available at: www.imd.org. Accessed: 2 May 2020 27. Willetts M, Atkins AS, Stanier C (2020a) A strategic big data analytics framework to provide opportunities for SMEs. In: 14th International technology, education and development conference, pp 3033–3042. https://doi.org/10.21125/inted.2020.0893 28. Willetts M, Atkins AS, Stanier C (2020b) Barriers to SMEs adoption of big data analytics for competitive advantage. In: The fourth international conference on intelligent computing in data sciences (ICDS2020), Fez, Morocco

Quantitative Study on Barriers of Adopting Big Data Analytics …

373

29. Williams B, Onsman A, Brown T (2010) Exploratory factor analysis: a five-step guide for novices. J Emerg Primary Health Care 8(3):1–13. https://doi.org/10.33151/ajp.8.3.93 30. Yusof MM et al (2008) An evaluation framework for Health Information systems: human, organization and technology-fit factors (HOT-fit). Int J Med Inform 77(6):386–398. https://doi. org/10.1016/J.IJMEDINF.2007.08.011

Post-quantum Cryptography A Brief Survey of Classical Cryptosystems, Their Fallacy and the Advent of Post-quantum Cryptography with the Deep Insight into Hashed-Based Signature Scheme Sawan Bhattacharyya and Amlan Chakrabarti Abstract Cryptography is the art of writing secret text. Employing cryptography the human-readable text is converted into a ciphertext (coded text) by the means of some consistent algorithm. The essence of a good cryptosystem lies in consistent and easy to understand encryption and decryption algorithms. Before the advent of quantum computers, much of the cryptographic algorithms depend on the principles of number theory, abstract algebra, and probability theory mainly factorization problem and discrete logarithmic problems. With the advancements in the field of quantum computing, the classical cryptosystems are no longer secure and it poses a threat to global security. In this brief survey, we would look toward some of the prime classical cryptosystems, their fallacy, and the advancement of the post-quantum cryptography, especially of the hashed-based signature scheme. The review also looks toward the difference between post-quantum cryptography and quantum cryptography.

1 Introduction Privacy had been an indispensable part of human civilization. From the very initial phases of communication, humans have focused on privacy to share valuable information and facts. The prime objective had not changed since then, and it focuses on that no one except those directly being involved over communication can get accessed to the information. Modern cryptosystems make it possible to communicate even over some insecure channels. The first known evidence of the use of cryptography can be traced backed to 1900 BC in an inscription carved in the main chamber of the tomb of the nobleman Khnumhotep 2 in Egypt. S. Bhattacharyya (B) Department of Computer Science, Ramakrishna Mission Vivekananda Centenary College, Kolkata, West Bengal, India A. Chakrabarti A.K. Choudhury School of Information Technology, University of Calcutta, Kolkata, West Bengal, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_24

375

376

S. Bhattacharyya and A. Chakrabarti

The human-readable messages are called plaintext. The plaintext is encrypted using some consistent algorithm, (encryption algorithm) into a coded text known as ciphertext using some keys which can either be private or public. The encoded ciphertext is decrypted using some consistent algorithm, (decryption algorithm) back into the original human-readable plaintext by using either the same keys used during the encryption algorithm or some other keys. The keys must be safely exchanged between the parties involved in the active communication using some protocols, most famous was, “New direction in cryptography, IEEE, 1976” widely known as Diffie– Hellman key exchange protocol. It is worthwhile to mention that both the encryption and decryption algorithms are public, the only secret is the key, and thus, the need for secure key exchange protocol is clear. There are three cryptographic techniques based on the key and encryption 1. Private Key Cryptosystems: Presence of a single secret shared key between the two parties. 2. Public Key Cryptosystems: Presence of two keys one is a secret key, and the other is public. 3. Hash Function: Presence of one-way cryptographic hash functions. Before the development of quantum computer, much of the cryptographic algorithms were done based on the knowledge of number theory, abstract algebra and the security issues with all pre-quantum cryptographic era relies on two main mathematical principles 1. Factoring Problem: Factoring the product of two large primes 2. Discrete Logarithmic Problem: Given A = g b mod p, and finding the value of b where A came from some group and g is its generator. But in the year of 1994, Peter Shor of Bell Laboratories published his paper named, “Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum Computer” popularly known as Shor’s algorithm [5] showed that quantum computer using the principle of quantum superposition and quantum entanglement can solve the above two mathematical problems [14]. All of our classical cryptosystem’s security is based on the hardness of the two problems but now they can easily be solved even when the number involved is quite large. Thus, quantum computers put our pre-existing cryptosystems, and consequently, our global security is at a risk. These days, private key encryption key is no longer used as key exchange which is a major problem, and if the communication network has n users, then there is a  n need for secret keys which are not a feasible solution when n is large; hence, 2 private key encryption is obsolete nowadays and quantum computer seems to have a very little effect on private key encryption methods advanced encryption standard (AES) believed to be safe against quantum attacks by choosing somewhat larger keys but no one had ever applied Shor’s algorithm to AES. There also exist another quantum algorithm named Grover’s algorithm but it is not as shockingly fast as Shor’s algorithm, and cryptographer can easily compensate for it by choosing a somewhat larger key size.

Post-quantum Cryptography

377

Grover’s algorithm provides a quadratic speedup for the quantum search algorithm in comparison with the search algorithm available on the classical computer, and the practical relevancy of Grover’s algorithm is unknown as Shor’s algorithm but if it is practically relevant doubling key size will be fine. Furthermore, it has been seen that exponential speedup for the quantum search algorithm is impossible [5], and it thus suggests that symmetric key algorithms and hash-based cryptography can be safely used in the quantum era. The security issues with the public key cryptosystem rely on the use of trapdoor one-way function as proposed by Diffie and Hellman in the year 1976 such that the information of one thing does not exploit the privacy of the other thing unless we are equipped with some additional information in another word we need a function that is easy to compute in one way (anybody can encrypt the message) and that is hard to invert unless we have any additional information called a trapdoor. Therefore, these functions are called trapdoor one-way functions. Now, what, all our pre-existing classical cryptosystems had been dead and there is a need for some new cryptosystems which would be resistant against both quantum and classical attacks [15]. These four classes of cryptosystem are quite promising 1. 2. 3. 4.

Hash-based signature scheme Lattice-based cryptosystem Multivariate cryptosystem Code-based cryptosystem

Apart from these four classes, few more cryptosystems are resistant to quantum attack, one such cryptosystem involves the evaluation of the isogenies of the supersingular elliptic curve [6]. The discrete logarithm problem on an elliptic curve can easily solve using a quantum computer but not the isogeny problem on a supersingular elliptic curve.

2 Literature Review 2.1 Classical Cryptosystems The classical cryptosystems are primarily based on the principle of factorization and discrete logarithmic problems. There are three cryptographic techniques based on the key and encryption 1. Private Key Cryptosystems: In private key cryptosystems, there is one single key which is used both in the encryption and decryption process and is kept secret between the receiver and sender, also called symmetric key cryptography. For example, Caesar cipher, Substitution cipher, Play fair cipher, Vigenere cipher, AES-128,192,256 (refer Fig. 1). The data is encrypted as well as decrypted using the shared secret key.

378

S. Bhattacharyya and A. Chakrabarti

Fig. 1 Private key cryptosystem

Fig. 2 Public key cryptosystem

2. Public Key Cryptosystem: In public key cryptosystems here is a pair of keys, one is private and the other is public, present with the both, receiver and the sender. The encryption process is done using the receiver’s public key and the decryption is done using the receiver’s private key, also called asymmetric key cryptography. For example, RSA, Knapsack, elliptic curve cryptography, McEliece cryptosystem, NTRU, Lamport signature scheme (refer Fig. 2). The data is encrypted using the receiver’s public key and decrypted using the receiver’s private key, and the pair of keys are distinct but are mathematically related. 3. Hash Function: Hash functions map data of arbitrary size to data of fixed size, and the value returned by a hash function is called hash values, hash codes, digest, or simply hashes. Hashing is a method of one-way encryption, e.g., SHA-256, XMSS, Leighton–Micali (LMS), BPQS (refer Fig. 3). The one-way cryptographic hash function maps the originals image of the cat into some random hashes.

Post-quantum Cryptography

379

Fig. 3 Cryptographic hash function

Much of the classical cryptosystem was based on the two mathematical principles which are perfectly secure before the advent of quantum computing and the famous Shor’s algorithm which had revolutionized the era of computing but had also put the global security at risk. The two main mathematical principles which from the root of classical cryptosystems are 1. Factoring problem: Given n = p * q where n is made public and if p and q being two large primes factoring n into p and q is quite infeasible for the classical computer using bits. The famous RSA scheme developed by Rivest, Shamir, and Adleman at MIT, the original paper was “A Method for Obtaining Digital Signatures and Public-Key Cryptosystems” relies on this factoring problem. For example, let us consider the scenario of the RSA scheme and determine how the security issue of RSA lies completely on the factorization problem and for this let us look at the key generation step. In any public key cryptosystem like RSA, the encryption process is done using the public key generated by the receiver, and the decryption is done by the private keys generated by the receiver, and hence, the key generation is to be carried out by the receiver solely. For RSA, the key generation and encryption and decryption are carried out through the following mechanism Here public keys are (n, e) and private keys are (p, q, d). It is now obvious that if anyone can generate p and q from n, (n), e, and consequently, d will be known and the whole security will be broken. For example, let assigned 3 and 5 to the value of p and q, respectively, n will be then 15. Now n is public and the prime factorization of n revealed that p and q can only take the value of 3 and 5, and thus, RSA is broken, and thus, we need two large primes for the value of p and q such that there is no scope to determine the value of p and q from n, i.e., factorization of n does not reveal the value of p and q. 2. Discrete Logarithmic Problem: Given an instance, a large prime p, a generator of the group (Z ∗p ,*) says g where Z ∗p = {x : gcd(x, p) = 1 and x = 0} and let A be an element ∈ Z∗p and finding an integer b such that 0 ≤ b ≤ p − 2 such that A= g b mod p is indeed hard for the classical computer using bits when the number involved is quite large. The

380

S. Bhattacharyya and A. Chakrabarti

Algorithm 1: RSA (a) Key Generation • Choose two large prime p and q and compute n as n = p ∗ q • Determine the value of Euler’s phi function over n ((n)) to find the number of a positive integer that is less than n and are relatively prime to n as (n) = ( p − 1) ∗ (q − 1) • Choose an integer e(encoding number) such that e and (n) are relatively prime, gcd(e, (n)) = 1 • Determine the valued such that d(decoding number) is the multiplicative inverse of e under mod (n) (b) Encryption Encrypt the message m using the encoding number e as m = m e mod n (c) Decryption • Decrypt the encoded message m using the decoding number d by first calculating (m)d mod n = (m)ed mod n • Applying the basic theorem of modular arithmetic relates ed to (n) as ed ≡ 1 (mod (n)) ⇒ ed mod (n) = 1 ⇒ ed = k(n) + 1, where k is the quotient. Hence, (m)ed mod n = (m (n) mod n)k m • Applying Euler’s Theorem m (n) mod n = 1,(m)d mod n = 1k .m = m(original message)

famous Diffie–Hellman key exchange protocol is based on this principle of Discrete logarithm. For example, let us consider the scenario of Diffie–Hellman key exchange protocols and determine how the security issue of this scheme depends completely on the discrete log problem; i.e., how the keys are to be shared between the two parties without compromising the security. Note that Diffie–Hellman key exchange protocol is only a protocol for key sharing and it is not any complete cryptosystem with proper encryption and decryption algorithm usable for sharing secret messages over an insecure channel. For Diffie–Hellman, key exchange is carried out through the following mechanism. Here (p, g, A, B) are the public keys, and (a, b) are private keys. Note that in

Algorithm 2: Diffie–Hellman Key Exchange Protocol (a) Any of the two parties choose a large prime say p, then (Z ∗p , ∗) would be a cyclic group with a generator say g, (p, g) is made transferred to the other party over the public insecure channel. (b) Each of the two parties chooses two integers to say a,b, respectively, which are secret to them only such that 0 < a(X a )or b(X b ) < p − 1 and calculate A(Ya ) = g a mod p and B(Yb ) = g b mod p and transferred A and B over the public insecure channel to the other party. (c) Now each of the two-party has two things: one is the calculated value of A or B and another one is the received value of B or A from the other party, respectively. (d) Each of the two-party computes common secret key K = (B)a mod p = (A)b mod p = g ab mod p.

Post-quantum Cryptography

381

Fig. 4 Diffie–Hellman key exchange protocol

Diffie–Hellman key exchange protocol, there is no active need of sharing the secret key but the two-party can compute the common secret key of their own from the private (a, b) and public (p, g, A, B) keys. The actual encryption and decryption are to be carried out by the common secret key K. The adversary had access only to (p, g, A, B), and the adversary cannot decrypt the ciphertext unless and until he or she has the common secret key K from which the encryption is carried out; i.e., the adversary has access to the private keys (a, b) (refer Fig. 4). The security of Diffie–Hellman key exchange protocol lies on the discrete logarithmic problem of A or B; i.e., the attacker cannot compute a or b given A = g a mod p or B = g b mod p. Here in the above example, the prime number chosen is 11 and the generator of the group will be then 7, A(Y A ) is computed to 4, and B(Y B ) is computed to be 8 and common key K is calculated to 3. In any classical computer, both the factorization and the discrete logarithmic problem are infeasible to compute. A closer look at Shor’s algorithm will determine how Shor’s algorithm implements the principle of quantum mechanics to find the factors of large prime. Shor’s algorithm is based on the fact that the factoring problem can be reduced to finding the period of the function and its where the quantum computing came into the realm and actual quantum speedup over finding the period of the modular function f a,N (x) with the help of QFT note that not all the step involved in Shor’s algorithm needs the use of a quantum computer except for the step to finding the period of the modular function f a,N (x) = a x Mod N ∀x ∈ N and x ≤ q need the use of quantum computer with its ability to be in a superposition to calculated f a,N (x) for all needed x for large N of about 100 digits long. The basis of finding the period of a periodic function can be traced back to Simon’s periodicity algorithm which is all about finding a pattern in periodic function.

382

S. Bhattacharyya and A. Chakrabarti

Table 1 Simon’s periodic table for n = 3 and c = 101 Y C 000 001 010 011 100 101 110 111

101 101 101 101 101 101 101 101

X =Y ⊕C 101 100 111 110 001 000 011 010

2.2 Simon’s Periodicity Algorithm Simon’s algorithm is a combination of both classical and quantum procedure. Suppose we are given a function f : {0, 1}n → {0, 1}n , that can be calculated but are treated as a black box, there exists a hidden binary string c = c0 , c1 , c2 , . . . , cn−1 , we have (1) f (x) = f (y):y ⊕ c = x∀x, y ∈ {0, 1}n where ⊕ is the bit-wise exclusive-or operation. Thus, the periodic function repeats itself in some pattern, and the pattern is determined by c, c is called the period of the function, the goal of the Simon algorithm is to find the value of c (refer Table 1). Example Let us work out an example with n = 3 and c = 101, we then have the following value of x, y, c. One might think that finding the period of a periodic function to be an easy task. But that is true only for when thinking periodic function of slowly varying continuous type (e.g., sin function) whose value at a small sample of points within a period can give powerful clues about what the period might be. But periodic function which gives a random value from one integer to the next within the period c gives no hint for determining the value of the period c [9].

2.3 Shor’s Algorithm Peter Shor of AT&T Lab invented a quantum algorithm for efficiently finding the prime factors of a large number, classical factoring algorithm has an order of O((log N )k ), whereas Shor’s algorithm has an order of O(log N ). One thing to be kept in mind that Shor’s algorithm is for factoring non-prime integer, and there exists a much polynomial-time algorithm for integer multiplication but no polynomial-time algorithm for factoring. Quantum algorithms are faster than

Post-quantum Cryptography

383

Table 2 Available information with the sender and receiver Sender Receiver • m • e and d satisfying ed ≡ 1 mod (n) • p and q • m ≡ (m  )d mod n

•m •e • n(n = p ∗ q) • m  ≡ m e mod n

a classical algorithm and are based on quantum superposition and quantum Fourier transform (QFT). The run time of factoring algorithm on any classical computer is of exponential order of O(exp[L 1/3 (log L)2/3 ]), and the same on any quantum computer is of polynomial order O((L)3 ), where L is the total number of bit in integer N . Before looking into the details of how quantum computing can efficiently calculate the periods let us examine how knowing the period can eventually lead to the breakage of the factorization problem and discrete logarithmic problem. Apart from evaluating the factors from the periods (as discussed in detail in upcoming sections) let us look at an alternative to decrypt the ciphertext exploiting the knowledge of the periods. Before that, it is important to know the basics of the steps involved in the encryption and decryption processes. The following two tables summarize the RSA encryption and decryption without factorization “Table 2” summarizes information available with the communicating party and “Table 3” summarizes the breaking of RSA using quantum computing, the key generation step has to be carried out by the receiver, he or she chooses the encoding number e to have an inverse d modulo (n), where d is called the decoding number. Since m  is a power of m and vice-versa, each has the same order in G n , say r. Now the receiver has chosen the encoding number e in such a way that it has no factor common with (n). Since the encoded message m  is in G n , its order r is a factor of the order of ( p − 1)(q − 1) or (n) of G n . So e have no common factor in common with r, and therefore, it has an inverse say d  modulo r, thus 



(m  )d ≡ m ed ≡ m 1+ar ≡ m

mod N

where every member v of a finite group G is characterized by its order r, the smallest integer for which (in case of G N ) [10] vr ≡ 1 mod N Shor algorithm is dependent on two mathematical principle 1. Modular Arithmetic: The basis of an algorithm for finding the pattern or the period in a periodic function had been covered in the previous section under Simon’s periodicity algorithm, and in Shor’s algorithm, one has to calculate the

384

S. Bhattacharyya and A. Chakrabarti

Table 3 Breaking RSA using quantum computing Adversary without QC Adversary with QC • m •e • n(n = p ∗ q)

• m •e • n(n = p ∗ q) • Quantum computer calculate r of (m  )r ≡ 1 mod n • Classical Computer computes d  such that cd  ≡ 1 mod r   • (m)d ≡ m ed ≡ m 1+ar ≡ m modulo N

period of a modular function f a,N (x) = a x mod N ∀x ∈ N and x ≤ q, where a is co-prime to the given number N and is less than N but does not have a nontrivial factor in common with N , the testing of such a factor can be done with help of Euclid algorithm to calculate the GCD(a, N ) if GCD computes to 1, it is all fine a is co-prime to N and we can go ahead but if the GCD(a, N ) does not compute to 1, i.e., a is not co-prime to N we got the factor of N and we are all done. There exists a theorem in classical number theory that for any co-prime a ≤ N , the above modular function f a,N (x) = a x mod N ∀x ∈ N and ≤ q will output a 1 for some r < N , where r is called the period of the function f (). After hitting 1, the value of the function will simply repeat, i.e., f a,N (r + s) = f a,N (s). 2. Quantum Fourier Transform (QFT): The heart of Shor’s algorithm is a superfast quantum Fourier transform which can be carried out by a spectacularly efficient quantum computer built entirely out of 1-Qbit and 2-Qbit gates, is a linear transformation on quantum bits or Qbits and is the quantum analog of discrete Fourier transform (DFT), where DFT converts a finite sequence of equally spaced samples of a function into the same length sequence of equally spaced samples of the discrete-time Fourier transform (a form of Fourier analysis that applies to a sequence of values, operates on discrete data, often samples whose interval has a unit of time) which is a complex-valued function of frequency. The discrete Fourier transform a sequence of N complex numbers {xn } = x0 , x1 , x2 , . . . , xn−1 into another sequence of complex numbers, {X n } = y0 , y1 , y2 , . . . , yn−1 , which is defined by yk =

N −1  n=0

xn .e

−i2π N

kn

=

N −1  n=0

     2π 2π kn − i sin kn xn cos N N

(2)

 N −1 The quantum Fourier transform acts on a quantum state i=0 xi |i and maps it  N −1 to the quantum state i=0 yi |i according to the formula

Post-quantum Cryptography

385 N −1 1  i2π yk = √ x j .e N k j N j=0

(3)

Note that only the amplitude of the state was affected by this transformation. Finally, we are now in a position to recollect all the fact and describe Shor’s algorithm with an example of an odd integer n say 15 and apply Quantum Fourier Transform 1. Determine whether the given number is an even, prime or an integral power of a prime number using an efficient classical algorithm if its then we will not use Shor’s algorithm. 2. Choose an integer q such that n 2 < q < 2n 2 let say 256 (225 < 256 < 450). 3. Choose a random integer x such that gcd(x, n) = 1 let say 7 (7 and 15 are co-prime). 4. Create two quantum register, one input and other is the output that must be entangled so that the collapse of the input register correspondence to the collapse of the output register • Input register: Must contain enough qubits to represent a number as large as q − 1, here for q = 256 we need 8 qubits. • Output register: Must contain enough qubits to represent a number as large as N − 1, here for N = 15 we need 4 qubits. 5. Load the input register with an equally weighted superposition of all integer from 0 to q − 1. 0 to 255. 6. Load the output registerwith initial states 0 Now the total state of the system at 255 1 this point will be √(256 a=0 |a, 000 . 7. Apply the modular function f a,N (x) = a x Mod N ∀x ∈ N and x ≤ q to each number in the input register, storing the results of each computation in the output register. The following table summarizes the state of input and output register for n = 15, q = 256, x = 7 (refer Table 4). 8. Now the previous approach of applying the modular function and storing the content of register in a tabular form is good but it is not efficient when N is large, we are now in a position to implement the principle of quantum mechanics to calculate the period. Measure the output register, this will collapse the superposition to represent just one of the transformation after the modular function, let call it c. Our output register will collapse to represent just one of the following |1 , |4 , |7 , or |13 . For the sake of simplicity, let assume that its |1 . 9. Since the two register are entangled, measuring the output register will have a direct impact on the state of the input register and would lead it to partially collapse into an equal superposition of each state in between 0 and q − 1 that yield c, where c is the value of the collapsed output register.

386

S. Bhattacharyya and A. Chakrabarti

Table 4 States of the input and output register after applying the modular function f x,N (a) = x a Mod N ∀ ∈ N and a ≤ q for x = 7 and n = 15 and a is the input state Input register f 7,15 (x) = 7x Mod 15 ∀x ∈ N Output register |0 |1 |2 |3 |4 |5 |6 |7

f 7,15 (x) = 70 f 7,15 (x) = 71 f 7,15 (x) = 72 f 7,15 (x) = 73 f 7,15 (x) = 74 f 7,15 (x) = 75 f 7,15 (x) = 76 f 7,15 (x) = 77

Mod 15 ∀x Mod 15 ∀x Mod 15 ∀x Mod 15 ∀x Mod 15 ∀x Mod 15 ∀x Mod 15 ∀x Mod 15 ∀x

∈ ∈ ∈ ∈ ∈ ∈ ∈ ∈

N N N N N N N N

1 7 4 13 1 7 4 13

In our example, our output register collapse to |1, , the input register will then 1 1 1 1 |0 + √(64 |4 + √(64 |8 + √(64 |12 , . . . The probabilities in this collapse to √(64 1 since our register is now in an equal superposition state of 64 value case are √(64 (0, 4, 8, 12, …, 252). Let now apply quantum Fourier transform (QFT) on the partially collapsed input register. The QFT will take a state as input say it be |a and transform it into q−1 i2π 1  |c .e q ac √ q c=0

 The final state after applying the QFT will be √1m a∈A |a , |c , where m is the cardinality of the set A of all the value that the modular function yields to the value of collapsed output register and |a corresponds to q−1 i2π 1  |c .e q ac √ q c=0

In our example q = 256, c = 1, A = 0, 4, 8, 12, . . . , 252, m = n(A) = 64, so the final state of the input register after QFT 255 1  1  i2π |w .e 256 aw , |1 √ √ 64 a∈A 256 w=0

The QFT will essentially peak the probability amplitudes at an integer multiple of q/r where r is the desired period in our case r is 4. |0 , |64 , |128 , |192 , . . . So we are no longer have an equal superposition of states. Measure the state of register one, call this value t and this value will have a very probability of being a multiple of q/r . With our knowledge of q and t, there are several methods of

Post-quantum Cryptography

387

calculating the period (one method is the continued fraction expansion of the ratio between q and m). 10. Now that we have the period, the factor of N can be determined by taking the r r greatest common divisor of N with respect x 2 + 1 and x 2 − 1. The idea here is that this computation will be done on a classical computer. In our example r is 4, 4 4 thus the gcd(7 2 + 1, 15) = 5 and gcd(7 2 − 1, 15) = 3 and thus the factor of 15 are 5 and 3.

Algorithm 3: Shor’s Algorithm Result: Factors of n Choose an integer less x ; if X ≤ n and gcd(x,n) = 1 then Use the QFT to determine the unknown period r of the modular function f x,n (a) = x a mod n ∀a ∈ N ; if r=even then Use Euclid’s algorithm to calculate GC D((x r/2 + 1), N ) and GC D((x r/2 − 1), N ); else Choose new x; end else Choose new x; end

2.4 The Fallacy of the Classical Cryptosystems and the Advent of a Quantum Computer After discussing Shor and Simon algorithm in detail, we are now in a position to explain how the quantum computer employs the principles of quantum mechanics in computation and how all these principles supports the quantum Fourier transform that forms the heart of the Shor algorithm and makes it possible to factorize large number which would otherwise infeasible for a classical computer do not employ the principle of quantum mechanics. Before that its important to discuss three things 1. Quantum-bits (Qubits): Classical bits or smallest possible block of information in any classical computer can exist in only two binary states 0 or 1, i.e., either true or false. A physical implementation of bits would be using two energy levels of the atom. An excited atom with its electron in higher energy level than in the ground state denotes the 1 state and the atom in the ground state with all its electrons in their grounds state denotes 0 states. These states are denoted by the Dirac ket notation, state 0 is denoted by |0 , and state 1 is denoted by |1 . Quantum Mechanics allows the superposition of |0 and |1 each with its amplitudes. This is what is called Qubits, i.e., Qubits are a superposition of |0 and by |1 .

388

S. Bhattacharyya and A. Chakrabarti

• 1 qubits or superposition of 2 possible state |0 and |1 • 2 qubits or superposition of 4 possible state |00 , |01 , |10 and |11 • 3 qubits or superposition of 8 possible state |000 , |001 , |010 , |011 ,… • • • n qubits or superposition of 2n possible state Describe by a “Wavefunction,” i.e., vector of 2n amplitudes. 2. Quantum Superposition: Quantum superposition is a fundamental principle of quantum mechanics, it states that any two quantum states can be added or superimposed, and the results would be another valid quantum state. A pure qubits state is a coherent superposition of the basic state. This means that a single qubit can be described by a linear combination of |0 and |1 | = α|0 + β|1

(4)

where α and β are probability amplitude and in general both are the complex number. According to born rule, probability of |0 with value 0 is |α|2 and the probability with outcome |1 with value 1 is |β|2 . Because the absolute square of the amplitude equates to probability, it must be true that |α|2 + |β|2 = 1 Note that the qubit superposition state does not have a value in between 0 and 1 rather there is a probability of |α|2 that it attains 0 state and a probability of |β|2 . In other words, superposition means that there is no way, even in principle, to tell which of the two possible states forming the superposition state pertains. 3. Quantum Entanglement: Quantum entanglement is a physical process that occurs when a pair or group of particles is generated or interact in such a way that the quantum state of each particle of the pair cannot describe independently of the state of other in the pair. The simplest system to display quantum entanglement is the system of two qubits, two entangled qubits 1 √ (|00 + |11 ) 2 in this state called equal superposition; there are equal possibility of measuring either the product state with |00 or |11 as | √12 |2 = 21 ; i.e., there is no way to tell that whether the first qubit has the value of 0 or 1 and same with the case second bit. Now the things are going interesting till now we have discussed much mathematical stuff, we deal with QFT, Simon algorithm, quantum entanglement, quantum superposition and now it is time to wind up everything and formally state why classi-

Post-quantum Cryptography

389

cal cryptosystem which fundamentally relies on factoring problem and discrete log problem is no longer safe in the post-quantum era. In this paper, it had been primarily focused on the factorization problem using the Shor period finding algorithm to break the heavily used and well-renowned RSA. Shor’s algorithm can also be modified to solve the discrete logarithmic problem using hidden subgroup problems and group homomorphism. RSA relied on the factorization problem of finding factors of product of two large prime (here p and q). r Shor’s algorithm in its final step calculates the factor by calculating gcd(x 2 + r 1, n) = a and gcd(x 2 − 1, n) = b, where a and b are the factor and r is the period of the modular function f x,n (a) = x a Mod n ∀a ∈ N : 0 ≤ a < q where x is the random integer ≤ n. Now for calculating the period r instead of proceeding classically through exhaustive search method which is not feasible for practically large n when the chosen p and q are large, we proceed with the help of quantum Fourier transform (QFT) to calculate the period. QFT will be applied on the partially collapsed input registers to convert it final state to 255 i2π 1  1  |w .e q aw , |c √ √ m a∈A q w=0 where m = cardinality of the set A of all the value that the modular function yields to the value of collapsed output register q = n 2 ≤ q ≤ 2n 2 c = value of the collapsed output register. QFT will peak the probability amplitudes at an integer multiple of q/r where r is the desired period, which we want. Measurement of the state of the registered one says it comes to be t, by continuous fraction expansion of the ratio of q and t one can find though this can be done classically and the last step of finding the period seems not that much dependent on quantum computer. The only quantum dependent step is peaking the probability amplitudes at an integer multiple of q/r where r is the desired period of the collapsed input register which happened only due to the presence of quantum entanglement between the input and output registers which is not possible in case of a classical computer implementing bits and the collapsing of the output register to just one of the superimposed state after the application of the modular function, and the input register would not be possible to load with the equally weighted superposition of all value between 0 and q − 1 without qubits and the principle of quantum superposition as the classical bits do not exist as the superimposed value of 0 and 1 (refer Fig. 5).

390

S. Bhattacharyya and A. Chakrabarti

Fig. 5 Fallacy of the classical cryptosystem

2.5 Post-quantum Cryptosystems 1. Hash-based Signature Scheme: Hashed-based cryptographic approaches that rely on the hash function for security purposes are so far limited to the digital signatures schemes. Digital signature scheme provides a means for the authentication of the sender rather than providing security to the message. The entire security relies on the underlying hash functions. 2. Lattice-based Cryptosystem: Lattice-based cryptosystem is the generic term used for the construction of cryptographic primitives which involves lattice either in the construction of the cryptosystems itself or in providing security to the cryptosystems [11]. Mathematically lattice is a subset of Rn or the set of all the integer linear combinations of the basics vectors b1 , b2 , b3 , . . . , bn ∈ Rn , thus L={



ai bi : ai ∈ Z}

(5)

where bi is linearly independent vector over R and the ordered n-tuplesbreak (b1 , b2 , b3 , . . . , bn ) is the basis of the lattice. The entire security of the latticebased cryptosystem lies on the hardness of some problems. Some general lattice problems used in the cryptographic primitives are (a) Shortest-Vector Problem (b) Closest-Vector Problem (c) Shortest Independent Vector Problem.

Post-quantum Cryptography

391

3. Multivariate Cryptosystem: Multivariate cryptosystem is based on the multivariate polynomials, i.e., polynomials with more than one indeterminant or variable, usually quadratic over a finite field F = Fq with q elements [13]. Let us have a system of m multivariate quadratic polynomial with n variables. p (1) (x1 , x2 , . . . , xn ) = p (2) (x1 , x2 , . . . , xn ) =

n 

pi(1) j xi · x j +

n 

i=1

i=1

n 

n 

pi(2) j xi · x j +

i=1

pi(1) · x1 + p0(1) , pi(2) · x1 + p0(2) , (6)

i=1

.. . p (m) (x1 , x2 , . . . , xn ) =

n  i=1

pi(m) j xi · x j +

n 

pi(m) · x1 + p0(m)

i=1

The entire security of the multivariate cryptosystem depends on the hardness of a problem, MQ (abbreviated from multivariate quadratic) which can be simply stated in the following form. Given a system of m multivariate quadratic polynomial p (1) (x), . . . , p (m) (x) in n variables x1 , . . . , xn , the problem is to find a solution to the system of polynomial, i.e., find a vector x = (x 1 , . . . , x n ) such that p (1) (x) = ... = p (m) (x) = 0 The solution of the above systems of a polynomial (MQ Problem) is proven to be non-deterministic polynomial time (NP) hard over any field and is believed to be hard on average for both classical and quantum computers. 4. Code-based Cryptosystem: Code-based cryptography is the fourth candidate along with hashed-based digital signature scheme, lattice-based cryptosystems, and multivariate cryptosystems that are proven to be safe from quantum attacks [12]. Formally, code-based cryptosystems can be defined as those family of cryptosystems both symmetric and asymmetric whose security lies entirely upon the hardness of decoding of an encoded message in linear error-correcting codes, chosen with a particular structure, for instance, Goppa code which is used in the well-known McEliece cryptosystem which is the first proposed code-based cryptosystem. The original idea behind McEliece cryptosystem was to use an encoded word of some chosen linear error-correcting code (binary Goppa code) to which random errors are added to obtain the ciphertext (refer Fig. 6).

392

S. Bhattacharyya and A. Chakrabarti

Fig. 6 McEliece cryptosystem

Error-correcting codes are that set of codes which can be used to detect as well as rectifying the error, which may have occurred either due to unavoidable physical alteration happened during the transmission of the message in the channel or by intention by some adversary. The correction of the error is done by adding some extra information called redundancy that makes it easier to detect and rectify the error [12]. In coding theory, the redundancy is also called as parity or parity check. An arbitrary basis of the code (generator matrix) is used as the public key for the encryption purpose to encode to which further error would be added to get the final ciphertext [2]. Authorized persons can use a fast trapdoor function to remove the error and decode the intermediate codeword to obtain the original message. The entire security of code-based cryptosystems is based on two computational aspects (a) Generic Decoding problem: Decoding problem is hard on average even for a quantum computer. (b) Identification of the generator matrix: The public key generator matrix is hard to distinguish from a random matrix. Even McEliece is safe in the pre-quantum era yet it is not preferred over RSA due to its large key sizes.

2.6 Signature Schemes In most of our day-to-day activity, apart from the privacy of our messages or the information that we share, we also need the authentication of the sender where the privacy of the message is not a prime factor, and we do not focus on the privacy of the messages or information being shared but focus on what we received and accept must be from that person we wanted and not form any anonymous one. In the physical world, we use the signature scheme where the sender’s signature is a part of the message, which provides the authentication of the author. The message not necessary be private and the authentication of the sender can be verified by the receiver by comparing the signature contained in the shared document with some pre-existing copy of the sender signature. It can be the ID card already there with the receiver, and this prevents the adversary to send any false information.

Post-quantum Cryptography

393

In the electronic sense, we achieved the same through signing the document with some secret key that is only available with the sender and a public key for anyone to verify the signature with some additional information say time and place so that if anyone copies the signature, it is no longer a valid one. The general algorithm of how the signature scheme work is as follows: 1. Key Generation: The sender develops a pair of a key using some Trapdoor one-way function, one is being kept secret and the other is made public. 2. Signature Generation: The sender signs the document using the secret or private key to develop the signature of message m : s = dsk (m), using the decryption function. 3. Signature Verification: The receiver verifies the signature of the sender by using the public verification key of the sender to develop a temporary message of the signature s : z = e pk (s) and then verifying with the available message m. Note that we are no longer interested in the privacy of the message but are more interested in the authentication of the sender.

2.7 Attacks on Signature Scheme Attack on the signature scheme can be broadly classified into two main categories [2] 1. Key Only Attack: The adversary knows only the sender’s public key 2. Message Attack: The adversary knows some of the messages or signature apart from the sender’s public key. Message attack can be classified into the following categories • Known message attack : Adversary knows the signature of a message set m 1 , m 2 , . . . , m n that is NOT chosen by the adversary. • Generic chosen message attack: Adversary can get the signature of a message set m 1 , m 2 , . . . , m n been chosen. The messages sent do not depend on the sender’s public key and are chosen before the adversary had got any signature. • Directed chosen message attack: Adversary can obtain the signature of a message set chosen by him/her and have been chosen before the adversary had seen any signature but, here these may depend on the sender’s public key. • Adaptive chosen message attack : The adversary can request the sender r for the signature that depends on the sender’s public key, and additionally, it also depends on previously obtained signatures. There is one more attack that is of very little value but cannot be ignored as the attacker may use it, and it depends on the development of a fraud pair of signature s and message m and can use it over the channel to mislead the receiver. The attacker

394

S. Bhattacharyya and A. Chakrabarti

may arbitrarily choose a value of the signature let s and use the sender’s public key to develop a false message m that had not been originally sent by the sender and can mislead the receiver. m = e pk (s)and(m, s) is a valid pair. Symptoms of the broken signature scheme 1. Total break: An adversary can recover the sender’s private key 2. Universal Forgery: The adversary had developed or find an efficient signing algorithm functionally equivalent to the sender’s signature algorithm which uses the secret key to sign the documents with the help of equivalent, possible different trapdoor information. 3. Selective Forgery: The adversary can sign any message of his or her choice. 4. Existential Forgery: The adversary can create a valid signature for at least one message chosen by another person. The invention of Grover’s algorithm had proved that these classical cryptosystems for developing signature are no longer secure but hash-based cryptography can be securely be used in the quantum era, and these hash-based cryptosystems when being involved in the signature scheme, it provides a secure means even in the quantum era.

2.8 Hash Function Hash function in cryptography map strings of arbitrary length to a string of fixed length of length say n which typically lies between 128 and 512 bits. For example, SHA-256 (Secure Hash Algorithm–256) is considered to be one of the most secure hash functions and it is believed to be safe from the attack from the most powerful supercomputer and even quantum attack [2]. A hash function is typically denoted by h : {0, 1}∗ → {0, 1}n . Before hash function can be used in cryptography, it has to satisfy certain properties 1. Preimage: Given y = h(x) find a string x  such that h(x  ) = y 2. Second preimage: Given x, find a string x = x such that h(x) = h(x  ) 3. Collision: Find two strings x and x  such that x = x and h(x) = h(x  ). All of the three properties of the hash function can be broken by a brute force attack, and it had been found that only 2n application of h() can break the preimage and second preimage problems and 2n/2 application of h() is needed to break the collision problem. The hashed function can be used in a very efficient way in the signature scheme for the author authentication all we have to do is that compute m  = h(m) where m is the original message and m  is computed by the use of the hash function h() and all the comparisons are done on m  . The use of the hash function not only prevents the attack where the adversary creates a false pair by burdening him to solve the preimage problem but also avoids the case where the original message is very large.

Post-quantum Cryptography

395

Hash functions are believed to be quantum safe but many a time they are believed to be safe from attack by supercomputer also; for example, the SHA 256 and thus the security issue of the hashed-based signature scheme in the quantum era lie with the security of the underlying hash functions.

2.9 One-Time Signature Scheme A one-time signature scheme is a set of algorithms where the key is allowed to used only once by revealing some part of the secret signature key. The basic idea is that we use a one-way function say f : {0, 1}n → {0, 1}n and the signature key x thats would be made public of n bits and compute the verification key y from the oneway function f () and do all the computations to developed the signature s using the signature key x and the hash function h() and made public the message m and the signature s and the verification is done based on computing the one-way function f () on the revealed parts of the signature key x and comparing it with the verification key y [1]. One more factor that must be considered is that the underlying hash function h() must be collision-resistant as there may be a situation arise when the author is not honest and can choose two different messages m and m  such that the h(m) = h(m  ), and thus, two messages m and m  have the same signature and sign m and later claim that he or she had sign m  instead of m. The collision-resistant hash function is an utmost factor for considering hash-based signature schemes. 2.9.1

Lamport One-Way signature Scheme

The idea of using a hash function in a digital signature scheme originated from Lamport who proposes the first practical implementation of the hash-based signature scheme in personal communication with Diffie. Here xi,m i is the revealed part of the signature.

Algorithm 4: Lamport one-way signature scheme 1. Key Generation • Choose 2q random n bit string xi, j for 0 ≤ i ≤ q and j ∈ {0, 1} • Compute yi, j = f (xi, j ) for 0 ≤ i ≤ q and j ∈ {0, 1} • Authenticate and make public the yi, j for 0 ≤ i ≤ q and j ∈ {0, 1} The Secret key is (x0,0 , x0,1 , . . . , xq−1,1 ) and public verification key is (y0,0 , y0,1 , . . . , yq−1,1 ) 2. Signature Generation Sign the message m by revealing xi,m i for 0 ≤ i ≤ q 3. Signature Verification The signature is verified by computing z i = f(xi,m i ) and checking that z i = yi,m i for 0 ≤ i ≤ q

396

S. Bhattacharyya and A. Chakrabarti

Each of the keys can be used for a sign at most one message if two messages being signed, then their exits a possibility that the attacker may develop a valid signature for a message using the information available from the two signatures For example, let m 1 and m 2 be two message as 110 and 101, respectively, then using the same private signature key the signature for m1 is (x0,1 , x1,1 , x2,0 ) and for m 2 is (x0,1 , x1,0 , x2,1 ), and thus, valid signature for any random message say m n = 111, the signature is (x0,1 , x1,1 , x2,1 ) which is a valid one and hence put the security at risk.

2.9.2

Merkle’s Development over Lamport’s One-Time Signature Scheme

American computer scientist Ralph Merkle had made some improvements over Lamport’s one-time signature scheme. The core idea is to use a signature key of form X = (x0 , x1 , . . . , xq−1 ) where xi is random n bits instead of the typical (x0,0 , x0,1 , . . . , xq−1,1 ) as in the original scheme developed by Lamport. The signature of the message m using this new Merkle improved scheme is done by revealing all the xi bits for which the m i = 1 for all 0 ≤ i ≤ q [1]. In this scheme, the number of 0 bits in the message m is added to the end of the original message m to avoid the attack where the attacker may try to develop a valid signature for the messages which does not have one bit in a position where m does not have 1 bit and define k  = k + log2 q + 1

(7)

for the development of the keys instead of using k as in the original Lamport scheme [3]. Algorithm 5: Merkle’s development over Lamport’s one-time signature Scheme 1. Key Generation • Choose q  random n bit string xi for 0 ≤ i < q  • Compute yi = f (xi ) for 0 ≤ i < q  • Authenticate and make public the yi for 0 ≤ i < q  The Secret key is (x0 , x1 , . . . , xq  −1 ) and public verification key is (x0 , x1 , . . . , xq  −1 ) 2. Signature Generation • Count the number of 0 bits in message m and let called this number to be a and define ab as the binary representation of a with q  bits and define m  = (mab) here  stands for concatenating • Sign m by revealing xi for all i, 0 ≤ i < q  such that m i = 1 3. Signature Verification • Find a and ab from m and generate m  • Compute z i = f (xi ) and verified the result with the public key yi for all i such that mi = 1

Post-quantum Cryptography

397

Here xi is the revealed part of the secret key and now the message m had been concatenated with the number of 0 bits in m. If m has more 0 bits, then the binary representation of a (the number of 0 bits in the original m) will have more 1 bits instead of 0 bits and thus attacker would not be able to sign any false messages m o where the position of 1 bit in the m o is same as in the original message m. Example Let k = 8 and thus k  = k + log2 8 + 1 = k + 4 = 12 and m = (10111100)2 thus we have a = 3 and ab = (0011)2 and m  = mab = (101111000011)2 . Thus, the signature will be s = (x0 , x2 , x3 , x4 , x5 , x10 , x11 ). Now the attacker cannot develop any false signature and false message, the original message m is lost and all the algorithm is performed on m  . The key length is now roughly n(k + log2 (k)) bit long and the signature which depends on the message m but it’s about n(k + log2 (k))/2 on average. It’s now obvious from the above two algorithms that if the secret key is used many a time then more and more of it will get revealed and thus the probability, he that the attacker would be able to generate the public key from the exposed part of the key would increase. There had been several proposals over the hash-based one-time signature scheme like Bleichenbacher’s and Maurer’s digital signature based on directed acyclic graph as described in their paper, “Directed acyclic Graph, One-Way function and Digital Signature”; Dods, Smart, and Maurer had analyzed the proposal made by Bleichenbacher and Maurer but found that it does not work as fine as Winternitz, in their paper “Cryptography and Coding” on December 19–21 of 2005 in the 10th IMA International Conference. Bos and Chaum in the year 1992 had proposed another variant of Lamport scheme, and in 2002, Reyzin and Reyzin had proposed another similar variant the key difference between Bos–Chaum and Reyzin–Reyzin was that the prior had focused upon the minimization of the public key while the later had focused upon the minimization of the signature size.

2.9.3

Winternitz One-Time Signature Scheme For Smaller Key Size

The scheme proposed by Lamport on hash-based one-time signature was quite efficient and easy to adopt but the main problem lies with the large key size, the key size is 2kn, whereas the proposal made by Winternitz was only about n. The basic idea is to use one string in the one-time signature scheme to sign several bits in the message digest (output of the hash function) simultaneously [3], it first appears in the thesis of Ralph Merkle, “Secrecy, Authentication, and Public key System,” he described that the method was suggested to him by Winternitz in the year 1979. Like Lamport scheme Winternitz one-time signature scheme also use similar kind of one-way function f : {0, 1}n → {0, 1}n and a cryptographic hash function

398

S. Bhattacharyya and A. Chakrabarti

h : {0, 1}∗ → {0, 1}n The main difference starts with the key’s generation and signature development.

Algorithm 6: Winternitz One-time Signature Scheme for smaller key size 1. Key Generation: • A Winternitz parameter w ≥ 2 is chosen which denotes the number of bits to be signed simultaneously, then t1 and t2 will be computed as t1 =  wn  and t2 =  log2 t1w+1+w  and t = t1 + t 2 • Choose t random n bit string xi for 0 ≤ i < t • Compute and authenticate and make public yi = f 2

w −1

(xi )

for 0 ≤ i < t • The Secret signature key is (x0 , x1 , . . . , xt−1 ) and public verification key is (y0 , y1 , ..., yt−1 ) 2. Signature Generation: • Zeroes are added to the beginning of the message m such that the total length of m is divisible by w. The extended m is spilled into t1 w-bits strings bt−1 , bt−2 , bt−3 , . . . , bt−t1 such that m = bt−1 ...bt−t1 • Identify each bi with an integer 0, 1, 2, 3, 4, 2w − −1 and find the check sum as t−1 

(2w − bi )

(8)

i=t−t1

It can be shown that the length of c ≤ t1 2w and thus the length of a binary representation of c is less than log2 t1 2w  + 1 = log2 t1  + 1 + w Minimum number of zeroes are added at the beginning of c such that the length of c is divisible by w and the extended string is spilled into t2 w-bits strings (bt2 −1 , bt2 −2 , . . . , bt2 −t2 ) such that (bt2 −1 bt2 −2  . . . b0 ) • The signature is computed as s = ( f bt−1 (xt−1 ), . . . , f b1 (x1 ), f b0 (x0 )) 3. Signature Verification: The signature s = (s0 , s1 , . . . , st−1 ) and computing the bit string b0 , b1 , . . . , bt−1 as above and checking if s = (f2

w −1−b

0

(s0 ), . . . , f b2

w −1−b

t−1

(st−1 ))

The following table summarizes the various aspects of the one-time signature scheme that had been discussed (refer Table 5). The one-time signature scheme introduced was not efficient for the practical scenario. In practical life, we need something of the sort many time signatures schemes

Post-quantum Cryptography

399

Table 5 One-time signature scheme Name of the scheme Public key size Lamport scheme Merkle’s improvement of Lamport Winternitz

Signature size

2kn n(k + log2 (k))

kn n(k + log2 (k))/2

n

tn

that can be implemented in signing in practical life provided that the security issue remains stable and the scheme provides good memory management. Several messages definitely can be signed with the one-time signature scheme, but in that case, we have stored a very large number of verification keys which could cost a huge amount of memory, an idea that can arise is to send both the verification key and the signature key; at the same time, this would solve the problem of memory management but would cost to the authentication problem of the verification keys. In the year 1979, Merkle had proposed a method that would neither cost the security nor cost memory and authentication problem. The basic idea is to use a complete binary hash tree that would reduce the number arbitrary but fixed number of one-time verification keys to the validity of one single public verification key that is the root of the hash tree. The method describes just above was Merkle static tree, the number of the valid signature that can be done with it are finite say it is 2d (for simplicity and the ease of calculation it had been taken as a power of two and also a binary tree of height h has 2h leaves) but there is also a provision by which it allows to sign an infinite number of messages by the use of a dynamically expanding tree, which grows along with the number of signatures made. There is a problem associated with it, the size of the signature grows after each signature, and these not only lead to inefficiency but it also led to insecurity as it revealed the total number of the signature being made.

2.10 Merkle Static Tree Let the total number of the signatures made is finite, say it is 2d , we choose a one-way hash function say h : 0, 1∗ → 0, 1n . For signing 2d messages say (m 1 , m 2 , m 3 , m 4 , . . . , m 2d ) we need 2d number of key pairs (si , pi ) for 0 ≤ i < 2d here d is called the height or depth of the tree [8]. Here the tree is fundamentally built upon the one-time hash function h() if the underlying hash function is secure, then the public verification key is also secure, and hence, the security issue had not been compromised and ultimately depends on the hash function and thus safe even in the quantum era. We assumed that message is signed using the one-time key pairs (si , pi ) for 0 ≤ i < 2d the question may arise how can the verification be done using the public key? For this, one must authenticate

400

S. Bhattacharyya and A. Chakrabarti

Algorithm 7: Merkle’s Tree for generating public key 1. For Signing Message For signing 2d messages developed that many numbers of secret–public key pair (si , pi ) for 0 ≤ i < 2d 2. Building Binary Tree • The leaves of the tree are computed as yi = h( pi ) for 0 ≤ i < 2d 2i+1 d− j and • The nodes at the height j are computed as y ij = h(y 2i j−1 ||y j−1 ) for 0 ≤ i < 2 for 0 ≤ j ≤ d 3. Making the public key Authenticate and make public the root of the tree Y = yd0 The root of the tree Y = yd0 is the only public key and all the key pair are secret keys (si , pi ) for 0 ≤ i < 2d

the verification keys pi using the public verification key Y, the root of the hash tree. This is typically done using an authentication path that follows the path from Y to the pi . There are two possible authentication paths for the verification of the key 1. The first part follows from the root to the leaves 2. The second part follows from the leaves to the root As an example, let us take the authentication of p4 as depicted in the following Merkle tree of height d = 3 and thus having 8 key pairs. The first possibility follows a path from the root of the tree, Y30 is an authenticate public key now, 1. The signer revealed y20 and y21 , the verifier computes y30 and if it matches with the given Y, y21 is verified. 2. The signer revealed y12 and y13 , the verifier computes y21 and if it matches with the pre-verified value of y21 , y12 is verified. 3. The signer revealed y05 , and the verifier computes y04 from p4 using the one-way hash function, then the verifier computes y12 and if it matches with the pre-verified value of y12 , p4 is verified. One can find that half of the transmission is redundant and element that are redundant, i.e., y21 and y12 . The second possibility follows a path that leaves to the root, 1. The verifier computes y04 from p4 and signer revealed y05 2. The verifier computes y12 from y04 and y05 and signer revealed y13 3. The verifier computes y21 from y12 and y13 and signer revealed y20 . Here in the second approach all of the redundant terms that appear in the last approach had been calculated and hence it saves the memory of the verifier from storing all the redundant terms, they are instantly being calculated and use. The revealed elements as in the last approach (say) are the authentication path, i.e., y05 , y13 and y20 . It had been calculated and found that the second methods take only log2 n compared to the prior where it takes 2 log2 n. The method is given by

Post-quantum Cryptography

401

Table 6 Authentication table of the tree of height 3 pi Authentication path p0 y21 y11 y01 1 1 p1 y2 y1 y00 1 0 p2 y2 y1 y03 1 0 p3 y2 y1 y02 0 3 p4 y2 y1 y05 p5 y20 y13 y04 0 2 p6 y2 y1 y07 0 2 p7 y2 y1 y06 The black are non-redundant elements and the rest are redundant elements

Merkle only needs to save the public verification key Yd0 for 2d messages no need to store all the 2d verification keys (refer Table 6). The approach from the leaves of the tree to the root seems to be well-established but still, there is one more thing to do, it is quite disturbing that even though the authentication path of pi+1 has a big portion of the authentication path for pi . It will be quite useful and memory and time saving that instead of calculating authentication path for each of the pi we calculate all the pi in order staring from p0 and going up to p2d and avoid all that path elements which depend on the previous and keep only those path elements that are new to the pi underuse and develop the path by the approach of recurrence (refer Fig. 7).

2.10.1

Signature Generation and Verification Using the Static Version of Merkle Tree

The above Merkle signature scheme uses Lamport scheme and its variants. The average size of the signature developed by the above scheme is roughly about 2n 2 + n log2 N where n is the length of the digest of the one-way hash function, i.e., h : {0, 1}∗ → {0, 1}n and N = 2d the total number of messages to be signed and 2n 2 is caused by the one-time signature scheme and n log2 N is caused due to the authentication path. Typically, the value for n and N is 256 and 220 . The basic version of the Merkle scheme uses the Lamport and its variant but the extended version of it developed by Buchmann, Dahmen, and Husling called as, “eXtended Merkle Signature Scheme” or XMSS in short.

402

S. Bhattacharyya and A. Chakrabarti

Fig. 7 Merkle Static Tree of height 3

Algorithm 8: Signature generation and verification using the static version of Merkle tree 1. Signature Generation • The signature is developed correspondingly as in an one-time signature scheme, the signature for the ith message is done using the index i, the one-time signature for the ith message developed using its secret key (ski )si , the public key for the ith message ( pki ) and the corresponding authentication path (Auth i ) using the following relation (i, si , pki , Auth i ) 2. Signature Verification The signature is verified also in a very similar fashion as in a one-time signature scheme • Verifying that m i had been signed using pki using the one-time signature scheme as described earlier in the paper. • Authenticating the public key pi using the authentication path and finally computing Yd0 and comparing it with the given values.

Post-quantum Cryptography

2.10.2

403

eXtended Merkle Signature Scheme or XMSS

The main difference between the prior version and the current version is smaller signatures and collision resilience and the use of Winternitz instead of Lamport scheme, and the main advantage of this is that there is no need to store pki instead it can be calculated and compared to give one; in the context of the prior version, this means that instead of storing the pki , it is being computed from the signature and eventually with the help of the computed values the root value is computed, i.e., Yd0 [7]. This reduces the signature size from 2n 2 + n log2 N to n 2 + n log2 N the Winternitz parameter, i.e., the number of messages to be signed simultaneously, w ∈ N (set of natural numbers). After looking at a brief overview of the some of the fundamentals of hash-based signature let defines stateful and stateless hash-based signature scheme. Stateful hash-based signature scheme: This scheme is identified with the characteristics of maintaining the state which essentially keeps track of the used one-time signature keys so that the keys cannot be used to sign multiple messages, e.g., XMSS, LMS. Stateless hash-based signature scheme: This scheme is identified with the characteristics of not maintaining the state, and the signatures are much larger and also computationally much intensive, e.g., SPHINCS.

2.11 Advantage of Using Hashed-Based Signature Scheme 1. Reliable hash function: The ultimate security of the hashed-based cryptography lies with the security of the underlying hash functions, and much of the hash function approved by NIST are quantum safe. 2. Standardization: Stateful hash function are the most likely to be standardized very soon by NIST, and thus, they can be implemented. 3. Utilize currently available hardware: Unlike the rest of the post-quantum cryptosystems, hash-based cryptosystem involves the majority of the calculation on the hash function and much of them had been optimized in the architecture. 4. Smaller Key size: The public and private keys have a much smaller size compared to other post-quantum cryptosystems by using Merkle tree the need for storing a large number of verification keys had been reduced to one. 5. Easy to adapt and implement: If there is any security issue with any of the hashbased cryptosystem just change the underlying hash function nothing else to do no need for complex hardware implementations and mathematical computations.

404

S. Bhattacharyya and A. Chakrabarti

2.12 Comparison Between Quantum Cryptography and Post-quantum Cryptography: Quantum Key Distribution The two-term quantum cryptography and post-quantum cryptography seem so, much similar yet a considerable amount of difference exists between them. Quantum cryptography refers to the use of fundamental laws of quantum mechanics such as quantum superposition, quantum entanglement, Heisenberg uncertainty principle, dual behavior of matter and radiation, de Broglie Hypothesis, and another fundamental principle to deals with the encryption and decryption processes and developed security of the system. Quantum key distribution (QKD) is the best-known example of the quantum encryption method where the data (the keys) are transferred using photons and the advantage of a photon’s no change no-cloning attribute can be used to harnessed security. In this method, the third party cannot get the keys from the photons and if somehow the attacker gets access to the keys through some messages, these would ultimately change the state of the photons and the cryptosystem fails to alert the active parties that their data had been altered by some adversary. Post-quantum cryptography refers to the use of a cryptographic algorithm that is believed to be safe from quantum attacks, i.e., resistant to Shor’s, Grover’s algorithm. Post-quantum cryptography is all about building quantum-safe cryptosystems by updating our present knowledge of mathematics and computer science [6]. Though much of the post-quantum cryptosystem had not received much attention from the cryptographic society except hash-based cryptography whose security is well-understood and most likely they would be implemented on a large scale in the coming years.

3 Conclusion The era of quantum computing is not very far away, Google had achieved quantum supremacy already. Several institutions are currently working on quantum computing like D-Wave, Google, AT&T, Atos, Honeywell, and many other institutions and companies, so it is no more surprising that within the next few years may be within this decade that we enter into the quantum era. Our classical cryptosystems are no doubt safe now but we cannot leave our research on post-quantum cryptography and look at the current trend the hash-based cryptography is the most promising compared to the other which involved much complex hardware issues and mathematical stimulation. So, it can be safely stated that instead of searching for other post-quantum cryptosystems, we first focused on what we have now, build more secure hash functions, optimize the architectural design for their implementations, and developed new schemes.

Post-quantum Cryptography

405

References 1. Gauthier Umana V (2011) Post quantum cryptography, vol 322. Doctoral Thesis, Technical University of Denmark, pp 79–87 2. Bernstein DJ, Buchmann J, Dahmen E (2009) Post quantum cryptography, vol 10. Springer, pp 35–43 3. Roy KS, Kalatia HK (2019) A survey on post-quantum cryptography for constrained devices. Int J Appl Eng Res 14(11):1–7. ISSN: 0973-4562 4. Quan NTM (2000) Intuitive understanding of quantum computation and post-quantum cryptography, vol 9, pp 8–9. arXiv:2003.09019 5. Shor PW (1996) Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer, vol 2, pp 5–23. arXiv: quant-ph/9508027v2 6. Chen L, Jordan S, Liu Y-K, Moody D, Peralta R, Perlner R, Smith-Tone D (2016) Report on post-quantum cryptography. NISTIR 8105, vol 2, pp 1–7. https://doi.org/10.6028/NIST.IR. 8105 7. Hülsing A, Gazdag S-L, Butin D, Buchmann J Hash-based signatures: an outline for a new standard. NIST, pp 1–3. https://csrc.nist.gov/csrc/media/events/workshop-on-cybersecurityin-a-post-quantum-world/documents/papers/session5-hulsing-paper.pdf 8. Buchmann J, Ding J (eds) (2008) Hash-based signatures: an outline for a new standard. In: Post-quantum cryptography, second international workshop, PQCrypto 2008 Cincinnati, OH, USA. Lecture notes in computer science, vol 5299. Springer 9. Yanofsky NS, Mannuci MA (2008) Quantum computing for computer scientists. Cambridge University Press, pp 204–209. ISBN 978-0-521-879965 10. Mermin ND (2007) Quantum computer science: an introduction. Cambridge University Press, pp 63–69. ISBN-13 978-0-521-87658-2 11. Pradhan PK, Rakshit S, Dutta S (2019) Lattice-based cryptography, its applications, areas of interest and future scope. In: Third international conference on computing methodologies and communication. ISBN-978-1-5386-7808-4 12. Sendrier N (2017) Code-based cryptography: state of the art and perspectives. IEEE Secur Privacy 15(4):44–50. https://doi.org/10.1109/MSP.2017.3151345 13. Ding J, Petzoldt A (2017) Current state of multivariate cryptography. IEEE Secur Privacy 15(4):28–36. https://doi.org/10.1109/MSP.2017.3151328 14. Arun G, Mishra V (2014) A review on quantum computing and communication. In: 2014 2nd international conference on emerging technology trends in electronics, communication and networking, Surat, 2014, pp 1–5. https://doi.org/10.1109/ET2ECN.2014.7044953 15. Mailloux LO, Lewis CD II, Riggs C, Grimaila MR (2016) Post-quantum cryptography: what advancements in quantum computing mean for IT professionals. IT Prof 18(5):42–47. https:// doi.org/10.1109/MITP.2016.77 16. Borges F, Reis PR, Pereira D (2020) A comparison of security and its performance for key agreements in post-quantum cryptography. IEEE Access 8:142413–142422. https://doi.org/ 10.1109/ACCESS.2020.3013250 17. Gorbenko Y, Svatovskiy I, Shevtsov O (2016) Post-quantum message authentication cryptography based on error-correcting codes. In: Third international scientific-practical conference problems of infocommunications science and technology (PIC S&T). Kharkiv 2016, pp 51–54. https://doi.org/10.1109/INFOCOMMST.2016.7905333

A Comprehensive Study of Security Attack on VANET Shubha R. Shetty and D. H. Manjaiah

Abstract The most trending technology among wireless network is vehicular ad hoc network (VANET), taking the automobile industry to new heights. Intelligent transportation systems (ITS) smartly utilized the standards of wireless communication to automate vehicles. VANET, being a most promising technology, has curbed the problem related to road safety, application-based services, etc. Doors of research are widely open for VANET due to its variant features. In this paper, we discussed the outline of VANET through its framework. The basic model of VANET is studied in order to understand the service system in VANET. Standards of VANET are even reviewed, which is the backbone of vehicular services. Followed by the type of communication in VANET and routing protocols. A broad study on various security attack and attacker is done to understand the security issues associated with the wireless communication. Focus on jamming attack and its effectiveness is thoroughly analyzed with the help of different parameters. In addition, future scope of research associated with jamming attack and possible developments is considered.

1 Introduction An ad hoc network is a local area network, which voluntarily forms connection when devices connect. Hence, vehicular ad hoc network (VANET) is a distinct type of ad hoc network that can be used to improvise vehicle safety, upgrade traffic efficiency, and provide information and entertainment on go. VANET is often referred as subcategory of mobile ad hoc network (MANET) [1]. VANETs came into existence in 2001 under “car-to-car ad hoc mobile service and networking” applications, where network is established and information broadcasting takes place between cars. Vehicular relaying and vehicle to road side device connection exist to implement roadway safety, map-reading, and transport-related services with entertainment. It is a prominent aspect of the ITS, and it is referred as intelligent transportation network. S. R. Shetty (B) · D. H. Manjaiah Department of Post-Graduate Studies and Research in Computer Science, Mangalore University, Mangalagangotri, Karnataka 574199, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_25

407

408

S. R. Shetty and D. H. Manjaiah

Major global issues like road accidents, congestion, fuel consumption, and pollution have led to VANET popularity in latter-years. Intelligent transportation systems (ITS) introduced VANETs to design a risk-free framework for automobile. VANETs direct on roadway protection and systematic management of road obstructions, with entertainment and relief to both driver and passenger. The wireless access in vehicular environment (WAVE) protocol established on IEEE 802.11p standard providing basic radio standard for dedicated short-range communications (DSRC) operating within 5.9 GHz [2].

2 VANET Overview VANET which is considered as derivation of mobile ad hoc network (MANET) is an emerging technology focusing more on roadway reliability and congestion control. One of the global status reports of World Health Organization (WHO) in 2018 reveals horrifying truth about the gradual raise in death caused by road accident annually. These causes unbearable loss both to the family and nation in terms of finance and emotions [3]. Hence, ITS through VANET is trying its best to reduce the loss and effectively manage the day-to-day traffic issues encountered. Further discussion focus on VANET architecture with diagrams, various aspect associated with VANET, making it a unique communication network and various communication standards to help precise communication between vehicles.

2.1 VANET Architecture VANET has two wireless point with to and fro communication among themselves and sharing information related to traffic, amusement, and many more. Road side unit and on-board unit are two units serving two-way communication [4]. The information exchange in entire VANET architecture takes place wirelessly through wireless access in the vehicular environment (WAVE). WAVE promises passenger safety by disseminating in time messages between vehicles and other units that includes flow of traffic and other road safety [5]. VANET basically consists of three components RSU, OBU, and TA, which is described below with suitable diagram. • Road Side Units (RSUs): Fixed mount-kind of infrastructures established either sides of the road. RSUs furnish information dissemination among nodes with proper network coverage. Most importantly, RSUs are the units that disperse safety messages within their coverage area. It delivers all the services provided by VANET by linking vehicles with each other. Employment of communication protocol influences the placement and density of RSU on the road [6]. • On-Board Units (OBUs): Vehicles with OBUs are capable of exchanging messages on roads; i.e., packets are transferred between the vehicles with the help

A Comprehensive Study of Security Attack on VANET

409

Fig. 1 Architecture of VANET [10]

of OBUs. Messages can be diverse, either with regarding safety or service message based on application [7]. When these two wireless nodes (RSUs and OBUs) are combined, they offer two types of bidirectional communications in VANET: vehicle-to-vehicle (V2V) and vehicle to RSU (V2R). It incorporates resource command processor (RCP). This contains components to store and extract information which is read/write memory and user interface. RCP is an exclusive interface linking with additional OBUs and network components using IEEE 802.11p radio frequency channel for dedicated short-range wireless communication [8]. • Trusted Authorities (TAs): The management of entire VANET system is on the shoulder of TA. Registration of vehicles that enters VANET, RSU, and OBU is done by TA. The duty of TA is to confirm the authenticity of the driver through User ID, check OBU ID in order to evade future security threat. TA consumes excessive power, and it make use of excessive memory space. During any distrustful behavior of vehicle, TA rescues entire VANET system by providing OBU ID [9] (Fig. 1).

2.2 VANET Model Overview Numerous units are available for VANET positioning, and important actions are performed between vehicles and other entities via communication. Information dissemination in VANET typically happens in infrastructure and ad hoc environment with the help of VANET communication standards [10]. Smart vehicles are modeled

410

S. R. Shetty and D. H. Manjaiah

Fig. 2 VANET model [12]

with set of sensors such as radars, reverse and forward radars, etc., to access environmental information, that is beyond a driver’s knowledge. Global positing system (GPS) helps to identify the geographical location and assists the driver. The infrastructure environment is composed of those entities that are constantly linked to support traffic and other services. Manufacturers in this environment explicitly admit the vehicle. Trusted third party (TTP) provides variant services like credentials, time stamping, etc. Legal authority undertakes vehicle registration and handles offensive cases. Digital video broadcasting (DVB) and location-based services (LBS) are provided by service providers. Ad hoc environment permits inter-vehicle communication. Units such as OBU, trusted platform module (TPM), and sensors play crucial part in this environment. Sensors look after road safety, identification of any events, environmental differences, and immediately pass it on to other vehicles. Any calculation and storage management is handled by TPM to restore the security. The diagram below explains the two environment in which VANET communication is enabled [11] (Fig. 2).

2.3 VANET Characteristics VANET has distinctive characteristics in contrast to other mobile ad hoc networks. VANET with distinct features not only provide vivid functionalities, but are challenging as well [13]. Following are few of the attributes of VANET:

A Comprehensive Study of Security Attack on VANET

411

• Very flexible: Since the vehicles move in high speed, foreseeing vehicle or node position and preserving node privacy becomes a challenge. • Instantly changing network topology: The structure of the network constantly changes due to uneven node position and its varying speed on the road. Hence, having a fixed network topology is highly impossible. • Limitless network proportions: Execution of VANET can be done in both rural and urban region, thus structure and size of the VANET will not be geographically chained. • Continuous information interchange: Foremost goal of VANET is communication between vehicles or nodes. Hence, VANET promotes nodes to communicate with road side units, with constant information exchange. • Communication Broadcast: VANET is ad hoc in nature, which wirelessly interchange messages among the nodes and road side units. So, securing the entire communication system becomes the priority. • Time Critical: Well-timed message delivery to the respective nodes will help the vehicles to take the decision and act accordingly. • Sufficient Energy: VANET has less chances of facing energy crisis and other computing tools. This smoothness the functioning of road side unit and other devices involved in the communication process with sufficient supply of power.

2.4 VANET Communication Standards VANET has strong communication setup with the help of various wireless standards. These standards include protocols to send and receive messages and standards work on diverse tools across safety specifications [14]. They are deployed on cellular networks like 3G/4G/5G, by synchronizing among protocols and labeling variant services. VANET standards magnify the pattern of the product and encourage the automobile industries to emerge with best products at minimum price [15]. The various VANET standards are listed below: • Dedicated Short-Range Communication (DSRC): In ITS and further frameworks like V2V, V2I, DSRC allows vehicles to communicate among themselves wirelessly. For DSRC, Federal Communications Commission (FCC) in the year 1999 granted the band from 5.850 to 5.925 GHz, with a spectrum of 75 MHz. 75 MHz DSRC spectrum is segmented into seven channels (Ch. 172 to Ch. 184). The control channel is Ch. 178 contributing safety power applications. The service channels are the rest six (172, 174, 176,180, 182, and 184). The top capability safety messages use Ch. 172 and 184. Although rest of other channels are used for both safety and non-safety information [5]. • Wireless Access to Vehicular Environment (WAVE): The WAVE IEEE 1609 narrates the forming of VANET framework along with the interactions, protocol sets, and other services that are mandatory for the vehicles to communicate wirelessly [16]. It includes RSU and OBU constituting both stationary and mobile units. Orthogonal frequency division multiplexing (OFDM) is used in WAVE

412

S. R. Shetty and D. H. Manjaiah

to divide the signals into numerous low band channels further increasing the carrying communication capacity of 3, 4.5, 6, 9, 12, 18, 24, and 27 Mbps in channel containing 10 MHz [17]. • IEEE 802p: Inclusion to IEEE 609 standards, IEEE broadened its family of protocol 802.11 by incorporating 802.11p to promote vehicular communication wirelessly as per DSRC standards [12]. Below are the figures including distribution of channels in DSRC and the framework of WAVE [5] (Figs. 3 and 4).

Fig. 3 Channel distribution in DSRC [5]

Fig. 4 WAVE framework [5]

A Comprehensive Study of Security Attack on VANET

413

2.5 Types of VANET Communication Components of VANET contributes in building various forms of communication in VANET. The system comprises of RSUs, OBUs, TAs, some proxy servers, geographical fulfillment, various administration, and application-based servers. Both public and private zone of automobile industry will be part of VANET environment and will be in transportation point. Link can be framed between vehicles (V2V), and it can happen between vehicles and road side unit or the infrastructure point (V2R/V2I) and even possibilities of infrastructure-to-infrastructure communication (I2I). Below figure depicts the communication model [18]. • Vehicle-to-Vehicle communication (V2V): Here, automobiles swap information related to bottleneck, speed of the vehicle, and area-based services wirelessly. All these type of information is conveyed through OBUs with the assistance of dedicated short-range communication (DSRC) and IEEE 802.11p protocol under the scope of particular RSUs [19]. V2V communication embeds safety applications by perceiving dangers, causing damages through traffic, terrain or weather, etc. • Vehicle-to-Infrastructure communication (V2I/V2R): Here, information broadcasting takes place between vehicles and road side unit including network infrastructure. V2R along with various hardware, software, and firmware furnish information such as lane marking, road signs, traffic signals, etc., to vehicles. The prime technology behind V2R communication is DSRC and wireless local area network (WLAN) [19]. The major point to be noted is installation of huge number of road side units that increases the investment on road side infrastructure [20]. Single-hop/multi-hop way of communication is facilitated between vehicles and infrastructure based on the link between them [21]. • Infrastructure-to-Infrastructure communication (I2I): In I2I communication, RSUs transfer messages with each other. In common, we presume that all the RSUs are attached to the mainstream network with the physical line. In scenarios like rural regions, isolated RSUs may fail to connect to mainstream network. RSUs have to exchange message among themselves, with main RSU and mainstream network to dispatch safety message, traffic management, and emergency information [22] (Fig. 5).

3 Routing Protocols With the help of various communication, variant vehicles on road exchange diverse information over the network. Therefore, information dissemination plays prominent role in VANET. Role of routing protocols comes into picture when information propagation takes place between vehicles and road side infrastructures. These protocols are standards or conventions that makes a decision regarding packet routes between different nodes. Hence, designing a dynamic routing protocol, itself is a challenge

414

S. R. Shetty and D. H. Manjaiah

Fig. 5 VANET communication model [18]

due to changing network topology in VANET [23]. Routing protocol can be classified based on topology, position, broadcasting, clustering, and geo casting. The three major services of ITS are safety, non-safety, and infotainment. Safety service is the at most priority of ITS. Hence, various routing protocols need to be strongly designed to meet the safety need [24]. The diagram below shows protocol classification [23] (Fig. 6).

Fig. 6 Classification of routing protocol in VANET

A Comprehensive Study of Security Attack on VANET

415

• Topology-Based Routing Protocol: This protocol depends on the topology of underlying network. It maintains a routing table. It uses store and forward methodology. Its work is slower compared to other type of routing protocol. There are two types of topology-based protocol, they are proactive routing protocol and reactive routing protocol, e.g., fisheye state routing (FSR), dynamic state routing (DSR). • Position-Based Routing Protocol: It is even referred as geographic-based routing. And every node is aware of its adjacent node and its position with the help of GPS service [25]. It makes use of GPS services for routing. It is promising method compared to rest of the routing techniques. It uses greedy forwarding to find optimal path, e.g., connectivity-aware routing (CAR), vehicle assisted data delivery (VADD). • Broadcast-Based Routing Protocol: This routing technique is used in emergency state like accident, climate, etc. [26] Messages can be delivered beyond the transmission range. Flooding mechanism is applied to guarantee packet delivery, but results in excess use of bandwidth, e.g., BROADCOMM, urban multi-hop broadcast (UMB), zone routing protocol (ZRP), vector-based tracking detection (V-TRDE), history enhanced-TRADE (HV-TRADE). • Clustering-Based Routing Protocol: The protocol allocates the point of network with various intersecting or decomposed clusters in a diffused manner. Dispensing of nodes happens in many discrete clusters or in clusters that intersect [27]. The cluster head is selected to communicate with different cluster heads or cluster nodes, e.g., clustering for open IVC networks (COIN), hierarchical cluster-based routing (HCB). • Geocast-Based Routing Protocol: It is a modification of multicast technique, where packets are transferred to every node within a predefined geographical region. It guarantees low cost packet delivery, e.g., robust vehicular routing (ROVER), inter-vehicular geocast (ROVER).

4 Challenges and Security Attacks VANETs have distinguish features like mobility constraints, infrastructure-less architecture, and frequent loss of network connections between nodes. All these constitutes to serious network security threats. Hence, developing secure algorithms which promotes safe network link and provide better security is a challenge [28]. The security system must be built in such a way that every message that is processed and sent need to be secured and must provide end-to-end encryption. The total security to the VANET structure can be guaranteed with clear picture about possible attacks and prevention measures concerned with these attacks. Road side units, on-board units, drivers, certifying authorities may directly or indirectly contributes to the security threads and disrupting normal functioning of VANET [29]. In order to reduce the security attacks, meeting certain security specifications are the priority. Authentication, confidentiality, nonrepudiation, etc. are the certain security specifications which can be attacked by the attackers and disturb the entire communication network [30].

416

S. R. Shetty and D. H. Manjaiah

4.1 Classification of Attackers Attackers will launch any attacks depending on their understanding about the system, their skill set, workforce, equipment, and their vicinity. Listed below is few set of potentiality an attacker can own [31]. • Scientific: Relying on the competency and technical skill set, attacker experiments his attacks to mislead the network. The skill set can be programmatic such as encrypting and decrypting codes etc. • Assets: The major resources to initiate an attack are finance, equipment, and workforce. Without which an attacker can never successfully plant an attack. • Location: Acquiring a location and launching an attack depends on category of attack and attacker’s aptitude. Here attack can channel from few hundreds to thousand meters. VANET is ad hoc in nature, this wireless communication is prone to various attacks leading to tampering of data privacy, integrity, and confidentiality which becomes an obstacle to VANET application. Hence, attackers and attack launched by them are major concern when security comes into picture. Attackers can be grouped based on who the attackers are, their intension of attack, impact of attack on VANET, and whether the attack is bounded or not. The following are the classification of attackers based on below variable [32]: • Insider versus Outsider: Insider is the one who is a genuine member of VANET, launching an attack on the network, whereas the outsider is the non-genuine member. • Malicious versus Rational: Malicious attacker is non beneficial attacker while rational is intended one. • Active versus Passive: Active attacker is the one who generates recent packets or destroy packets that are already used in the network, however, passive attacker only intrudes the wireless transmission. • Local versus Extended: Local attackers can launch an attack locally, i.e., within specific border. But extended attackers can launch an attack beyond the network access.

4.2 Classification of Attacks Security attacks on vehicular ad hoc network can happen in several forms. There are variety of attacks. A professional or nonprofessional attacker will organize an attack using variant strategies. Few attacks are on packets, where a packet may be dropped, delayed or may be forwarded to false destination. Numerous attacks are done to alter existing data by adding or deleting or injecting false data in a packet. Which results in network chaos and may end up with life threatening scenario like road accident. Furthermore, few attacks may hamper different technologies in VANET that

A Comprehensive Study of Security Attack on VANET

417

will tamper VANET functionalities [33]. Purpose of launching an attack in VANET may be due to numerous reason. Few attackers perform attack just to eavesdrop the conversation over the network, and some may intend to damage information sent on packets across the network. VANET works on vulnerable information which is life critical. Hence, this network captivates and nourishes different malevolent attackers to launch several attacks. Below are the categories of VANET network with different division of attacks [34]. • Network Attack: This class of attack vandalizes the performance of entire network link. It is considered to be the dangerous attack. Examples are Sybil and DoS attack. • Application Attack: This group of attack focus on types of information exchanged and harms different kind of application provided by VANET. Examples for this group of attack are eavesdropping and bogus information. • Timing Attack: The intention of this attack is to delay the message dissemination between nodes by modifying the time period. • Social Attack: It is a mode of psychological attack done on drivers, interrupting smooth driving, and ultimately create a chaos on road. Ending up with hampering necessary safety specifications. • Monitoring Attack: Keen monitoring of vehicles and keeping track of every task of the system and later launching the attack depending on the invigilation done is the specialty of this attack. Faking identity, passive attack, and session seizing are the types of attack in this class. Multiple researches have classified attacks based on various parameters. Below diagram shows five different categories of attack and their sub categories as well. In the foremost category, authenticated user’s identity information is used to launch the attacks by the attackers. Second group of attack is based on sending faulty or adjusted piece of information. Group three is about prolonging or eliminating necessary packets or sending those packets to diverse destined user. Group four is a type of attack to clog or gather messages that are conveyed in the forum. The last group of attack is to hamper VANET architecture. The diagram below shows the classification of security attack in VANET [35]. • Availability Attack: It consists of denial of service (DoS) attack, jamming attack, greedy behavior, broadcast tampering, malware attack, spamming, black-hole attack, and others are those attacks that avert the network to function in normal mode, making the service of communication channel unavailable. These kind of attacks denies the users with the network privilege and services. They may take various forms like jamming communication channel and hindering the delivery of safety related messages to the user. DoS attack can be accomplished through various other attacks like jamming, black-hole attack, greedy behavior, grey-hole attack, warm-hole, malware attack, etc. • Authentication and Verification Attack: Attacks under these categories are evident by Sybil, replay attack, GPS spoofing, position faking, masquerading, tunneling, and others. Malicious node may hold duplicate or many identities which

418

S. R. Shetty and D. H. Manjaiah

Fig. 7 Classification of attack

creates erroneous message or may fake congestion. Quacking with position, speed or even location and curbing the safety on road is the foremost intention of these attacks. • Confidentiality Attack: Confidentiality is all about the true owner who has the right to access the information. But attack on confidentiality will hinder the purpose. Spying, data gathering and traffic analysis are the various forms of confidentiality attack, where an attacker might deliberately overhear the confidential communication that takes place on a communication channel and misemploy. Encrypting the confidential data over the network may to some extend put a stop to privacy violation. • Integrity and Data Trust Attack: Sending and receiving the original and authenticated message is integrity and data trust. However, here attackers try to hamper the integrity of data through numerous forms such as masquerading, fabrication or altering data, subdue messages, replaying, and so on. In masquerading attack, the contender fakes his identity as a legal source and disseminate fallacious message. Likewise fabricating attack transgress security requirements such as nonrepudiation and integrity. Sole intention of fabrication attack is to alter the original data and disperse bogus message throughout the channel. • Nonrepudiation/accountability attack: Attacker here make it tedious to keep an account or trace numerous events that takes place in a network which are essential for safety on road [36] (Fig. 7).

5 Overview of Jamming Attack Based on the types of attacks and attackers jamming attack is considered as matter of concern and has become a research topic too. Jamming attack is one such attack that occurs when jammer consciously sends radio signals to distort the link through minimized signal-to-noise ratio. The term jamming is made use in order to differentiate

A Comprehensive Study of Security Attack on VANET

419

Fig. 8 Jamming scenario in VANET [38]

it from coincidental jamming which called interference. Here, we aim to examine jamming attacks which is launched to damage communications purposefully. Hence, we need to understand jamming strategies put on by various jammer [37]. The diagram represents the jamming senior in VANET environment [38] (Fig. 8). Jamming attackers can be organized into four dissimilar groups depending on their behavior. They are constant, deceptive, random, and reactive and Intelligent jammer [39]. • Constant jammer: Frequently imparting undirected information on the medium, ignoring its state (Idle or not) are the constant jammers [37]. They repeatedly release radio signals to reduce signal-to-noise ration. It can congest whole of information interchange causing severe harm. Yet comparatively identification of this attacker is not difficult as it consumes energy ineffectively [39]. • Deceptive jammer: Continuously introducing a stream of undirected information with no space between consecutive packets carried out in a network is deceptive jammer [37]. Defying MAC layer entrance rule and generating denial of service attack (DoS) is the characteristics of the attacker. The channel will be ever busier for the authorized nodes. Unlike constant jammer deceptive jammer constraints forwarding of noise [39]. • Random jammer: This jammer fluctuates between jamming and sleeping process to preserve energy and work on physical layer [37]. It is hard to locate since its energy intake is less [39]. • Reactive jammer: The jammer will activate only when the communication channel is busy and mobile, which is other case comes to idle state. It is tough to identify as it functions during authorized channeling [39]. • Intelligent jammer: This jammer specifically targets the packets and introduce noise to the packet so as to disrepute the packet. The jammer is efficient enough to examine the current traffic, and it is conscious about the protocol. It is a complex jammer and hard to identify [40].

420

S. R. Shetty and D. H. Manjaiah

5.1 Effectiveness of Jamming Jamming attack is one such security threat to the vehicular ad hoc network that can make a serious impact on the network most of the time it may result in fatalities. In order to detect jamming in a network, it is must to study the impact made by the jamming attack with severity and ultimate result. Therefore, following three metrics can be used to realize if ad hoc network is under attack by a jammer. Below metrics tightly associated with each other and out-turn network performance [41] • Packet Delivery Ratio (PDR): The collation among incoming packets obtained at the end of receiver and outgoing packets from the sender is packet delivery ratio. The density of vehicles on road in VANET keeps fluctuating depending on road condition. For example, during peak hours or peak days the traffic on road will be considerably high leading to reduced packet delivery ratio. A prominent fall in packet data ratio occurs when jamming prevails, which is caused by the noise from jamming attack. Accordingly, reduction in PDR either due to the bottle neck in the network or due to the jammer is barely possible to depend on PDR alone. • Packet Send Ratio (PSR): Always ratio of packets that are accomplished through an authorized user must be compared with the ratio of packets send out in MAC layer. Drop in packet send ratio can happen in two situations. Firstly, when roads are congested, vehicle travels in lower speed and results in extended communication time between nodes. Simultaneously, channel encounters control signals like request/clear-to send (RTS/CTS) directing larger fall in PSR. Secondly, when jamming prevails in the network, the channel will frequently tend to be busy due to the noise induced by the jammer, resulting in back-off-timer and latency in receiving CTS feedback. Nevertheless, roads are obstructed or jammed, packets will be usually buffered or discarded when new packets arrive, and when time-out occurs, drop in PSR will happen. Hence, PSR alone cannot be used as a tool to identify jamming attack. • Signal Strength (SS): SS measures the quality of radio frequency signal sent from source to destination and is a high-powered tool for measurement at receiver’s end. Depending on commissioned protocol, wireless points can test SS at any point of time. SS serves a great tool for researchers to identify jamming attack in wireless network. Vehicles are free to move in VANET, where it can randomly enter and come out of communication range. Increase in SS can occur due to high vehicle density (congestion) or may be on purpose. Numerous works are carried out with regarding jamming problem choosing 802.11p relay, still no comprehensible definition of jammer’s potential exist. A regular presumption about jammer is that a jammer can jam wireless transmissions with no restriction at unspecified or stable speed, and jammer can proceed with no restriction. Assumption is that effect of jamming and adaptability can be closely concentrated. PDR, PSR, and SS are the network metrics that can analyze jamming and can be applied in variant scenarios like congested road, high way road, etc.

A Comprehensive Study of Security Attack on VANET

421

6 Related Works VANET, being widely gaining popularity, has exposed to enormous security issues. Safety is prioritized, and researchers worldwide are working to strengthen the wireless network security. Handling various attacks on VANET has become a challenge, and jamming attack is one among them. Following are few research papers related to security attack that are been reviewed in order to intensify the research topic and objectives to be achieved. M. Raya et al. (2007) [42] have proposed a scheme that recognizes the most applicable service features. The proposed scheme has a secured framework taking protocols into consideration. It focuses on vehicular ad hoc network security and privacy. S. Zeadally et al. (2012) [17] have surveyed current research issues such as quality of service, routing protocols, broadcasting with respect to vehicular ad hoc network. Standards for wireless access are been reviewed, where trails and major deployments in various parts of the world are studied. Various vehicular ad hoc network simulation tools like simulators are examined, with respect to their performance, efficiency, etc. Research-related challenges with respect to VANET are highlighted. R. S. Raw et al. (2013) [43] a review on VANET security issues and numerous challenges prevailing in ad hoc network which hinders the free flowing communication among vehicles on road. A close observation is done on diverse attacks and solutions for the attack that takes place on VANET. Dissimilar variables are examined to make a comparison between solutions. Mechanisms that are used to solve attack issues are even highlighted. N. Panjrath et al. (2017) [44] have made a brief study on vehicular ad hoc network concerned with its architecture, security issues, and design issues. A survey on VANET’s communication system is even pursued. Various hardware, software, and sensor units are deeply studied. Ad hoc network supports variant routing protocols which are analyzed in this paper. An insight on service domain of VANET is made to support the survey. Muhammad Sameer Sheikh et al. (2017) [5] have surveyed VANET architectures, its service model and standards, latest simulation models, and its performances are considered. A deep study on various security attack on VANET is studied with future scope of research on concerned with security attacks. Hamssa Hasrouny et al. (2017) [45] have highlighted three major components of VANET security structure. Security framework and associated protocols are elaborated for in-depth grasp and even latest categorization of security attacks and solutions associated with them. Ajay N. Upadhyaya et al. (2018) [46] have inspected various security attacks on VANET. Features and conditions of VANET are studied well. VANET attackers are grouped and depicted subjected on security essentials. A. Aijaz et al. (2006) [47] made an introduction to prototype of attacks that happens when vehicles relay. The framework is used to clarify system archetype of

422

S. R. Shetty and D. H. Manjaiah

Network-On-Wheels (NoW). And this model is even used to discover possible deficiency of NoW communication system during its specification stage. The work even came across few strongly unrecognizable attacks on various units such as hardware, software, and even sensors. Certain fascinating issues and challenges which require insight focus too are highlighted here. The attack tree serves as a key to analyze the system security. To magnify the complete security, two mechanisms were found worth. Two mechanisms are plausibility check conducted in cars and regular system check specifically on road side units. Collation between message received and inner sensor data, even assessing messages from diverse source of information based on sole event and constructing structures in relation with unique traffic events that are associated with facts is plausibility check. With plausibility check and regular system check, simulation results in increasing efforts of attackers. But there is a need for a model that will deal with every applications. Working of each and every device is verified through regular system check, which reduces defects in any units. It includes software updation as well. C. Chen et al. (2009) [48] have brought forward a robust Sybil attack detection (RobSAD) tool depending on normal flow of vehicle and even on suspicious move. Considering restricted planning of vehicular ad hoc network attack on each and every node can be independently perceived by interrelating digital signature of neighboring vehicles. The results of simulation prove that the presented framework performs outstandingly with less network requirements, as well as reduced rate of detection, hands overhead efficiently and is robust. Lv Humeng et al. (2012) [49] proposed “Distributed Beacon Frequency Control Algorithm” for vehicular ad hoc network, which will redesign by adjusting frequency of beacon messages in two different instant. The proposed algorithm is capable of detecting and calculating network burden. It can control the frequency of beacon messages that are sent at each time interval depending on the domain. In order to deal with the overcrowding caused by periodic messages, half the frequency every time method is acquired. The algorithm is practically simulated using NS2 simulator to check the performance in heavy traffic condition and proved that beacon receiving probability has increased due to effectively making use of bandwidth. Yi Qian et al. (2012) [50] have designed productive medium access control (MAC) protocol in order to well-timed message diffusion in VANET, which in turn enhances network security and road safety. This proposed protocol is designed for various kind of applications with variant messages of prime concern so as to access DSRC channels. Simulation shows protocol is potential enough to furnish secured communication with reliability, less latency, integrity, undeniable, privacy of the source, or the sender. S. O. Tengstrand et al. (2014) [51] have surveyed the radio signal obstruction to study the system performance, and even its influence on the IEEE 802.11p system where bit rate, packet delay and packet error possibilities are taken into account. As known, the VANET system is built on automated request repeat (ARQ), and carrier sense multiple access (CSMA) problem related to interference is closely associated, which may happen as packet delay on constant rate of data. Hence, radio signal interference problem exists as a challenge in nearest future when it comes to safety

A Comprehensive Study of Security Attack on VANET

423

or non-safety related messages. Three different types of radio signals are studied such as continuous wave (CW), pulse modulated sine signal, and additive white Gaussian noise (AWGN). On IEEE 802.11p, CW has very bad mortification when normal spread noise AWGN. Pulse modulated sine signal is considered to have lesser signal degradation compared to all the three. But ignoring this degradation is not the solution instead duty factor of the signal obstruction should be considered. Measuring only the root mean value (RMS) of the signal interference is not a good practice. Using impulsiveness correction factor (ICF) can be one among possible solutions to this problem. When time critical events like safety applications are considered, delay plays prominent role. Hence, radio system must be wisely selected depending on type of applications we are working on. N. Lyamin et al. (2014) [52] have presented latest model to detect denial of service (DoS) attack vehicular ad hoc network with communication standard IEEE 802.11p. Platooning is considered, and a straightforward algorithm is developed for identification real-time jamming of timely positioning beacons communicated among vehicles on road in group. Likelihood of detecting jamming and fake signal is evaluated among two dissimilar attacker framework. Simulation shows the probability of detection is not less than 0.9. No fake signaling probability for any jamming too. A. M. S. Abdelgader et al. (2014) [53] have concentrated on PHY layer of IEEE 802.11p, cater an overview of frequency band, specifications and block diagrams of WAVE PHY. Since VANET is emanating as a revolution in transportation domain, network security and privacy along with road safety play key role. In order to achieve network safety, focusing on PHY is prime consideration. F. Nyongesa et al. (2015) [54] have conducted a survey on Doppler shift compensation schemes and its provocation on VANET. The survey is conducted in highmobility wireless network where latest stratification consisting of six variant scheme. Compensation schemes are portioned into time domain, space domain, and frequency domain, which are further categorized as time partitioned-based scheme, modulation and coding based, autocorrelation and interference end based, beam foaming based, and diversity combining based. Bite error rate is calculated in all six schemes based on which VANET performance is concluded. M. Sun et al. (2017) [55] presented a new data trust scheme that can wisely discover the trustworthiness of a message on the network and is capable of distinguishing erroneous message. Tracks the vehicles on the road despite of messages being reported imprecisely. Through the wireless physical layer, effect of reported vehicles information with the help of reliable sensor techniques are sensed which is the ultimate concept behind this model. This mechanism is tied within a vehicle tracking system which is dynamic in nature, for which extended Kalman filter is made use of. Chi-Square trial is made use of to determine fallacious information. Simulation results are highly productive when highway traffic scheme with less pointed curves is taken into consideration and when a genuine neighboring vehicle exists. Huong Nguyen-Minh et al. (2015) [56] have proposed a jamming detection technique to detect loss of packet that takes place either through jamming attack or accidentally. A systematic model is even launched to review the productivity

424

S. R. Shetty and D. H. Manjaiah

and accuracy of proposed system. It is specifically used to detect reactive jamming directing beacon messages. Operated on multi- channel, receives both safety and non-safety-related messages. Ali Hamieh et al. (2009) [57] have projected a model which is capable of recognizing distinct type of jamming attack only when jammer communicates through workable radio signals radiate from its device. The model is constructed on the quantity of distribution of errors. VANET is largely prone to denial of service attack, and it becomes important to detect the attackers and kind of attacks done. Relying on calculations done between the correlation that prevails in the middle of errors and the receiving time to find the existence of jamming attack in VANET the new detection model is established. Simulation results are capable of detecting jamming in inflated credentials. M. A. Karabulut et al. (2017) [58] have developed a logical scheme to evaluate the performance of IEEE 802.11p enhanced distributed channel access function (EDCAF) for vehicular ad hoc network. The association among performance factors and EDCAF factors are obtained. The analysis is made using 2D Markov chain model. The Markov chain model examines all the prime matrices which makes an impact on the IEEE 802.11p performance. Those factors contain back-off counter freezing, qCW[QC] (contention window), AIFSN[QC] (arbitration inter-frame space), etc. Simulation is carried out using this performance scheme. Packet drop ratio, latency time, and throughput are acquired. M. S. Mohamed et al. (2017) [59] have examined hybrid jamming attack in vehicular ad hoc network and presented an enhanced voting algorithm for the same. In order to increase the reliability of the network, voting-based scheme is implemented, where hold of hybrid jamming on voting-based perspective is explored. Practical examination depending on attack model is conducted with the help of devices such as onboard unit that are available commercially. The newly presented algorithm has high ranking with respect to the amount of time taken to decide the reliability. EEBL application is made use of during the experiment as an example. The time making decision has been advanced to 3.3 seconds during the experiment, which is a great leap because for those messages which are safety oriented every second makes a difference. Safety messages are time critical and may end up being fatal if delayed. A. Benslimane et al. (2017) [37] have introduced new analytical model to investigate the depth of reactive jamming attack and the impact of jamming on broadcasting. The proposed system is capable of distinguishing the performance of network during normal network scenario and jammed situation. To accomplish safety associated applications in VANET, a detection method based on MAC is proposed which overpower the threshold method used in detection task. The proposed system can more precisely differentiate the reason for collision between contents, attack, and interventions. With the help of reduced possibility of fake alarms, jamming attack detection is simplified. Simulation results and calculations are used to assess the detection system. Fake alarms are possibly sidestepped in order to increase the execution of detection method. The detection methodology is investigated in both platoon situation and normal domain. The algorithm here shows better performance in platoon

A Comprehensive Study of Security Attack on VANET

425

environment where vehicle count is stable and monitored using technology related to platooning. I. K. Azogu et al. (2013) [60] brought forward a hideaway scheme acceptable for anti-jamming in vehicular ad hoc network. To analyze the effectiveness of jamming, a new security measuring standard is recognized and employed to design protection structure. The calculations and service function is done through road side unit rather through onboard unit. Simulation clearly shows that the presented system is more capable of improving VANET efficiency against jamming in comparison with traditional system such as channel surfing. The new system is likely to be a defense mechanism. The road side unit is variedly and efficiently used in the new model.

7 Conclusion VANET is promising technology for vehicular communication network, due to its extensive applications and vivid safety related services. It has undoubtedly become a subject of analysis in the area of automobile, educational institution, etc. Due to gaining popularity of VANET, security has become major concern. After analyzing various attacks, jamming seems to be a challenging task. Maximizing road safety and minimizing jamming attack on vehicles are the major concern. Jamming is a consequential threat to VANET security. Jammer frequently send signals (in pretentious zone) to hinder the communication between nodes in the network. Jamming attack keeps the connecting channel busy, and the sufferer feels the same. Hence, jammed node neither can send nor receive information from the effected region. When jamming is exhibited, the transmitter can affluently dispatch packets without any intermission, while recipient may not receive the same. These packets may transfer vital information like, road conditions, weather, accidents, failing to receive such packets may lead to endangerment. Since VANET has high mobility and rapid changing network topology, defending it against jamming attack is a challenge. The jammer is not abided with any protocol, and their mobility is unrestricted. Besides, challengers have over all control to start a jamming attack and go to sleep mode, so that they can obscure their existence. Amalgamation of all these issues has made jamming hard to detect and solve. Making a note of above issues and the review done, our further research focus on studying existing jamming attack and its effectiveness and trying to develop a jamming detection model that is more potential enough in detecting jamming attack in VANET. Acknowledgements My sincere thanks to Dr. Manjaiah D. H. and his area of specialization team for their kind enlightenment and constant support. I further extend my regards to the department of Post-Graduate Studies and Research in Computer Science, Mangalore University, Karnataka, India for their continued guidance. Conflict of Interest We declare that the paper describes original work carried out by us. Paper acknowledges all contributors other than the authors named above. To the best of our knowledge, the work described in the paper is original and no part of it has been copied or taken from other

426

S. R. Shetty and D. H. Manjaiah

sources without necessary permissions. Present manuscript/paper has not been published anywhere in any form of publication (in any language) and not presently under consideration for publication anywhere in form of publication.

References 1. Laouiti A, Qayyum A, Saad MNM (2014) Vehicular ad-hoc networks for smart cities: first international workshop, 2014, vol 306. Springer 2. Shrestha R, Bajracharya R, Nam SY (2018) Challenges of future VANET and cloud-based approaches. Wirel Commun Mob Comput 2018 3. Sharma S et al (2019) Vehicular ad-hoc network: an overview. In: 2019 international conference on computing, communication, and intelligent systems (ICCCIS). IEEE, pp 131–134 4. Li W, Song H (2016) Art: an attack-resistant trust management scheme for securing vehicular ad hoc networks. IEEE Trans Intell Transp Syst 17(4):960–969 5. Sheikh MS, Liang J, Wang W (2019) A survey of security services, attacks, and applications for vehicular ad hoc networks (VANETs). Sensors 19(16):3589 6. Zaidi T, Faisal S (2018) An overview: various attacks in VANET. In: 2018 4th international conference on computing communication and automation (ICCCA). IEEE, pp 1–6 7. Jameel F, Hamid Z, Jabeen F, Javed MA (2018) Impact of co-channel interference on the performance of VANETs under α-μ fading. AEU—Int J Electron Commun 83:263–269 8. Zheng K, Zheng Q, Chatzimisios P, Xiang W, Zhou Y (2015) Heterogeneous vehicular networking: a survey on architecture, challenges, and solutions. IEEE Commun Surv Tutor 17(4):2377–2396 9. Ghosh M, Varghese A, Kherani AA, Gupta A (2009) Distributed misbehavior detection in VANETs. In: 2009 IEEE wireless communications and networking conference. IEEE, pp 1–6 10. Latif S, Mahfooz S, Jan B, Ahmad N, Cao Y, Asif M (2018) A comparative study of scenariodriven multi-hop broadcast protocols for VANETs. Veh Commun 11. De Fuentes JM, Gonza´lez-Tablas AI, Ribagorda A (2011) Overview of security issues in vehicular ad-hoc networks. In: Handbook of research on mobility and computing: evolving technologies and ubiquitous impacts. IGI Global, pp 894–911 12. Mejri MN, Ben-Othman J, Hamdi M (2014) Survey on VANET security challenges and possible cryptographic solutions. Veh Commun 1(2):53–66 13. Dinesh D, Deshmukh M (2014) Challenges in vehicle ad hoc network (VANET). Int J Eng Technol Manag Appl Sci 2(7):76–88 14. Malebary S, Wenyuan Xu (2015) A survey on jamming in VANET. Int J Sci Res Innov Technol 2:142–156 15. Singh GD, Tomar R, Sastry HG, Prateek M (2018) A review on VANET routing protocols and wireless standards. In: Smart computing and informatics. Springer, pp 329–340 16. Sheikh MS, Liang J (2019) A comprehensive survey on VANET security services in traffic management system. Wirel Commun Mob Comput 2019 17. Zeadally S, Hunt R, Chen Y-S, Irwin A, Hassan A (2012) Vehicular ad hoc networks (VANETs): status, results, and challenges. Telecommun Syst 50(4):217–241 18. Arif M, Wang G, Bhuiyan MZA, Wang T, Chen J (2019) A survey on security attacks in vanets: Communication, applications and challenges. Veh Commun 19:100179 19. Ali I, Gervais M, Ahene E, Li F (2019) A blockchain-based certificateless public key signature scheme for vehicle-to-infrastructure communication in VANETs. J Syst Archit 99:101636 20. Santa J, Go´mez-Skarmeta AF, Sa´nchez-Artigas M (2008) Architecture and evaluation of a unified v2v and v2i communication system based on cellular networks. Comput Commun 31(12):2850–2861

A Comprehensive Study of Security Attack on VANET

427

21. Cunha F, Villas L, Boukerche A, Maia G, Viana A, Mini RAF, Loureiro AAF (2016) Data communication in VANETs: protocols, applications and challenges. Ad Hoc Netw 44:90–103 22. Huang L, Jiang H, Zhang Z, Yan Z, Guo H (2017) Efficient data traffic forwarding for infrastructure-to-infrastructure communications in VANETs. IEEE Trans Intell Transp Syst 19(3):839–853 23. Devangavi AD, Gupta R (2017) Routing protocols in VANET—a survey. In: 2017 international conference on smart technologies for smart nation (SmartTechCon). IEEE, pp 163–167 24. Sirola P, Joshi A, Purohit KC (2014) An analytical study of routing attacks in vehicular ad-hoc networks (VANETs). Int J Comput Sci Eng (IJCSE) 3(4):210–218 25. Paul B, Islam MJ (2012) Survey over VANET routing protocols for vehicle to vehicle communication. IOSR J Comput Eng (IOSRJCE) 7(5):1–9 26. Singh S, Agrawal S (2014) VANET routing protocols: issues and challenges. In: 2014 recent advances in engineering and computational sciences (RAECS). IEEE, pp 1–5 27. Yogarayan S, Razak SFA, Azman A, Abdullah MFA, Ibrahim SZ, Raman KJ (2020) A review of routing protocols for vehicular ad-hoc networks (VANETs). In: 2020 8th international conference on information and communication technology (ICoICT). IEEE, pp 1–7 28. Barskar R, Chawla M (2015) Vehicular ad hoc networks and its applications in diversified fields. Int J Comput Appl 123(10) 29. Goyal AK, Agarwal G, Tripathi AK (2019) Network architectures, challenges, security attacks, research domains and research methodologies in VANET: a survey. Int J Comput Netw Inf Secur 11(10) 30. Abassi R (2019) VANET security and forensics: challenges and opportunities. Wiley Interdiscip Rev: Forens Sci 1(2):e1324 31. Al Junaid MAH, Syed AA, Warip MNM, Azir KNFK, Romli NH (2018) Classification of security attacks in VANET: a review of requirements and perspectives. In: MATEC web of conferences, vol 150. EDP Sciences, p 06038 32. Shahid MA, Jaekel A, Ezeife C, Al-Ajmi Q, Saini I (2018) Review of potential security attacks in VANET. In: 2018 Majan international conference (MIC). IEEE, pp 1–4 33. Ahmed W, Elhadef M (2017) Securing intelligent vehicular ad hoc networks: a survey. In: Advances in computer science and ubiquitous computing. Springer, pp 6–14 34. Mishra R, Singh A, Kumar R (2016) VANET security: issues, challenges and solutions. In: 2016 international conference on electrical, electronics, and optimization techniques (ICEEOT). IEEE, pp 1050–1055 35. Rasheed A, Gillani S, Ajmal S, Qayyum A (2017) Vehicular ad hoc network (VANET): a survey, challenges, and applications. In: Vehicular ad-hoc networks for smart cities. Springer, pp 39–51 36. Ekedebe N, Yu W, Lu C, Song H, Wan Y (2015) Securing transportation cyber-physical systems. In: Securing cyber-physical systems. CRC Press, pp 163–196 37. Benslimane A, Nguyen-Minh H (2017) Jamming attack model and detection method for beacons under multichannel operation in vehicular networks. IEEE Trans Veh Technol 66(7):6475–6488 38. Karagiannis D, Argyriou A (2018) Jamming attack detection in a pair of RF communicating vehicles using unsupervised machine learning. Veh Commun 13:56–63 39. Alturkostani H, Chitrakar A, Rinker R, Krings A (2015) On the design of jamming-aware safety applications in VANETs. In: Proceedings of the 10th annual cyber and information security research conference, pp 1–8 40. Xu W, Trappe W, Zhang Y, Wood T (2005) The feasibility of launching and detecting jamming attacks in wireless networks. In: Proceedings of the 6th ACM international symposium on mobile ad hoc networking and computing. ACM, pp 46–57 41. Malebary S, Xu W, Huang C-T (2016) Jamming mobility in 802.11 p networks: modeling, evaluation, and detection. In: 2016 IEEE 35th international conference on performance computing and communications conference (IPCCC). IEEE, pp 1–7 42. Raya M, Hubaux J-P (2007) Securing vehicular ad hoc networks. J Comput Secur 15(1):39–68

428

S. R. Shetty and D. H. Manjaiah

43. Raw RS, Kumar M, Singh N (2013) Security challenges, issues and their solutions for VANET. Int J Netw Secur Appl 5(5):95 44. Panjrath N, Poriye M (2017) A comprehensive survey of VANET architectures and design. Int J Adv Res Comput Sci 8(5) 45. Hasrouny H, Samhat AE, Bassil C, Laouiti A (2017) VANET security challenges and solutions: a survey. Veh Commun 7:7–20 46. Upadhyaya AN, Shah JS (2018) Attacks on VANET security. Int J Comput Eng Technol 9(1):8–19 47. Aijaz A, Bochow B, Do¨tzer F, Festag A, Gerlach M, Kroh R, Leinmu¨ller T (2006) Attacks on inter vehicle communication systems—an analysis. In: Proceedings—WIT, pp 189–194 48. Chen C, Wang X, Han W, Zang B (2009) A robust detection of the Sybil attack in urban VANETs. In: 2009 29th IEEE international conference on distributed computing systems workshops. IEEE, pp 270–276 49. Lv H, Ye X, An L, Wang Y (2012) Distributed beacon frequency control algorithm for VANETs (DBFC). In: 2012 second international conference on intelligent system design and engineering application (ISDEA). IEEE, pp 243–246 50. Qian Y, Lu K, Moayeri N (2008) A secure VANET mac protocol for DSRC applications. In: Global telecommunications conference, 2008. IEEE GLOBECOM 2008. IEEE, pp 1–5 51. Tengstrand SO, Fors K, Stenumgaard P, Wiklundh K (2014) Jamming and interference vulnerability of IEEE 802.11 p. In: 2014 International Symposium on electromagnetic compatibility (EMC Europe). IEEE, pp 533–538 52. Lyamin N, Vinel A, Jonsson M, Loo J (2014) Real-time detection of denial-of-service attacks in IEEE 802.11 p vehicular networks. IEEE Commun Lett 18(1):110–113 53. Abdelgader AM, Lenan W (2014) The physical layer of the IEEE 802.11 p wave communication standard: the specifications and challenges. In: Proceedings of the world congress on engineering and computer science, vol 2, p 71 54. Ferdinand Nyongesa, Karim Djouani, T Olwal, and Yskandar Hamam. Doppler shift compensation schemes in vanets. Mobile Information Systems, 2015, 2015. 55. Sun M, Li M, Gerdes R (2017) A data trust framework for VANETs enabling false data detection and secure vehicle tracking. In: 2017 IEEE conference on communications and network security (CNS). IEEE, pp 1–9 56. Nguyen-Minh H, Benslimane A, Rachedi A (2015) Jamming detection on 802.11 p under multichannel operation in vehicular networks. In: Wireless and mobile computing, networking and communications (WiMob), 2015 IEEE 11th international conference on. IEEE, pp 764–770 57. Hamieh A, Ben-Othman J, Mokdad L (2009) Detection of radio interference attacks in VANET. In: Global telecommunications conference, 2009. GLOBECOM 2009. IEEE, pp 1–5 58. Karabulut MA, Shahen Shah AFM, Ilhan H (2017) Performance modeling and analysis of the IEEE 802.11 dcf for VANETs. In: 2017 9th international congress on ultra modern telecommunications and control systems and workshops (ICUMT). IEEE, pp 346–351 59. Mohamed MS, Hussein S, Krings A (2017) An enhanced voting algorithm for hybrid jamming attacks in VANET. In: 2017 IEEE 7th annual computing and communication workshop and conference (CCWC). IEEE, pp 1–7 60. Azogu IK, Ferreira MT, Larcom JA Liu, H (2013) A new anti-jamming strategy for VANET metrics-directed security defense. In: 2013 IEEE Globecom workshops (GC Wkshps). IEEE, pp 1344–1349

Developing Business-Business Private Block-Chain Smart Contracts Using Hyper-Ledger Fabric for Security, Privacy and Transparency in Supply Chain B. R. Arun Kumar Abstract Inception of block-chain technology (BCT) has led to new dimensions to business, trade and technologies. Being a distributed technology block-chain (BC) has capability to enable secured, transparent transactions, implement immutable and shared ledger that enhances the trust among the stakeholders. In this research work, the investigation involves systematic review of the current state of the art of the smart-contract (SC), BCT and explore the case-study and apply the contemporary technology concepts and features for supply chain such as Transportation and Logistic problem and to implement trust in the business activities of the consortium. This work has adopted the industry widely used open platform namely Hyper-ledger which is expected to support the elements required to implement the trust in different sectors. The results show that Hyper-ledger framework and tools employed are effective in building the trust among the stakeholders by addressing the said key issues and further, advances the state of the art in transportation/logistics supply chain management (SCM) with unique and desired solutions.

1 Introduction Block-chain technology (BCT) has globally grabbed the attention of very large number of researchers across different sectors of industry/business [1, 2]. The unique combination of characteristics of the block-chain (BC) such as distributed, decentralized, persistent, anonymity, and auditability [2]. The security of records of ownership is expected to revolutionize and redefine the solutions greatly [3]. Such redefined solutions can save the cost in terms of labour/man-hours and solve the problems within near real time and hence appears as promising in industries such as finance, insurance, energy transportation and logistics. Different versions of the BCT have B. R. Arun Kumar (B) Department of Computer Science and Engineering, BMS Institute of Technology and Management, Affiliated to Vivesvaraya Technological University, Belagavi, Doddaballapura Main Road, Avalahalli, Yelahanka, Bengaluru, Karnataka 560064, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_26

429

430

B. R. Arun Kumar

created several avenues for investigating its adoption in their concerned fields as per the requirements. Further, since technology is open for innovations, customized version of the BCT can be adopted which enables relevant implementation of the solution and catalyses the disruption of the technology including the administration and governance. Transportation, logistics and supply chain management (SCM) also demands for the integrity and transparency of the transactions, privacy and confidentiality of the business to establish the trust among the stakeholders. Integrity is not only applicable to the correctness of the values but also terms and conditions of the contract including legal aspects and obligations. As per recent reports, BC are being extensively investigated and adopted which created SCM collaborative framework by better trust and efficiency [4]. Creating verified, validated, immutable, transparent documentations for every instance of data at every step from production to consumption bringing collaborative modalities in complex SCM which addresses the above-mentioned points [5, 6]. In the consortium, multiple parties involve may have malicious users who can counterfeiting the database which leads to unsecured data and inefficient data validation leads loss of faith (Figs. 1 and 2). When many organizations adopt the block-chain (BC), the states of the ledger and history of the transactions is shared which will have technology enforced security, restricted access, immutability, transparency, and trust among the stakeholders. However, it is worth to note that public BCs are not suitable for businesses consortium due to the issues such as privacy of business sensitive data, ever increasing

Fig. 1 Business network without block-chain. Source [7]

Fig. 2 Business network with block-chain. Source [7]

Developing Business-Business Private Block-Chain Smart …

431

huge storage space, lack of governance, huge computing power required to execute consensus algorithms and in most of the cases consortium requires only few stakeholders’ approval. Further, smart-contract (SC) need to be private allowing only authorized stakeholder interactions and privacy controls to be exercised on business sensitive data [8].

1.1 Smart-Contracts (SC) for B2B The idea of SC is proposed in 1994 by Nick Szabo which is the set of terms and conditions agreed between the buyer (s) and the seller(s) or among the business enterprises involved in a particular business. The terms and conditions are executed using the programmed automated transaction protocols [9, 10]. The unique feature of SC is that anonymous parties can execute trusted (transparent, traceable, and irreversible) transactions and agreements without the involvement of external administration and legitimate entities. SC enables smooth implementation of legal aspects involved and replaces tradition cumbersome method but without compromising on authenticity and credibility among enterprises/persons. BCT with SC implementation helps the organizations of the consortium in running the secured business without leaving any of the layers to the public access. Encroaching privacy is more precisely using other’s personal information without the consent of the concerned persons. Transparency comes through good governance, approved policy, transactions details and immutable ledgers made available for the consortium. Even though the originally idea of BT is to be open public, the concept of private BC and SC are significant with restricted access. The degree of privacy and transparency depends on the kind of BC and implementation decisions as per the applications [11]. The SC on BCT is preferred in TL as it is capable of automating the given contract, allow third party for validating the integrity of the transactions/contracts [7] (Fig. 3). SC on BC is anticipated to handle the complex and dynamic activities of Transportation and logistics (TL) supply chain. This work aims to bring working model which brings more reliability, robustness in TL sector (TLS). The various adoption of SC-BCT including the followings are the motive for taking this work as applied to TLS [1–6, 8–10]. Based on the research reports, SC-BCT for pharmaceuticals supply chain can work efficiently by reducing the waiting time of the retailers significantly (from 60– 90 days to 24–48 h). IBM curated BC can track the chain of supply of items, encrypt the records and establishes the transparency and authenticity of the transactions. The BC-SC can influence redefining many of the security solutions innovatively. The SC that run on each node enable secure customized communication among untrusted and anonymous parties along with elimination of external third party [12, 13]. The implementaion of SC on BCT can be simple or complex for beginners as in case of Ethereum [5, 8, 13, 14]. As discussed, data privacy is an issue which can be

432

B. R. Arun Kumar

Fig. 3 SC in logistics industry. Source [7]

handled by programming, the owner of the data and organization can program in line with agreed terms and conditions in smart contact and ensure the private data protection preserving the privacy of the person or organization [16]. BC coupled with SC can enable IoT devices to implement security solutions efficiently [15–18]. Transparency in BC refers to digital ledger immutability where data cannot be altered/deleted. This is, in fact, a key feature that leads to several applications including track and trace of the product [19]. SC-based BC is expected to revolutionize almost all industry and it will be referred to as “transparency-privacy-trust internet value chain “in the future [19–25].

1.2 Hyper-Ledger Fabric (HF) Due to the high adaptability HF is being used as the BC solutions for private network of enterprises for top use cases such as Pharmaceuticals supply, Trade financing, Education and Training, Energy Management and many more [26]. The high adoptability is due to the modular architecture of HF to create pluggable components such as identity management of customers into a permissioned BC. The HF can scale up to considerable number of organisations, channels and transactions [27]. This can be further optimized to higher scalability and performance as there is no power of work and crypto mining. Because of the above reasons’ HF is adopted in the implementation SC-BCT for TL use-case (Fig. 4).

Developing Business-Business Private Block-Chain Smart …

433

Fig. 4 Modular architecture of HF. Source [12]

1.3 Problem Formulation The main purpose of this research work is to build the permissioned, private BC network with SC implementation for the consortium. The problem is to carry out security, privacy, and transparency among the business entities through SC on BC. The SC shall have the endorsement policy associated which specifies the set of peers. The peers define the organizations participated in business which must execute SC and endorse execution results. Each peer validating the transaction shall verify whether the transaction has appropriate endorsements as specified in endorsement policy and the validating peers are valid (have valid signatures from valid certificates).

1.4 Methodology The methodology adopted has two parts: First part involves systematic literature review focusing on relevant index papers on TL & SCM, SC & BCT aspects and applications. After the critical review, research gap is identified, explored the adoption of SC and BCT for the business consortium to emulate trust by following case-study method. In the second part, a widely used “case-study approach” is adopted [11]. The SC– BCT is applied based on the real time context of the organizations A, B, C and D which involves in the supply chain of the item “Diamond-Item-X” (DIX). The negotiated and agreed terms and conditions of the consortium business in demand and supply of DIX is codified in SC. The concerned departments of the organizations namely logistics/purchase were responsible for verifying and validating the transactions as per the SC on the Hyper-ledger Fabric (HF) which serves as backbone for information flow, automating and approving the transaction.

434

B. R. Arun Kumar

2 Implementation Each organization is represented by the node in the BC. Each identified node gets the authority certificate by the certifying authority server (CA) and SC is programmed as perm the terms and conditions with legal obligations using the Golan script which is preferred tool as claimed in [28, 29]. Channel is configured to indicate identified members, ordering nodes that can add blocks on the channel, new members, opening of new channel if majority of members vote for it and stored on the ledger and adopting policy as per consortium approval. The complete configuration is made available as a transaction in the ledger [30–33]. The elements of the BC network are the set of peers, they host ledgers and SC and records SC transactions in ledgers. The SCs are refereed as chain-codes in HF. The BC network consisting of peers (P1, P2, P3) with their copy of distributed ledger (L1, L2, L3) which use the same chain-code S1 for accessing ledger is shown (Fig. 5). Each term and condition of the contract is implemented without misinterpretation/omission to avoid possible loopholes in the SC. SC code is verified and validated for bugs/errors in the code to avoid unforeseen repercussions at later point of time. Possibility of getting exploited to hackers is taken care. This implementation has adopted the architecture recommendations of the Hyperledger Architecture working group (HAWG). The HAWG identified layers of BC application components are: “Consensus Layer, Smart Contract Layer, Communication Layer, Data Store Abstraction, Crypto Abstraction, Identity Services, Policy Services, APIs, and Interoperation layer which operates between different BC instances” [31] (Fig. 6). Figure shows the processing as consists of Inputs, the contract interpreter, and outputs. The valid request are accepted and output generated updates the state. The SC validates each request of transaction by verifying whether it obeys to the policy/contract. Invalid request is dropped and generally not included in BC. If transaction is rejected, then there no delta changes in the block. Logical errors of the

Fig. 5 Elements of BC networks. Source [30]

Developing Business-Business Private Block-Chain Smart …

435

Fig. 6 A smart contract processing the request. Source [31]

transactions are logged for auditing to decide regarding the continuation of execution. This is essential to avoid exception such as double spending. Inconsistencies in the contract layer need to be ordered called ordering handled by the consensus layer. Smart contracts generate soft interrupts/notifications which can generate alerts regarding possible side effects processing the request (Fig. 7). The consensus layer sends the proposal which indicated the contract to be executed and transaction identity and other information such as any dependencies. The SC layer refers to the state of the ledger, inputs given by consensus layer and identity information provided by identity layer to authenticate/authorize the entity which requested SC execution. The successful execution of transaction results in delta change in state as returned by SC layer otherwise returns “transaction rejected” [31].

Fig. 7 SC interaction with other layers. Source [31]

436

B. R. Arun Kumar

Chain-code is principal component in the system in which the consortium business transaction takes place. After the initial configuration, an instance of chain-code is created for the particular channel and corresponding functionalities are invoked to query and execute the transaction on the HF. Implementation using HF Each organization (A, B, C and D) of the consortium called “members” sets the HF. Each member sets up the peers for participation and configured them with cryptographic information such as Certificate Authority (CA). The client (for example: application/portal) sent the transaction invocation request to peers using HF development kit or uses HF through REST interface. The clients of the organization sends transaction invocation requests (TIR) to peers. Peers used Chain-code to initiate TIR. Every peer maintained its own ledger/channel; however their roles were different such as Endorser (validates transaction and executes SC), Anchor (broadcasts the received updates), Orderer peer maintained the reliable Ledger state across the network, created the block and delivered that to all the peers [31–33] (Fig. 8). The chain-code (CC) is the central element as transactions are operations invoked on the CC. Only endorsed transactions are committed and updated the state. If required additional special CC for management functions and parameters can be created.

Fig. 8 HF workflow. Source [32]

Developing Business-Business Private Block-Chain Smart …

437

3 Sample of Screen Shots The dashboard shows the screen shot of SC–BC implemented on HF platform. The visualization of number of transactions done per second/minute/hour is shown graphically. Along with the navigation tabs, channel drop down, list of the peers, metrics, activity, and transactions by organization panels can be noticed (Figs. 9 and 10). The network part contains the list of different kinds of the nodes that are connected to the system with included the orderer node. Network screen consist of the list of the properties that channel is configured.

Fig. 9 Dashboard of the B2B block-chain smart-contract implemented using HF

Fig. 10 The HF BC network

438

B. R. Arun Kumar

4 Conclusion The HF architecture offered CC trust with respect to both BC application and orders have handled the misbehaving nodes. The SC (CC) that have confidentiality requirements with respect to the transaction content and updates are found supported by HF. Ordering services implemented to allow pluggable consensus using Hyper-ledger. The SC is implemented by taking care of logical errors as well as legal aspects as applicable to the business. The HF could eliminate the hurdles of the public networks and centralized traditional database problems where transactions/data sharing is unsecured, time-consuming and uncontrollable. The B2B Hyper-ledger block chaincode manages the supply chain efficiently ensuring the transactions as per the contacts with all the advantages of the block-chain. The preliminary testing of the system shows that Hyper-ledger CC have established trust among the stakeholders of the supply chain by offering transaction transparency, traceability, and more agility. The techniques of cryptography, consensus algorithms coupled with smart contract implementation have contributed to implementing security, transparency, privacy, and enhancing trust. For the considered scenario of 4 organization with multiple channels scalability of transactions was excellent. However, the same may be tested on the large scale in the future work. Further, the performance of SC can be evaluated under exceptional conditions in complex supply chain management. This work addressed legal aspects of SC while implementing the agreement as applicable to the parties involved in the business using software layer without involvement of human element. However legal validity of the SC itself is debatable and jurisdiction dependent which is not focused in this work. Legal and regulatory aspects across the jurisdictions are to be addressed since the TL happens at cross borders. Further, SC implementation at the various jurisdiction levels needs to be investigated as illegal activities like smuggling, hacking and terrorist could be conducted leveraging the SC mavens such as self-execution and anonymity of smart contracts. The future work includes analysing the performance by adopting the metrics namely transaction throughput, modular consensus, ability to curb known attacks. If implemented by addressing all the issues from different dimensions/perspective as mentioned can bring radical changes in business/trades at national/international scenarios. The BC has got the huge potential to create new opportunities and innovate applications that are significantly contributing to transforming the Society. Acknowledgements The author, Dr. Arun kumar B. R. would like to place on record his sincere thanks to BMSET, Principal, and colleagues who directly/indirectly encouraged this work. My special acknowledgement to my student Mr. Shoel Khan for his contributions.

Developing Business-Business Private Block-Chain Smart …

439

References 1. Perera S, Nanayakkara S, Rodrigo MNN, Senaratne S, Weinand R (2020) Blockchain technology: is it hype or real in the construction industry? J Ind Inf Integr 17. https://www.scienc edirect.com/science/article/pii/S2452414X20300017?via%3Dihub 2. Underwood S (2016) Blockchain beyond bitcoin. Commun ACM 59(11):15–17. https://doi. org/10.1145/2994581 3. Hamida EB, Brousmiche KL, Levard H, Thea E (2017) Blockchain for enterprise: overview, opportunities and challenges. Presented at the Thirteenth International Conference on Wireless and Mobile Communications (ICWMC 2017), Nice, France, July 2017. Google Scholar 4. Petersson E, Baur K (2018) Impacts of blockchain technology on supply chain collaboration. https://www.diva-portal.org/smash/get/diva2:1215210/FULLTEXT01.pdf. 21 May 2018. https://www.diva-portal.org/smash/get/diva2:1215210/FULLTEXT01.pdf 5. Agarwal S (2018) S.M. in Engineering and Management, Massachusetts Institute of Technology, System Design and Management Program. https://dspace.mit.edu/handle/1721.1/ 118559 6. Litke A, Anagnostopoulos D, Varvarigou T (2019) Blockchains for supply chain management: architectural elements and challenges towards a global scale deployment. Logistics 3:5 [CrossRef] 7. http://blockgeeks.com/guides/what-is-blockchain-technology/ 8. Wood G (2014) Ethereum: a secure decentralised generalised transaction ledger. Ethereum Project Yellow Paper, 151 9. Swezey M What is a smart contract? https://medium.com/pactum/what-is-a-smart-contract10312f4aa7de 10. Sillaber C, Waltl B, Treiblmaier H et al (2020) Laying the foundation for smart contract development: an integrated engineering process model. Inf Syst E-Bus Manag. https://doi.org/ 10.1007/s10257-020-00465-5 11. Yin RK (2017) Case study research and applications: design and methods. Sage publications, Thousand Oaks, CA 12. Hu Y, Liyanage M, Manzoor A, Thilakarathna K, Jourjon G, Seneviratne A (2019) Blockchainbased smart contracts—applications and challenges. https://www.researchgate.net/publication/ 328230865_Blockchain-based_Smart_Contracts_-_Applications_and_Challenges/link/5d0 78eda299bf1f539c9560e/download. Accessed 26 Aug 2020 13. Script. https://en.bitcoin.it/wiki/Script. Accessed 21 May 2019 14. Ur Rahman M, Baiardi F, Guidi B, Ricci L Protecting personal data using smart contracts. https://www.researchgate.net/publication/335741721_Protecting_Personal_Data_ using_Smart_Contracts 15. Zhang Y, Wen J (2017) The IoT electric business model: using blockchain technology for the internet of things. Peer-to-Peer Netw Appl 10(4):983–994 16. Zhou Q, Yang Y, Chen J, Liu M (2018) Review on blockchain application for internet of things. In: Sun X, Pan Z, Bertino E (eds) Cloud computing and security. ICCCS 2018. Lecture notes in computer science, vol 11067. Springer, Cham. https://doi.org/10.1007/978-3-030-00018-9_64 17. Biswas K, Muthukkumarasamy V (2017) Securing smart cities using blockchain technology. In: 2016 IEEE 18th international conference on high performance computing and communications; IEEE 14th international conference on smart city; IEEE 2nd international conference on data science and systems, Sydney. IEEE, pp 1392–1393 18. Yuan Y, Wang FY (2016) Towards blockchain-based intelligent transportation systems. In: 2016 IEEE 19th international conference on intelligent, Paris. IEEE, pp 2663–2668 19. Jabbari A, Kaminsky P (2018) Blockchain and supply chain management. Department of Industrial Engineering and Operations Research University of California, Berkeley. https:// www.mhi.org/downloads/learning/cicmhe/blockchain-and-supply-chain-management.pdf 20. Tapscott D, Tapscott A (2016) How the technology behind bitcoin is changing money, business, and the world. Portfolio Penguin, New York

440

B. R. Arun Kumar

21. Camerinelli E (2016) Blockchain in the supply chain. Finextra, 13 May 2016 [Online]. Available at: https://www.finextra.com/blogposting/12597/blockchain-in-the-supply-chain 22. Parker L (2015) Ten companies using the blockchain for non-financial innovation. Brave New Coin, 20 Dec 2015 [Online]. Available at: https://bravenewcoin.com/news/ten-compan ies-using-the-blockchain-for-non-financial-innovation/ 23. Migrov R (2016) The supply circle: how blockchain technology disintermediates the supply chain. ConsenSys, 09 March 2016 [online]. Available at: https://media.consensys.net/the-sup ply-circle-how-blockchain-technology-disintermediatesthe-supply-chain-6a19f61f8f35 24. https://www.icaew.com/technical/technology/blockchain/blockchain-articles/blockchaincase-studies 25. https://openblockchain.readthedocs.io/en/latest/biz/usecases/ 26. https://openledger.info/insights/hyperledger-enterprise-solutions-top-5-real-use-cases/ 27. Ferris C (2019) Does hyperledger fabric perform at scales? https://www.ibm.com/blogs/blockc hain/2019/04/does-hyperledger-fabric-perform-at-scale/. 2 April 2019 28. White paper on “Hyperledger”. https://blockchainlab.com/pdf/Hyperledger%20Whitepaper. pdf 29. Lyam M (2020) Why Golang and not Python? Which language is perfect for AI? 4 Feb 2020. Available at: https://www.rtinsights.com/why-golang-and-not-python-which-languageis-perfect-for-ai/ 30. https://hyperledger-fabric.readthedocs.io/en/release-2.2/create_channel/create_channel_c onfig.html 31. The Hyperledger Architecture Working Group would like to thank the following people for contributing to this paper: Bharathan V, Bowman M, Cole S, Davis S, George N, Graham G, Harchandani L, Jagadeesan R, Kuhrt T, Liberman S, Little T, Middleton D, Montgomery H, Nguyen B, Panicker V, Parzygnat M, Quaresma V, Wallace G. https://www.hyperledger.org/ wp-ontent/uploads/2018/04/Hyperledger_Arch_WG_Paper_2_SmartContracts.pdf 32. Mamun M (2018) How does hyperledger fabric work? https://medium.com/coinmonks/howdoes-hyperledger-fabric-works-cdb68e6066f5. 17 Apr 2018 33. Lahoti N (2018) All you need to know about smart contracts and their applications in logistic industry. Blog Posted on 30 Aug 2018. https://mobisoftinfotech.com/resources/blog/smart-con tracts-and-their-applications-in-logistics/

Data-Driven Frameworks for System Identification of a Steam Generator Nivedita Wagh and S. D. Agashe

Abstract This paper presents the system identification using data-driven framework for a steam generator. In-house experimental set up of the steam generator was used to generate the data for steam pressure, steam temperature, steam flow rate, volume of water, and volume of steam as a function of time. Using this data, the variable ranking approach was implemented for obtaining the parameters which affect the performance of the steam generator in terms of steam pressure. It is observed that steam flow rate and steam temperature are the most dominant factors which affect the performance of steam generator. The scatter plots are obtained for these parameters to develop the regression model. Multiple variable regression and multilayer perceptron models are further implemented for the system identification. It is noticed that multi-layer perceptron model provides a much better prediction of the estimated pressure.

1 Introduction Steam generators are the important devices which are traditionally used for generating steam through combustion of fuel. This steam is further used for power generation through steam turbine. Thermal power stations predominantly use steam boilers for steam production. Because of the rapid changes in the demand of steam for power generation and for the process industry, the boilers are subjected to fluctuating loads very frequently. The efficiency therefore decreases with time due to bad combustion, fouling, poor operation, and maintenance. The boiler performance is mainly evaluated based on its efficiency in terms of thermal efficiency, combustion efficiency, and overall efficiency. Thermal efficiency of boiler is the ratio of heat content in the outgoing steam to the heat supplied to boiler. Heat content of the steam is the N. Wagh (B) · S. D. Agashe Department of Instrumentation and Control, College of Engineering Pune, Savitribai Phule Pune University, Pune, India S. D. Agashe e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_27

441

442

N. Wagh and S. D. Agashe

function of the steam pressure. It is therefore possible to predict the performance of a boiler through pressure prediction. The combustion efficiency shows the performance of the fire bed and corresponds to its ability to burn the fuel completely. Better the combustion efficiency better is the performance of boiler in terms of steam pressure. Overall efficiency considers all the losses of energy which occur during the generation of steam. These losses include the radiation and convection losses, the loss of energy with the ash, and other energy losses. Fuel and water quality also leads to poor performance of the boiler. Prediction of efficiency of the boiler helps us to understand the deviation of efficient from the best efficiency. Conventionally, efficiency of the boiler is measured by direct method or indirect methods. One of the major factors affecting the boiler efficiency is steam pressure. The expected steam pressure obtained here estimated from a data-driven model, and it is based on the values of pressure obtained through experimentation done previously. Various modeling approaches are reported in the literature for predicting the performance parameters of steam generator. The model-based approach is important for dynamic and fluctuating load conditions. Model-based control has been reported by many researchers during last few decades [1]. Many modeling approaches are presented in the literature with the aim to enhance the overall boiler performance. This work reports a system identification data-driven framework for steam boiler with an aim to improve the boiler performance. Such frameworks are proposed by many researchers as applied to different processes. Niknam et al. [2] proposed a framework for optimization of energy management to minimize the cost and exhausts. Scenario-based stochastic model is implemented here to predict the load demand, output power of wind, photo-voltaic units, and market price. Sabio [3] reported a framework, which includes a mixed integer nonlinear programming (MINLP), to handle uncertainty explicitly in multi objective optimization (MO2) problems for low cost applications of industrial processes. Kim et al. [4] proposed a framework to carry out thermal performance analysis, which differs from conventional method of performance monitoring. Their methodologies are effective for diagnosis, monitoring, and prognosis during condition-based maintenance (CBM). Kim et al. [4] used historical data to predict the expected performance in terms of efficiency. Nikula et al. [5] reported general data-driven framework for steam generation control with an aim to improve the performance in terms of boiler efficiency which is essentially a function of pressure. They provide the framework with an adaptation mechanism to correct the prediction of expected efficiency during control. The neural network model is a data-driven framework, so it requires experimental data to build the model. Li and Wang [6] proposed a framework with data-driven approach in which five main variables are used. The effect on performance of the boiler combustion is evaluated. They used principal component analysis and respective contribution of five variables on the performance parameters of the process. The effect of each parameter on the control performance is analyzed, and the corresponding measures are suggested. Response surface method (RSM) was effectively implemented by Heydar et al. [7], using two input data sets in their neural network model to predict the output. They argued that the boiler optimization is possible through such a neural network model. The response surface methodology (RSM)

Data-Driven Frameworks for System Identification …

443

Fig. 1 Boiler and heat exchanger pilot plant in advanced process laboratory at COE, Pune

Table 1 Boiler plant configuration

Parameter

Value

Units

Diameter

0.303

m

Length

0.294

m

Volume

0.021199346

m3

Wet area

0.301118298

m2

Metal density

7860

kg/m3

Metal mass

41

kg

Feed water temperature

80

°C

Residence time of steam

10

s

is an effective mathematical and statistical method for modeling and analyzing a process in which the output is affected by multiple variables. The aim of the present work is to provide a data-driven framework that can predict the boiler pressure deviation from the actual boiler pressure. This information can be used to obtain the improved boiler performance. It is also useful for plant operators and supervisors. The proposed framework is developed using in-house pilot plant boiler as shown in Fig. 1 with the dimensions as documented in Table 1. Figure 2 shows the schematic of the boiler.

2 Methodology This section reports identification framework for predicting the boiler pressure. The framework has of a system identification part as shown in Fig. 3. The variable ranking approach is effectively used to implement such a framework and is presented here.

444

N. Wagh and S. D. Agashe

Fig. 2 Schematic of boiler plant

Fig. 3 Framework for system identification

Responses of multiple regression and neural network-based multi-layer perceptron method are used to predict the pressure.

Data-Driven Frameworks for System Identification …

445

2.1 Ranking of Variable Understanding the dominating relations between variables is key to data analytics. Variable ranking is the mathematical technique to decide the relative dependence of a particular performance parameter on the multiple governing parameters. In this method, Shannon’s entropy and two variable mutual information are obtained. An expression for mutual information corresponding to discrete random variables is given as: I (X i ; y) =

n  n 

 p(X i j , y) ln

j=1 k=1

p(X i j , yk ) p(X i j ) p(yk )

 (1)

Here p(.) indicates the probability function between two discrete random variables. The running index i represents the ith value of variable, and y is the output parameter. Single vector Shannon’s entropy is defined as: H (X i ) = −

n 

p(Xi j) ln p(X i j )

(2)

j=1

The mutual information if averaged between every governing parameter in subset S and output parameter, following expression is obtained. mf 1  Istat (S; y) ≈ I (X i ; y) m f X ∈S

(3)

i

where mf represents the number of output parameters within S = {X i , i = 1, . . . , m f }. By rearranging: ∧

H (S, y) =

mf 

H (X i ) + H (y) − Istat (S; y)

(4)

X i ∈S ∧

The entropy H (S, y) represents total of entries of each governing variables X i and the output entropy of vector y subtracted by Istat (S; y). Dominance of a governing variable is demonstrated finally through similarity metric R as given below: R(S, y) = 1 −

Istat (S; y) ∧

(5)

H (S, y)

Values of R, change between 0 and 1. These are the normalized values to decide the dominance of governing parameter which affect the output variable, which is

446

N. Wagh and S. D. Agashe

boiler pressure in the present case. The values of R near zero indicate strong dependence, whereas values near 1 indicate independence between governing and output parameters. In the present work, open-source software WEKA is used to calculate the ranking of input variables and select the highest ranking variables for the regression analysis.

2.2 Multiple Linear Regression Multiple linear regression method is a mathematical tool to understand the relation between governing and output variables. More than one regressor is involved in the multiple linear regressions. A linear equation is used to develop the relation between dependent and two or more independent variable. A polynomial equation is used to express the multiple regression model as given here: yi = β0 + β1 X i1 + β2 X i2 + · · · + β p X i p + εi

(6)

where n varies as i = 1, 2, . . . , n, which indicate number of observations. Here, y is the output and X i are the governing or input variables, β represent regression coefficients with ε as the error term. The prediction of boiler pressure using multiple linear regressions is presented in this study.

2.3 Multi-layer Perceptron Model MLP is one of the categories of artificial neural network which includes multiple layers of perceptions with minimum three layers of nodes between input and output. These are input, hidden, and an output layer. It uses a nonlinear activation function except for input node. Activation function maps the weighted inputs to the output of each neuron. MLP consists of three or more hidden layers of neurons between input and output functions. The input is implemented on the first layer of neurons with linear activation function. The output of first layer goes to second layer of neurons with again a linear activation function implemented on it. When multi-layer perceptron has linear activation function in all layer neurons, then linear algebra indicates that any number of perceptron layers can be reduced to a two-layer input–output model. In MLPs nonlinear, activation function is also used. This is specifically developed to express the frequency of action potentials, or firing, of biological neurons. The commonly used activation functions are sigmoids and are described by equations, y(vi ) = tanh(vi ) and y(vi ) = (1 + e−vi )−1 . The rectifier linear unit (ReLU) is often used as possible way to overcome the numerical issues related to sigmoids. The first sigmoid is a hyperbolic tangent which ranges between −1 and 1. The other is logarithmic, which changes between 0 and 1 and is similar in shape. Here, yi

Data-Driven Frameworks for System Identification …

447

is the performance parameter or output of the ith neuron, and vi is the weighted sum of the input connections. Literature also proposes different activation functions such as the rectifier and softplus functions. More specialized activation functions include radial basis functions.

3 Case Study In the present work, the framework for boiler pressure prediction is implemented with a laboratory-based pilot plant. The present pilot plant is in advanced process laboratory of College of Engineering Pune (India). The configuration details of the boiler pilot plant are given in Sect. 3.1. System identification framework is presented in Sect. 3.2 as a case study. Section 3.3 provides the model identification and diagnosis of control process state.

3.1 Boiler Pilot Plant Configuration The pilot plant under consideration is designed mainly for research purpose where it acts as utility unit for steam generation. It is further cascaded with a heat exchange plant having steam and water as a medium. The components of the pilot plants include tank for feed water, boiler drum, and heating system. Feed water is supplied by reciprocating pump. Water enters into boiler drum from feed water tank. The heat supplied to the boiler is controlled through silicon-controlled rectifier (SCR) firing of heater. Feed water gets heated in the drum, and saturated steam is supplied to the heat exchange. The parameters used for boiler drum plant design are documented in Table 1. The cold water is supplied through inner pipe, and steam flows through outer pipe of the heat exchange, where the steam transfers heat to cold water.

3.2 System Identification The pilot plant steam generator was run for 3.46 h to generate the data which is used in the current work for developing the system identification model. The entire data is used for system identification based on which the prediction model for the steam pressure as a function of steam flow rate, and steam temperature is obtained. Using this model, the pressure is further predicted for obtaining the steam pressure as a function of time. The parameters used for training the model were decided based on variable ranking method. Trained model is further used for prediction of the performance parameters using testing data. In the present work, the boiler pressure is used as a performance parameter because the efficiency of the steam generator

448

N. Wagh and S. D. Agashe

Fig. 4 Variation of a steam temperature and b steam flow rate with time

is largely governed by the boiler pressure. The implementation of variable ranking method using WEKA is briefly discussed in this section.

3.2.1

Boiler Data

The in-house boiler with heater and feed water system is taken as the subject of the case study. The data obtained for this boiler is shown in Fig. 4. It is noticed that from Fig. 4a that the temperature is shoot up at around time instant of 2500 s, and it is stabilized. However, the steam flow rate is stabilized at around time instant between 4000 and 5000 s. The flow rate remains stabilized up to 10,000 s and then again shoot up to further remain stabilized. The variation of actual pressure with time obtained through measurements is shown in Fig. 5.

3.2.2

Variable Ranking

All the possible parameters that affect the boiler pressure were tested in variable ranking method as documented in Table 2. It is noticed that steam flow rate and steam temperature have the highest value of similarity metric (R). Hence, these two parameters were selected to train the regression and multi-layer perceptron models. The similarity metric values close to one are considered as most suitable for model training. Thus, only two parameters, namely steam flow rate and steam temperature with highest ranking show the strong correlation with the boiler pressure. The scatter plots are the tools used for developing the relationship between governing and performance parameters based on the experimental data. These scatter plots are helpful to understand the trends of variation of one parameter with respect to other. The pilot plant boiler unit was operated to obtain the data for developing the scatter plots. The linear correlations between steam pressure and steam temperature

Data-Driven Frameworks for System Identification …

449

Fig. 5 Variation of pressure with time

Table 2 Ranking of variables

S. No.

Parameter

Mean of R

Ranking

1

Steam flow rate

0.957

3

2

Steam temperature

0.957

2

3

Volume of water

0.224

5

4

Volume of steam

−0.224

4

as well as steam pressure and steam flow rate are clearly visible from the scatter plots as shown in Fig. 6. However, in case of the steam temperature, the variation is linear after the temperature of 90 °C. It is thus noticed that the variable ranking results show that the variables which show highest rank exhibit linear correlation for the boiler studied.

Fig. 6 Scatter plots for highest ranking variable a steam temperature and b steam flow rate

450 Table 3 Coefficients of regression analysis obtained using WEKA and operating ranges of the variables

N. Wagh and S. D. Agashe Observations

Regression model results

Multi-layer perceptron model results

Regression coefficients

β 0 = 0.0014, β 1 = 0.0368



Time taken to test the model

2.88 s

2.84 s

Steam flow rate

Min: 0 (m3 /s), Max: Min: 0 (m3 /s), Max: 5.01 (m3 /s) 5.01(m3 /s)

Steam temperature Min: 29.81 (°C), Max: 143.61 (°C)

Min: 29.81 (°C), Max: 143.61 (°C)

Predicted pressure

Min: 0.311 (MPa), Max: 0.329 (MPa)

Min: 0.302 (MPa), Max: 0.346 (MPa)

3.3 Model Identification The highest ranking variables were used for developing the state matrices as presented in Table 2. The discretization of the range [0, 1] of the selected variable is carried out, and the expected pressure for every discretized value is obtained. The regression coefficients obtained through WEKA are documented in Table 3. It is observed that the time taken to test the regression model is higher as compared with the time taken by multi-layer perceptron model. The operating ranges of steam flow rate and steam temperature considered for in both the models are same. The estimates of expected pressure are set in two-dimensional arrays where each dimension belongs to one variable (steam flow rate or steam temperature) as shown in Fig. 7. It is noticed from the data that the maximum pressure reaches to 0.35 MPa at the state variables of about 145 °C steam temperature and flow rate of 2.5 m3 /s. The comparison of variation of actual pressure with the pressure predicted with the help of multiple regressions and multi-layer perceptron is presented through Fig. 8. Data for total of 12,500 s is used for prediction of expected pressure. It has been observed from the figure that, predicted pressure by multi-layer perceptron model is close to the actual boiler drum pressure, while pressure predicted by regression model is overshoots in between the 9500 and 10,000 s. Overall the regression graph is unstable in nature. If extensive data is available for such type of operation, then the accuracy of the model will be further improved. Further, the error estimation for both multi-regression and multi-layer perceptron model is documented in Table 4. It is noticed that the root mean square error in the estimation is substantially reduced when multi-layer perceptron model is used for the system identification of a steam generator as compared to the multi-regression model. Also, the relative absolute error which decides the accuracy of the model is improved when multi-layer perceptron model is used.

Data-Driven Frameworks for System Identification …

451

Fig. 7 Expected steam pressure at the boiler outlet. Symbols represent the values of pressure at the corresponding steam flow rate and steam temperatures

Fig. 8 Comparison of actual boiler pressure and the pressure predicted using multiple regression analysis and multi-layer perceptron

452 Table 4 Error estimation of regression model and multi-layer perceptron model

N. Wagh and S. D. Agashe Parameter

Regression model

Multi-layer perceptron model

Correlation coefficient

0.9359

0.9835

Absolute error

0.0117

0.0041

RMS error

0.0151

0.0048

Relative error

6.8162%

2.3761%

RRS error

8.7468%

2.7958%

4 Conclusion The system identification using data-driven framework for a steam generator is presented and validated in this work. It is noticed that the steam flow rate and the steam temperature are the most dominant factors which affect the performance of steam generator in terms of steam pressure. The variable ranking approach is used to decide the dominant parameters for prediction of steam pressure. Multi-regression and multi-layer perceptron models were implemented for the system identification. It is noticed that multi-layer perceptron model provides a much better prediction of the estimated pressure. The data-driven framework has been rarely employed to steam generator which is the important device with continuous process. The present work shows that such a framework can be effectively deployed for the system identification and monitoring of the steam generators. The system monitoring part of the steam generator is being implemented and will be presented during the conference.

References 1. Astrom KJ, Bell RD (2000) Drum-boiler dynamics. Automatica 36:363–378 2. Taher N, Azizipanah-Abarghooee R, Narimani MR (2012) An efficient scenario-based stochastic programming framework for multi objective optimal micro-grid operation. Appl Energy 99:455– 470 3. Sabio N, Pozo C, Guillén-Gosálbez G, Jiménez L, Karuppiah R, Vasudevan V, Sawaya N, Farrell JT (2014) Multi-objective optimization under uncertainty of the economic and life-cycle environmental performance of industrial processes. AIChE J 60:2098–2121 4. Kim H, Na MG, Heo G (2014) Application of monitoring, diagnosis, and prognosis in thermal performance analysis for nuclear power plants. Int J Nucl Eng Technol 46:737–752 5. Riku-Pekka N, Ruusunen M, Kauko L (2016) Data-driven framework for boiler performance monitoring. Int J Appl Energy 183:1374–1388 6. Shizhe Li, Wang Y (2018) Performance assessment of a boiler combustion process control system based on a data-driven approach. Processes 6:200 7. Heydar M, Milad S, Mohammad HA, Kumar R, Shahaboddin S (2019) Modeling and efficiency optimization of steam boilers by employing neural network and response surface method. Mathematics

Track V

An Efficient Obstacle Detection Scheme for Low-Altitude UAVs Using Google Maps Nilanjan Sinhababu

and Pijush Kanti Dutta Pramanik

Abstract Unmanned aerial vehicles have shown great potential in fast shipping and delivery, including delivering emergency support and services to the disaster (natural or manmade) hit areas where manual reach is infeasible. For accurate and effective emergency service delivery at the adverse sites, the UAVs need to fly close to the ground. Due to the low-altitude flight, there may be many stationary obstacles (e.g., trees and buildings) on the path of a UAV. Detecting these obstacles is crucial for successful mission accomplishment and evading crash. The existing obstacle detection methods limit the flying speed of a UAV due to the latency in processing and analysing the in-flight sensed data. To mitigate this, we propose to equip the UAV with the prior information of the obstacles on its trajectory. In case of an obstacle, the UAV slows down to avoid the obstacle; otherwise, it travels with a much higher speed. As experiment, we fed the UAVs with the satellite images from the Google Maps. It is observed that the proposed approach improves the overall flying speed of the UAVs to a great extent.

1 Introduction 1.1 Unmanned Aerial Vehicles Unmanned aerial vehicles (UAVs) are increasingly being used in several applications such as military, government, commercial, agriculture, and recreation. One of the most potential use of UAVs seems in freight. Companies like Amazon are successfully exploring utilising UAVs for express shipping and delivery.

N. Sinhababu Reliability Engineering Centre, Indian Institute of Technology, Kharagpur, India P. K. D. Pramanik (B) Department of Computer Science and Engineering, National Institute of Technology, Durgapur, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_28

455

456

N. Sinhababu and P. K. D. Pramanik

The UAVs can also be very useful in the critical and emergency purposes such as to support infrastructures and deliver supplies to the affected sites after natural disasters or terrorist attacks, where the infrastructure supply lines are cut or disabled [1]. UAVs can deliver emergency supplies to the locations that are not reachable easily or risky for human.

1.2 Obstacle Detection in UAVs For precision delivery and support, in case of emergency sites and in other applications such as building or terrain mapping, urban development, agriculture, search and rescue operations, the low-altitude UAVs are more suitable [2, 3]. But flying the UAVs closer to the ground has an obvious challenge—mitigating the natural (trees) and manmade obstacles (buildings and towers). Several techniques, like LIDAR, optical flow, stereoscopic vision, etc., are proposed to detect and avoid obstacles [4–6]. In general, these techniques suggest using some kind of sensor to monitor the flight path of the UAV. The sensed data are continuously processed and analysed by the onboard system-on-chip (SoC) of the UAV to recognise an obstacle. Since the obstacles are detected on-the-fly, these schemes are known as live analysis techniques.

1.3 Limitations of Live Analysis Techniques The major problem with the live technique is that it limits the speed of the UAV. It is said that UAVs can only “see” 10–100 feet away, which severely limits its flying speed while maintaining the ability to detect obstacles. Factually, the realtime sensing and onboard processing and analysing the sensed data involves some latency. If the flying speed is more than this latency, the UAV will not be able to detect the obstacles correctly. The reduced speed results in delaying delivery of emergency services and more power loss of the UAV battery.

1.4 Proposed Approach To minimise the speed loss, we propose to support the UAVs with the static information of the obstacles on its trajectory. The UAV travels at a higher speed where there is no obstacle and slows down only when an obstacle approaches. In this paper, we did not propose any avoidance mechanism. We assume that there is an inbuilt obstacle avoidance logic, and it will decide what to do when the UAV encounters an obstacle. The benefit of the proposed approach is that the UAV flight mission is faster and safer.

An Efficient Obstacle Detection Scheme for Low-Altitude …

457

1.5 Contribution of this Paper For the experiment, in this paper, we considered only trees as obstacles. To achieve our goal and address the limitations, we derived the following objectives: 1. 2.

Get Google Maps image data and perform transformations to determine trees. Intelligently tweak missions before uploading to UAVs to avoid obstacles considering drones already have obstacle avoidance sensors.

1.6 Organisation of the Paper The rest of the paper is organised as follows. Section 2 discusses the related work. Section 3 presents the details of the simulation environment and considerations, data preprocessing, and experiment. The experimental outcome is observed and discussed in Sect. 4. The paper is concluded in Sect. 5.

2 Related Work Some of the mentionable work related to the presented work are briefed below. UAV mission planning: Sampedro et al. [7] designed a scalable and flexible architecture for real-time mission planning and dynamic agent-to-task assignment for a swarm of UAV. The proposed architecture is evaluated in simulation and real indoor flights demonstrating its robustness in different scenarios and its flexibility for realtime mission replanning and dynamic agent-to-task assignment. Hadi et al. [8] developed an autonomous system for a mission in which the UAV has to drop payloads in some target points, which is only known when the UAV is in flight. The results showed that the UAV autonomously flies point to point while searching the true point to drop the payloads using a camera. Further, integrating UAV’s with the ground control system made it possible to know the true point from the ground control station. Ramirez-Atencia and Camacho [9] improved the automation of mission planning and real-time replanning using a human–computer interface (HCI) that facilitates the visualisation and selection of plans that will be executed by the vehicles. They propose an extension to QGroundControl by adding a mission designer that permits the operator to build complex missions with tasks and other scenario items. Ryan and Hedrick [10] proposed a path planning algorithm combined with an off the shelf autopilot system. The path planning algorithm mode was developed, keeping in mind the various domains of application. Simulations were performed with a nonlinear UAV model and a commercial autopilot system in the control loop. In these simulations, the desired trajectories were commanded to the autopilot as a series of waypoints. However, it is observed that the UAV was unable to accurately track the trajectories. Castelli et al. [11] present a novel approach towards enabling

458

N. Sinhababu and P. K. D. Pramanik

safe operation of UAVs in urban areas. Their method uses geodetically accurate dataset images with geographical information system (GIS) data of road networks and buildings provided by Google Maps, to compute the shortest path from start to end locations of a mission. They simulated 54 missions and showed significant improvement in maximising UAV’s standoff distance to moving objects with a quantified safety parameter over 40 times better than the naive straight line navigation. Birk et al. [12] presented various missions of UAVs in different realistic safety, security, and rescue field. The UAV is capable of autonomous waypoint navigation using onboard GPS processing. Live video feed from the vehicle is used to create photo maps using an enhanced version of Fourier Mellin-based registration at the operator end. Bielecki et al. [13] identified the problem of a vision system implementation for UAVs in the context of industrial inspection tasks. A syntactic algorithm of a two-dimensional object vectorisation and recognition is proposed by the authors. An algorithm of two-dimensional map recognition has been introduced as well. The algorithms have been tested by using both synthetic and real data of satellite image. Haque et al. [14] focussed on the capability of low-weight and low-cost autonomous UAVs for delivering parcel ordered by online. They developed a navigation system using an android device as its core onboard processing unit and Google Map to locate and navigate destination. Obstacle detection mechanisms in UAVs: Gageik et al. [15] demonstrated a lowcost solution for obstacle detection and collision avoidance of UAVs using low-cost ultrasonic and infrared range finders. Improved data fusion, inertial and optical flow sensors are used as a distance derivative for reference. McGee et al. [16] proposed a vision-based obstacle detection system for small UAVs. Obstacles are detected by segmenting the image into the sky and non-sky regions. The non-sky regions are classified as obstacles. They showed the successful operation of avoiding an obstacle. Odelga et al. [17] presented a self-sufficient collision-free indoor navigation algorithm for teleoperated multirotor UAVs. The algorithm is able to track detected obstacles based on measurements from an RGB-D camera and a Bin-Occupancy filter capable of tracking an unspecified number of targets. Saha et al. [18] proposed a mathematical model using monocular vision for real-time obstacle detection and avoidance for a low-cost UAV in an unstructured, GPS denied environment. Model validation was performed with some real-time experiments under both stationary and dynamic motion of the UAV during its flight. Zheng et al. [19] developed a velocity estimation method based on the polynomial fit is used to estimate the position of the lidar as it scans each point and then corrects the twisted point cloud. Besides, the clustering algorithm based on relative distance and density (CBRDD) is used to cluster the point cloud with uneven density. To prove the effectiveness of the obstacle detection method, the simulation experiment and actual experiment were carried out. The results show that the method has a good effect on obstacle detection. Hrabar [20] presented a synthesis of techniques for rotorcraft UAV navigation through unknown environments, using a 3D occupancy map that is updated online using stereo data to fly safely close to the structures it is inspecting. Both simulation and real flight demonstrations of successful navigation were provided in the result.

An Efficient Obstacle Detection Scheme for Low-Altitude …

459

Identification of trees using satellite images: Branson et al. [21] presented a fully automated tree detection and species recognition pipeline that can process thousands of trees within a few hours using publicly available aerial and street view images of Google Maps. Convolutional neural network (CNN) is used to automatically learn the features from publicly available tree inventory data. Experiments show more than 70% accuracy for street trees, assign correct species to > 80% for 40 different species, and correctly detect and classify changes in > 90% of the cases. Wegner et al. [22] focussed on designing a computer vision system that will help us search, catalogue and monitor public infrastructure, buildings and artefacts using map services like Google Maps. Authors explore the architecture, feasibility, and technical challenges of such a system and introduce a solution that adapts state-of-the-art CNN-based object detectors and classifiers. Authors test the proposed system on “Pasadena Urban Trees” dataset and show that combining multiple views significantly improves both tree detection and tree species classification, rivalling human performance. Li et al. [23] presented a novel method to estimate the shade provision of street trees in the downtown area of Boston, Massachusetts using quantification of the sky view factor (SVF) from street-level imagery. Google Street View panoramas were used to represent the street canyon unit and to compute the SVF, and an estimate of the shading effect of street trees is done. The results showed that the street trees in Boston decrease the SVF by 18.52% in the downtown areas.

3 Experiment and Results 3.1 Considerations For this experiment, we considered the following suppositions: • The UAVs are used to carry some item from one location to another. • We considered only the straight paths (more details in Sect. 3.9). • A UAV may encounter different types of obstacles (e.g. buildings, trees, towers, etc.) on its way, but in this paper, we limited our experiment in detecting the trees only.

3.2 Definitions In this paper, we used a few specific terms, the definitions of which are given below: • Mission: A mission is defined as the objective provided to a UAV to move from one location to another. • Default initial mission: The usual path for a UAV between two locations (not considering the proposed method).

460

N. Sinhababu and P. K. D. Pramanik

• Modified mission: It is the updated mission that contains information on the location of possible obstacles and instructions for a safe and robust flight. • Maximum speed: The maximum speed achieved by the UAV in an ideal condition consisting of no wind speed and optimal operational temperature. • Optimal speed: The speed, at which a UAV, covers the maximum distance with maximum energy efficiency while detecting all the obstacles. • Sensor speed: The maximum speed under which the sensors of the UAV can detect and avoid the obstacle effectively.

3.3 Simulation Environment For conducting the simulated experiment, we used the Python 3.6 (64-bit Windows version) environment with the following system specifications: • • • •

OS: Windows 10 Professional Build 17,763 CPU: Intel Core i7-8700 K Processor RAM: 32 GB DDR4 GPU: NVIDIA GeForce® GTX 1080 Ti

The coding was done in IPython notebook throughout the experiment. Some of the used algorithms are already available in the Python 3 environment as a PIP package. Important modules used in the experiment include packages from NumPy and Matplotlib.

3.4 Proposed Methodology The proposed obstacle detection approach comprises several modules that are executed stepwise, as shown in Fig. 1. Each of them is briefed below: 1.

2.

3.

4.

Default missionx: To set the mission, the system requires the initial geolocation of the UAV and the final destination. By default, initially, the flight path is automatically generated as a straight line between the two coordinates. Get map: This function gets the coordinates of the overall flight path using satellite data and then captures a geolocation image by cropping the required area under the flight path. Further, the labels and roadways are removed in the process. Image preprocessing: Image segmentation-related tasks usually require preprocessing to improve the quality of the image as per requirement. For geolocation image preprocessing, we considered only the contrast and brightness optimisation. Segmentation: This is the most important phase of the process as it detects the possibility of the presence of trees in an area and marks it.

An Efficient Obstacle Detection Scheme for Low-Altitude …

461

Fig. 1 Proposed methodology

5.

Modified mission: This phase makes changes to the initial plan taking into consideration the segmentation that identifies trees hence making the mission more robust and improve the obstacle avoidance. This phase tells the flight controller to slow down in line with the sensor speed when approaching an obstacle.

3.5 Set Initial Mission For any autonomous UAV mission, it is required that the starting position must be set using the current GPS coordinates. Usually, at least eight GPS satellites must be locked to get a good reference for the starting point coordinates. An example showing the initial UAV position (represented by A) in Google Maps is demonstrated in Fig. 2a. This UAV has been given an autonomous mission to deliver some item to a point B in the map. Autonomous missions can also be set using multiple waypoints. Multiple waypoints can be useful when the area is already known, such that the UAV can follow multiple paths in order to complete a particular mission. But in most of the cases, the optimal path is not known in prior. The easiest approach in an unknown environment is to consider a straight path from the starting position to the destination, as shown in Fig. 2b. The blue markers with labels A and B represent the starting and destination points, respectively. The yellow line represents the path of the UAV with a constant speed throughout the entire mission.

462

N. Sinhababu and P. K. D. Pramanik

(a)

(b)

Fig. 2 a Get the current GPS coordinates to initialise mission; b planned mission after setting destination coordinates

3.6 Get Map For identification of trees in the map, we require areal images of the entire mission area. Google Maps API provides necessary areal images for the given coordinates. An example of the satellite images for the map overlay in Fig. 2b is represented in Fig. 3a. After applying the initial mission and cropping in the collected satellite image, the flight path is shown in Fig. 3b. Algorithm 1 exposits the entire map collection process.

(a)

(b)

Fig. 3 a RGB satellite geolocation image; b cropped RGB satellite geolocation image

An Efficient Obstacle Detection Scheme for Low-Altitude …

463

Algorithm 1: Get map of the flight path Input: Coordinates of start and destination Output: Google Map satellite image 1. Initialise Google Static Maps API 2. Provide the API Key 3. Set Google Maps URL in Requests 4. Set desired zoom level of 70% for trees visibility. 5. Find centre coordinates: a. Get coordinates of the starting point b. Get coordinates of the destination. c. Convert coordinate in radians to Cartesian coordinates for starting point and the destination. d. Compute average of the coordinates e. Convert average coordinate to latitude and longitude. 6. Calculate total area size considering a percentage of slack factor between the path and the image crop range. 7. Initialise requests with parameters: URL, centre, zoom level, api-key 8. Save the image

3.7 Image Preprocessing 3.7.1

Brightness Enhancement

The images collected from the Google Maps API have a three-channel colour, comprising of red, green, and blue (RGB) pixels. Although the images have been improved largely compared to the early years, the images have a very low light intensity throughout the capture. Such lower intensities cannot be used directly for contrast enhancement, as contrast enhancement requires the images to have a moderate intensity. Moreover, most of the contrast enhancement techniques tend to reduce the brightness further, resulting in a more inferior image for later stages of segmentation. To overcome this issue, a basic brightness optimisation was performed on the image. To adjust the brightness, the value of all pixels in each channel was added by a positive constant. The constant value for a particular channel was calculated based on the minimum range of the pixel with current maximum intensity value to the maximum possible intensity value. The brightness enhanced geo image captured by the Get Map module is shown in Fig. 4.

3.7.2

Contrast Enhancement

As already discussed, the contrast enhancement techniques are prone to reduce image brightness. The image intensity has already been improved by making the raw geo image brighter. But still, it is required to select a good contrast enhancement technique that will improve the contrast, yet keeping the desired level of brightness.

464

N. Sinhababu and P. K. D. Pramanik

Fig. 4 Brightness enhancement

The segmentation of the image is based on the image visibility and contrast; hence, this is a very important stage in this work. The most popularly used technique for improving image contrast is the histogram equalisation. However, this technique is not perfectly suitable for certain applications due to its tendency to reduce the mean brightness of the image. Hence, several variations of histogram equalisation techniques are adopted, and specifically, brightness preserving bi-histogram equalisation technique tends to perform better for images having regions with close intensity and colour profile. In maps, though the grass and trees have greener channel values, they should be treated differently for segmentation. The contrast-enhanced geo image is shown in Fig. 5.

Fig. 5 Contrast enhancement

An Efficient Obstacle Detection Scheme for Low-Altitude …

(a)

(b)

465

(c)

Fig. 6 a Identification of trees using tone and texture; b mono-texture preserved and tuned segmentation; c overlay of segmentation and preprocessed satellite image

3.8 Segmentation 3.8.1

Phase I: Identify

The images in Google Maps are in the form of contour, and they represent similar coloured objects having different height or depth. To bring the depth and height perspective, light and dark shades of the same colour were used. For example, both trees and grasses are coloured as green but, grasses on the ground is represented by a lighter shade than the trees. This contouring is done across the Google Map satellite image. The objective of this phase is to identify the darker green shades and mark the portions using some linear strokes using a bright colour which is not a part of the contour. To perform the operation, the tones in the preprocessed satellite image that range from shadow to mid-grey are replaced by the bright linear strokes of red (foreground colour) and the lighter values such as grey (background colour). The balance of the stroke amount, along with the pressure of the used stroke, is set to cover the entire area. Higher values for the strokes will increase the number and variations of tones that are drawn with these colours. Higher values for the pressure will produce crisper transitions between tones, and lower values will give a more contrast result. An optimal setting of the parameters was performed to get the desired result, as shown in Fig. 6a. These settings were used for the entire set of tests performed, and consistent results were found.

3.8.2

Phase II: Channel Filtering

This phase is used to keep a particular type of texture and remove the rest of the textures present in the images. In the first phase of segmentation, a special texture filter has been applied to make the foreground colour and texture different from the rest. The foreground considered in the images are actually trees, which are high enough to cause a rapid intensity variation. The objective of this phase is to keep the red markers (mono-texture) and remove the rest of the area. The texture extraction

466

N. Sinhababu and P. K. D. Pramanik

has been performed by combining the effects of the Emboss and Grain filters. Dark areas in the image appear as holes in the top layer of the images, which in turn reveals the background colour. Further, fine-tunings of the image balance, graininess, and relief are performed to get the desired result, as shown in Fig. 6b.

3.8.3

Phase III: Overlay

In the overlay phase, the image with filtered channel is passed as an input where the background has been removed by selecting all the white pixels and setting transparency. After that, the preprocessed image is overlaid by removing all the segmented pixels in the original image with a red channel, as shown in Fig. 6c. This is the final environmental description that the system will get to generate and plan the new mission accordingly.

3.9 Modifying Mission From the segmentation, regions with the possibility of trees are identified and marked. Hence, there are two options in this scenario. Firstly, the UAV can be routed through the portions where there are no trees. Secondly, the UAV moves in the same straight line as before, but with speed variations in certain locations where there are trees and the onboard obstacle avoidance sensors could get sufficient time to perform. The second option seems to be more appealing due to the several limitations of the first option, such as: 1. 2. 3.

More complicated path generation algorithms are required, meaning more processing power and time requirements [24]. With more turns, the total distance covered by the UAV may not be optimal. When we do not have the prior knowledge of a safe flight path, then the straight path is the best option.

Considering the flight path to be straight, selection of the speed in that flight path is the most important criteria to consider. The optimal speed of the UAV changes with respect to the payload and the UAV design. We assume that the optimal flight speed and sensor speed are already known, which are represented as the initial velocity and final velocity, respectively. The known value of deceleration (a) is found using the test flight. Then the distance (S) to be covered to achieve the final velocity (v) from the initial velocity (u) is calculated using Eq. 1. s=

u 2 − v2 2a

(1)

The calculated distance must be represented in the form of coordinates and default initial mission is updated by adding the coordinates of the obstacles and the coordinate

An Efficient Obstacle Detection Scheme for Low-Altitude …

467

Fig. 7 Modified mission path speed

of the starting point of the deceleration of the UAV. In Fig. 7, the orange colour represents the deceleration of the drone from its optimal speed to the sensor speed, and the red colour represents that the UAV is in the obstacle zone and should fly at the sensor speed.

4 Observations The problem of segmentation of the geo-image captured using the Google Maps API can be solved in three standard ways like supervised segmentation, unsupervised segmentation and rule-based segmentation. Depending on the application, CNNbased supervised segmentations are proved to be superior compared to others. But CNN, in general, requires an ample amount of tagged data in order to perform well. But Google Maps data currently lacks the required tagging information, and hence, this method does not perform well. Therefore, in our experiment, only manual identification and visual inspection of the results were performed. It is observed that the proposed technique is able to provide a very clear segmentation of the trees in the satellite images, without missing any spot as per visual inspection. We compared the flight time of the default mission and the modified mission. For this, the processing time in each stage was logged. But for analysis, only the time requirement for get map and segmentation were considered. Other stages like preprocessing and mission planning consumed negligible time, hence were not considered. We considered seven missions with different distances. Each mission was tested for five consecutive iterations, the mean results of which are provided in Table 1. Further, the overall time requirement for the mission is shown in Fig. 8.

468

N. Sinhababu and P. K. D. Pramanik

Time (seconds)

Table 1 Mission planning time log data

Mission distance (km)

Getting map (s)

Segmentation (s)

1

2

17

5

12

45

10

24

108

15

33

204

20

46

272

25

57

391

30

69

532

700 600 500 400 300 200 100 0 0

5

10

15 20 25 Distance (kilometers)

30

35

Fig. 8 Total time for mission planning with varying distance

It is seen that the mission planning time is increasing almost linearly with increasing distance. The increase in the processing time is due to the fact that the system requires images at a constant 70% zoom level. Increasing the distance also increases the area covered under the zoomed image, which is undesirable for segmentation. So, an image stitching technique is applied to all the captured satellite images throughout the mission path. Stitching the images gives the impression of a single image. The optimal speed of drones usually lies between 45 and 70 kmph. And the maximum speed with which most of the sensors are able to perform better obstacle detection is 25 kmph. Considering these constraints, a planned mission statistic is analysed, as shown in Fig. 9. From the analysis, it can be seen that a UAV with the sensor speed (SS) lags behind a UAV completing the mission with average optimal speed (OS) by at least twice of the overall mission time. This can impact not only the security of the UAV but also the performance. The minimum speed a UAV would fly is equal to the obstacle avoidance speed, and maximum speed will be equal to the optimal speed of that particular UAV. It should be noted that the time required to process the environment by the UAV may vary to a certain extent depending on the total number of sensors, type of sensor, onboard SoC, etc. A UAV equipped with a better SoC will be able to perform the onboard computation much faster. Also, the maximum speed and optimal speed can vary depending upon the UAV model.

Time (minutes)

An Efficient Obstacle Detection Scheme for Low-Altitude …

469

80 70 60 50 40 30 20 10 0

Considering OS 45 Considering OS 70 Considering SS

5

10

15 20 Distance (kilometers)

25

30

Fig. 9 Overall performance statistics

5 Conclusions and Future Scope In this paper, we attempted to present a simple and efficient approach to identify the obstacles faced by a low-altitude UAV in its flight path. Instead of depending on the on-the-fly obstacle detection system, which restricts the UAV speed to a certain limit depending on the capability of the sensors and onboard SoC, we provided the UAV with the precise information of the probable obstacles on the path beforehand. It helps the UAV to maintain its maximum speed throughout the flight except near the obstacles. The UAV decelerates when an obstacle approaches. To mark the obstacles on the path, Google Maps seems to be a useful and straightforward means. The proposed method saves a significant amount of time, which might be crucial for a critical and emergency situation. In this experiment, we considered only trees as obstacles. However, in reality, there might be different types of obstacles. This work can be extended to detect all the possible obstacles.

References 1. Erdelj M, Król M, Natalizio E (2017) Wireless sensor networks and multi-UAV systems for natural disaster management. Comput Netw 124:72–86 2. Martin P, Payton O, Fardoulis J, Richards D, Yamashiki Y, Scott T (2016) Low altitude unmanned aerial vehicle for characterising remediation effectiveness following the FDNPP accident. J Environ Radioact 151(Part 1):58–63 3. Djimantoro MI, Suhardjanto G (2017) The advantage by using low-altitude UAV for sustainable urban development control. IOP Conf Ser Earth Environ Sci 109(012014) 4. Chen J, Zhou Y, Lv Q, Deveerasetty KK, Dike HU (2018) A review of autonomous obstacle avoidance technology for multi-rotor UAVs. In: IEEE international conference on information and automation (ICIA), Wuyishan, China 5. Pham H, Smolka SA, Stoller SD, Phan D, Yang J (2015) A survey on unmanned aerial vehicle collision. arXiv: 1508.07723 6. Yasin JN, Mohamed SAS, Haghbayan M-H, Heikkonen J, Tenhunen H, Plosila J (2020) Unmanned aerial vehicles (UAVs): collision avoidance systems and approaches. IEEE Access 8:105139–105155 7. Sampedro C, Bavle H, Sanchez-Lopez JL, Fernández RAS, Rodríguez-Ramos A, Molina M, Campoy P (2016) A flexible and dynamic mission planning architecture for uav swarm

470

8. 9. 10. 11. 12. 13. 14.

15. 16.

17.

18.

19. 20. 21. 22.

23.

24.

N. Sinhababu and P. K. D. Pramanik coordination. In: International conference on unmanned aircraft systems (ICUAS), Arlington, USA Hadi G, Varianto R, Trilaksono B, Budiyono A (2014) Autonomous UAV system development for payload dropping mission. J Instrum Autom Syst 1(2):72–22 Ramirez-Atencia C, Camacho D (2018) Extending QGroundControl for automated mission planning of UAVs. Sensors 7(18):23–39 Ryan A, Hedrick J (2005) A mode-switching path planner for UAV-assisted search and rescue. In: 44th IEEE conference on decision and control, Seville, Spain Castelli T, Sharghi A, Harper D, Tremeau A (2016) Autonomous navigation for low-altitude UAVs in urban areas. arXiv: 1602.08141v1 Birk A, Wiggerich B, Bülow H, Pfingsthorn M (2011) Safety, security, and rescue missions with an unmanned aerial vehicle (UAV). J Intell Rob Syst 64(1):57–76 ´ Bielecki A, Buratowski T, Smigielski P (2013) Recognition of two-dimensional representation of urban environment for autonomous flying agents. Expert Syst Appl 40(9):3623–3633 Haque M, Muhammad M, Swarnaker D, Arifuzzaman M (2014) Autonomous quadcopter for product home delivery. In: International Conference on Electrical Engineering and Information & Communication Technology, Dhaka, Bangladesh Gageik N, Benz P, Montenegro S (2015) Obstacle detection and collision avoidance for a UAV with complementary low-cost sensors. IEEE Access 3:599–609 McGee TG, Sengupta R, Hedrick K (2005) Obstacle detection for small autonomous aircraft using sky segmentation. In: IEEE international conference on robotics and automation, Barcelona, Spain Odelga M, Stegagno P, Bülthoff HH (2016) Obstacle detection, tracking and avoidance for a teleoperated UAV. In: IEEE international conference on robotics and automation (ICRA), Stockholm Saha S, Natraj A, Waharte S (2014) A real-time monocular vision-based frontal obstacle detection and avoidance for low cost UAVs in GPS denied environment. In: IEEE international conference on aerospace electronics and remote sensing technology, Yogyakarta, Indonesia Zheng L, Zhang P, Tan J, Li F (2019) The obstacle detection method of UAV based on 2D lidar. IEEE Access 7:163437–163448 Hrabar S (2008) 3D path planning and stereo-based obstacle avoidance for rotorcraft UAVs. In: IEEE/RSJ international conference on intelligent robots and systems, Nice, France Branson S, Wegner J, Hall D, Lang N, Schindler K, Perona P (2018) From Google maps to a fine-grained catalog of street trees. ISPRS J Photogramm Remote Sens 135:13–30 Wegner JD, Branson S, Hall D, Schindler K, Perona P (2016) Cataloging public objects using aerial and street-level images-urban trees. In: The IEEE conference on computer vision and pattern recognition, Las Vegas, USA Li X, Ratti C, Seiferling I (2018) Quantifying the shade provision of street trees in urban landscape: a case study in Boston, USA, using Google street view. Landsc Urban Plan 169:81– 91 Prasetia AS, Wai RJ, Wen YL, Wang Y (2019) Mission-based energy consumption prediction of multirotor UAV. IEEE Access 7:33055–33063

Estimating Authors’ Research Impact Using PageRank Algorithm Arpan Sardar and Pijush Kanti Dutta Pramanik

Abstract Citations count is the most popular and straightforward way to assess the influence of a research paper and credential of the authors. But merely this onedimensional statistics does not truly reflect the research impact of the author. It may happen that a researcher has lower merits of the popularly used citation-based metrics such as h-index, i−10 index, but has a great impact on his research field, directly or indirectly. In order to address this problem, we used Google’s PageRank algorithm to obtain a more considerable publication ranking and quantifying the author’s impact in a citation network. The PageRank algorithm implies the reasonable notion that citations from more important publications should contribute more to the ranking of the cited paper and the author(s) than those from less important ones. Applying the PageRank algorithm on a paper-to-paper citation network, we calculated the PageRank values of the research papers. Further, we quantified authors’ impact based on (a) shared cumulative PageRank values (based on their authored papers’ PageRank values) and (b) shared cumulative citation counts and compared those two outcomes. We also compared our PageRank-based authors’ ranking with h-index based authors’ ranking.

1 Introduction In the academics and research community, the researchers’ competence and reputation are typically assessed and appraised by their research publications. The publication credentials are measured by different factors such as number of publications, quality of journals where the articles are published, how many papers authored by other researchers have referred the papers (citation received), etc. While the citation count gives a statistical measure of the impact of a publication, it often fails to provide a full picture of the influence of the publication. Due to pressure in the academic and research career, the citations counts are often manipulated (not A. Sardar · P. K. D. Pramanik (B) Department of Computer Science and Engineering, National Institute of Technology, Durgapur, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_29

471

472

A. Sardar and P. K. D. Pramanik

necessarily ill-intentionally) in the form of self-citations (X cites many of X’s older papers whenever X writes a new paper) or buddy-citations (X cites Y’s paper and Y cites X’s, in return). This opens up the obvious question—is it reasonable to evaluate the papers and journals solely based on the number of citations? In fact, considering only the citation counts is a one-dimensional measurement where each citation is treated equally. But what about if a paper is cited by a low-grade journal or a very high-grade journal? How to differentiate and evaluate this? One of the options is to use some methodology similar to the PageRank algorithm [1] used by Google to rank and index the websites, or the citation networks, or some similar weighted and directed networks. In fact, SJR uses the PageRank algorithm to assess the impact of the papers more sensibly. Similarly, to assess the research impact of the researcher, there exist several wellknown techniques such as h-index [2], i−10 index, g-index [3], and DS index [4]. But like the journal ranking methods, these techniques also heavily rely only on the citation counts. In other words, more the citations, more the impact of the researcher. But this may not be so straightforward. Here, along with other factors, the quality of the citations should also be considered. In this paper, we propose a PageRank-based approach to estimate the impact of the researchers. We first calculate the rank of the papers and use this ranking to rank their authors, considering the citation impact also. The algorithm naturally takes into account the following two factors: • The effect of receiving a citation from a more important paper should be greater than coming from a less popular one; • Citation links coming from a paper with a long reference list should count less than those coming from one with a shortlist. In other words, the importance of a paper should be divided over a number of references that inspired the line of research. The rest of the paper is organised as follows. Section 2 mentions some related works. Section 3 presents the details methodology for calculating the rank of the research papers and their authors using the PageRank algorithm. The experiment result and its analysis are observed in Sect. 4. Section 5 concludes the paper.

2 Related Work Complex network analysis has attracted considerable interest in recent years and plays an important role in many disciplines, including citation networks. Introduction of PageRank algorithm by Google for ranking web pages brought a paradigm shift in ranking algorithms in the citation networks, amongst others. There are quite a few studies in the past where people have used the PageRank algorithm in order to quantify the authors’ impact. Ding et al. [5] introduced a PageRank-based author ranking method. They considered 108 most cited authors from the domain of information retrieval and ranked them based on PageRank with varying damping factors.

Estimating Authors’ Research Impact Using PageRank Algorithm

473

Singh et al. [6] introduced a PageRank-based efficient paper ranking algorithm. Ding [7] pointed out the close relationship between paper and author ranking methods in a broader perspective and showed that it is better if we can rank both of them in parallel. Paul et al. [8] worked on the temporal analysis of author ranking using a citation-collaboration network. Dey et al. [9] proposed a DEA-based ranking method that distinctively ranks the authors. Unlike other indexing schemes such as h-index, i-10 index, DS index and g-index, the proposed method is capable of impacting the ranking even for a single citation addition. Senanayake et al. [10] proposed a new metric based on the PageRank algorithm where they used both a simulated system of authors and real-world databases to establish its comparative merits. They primarily focused on comparing PageRank-index with h-index. Fiala and Tutoky [11] proposed a new variant of h-index, PageRank-index. Their PR-index was developed based on both h-index and PageRank to evaluate an author’s impact from an objective point of view. They replaced the h-index’s consideration of citation counts with the PageRank score. In our study, taking the PageRank algorithm into account, we proposed a different approach to quantify the authors’ research impact and rank them in a paper-to-paper citation network. Our work is closely related to that of the work by Ding et al. [5]. However, in our study, we worked on a much larger dataset and quantified authors’ impact based on their papers’ shared PageRank values counted cumulatively and ranked them accordingly. We observed some interesting outcomes with this methodology.

3 Methodology 3.1 Dataset We used the InspireHEP1 database for the experiment. Upon extracting the downloaded .zip file, a 1.90 GB .json file containing metadata of 1,362,496 publications was acquired. An example of the metadata for a record in the dataset is shown in Table 1. We preprocessed the dataset to remove the unwanted records. To do so, we discarded papers having no citations or references and then dropped fields like ‘abstract’, ‘free_keywords’, ‘standardized_keywords’, ‘creation_date’ from each entry of records for this experiment.

1

http://old.inspirehep.net.

474

A. Sardar and P. K. D. Pramanik

Table 1 Example of a record from the InspireHEP dataset Attribute

Value

free_keywords

[]

standardized_keywords

[]

Citations

[107392, 48129, 72835, 50824, 1436908, 114955, 1190417, 51403, … 12396, 52333, 54513, 46579, 1376629, 68598, 1439995, 663146, 1443029]

Recid

51

Title

‘Neutron-Proton Scattering Below 20-meV’

References

[47202, 43619, 47204, 46245, 47400, 45692, 1669570, 48175, 458288, 3218, 48052, 46838, 9144, 1476,836, 46874, 47483, 46684, 47418, 46885]

Abstract

‘Critical examination and analysis of existing … might suffice; this minimal program is briefly discussed’

Authors

[‘Noyes, H. Pierre’]

creation_date

‘1963’

Co-authors

[]

3.2 PageRank Calculation of a Research Paper 3.2.1

The PageRank Algorithm

According to Google: ‘PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.’ In a nutshell, PageRank centrality gives higher weights to nodes which have been endorsed by a large number of nodes, which in turn have been endorsed by many others, and so on. So, it is understandable that if a node X is endorsed by some other node, Y having great importance (i.e. Y has been endorsed by many other significant nodes), then node X is assigned to a heavier weight. The basic structure on which the PageRank algorithm works best is a directed network graph. Following is a simple example to briefly review how PageRank works. Suppose, there are three nodes (pages), A, B and C. Initially, the total PageRank (equal to 1) is equally divided amongst all the three nodes, i.e. PR(A) = PR(B) = PR(C) = 1/3. Now suppose, B has a link to A and C. This implies PR(A) = PR(C) = 1/3 + PR(B) / L(B), where L(B) is the total number of nodes linked by B. In general, if a node has m outlinks, it will transfer 1/m of its importance to each of the nodes being referenced by that node. So, A and C have the same importance in this small network, and moreover, B is less important as it has no incoming edges referring to it. For example, in Fig. 1, node D has three outlinks, so it will transfer 1/3 of its importance to A, C and E. The initial importance of each node is uniformly distributed

Estimating Authors’ Research Impact Using PageRank Algorithm

475

Fig. 1 Directed network with five nodes

as 1/5. As each incoming link increases the PageRank value of a node, it is updated by adding the importance of the incoming links to the current value, and this process is iterated until all the values converge to a certain steady value. The calculation of PageRank for each is given in Table 2, and the PageRank values are given in Table 3. It is observed that the convergence is achieved after 21 iterations, and thus, we get the ranking of the nodes in terms of their importance in this network as B, C, D, A and E. The details of the PageRank calculation can be found in [12]. So, what this algorithm shows us is that nodes which are referred by others, which themselves are referred by many others, are more important than nodes that are not referred by many. Table 2 Calculations for updating PageRank values of nodes Iteration

A

B

C

D

E

I0

A0 (1/5) = 0.20

B0 (1/5) = 0.20

C0 (1/5) = 0.20

D0 (1/5) = 0.20

E0 (1/5) = 0.20

I1

A1 = D0 /3 + E0 (0.20/3 + 0.2) = 0.27

B 1 = A0 + C 0 (0.20 + 0.20) = 0.40

C 1 = B0 /2 + D0 /3 (0.20/2 + 0.20/3) = 0.17

D1 = B0 /2 (0.20/2) = 0.10

E 1 = D0 /3 (0.20/3) = 0.07

I2

A2 = D1 /3 + E1 (0.10/3 + 0.07) = 0.10

B 2 = A1 + C 1 (0.27 + 0.17) = 0.44

C 2 = B1 /2 + D1 /3 (0.40/2 + 0.10/3) = 0.23

D2 = B1 /2 (0.40/2) = 0.20

E 2 = D1 /3 (0.10/3) = 0.03

I3

A3 = D2 /3 + E2 (0.20/3 + 0.03) = 0.10

B 3 = A2 + C 2 (0.10 + 0.23) = 0.33

C 3 = B2 /2 + D2 /3 (0.44/2 + 0.20/3) = 0.28

D3 = B2 /2 (0.44/2) = 0.22

E 3 = D2 /3 (0.20/3) = 0.07













IN Convergence achieved

AN = DN−1 /3 + E N-1 Final PR value of A

BN = AN-1 + C N−1 Final PR value of B

C N = BN−1 /2 DN = BN−1 /2 E N = DN−1 /3 + DN−1 /3 Final PR value Final PR value of D of E Final PR value of C

476

A. Sardar and P. K. D. Pramanik

Table 3 PageRank values being updated through iterations until convergence Iteration

A

B

C

D

E

0

0.200000

0.200000

0.200000

0.200000

0.200000

1

0.266667

0.400000

0.166667

0.100000

0.066667

2

0.100000

0.433333

0.233333

0.200000

0.033333

3

0.100000

0.333333

0.283333

0.216667

0.066667

4

0.138889

0.383333

0.238889

0.166667

0.072222













20

0.124999

0.375004

0.249998

0.187496

0.062502

21

0.125001

0.374997

0.250001

0.187502

0.062499

3.2.2

Applying the PageRank Algorithm

To quantify the scientific impact of papers by taking advantage of the PageRank algorithm, we created a network graph, having directed edges in between the nodes where a node represents a paper and if paper i has a citation of paper j, a directed edge (i, j) is created for every citation of i. Similarly, an edge (k, i) is created, if k paper cites i. So, the collection of edges allows us to form a directed network where the edge is constructed by citing_paper → cited_paper. We applied the PageRank algorithm on this directed graph and sorted the results on the basis of PageRanks. Papers ranked based on their respective PageRank scores are then saved in a comma-separated values (CSV) file having attributes, ‘Title’, ‘Authors’, ‘Citations’ and ‘PageRank’. Table 4 shows the top-20 papers (from the InspireHEP dataset) according to their calculated PageRank.

3.3 Calculating Authors’ Impact Based on the calculated PageRank values of the papers, we calculated the authors’ impact by the following two methods. Calculating authors’ PageRank-based impact: To quantify authors’ impact on the basis of PageRank in the entire citation network, we made sure that each paper’s PageRank gets equally divided amongst all its authors and co-authors. By doing that, all the authors and co-authors get an equal share of PageRank value for that particular publication, and when the same individual is associated with another publication, his/her present PageRank value increases cumulatively with the latter publication’s shared PageRank value and so on. In this way, the cumulative PageRank values for all the authors are calculated. Based on the results, we plotted a bar graph showing the top-30 impactful authors in terms of their cumulative PageRank values.

Estimating Authors’ Research Impact Using PageRank Algorithm

477

Table 4 Top-20 papers ranked based on PageRank Title

Author(s)

Citations

PageRank

Partial symmetries of weak interactions

[‘Glashow, S. L.’]

7989

0.00963840

Dynamical model of elementary particles based on an analogy with superconductivity. 1

[‘Nambu, Yoichiro’, ‘Jona-Lasinio, G.’]

5304

0.00868591

Unitary symmetry and leptonic decays

[‘Cabibbo, Nicola’]

6297

0.00678616

On gauge invariance and vacuum polarisation

[‘Schwinger, Julian S.’]

4928

0.00566731

The S matrix in quantum electrodynamics

[‘Dyson, F. J.’]

711

0.00413933

Quantised singularities in the electromagnetic field

[‘Dirac, Paul Adrien Maurice’]

2091

0.00407080

The radiation theories of Tomonaga, [‘Dyson, F. J.’] Schwinger and Feynman

453

0.00400779

Mach’s principle and a relativistic theory of gravitation

[‘Brans, C.’, ‘Dicke, R. H.’]

3309

0.00358553

Consequences of Dirac’s theory of positrons

[‘Heisenberg, W.’, ‘Euler, H.’]

1784

0.00332409

Theory of strong interactions

[‘Sakurai, J. J.’]

937

0.00321739

Symmetries of baryons and mesons

[‘Gell-Mann, Murray’]

1588

0.00309483

Symmetry breaking through Bell–Jackiw anomalies

[‘t Hooft, Gerard’]

3639

0.00308335

Invariant theoretical interpretation of interaction

[‘Utiyama, Ryoyu’]

980

0.00300105

Foundations of the new field theory

[‘Born, M.’, ‘Infeld, L.’]

1326

0.00291870

Problem of strong P and T invariance in the presence of instantons

[‘Wilczek, Frank’]

3411

0.00290888

μ → eγ at a Rate of one out of 109 [‘Minkowski, Peter’] Muon Decays?

3685

0.00289482

Remarks on the unified model of elementary particles

3900

0.00289384

4039

0.00282572

Theory of superconductivity

[‘Bardeen, John’, ‘Cooper, L. 1613 N.’, ‘Schrieffer, J. R.’]

0.00277212

On massive neutron cores

[‘Oppenheimer, J. R.’, ‘Volkoff, G. M.’]

0.00259969

[‘Maki, Ziro’, ‘Nakagawa, Masami’, ‘Sakata, Shoichi’]

Computation of the quantum effects [‘t Hooft, Gerard’] due to a four-dimensional pseudoparticle

1577

478

A. Sardar and P. K. D. Pramanik

Calculating authors’ citation count-based impact: To quantify authors’ impact on the basis of citation counts in the entire citation network, we made sure that each paper’s citation count gets equally divided amongst all its authors and co-authors. By doing that, all the authors and co-authors get an equal share of citation count value for that particular publication, and when the same individual is associated with another publication, his/her already existing citation count value increases cumulatively with the later publication’s shared citation count value and so on. In this way, the cumulative citation count values for all the authors are calculated. Based on the results, we plotted a bar graph showing the top-30 impactful authors in terms of their cumulative citation counts.

4 Results and Analysis 4.1 Top-20 Papers Based on Their PageRank Values An interesting observation can be drawn from Table 4 that, while the highest cited papers have definitely been impactful, a PageRank-based measure of impact does capture something different and more exciting than merely relying on just citation counts. For example, in our output file, the fifth paper, i.e. ‘The S matrix in quantum electrodynamics’ by author Dyson, F. J. has only 711 citations, while the seventh paper, i.e. ‘The Radiation Theories of Tomonaga, Schwinger, and Feynman’ by the same author Dyson, F. J. (with other co-authors) has only 453 citations. These numbers are fairly less than other top publications in our output. So how can this be interpreted? This is certainly not an anomaly. ‘The S matrix in quantum electrodynamics’ is a breakthrough paper in its field of studies. In fact, Dyson, F. J. is mentioned in the Wikipedia page for ‘S matrix’ itself. So, it is fair to state that even though the paper does not have an overwhelming number of citations, it still deserves higher importance due to its breakthrough nature. So, papers like this attract the attention of popular researchers working in the same domain and thus get cited by important authors. Moreover, according to Wikipedia, Freeman John Dyson a.k.a. Dyson, F. J. was an English-American theoretical physicist and mathematician known for his work in quantum electrodynamics, solid-state physics, astronomy and nuclear engineering. He even has his own theorem, Dyson Series established in the field of Physics. So, his works get cited by other popular authors, and for that, our evaluation could find out their justified credits despite the fact of having fewer citation counts. ‘The Radiation Theories of Tomonaga, Schwinger, and Feynman’ is also a very famous publication. So, even though it does not have an impressive citation count, it is not surprising that the paper is still in the top results because of the significance of thesis stated in the paper as well as the impact of the author Dyson, F. J. who is vastly popular in his field. So, here we see that the PageRank algorithm is able to determine

Estimating Authors’ Research Impact Using PageRank Algorithm

479

the scientific impact of publications more accurately, not by just taking a number of citations for a paper, but also taking sentimental analysis into consideration.

4.2 Top-30 Authors Based on PageRank Values Figure 2 shows the top-30 impactful authors in terms of their shared cumulative PageRank values. It is very interesting to notice that these authors are not ordered by their citation counts. In fact, this result reflects the ordering of authors in terms of their reputation. The first two authors, S. L. Glashow and Julian S. Schwinger, are Nobel laureates. Following them, F. J. Dyson is also very reputed physicist. Steven Weinberg is a Nobel laureate along with many others receiving this prestigious honour in this list. Out of top-30 impactful authors, twelve physicists are Nobel laureates. Even though some of them might not have the highest citation counts, they are still the most impactful authors in the entire citation network. So, our method picked out very influential scientists like the ones with the highest impact. It is certainly neat that this method proves very effective at picking out Nobel Prize winners, for instance. This is indeed an interesting outcome contrary to finding out top authors by relying on only citation counts.

Fig. 2 Top-30 authors based on PageRank values

480

A. Sardar and P. K. D. Pramanik

Fig. 3 Top-30 authors based on citation counts

4.3 Top-30 Authors Based on Citation Counts Figure 3 shows the top-30 impactful authors in terms of their shared cumulative citation counts. S. L. Glashow tops this chart too as he has the highest cumulative citation count in the network. However, authors following him in this chart are differently ordered than the previous chart as this chart is purely based on the cumulative citation counts of all the authors in the network. Their reputations are not taken into consideration in this algorithm. If an author is a part of many papers having a significant number of citation counts, then he is expected to be ranked higher according to our implementation. If we analyse Figs. 2 and 3, further, we notice that except the ordering, there is not much change in the top authors’ list. This is fair because a reputed author is expected to have a higher number of citations for his/her work than others. Also, an author having significant citation counts is most likely to be an eminent author in his/her field of study. So, it is intuitive that there lies a strong correlation between these two attributes.

4.4 Relation Between Authors’ Cumulative Citation Counts and Cumulative PageRank After finding out authors’ impacts in the network, in order to draw a conclusion, we checked to what degree the authors’ cumulative citation counts were linearly related to their cumulative PageRank values. So, we calculated the Pearson correlation between them and visualised their relation through a scattered plot, as shown in

Estimating Authors’ Research Impact Using PageRank Algorithm

481

Fig. 4 Authors’ cumulative citation counts versus cumulative PageRank

Fig. 4. It is observed that the correlation between cumulative PageRank and cumulative citation counts is very high, i.e. 0.91807. So, this outcome surely does not contradict our intuition about the relationship between these two metrics.

4.5 Comparing PageRank-Based Author Impact with h-index The h-index is one of the most popular yet simple metric offered by Google Scholar to evaluate researchers based on the citation counts of their articles. The h-index is defined as the maximum value of h such that the given author/journal has published h papers that have each been cited at least h times. For example, if we have a researcher with five publications A, B, C, D and E with 10, 8, 5, 4 and 3 citations, respectively, the h-index is equal to 4 because the fourth publication has four citations, and the fifth has only 3. In contrast, if the same publications have 25, 8, 5, 3 and 3 citations, then the h-index is 3 because the fourth paper has only three citations. The above-mentioned examples are shown below. f (A) = 10, f (B) = 8, f (C) = 5, f (D) = 4, f (E) = 3 =⇒ h-index = 4. f (A) = 25, f (B) = 8, f (C) = 5, f (D) = 3, f (E) = 3 =⇒ h-index = 3. If we have the function f ordered in decreasing fashion from the largest value to the lowest one, we can compute the h-index as max{i ∈ N: f (i) ≥ i}. The h-index is dependent on the citation counts and independent of subject/domain-specific point of view. In that perspective, our PageRank evaluation excels, as it captures the importance of an author based on the referring authors’ importance, and a paper gets cited by other important author means that their subject/domain of research matches. Table 5 lists the top-10 authors based on their PageRank values. It also shows the h-index values of the listed authors. It can be seen that both the ranking differ considerably. Unlike the h-index, our evaluation does not depend on the number of publications an author has published, and the number of citations received. Rather, PageRank-based ranking takes into account the soft and obscure factors, which is the sole purpose of the PageRank algorithm. The h-index is particularly harsh on the newcomers since this indicator is based on long-term observations of an author’s career path, and thus, newcomers do not get high h-indexes. If a newcomer publishes a breakthrough or a really good paper,

482

A. Sardar and P. K. D. Pramanik

Table 5 Comparing PageRank- and h-index-based rankings of the top-10 authors Authors

PageRank Value

h-index Ranking

Value

Ranking

Glashow, S. L.

0.011393227

1

69

3

Schwinger, Julian S

0.010044943

2

52

6

Dyson, F. J.

0.008603087

3

25

9

Weinberg, Steven

0.008460972

4

108

1

Nambu, Yoichiro

0.007415281

5

32

8

Cabibbo, Nicola

0.006941089

6

56

5

‘t Hooft, Gerard

0.006068329

7

59

4

Gell-Mann, Murray

0.005384869

8

47

7

Jona-Lasinio, G.

0.004342957

9

11

10

Wilczek, Frank

0.004151247

10

105

2

he/she is expected to get cited by many other important researchers working in the same domain, and thus, he/she will also get a fair share of the authors’ importance citing his/her work. So, an author’s impact can be quantified purely based on the quality of the publications, irrespective of citation counts.

5 Conclusions and Further Scope Methodologies based on Google’s PageRank algorithm hold a great promise for quantifying the impact of scientific publications and authors. They provide a meaningful extension to traditionally used measures, such as a total number of citations, impact factor and h-index. The PageRank algorithm ensures the reasonable notion that citations from more important publications should contribute more to the rank of the cited paper and, in turn, the authors than those from less important ones. In this paper, we ranked the research papers using the PageRank algorithm, then quantified the authors’ overall impact in a citation network and observed some interesting results. Further scopes for this study can be to design a recommendation system for the faculty hiring and promotion committee to help them to make apt decisions. The recommendation system may comprise a finding feature or indicator that allows predicting a highly ranked paper or highly impactful author and make predictions about the model on some test region, taking the discarded attributes (not considered in our experiment) into consideration to make smarter and better decisions.

Estimating Authors’ Research Impact Using PageRank Algorithm

483

References 1. Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking: bringing order to the web. Technical Report, Stanford InfoLab 2. Hirsch JE (2005) An index to quantify an individual’s scientific research output. Proc Natl Acad Sci USA 102(46):16569–16572 3. Egghe L (2006) Theory and practise of the g-index. Scientometrics 69(1):131–152 4. Farooq M, Khan HU, Iqbal S, Munir EU, Shahzad A (2017) DS-Index: ranking authors distinctively in an academic network. IEEE Access 5:19588–19596 5. Ding Y, Yan E, Frazho A, Caverlee J (2009) PageRank for ranking authors in co-citation networks. J Am Soc Inf Sci Technol 60(11):2229–2243 6. Singh AP, Shubhankar K, Pudi V (2011) An efficient algorithm for ranking research papers based on citation network. In: 3rd conference on data mining and optimization (DMO), Putrajaya, Malaysia 7. Ding Y (2011) Scientific collaboration and endorsement: Network analysis of coauthorship and citation networks. J Informetrics 5(1):187–203 8. Paul PS, Kumar V, Choudhury P, Nandi S (2015) Temporal analysis of author ranking using citation-collaboration network. In: 7th international conference on communication systems and networks (COMSNETS), Bangalore, India 9. Dey AK, Pramanik PKD, Choudhury P, Bandopadhyay G (2020) Distinctive author ranking using DEA indexing. Qual Quant 10. Senanayake U, Piraveenan M, Zomaya A (2015) The Pagerank-index: going beyond citation counts in quantifying scientific impact of researchers. PLoS ONE 10(8):e0134794s 11. Fiala D, Tutoky G (2017) PageRank-based prediction of award-winning researchers and the impact of citations. J Informetrics 11(4):1044–1068 12. Dai K (2020) PageRank Lecture Note, 22 June 2009. [Online]. Available: https://www.ccs.neu. edu/home/daikeshi/notes/PageRank.pdf [Accessed 15 Nov 2020]

Research Misconduct and Citation Gaming: A Critical Review on Characterization and Recent Trends of Research Manipulation Joyita Chakraborty, Dinesh K. Pradhan, and Subrata Nandi

Abstract Research integrity is under threat. Apart from traditional research malpractices that include fabrication, falsification, and plagiarism, recent reports reflect intentionally biased citations and other complex malpractices to inflate impact factors mutually (excessive self-cites, citation stacking, cartels, cabals, and rings). Such journals are blacklisted annually by Thomson Reuters firm since 2009. This paper highlights and categorizes all possible cases of research misconduct. It creates awareness and an urgent need to introduce new monitoring and evaluation standards based on advanced computational intelligence techniques. Further, existing metrics need to be redefined, traditionally based upon citation and publication count as key quantifiers.

1 Introduction Scholarly publishing is at the heart of academic research. Research integrity is withheld by equitable and ethical behavior at all stages of the research process. Starting from problem formulation, conducting experiments and collecting data, accurately describing the methodology, data and results, citing correct references, choosing an appropriate venue, and fair peer-review and editorial process till final publication. However, recent studies have reported several authors, editors, and journal misconducts [1–4]. It is beyond traditional questionable research practices that mainly includes fabrication, falsification, and plagiarism. A sharp rise in the number of articles published on research integrity between 1982 and 2019 reflects the community’s growing concern. Considering the Web of Science (WoS) data1 , in 1985, 10 articles were published annually on the ‘Research Integrity’ 1

https://clarivate.com/webofsciencegroup/solutions/web-of-science/.

J. Chakraborty (B) · S. Nandi National Institute of Technology, Durgapur 713209, India D. K. Pradhan Dr. B.C. Roy Engineering College, Durgapur 713206, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_30

485

486

J. Chakraborty et al.

topic, whereas it rose to 200 articles per year in 2019. The minor cases include adding superfluous references to either self or mutual researcher strategically, whereas the significant issues include fabricating data and results. The most devastating is the tampering of clinical trial results [5]. One reason for such an abrupt increase in research misconduct cases is tremendous publication and citation pressure [1, 6]. The entire evaluation system in academia is based upon publication and citation count as key quantifiers. It includes author ranking indices such as h-index, g-index, i-index, C3-index, and other journal ranking indices such as impact factor and immediacy index, etc. Decisions of hiring, promotion, funding and ranking of authors, journals, and institutes are based upon these single point metrics. The pressure to perform sometimes inherently lures an author or editor to being involved in several malpractices. More publications, more citations, more grant proposals, more popularity, and more power are the crucial seduction for gaming the system. Recently, China has approved reforms, such as using more comprehensive evaluation systems than single point metrics [7]. Moreover, with exponential growth in all scientific entities (publications, researchers, journals, conferences, institutes, and interdisciplinary fields of study), the complexity of detecting such anomalies increases ten-fold [6]. We present an early example. Four Brazilian journals were blacklisted by Thomson Reuters indexing firm in 2011 and tagged them into ‘citation stacking’ case [8]. Seven review papers were published, adding hundreds of superfluous references in each other’s journals. Mauricio Rocha-e Silva, the editor of one of the journals ‘Clinics,’ blames the CAPES agency in the education ministry. The agency encourages students and local researchers to publish in high impact factor journals and partly evaluate their graduation program. As a result, researchers do not want to publish in new, low ranking, and rising journals. This inherently lures editors of the journals to boost impact factor artificially. Moreover, the ethical boundary of research ethics is so loosely defined that the difference between acceptable and unacceptable behavior is not distinguishable. Sometimes, authors, editors, and publishers are themselves ignorant of their unethical actions. Thus, every stakeholder’s rules in this publication system also need to be clearly defined and documented. In this paper, we comprehensively present all possible categories of research misconduct with examples and view how computational intelligence, data science, and analytics can help detect them?

2 Characterization: Types of Possible Research Misconduct In this section, we present all possible kinds of research malpractices.

Research Misconduct and Citation Gaming: A Critical Review on Characterization …

487

2.1 Citation Malpractices Any reference in a manuscript that does not directly contribute or enrich an article’s technical content and is given to boost the impact factor of a journal artificially is regarded as ‘citation malpractice.’ Broadly, the possible kinds of anomalies can be categorized into excessive self-citation and citation cartel. However, excessive selfcitation in journals can occur due to other legitimate reasons such as narrow field specialization, authors submitting their works, citing papers after reading recent editions of the same journal, etc. However, such an excessive self-citation count is unacceptable.

2.1.1

Excessive Self-Citation

It is a phenomenon when a journal excessively cites its own recently published papers in a time window of 2–5 years. Consequently, the impact factor of a journal increases abruptly. Impact factor biased self-citation is not acceptable [2, 9]. Further, Bai et al. report a relationship analysis showing a direct relationship between self-citation and h-index and self-citation and impact factor. Excessive self-citation can occur in several forms, as discussed below, • Author self-citation: Recent work [10] analyzes author self-citations’ trends by considering highly cited researchers of 2019 using standard outlier detection methods (box-plot method). Out of 250 highly cited authors in the Chemistry domain, three authors give ‘excessive self-citations’ beyond Q3 + 1.5 x (Inter Quartile Range) of their total citations. However, such percentile distribution techniques are domain and context-specific. Further, recently on January 29, 2020, one of the world’s renowned bio-physicist ‘Kuo Chen Chou’ [11] and a top-cited author was removed from the editor-in-chief position of ‘Journal of Theoretical Biology’ (publisher, Elsevier). Since 2003, for 602 papers, he collected a total of 58,000 citations. Being in the editorial position, he forced the authors to add more than 50 citations to his publications as a condition of publication. It is also reported that he even asked authors to change paper titles and modify them according to algorithms formulated by him. Other malpractices include creating false alias names, reviewing colleagues from the same institute, and biasing their reference list, choosing known reviewers for his articles, and adding himself as co-author in the last stage of the review process. However, apart from retraction of the paper itself, there is still no proper methodology to identify such anomalous citations and remove them. The final decision lies in the hands of the publisher, ‘Elsevier.’ • Coercive induced self-citation: Coercive induced self-citation refers to mal- practice where an editor forces an author to add irrelevant references to either paper from its own journal or gives a reference list that is not specific about how and why the review of such works is required [12]. A survey conducted in the year 2008 reports that out of 283 authors, 22.7% were asked by either one of the reviewers to add irrelevant references. It is seen that business journals have the

488

J. Chakraborty et al.

highest percentage of coercive journals than other fields such as economics, sociology, and psychology. On the other hand, business and medicine fields are most likely to add irrelevant citations to grant proposals. Moreover, junior scholars are more likely to give coercive citation than senior researchers. Such kind of gaming benefits both author and editor [12]. For the author, the submitted manuscript is accepted for publication. For an editor, it receives complimentary citations, and resultingly, the impact factor is boosted artificially. • Citation from Conflict of Interest group (co-author and collaboration group citation): Bai et al [3] propose the PageRank algorithm to find the citation rank of articles. It significantly reduces the weight (paper rank) if there is CoI or suspected CoI relationship. Biased citations from co-author and collaboration group (researchers working under the same research group) come under CoI relationship. Colleague relation for authors working in the same affiliation is considered as suspected CoI relation [3]. • Journal self-citation: A new measure, ‘Impact Factor Biased Self-Citation Practices (IFBSCP),’ is devised by Chorus and Waltman [9]. It is reported that a disproportionate increase in the measure occurs due to several editor misconducts such as coercive citations, compulsively publishing more number of editorial or review papers, or by merely accepting papers with hundreds of superfluous references to recent publications of its own journal. Such instances are also field-dependent. For example, they have reported that the life science field publishes more editorial papers, leading to an abnormally increased IFBSCP measure rather than other physical and social sciences [9]. 2.1.2

Citation Cartel

It is a phenomenon where a group of journals mutually cite recent publications of each other’s journals within a time window of 2–5 years to artificially boost impact factor. On a microscopic scale, a band of authors, editors, and publishers groups together for their mutual-benefit. They are difficult to detect. Some texts also refer to them as citation stacking, cabal, or rings. • Author and editor misconduct: In the last 8 years, several cases of author and editor misconduct have been reported. In 2015, a case of two journals ‘Asia Pacific Journal of Tourism Research (APTR) and Journal of Travel and Tourism Marketing (JTTM)’ has been reported. Strikingly, it is found that an article published in the APTR gives 161 out of 172 references to JTTM. Reciprocating to it, JTTM also gives 130 out of 161 references back to APTR [5]. Both papers in each of the journals is co-authored by Rob Law who is an editor-in-chief in one of the journals and belongs to an editorial board member in another journal. A more recent case is that of two journals of European Geophysical Union (EGU), Solid Earth and Soil detected in 2017. Artemi Cerda, editor and reviewer of these journals coercively asked authors to add 622 additional references. It was found that 64% of the authors agreed to his requests adding 399 superfluous citations [13].

Research Misconduct and Citation Gaming: A Critical Review on Characterization …

489

• Publisher misconduct: In the physics domain, Petr Heneberg identified citation nexus between three journals of the same publisher ‘Editura Academiei Romane.’ The 2-year impact factor abruptly rose from 0.088 to 1.658 for one of the journals (Proceedings of the Romanian Academy, Series, A). Simultaneously, for the other two journals Romanian Journal of Physics and Romanian Reports in Physics, the impact factor from 0.279 to 0.924 and from 0.458 to 1.517, respectively [9].

2.2 Detection of Plagiarism The first software for detecting plagiarism came into use in 1989 for educational purposes. However, today in the research community, more computationally robust software is deployed to match a given paper against a massive text corpus on the web. Despite such advances, there are still several unresolved and complex issues in this direction [14]. First, this software cannot distinguish between literal (directly copying text) and intellectual (paraphrasing) plagiarism. Second, different language plagiarism cannot be detected using such software. Some works [15] have used semantic information from citation flow to overcome such challenges. Almost every editorial board is following plagiarism detection for cross-checking ambiguous submissions.

2.3 Figure or Image Manipulation The last decade has seen the onset of the use of tampered images in scholarly publishing [16]. It includes duplicating images (color changes), cloning, cropping an image, making selective adjustments such as color brightness and contrast, etc. In the bio-medical field, researchers have started using machine learning and other advanced deep learning techniques to scrutinize such image variations comparing with past publication repositories.

2.4 Honorary or Ghost Authorship Honorary authorship is research malpractice where author names are added coercively without any research contribution. It is seen that some researchers, especially junior scholars, are biased to add names of authors who are in a senior administrative position to induce favorable review, popularity, citations, and prestige. Further, it is added to create mutual relationships in expect of authorship in another paper in the future or favor a known colleague. Additionally, honorary authorship is also beneficial to grant multiple research funding to academicians who have a history of receiving funds from prestigious institutes [4]. Overall, it is a kind of malpractice that threatens research integrity; however, it is widely practiced today. In one of the

490

J. Chakraborty et al.

works, it is reported that health care and medicine fields, as well as marketing and management disciplines, mostly practice honorary authorship as compared to other disciplines. Also, it is reported that females use it 38% more than men [4].

2.5 Biases in Peer-Review Process Peer-review is at the core of scholarly publishing, and the gatekeepers of the peerreview process are only biased. Recent reports reveal that some researchers create false aliases by setting up fake e-mail accounts to review their articles or articles of known co-authors and collaborators. Further, there is still a lack of efficient reviewer to paper assignment strategy or robust software to completely combat the conflict of interest (CoI) between reviewers and authors of papers. Mostly, it includes author networks working under the same research group, institution, supervisor-supervisee links, or within a close geographical boundary. Existing conference assignment systems such as EasyChair and CyberChair even now mostly consider the selfdeclared CoI by reviewers and paper authors [17]. Open-access platforms such as ScholarOne and Publons provide some tools based on reviewing data to cross-verify peer reviewer identity. In this regard, unique ORCID or other researchers IDs can be used.

3 Future Scope of Study: How Computational Intelligence, Data Science, and Analytics Can Help? Recent literature has reported on many complex issues such as the formation of strategic citation cartels and rings, coercive citations due to the author, editor misconduct, duplicating images, ghost authorship, biased peer-review process beyond traditional research malpractices of fabrication, falsification, and plagiarism. For instance, Fister et al. [18] propose to detect such citation cartels readily from multi-level graphs using online web tools. They use the resource description framework (RDF) format and RDF query language to identify anomalous citation links in paper and author citation networks. Moreover, in our recent work [6], we have used unsupervised machine learning techniques to identify such cases of extreme outliers from any given large bibliographic data set. We have seen when two journals give directed mutual-citations to each other; an abrupt increase in the publication rate of donor journal and consequent inflation in the impact factor of the recipient for a specific period are a characteristic of such outliers. Some other works [2] use statistical significance tests (Welch F-test) to determine the consequence of self-citation on impact factor and immediacy index journal metrics. Many works exist that devise algorithms to detect self-citations. Further, algorithms need to be devised to detect more of such complex anomalous citation patterns.

Research Misconduct and Citation Gaming: A Critical Review on Characterization …

491

New auditing standards need to be defined to detect such problematic anomalous practices. Here, the need for computational intelligence techniques such as advanced machine and deep learning techniques and fuzzy-logic algorithms is even more evident [19]. The ill fate of anomalies is retraction of papers and journals. However, a drop in percentage of retraction is observed from 1980 to 2019. In 2019, in every 10,000 papers, 10 papers are retracted. ‘How much is too much? [10]’ Apart from a comprehensive definition of research integrity itself, at each stage, research ethics’ acceptable and unacceptable behavior needs to be more widely described. Moreover, the role of stakeholders–researchers (as authors, co-authors, collaborators, reviewers), editorial board members, publishers, institutions, data service providers, governing bodies, and funding agencies needs to be defined clearly. Honesty, openness, and transparency in reporting methodology, codes, data for reproducibility, and results are the need of the hour. The open-access publication is one step toward it. Database providers and publication houses can work in coherence for ensuring accurate, unambiguous, transparent, and open data before any analysis and evaluation. One of the key reasons behind all such anomalous activity is fake identity. It can be resolved partially by verifying the identity of a researcher using block-chain technologies [19]. Editorial board members of a journal or reviewer’s identity in a peer-review process can be resolved to some extent using such techniques. The essential point is such kind of gaming can lead to colossal technological, scientific, health, and economic loss to society.

References 1. Chakraborty J, Pradhan D, Dutta HS, Nandi S, Chakraborty T (2018) On good and bad intentions behind anomalous citation patterns among journals in computer sciences. arXiv preprint arXiv: 1807.10804 2. Heneberg P (2016) From excessive journal self-cites to citation stacking: analysis of journal self-citation kinetics in search for journals, which boost their scientometric indicators. PLoS One 11(4):e0153730 3. Bai X, Xia F, Lee I, Zhang J, Ning Z (2016) Identifying anomalous citations for objective evaluation of scholarly article impact. Plos One 11(9):e0162364 4. Fong EA, Wilhite AW (2017) Authorship and citation manipulation in academic research. Plos One 12(12):e0187394 5. Mongeon P, Waltman L, Rijcke S (2016) What do we know about journal citation cartels? A call for information 6. Chakraborty J, Pradhan DK, Nandi S (2020) On the identification and analysis of citation pattern irregularities among journals. Expert Syst e12561 7. Zhang L (2020) For china’s ambitious research reforms to be successful, they will need to be supported by new research assessment infrastructures. Impact Soc Sci Blog 8. Van Noorden R (2013) Brazilian citation scheme outed. Nature 500(7464):510–511 9. Chorus C, Waltman L (2016) A large-scale analysis of impact factor biased journal self-citations. Plos One 11(8):e0161021 10. Szomszor M, Pendlebury DA, Adams J et al (2020) How much is too much? The difference between research influence and self-citation excess. Scientometrics 123(2):1119–1147 11. Van Noorden R (2020) Highly cited researcher banned from journal board for citation abuse. Nature 578(7794):200–201

492

J. Chakraborty et al.

12. Wilhite AW, Fong EA (2012) Coercive citation in academic publishing. Science 335(6068):542–543 13. Lockwood M (2020) Citation malpractice 14. Alzahrani SM, Salim N, Abraham A (2011) Understanding plagiarism linguistic patterns, textual features, and detection methods. In: IEEE transactions on systems, man, and cybernetics, Part C (Applications and Reviews), vol 42, no 2, pp 133–149 15. Gipp B (2014) Citation-based plagiarism detection. In: Citation-based plagiarism detection. Springer, pp 57–88 16. Bik EM, Casadevall A, Fang FC (2016) The prevalence of inappropriate image duplication in biomedical research publications. MBio 7(3) 17. Pradhan DK, Chakraborty J, Choudhary P, Nandi S (2020) An automated conflict of interest based greedy approach for conference paper assignment system. J Informetrics 14(2):101022 18. Fister I Jr, Fister I, Perc M (2016) Toward the discovery of citation cartels in citation networks. Front Phys 4:49 19. Pradhan DK, Chakraborty J, Nandi S (2019) Applications of machine learning in analysis of citation network. In: Proceedings of the ACM India joint international conference on data science and management of data, pp 330–333

Dynamic Price Prediction of Agricultural Produce for E-Commerce Business Model: A Linear Regression Model Tumpa Banerjee, Shreyashee Sinha, and Prasenjit Choudhury

Abstract Price instability due to erratic supply, seasonality, and information disparity is the fundamental issue of the agricultural commodity market. To achieve quality returns from investment, sellers must quote the optimal price of the products. The key to digital market success for an agro-seller is a continually adjusting dynamic pricing mechanism that adapts to the market fluctuations. This paper describes the development of a dynamic pricing model leveraging the strengths of linear regression and forecasts the optimal price of agricultural commodities. In particular, this model dynamically factors in the total current supply of the products in the market. Further, this paper analyzes the accuracy of the model and explores the potential factors of attained performance. With low computation power, this simple technique is designed to align with the benefit of Indian farmers, who have limited access to premium technology and resources.

1 Introduction Agriculture is the backbone of India’s economy as it is the livelihood of more than 50% of countrymen directly or indirectly. Its commodity prices are woven intricately with people’s lives [1]. The volatility in prices due to several factors has severe detrimental effects and causes chaos among farmers or producers. Farmers are unsure about future returns and often make suboptimal investment decisions. Nevertheless, the factors that create demand–supply imbalance are less likely to vanish in the future; e.g., unpredictable weather variations or unidentified plant diseases are complications that cannot be resolved. Hence, stabilizing fair prices for agricultural goods to the growers is the renewed concern and primary objective for different agricultural marketing systems’ reformation. Government ventures like the electronic T. Banerjee (B) Department of Computer Application, Siliguri Institute of Technology, Siliguri, India S. Sinha · P. Choudhury Department of Computer Science and Engineering, National Institute of Technology Durgapur, Durgapur, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Lecture Notes on Data Engineering and Communications Technologies 71, https://doi.org/10.1007/978-981-16-2937-2_31

493

494

T. Banerjee et al.

National Agricultural Market (eNAM) and new bills to permit agrarian produce to the APMC market demonstrate a substantial interest in protecting the growers’ returns and valuing the consumers’ capital. It is now evident that to prevent threats that challenge growers’ viability, it is imperative to make calculated decisions based on intelligent predictions [2]. Additionally, close prediction of the agricultural commodity prices is critical to ensure a booming social economy [3] since the long-term prognosis of future price helps to take plantation decisions, whereas short-term predictions guarantee the investment return bridging the gap between a grower’s toil and the outcome. The existence of middlemen who buy from farmers and sell to the market serve as a disadvantage for both farmers who sell at no or insignificant profit and consumers who buy at huge price. With the advent of Internet retail, agricultural trade is gaining ground in countries like China and India to streamline the economy, provide better revenue to the farmer, and better price to the consumer. Government-driven eNAM is securing popularity in India and working on both B2B and B2C segments to make the agrimarket transparent. However, there is still a considerable amount of work remaining in standardization, trust and smooth circulation of cash and produce [4]. Prevalent fear of unsold and surplus good and huge agro-wastage prevents stakeholders from indulging in e-trading. The introduction of data science in the digital market has attempted to reduce similar risks [5] by leveraging statistical and machine learning techniques. Historical data tends to be useful for finding the pricing trend and establishing the relationship between price and its influencing parameters. Predicting financial risk in the stock market [6], maximizing profit in the supply chain [7], modeling house pricing [8] are some use cases of data analysis. For instance, e-commerce giants like Amazon or Walmart employ dynamic pricing mechanisms based on past data and current demand to skim off the surplus and maximize their profit [9] and permeate the extensive industry range [10]. Dynamic pricing continually varying with the total supply, market time, and quality of the product is imperative for online retailers to grow their business and stay competitive in the market. This paper uses data analytics to solve the farmers’ problems and proposes a predictive model to forecast the agriproducts’ price based on the current market condition. Although there is no shortage of sophisticated deep learning techniques to build a predictive model, this paper sticks to the classical linear regression for this purpose because of the low computation expense, ease of implementation, relatively less training time, and lesser risk of overfitting. Without the need for a high computation power, this proposed model can run on low-cost smartphones.

2 Literature Review An ample amount of research has been conducted in the field of agricultural ecommerce. For instance, mobile application WeChat, running in China, provides

Dynamic Price Prediction of Agricultural Produce for E-Commerce …

495

social networking services to promote agricultural e-trade [11]. WeChat collects data to understand customer behavior, frequency of purchase, and the product’s price to provide adequate support for advertising and promote e-business. In [12], the author proposed an agent-based agricultural e-commerce model, where the agent helps a consumer search for an agri-e-commerce website with better pricing of products; this allows the consumer to purchase a product with an optimal price but does not favor the seller to set the best price. In [13], authors have used freshness, the elasticity of demand against price, and demand sensitivity to price for employing dynamism in price in the supermarket. Market demand has played a key role in the dynamic pricing revenue distribution model for the fresh agricultural products supply chain [14]. Only the shelf-life time of the fresh agri-product has been used for controlling the dynamism in price since each customer selects the product based on the utility of the product’s price and the quality [15]. Authors in [16] employed the dynamic pricing mechanism based on the competitors’ price. A multi-agent model has developed for simulating a dynamic pricing strategy in a competitive market for perishable product [17]. Plenty of dynamic pricing models for fresh products proposed by the researchers of various countries, mostly targeted for the supermarket or stores, hardly dealing with the e-commerce style [16]. However, dynamic pricing model for selling fresh agri-products on the online platform in India has not got much attention from the researchers. The significance of this in Internet retailing has been demonstrated by the popular e-commerce websites Amazon, Flipkart, etc. A dynamic pricing mechanism can reassure the seller of maximum profit despite surplus and decaying. This is essential for better demand management and inventory management [18].

3 Motivation of the Work The usage of smartphones in areas such as e-commerce and trade has increased greatly. In the field of agriculture as well, they have been employed to record produce, farm journaling, and farm management tasks [19]. Smartphone cameras, GPS, and sensors make it possible to collect tangible data which can be used for optimizing produce quality while decreasing waste. Since the potential users are going to be majorly farmers and growers, this study targets smartphones as they are readily available even in areas lacking urban infrastructure. This research focus on low-cost smartphone-based technology due to their prevalence among farmers. The sale price could depend on the production cost, total supply, and myriad other factors concerning the product and market. Studies suggest that the pricing of any product and services in any market correlates positively with the demand; i.e., more demand for any product may cause a higher price, and less cost may escalate the demand. From this, it can be safely said that pricing is the most challenging task because it should be low enough to lure the customer to retail further without reducing the product’s proper value, and it should be high enough to generate increasing profit without shrinking the demand. Hence, supply and demand are the critical parameters

496

T. Banerjee et al.

for price determination. The total quantity demanded of a product on a particular day equals the said product’s depleted amount, and the surplus amount gets marketed the next day. However, agricultural products are perishable in nature. An extra shelf day may affect their quality and freshness; consequently, marketing of such goods can get challenging for sellers who strive to sell their bulk within the stipulated time in addition to generating revenue. They reduce the product’s price to create demand when they have the leftover product, i.e., stock in hand determines the agricultural product’s price and demand. This paper studies the pricing behavior based on the total supply of the product. The price of the product depends not only on the availability of the said product but also on the supply of other products in the market since a scarcity of goods leads to higher demand and results in a high price; this inspires us to determine the price based on the market supply.

4 Materials and Method 4.1 Linear Regression Machine learning models require historical data to generalize the dataset. Due to the lack of a dynamic pricing dataset like price record of each day and each hour, time series models like ARIMA, LSTM cannot be not employed. The dataset contains the daily records of total supply, the wholesale price, and the products’ retail price. This paper tries to build a model using linear regression based on the availability of the product. This linear regression model is not intended to surpass model dynamic pricing, but instead it aims to highlight the need for using simple linear regression to the researchers. Additionally, a low-cost device would not possess enough computational power to support deep learning techniques. To the best of effort, two approaches could be summarized to perform these predictions on such a device: 1. 2.

Deploy a pre-trained model for performing the inference on the device, or employ a cloud model to infer via the mobile network. Train a computationally inexpensive model on the device itself by collecting real-time data.

The lack of GPU availability on low-cost devices can make even inference difficult on them, and network availability continues to be a challenge in rural areas. Network latency in devices where the operations happen on cloud might cause a delay and dynamic prices might change. Thus, this research explores the second option to perform the data collection and training the model on the device itself. For this, use linear regression is used as it is computationally inexpensive and easy to implement. Linear regression is useful to establish the relationship among dependent variable and independent variables. The dependent variable is symbolized by y, and the set

Dynamic Price Prediction of Agricultural Produce for E-Commerce …

497

of p independent variables are expressed by x1 , x2 , …, x p . Forecasting signifies the prediction or estimation of one variable called target variable or dependent variable employing the established linear relationship based on the predicting or independent variables. The linear model relating the target variable with predicting variables can be represented as y = β0 + β1 ∗ x1 + β2 ∗ x2 + · · · + β p ∗ x p + ε Estimating the relationship or to compute the value of the parameters βi is the objective of the regression analysis. The standard approach of regression analysis is ordinary least square estimation (OLS), where estimates are chosen to minimize the difference between the actual yi value and the estimated yˆi value from the equation. The least square method provides the best fit line for the given dataset. minimize

n  

yi − yˆi

2

i=1



2 yi − yˆi specifies the deviation of the calculated value from the actual value. Summing up all the deviation does not provide total deviation, since it can be positive or negative. It is squared to make all the terms positive. Measuring the accuracy of the regression relationship The accuracy of the relationship can be measured from R 2 value R 2 = Regression sum of square/total sum of square Residual Sum of Square =1− Total Sum of Square R 2 lies between 0 and 1, since it is a proportion. When residual sum of square value tends to zero, R 2 tends to 1, and it means the data points are perfectly placed on the regression line. When residual sum of square get larger, then R 2 tends to zero, means high variation between fitted value and actual value. The R 2 is biased by the number of regressors or regression co-efficient. The unbiased, adjusted R2 is used to measure the strength of the regression. Adjusted R 2 = R 2 −

  p 1 − R2 n− p−1

Data Source The dataset has been collected from National Horticulture Board (nhb.gov.in) website. National Horticultural Board is an Indian government organization which helps in production and market development of fresh horticulture produce. This

498

T. Banerjee et al.

website publishes daily, weekly, monthly and yearly price and arrival report for several horticultural produce marketed in different state/region. This report is collected for the products traded in Kolkata, West Bengal, for the duration of January 2018 to January 2019. The experiment is conducted in RStudio (Version 1.1.463), using R 3.5.3. R is used since it is an open-source programming language and convenient for data analysis since many packages and tools are available for statistical computing and machine learning algorithms. Visual representation of data in graphical form is also advantageous for the dataset of any size.

4.2 Outliers Detection An outlier is a data, recorded as a value of an attribute of a sample that lies an unusual interval from other values from the population. The outliers may exist due to erroneously recorded or computed data or real data. The presence of outliers may change the result drastically; hence, checking for them in the dataset is crucial before proceeding with the experiment. Q1 − 1.5 ∗ I Q > value > Q3 + 1.5 ∗ I Q The box plot (Fig. 1) is useful to depict the existence of outliers. They may be removed if the data is mistakenly recorded, but not when they create a significant association. Agricultural produce is a daily need of human beings. Still, demand varies with the price of the product, and the price of the product fluctuates with the total supply, supply of other products, the freshness of the product, and demand. Further, demand may deviate from its normal level due to festive seasons, natural calamities or other constraints, and hence, in this case, outliers cannot be considered as incorrect records and should not be dropped or replaced by other values.

Fig. 1 Boxplot of a price and b supply

Dynamic Price Prediction of Agricultural Produce for E-Commerce …

499

5 Result and Discussion Logarithm transformation has been used to transform the highly skewed data into a more normalized dataset. Log transformation of the Bitter Gourd data has been taken to build the linear regression model, and model attributes are explained. Table 1 summarizes the results of linear regression models for bitter gourd, cabbage, brinjal, and cauliflower’s actual data. The strength of association between any two variables is estimated through the correlation coefficients. Pearson correlation coefficient is widely used in the statistic for calculating the strength of relationship. Its value ranges from −1.0 to +1.0; positive value indicates a positive correlation, and negative value signifies a negative correlation. Correlation coefficient values greater than 0.8 and less than −0.8 imply significant strength in the relationship. Before approximating the model, the correlation coefficient between the total supply of the product and its price is calculated to check the relationship strength between them. Correlation can also be demonstrated by scatterplot. The graph depicts, higher supply results in low price and low supply follows high rate, i.e., negative correlation. Table 1 Summary of the linear regression models for bitter gourd, cabbage, brinjal, and cauliflower Bitter gourd

Cabbage

Brinjal

Cauliflower

Intercept (β0 ) Coefficient (β1 )

4345.784 −115.883

1096.6873 2.2593

4351.747 −71.040

2859.9739 −18.0690

Adjusted R-squared

0.624

0.5384

0.7315

Residual standard error

441.8

p-value