Smart Trends in Computing and Communications: Proceedings of SmartCom 2019 [1st ed. 2020] 978-981-15-0076-3, 978-981-15-0077-0

This book gathers high-quality papers presented at the International Conference on Smart Trends for Information Technolo

437 85 20MB

English Pages XXIV, 499 [498] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Smart Trends in Computing and Communications: Proceedings of SmartCom 2020 [1st ed.] 9789811552236, 9789811552243

This book gathers high-quality papers presented at the International Conference on Smart Trends for Information Technolo

1,414 40 19MB Read more

Advances on Smart and Soft Computing: Proceedings of ICACIn 2020 [1st ed.] 9789811560477, 9789811560484

This book gathers high-quality papers presented at the First International Conference of Advanced Computing and Informat

1,994 134 23MB Read more

Evolving Technologies for Computing, Communication and Smart World: Proceedings of ETCCS 2020 [1st ed.] 9789811578038, 9789811578045

This book presents best selected papers presented at the International Conference on Evolving Technologies for Computing

1,412 27 22MB Read more

Soft Computing: Theories and Applications : Proceedings of SoCTA 2019 (Advances in Intelligent Systems and Computing) [1st ed. 2020] 9811540314, 9789811540318

This book focuses on soft computing and how it can be applied to solve real-world problems arising in various domains, r

3,290 146 43MB Read more

Advances in Wireless Communications and Applications: Smart Communications: Interactive Methods and Intelligent Algorithms, Proceedings of 3rd ICWCA 2019 [1st ed.] 9789811556968, 9789811556975

This book features selected papers presented at the 3rd International Conference on Wireless Communications and Applicat

370 60 9MB Read more

International Conference on Innovative Computing and Communications: Proceedings of ICICC 2020, Volume 1 [1st ed.] 9789811551123, 9789811551130

This book includes high-quality research papers presented at the Third International Conference on Innovative Computing

1,413 71 47MB Read more

International Conference on Innovative Computing and Communications: Proceedings of ICICC 2020, Volume 2 [1st ed.] 9789811551475, 9789811551482

This book includes high-quality research papers presented at the Third International Conference on Innovative Computing

1,348 146 43MB Read more

Advances in Computing and Intelligent Systems: Proceedings of ICACM 2019 (Algorithms for Intelligent Systems) [1st ed. 2020] 9789811502224, 9789811502217, 9811502226

This book gathers selected papers presented at the International Conference on Advancements in Computing and Management

238 56 61MB Read more

Emerging Trends in Intelligent Computing and Informatics: Data Science, Intelligent Information Systems and Smart Computing [1st ed. 2020] 978-3-030-33581-6, 978-3-030-33582-3

This book presents the proceedings of the 4th International Conference of Reliable Information and Communication Technol

1,377 114 80MB Read more

Recent Trends in Materials and Devices: Proceedings of ICRTMD 2019 [1st ed.] 9789811586248, 9789811586255

This book presents the proceedings of the International Conference on Recent Trends in Materials and Devices (ICRTMD 201

382 56 4MB Read more

Smart Trends in Computing and Communications: Proceedings of SmartCom 2019 [1st ed. 2020]
978-981-15-0076-3, 978-981-15-0077-0

Author / Uploaded
Yu-Dong Zhang
Jyotsna Kumar Mandal
Chakchai So-In
Nileshsingh V. Thakur

Table of contents :
Front Matter ....Pages i-xxiv
Comparison of Different Image Segmentation Techniques on MRI Image (Afifi Afandi, Iza Sazanita Isa, Siti Noraini Sulaiman, Nur Najihah Mohd Marzuki, Noor Khairiah A. Karim)....Pages 1-9
PSO-ANN-Based Computer-Aided Diagnosis and Classification of Diabetes (Ratna Patil, Sharvari C. Tamane)....Pages 11-20
Round Robin Scheduling Based on Remaining Time and Median (RR_RT&M) for Cloud Computing (Mayuree Runsungnoen, Tanapat Anusas-amornkul)....Pages 21-29
Business-Driven Blockchain-Mempool Model for Cooperative Optimization in Smart Grids (Marius Stübs, Wolf Posdorfer, Julian Kalinowski)....Pages 31-39
Research of Institutional Technology Diffusion Rules Based on Patent Citation Network—A Case Study of AI Field (Zhao Rongying, Li Xinlai, Li Danyang)....Pages 41-49
Impact on the Information Security Management Due to the Use of Social Networks in a Public Organization in Ecuador (Segundo Moisés Toapanta Toapanta, Félix Gustavo Mendoza Quimi, Leslie Melanie Romero Lambogglia, Luis Enrique Mafla Gallegos)....Pages 51-64
Appropriate Security Protocols to Mitigate the Risks in Electronic Money Management (Segundo Moisés Toapanta Toapanta, María Elissa Coronel Zamora, Luis Enrique Mafla Gallegos)....Pages 65-74
Acceptance and Readiness of Thai Farmers Toward Digital Technology (Suwanna Sayruamyat, Winai Nadee)....Pages 75-82
Neural Network Classifier for Diagnosis of Diabetic Retinopathy (Gauri Borkhade, Ranjana Raut)....Pages 83-88
Comparative Analysis of Data Mining Classification Techniques for Prediction of Heart Disease Using the Weka and SPSS Modeler Tools (Atul Kumar Ramotra, Amit Mahajan, Rakesh Kumar, Vibhakar Mansotra)....Pages 89-97
An Automated Framework to Uncover Malicious Traffic for University Campus Network (Amit Mahajan, Atul Kumar Ramotra, Vibhakar Mansotra, Maninder Singh)....Pages 99-108
Comparative Analysis of K-Means Algorithm and Particle Swarm Optimization for Search Result Clustering (Shashi Mehrotra, Aditi Sharan)....Pages 109-114
Design and Implementation of Rule-Based Hindi Stemmer for Hindi Information Retrieval (Rakesh Kumar, Atul Kumar Ramotra, Amit Mahajan, Vibhakar Mansotra)....Pages 115-122
Research on the Development Trend of Ship Integrated Power System Based on Patent Analysis (Rongying Zhao, Danyang Li, Xinlai Li)....Pages 123-132
Detection of Data Anomalies in Fog Computing Architectures (K. Vidyasankar)....Pages 133-142
Cloud Data for Marketing in Tourism Sector (Pritee Parwekar, Gunjan Gupta)....Pages 143-153
Road Travel Time Prediction Method Based on Random Forest Model (Wanchao Song, Yinghua Zhou)....Pages 155-163
Video Synchronization and Alignment Using Motion Detection and Contour Filtering (K. Seemanthini, S. S. Manjunath, G. Srinivasa, B. Kiran)....Pages 165-177
Mutichain Enabled EHR Management System and Predictive Analytics (Meghana Nagori, Aditya Patil, Saurabh Deshmukh, Gauri Vaidya, Mayur Rahangdale, Chinmay Kulkarni et al.)....Pages 179-187
Quick Insight of Research Literature Using Topic Modeling (Vrishali Chakkarwar, Sharvari C. Tamane)....Pages 189-197
Secure Cloud-Based E-Healthcare System Using Ciphertext-Policy Identity-Based Encryption (CP-IBE) (Dipa D. Dharamadhikari, Sharvari C. Tamane)....Pages 199-209
Security Vulnerabilities of OpenStack Cloud and Security Assessment Using Different Software Tools (Manisha P. Bharati, Sharvari C. Tamane)....Pages 211-220
Smart Physical Intruder Detection System for Highly Sensitive Area (Smita Kasar, Vivek Kshirsagar, Sagar Bokan, Ninad Rathod)....Pages 221-229
Two-level Classification of Radar Targets Using Machine Learning (Aparna Rathi, Debasish Deb, N. Sarath Babu, Reena Mamgain)....Pages 231-242
A Cognitive Semantic-Based Approach for Human Event Detection in Videos (K. Seemanthini, S. S. Manjunath, G. Srinivasa, B. Kiran, P. Sowmyasree)....Pages 243-253
Analysis of Adequate Bandwidths to Guarantee an Electoral Process in Ecuador (Segundo Moisés Toapanta Toapanta, Johan Eduardo Aguilar Piguave, Luis Enrique Mafla Gallegos)....Pages 255-265
Load and Renewable Energy Forecasting for System Modelling, an Effort in Reducing Renewable Energy Curtailment (Dipam Chaudhari, Chaitanya Gosavi)....Pages 267-275
RAM: Rotating Angle Method of Clustering for Heterogeneous-Aware Wireless Sensor Networks (Kameshkumar R. Raval, Nilesh Modi)....Pages 277-286
GWO-GA Based Load Balanced and Energy Efficient Clustering Approach for WSN (Amruta Lipare, Damodar Reddy Edla, Ramalingaswamy Cheruku, Diwakar Tripathi)....Pages 287-295
Proof of Authenticity-Based Electronic Medical Records Storage on Blockchain (Mustafa Qazi, Devyani Kulkarni, Meghana Nagori)....Pages 297-306
Hardware Implementation of Elliptic Curve Cryptosystem Using Optimized Scalar Multiplication (Rakesh K. Kadu, Dattatraya S. Adane)....Pages 307-316
Efficient Resource Provisioning Through Workload Prediction in the Cloud System (Lata J. Gadhavi, Madhuri D. Bhavsar)....Pages 317-325
An Approach for Environment Vitiation Analysis and Prediction Using Data Mining and Business Intelligence (Shubhangi Tirpude, Aarti Karandikar, Rashmi Welekar)....Pages 327-338
Preserving Authentication and Access Control by Using Strong Passwords Through Image Fusion Mechanism (Vijay B. Gadicha, Abrar S. Alvi)....Pages 339-349
Performance Improvement of Direct Torque and Flux Controlled AC Motor Drive (Jagdish G. Chaudhari, Sanjay B. Bodkhe)....Pages 351-363
Cryptocurrency: A Comprehensive Analysis (Gaurav Chatterjee, Damodar Reddy Edla, Venkatanareshbabu Kuppili)....Pages 365-374
Challenges in Recognition of Online and Off-line Compound Handwritten Characters: A Review (Ratnashil N. Khobragade, Nitin A. Koli, Vrushali T. Lanjewar)....Pages 375-383
Novel Idea of Unique Key Generation and Distribution Using Threshold Science to Enhance Cloud Security (Devishree Naidu, Shubhangi Tirpude, Vrushali Bongirwar)....Pages 385-392
Information Retrieval Using Latent Semantic Analysis (Rahul Khokale, Nileshsingh V. Thakur, Mahendra Makesar, Nitin A. Koli)....Pages 393-404
Can Music Therapy Reduce Human Psychological Stress: A Review (Nikita R. Hatwar, Ujwalla H. Gawande)....Pages 405-411
Web Mash-Up Development and Security Using AOP (Manjusha Tatiya, Sharvari C. Tamane)....Pages 413-419
Design Consideration of Malay Text Stemmer Using Structured Approach (Mohamad Nizam Kassim, Shaiful Hisham Mat Jali, Mohd Aizaini Maarof, Anazida Zainal, Amirudin Abdul Wahab)....Pages 421-432
Enhanced Text Stemmer with Noisy Text Normalization for Malay Texts (Mohamad Nizam Kassim, Shaiful Hisham Mat Jali, Mohd Aizaini Maarof, Anazida Zainal, Amirudin Abdul Wahab)....Pages 433-444
Modified Moth Search Algorithm for Portfolio Optimization (Ivana Strumberger, Eva Tuba, Nebojsa Bacanin, Milan Tuba)....Pages 445-453
Towards the Adoption of Self-Driving Cars (Omayma Alqatawneh, Alex Coles, Ertu Unver)....Pages 455-461
An Overview on Privacy Preservation and Public Auditing on Outsourced Cloud Data (Sonali D. Khambalkar, Shailesh D. Kamble, Nileshsingh V. Thakur, Nilesh U. Sambhe, Nikhil S. Mangrulkar)....Pages 463-470
Segmentation of Handwritten Text Using Bacteria Foraging Optimization (Rajesh Agrawal, Prashant Sahai Saxena, Vijay Singh Rathore, Saurabh Maheshwari)....Pages 471-479
Problems with PIR Sensors in Smart Lighting+Security Solution and Solutions of Problems (Pinak Desai, Nilesh Modi)....Pages 481-486
Multi-level Thresholding and Quantization for Segmentation of Color Images (Shailesh T. Khandare, Nileshsingh V. Thakur)....Pages 487-496
Back Matter ....Pages 497-499

Citation preview

Smart Innovation, Systems and Technologies 165

Yu-Dong Zhang Jyotsna Kumar Mandal Chakchai So-In Nileshsingh V. Thakur Editors

Smart Trends in Computing and Communications Proceedings of SmartCom 2019

123

Smart Innovation, Systems and Technologies Volume 165

Series Editors Robert J. Howlett, Bournemouth University and KES International, Shoreham-by-sea, UK Lakhmi C. Jain, Faculty of Engineering and Information Technology, Centre for Artificial Intelligence, University of Technology Sydney, Sydney, NSW, Australia

The Smart Innovation, Systems and Technologies book series encompasses the topics of knowledge, intelligence, innovation and sustainability. The aim of the series is to make available a platform for the publication of books on all aspects of single and multi-disciplinary research on these themes in order to make the latest results available in a readily-accessible form. Volumes on interdisciplinary research combining two or more of these areas is particularly sought. The series covers systems and paradigms that employ knowledge and intelligence in a broad sense. Its scope is systems having embedded knowledge and intelligence, which may be applied to the solution of world problems in industry, the environment and the community. It also focusses on the knowledge-transfer methodologies and innovation strategies employed to make this happen effectively. The combination of intelligent systems tools and a broad range of applications introduces a need for a synergy of disciplines from science, technology, business and the humanities. The series will include conference proceedings, edited collections, monographs, handbooks, reference books, and other relevant types of book in areas of science and technology where smart systems and technologies can offer innovative solutions. High quality content is an essential feature for all book proposals accepted for the series. It is expected that editors of all accepted volumes will ensure that contributions are subjected to an appropriate level of reviewing process and adhere to KES quality principles. ** Indexing: The books of this series are submitted to ISI Proceedings, EI-Compendex, SCOPUS, Google Scholar and Springerlink **

More information about this series at http://www.springer.com/series/8767

Yu-Dong Zhang Jyotsna Kumar Mandal Chakchai So-In Nileshsingh V. Thakur •

•

•

Editors

Smart Trends in Computing and Communications Proceedings of SmartCom 2019

123

Editors Yu-Dong Zhang Department of Informatics University of Leicester Leicester, UK Chakchai So-In Department of Computer Science Khon Kaen University Khon Kaen, Thailand

Jyotsna Kumar Mandal Department of Computer Science and Engineering University of Kalyani Kalyani, India Nileshsingh V. Thakur Nagpur Institute of Technology Nagpur, Maharashtra, India

ISSN 2190-3018 ISSN 2190-3026 (electronic) Smart Innovation, Systems and Technologies ISBN 978-981-15-0076-3 ISBN 978-981-15-0077-0 (eBook) https://doi.org/10.1007/978-981-15-0077-0 © Springer Nature Singapore Pte Ltd. 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Organizing Committee

Conference Chairs Dharm Singh, Namibia University of Science and Technology, Namibia J. K. Mandal, University of Kalyani, West Bengal, India Simon James Fong, University of Macau, Macau

Conference Secretary Amit Joshi, Chair, InterYIT, IFIP

Advisory Committee

Members Chandana Unnithan, Victoria University, Melbourne, Australia Dr. Aynur Unal, Stanford University, USA Dr. Y. C. Bhatt, MPUAT, Udaipur, India Chih-Heng Ke, MIEEE, NKIT, Taiwan Tarek M. Sobh, Dean, School of Engineering, University of Bridgeport, USA Z. A. Abbasi, Department of Electronics Engineering, AMU, Aligarh, India Manjunath Aradhya, Department of MCA, SJCE, Mysore Mr. Prem Surana, Chairman, Deepshikha Group, Jaipur, India Mr. Anshu Surana, Vice Chairman, Deepshikha Group, Jaipur, India Prof. Min Xie, Ph.D. (Quality), Fellow of IEEE

v

vi

Organizing Committee

Prof. Devesh Kumar Srivastava, Manipal University, Jaipur, India Mustafizur Rahman, Endeavor Research Fellow, Institute of High Performance Computing, Agency for Science Technology and Research Ashok Arora, MRIU, Faridabad, India C. Arunachalaperumal, Associate Professor, S. A. Engineering College, Chennai, India Chandana Unnithan, Deakin University, Melbourne, Australia Dr. Pawan Lingras, Professor, Saint Mary’s University, Canada Mohd. Atique, Amravati, Maharashtra, India Puneet Azad, New Delhi, India Hoang Pham, Professor and Chairman, Department of Industrial and Systems Engineering, Rutgers University, Piscataway, NJ Dr. Suresh Chandra Satapathy, Chairman, Division V, CSI Dr. Hemant Purohit, George Mason University, USA Dr. Naeem Hannoon, Universiti Teknologi MARA, Malaysia Dr. Nagaraj Balakrishnan, Professor, Karpagam College of Engineering, Myleripalayam, Coimbatore, India; Prashant Bansod, SGSITS, Indore Prof. Hipollyte Muyingi, Namibia University of Science and Technology, Namibia Dr. Nobert Jere, Namibia University of Science and Technology, Namibia Shalini Batra, Computer Science and Engineering Dept., Thapar University, Patiala, Punjab, India Ernest Chulantha Kulasekere, Ph.D., University of Moratuwa, Sri Lanka Shajulin Benedict, Director, HPCCLoud Research Laboratory, St. Xavier’s Catholic College of Engineering, Chunkankadai District, Nagercoil, Tamil Nadu James E. Fowler, Mississippi State University, Mississippi, USA Dr. Majid Ebnali-Heidari, Shahrekord University, Shahrekord, Iran Rajendra Kumar Bharti, Assistant Prof. Kumaon Engg. College, Dwarahat, Uttarakhand, India Prof. Murali Bhaskaran, Dhirajlal Gandhi College of Technology, Salem, Tamil Nadu Pramod Parajuli, Nepal College of Information Technology, Nepal Prof. Komal Bhatia, YMCA University, Faridabad, Haryana, India Lili Liu, Automation College, Harbin Engineering University, Harbin, China Brooke Fisher Liu, Department of Communication, University of Maryland, College Park, Maryland, USA Prof. S. R. Biradar, Department of Information Science and Engineering, SDM College of Engineering and Technology, Dharwad, Karnataka A. K. Chaturvedi, Department of Electrical Engineering, IIT Kanpur, India Margaret Lloyd, Faculty of Education School of Curriculum, Queensland University of Technology, Queensland Hoi-Kwong Lo, University of Toronto, Ontario, Canada Pradeep Chouksey, Principal, TIT college, Bhopal, MP, India

Organizing Committee

vii

Shashidhar Ram Joshi, Ph.D., Institute of Engineering, Pulchowk Campus, Pulchowk, Nepal Chhaya Dalela, Associate Professor, JSSATE, Noida, Uttar Pradesh, India Jayanti Dansana, KIIT University, Bhubaneswar, Odisha, India Desmond Lobo, Computer Engineering Department, Faculty of Engineering at Kamphaengsaen, Kasetsart University, Thailand Sergio Lopes, Industrial Electronics Department, University of Minho, Braga, Portugal Soura Dasgupta, Department of TCE, SRM University, Chennai, India Dr. Apurva A. Desai, Veer Narmad South Gujarat University, Surat, India V. Susheela Devi, Senior Scientific Officer, Department of Computer Science and Automation, Indian Institute of Science, Bangalore Subhadip Basu, Ph.D., Visiting Scientist, The University of Iowa, Iowa City, USA Vijay Pal Dhaka, Jaipur National University, Jaipur, Rajasthan Chih-Heng Ke, MIEEE, NKIT, Taiwan Dr. Nobert Jere, Namibia University of Science and Technology, Namibia Mr. Mignesh Parekh, Kamma Incorporation, Gujarat, India Kok-Lim Low, National University of Singapore, Singapore Dr. Koteswara rao K, PVPSIT, India Dr. Rajan Patel, Sankalchand Patel College of Engineering, Visnagar, India Dr. Sudhir Babu V, PVP Siddhartha Institute of Technology, India Sharvari C. Tamane, MGMs Jawaharlal Nehru Engineering College, India Ritesh Kumar Verma, Pune Institute of Business Management, Pune, India Dr. Meghana V. Kshirsagar, Government College of Engineering, Aurangabad, India Dalila Durães, High School of Technology and Management of Felgueiras at Polytechnic Institute of Porto, Portugal Sunita Singhal, Manipal University, Jaipur, India

Technical Program Chairs Prof. Yu-Dong Zhang, University of Leicester, UK Prof. Chakchai So-In, Khon Kaen University, Thailand Prof. Nileshsingh V. Thakur, Nagpur Institute of Technology, India

Program Secretary Mihir Chauhan—Global Knowledge Research Foundation

viii

Organizing Committee

Technical Program Committee Prof. Ting-Peng Liang, National Chengchi University, Taipei, Taiwan Nedia Smairi, CNAM Laboratory, France Prof. Subhadip Basu, Visiting Scientist, The University of Iowa, Iowa City, USA Prof. Abrar A. Qureshi, Ph.D., University of Virginia, USA Prof. Louis M. Rose, Department of Computer Science, University of New York, USA Dr. Ricardo M. Checchi, University of Massachusetts, Massachusetts, USA Prof. Brent Waters, University of Texas, Austin, Texas, USA Prof. Prasun Sinha, Ohio State University Columbus, Columbus, OH, USA Prof. N. M. van Straalen, VU University Amsterdam, Amsterdam, Netherlands Prof. Rashid Ansari, University of Illinois, USA Prof. Russell Beale, School of Computer Science–Advanced Interaction, University of Birmingham, England Prof. Dan Boneh, Computer Science Dept., Stanford University, California, USA Prof. Alexander Christea, University of Warwick, London, UK Prof. Mustafizur Rahman, Endeavor Research Fellow, Australia Prof. Hoang Pham, Rutgers University, Piscataway, NJ, USA Prof. Ernest Chulantha Kulasekere, University of Moratuwa, Sri Lanka Prof. Shashidhar Ram Joshi, Institute of Engineering, Pulchowk, Nepal Dr. Ashish Rastogi, Higher College of Technology, Muscat, Oman Dr. Aynur Unal, Stanford University, USA Prof. Ahmad Al-Khasawneh, The Hashemite University, Jordan Dr. Bharat Singh Deora, JRNRV University, India Prof. Jean Michel Bruel, Departement Informatique IUT de Blagnac, Blagnac, France Prof. Ngai-Man Cheung, Assistant Professor, University of Technology and Design, Singapore Prof. J. Andrew Clark, Computer Science, University of York, UK Prof. Babita Gupta, College of Business California State University, California, USA Prof. Shuiqing Huang, Department of Information Management, Nanjing Agricultural University, Nanjing, China Prof. Yun-Bae Kim, Sungkyunkwan University, South Korea Prof. Sami Mnasri, IRIT Laboratory, Toulouse, France Prof. Anand Paul, The School of Computer Science and Engineering, South Korea Dr. Krishnamachar Prasad, Department of Electrical and Electronic Engineering, Auckland, New Zealand Prof. Louis M. Rose, Department of Computer Science, University of York Dr. Haibo Tian, School of Information Science and Technology, Guangzhou, Guangdong, China Er. Kalpana Jain, CTAE, Udaipur, India

Organizing Committee

ix

Prof. Philip Yang, Price water house Coopers, Beijing, China Prof. Sunarto Kampus UNY, Yogyakarta, Indonesia Dr. Ashok Jetawat, CSI Udaipur Chapter, India Dr. Neetesh Purohit, Member SIG-WNs CSI, IIT Allahabad, India Prof. Dr. Ricardo M. Checchi, University of Massachusetts, Massachusetts, USA Mr. Jeril Kuriakose, Manipal University, Jaipur, India Prof. R. K. Bayal, Rajasthan Technical University, Kota, Rajasthan, India Prof. Martin Everett, University of Manchester, England Prof. Feng Jiang, Harbin Institute of Technology, China Prof. Prasun Sinha, Ohio State University Columbus, Columbus, OH, USA Dr. Savita Gandhi, Professor, Gujarat University, Ahmedabad, India Mr. Chintan Bhatt, Changa University, Gujarat, India Prof. Feng Tian, Virginia Polytechnic Institute and State University, USA Prof. XiuYing Tian, Instrument Lab, Yangtze Delta Region Institute of Tsinghua University, Jiaxing, China Prof. Xiaoyi Yu, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China Prof. Abdul Rajak A. R., Department of Electronics and Communication Engineering, Birla Institute of Technology and Sciences, Abu Dhabi Mr. Ajay Choudhary, IIT Roorkee, India Dr. Manju Mandot, CSI Udaipur Chapter, India Prof. D. A. Parikh, Head, CE, LDCE, Ahmedabad, India Dr. Paras Kothari, Samarth Group of Institutions, Gujarat, India Dr. Harshal Arolkar, Immd. Past Chairman, CSI Ahmedabad Chapter, India Mr. Bhavesh Joshi, Advent College, Udaipur, India Prof. K. C. Roy, Principal, Kautilya, Jaipur, India Dr. Mukesh Shrimali, Pacific University, Udaipur, India Dr. Sanjay M. Shah, GEC, Gandhinagar, India Salam Shuleenda Devi, NIT Silchar, India Amira Ashour, Tantra University, Egypt Dr. S. Mishra, M-SIG-WNs, CSI, KEC Dwarahat, Uttarakhand, India Dr. Chirag S. Thaker, GEC, Bhavnagar, Gujarat, India Mr. Nisarg Pathak, SSC, CSI, Gujarat, India Mrs. Meenakshi Tripathi, MNIT, Jaipur, India Prof. S. N. Tazi, Govt. Engineering College, Ajmer, Rajasthan, India Shuhong Gao, Mathematical Sciences, Clemson University, Clemson, South Carolina Sanjam Garg, University of California, Los Angeles, California Faiez Gargouri, Sfax University Tunisia, Tunisia, North Africa Dr. A. Garrett, Department of Mathematics, Computing, and Information Sciences, Jacksonville State University, Jacksonville, Alabama Leszek Antoni Gasieniec, University of Liverpool, Liverpool, England

x

Organizing Committee

Ning Ge, School of Information Science and Technology, Tsinghua University, Beijing, China Garani Georgia, University of North London, UK Hazhir Ghasemnezhad, Electronics and Communication Engineering Department, Shiraz University of Technology, Shiraz, Iran Andrea Goldsmith, Professor of Electrical Engineering, Stanford University, California Dr. Saeed Golmohammadi, Assistant Professor in University of Tabriz, Tabriz, Iran Prof. K. Gong, School of Management, Chongqing Jiaotong University, Chongqing, China Crina Gosnan, Associate Professor, Department of Computer Science, Babes-Bolyai University, Cluj-Napoca, Romania Mohamed Gouda, The University of Texas, Computer Science Department, Austin, Texas Mihai Grigore, Department for Management, Technology and Economics Group for Management, Information Systems, Zürich, Switzerland Cheng Guang, Southeast University, Nanjing, China Venkat N. Gudivada, Weisburg Division of Engineering and Computer Science, Marshall University Huntington, Huntington, West Virginia Sankhadeep Chatterjee, UEM Kolkata, India Ambika Annavarapu, GRIET, Hyderabad, India Prof. Wang Guojun, School of Information Science and Engineering of Zhong Nan University, China Prof. Nguyen Ha, Department of Electrical and Computer Engineering, University of Saskatchewan, Saskatchewan, Canada Dr. Z. J. Haas, School of Electrical Engineering, Cornell University, Ithaca, New York Prof. Mohand Said Hacid, Lyon University, France Prof. Haffaf Hafid, University of Oran, Oran, Algeria Prof. M. Tarafdar Hagh, Department of Electrical Engineering, Islamic Azad University, Ahar, Iran Ridha Hamdi, University of Sfax, Sfax, Tunisia, North Africa Prof. Dae Man Han, Green Home Energy Center, Kongju National University, Republic of Korea Prof. Xiangjian He, University of Technology, Sydney, Australia Prof. Richard Heeks, University of Manchester, Manchester, United Kingdom Mr. Walid Khaled Hidouci, Ecole Nationale Supérieure d’Informatique, Algeria Sayan Chakraborty, BCET, Durgapur, India Simona Moldovanu, Universitatea Dunarea de Jos Galati, Galaţi, Romania Prof. Achim Hoffmann, Prof. Achim Hoffmann School of Computer Science and Engineering, The University of New South Wales, Australia

Organizing Committee

xi

Ma Hong, Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, China Hyehyun Hong, Department of Advertising and Public Relations, Chung-Ang University, South Korea Qinghua Hu, Harbin Institute of Technology, China Honggang Hu, School of Information Science and Technology, University of Science and Technology of China, P. R. China Fengjun Hu, Zhejiang Shuren University, Zhejiang, China Dr. Qinghua Huang, School of Electronic and Information Engineering, South China University of Technology, China Chiang Hung-Lung, China Medical University, Taichung, Taiwan Kyeong Hur, Dept. of Computer Education, Gyeongin National University of Education, Incheon, Korea Wen-Jyi Hwang, Department of Computer Science and Information Engineering, National Taiwan Normal University, Taipei Gabriel Sebastian Ioan Ilie, Computer Science and Engineering Dept., University of Connecticut, Mansfield, Connecticut Sudath Indrasinghe, School of Computing and Mathematical Sciences, Liverpool John Moores University, Liverpool, England Ushio Inoue, Dept. of Information and Communication Engineering, Tokyo Denki University, Tokyo, Japan Dr. Stephen Intille, Associate Professor, College of Computer and Information Science and Dept. of Health Sciences, Northeastern University, Boston, Massachusetts Soumen Banerjee, UEM Kolkata, India Prasenjit Chatterjee, MCKV, Kolkata, India Dr. M. T. Islam, Institute of Space Science, Universiti Kebangsaan Malaysia, Selangor, Malaysia Lillykutty Jacob, Professor, Department of Electronics and Communication Engineering, NIT, Calicut, Kerala, India Anil K. Jain, Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan Dagmar Janacova, Tomas Bata University in Zlín, Faculty of Applied Informatics, nám. T. G., Czech Republic, Europe Kairat Jaroenrat, Faculty of Engineering at Kamphaengsaen, Kasetsart University, Bangkok, Thailand Don Jyh-Fu Jeng, Assistant Professor, Institute of International Management, National Cheng Kung University, Taiwan Minseok Jeon, Department of Computer Science, Yonsei University, Seoul, South Korea Prof. Guangrong Ji, College of Information Science and Engineering, Ocean University of China, Qingdao, China Yoon Ji-Hyeun, Department of Computer Science, Yonsei University, Seoul, South Korea Zhiping Jia, Computer Science and Technology, Shandong University, Jinan, China

xii

Organizing Committee

Syeda Erfana Zohora, Tri’f University, KSA Amartya Mukherjee, IEM Kolkata, India Samarjeet Borah, Sikkim Manipal University, India Sarwar Kamal, East West University, Bangladesh Liangxiao Jiang, Department of Computer Science, China University of Geosciences, Wuhan, China David B. Johnson, Computer Science Department, Carnegie Mellon University, Pittsburgh, Pennsylvania Prof. Chen Junning, Electronic Information and Engineering, Anhui University, Hefei, China Seok Kang, Associate Professor, University of Texas, San Antonio, Texas Ghader Karimian, Assistant Professor, Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran S. Karthikeyan, Department of Information Technology, College of Applied Science, Sohar, Oman, Middle East Michael Kasper, Fraunhofer Institute for Secure Information Technology, Germany L. Kasprzyczak, Institute of Innovative Technologies EMAG, Katowice, Poland Zahid Khan, School of Engineering and Electronics, The University of Edinburgh, Mayfield Road, Scotland Jin-Woo Kim, Department of Electronics and Electrical Engineering, Korea University, Seoul, Korea Muzafar Khan, Computer Science Department, COMSATS University, Pakistan Jamal Akhtar Khan, Department of Computer Science, College of Computer Engineering and Sciences, Salman bin Abdulaziz University, Kingdom of Saudi Arabia Kholaddi Kheir Eddine, University of Constantine, Algeria Dr. Fouad Khelifi, School of Computing, Engineering and Information Sciences, Northumbria University, Newcastle upon Tyne, England Shubhalaxmi Kher, Arkansas State University, College of Engineering, Jonesboro, Arkansas Sally Kift, James Cook University, Townsville, Queensland Sunkyum Kim, Department of Computer Science, Yonsei University, Seoul, Korea Leonard Kleinrock, University of California, Los Angeles, Computer Science Department, California Dirk Koch, School of Computer Science, University of Manchester, Manchester, England Zbigniew Kotulski, Warsaw University of Technology, Faculty of Electronics and Information Technology, Institute of Telecommunications, Warszawa, Poland Ray Kresman, Bowling Green State University, Bowling Green, OH, USA Ajay Kshemkalyani, Department of Computer Science, University of Illinois, Chicago, IL Madhu Kumar, Associate Professor, Computer Engineering Department, Nanyang Technological University, Singapore Anup Kumar, Professor, Director MINDS Lab, University of Louisville, Kentucky, USA

Organizing Committee

xiii

Md Obaiduallh Sk, Aliah University, Kolkata, India Kaiser J. Giri, Islamic University, India Sirshendu Hore, HETC, Hooghly, India Hemanta Dey, TICT, India James Tin-Yau Kwok, Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong Zhiling Lan, Department of Computer Science, Illinois Institute of Technology, Chicago, IL Hayden Kwok-Hay So, Department of Electrical and Electronics Engineering, University of Hong Kong, Hong Kong Zhiling Lan, Department of Computer Science, Illinois Institute of Technology, Chicago, Illinois K. G. Langendoen, Delft University of Technology, Netherlands, Europe Michele Lanza, REVEAL Research Group, Faculty of Informatics, University of Lugano, Switzerland Shalini Batra, Computer Science and Engineering Dept., Thapar University, Patiala, Punjab, India Shajulin Benedict, Director, HPCCLoud Research Laboratory, St. Xavier’s Catholic College of Engineering Chunkankadai District, Nagercoil, Tamil Nadu Rajendra Kumar Bharti, Assistant Prof. Kumaon Engg. College, Dwarahat, Uttarakhand, India Prof. Murali Bhaskaran, Dhirajlal Gandhi College of Technology, Salem, Tamil Nadu, India Prof. Komal Bhatia, YMCA University, Faridabad, Haryana, India Prof. S. R. Biradar, Department of Information Science and Engineering, SDM College of Engineering and Technology, Dharwad, Karnataka Prayag Tiwari, National University of Science and Technology MISIS, Moscow, Russia Surekha B, KS Institute of Technology, Bangalore, India A. K. Chaturvedi, Department of Electrical Engineering, IIT Kanpur, India Jitender Kumar Chhabra, NIT Kurukshetra, Haryana, India Pradeep Chouksey, Principal, TIT college, Bhopal, MP, India Chhaya Dalela, Associate Professor, JSSATE, Noida, Uttar Pradesh, India Jayanti Dansana, KIIT University, Bhubaneswar, Odisha, India Soura Dasgupta, Department of TCE, SRM University, Chennai, India Dr. Apurva A. Desai, Veer Narmad South Gujarat University, Surat, India V. Susheela Devi, Senior Scientific Officer, Department of Computer Science and Automation, Indian Institute of Science, Bangalore Dr. Bikash Kumar Dey, Department of Electrical Engineering, IIT Bombay, Powai, Maharashtra Vijay Pal Dhaka, Jaipur National University, Jaipur, Rajasthan

xiv

Organizing Committee

Dr. Anup Palsokar, Department of Computer Applications, SIES College of Management Studies, India Dr. Ramesh Prajapati, Gujarat Technological University (GTU), India Dr. Nageswara Rao Moparthi, Velagapudi Ramakrishan Siddhartha Engineering College, AP, India Shashi Mehrotra, KL University, Vaddeswaram, India A. M. Viswa Bharathy, Jyothishmathi Institute of Technology and Science, India Dr. K. Srujan Raju, Prof. and HOD Dept. of CSE, Dean SWF, CMR Technical Campus, Hyderabad, India

Preface

The Third International Conference on Smart Trends for Information Technology and Computer Communications (SmartCom-2019) targets state-of-the-art as well as emerging topics pertaining to information, computer communications and effective strategies for its implementation for engineering and managerial applications. The conference attracts a large number of high-quality submissions and stimulates the cutting-edge research discussions among many academic pioneering researchers, scientists, industrial engineers and students from all around the world and provides a forum to researchers for proposing new technologies, sharing their experiences and discussing future solutions for design infrastructure for ICT, providing common platform for academic pioneering researchers, scientists, engineers and students to share their views and achievements, enriching technocrats and academicians by presenting their innovative and constructive ideas and focusing on innovative issues at international level by bringing together the experts from different countries. The conference was held during January 24–25, 2019, at Hotel Novotel, Siam Square, Bangkok, Thailand, and organized by Global Knowledge Research Foundation, and associated/supported partners were Springer, InterYIT, IFIP and Nagpur Institute of Technology, Nagpur, India. Research submissions in various advanced technology areas were received, and after a rigorous peer review process with the help of program committee members and 31 external reviewers for 170 papers from 12 different countries including India, China, Malaysia, Philippines, Thailand, Germany, Ecuador, Canada, Viet Nam, Serbia, South Korea and UK, 51 were accepted with an acceptance ratio of 0.16. This event’s success was possible only with the help and support of our team and organizations. With immense pleasure and honor, we would like to express our sincere thanks to the authors for their remarkable contributions, all the Technical Program Committee members for their time and expertise in reviewing the papers within a very tight schedule, and the publisher Springer for their professional help. This is the third conference of the series SmartCom in which proceedings are published as a SIST volume by Springer. We are overwhelmed by our two xv

xvi

Preface

distinguished scholars and appreciate them for accepting our invitation to deliver keynote speeches to the conference and six technical session chairs for analyzing the research work presented by the researchers. Most importantly, we are also grateful to our local support team for their hard work for the conference. This series has already been made a continuous series which will be hosted at different locations every year. Leicester, UK Kalyani, India Khon Kaen, Thailand Nagpur, India 25 January 2019

Yu-Dong Zhang Jyotsna Kumar Mandal Chakchai So-In Nileshsingh V. Thakur

Contents

1

2

3

4

5

6

7

8

Comparison of Different Image Segmentation Techniques on MRI Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Afifi Afandi, Iza Sazanita Isa, Siti Noraini Sulaiman, Nur Najihah Mohd Marzuki and Noor Khairiah A. Karim

1

PSO-ANN-Based Computer-Aided Diagnosis and Classification of Diabetes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ratna Patil and Sharvari C. Tamane

11

Round Robin Scheduling Based on Remaining Time and Median (RR_RT&M) for Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . Mayuree Runsungnoen and Tanapat Anusas-amornkul

21

Business-Driven Blockchain-Mempool Model for Cooperative Optimization in Smart Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marius Stübs, Wolf Posdorfer and Julian Kalinowski

31

Research of Institutional Technology Diffusion Rules Based on Patent Citation Network—A Case Study of AI Field . . . . . . . . . Zhao Rongying, Li Xinlai and Li Danyang

41

Impact on the Information Security Management Due to the Use of Social Networks in a Public Organization in Ecuador . . . . . . . . Segundo Moisés Toapanta Toapanta, Félix Gustavo Mendoza Quimi, Leslie Melanie Romero Lambogglia and Luis Enrique Mafla Gallegos Appropriate Security Protocols to Mitigate the Risks in Electronic Money Management . . . . . . . . . . . . . . . . . . . . . . . . . Segundo Moisés Toapanta Toapanta, María Elissa Coronel Zamora and Luis Enrique Mafla Gallegos Acceptance and Readiness of Thai Farmers Toward Digital Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suwanna Sayruamyat and Winai Nadee

51

65

75

xvii

xviii

9

Contents

Neural Network Classifier for Diagnosis of Diabetic Retinopathy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gauri Borkhade and Ranjana Raut

83

10 Comparative Analysis of Data Mining Classification Techniques for Prediction of Heart Disease Using the Weka and SPSS Modeler Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Atul Kumar Ramotra, Amit Mahajan, Rakesh Kumar and Vibhakar Mansotra

89

11 An Automated Framework to Uncover Malicious Traffic for University Campus Network . . . . . . . . . . . . . . . . . . . . . . . . . . . Amit Mahajan, Atul Kumar Ramotra, Vibhakar Mansotra and Maninder Singh

99

12 Comparative Analysis of K-Means Algorithm and Particle Swarm Optimization for Search Result Clustering . . . . . . . . . . . . . 109 Shashi Mehrotra and Aditi Sharan 13 Design and Implementation of Rule-Based Hindi Stemmer for Hindi Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Rakesh Kumar, Atul Kumar Ramotra, Amit Mahajan and Vibhakar Mansotra 14 Research on the Development Trend of Ship Integrated Power System Based on Patent Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Rongying Zhao, Danyang Li and Xinlai Li 15 Detection of Data Anomalies in Fog Computing Architectures . . . . 133 K. Vidyasankar 16 Cloud Data for Marketing in Tourism Sector . . . . . . . . . . . . . . . . . 143 Pritee Parwekar and Gunjan Gupta 17 Road Travel Time Prediction Method Based on Random Forest Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Wanchao Song and Yinghua Zhou 18 Video Synchronization and Alignment Using Motion Detection and Contour Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 K. Seemanthini, S. S. Manjunath, G. Srinivasa and B. Kiran 19 Mutichain Enabled EHR Management System and Predictive Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Meghana Nagori, Aditya Patil, Saurabh Deshmukh, Gauri Vaidya, Mayur Rahangdale, Chinmay Kulkarni and Vivek Kshirsagar 20 Quick Insight of Research Literature Using Topic Modeling . . . . . 189 Vrishali Chakkarwar and Sharvari C. Tamane

Contents

xix

21 Secure Cloud-Based E-Healthcare System Using Ciphertext-Policy Identity-Based Encryption (CP-IBE) . . . . . . . . . . 199 Dipa D. Dharamadhikari and Sharvari C. Tamane 22 Security Vulnerabilities of OpenStack Cloud and Security Assessment Using Different Software Tools . . . . . . . . . . . . . . . . . . . 211 Manisha P. Bharati and Sharvari C. Tamane 23 Smart Physical Intruder Detection System for Highly Sensitive Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Smita Kasar, Vivek Kshirsagar, Sagar Bokan and Ninad Rathod 24 Two-level Classification of Radar Targets Using Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Aparna Rathi, Debasish Deb, N. Sarath Babu and Reena Mamgain 25 A Cognitive Semantic-Based Approach for Human Event Detection in Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 K. Seemanthini, S. S. Manjunath, G. Srinivasa, B. Kiran and P. Sowmyasree 26 Analysis of Adequate Bandwidths to Guarantee an Electoral Process in Ecuador . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Segundo Moisés Toapanta Toapanta, Johan Eduardo Aguilar Piguave and Luis Enrique Mafla Gallegos 27 Load and Renewable Energy Forecasting for System Modelling, an Effort in Reducing Renewable Energy Curtailment . . . . . . . . . . 267 Dipam Chaudhari and Chaitanya Gosavi 28 RAM: Rotating Angle Method of Clustering for Heterogeneous-Aware Wireless Sensor Networks . . . . . . . . . . . 277 Kameshkumar R. Raval and Nilesh Modi 29 GWO-GA Based Load Balanced and Energy Efficient Clustering Approach for WSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Amruta Lipare, Damodar Reddy Edla, Ramalingaswamy Cheruku and Diwakar Tripathi 31 Proof of Authenticity-Based Electronic Medical Records Storage on Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Mustafa Qazi, Devyani Kulkarni and Meghana Nagori 32 Hardware Implementation of Elliptic Curve Cryptosystem Using Optimized Scalar Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 Rakesh K. Kadu and Dattatraya S. Adane 33 Efficient Resource Provisioning Through Workload Prediction in the Cloud System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 Lata J. Gadhavi and Madhuri D. Bhavsar

xx

Contents

34 An Approach for Environment Vitiation Analysis and Prediction Using Data Mining and Business Intelligence . . . . . . . . . . . . . . . . . 327 Shubhangi Tirpude, Aarti Karandikar and Rashmi Welekar 35 Preserving Authentication and Access Control by Using Strong Passwords Through Image Fusion Mechanism . . . . . . . . . . . . . . . . 339 Vijay B. Gadicha and Abrar S. Alvi 36 Performance Improvement of Direct Torque and Flux Controlled AC Motor Drive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 Jagdish G. Chaudhari and Sanjay B. Bodkhe 37 Cryptocurrency: A Comprehensive Analysis . . . . . . . . . . . . . . . . . . 365 Gaurav Chatterjee, Damodar Reddy Edla and Venkatanareshbabu Kuppili 38 Challenges in Recognition of Online and Off-line Compound Handwritten Characters: A Review . . . . . . . . . . . . . . . . . . . . . . . . 375 Ratnashil N. Khobragade, Nitin A. Koli and Vrushali T. Lanjewar 39 Novel Idea of Unique Key Generation and Distribution Using Threshold Science to Enhance Cloud Security . . . . . . . . . . . . . . . . 385 Devishree Naidu, Shubhangi Tirpude and Vrushali Bongirwar 40 Information Retrieval Using Latent Semantic Analysis . . . . . . . . . . 393 Rahul Khokale, Nileshsingh V. Thakur, Mahendra Makesar and Nitin A. Koli 41 Can Music Therapy Reduce Human Psychological Stress: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Nikita R. Hatwar and Ujwalla H. Gawande 42 Web Mash-Up Development and Security Using AOP . . . . . . . . . . 413 Manjusha Tatiya and Sharvari C. Tamane 43 Design Consideration of Malay Text Stemmer Using Structured Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 Mohamad Nizam Kassim, Shaiful Hisham Mat Jali, Mohd Aizaini Maarof, Anazida Zainal and Amirudin Abdul Wahab 44 Enhanced Text Stemmer with Noisy Text Normalization for Malay Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 Mohamad Nizam Kassim, Shaiful Hisham Mat Jali, Mohd Aizaini Maarof, Anazida Zainal and Amirudin Abdul Wahab 45 Modified Moth Search Algorithm for Portfolio Optimization . . . . . 445 Ivana Strumberger, Eva Tuba, Nebojsa Bacanin and Milan Tuba 46 Towards the Adoption of Self-Driving Cars . . . . . . . . . . . . . . . . . . 455 Omayma Alqatawneh, Alex Coles and Ertu Unver

Contents

xxi

47 An Overview on Privacy Preservation and Public Auditing on Outsourced Cloud Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 Sonali D. Khambalkar, Shailesh D. Kamble, Nileshsingh V. Thakur, Nilesh U. Sambhe and Nikhil S. Mangrulkar 48 Segmentation of Handwritten Text Using Bacteria Foraging Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 Rajesh Agrawal, Prashant Sahai Saxena, Vijay Singh Rathore and Saurabh Maheshwari 49 Problems with PIR Sensors in Smart Lighting+Security Solution and Solutions of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 Pinak Desai and Nilesh Modi 50 Multi-level Thresholding and Quantization for Segmentation of Color Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 Shailesh T. Khandare and Nileshsingh V. Thakur Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497

About the Editors

Prof. Yu-Dong Zhang received his B.S. and M.S. degrees from Nanjing University of Aeronautics and Astronautics in 2004 and 2007, prior to completing his Ph.D. in Signal and Information Processing at Southeast University in 2010. From 2010 to 2012, he worked at Columbia University as a postdoc. From 2012 to 2013, he worked as a research scientist at Columbia University and New York State Psychiatric Institute. From 2013 to 2017, he worked as a Full Professor and doctoral advisor at the School of Computer Science and Technology at Nanjing Normal University. Since 2018, he has been a Full Professor at the Department of Informatics, University of Leicester, UK. Prof. Jyotsna Kumar Mandal received his M.Sc. in Physics from Jadavpur University and his M.Tech.in Computer Science from the University of Calcutta, prior to completing his Ph.D. in Computer Science & Engineering at Jadavpur University in 2000. He began his career as a lecturer at NERIST, Arunachal Pradesh, in 1988. With 32 years of teaching and research experience, he has over 400 articles and seven books to his credit and 23 scholars have completed Ph.D. under him. Presently, he is working as a Professor of Computer Science & Engineering at Kalyani University, West Bengal. Prof. Chakchai So-In has been with the Department of Computer Science at Khon Kaen University since 2010. Dr. So-In received his B.Eng. and M.Eng. degrees from Kasetsart University (KU), Bangkok, Thailand, in 1999 and 2001, respectively. He also completed M.S. and Ph.D. degrees in Computer Engineering at the Department of Computer Science and Engineering, Washington University, in St. Louis (WUSTL), MO, USA, in 2006 and 2010. He has more than 100 publications in prominent journals and proceedings with 10 books. He is also acting as editors, TPC, and reviewers for a number of Journals and Conferences. His research interests include Mobile Computing/Sensor Networks, Internet of Thing, Computer/Wireless/Distributed Networks, Cyber Security, and Intelligent Systems and Future Internet. He is also senior members of IEEE and ACM.

xxiii

xxiv

About the Editors

Prof. Nileshsingh V. Thakur is a Professor and Principal of Nagpur Institute of Technology, Nagpur, India. He completed his B.Eng. at the Government College of Engineering, Amravati, in 1992; his M.Eng. at SGB Amravati University in 2005; and his Ph.D. in Computer Science and Engineering at Visvesvaraya National Institute of Technology (VNIT), Nagpur, India, in 2010. His research interests are in Image & Video Processing, Advanced Optimization, Neural Networks, Sensor & Ad Hoc Networks, Pattern Recognition & Data Mining, etc. He has 27 years of teaching and research experience, and has published more than 80 papers in these areas.

Chapter 1

Comparison of Different Image Segmentation Techniques on MRI Image Afifi Afandi, Iza Sazanita Isa, Siti Noraini Sulaiman, Nur Najihah Mohd Marzuki and Noor Khairiah A. Karim

Abstract Image processing techniques have been important and widely used for image analysis due to the advancement of computer learning. Analyzing of processed image is done to obtain any quantitative information or data from the processed images. Hence, segmentation is a part of image processing that has various types of techniques and algorithms to be used. Generally, the main purpose of the segmentation is to improve or potentially change the image digitally so that any useful information is easier to analyze. However, the effectiveness of segmentation algorithms evaluation has become the priority to get results through the process. Therefore, this study is used to make comparisons of different image segmentation techniques based on MRI images. Different clustering algorithms are tested on MRI images to identify white matter hyperintensities (WMH) region on human brain. The accuracy of the identification assessment is compared between several MRI image segmentation techniques. The best performance analysis is suitable to be implemented on any computer-aided tool for medical monitoring or analyzing purpose.

1.1 Introduction One of the main purposes of digital image processing is to obtain compulsory information and data on the tested images [1]. In the simplest terms, the input images are manipulated initially to ease user for interpretation and accomplish their goals. Digital images processing basically operate on two-dimensional (2D) square known as pixel [2]. Segmentation techniques are part of image processing and computer vision. Generally, segmentation assessments are not only used as an algorithm itself but also carry out several tasks. On the other hand, image segmentation works to divide subjects into multiple segments. Any useful data from segmented region will be obtained during the process in the form of color, intensity or texture [3]. Besides, A. Afandi · I. S. Isa (B) · S. N. Sulaiman · N. N. M. Marzuki · N. K. A. Karim Faculty of Electrical Engineering, Universiti Teknologi MARA, Penang Campus, 13500 Permatang Pauh, Pulau Pinang, Malaysia e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_1

1

2

A. Afandi et al.

image segmentation also plays the main role as basis of image recognition. The process depends on specific criteria to classified images where the user intrigued by [4]. There are various types of technique to perform segmentation process. As the time past, all the techniques are less practically applied on all types of images. In this study, a comparison between image segmentation algorithms is applied on medical images. Through the years, many researchers have proposed different types of segmentation algorithm. The segmentation process can be classified into two parts, that is, discontinuity and similarity. The research by S. Kannan, V. Gurusamy and G. Nalini (2015) found that discontinuity properties basically partition images based on the level of intensity while the similarity properties are the division of an image into regions [5]. Digital images are usually exposed to numerous types of distortion that can affect the image quality. Therefore, a proper image segmentation technique meant for specific image, particularly the medical images, is important to retain useful information of the image.

1.2 Methodology Generally, white matter hyperintensites (WMH) are commonly found in the deep white matter through scanned MRI images. An early detection of WMH development plays an important step to prevent brain damage due to small vessel disease at early stages [6]. This consequence usually occurs among the adults and older person. This consequence also may lead to some critical diseases such as stroke, dementia cognitive and Alzheimer [7]. Therefore, an intelligent and efficient automated identification technique is needed to be implemented on processing the image of medical diagnosis. The proposed study focuses on comparing different segmentation techniques based on MRI images of human brain. Several segmentation methods are performed to obtain any quantitative data and information of white matter (WM). Hence, the comparison is made based on the performance of the algorithms through the identification of WM region. The segmented process is performed on the T2-weighted images (T2-WI) of MRI images which are obtained from Advanced Medical and Dental Institute (AMDI), Universiti Sains Malaysia, Pulau Pinang. Figure 1.1 shows the overall idea of this study, where the main objective is to compare different clustering algorithms for classifying WMH region in WM brain of MRI image.

1.2.1 Preprocessing Initially, preprocessing is done on the MRI images to remove background noise. Generally, denoising and enhancement of images are included in preprocessing techniques. There are many types of different denoising approach, such as filter process and morphological operation [8]. Next step, the grayscale image is initially converted into binary image before performing the segmentation technique. All the pixels from

1 Comparison of Different Image Segmentation Techniques …

3

Fig. 1.1 An overview of the proposed segmentation algorithm methods on MRI images

Fig. 1.2 Preprocessing removal of skull and scalp (i) before (ii) after

(i)

(ii)

grayscale image are replaced with 8-bit grayscale level between value 1 (white) and 0 (black). Therefore, these image properties make the process much easier as they consist of only a single intensity value for each pixel. To enhance the performance of analysis, structuring element decomposition is done by breaking the image into region of interest. The toolbox contains many types of shape that can be used as our region of interest. Besides, morphological process of MATLAB image processing toolbox is performed on the MRI images. These steps are done to remove skull and scalp, as shown in Fig. 1.2. These operations including erosion are performed to erode away the boundaries of regions on the images pixels. The preprocessing technique is applied and analyzed through all the segmentation methods.

1.2.2 Segmentation Techniques Due to development of technology, image segmentation techniques have become important in digital image processing. There are several image segmentation techniques that are already developed by researchers and scientists through the years [9]. Some of the techniques are very well-known and widely used for segmentation

4

A. Afandi et al.

process. After the preprocessing techniques completed, the segmentation techniques are performed using different algorithm. In this study, segmentation process is done by using three different clustering algorithms, namely K-means, fuzzy C-means and superpixels algorithm. Basically, the clustered images are converted into colored images for viewing the ROI as colored images formed of three basic color components that is red, green and blue (RGB). Each clustered ROI is determined by the combination of these color and their intensities properties. Therefore, the segmented region can be seen clearly with the different color regions. K-Means Clustering K-means (KM) algorithm is a clustering technique that performs segmentation and classifies images into different classes of pixels. The easiest and efficient KM algorithm is an unsupervised learning algorithm that has been developed with less computational complexity. The number of clusters can be decided by the user that makes this technique to be very suitable to use. The algorithm works by measuring the distance of initial points and clustering data based on distance. By using normalization techniques for the selection of initial cluster, the numbers of iteration can be reduced [10]. The chosen initial value of clustering centers is based on closest intensities. Hence, the system becomes more efficient with less processing time [11]. The mathematical equation of KM algorithm is given by Eq. (1.1). D(v) =

k n a ||xi − v j||2 j=1

(1.1)

i=1

Given that, n is the number of cluster center, k is the number of data points and xi − vj2 is Euclidean distance between each pixel to centers. Generally, the number of clusters is defined by the user, and the centroid values are set randomly. The algorithm is performed by iteration to minimize the centroids distance between points. Every iteration assigned the new location of points to the nearest cluster centroid. The operation is recalculated until there are no value changes. Fuzzy C-Means clustering Fuzzy C-means (FCM) clustering is one of the segmentation techniques used for analysis of data and information. The concept of this algorithm is classified based on the membership that allows one piece of data to belong to two or more clusters [12]. Basically, the algorithms of KM and FCM operation are slightly similar to each other as both uses the Euclidean distance for the measurement. On the other hand, FCM algorithm performs by classified membership to each of the data point based on each cluster center. The nearest data to the cluster center assigns the data more on the particular cluster center. The degree of membership is calculated by using the given Eq. (1.2)

1 Comparison of Different Image Segmentation Techniques …

1

µi j = c

k=1 (di j /dik )

( m2 −1)

5

(1.2)

where c is the cluster center, k is the iteration step and µij is the membership of ith data and jth cluster center. The algorithm works by assigning membership to each and every data according to their cluster. At first, the cluster centers, c is selected randomly. Then, the fuzzy membership is calculated using Eq. 1.2. The fuzzy centers, V j , is updated by performing Eq. (1.3) where V j = 1, 2, . . . , c. Vj =

n

µi j

m

n m µi j xi /

i=1

(1.3)

i=1

Fuzzy segmentation benefits the user as it keeps more data from the original image compared to hard segmentation [13]. The FCM is able to detect the rough shape by using the Euclidean distance. Superpixels region-based segmentation Superpixels (SP) segmentation process increasing the meaningful information than single pixel, hence makes this method become more efficient. The algorithm works by creating boundaries on the images, while no loss of any quantitative information due to segmentation. Superpixels are the desired method for many vision tasks when it comes to compact and complex images. The algorithms work perfectly and get the best results whenever number of superpixels is appropriately provided. As the superpixels become high-level tasks, the application becomes complex due to increase in number of pixels.

1.3 Results and Discussion The results are discussed about the segmentation techniques based on the performance. The results are categorized into two types that are qualitative and quantitative. Qualitative Results The qualitative data are data or information that are usually measured by their own characteristics and qualities. Basically, the data representation cannot be expressed in terms of numbers. In this study the evaluation for qualitative data is done by visualization. Every segmentation images are measured based on different techniques. The segmentation process is made to identify the WMH in human brain. Hence, it clusters the image into four regions; that is white matter (WM), gray matter (GM), background and cerebrospinal fluid (CSF). For the best analysis, the images are converted from digital imaging and communication in medicine (DICOM) to grayscale images as grayscale images represent

6

A. Afandi et al.

only a single value of each pixel. So the analysis becomes more uncomplicated compared to color images. Then the morphological structure element operation is done to remove skull and scalp. The strel function creates a disk-shape structuring element for the region of interest. Meanwhile, Table 1.1 shows the qualitative segmentation results of three different clustering algorithms of KM, FCM and SP. Table 1.1 Results of different clustering algorithms Subject 1

2

3

4

5

MRI images

KM

FCM

SP

1 Comparison of Different Image Segmentation Techniques …

7

Quantitative Results Quantitative data were measured for more specific approach so that the data or information obtained are valuable. Besides, the data can be represented in terms of numerical and statistical data. Overall, there are five subjects provided from AMDI which extracted 10–15 slices for each subject. Therefore, there are about 75 slices of T2-WI images that have been tested in this study. The measurement results are summarized and have been recorded on Tables 1.2 and 1.3. The data tabulated are average of each Table 1.2 Measurement of segmented area Region WMH

WM

GM

CSF

Techniques

Subjects 1

2

3

4

5

KM

42.5

55.67

49.9

72.6

10.6

FCM

169.81

14.8

64

5

142.9

SP

145.84

88.5

23

144.25

199.44

KM

1655.5

17,248

96,183

21,413

11,569

FCM

6009

937

1949

64,055

3434

SP

7402.6

2557

18,662.5

41,265

33,554

KM

10558

2852.9

17683.3

1515.2

1761.5

FCM

1835.8

1850.2

1756.98

1703.2

1787.3 24,242

SP

18,322

9428.4

10901.6

8120

KM

398.62

136.2

105.4

172.4

396.77

FCM

151.75

241.63

336.33

105.88

771.27

SP

480.33

658.77

1146.2

713.29

2246.4

4

5

Table 1.3 Measurement of mean intensity Region

Techniques

Subjects 1

WMH

WM

GM

3

KM

80.05

103.12

79.96

136.2

83.25

FCM

77.2324

58.7

64.2

53.4

63.5

SP

73.307

69.5

49.35

61

76.44

KM

68.932

83.9

69.81

88.65

58.47

FCM

87.378

192.3

158.28

79.62

86.867

SP

79.83

166.38

58.5

69.5

53.5

KM

104.68

133.2

186.27

124.3

FCM

131.164

103.36

112.1

115.9

21.59

109.35

116.56

96.63

KM

56.682

43.43

79.8

56.64

396.7

FCM

169.224

74.74

336.33

163.1

127.48

SP

166.571

78.1

77.13

146.7

85.6

SP CSF

2

1761.5 134.65 93.714

8

A. Afandi et al.

Table 1.4 Measurement of time processed Techniques

Subjects 1

Average 2

3

4

5

KM

35.358 s

22.414 s

21.387 s

19.700 s

23.547 s

24.4818 s

FCM

36.143 s

35.860 s

36.253 s

34.457 s

32.892 s

35.1213 s

SP

20.986 s

18.625 s

19.206 s

19.676 s

19.061 s

19.511 s

subject from all slices. The comparison is made for different clustering algorithm methods to evaluate the performance of each technique. Table 1.4 shows the time processed result for each segmentation techniques. The time taken for each algorithm to process the image is recorded for every single subject. From the results, it shows that superpixels clustering algorithm provides the fastest average processed time compared to KM and FCM.

1.4 Conclusion In conclusion, this study presents a comparison between various types of segmentation techniques. The subjects are all tested on the MRI images of human brain. Generally, the segmentation algorithm plays important roles for the WMH detection. The best segmentation technique produces the fastest and precise result of detection region as conventional MRI delineation technique by radiologist is time-consuming and provided various outcomes. Overall, this study demonstrates the analysis of the segmentation algorithms performance based on the time processed. Acknowledgements The authors highly show gratitude to Advanced Medical and Dental Institute (AMDI), Kepala Batas, Universiti Sains Malaysia, and special thanks to Faculty of Electrical Engineering, Universiti Teknologi MARA P. Pinang. Special appreciation to the Human Research Ethics Committee of USM (JEPeM) for approval under code USM/JEPeM/16090293. Also special thanks to research interest group of Advanced Rehabilitation Engineering in Diagnostic and Monitoring Research Group (AREDiM).

References 1. Tanwar, B., Kumar, R., Gopal, G.: Clustering techniques for digital image segmentation. 7(12), 55–60 (2016) 2. Sharma, P., Suji, J.: A review on image segmentation with its clustering techniques. Int. J. Signal Process. Image Process. Pattern Recognit. 9(5), 209–218 (2016) 3. Khan, W.: Image segmentation techniques: a survey. J. Image Graph. 1(4), 166–170 (2014) 4. Yuheng, S., Hao, Y.: Image segmentation algorithms overview. 1 (2017) 5. Kannan, S., Gurusamy, V., Nalini, G.: Review on image segmentation techniques. Int. J. Sci. Res. Eng. Technol. 26(9), 1277–1294 (2015)

1 Comparison of Different Image Segmentation Techniques …

9

6. Wardlaw, J.M., Valdés Hernández, M.C., Muñoz-Maniega, S.: What are white matter hyperintensities made of? Relevance to vascular cognitive impairment. J. Am. Heart Assoc. 4(6), 001140 (2015) 7. Brickman, A.M., et al.: White matter hyperintensities and cognition: testing the reserve hypothesis. Neurobiol. Aging 32(9), 1588–1598 (2011) 8. Beaulah Jeyavathana, R., Balasubramanian, R., Pandian, A.A.: A survey : analysis on preprocessing and segmentation techniques for medical images. Int. J. Res. Sci. Innov. III(July), 2321–2705 (2016) 9. Heena, A., Biradar, N., Maroof, N.M.: A novel approach to review various image segmentation techniques. Int. J. Innov. Res. Comput. Commun. Eng. ISO Certif. Organ. 32975(2), 266–269 (2007) 10. Isa, I., Sulaiman, S., Tahir, N., Mustapha, M., Karim, N.A.: A new technique for K-means cluster centers initialization of WMH segmentation. In: Proceedings of the International Conference on Imaging, Signal Processing and Communication (2017) 11. Papadopoulos, S., et al.: Image clustering through community detection on hybrid image similarity graphs. In: Proceedings of International Conference on Image Processing ICIP, pp. 2353– 2356 (2010) 12. Suganya, R., Shanthi, R.: Fuzzy C-means algorithm—a review. Int. J. Sci. Res. 2(11), 1–3 (2012) 13. Ajala Funmilola, A., Oke, O., Adedeji, T., Alade, O., Adewusi, E.: Fuzzy k-c-means clustering algorithm for medical image segmentation. J. Inf. Eng. Appl. 2(6), 21–33 (2012)

Chapter 2

PSO-ANN-Based Computer-Aided Diagnosis and Classification of Diabetes Ratna Patil and Sharvari C. Tamane

Abstract Feature selection (FS) is indeed a tough, challenging and demanding task due to the large exploration space. It moderates and lessens the number of features. It also eliminates insignificant, noisy, superfluous, repetitive and duplicate data and provides reasonably adequate classification accuracy. Present feature selection approaches do face the difficulties like stagnation in local optima, delayed convergence and high computational cost. In machine learning, particle swarm optimization (PSO) is an evolutionary computation procedure which is computationally less costly and can converge quicker than other existing approaches. PSO can be effectively used in various areas, like medical data processing, machine learning and pattern matching but its potential for feature selection is yet to be fully explored. PSO improves and optimizes a candidate solution iteratively with respect to a certain degree of quality. It provides a solution to the problem by having an inhabitant of swarm particles. By applying mathematical formulas, velocity and position of swarm particles are calculated and these particles are moved in the search space. The movement of individual swarm particle is inclined by its local finest known position and is also directed to the global finest known position in the exploration space. These positions are updated as improved positions, which are found by other particles. These improved positions are then used to move the swarm in the direction of the best solutions. The aim of the study is to inspect and improve the competence of PSO for feature selection. PSO functionalities are used to detect subset of features to accomplish improved classification performance than using entire features set.

R. Patil (B) National Institute of Electronics & Information Technology, Government of India, Aurangabad, India e-mail: [email protected] S. C. Tamane Jawaharlal Nehru Engineering College, Aurangabad, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_2

11

12

R. Patil and S. C. Tamane

2.1 Introduction 2.1.1 Literature Survey A swarm is a flock of similar agents having cooperation between them locally. They do not have a central control inside their environment, which gives rise to develop a global motivating conduct. Swarm-based processes have come into view newly as a family of nature-stimulated inhabitants-created algorithms (NIA) which provided fast, less expensive and strong results to numerous multifaceted problems [1]. Social behaviour of swarms in nature such as groups of ants, bees, fish and flock of birds is used as a swarm intelligence to develop a mathematical model. Although these particles (creatures or swarm entities) are comparatively inexperienced with partial proficiencies on their own, they are cooperated by convinced behavioural forms to supportively reach tasks essential for their existence. In the past, several classification models were developed which uses PSO. Sousa et al. (2004) developed a first application of PSO, which uses rule-based technique. In this experiment the author developed a sequential covering method, where the unification of features is done to form rules [2]. Sensitivity and specificity were computed and their product was used to measure the quality of rules like AntMiner. The stopping criteria were considered as 30, that is, the quality of a rule is constant for 30 iterations. After meeting with the stopping criteria, the solution was stored in the list of rules. The next step was pruning to delete the features which do not affect the performance. The final default rule guesses the remaining majority class. In 2008, Holden and Freitas used ACO in combination with PSO to develop a rule-based classification model. This model dealt with nominal as well as continuous variables efficiently [3]. While developing PSO nominal variables were encoded as binary variables, whereas in ACO implementations these variables involve discretization. This technique was the refinement of the first version which was proposed by the same authors, Holden and Freitas (2005, 2007). PSO usage has been in practice to train both ANN and SVM models. PSO was used to train ANN by Kennedy and Eberhart (1995) [4]. In 2009, Samanta and Nataraj developed a PSO-based feature selection model [5]. Shih-Wei Lin et al. developed SVM + PSO for classification of datasets downloaded from UCI. They have compared the classification accuracy of the model developed with grid search methods [6]. Experimental results showed greater results of classification accuracy rates obtained using developed approach than those of grid search (2016). This paper has been drafted in the following way: Sect. 2.2 presents the approaches used for feature selection. Section 2.3 includes particle swarm optimization. Section 2.4 gives the architecture of proposed feature selection algorithm. In Sect. 2.5, experimental results are analysed. In this section results attained by the proposed method are compared with the results got by other methods. Lastly, final interpretations are prepared in Sect. 2.6.

2 PSO-ANN-Based Computer-Aided Diagnosis …

13

2.2 Feature Selection Approaches In machine learning, feature selection is used when you are dealing with a highdimensional space of features. It is used with the sole objective of simplifying a dataset by sinking its dimensionality and finding significant principal features without losing predictive accurateness. The complete exploration space comprises all possible subgroups of features. This count will be 2n , where n denotes the number of aspects or features. Based on the evaluation technique, feature selection processes are divided into three groups. • Filter approach: It uses statistical characteristics of the data for the evaluation of fitness value and the feature selection search progression is not dependent on any classification process. These approaches are computationally less expensive and more general as compared to wrapper approaches [7]. • Wrapper approach: It includes a classification process as a chunk of the calculation module to evaluate the goodness of the selected attribute subset. Wrappers are preferred over filter approaches in terms of the classification performance. • Hybrid approach: In this approach a learning algorithm and feature selection method are interleaved [8]. These approaches are categorized into forward selection, backward elimination, forward stepwise selection, backward stepwise elimination and random mutation. Other methods are centred on chromosomal process [9], simulated annealing, ant colony optimization and particle swarm optimization.

2.3 Particle Swarm Optimization (PSO) PSO is an evolutionary computing (EC) technique projected by Kennedy and Eberhart in 1995. In PSO, a population, also known as a swarm of candidate solutions, is encoded as particles in the search space. It is stirred by societal and supportive actions exhibited by several kind of species to fill their needs in the search space. Each particle has a position and a corresponding fitness value evaluated by the fitness function to be optimized. The particles move from one position to another according to their most recent velocity vector. This velocity vector is determined according to the particle’s personal experience and also considers the experience of other particles by using the best positions met by the particle. Further, the experiences are augmented by two factors C1 and C2 , and two arbitrary numbers R1 and R2 produced in the range [0, 1]. Algorithm 1: Basic flow of PSO (1) (2) (3)

Set arbitrarily position to each particle and the initial velocity of the swarm in each dimension of the solution space. Set PSO parameters. Set iteration count K = 1.

14

R. Patil and S. C. Tamane

(4)

Fitness function is evaluated to optimize the position of every particle of the swarm. (5) If the current location of each individual particle is better than its historically best one so far Pi, then update the particle’s best position. (6) Identify and update the swarm’s globally best particle having the swarm’s best fitness value. Update its index as g and its position at Pg. (7) Use Eq. (2.1) to update the velocities of all the particles. (8) Move each particle to its new position using Eq. (2.2). (9) Iterate steps 4–8 till convergence or a discontinuing condition is encountered (e.g. the extreme numeral of permissible repetitions is reached, or an adequate decent fitness value is reached, or the process has not enhanced its performance aimed at a several number of successive repetitions). (10) The global best fitness value achieved by the swarm particle amongst the complete group is returned. Vid (k + 1) = w ∗ Vid (k) + C1 ∗ R1 ∗ (Pid (k) − Xid (k)) + C2 ∗ R2 ∗ Pgd (k) − Xid (k) (2.1) Xid (k + 1) = Xid (k) + Vid (k + 1)

(2.2)

where • w denotes the weight of inertia. • Vid is the rate of change (velocity) of the position of the ith particle in the dth dimension, and k signifies the repetition count. • Xid denotes the position of the ith particle in the dth dimension. Xi refers to the ith particle itself, or as a vector of its locations in all dimensions of the problem space. • Pid denotes the historically best location of the ith particle in the dth dimension. • Pgd denotes the spot of the swarm’s global best particle (Xg ) in the dth dimension. • R1 and R2 are two n-dimensional vectors with arbitrary values uniformly selected in the range of [0.0, 1.0]. These two vectors will give useful arbitrariness for the exploration tactic. • Positive constants C1 and C2 are weighting parameters, also called the cognitive and societal parameters, respectively. These constants are used to control the relative importance of swarm particle’s private experience versus its social experience. To improve the search development process, weighting parameters usually fall in the range of [0, 4] with C1 + C2 = 4. The ‘C1 ’ and ‘C2 ’ values can affect the search skill of PSO remarkably by biasing the new spot of Xi towards its historically best location or the globally best position. PSO search mechanism in multidimensional search space is shown in Fig. 2.1.

2 PSO-ANN-Based Computer-Aided Diagnosis …

15

Fig. 2.1 Multidimensional search space using PSO

2.4 Proposed Feature Selection Algorithm PSO is chosen for feature selection in this study, as it has an inherit potential to adjust to a varying environment, which can increase its ability by just finding optima in static surroundings to further track them in dynamic surroundings. It is also characterized by its fast convergence behaviour. A comprehensive flow diagram of PSO-ANN is given in Fig. 2.2.

2.5 Experimental Results The experiment is carried out in MATLAB on the PIMA dataset from UCI repository. To assess the network performance, mean squared error (MSE) and correct classification percentage are the two factors that are used. In this experiment 70% of the instances in the dataset are divided as the training set and the 15% as the validation and remaining 15% as test set. In the first phase of the experiment, all the attributes are used for computing the network performance. Without using PSO for feature selection, Levenberg-Marquardt (LM) algorithm is used for training on all the features of train dataset. The performance of the network is assessed by computing MSE and the correct classification percentage. The average MSE obtained is 0.151650 and correct classification percentage is 77.343750%.

16

Fig. 2.2 Training process of PSO-ANN

R. Patil and S. C. Tamane

2 PSO-ANN-Based Computer-Aided Diagnosis …

17

The best validation performance is attained at epoch 3 which is 0.18192. The performance metrics achieved for training on all the features using LM algorithm is shown in Fig. 2.3. To validate the results attained, the factors like receiver operating characteristic (ROC) curve and confusion matrix are taken into consideration. ROC helps in visualizing the performance of model. In this graph X-axis shows false-positive rate and Y-axis shows true positive rate. Figure 2.4 indicates the ROC using all the features and with selected features using PSO. Training confusion matrix, validation confusion matrix, test confusion matrix and all confusion matrix with all the features taken into consideration and using only those features selected by PSO are shown in Fig. 2.5. In the next stage, PSO is applied for the feature selection on the dataset. The PSO parameters are adjusted as follows: Swarm size: 10–100 Maximum generations: 20–500

Fig. 2.3 Comparison of performance (MSE) of training/validation/testing set

Fig. 2.4 Comparison of ROC with the set of all features and with reduced set of features

18

R. Patil and S. C. Tamane

Fig. 2.5 Comparison of confusion matrix with using all features and with reduced set of features using PSO

Acceleration constant parameters (C1 and C2 ): 2–2.05 Inertial weight: 0.9–0.4 Initial velocity: 10% of position Table 2.1 shows the number of features selected and the corresponding MSE and correct classification percentage. It is observed that the performance is good when the number of features chosen using PSO is 4.

Table 2.1 MSE and correct classification (%) values obtained with the number of feature selected using PSO No. of features selected using PSO

MSE based on Levenberg-Marquardt algorithm learning (average)

Correct classification (%)

3

0.137848

77.2656

4

0.1344

79.56

5

0.135264

77.8906

6

0.13698

77.73

7

0.1342

78.32

2 PSO-ANN-Based Computer-Aided Diagnosis …

19

Fig. 2.6 Best cost versus iteration

The following graph in Fig. 2.6 depicts that the best cost reduces with increase in iteration of PSO. The best validation performance is 0.18192 without reducing the features. After reducing the number of features to four using PSO, the best validation performance is improved to 0.1302 at epoch 11 as shown in Fig. 2.3. The classification rate obtained using all the features is compared with the classification rate on reduced features set after applying the PSO for feature reduction. PSO has improved the classification rate from 77.3 to 80.3% for all set. The ROC curve is plotted with all the features and using only the features selected using PSO is shown in Fig. 2.4. After the comparison of performances, we observed that the results were favourable on the selected features obtained using PSO.

2.6 Conclusion The objective of this work is to decrease the number of attributes of the dataset without affecting the accuracy of the classification. The particle swarm optimization algorithm is selected for feature reduction phase. Removing all the irrelevant data and retrieving required information out of medical data would help medical practitioners by several means. It has been noticed that the performance of the neural network is not degraded even after the number of attributes is reduced. From the experimentations, it has been demonstrated that the PSO algorithm has shown improved performance for feature selection and increased overall network performances. When PSO is used for feature selection, classification percentage has gone up to 80% from 77%, and the number of features selected has reduced to half. GA-RBF NN was implemented in the past which uses GA for feature selection and radial basis function neural network has provided 77.39% accuracy [10]. The comparative analysis of available classification algorithms provided in the earlier work [11] is done with the PSO-ANN-based feature selection. This comparison shows an improvement with PSO-ANN up to 80% from

20

R. Patil and S. C. Tamane

78%. The prediction accuracy level can be further improved using hybrid PSO with rough set theory and fuzzy logic which may produce improved consequence. In future work, the experimentations will be carried out with real-time dataset under concerned general practitioner.

References 1. Aghdam, M.H., Heidari, S.: Feature selection using particle swarm optimization in text categorization. JAISCR 5(4), 231–238 (2015). https://doi.org/10.1515/jaiscr-2015-003. Research Polish Neural Network Society and the University of Social Sciences 2. Sousa, T., Silva, A., Neves, A.: Particle swarm based data mining algorithms for classification tasks. Parallel Computing 30(5), 767–783 (2004). https://doi.org/10.1016/j.parco.2003.12.015. Elsevier 3. Holden, N., Freitas, A.: A hybrid PSO/ACO algorithm for discovering classification rules in data mining. J. Artif. Evol. Appl. (2008). http://dx.doi.org/10.1155/2008/316145. Hindawi Publishing Corporation Volume 4. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of IEEE International Conference on Neural Networks, Piscataway, NJ, pp. 1942–1948 (1995) 5. Samanta, B., Nataraj, C.: Application of particle swarm optimization and proximal support vector machines for fault detection. Swarm Intell. 3(4), 303–325 (2009). https://doi.org/10. 1007/s11721-009-0028-6 6. Lin, S.-W., Ying, K.C.: Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert. Syst. Appl. 35(4), 1817–1824 (2008). https://doi. org/10.1016/j.eswa.2007.08.088. ScienceDirect, Elsevier 7. Indriyani, W.G., Rakhmadi, A.: Filter-wrapper approach to feature selection using PSO-GA for Arabic document classification with naive bayes multinomial. IOSR J. Comput. Eng. (IOSRJCE) 17(6), Ver. VI, 45–51 (2015). e-ISSN: 2278-0661, p-ISSN: 2278-8727 8. Menghour, K., Souici-Meslati, L.: Hybrid ACO-PSO based approaches for feature. Sel. Int. J. Intell. Eng. Syst. 9(3), 65–79 (2016). https://doi.org/10.22266/ijies2016.0930.07 9. Patil, R., Tamne, S.: Upgrading the performance of KNN and naïve bayes in diabetes detection with genetic algorithm for feature selection. Int. J. Sci. Res. Comput. Sci. 3(1), 1971–1981 (2018). ISSN: 2456-3307. Engineering and Information Technology IJSRCSEIT 10. Choubey, D.K., Paul, S.: GA_RBF NN: a classification system for diabetes. Int. J. Biomed. Eng. Technol. 23(1), 71–93 (2017). https://doi.org/10.1504/IJBET.2017.10003045 11. Patil, R., Tamne, S.: A comparative analysis on the evaluation of classification algorithms in the prediction of diabetes 8(5), 3966–3975 (2018). ISSN: 2088-8708. https://doi.org/10.11591/ ijece.v8i5.pp3966-3975

Chapter 3

Round Robin Scheduling Based on Remaining Time and Median (RR_RT&M) for Cloud Computing Mayuree Runsungnoen and Tanapat Anusas-amornkul

Abstract Cloud computing is a system that is flexible to adjust infrastructure as needed with low costs. Quality of service is a challenge for a cloud service provider. A scheduling algorithm has a direct impact on the quality of service in cloud computing. Therefore, this work focuses on studying scheduling algorithms for cloud environments and proposes a new algorithm based on a round robin algorithm called round robin based on remaining time and median (RR_RT&M). It is compared with other algorithms, such as first-come, first-served (FCFS), and smarter round robin (SRR) algorithms. The performance metrics are makespan, execution time, and waiting time. The experiments were conducted in CloudSim simulator, and the results showed that RR_RT&M performed the best for all metrics and the percentage of improvement for makespan was between 16 and 72%. For execution time and waiting time, the improvement percentages were 31–73 and 0–73%, respectively.

3.1 Introduction Today, information technology (IT) plays a major role in doing a business over the internet. It is a tool to streamline communications, facilitate strategic thinking, store and safeguard valuable information, and cut costs and eliminate waste [1]. In order to use IT efficiently, businesses bring cloud computing to support their infrastructure. The basic concepts for cloud computing are to distribute processing power or resources to computers in a network without locating in the same location [2], and to work with virtualization technology, which does not limit the performances of individual computers [3]. A cloud user is able to change a number of computers or computer hardware as needed without considering the real hardware or infrastructure. The user only pays for the services that he needs, called pay-as-you-go. In addition, the user can use the services anytime anywhere using a computer or a smart phone that can connect to the internet. Cloud service models are categorized into three services, that are M. Runsungnoen · T. Anusas-amornkul (B) King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand e-mail: [email protected]; [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_3

21

22

M. Runsungnoen and T. Anusas-amornkul

software-as-a-service (SaaS), platform-as-a-service (PaaS), and infrastructure-as-aservice (IaaS) [4, 5]. Cloud service providers focus on quality of service, resource usage maximization, and load balance because cloud users can request to process data in the cloud all the time [2]. Scheduling algorithms have an impact on cloud performance [6, 7]. The challenge of scheduling algorithms is to reduce waiting time, response time, makespan, and load balance [3, 5, 8]. Examples of heuristic algorithms to solve scheduling problems are first-come, first-served (FCFS), shortest-job-first (SJF), and round robin (RR). However, each algorithm is suitable for different scenarios. A research problem is to study scheduling algorithms and a new scheduling algorithm for cloud computing is proposed. In this paper, five scheduling algorithms, that are FCFS, RR, round robin based on the average burst time of tasks (RR_ABT), smarter round robin (SRR), and modified median round robin algorithm (MMRRA) are studied. Moreover, round robin based on remaining time and median (RR_RT&M) algorithm is proposed to enhance the performance of cloud computing. The organization of this paper is as follows. The following section presents literature review from researchers in the related topic. In Sect. 3.3, scheduling algorithms are described in detail along with our proposed algorithm. Then, the experimental setup, performance metrics, and the results are described and discussed. In the last section, a summary of the work and a future work are presented.

3.2 Related Works In [5], an improved round robin CPU scheduling algorithm with varying time quantum was proposed to minimize waiting time, turnaround time, response time, context switching and maximize CPU utilization. The shortest job first and round robin scheduling with varying time quantum was used to improve the traditional round robin algorithm. Results showed that the waiting time and turnaround time was reduced. In [9], Mahmoud et al. proposed multi-criteria strategy for job scheduling and resource load balancing in cloud computing environment that focused on job scheduling and resource load balancing. Three stages were proposed using three scheduling techniques, that are min–min, max–min and suffrage with genetic algorithm. The results showed that the proposed work enhanced the performance of jobs scheduling and resource load balancing. In [10], Hitcham and Chaker proposed round robin based on average burst time of tasks (RR_ABT) comparing with FCFS and original RR. Later, the authors proposed another scheduling algorithm called smarter round robin (SRR) [11] by adjusting time quantum depending on number of tasks in a waiting list. The results showed that SRR performed the best when compared to RR_ABT, SJF, FCFS, and RR in terms of average turnaround time, average waiting time, number of context switching, and average response time.

3 Round Robin Scheduling Based on Remaining Time and Median …

23

In [12], Mora et al. proposed modified median round robin algorithm (MMRRA) for CPU scheduling. The idea was to calculate the proposed modified median and use it as time quantum to the RR algorithm. The results showed that MMRRA performed best among other algorithms, such as classical RR, IMRRSJF, and IRR. From all related works, many scheduling algorithms were proposed, especially for cloud computing environments to improve the cloud performance.

3.3 Scheduling Algorithms A scheduling algorithm is a process to assign tasks to be processed with the objective to use resource efficiently by reducing average response time (ART) and average turnaround time (ATAT) [12]. In CloudSim, the first step is to initialize a CloudSim package. The next steps are to create a data center, brokers, virtual machines (VMs), and cloudlets. Then, a broker communicates with VMs and scheduling algorithms [13]. In this work, a scheduling algorithm called round robin based on remaining time and median (RR_RT&M) is proposed and compared with five other algorithms, that are FCFS, RR, RR_ABT, SRR, and MMRRA. Some scheduling algorithms were proposed for cloud environments and some were proposed for CPUs but they were interesting to study and compare in this work. The details of these algorithms are as follows: 1. First-Come First-Served (FCFS) is the simplest scheduling algorithm that orders tasks depending on the time of arrival [4]. 2. Round Robin (RR) is similar to FCFS but it divides tasks into time quantum in order to process each task equally [14]. 3. Round Robin based on Average Burst Time of Tasks (RR_ABT) is a modified version of RR by calculating time quantum based on ABT of tasks instead of fixed time quantum [10]. 4. Smarter Round Robin (SRR) uses a concept of RR but it dynamically changes the time quantum depending on number of tasks in a queue [11]. If the number is less than or equal to 3, SRR uses the SJF algorithm to allocate the time. However, if the number is greater than 3 and the number is an even number, SRR uses ABT. Otherwise, SRR uses median of task burst time instead. 5. Modified Median Round Robin Algorithm (MMRRA) is another RR but it allocates time quantum dynamically by calculating a modified median of tasks [12]. The time quantum is calculated as shown in Eq. (3.1). Time quantum =

median × highest burst time

(3.1)

6. Round Robin based on Remaining Time and Median (RR_RT&M) is proposed to change the time quantum based on remaining time and median of tasks. If the number of tasks is less than 3, the time quantum is the longest remaining

24

M. Runsungnoen and T. Anusas-amornkul

First, all processes are sorted in ascending order based on the burst time of tasks (time needed for executing each task) and send to VM nt = Number of tasks Second, if number of tasks in a waiting queue is less than or equal to three while (waiting queue) { • IF nt 0 and InDegree = 0, the institution is the type of technology-output; ➁ When OutDegree = 0 and InDegree > 0, the institution is the type of technology-input; ➂ When OutDegree > 0 and InDegree > 0, the institution is the type of technology-comprehensive

406

407

445

467

528

535

Panasonic

734

CANO

AT&T Inc.

AMTT

946

MATU

Microsoft Corporation

MICT

1365

OutDegree

730

International Business Machines Corporation (IBM)

IBMC

NUAN-Non-standard

Enterprise name

Patent assignee

Table 5.1 Degree centrality of patentees in technical diffusion network

46 Z. Rongying et al.

5 Research of Institutional Technology Diffusion Rules …

47

Fig. 5.2 Patent citation network

Table 5.2 TDB value of institution (part) Patent assignee code

TDB

Patent assignee code

TDB

Patent assignee code

TDB

Patent assignee code

TDB

IBMC

94

PHIG

71

MOTI

58

NELE

49

MATU

86

FUIT

67

DRAG-Nonstandard

56

XERO

46

NIDE

84

NITE

66

LUCE

56

INTT

44

SONY

81

HITA

63

RICO

54

TELF

44

MICT

80

MITQ

62

OYNO

51

ITLC

42

TOKE

79

SHAF

60

SMSU

51

NPDE

42

AMTT

76

SIEI

60

APPY

50

XERF

42

CANO

75

TEXI

60

GOOG

49

–

–

to TDI value. When a patent is cited by the same patentee multiple times, citation amount would be increased, thus affecting the TDI of the institution.

5.4 Conclusion and Limitation Based on the level of patent institutions, this study takes the patent data of AI field as an example to build a patent citation network between patent institutions and citation patent institutions, and makes visualization analysis and indicator analysis to find the technology diffusion rules. The following rules can be found through

48

Z. Rongying et al.

Table 5.3 TDI value of institution (part) Patent assignee code

TDI

Patent assignee code

TDI

Patent assignee code

TDI

Patent assignee code

TDI

BBNB-Nonstandard

38

COPQ

23

LEAI

17

SOFT-Nonstandard

15

ACCE-Nonstandard

33

RUTF

20

TOSH-Nonstandard

17

SYLV

15

SRIN

32

CSEL

19

ATRO-Nonstandard

16

MINA

15

ALLV-Nonstandard

28

STRD

19

USGO

16

DIGI

14

FORU-Nonstandard

27

BELL-Nonstandard

19

SUNQ-Nonstandard

16

VCSI-Nonstandard

14

CACT

26

TREN-Nonstandard

18

BZMR

16

ENVI-Nonstandard

14

SUME

25

HTCM

18

ROCW

16

AMER-Nonstandard

14

SCAT

23

IVTC

17

BURS

15

–

–

citation patent analysis of 8654 data. In the matrix diagram constructed by patent quantity and patent quality, technical force of each institution varies greatly. Most institutions are still at lower stage and need to further improve their technical strength. In the technology diffusion network, there are three types of institutions according to the citation relationship: technology-output, technology-input and technologycomprehensive. However, it is found that the two indicators of TDB and TDI of most institutions are relatively low. Although some conclusions and inspirations are obtained in this paper, there are still some problems and deficiencies in the research process: The purpose, function, motivation and behavior of citation are complex, so there may be unreasonable and irregular phenomenon in citation of patents. Patent citation is only a part of technology diffusion, and patent transfer and patent authorization can also represent technology diffusion. So in a follow-up study, these two problems should be taken into consideration: One is to eliminate unreasonable and irregular citation patents; the second is the comprehensive consideration of patent citation, patent transfer, patent authorization and other technology diffusion behavior. Therefore, the technology diffusion network can reflect the technology diffusion process more accurately and comprehensively, and the technology diffusion indicator can truly reflect the contribution of the institution to the technology.

5 Research of Institutional Technology Diffusion Rules …

49

References 1. Xu, Q., Gu, J.X., Chen, J.Q.: Technological evolutionary trajectories based on patent citation network—by taking data mining domain as example. Sci. Res. Manag. 2, 27–35 (2013) 2. Hu, A.G.Z., Jaffe, A.B.: Patent citations and international knowledge flow: the cases of Korea and Taiwan. Int. J. Ind. Organ. 21(6), 849–880 (2003) 3. Eaton, J., Kortum, S.: Trade in ideas patenting and productivity in the OECD. Papers 40(3–4), 251–278 (1995) 4. Fung, M.K., Chow, W.W.: Measuring the intensity of knowledge flow with patent statistics. Econ. Lett. 74(3), 353–358 (2002) 5. Yoon, B., Park, Y.: A text-mining-based patent network: analytical tool for high-technology trend. J. High Technol. Manag. Res. 15(1), 37–50 (2004) 6. Fallah, M.H., Fishman, E., Reilly, R.R.: Forward patent citations as predictive measures for diffusion of emerging technologies. In: Portland International Conference on Management of Engineering & Technology, pp. 420–427 (2009) 7. Kim, D.H., Shin, Y.G., Park, S.S., Jang, D.S.: Forecasting diffusion of technology by using bass model. 1148(1), 149–152 (2009) 8. Liu, Y., Rousseau, R.: Knowledge diffusion through publications and citations: a case study using ESI-fields as unit of diffusion. J. Am. Soc. Inform. Sci. Technol. 61(2), 340–351 (2010)

Chapter 6

Impact on the Information Security Management Due to the Use of Social Networks in a Public Organization in Ecuador Segundo Moisés Toapanta Toapanta , Félix Gustavo Mendoza Quimi , Leslie Melanie Romero Lambogglia and Luis Enrique Mafla Gallegos Abstract It analyzed the management for the dissemination of government information in social networks and security management models applied to communicate with society. The problem is the lack of a model that integrates a security scheme to expose information from a government organization on social networks. The objective is to define a generic algorithm prototype for security policies application. It used the deductive method and exploratory research to analyze the information of the reference articles. It turned out the following: Guidelines for defining information security policies, General scheme of information security policies, Mathematical description to measure impact, and Generic algorithm prototype for the security policies application using flowchart techniques. It was concluded that for the exposure and socialization of information from a government organization to society, it is recommended to have alternatives that allow mitigating the risks, threats, and vulnerabilities of information such as integrated security models, security policies, responsible, technologies, and adequate cryptographic algorithms to improve the confidentiality, integrity, and authenticity of the information based on its mission, vision, strategic objectives of each government organization, and legal provisions such as the LOTAIP 2017 among others.

S. M. T. Toapanta (B) · F. G. M. Quimi · L. M. R. Lambogglia Department Computer Science, Universidad Politécnica Salesiana Del Ecuador, Chamber, 227 Y 5 de Junio, Guayaquil, Ecuador e-mail: [email protected] F. G. M. Quimi e-mail: [email protected] L. M. R. Lambogglia e-mail: [email protected] L. E. M. Gallegos Department Computer Science, Escuela Politécnica Nacional, Ladrón de Guevara E11-253, Quito, Ecuador e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_6

51

52

S. M. T. Toapanta et al.

6.1 Introduction Information and Communication Technologies (ICT) are important in the strategy of public communication, governance, democracy, citizen participation, and electronic government [1]; Web 2.0 serves to innovate in this interactive communication; public or private companies use Social Networks (SN) to reach more people with more products or services; within the evolution of ICTs, SNs are the strongest means to distribute information to local or foreign citizens with diversity of comments, audio, video, and interactions; In these networks, community activities are measured for descriptive statistics. Governments are authorized to deliver their information through SNs; also, the impacts that information has on society are social, political, cultural, legal, economic, corporate, academic, health, and security; in these spaces people express their emotions, attitudes, and desires, be they positive or negative. The analysis of SN provides research for its success, data mining, web mining, use analysis, opinion analysis, geographic information, social content sharing, analysis of key terms, and product surveys. In addition, it facilitates the entry of innovative ideas to the government or private entity. Various proposals for use given to NS are emergency dissemination models, distributed networks to ensure information, analysis of crimes through algorithms [2], prediction of links [3], probability of communications [4], and analysis applied to the health area [5]. Government agencies and officials use the SN as communication way for post of authorized information, image of the government, public administration, and attention and supervision of citizens in public services [6]. Today, government organizations have their own web pages and accounts in SN to disclose their activities, government processes, or other information of national interest; this helps the citizen to be closer to the government, this interactive relationship through technologies is known as G2C. When the governments started in SN, the changes made were logical variation of services to citizens and companies, cooperation base, knowledge base, availability, openness and decentralization, and management and control of processes [7]. Everyone knows the relevant characteristics or benefits when government organizations disclose information through SN; What about the Information Security (IS) of government that is exposed in third-party media? In these spaces, the content of the government and citizens is in a repository of an external company. The IS is the composition of systems, operations, and internal controls to guarantee the Confidentiality, Integrity, and Availability (CIA) of company information [8]; it ensures business continuity and minimizes the loss and impact of incidents on business experience. Within government organizations, information management is restricted and structured by professionals; some of this information is classified to be confidential or public by some means. There may be processes, techniques, or technologies to ensure the sending of information from databases to the public environment. It is important that ICTs are considered at a strategic level, where technological transformation provides a more adequate, simple, and quick response service

6 Impact on the Information Security Management …

53

to the citizen; Corporate-level strategies must be aligned with the company mission, vision, and objectives; also consider structures, processes, and mechanisms of research related to ICT, COBIT, ITIL, and ISO where the metrics for governance, services, and IS for a government or private company is defined. The COBIT standards recommend that the Chief Executor Officer (CEO) and Chief Information Officer (CIO) be at the same level with different responsibilities; the CIO of a government organization must be aligned with general guidelines at the country level and also have guidance on public service, guide ICT to the problems of the organization, technological assessment, administrative capacity, and the facility to communicate information to the citizen. In Ecuador, the Organic Structure of the Executive Function of the Presidency is organized as follows: national secretariats, ministries, secretariats, research institutes, promotion institutes, control agencies, companies, technical secretariats, public banks, councils, and services among others [9]. The profitability of the government sector isn’t economic, but social; the objectives of each government organization are different, this research aims to establish some measure of IS that is sent to social networks. Why is it necessary to define an IS mechanism when posted in SN of a government organization in Ecuador? To identify management tasks, define methods of classification, and publication in SN, increase the security level and monitor of posted information. The objective is to define a generic algorithm prototype for security policies applied to increase the level of IS due to the use of SN in a government organization in Ecuador. The articles reviewed and related to IS, government organizations, information schemes, security policies, and SN are Facebook and Public Health: A Study to Understand Facebook Post Performance with Organizations Strategy [10], Online Social Networks: Threats and Solutions [11], A Generic Framework for Information Security Policy Development [12], eSecurity Development Lifecycle in the Context of Accreditation Policies and Standards [13], Information Security Policies: A Review of Challenges and Influencing Factors [14], Security Policy Alignment: A Formal Approach [15], Notes on the Evolution of Computer Security Policy in the US Government, 1965–2003 [16], Regulatory Framework Creation Analysis to Reduce Security Risks The Use of Social Media in Companies [17]. It is used the deductive method and exploratory research to analyze the information of the reference articles. The results are the following: Guidelines for defining IS policies, General scheme of IS policies, Mathematical description to measure impact, and Generic algorithm prototype for the security policies application using flowcharts techniques. It is concluded that for the exposure and socialization of information from a government organization to society, it is recommended to have alternatives that allow mitigating the risks, threats, and vulnerabilities of information such as integrated security models, security policies, responsible, technologies, and adequate cryptographic algorithms to improve the CIA of the information based on its mission,

54

S. M. T. Toapanta et al.

vision, strategic objectives of each government organization, and legal provisions such as the LOTAIP 2017 among others.

6.2 Materials and Methods In a first instance in Materials, it revised some works of the administration of IS in SN, the procedures that companies must filter their information toward SN, and the security processes to deliver information. In the second instance in Methods, tools were proposed to generate an IS model to use SN, which are access statistics of the Ecuadorian society to the Internet, amount of SN used by government organizations, a shared government information scheme, phases of IS, and a model of IS elements.

6.2.1 Materials The authors analyzed conceptual frameworks, corporate communication strategies, and creation of collective value in communities in SN; reviewed and presented a shared health information scheme; in the tests, they examined public health care information on Facebook; they determined that visual material such as photos and videos are more effective elements to communicate [10]. The authors presented classic, modern, or combined threats of security and privacy of SN users; described solutions of SN operators, commercial companies, and academic researchers; the work was descriptive, comparative on the protection capacity of solutions; generated recommendations to protect information in SN [11]. The authors proposed a general framework of IS policies which is developed in phases, processes, and tasks; this study was carried out in 30 universities, they reviewed the development of the policies and content; they used a code process to simplify the information; the case study was given to three universities with different populations and categories; they consider a process for the implementation of security policies [12]. The authors conducted a descriptive analysis of IS models, they reviewed the phases, components, policies, procedures, comparisons of phases; also the difference between information security and information assurance; they emphasized the strengths of the models; they proposed the need for following general models: information security, risk management, detection and analysis of threats to information assets, business operations, and plan management; these models for both the governmental and private sector [13]. The authors reviewed several IS policies to determine the challenges these policies may have; they also determined the organizational factors and human factors that influence the behavior of users; they concluded with the need for to train and educate of employees in these policies and dynamic monitor of the level of compliance [14]. The authors formalized the alignment of security policies for socio-technical systems to find weaknesses in the systems; they integrated local and global policies to relate security models with security policies; the research review was carried out

6 Impact on the Information Security Management …

55

from approaches such as alignment of policies, refinement of policies, consistency and integrity, and systems models [15]. The author described the computer history in his country, the congress highlighted three topics, economic acquisition, costly assets, and information security; with emphasis on this last issue, the government and its dependent offices have made efforts in guidelines, procedures, or policies on information security; its conclusion of operations in the virtual space affect the real world [16]. The authors reviewed the SN risks that are used by companies, propose a regulatory framework to mitigate risk through the use of SN; they concluded in the socialization of regulations for the exposure of business data at all institutional levels; in order to improve user behavior, they encourage to train and policy communications [17].

6.2.2 Methods Accord to the Organic Structure of the Executive Function of the Presidency to the year 2017, there were 130 current entities [9]; each government organization has accounts in at least three social networks: Twitter, Facebook, and YouTube; to deliver information to the citizen; in addition, their web pages are under standardized formats; there are at least 390 accounts on social networks used as communication channels. Other government organizations that receive part of the Ecuadorian state resources were not considered as governations, prefectures, municipalities, parochial juntas, regional councils, and provincial councils. Accord to the National Institute of Statistics and Census (INEC) of Ecuador in 2017, 58.30% of the population had Internet communication [18], Fig. 6.1 details accesses since 1995; Accord to the World Bank in 2017, 57.27% of the population had Internet communication [19], Fig. 6.2 details accesses since 1995; Accord to

Fig. 6.1 INEC, Ecuadorian society access to the Internet

56

S. M. T. Toapanta et al.

Fig. 6.2 World Bank, Ecuadorian society access to the Internet

the Telecommunications Regulation and Control Agency (Arcotel), it was informed that by 2017, 63.10% of the population had Internet communication [20], Fig. 6.3 describes the accesses since 2010. This increase has led to government companies to provide their services and information through the SN. In addition, until December 2017, the percentage of people who used the Internet according to ages are from 5 to 15 years old is 50.4%; from 16 to 24 years old is 85.2%; from 25 to 34 years old is 73.9%; from 35 to 44 years old is 59.6%; from 45 to 54 years old is 44.0%; from 55 to 64 years old is 27.2%; 65 years old or older is 7.8%. 74.7% of Ecuadorians used the Internet for less than once a day [18]. It was deduced that more than 59% of Ecuadorians between 16 and 44 years old accessed and continuously used the Internet to obtain information, general communication, education, learning, and others.

Fig. 6.3 Arcotel, Ecuadorian society access to the Internet

6 Impact on the Information Security Management …

57

Fig. 6.4 Average access of the Ecuadorian society to the Internet

In Fig. 6.1, based on the information obtained from the INEC, it was shown how the Internet from 1995 to 2015 in the Ecuadorian society has increased the number of users. For this analysis, data from the INEC and the World Bank have annual Internet access statistics extracted from the references, but the information was shown in blocks of 5 years and 2017; from which the information was taken for the analysis of the sample. From the analysis of Fig. 6.1, it was deduced that in the phase from 2010 to 2015, the increase of Internet users in Ecuadorian society was three times more than in the blocks of previous years. In Fig. 6.2, the information issued by the World Bank regarding Internet communication by Ecuadorian society showed aggressive growth from 2010 to 2015. In Fig. 6.3, the information provided by Arcotel is until 2017 and showed growth in greater proportion from the year 2015 to 2017. Figure 6.4 presents the average access applied to the three previous institutions, since 2010; it was obtained that in 2017, 59.56% of the Ecuadorian society communicated with the Internet and its services. Accord to the INEC in 2017, 31.90% of the Ecuadorian population used SN from their smartphone [18]. With these data, • Users number on SN for information published by organizations of the executive function, • Important percentage of Internet access by Ecuadorians, and • Important percentage of use of SN by Ecuadorians, the importance and motivation to propose an IS measure by the use of SN in a government organization was determined. Figure 6.5 describes a general scheme of government information; the data of government organization is shared to society through the SN. Another tool that was proposed for an IS model is the application of three phases

58

S. M. T. Toapanta et al.

Fig. 6.5 Scheme of shared government information

Evaluation phase, this phase is in accordance with the government organization strategies; the database information is organized with the following tasks: classified, ordered, located, marked, and valued; the valuation is fiscal, administrative, legal, informative, or historical. The entity authority issues its approval for diffusion, previous consult with the information collectors. Development phase: this phase performs the normalization of the content, with the following tasks: identification, context, content, structure, access condition and uses, notes, and descriptions control and then, proceed to the diffusion of information. Diffusion phase: this phase involves the selective emission, periodic emission, or informative emission, with its respective monitor of posted information in the SN. The IS has many elements, one of them is the use of SN to diffusion information through the Internet that is a communication way. It was proposed to use a scheme as a security mechanism when disseminating information. Figure 6.6 describes, in a model, the elements that are considered in the IS.

6.3 Results In this phase, the following results were obtained: • • • •

Guidelines for defining IS policies. General scheme of IS policies. Mathematical description to measure impact. Generic algorithm prototype for the security policies application using flowcharts techniques.

6 Impact on the Information Security Management …

59

Fig. 6.6 Information security element model

1. Guidelines for defining IS policies. • The posted information in SN must agree with the mission, vision, and objectives of the government organization. • Plan time for continuous updates and expansions. • Reasonable participation in SN to increase citizen participation. • The scheme must be a dynamic and continuous improvement to be close to SN changes. • Improve the IS management processes in the government organization. • Generate diffusion and monitor strategies in SN. 2. General scheme of IS policies. Table 6.1 describes the processes of a security scheme divided into phases. The processes aim to have large quantities of information in less content to post in SN. 3. Mathematical description to measure impact. Among the various indicators to measure impact of a posted information we can mention: engagement, new users, comments, share, and positive responses; the following general formula to measure the engagement was proposed, that is, to have an idea of impact the company’s actions or how interested citizens are in the information offered in SN.

60 Table 6.1 Processes of a scheme security

S. M. T. Toapanta et al. Phase

Main process

Subprocess

Retrieve information

Insert in the database Update in the database Recover from the database Classification Ordination

Evaluation

Organize information

Location Establish Assessment

Authorization

Request authorization Request publication Identification Context

Development

Normalize content

Content and structure Condition of access and uses Notes and descriptions control

Diffusion

Information format

Present the authorized data

Selective diffusion

Deliver on a regular basis

Periodic diffusion

Successive emissions

Information and reference service

Provide advice

Monitor

Evaluate comments

Review indicators Follow-up of activities

number of positive responses + number of comments + number of shared × 100 number of followers or contacts (6.1) Most SN provide these general numbers, others generate their own indicators, and there are tools in the market to generate other types of indicators for specific SN. 4. Generic algorithm prototype for the security policies application using flowcharts techniques. Figure 6.7 describes the steps for the application of the scheme where the three phases were integrated; evaluation, development, and diffusion.

6 Impact on the Information Security Management …

61

Fig. 6.7 Steps for application of the security scheme

In this case, were applied the flowchart techniques to present the description of a prototype algorithm. The algorithm steps are detailed below: • Recover information, it is an operation in the database, it verifies the existence and selection of the data. • Organize information, it is one or several operations of the obtained data. • Request authorization and publication, the communicator of the government organization must be aligned and respect the Communication Law of Ecuador for the diffusion of the content. • Normalize content, formats application or standards to the information content. • Information, considers that the information is ready for diffusion. • Diffusion, the selection of the diffusion type will be accord to the scope. • Monitor, it is to follow and measure the impact caused by posted information.

62

S. M. T. Toapanta et al.

6.4 Discussion This document was presented the Internet use in Ecuadorian society, the descriptive analysis of IS works when exposing data in SN of a government or private organization. • In Ecuador, government companies are in the following sectors: electricity, health, sports, labor, education, inclusion, defense, industries, agriculture, economy, transport, commerce, and tourism, where the risks and vulnerabilities of the information are different for each sector. • The results of this research agreed with the references in the following topics: in the use of a shared information scheme in a SN [10], in the adoption of security measures with phases, processes and subprocesses [12], in the components of IS [13], in the guidelines application to diffusion data of a company [17]. • The proposed security model is applicable to countries with a culture similar to Ecuador. • It is the main task to make employees aware of the importance of applying security mechanisms when posting an information about the organization in SN.

6.5 Future Work and Conclusion In order to mitigate the risks, vulnerabilities, and threats in the management of public information in SN in the short term, integrated security models should be generated as an alternative and cryptographic algorithm should be used to ensure that information management is with Identity, Authenticity, Authorization, and Audit (IAAA). 1. It was concluded that for the exposure and socialization of information from a government organization to society, it is recommended to have alternatives that allow mitigating the risks, threats, and vulnerabilities of information such as integrated security models, security policies, responsible, technologies, and adequate cryptographic algorithms to improve the CIA of the information based on its mission, vision, strategic objectives of each government organization, and legal provisions such as the LOTAIP 2017 among others. 2. The model aims to help with the classification and exploitation of the government organization information when it is posted in SN, for this, a generic algorithm that applies a security scheme was developed. 3. The model increases the level of security of the posted information because the human factor is vulnerable in the treatment of information. Acknowledgements The authors thank Universidad Politécnica Salesiana del Ecuador, the research group of the Guayaquil Headquarters “Computing, Security and Information Technology for a Globalized World” (CSITGW) created according to resolution 142-06-2017-07-19, and Secretaría de Educación Superior Ciencia, Tecnología e Innovación (Senescyt).

6 Impact on the Information Security Management …

63

References 1. Rahmanto, A., Dirgatama, C.: The implementation of e-government through social media use in local government of Solo Raya. In: 2018 International Conference on Information and Communications Technology (ICOIACT), January 2018, vol. 83, pp. 765–768 (2018). https:// doi.org/10.1109/icoiact.2018.8350763 2. Davidekova, M., Gregus, M.: Social network types: an emergency social network approach—a concept of possible inclusion of emergency posts in social networks through an API. In: 2017 IEEE International Conference on Cognitive Computing (ICCC), 40–47 (2017). https://doi. org/10.1109/ieee.iccc.2017.13 3. Ghodpage, N., Mante, R.: Privacy preserving and information sharing in decentralized online social network. In: 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), pp. 152–155 (2018). https://doi.org/10.1109/icicct. 2018.8473268 4. Alzahrani, T., Horadam, K.: Analysis of two crime-related networks derived from bi-partite social networks. In: 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), pp. 890–897 (2014). https://doi.org/10.1109/ asonam.2014.6921691 5. Jain, V., Sakhuja, S.: Structural investigation of a healthcare value chain: a social network analysis approach. In: 2014 IEEE International Conference on Industrial Engineering and Engineering Management, January 2015, pp. 179–183 (2014). https://doi.org/10.1109/ieem. 2014.7058624 6. Rong, Y., Xu E.: Strategies for the management of the government affairs microblogs in China based on the SNA of fifty government affairs microblogs in Beijing. In: 2017 International Conference on Service Systems and Service Management (2017). https://doi.org/10.1109/icsssm. 2017.7996282 7. Al-Wahaibi, H., Al-Mukhaini, E., Al-Badi, A., Ali, S.: A case study of the employment of social media in government agencies in Oman. In: 2015 IEEE 8th GCC Conference and Exhibition, pp. 1–4 (2015). https://doi.org/10.1109/ieeegcc.2015.7060089 8. Sari, P., Prasetio, A.: Knowledge sharing and electronic word of mouth to promote information security awareness in social network site. In: 2017 International Workshop on Big Data and Information Security (IWBIS), pp. 113–117 (2017). https://doi.org/10.1109/iwbis.2017. 8275111 9. Secretaria Nacional de Planificacion y Desarrollo. Quito: Plan Nacional Buen Vivir 2013–2017 (2013) 10. Straton, N., Vatrapu, R., Rao Mukkamala, R.: Facebook and public health: a study to understand facebook post performance with organizations’ Strategy. In: 2017 IEEE International Conference on Big Data (BIGDATA) Facebook C(60) (2017). https://doi.org/10.1109/bigdata. 2017.8258288 11. Fire, M., Goldschmidt, R., Elovici, Y.: Online social networks: threats and solutions. IEEE Commun. Surv. Tutor. 16(4), 2019–2036 (2014). https://doi.org/10.1109/comst.2014.2321628 12. Ismail, W., Widyarto, S., Ahmad, R., Ghani, K.: A generic framework for information security policy development. In: 2017 4th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI), vol. 4, pp. 324–329 (2017). https://doi.org/10.1109/ eecsi.2017.8239132 13. Kalaimannan, E., Gupta, J.: The security development lifecycle in the context of accreditation policies and standards. IEEE Secur. Priv. 15(1), 52–57 (2017). https://doi.org/10.1109/msp. 2017.14 14. Alotaibi, M., Furnell, S., Clarke, N.: Information security policies: a review of challenges and influencing factors. In: 2016 11th International Conference for Internet Technology and Secured Transactions (ICITST), pp. 352–358 (2016). https://doi.org/10.1109/icitst.2016.7856729 15. Pieters, W., Dimkov, T., Pavlovic, D.: Security policy alignment: a formal approach. IEEE Syst. J. 7(2), 275–287 (2013). https://doi.org/10.1109/jsyst.2012.2221933

64

S. M. T. Toapanta et al.

16. Warner, M.: Notes on the evolution of computer security policy in the US government, 1965– 2003. IEEE Ann. Hist. Comput. 37(2), 8–18 (2015). https://doi.org/10.1109/mahc.2015.25 17. Prayitno, O., Cizela da Costa Tavares, O., Damaini, A., Setyohadi, D.: Regulatory framework creation analysis to reduce security risks the use of social media in companies. In: 2017 4th International Conference on Information Technology, Computer, and Electrical Engineering (ICITACEE), pp. 235–238 (2017). https://doi.org/10.1109/icitacee.2017.8257709 18. Tecnologías de la Información y Comunicación Contenido. INEC (2017) 19. Personas que usan Internet (% de la población)|Data. Datosbancomundialorg. https://datos. bancomundial.org/indica-dor/IT.NET.USER.ZS?end=2017&locations=EC&start=1995. Accessed 5 Jan 2019 20. Ecuador, A.: Agencia de Regulación y Control de las Telecomunicaciones|Ecuador Servicio de acceso a internet (SAI). Agencia de Regulación y Control de las Tele-comunicaciones|Ecuador (2019). http://www.arcotel.gob.ec/servicio-acceso-internet/. Accessed 5 Jan 2019

Chapter 7

Appropriate Security Protocols to Mitigate the Risks in Electronic Money Management Segundo Moisés Toapanta Toapanta , María Elissa Coronel Zamora and Luis Enrique Mafla Gallegos Abstract Discussed in a general way, the protocols used by retailers to provide security when making online purchases. The problem was the risks that exist at the time of using electronic money. The aim is to analyze communication protocols that can be used with the benefits they present in security and implementation costs to provide robustness against failures, third party attacks and mitigate vulnerabilities that can expose sensitive information of both the internal client and the user external. The deductive method and exploratory research were used in order to analyze the information in the articles to which this document refers. Turned out that the Protocol 3D Secure, being a mixture of various security protocols and leveraging the benefits each offers, was one that allowed us to manage the electronic money in a safer manner for the user and for the trade. It was concluded that, on several occasions, companies opt to bet on prices lower than leaving the security of the information in the background and are presented as a result of this, a number of computer fraud.

7.1 Introduction The Internet is mostly chosen by the companies to publicize their products, as well, as users can make bank transfers or shop quickly without having to leave the comfort of their homes [6].

S. M. T. Toapanta (B) · M. E. C. Zamora Computer Science Department, Universidad Politécnica Salesiana Ecuador, Robles 107, Chambers, Guayaquil, Guayas 042590630, Ecuador e-mail: [email protected] M. E. C. Zamora e-mail: [email protected] L. E. M. Gallegos Faculty of Engineering Systems, Escuela Politécnica Nacional, Ladrón de Guevara E11-253, Quito, Pichincha 022976300, Ecuador e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_7

65

66

S. M. T. Toapanta et al.

All this because the Internet has positioned itself as a potential means of distribution, in addition to its constant development that allows better communication and interrelation between customers and suppliers. Transactions carried out by means of a virtual platform and the money that is handled in them every day is growing and the e-commerce becomes more vulnerable to computer attacks, it is for this reason that the safety that must exist in the platforms is not only fundamental but also making that proposal functional for both parties [5]. Seeks to promote the use of electronic money as a drive mechanism for the inclusion of marginalized sectors into the financial system. Four actors are involved in electronic payments during a card-not-present transaction. (Also called the cardholder) customer browsing on the web site of the merchant, called service provider (or SP), to buy an online service. These two actors have a payments provider, called, respectively, the issuing bank and the acquiring bank. However, in the majority of online payment schemes, other actors are involved. They are usually used as a third-party trust, with several roles. For example, it can be a set of interoperability, as in 3D Secure, or an identity provider operated by the banks themselves, as the system BankID and also it requires an authentication system for payment providers to prevent fraud [6]. There are various proposals to make payments by electronic media secure credit card payments, proposed use of electronic money, and proposed implementation of micropayment or wallet card use [8, 9]. What are the protocols used to mitigate the risks in the management of electronic money? • • • •

SSL (Secure Sockets Layer) SET (Secure Electronic Transaction). 3D Secure 3D Secure V2.

The objective is to analyze each of the security protocols and according to their pros and cons to define what is best. The articles analyzed for this documentation are: Mastercard and Visa make major push with biometric cards, Monedas Virtuales Se Suman Al Comercio Electronico, A scheme to improve security of SSL [4], El Comercio Electronico [5], El dinero electronico en el Ecuador, Electronic payment system design based on SET and TTP, A comparative study of card-not-present e-commerce architectures with card schemes: What about privacy? [6]; Supported Credit Card Types 3D Secure Process for Payment Pages 2. 0 [7], Factores que afectan la confianza de los consumidores por las compras a través de medios electrónicos [10], Trusting SSL in practice [1], Electronic banking and payments [9], Punto de vista de CA [3], Thales e-Security launches 3d-Security Module to secure Visa and Mastercard Internet transactions [2], Visa puts its faith in wallets [12], E-commerce transaction 59 security model based on cloud computing, Sistema de dinero electrónico, un medio de pago al alcance de todos [11]; Electronic payment system design based on SET and TTP [13].

7 Appropriate Security Protocols to Mitigate the Risks …

67

The deductive method is used to analyze the information of articles related to this investigation. It is concluded that it is necessary to use security protocols to guarantee the integrity and security of electronic transactions, they have their advantages and disadvantages, but despite providing a certain degree of security and confidentiality of information, none of them is 100% secure, despite this, 3D Secure V2 provides us with a high level of security in addition to easy interaction for both the user and the business.

7.2 Materials and Methods 7.2.1 Materials The information of the articles was used to analyze the existing security protocols and the current use of electronic commerce. In this phase, the information of each of the articles referred to in this document is used, the basic structure of the e-commerce platforms was analyzed [11], and the security problems were presented [10]. We also analyzed some basic concepts about the protocols, their forms of encryption, and their security processes for a better understanding of the subject [9]. TPV: Point of Sale Terminal. XML: Extensible Markup Language, it is a simple and very flexible text format. Exchange of a wide variety of data on the Web and elsewhere. ACS: Access control service. MD (Merchant Data): The data of the merchant that the ACS 3D will return at the end of the process of authentication of the holder of the card. MPI (Merchant Plug- in): Software designed to facilitate 3D Secure checks and prevent credit card fraud. SSL (Secure Sockets Layer): Protocol designed to allow applications to transmit round-trip information using encryption keys. SSL is not a robust secure protocol in the real world. The first reason is that SSL does not follow the strict PKI trust model; the PKI principle is very complicated and requires a strict trust model to ensure security. However, the cost of maintaining a strict trust model is too high to be accepted by the public [4] (Fig. 7.1). It is practical and easy to implement, but despite its simplicity, it leaves aside some important aspects such as 1. The protocol works correctly if there are only two parties involved 2. There are no validations of customer recognition 3. The electronic stores have no way of knowing if the card that is actually entered has funds to carry out the transaction [1].

68

S. M. T. Toapanta et al.

Fig. 7.1 Secure sockets layer protocol (SSL)

SET (Secure Electronic Transaction): Protocol that offers data packages for all transactions and each transaction is authenticated with a digital signature. By using symmetric key, digital signature, Hash technology, and digital envelope, SET can guarantee confidentiality, authentication, data integrity, and non-repudiation of the transaction. But there are still some deficiencies in this protocol (Fig. 7.2).

Fig. 7.2 Secure electronic transaction protocol (SET)

7 Appropriate Security Protocols to Mitigate the Risks …

69

With SET, the customer information travels to an e-commerce server, but it is not the platform that performs the validations of payment, instead, the card number travels directly to the vendor’s bank, where you can know the information of the bank account of the user and this verification is done in real time [13]. SET takes into account some aspects that SSL leaves aside such as • It gives security not only to the buyer but also to the seller that the transactions carried out are guaranteed with the bank. • We know that with SET, the transactions are reliable, it is complicated that the information can be manipulated by a third party [10]. 3D Secure V1: Form of payment developed by Visa and MasterCard that enables the making of secure purchases on the Internet and authenticates the buyer as the legitimate owner of the card he is using. 3D Secure is supported by the SSL protocol (Africa et al. 2018) [7]. It was called 3D Secure since in it act three domains: – Issuer (The bank or cardholder) – Acquirer (The trade or acquirer of the transaction) – Interoperability (The Brand) (Fig. 7.3). 3D Secure V1 asks buyers for a pin that has been previously validated with the bank that issued the credit card. When validating this key and knowing that the client has space available to carry out the transaction in question, the bank allows the purchase [10]. The integrity of the information is guaranteed between the e-commerce, the issuing bank of the card, the receiving bank, and the seller. Implementing this protocol

Fig. 7.3 3D Secure protocol

70

S. M. T. Toapanta et al.

is simple, it is not necessary to purchase any additional software or modify existing software, just install a plugin in the e-commerce server and acquire a certificate to be identified by the financial entity [7]. 3D Secure V2: It will provide additional data that will improve the ability of riskbased authentication solutions to identify the behavior of devices and cardholders [3]. It was launched in response to the security problems presented by the previous protocols and seeks to facilitate “frictionless shopping” that improves the speed of transactions, in addition to changing the old system of passwords by token or biometrics. The new version of 3DS differs from the previous version by some aspects such as • Improves the client’s authentication system • It is not necessary to use static passwords • You can make secure payments with 3DS V2 both on mobile devices, browserbased solutions, and other devices connected to the consumer • 3D Secure 2.0 allows mobile payment methods, in the application and in the digital wallet • Greater benefits in the prevention of fraud • Eliminates the initial registration process • There are no pop-up windows to test your security vulnerabilities. The protocol has been accepted satisfactorily to date. The authorization of a transaction using 3D Secure V2 is 0.25 ctvs per transaction and the issuer of the card is responsible for making the payment of this value. The businesses benefits since the problem that arose from the abandonment of the shopping cart was greatly reduced and for the consumer it is an easy way.

7.2.2 Methods In this research, the deductive method was used in order to know the security protocols currently used and how risks have been mitigated at the time of using electronic money. • • • •

First, the general concept of each of the protocols Second, the encryption algorithms that were used Third, the cost of implementing each of them Fourth, the simplicity of implementation on the e-commerce platforms.

7 Appropriate Security Protocols to Mitigate the Risks …

71

7.3 Results As a result of this research, the advantages and disadvantages of security protocols were analyzed when making an electronic purchase, the methods most used by users and the acceptance of users (Table 7.1). A comparison was made of the security, reliability, and integrity attributes that the protocols should provide and it was possible to deduce that 3D Secure provided us with multiple benefits (Table 7.2). In the second instance, it was observed that the 3D Secure protocol when integrating SSL, SET, and an authentication plugin inside it, provided all the benefits of the previous protocols and in this way, the transactions are simpler, safer, and do not represent an additional cost for its implementation, hence, both the businesses, the buyers, and the banking entities have been able to trust that the majority of the transactions will be completed successfully. Figure 7.4 indicates that when confirming the purchase and entering the information of the credit card, said information travels to the merchant’s website and this, in turn, communicates with the issuing bank for the authentication process (3D Secure O 3D Secure V2) and if the validated information is correct, it is forwarded to the payment gateway to later return the payment result to the website (SET), all the pages redirected in this process contain SSL certificate, generating a purchase process much more secure by the combination of these three protocols.

Table 7.1 Comparison between the different forms of payment within the e-commerce Payment method

Acceptance of the client

Credit card/debit card

Highly accepted

Third-party payment

Average acceptance

Electronic money

Low acceptance

Dual signature

Low acceptance

Table 7.2 Comparison—security protocols SSL

SET

3D Secure

Confidentiality

YES

YES

YES

Integrity

YES

YES

YES

Authentication to credit cards

YES

YES

YES

Bank Authentication

X

YES

YES

Authentication to the shops

YES

YES

YES

Verification that the customer is able to use the credit card

X

X

YES

72

Fig. 7.4 Flowchart 3D Secure

S. M. T. Toapanta et al.

7 Appropriate Security Protocols to Mitigate the Risks …

73

7.4 Discussion From the analysis of the protocols and with the results obtained from this research, the following points are proposed to be discussed: • The 3D Secure protocol was considered in its latest version as the most viable alternative for security since its combined features and its implementation of token-based authentication offers security of payments, data backup during the transaction, and a better experience for the user. • The robustness of the security process depended a lot on the protocol used • It was observed that, although there is not yet a security protocol that guarantees that all transactions will be secure, the 3D Secure versions are very close to this concept and it is expected that in future versions, the security breaches will still exist which can be resolved.

7.5 Future Works and Conclusion 7.5.1 Future Works As future work, it is proposed to study in-depth and extend the analysis to new security protocols applied to electronic transactions.

7.5.2 Conclusions In order to determine the protocol that provides a greater degree of security in online purchases and to mitigate the risks in the use of electronic money, it is recommended to adopt 3D Secure in its latest version. From the research carried out, we can conclude the following: 1. The 3DS protocol in its latest version is presented as the most secure protocol. 2. The concepts of security and privacy do not usually go hand in hand with the protocols studied. 3. Future versions of 3DS could solve the faults that still exist in the security of electronic commerce. 4. With the improvement of the protocol, all the parties involved in the purchase process have benefits. – Consumers have a better shopping experience – There were fewer cases of abandonment of the purchase – Banking entities allow a greater number of transactions.

74

S. M. T. Toapanta et al.

Acknowledgements The authors thank Universidad Politécnica Salesiana del Ecuador, the research group of the Guayaquil Headquarters “Computing, Security and Information Technology for a Globalized World” (CSITGW) created according to resolution 142-06-2017-07-19, and Secretaría de Educación Superior Ciencia, Tecnología e Innovación (Senescyt).

References 1. Ali Aydın, S.: Trusting SSL in practice. In: Proceedings of the 6th International Conference on Security of Information and Networks (2013). https://doi.org/10.1145/2523514.2523594. Accessed 7 Jan 2019 2. Anon.: Thales e-security: Thales e-security launches 3d-security module to secure Visa and Mastercard internet transactions; 3d-security module provides cost-effective hardware for verified by Visa and Mastercard SecureCode (2003) 3. Anon: Punto de vista de CA (n.d.) 4. Huawei, Z., Ruixia, L.: A scheme to improve security of SSL—IEEE Conference Publication (2009). https://doi.org/10.1109/PACCS.2009.148. Accessed 7 Jan 2019 5. Lossa, M.: El Comercio Electronico (2013). Mendeley.com. https://www.mendeley.com/ catalogue/el-comercio-electronico/. Accessed 7 Jan 2019 6. Plateaux, A., Lacharme, P., Vernois, S., Coquet, V., Rosenberger, C.: A comparative study of card-not-present e-commerce architectures with card schemes: what about privacy? (2018). https://doi.org/10.1016/j.jisa.2018.01.007. Accessed 7 Jan 2019 7. Release, C., Support, Z., Secure, T., Standard, I.: Supported credit card types 3d secure process for payment (2019) 8. Schlager, C., Nowey, T., Montenegro, J.: A reference model for authentication and authorisation infrastructures respecting privacy and flexibility in b2c eCommerce—IEEE Conference Publication (2006). https://doi.org/10.1109/ARES.2006.13. Accessed 7 Jan 2019 9. Skipper, J.: Electronic banking and payments-IEE Colloquium eCommerce—Trading but not as we know it! (1998). https://doi.org/10.1049/ic:19980779. Accessed 7 Jan 2019 10. Sánchez, J., Montoya, L.: Factores que afectan la confianza de los consumidores por las compras a través de medios electrónicos. Pensamiento y gestión (2016). https://doi.org/10.14482/pege. 40.8809. Accessed 7 Jan 2019 11. Valencia, F.: Sistema de dinero electrónico, un medio de pago al alcance de todos. Boletín Cemla (2015) https://doi.org/10.8. Accessed 7 Jan 2019 12. Visa puts its faith in wallets (2002). https://doi.org/10.1016/S0965-2590(01)00219-5. Accessed 7 Jan 2019 13. Yong, X., Jindi, L.: Electronic Payment System Design Based on SET and TTP—IEEE Conference Publication (2010). https://doi.org/10.1109/ICEE.2010.77. Accessed 7 Jan 2019

Chapter 8

Acceptance and Readiness of Thai Farmers Toward Digital Technology Suwanna Sayruamyat

and Winai Nadee

Abstract Recently, mobile applications have been continuously released in the market to solve the consumer problems. In agribusiness and farming, the Ministry of Agriculture and Cooperative proposes AgriMap mobile application (AMMA) that provides integrated information to help farmers in deciding which plant they should grow in certain areas. This paper conducted a random survey with 727 farm households in Ang Thong province in May 2018 within three districts (Phothong, Wisetchaichan and Samko). This study analyzes the digital literacy of the farmers via skills, understanding and usage of social media applications including AMMA. Technology acceptance model (TAM) was applied to capture the attitude and perception of the farmers. The influence of society and behavioral intention, which contributes to the acceptance of digital technology, on Thai farmers was also included. This study found that the majority of farmers have smart phones, but less than 30% uses social media and only 10% of the farmers know AMMA. Although the farmers show positive attitude and perception toward using digital technology, the adoption rate of social media and other applications is still significantly low. This comparative analysis indicates the conflict between acceptance and readiness aspects in Thai farmers. Therefore, other supportive approaches should be involved to promote the usage of digital technology in Thai farmers.

8.1 Introduction In 2017, Thai government initiated a policy to transform the country toward digital era. This policy extended Industry 4.0 concept and was called Thailand 4.0. It has S. Sayruamyat Department of Agricultural and Resource Economics, Faculty of Economics, Kasetsart University, Bangkok, Thailand e-mail: [email protected] W. Nadee (B) Department of Management Information Systems, Thammasat Business School, Thammasat University, Bangkok, Thailand e-mail: [email protected]; [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_8

75

76

S. Sayruamyat and W. Nadee

been cascaded through all government service units and has stimulated the awareness of Thai society. To steer forward this flagship 4.0, one interpretation is to develop applications or platforms (web-based or mobile-based). This also raised the expectation of creating big data in the purposes of analytics and forecasting. Turning to Thai farmers’ aspect, the transformation has been slowly progressing due to certain limitations. One reason is that the Thai farmers are aging people (over 50 years old) who might not consider digital transformation as necessary. On technological aspect, telecommunication infrastructure in Thailand is striving toward digital age as 4G mobile network has been 100% implemented in metropolitan area and being expanded aggressively to remote areas. Considerably, the availability of technology might influence those aging farmers to transform their behavior, that is, throw away their analog or 2G mobile phone and shift to use smart phone instead. However, there is not any guarantee for this change. Turning to the aspect of smart phone consumer in Thailand, it has been changed constantly due to the convenience of owning smart phone (average ≥ 1 smart phone per person in Bangkok) and high-speed internet access via 4G network. Peoples’ behavior change can be obviously observed by looking at the number of audiences of TV shows compared with the number of views of shows on YouTube. This example emphasizes the growing interaction between human and technology. Radically, consumers have recognized their right in choosing not to receive information, in which they are not interested. People will then just swipe through the next news feed or press the next button to view the next video [4]. This phenomenon has drawn attention from both individual and business to apply and use digital technology. One of the main purposes of individual is, of course, for self-entertaining, while an obvious purpose for business is to engage customers. According to the digital aspect, the core element is “ecosystem”, which most individual and business use and participate their particular purposes. An instance of ecosystem might be, in other words, called a “platform”. For smart phone, there are two major platforms that play dominant roles in the market. They are “Android” from Google Inc. and “iOS” from Apple Inc. In these platforms, application (or we may call “app”) is the most crucial element in each platform to serve the consumer demands [1, 5]. According to the popularity of smart phone and the use of apps in daily life, most government organizations (i.e. the Ministry of Agriculture) view this as an opportunity to achieve their mission, in which one of the missions is to support Thai farmers. There are applications in the market, that is, “Know Land”, “Fertilizing Calculator” and “Agri-Map Mobile Application (AMMA) and so on. In this paper, AMMA will be mainly focused. AMMA is an agricultural map application that offers integrated data from various sources, that is, the department of land development, the department of agricultural support, and so on. For example, this app synthesizes soil information, area-based weather by introducing constant update and analytical interface. The government unit which offers this app expects Thai farmers to use the app features in making more accurate decisions on plant planning. However, AMMA has yet to properly be introduced to the farmers. Owing to the competitiveness in the smart phone ecosystems, consumer has been bombarded with newly introduced

8 Acceptance and Readiness of Thai Farmers …

77

apps every day. It has been more difficult for users to choose an app to serve their needs. To obtain user awareness and acceptance for an app, there are tools, tasks and activities to be implemented. For example, customer acquisition and activation processes in AARRR matrix [9] will be useful in helping to promote AMMA to Thai farmers or to predict the intention usage of AMMA.

8.2 Research Model In exploring this phenomenon, technology acceptance model (TAM) [2, 10] was applied to evaluate the intention to use AMMA in connection with attitudes and perception of Thai farmers. Subjective norms were also included in the model due to Thai society considered as collectivism. According to TAM, perception was measured in aspects of usefulness, ease of use and result demonstrability. The structural model (adapted from TAM) is shown in Fig. 8.1. The model also illustrates the relationship between constructs and can be presented in the list of hypotheses as follows. – – – – –

H1: Attitudes toward using AMMA has influence on intention to use AMMA. H2: Subjective norms have influence on intention to use AMMA. H3: Perceived usefulness has influence on intention to use AMMA. H4: Perceived ease of use has influence on perceived usefulness of AMMA. H5: Result demonstrability of AMMA has influence on perceived usefulness of AMMA. – H6: Attitudes toward using AMMA mediates the relationship between perceived usefulness of AMMA and intention to use AMMA.

Fig. 8.1 Structural model demonstrates intention to use of AMMA for farm management

78

S. Sayruamyat and W. Nadee

– H7: Subjective norms mediate the relationship between perceived usefulness of AMMA and intention to use of AMMA.

8.3 Method This research collected data using survey approach with random sampling technique in 2018. A total of 727 farming families were visited in three districts in Ang Thong province (Phothong: 469, Wisetchaichan: 148, Samko: 110), of which 64.6% respondents were female; 33.9% of respondents were in the age between 51 and 60 years and 30.9% were in the age between 61 and 70. Interestingly, 56.0% of them possess highest education as grade 4.

8.3.1 Questionnaire Instrument The questionnaire consists of four parts. Part one includes demographic questions; part two includes questions assessing digital skills and technology possession, that is, smart phone, tablet or laptop. Digital skills include awareness, perception and use of popular social media apps, that is, Facebook, Line, Twitter, Instagram and Whatsapp categorized by three levels (aware and use, aware but never use and no aware). Part three assesses the acceptance of AMMA using TAM question items: two items assess intention to use AMMA, three items assess perceived usefulness, three items assess perceived ease of use, three items evaluate attitudes toward using AMMA, three items question subjective norms and two items obtain result demonstrability. All items in part three were obtained using 5-point Likert scale. Part four assesses awareness and use of chosen 10 agriculture apps: (1) Project plants, (2) Digital farmer, (3) Know land, (4) Farmer info, (5) Fertilizer usage calculator, (6) Thai farmer, (7) OAE AgriInfo, (8) Rice production technology, (9) WMSC and (10) Go forward by the division of academic farming categorized by three levels (aware and use, aware but never use and no aware).

8.3.2 Data Analysis This paper uses descriptive statistics (in the form of percentage, frequency and average) to summarize the trend of using AMMA. The analysis also includes the influence of psychological factors on intention to use AMMA, that is, attitudes of farmers toward using AMMA, perception of usefulness and ease of use of AMMA, including result demonstrability according to support farm management aspect. Those factors were tested and analyzed by using confirmatory factor analysis (CFA). The purpose

8 Acceptance and Readiness of Thai Farmers …

79

of applying CFA in this analysis is to evaluate the relationship between observed variables and also mediating effects. Then, the identified relationship will be used to create the structural model according to the structural equation modeling (SEM) technique (see path diagram in Fig. 8.1). The model includes six TAM constructs: intention to use, attitudes, subjective norms, perception of usefulness, ease of use and result demonstrability of AMMA. Both CFA and SEM analyses apply maximum likelihood estimation (MLE) as the main estimating technique. Five indexes, that is, Chi-square, goodness of fit index (GFI), Tucker–Lewis index (TLI), comparative fit index (CFI) and root mean square error of approximation (RMSEA) were used to access the fitness between the structural model and the survey data, which indicate the validity of the structure model. Hair [6] suggests thresholds for the analysis with large sample size and observed variable between 12 and 30 variables. Those thresholds are Chi-square (p < 0.05), GFI, CFI, TLI > 0.95 and RMSEA < 0.05, indicating a good fit model.

8.4 Results The result of the survey indicates digital literacy and digital technology usage (Line, Facebook, Twitter, etc.) of 727 Thai farmers in Ang Thong province in Thailand. The results show that 64.4% possess smart phone but only 8.3% possess tablet, 12.1% own laptop computer and 18.3% have desktop computer. The purpose of having tablet, laptop or desktop computer is mainly to support their children’s education. Although more than half of the respondents possess smart phone, still less than 30% use social media app. Line app is the most used app (26.3%) followed by Facebook app (23.8%). Table 8.1 indicates that the number of farmers who are aware and use social media app is not different from the number of farmers who are aware of social media but never use the apps. Surprisingly, 50% of the respondents do not know Line or Facebook apps, which was significantly different by 99% confidence (Pearson’s Table 8.1 Frequency of smart phone possession, awareness and use of Line, Facebook and AMMA applications Smart phone

Aware and use

Aware but never use

Not aware

Total

No

Line = 12 Facebook = 10 AMMA = 1

Line = 50 Facebook = 50 AMMA = 20

Line = 197 Facebook = 199 AMMA = 238

259

Yes

Line = 180 Facebook = 162 AMMA = 8

Line = 123 Facebook = 136 AMMA = 47

Line = 165 Facebook = 170 AMMA = 413

468

Total

Line = 192 Facebook = 172 AMMA = 9

Line = 173 Facebook = 186 AMMA = 67

Line = 362 Facebook = 369 AMMA = 651

727

80

S. Sayruamyat and W. Nadee

Chi-square [degree of freedom (df) = 2] = 134.5) and the number of samples who possess smart phone is significantly different from the number who do not possess smart phone at 99% confidence (Pearson’s Chi-square (df = 2) = 129.5). For the perception and use of AMMA, the results show that only 76 farmers were aware of AMMA, which was 10.5% of overall samples. Furthermore, nine farmers have used AMMA. According to the Chi-square test, it indicates the number of farmers who possess smart phone and who are aware is not different (Pearson’s Chi-square (df = 2) = 3.07), see Table 8.1. The evaluation of theoretical constructs found that attitude toward using AMMA is the only high influential factor, while other constructs: perceived usefulness, perceived ease of use, result demonstrability, subjective norms and intention to use have medium influence (see Table 8.2). When we consider mode values of intention to use AMMA of the farmers, the value is low (21.05%), while perceived usefulness, result demonstrability and attitudes toward using AMMA show high mode values. This indicates positive perception and attitudes toward using AMMA, which possibly increase intention to use AMMA, as shown in Table 8.2. When conducting factor analysis, modification indices threshold more than 10 was used as a criterion for model improvement and factor loading more than 0.5 was the criteria for including each theoretical construct accordingly. The result from Chi-square test is 272.8 with 81 df, which indicates that the model is significantly acceptable with 99% confidence (Chi-square = 272.8, df = 81; p = 0.00). The index values are GFI = 0.957, TLI = 0.98, CFI = 0.987 and RMSEA = 0.056, which were acceptable as recommended by Hair [6]. This confirms that the measurement model from factor analysis statistically fits with the empirical data and is appropriate for further developing the structural model. Next, the SEM analysis indicates that Chi-square value of the model is 207.65 with 81 degree of freedom with 99% confidence (Chi-square = 207.65, df = 81; p = 0.00). The index values are GFI = 0.968, TLI = 0.987, CFI = 0.991 and RMSEA = 0.046, which confirm that this structural model for the factors predicting intention to use AMMA statistically fits with the empirical data. For hypothesis testing, the results show that perceived usefulness of AMMA does not influence the intention to use AMMA (H3: b = −0.046, p = 0.548). This confirms the mediating role of attitudes and subjective norms (H6, H7), which show positive effects toward intention to use AMMA. It is also found that subjective norms have higher parametric value Table 8.2 Mean, mode and evaluation of TAM constructs Construct

Mode (average %)

SD

Evaluation*

PU

4 (34.58)

3.31 (1.08)

Medium

PEU

3 (21.90)

2.85 (1.08)

Medium

Result demonstration

4 (26.30)

3.00 (1.20)

Medium

Subjective norms

2 (21.05)

2.78 (1.11)

Medium

Attitudes

4 (26.57)

3.38 (1.09)

High

Intention to use

1 (28.04)

2.43 (1.23)

Medium

8 Acceptance and Readiness of Thai Farmers …

81

Fig. 8.2 Standardized regression weight of constructs influencing intention to use AMMA for farm management ( p < 0.01, p > 0.1)

than attitudes. When considering indirect/mediating effects, perceived usefulness has highest influence, followed by result demonstrability and perceived ease of use (see Fig. 8.2). In addition, other apps considered for farm management, that is, Thai farmer (5.7%), WMSC (4.4%) and Go forward by the division of academic farming (3.2%), result in lower awareness compared to AMMA. The reason why farmers do not use or are not aware of these apps might be lower possession of smart phone and tablet. This indicates gap between farmers and technology access (digital divide) [8]. Moreover, farmers raise an issue on the difficulty of using AMMA during the demonstration, although they perceive the benefits of AMMA.

8.5 Conclusion This study investigates the digital skills of Thai farmers, and the result shows gaps in farming industry. To achieve Thailand 4.0 as mentioned in the introduction section, there is a need in the improvement policy to help effectively transform attitudes and acceptance of digital technology toward sustainable development of Thai farming industry. There are recommendations for further study. Social influence seems to be a dominant factor to stimulate intention to use AMMA or other agri-apps. Design theories related to behavior change should be involved to help promoting digital transformation to Thai farming industry [3]. Furthermore, effective marking

82

S. Sayruamyat and W. Nadee

tools/techniques should be included, that is, influencer [7] to foster the adoption of newly introduced technology. Acknowledgements This research has been approved by the Department of Agricultural and Resource Economics—Research Ethics Committee, Faculty of Economics, Kasetsart University to ensure its compliance with ethical standards. The authors would like to thank the Field Internship Team of the Department of Agricultural and Resource Economics, Faculty of Economics, Kasetsart University for supporting data collection for this research.

References 1. Apple Inc.: iTunes charts (2018). https://www.apple.com/itunes/charts/free-apps/. Accessed 5 Nov 2018 2. Davis, F.D., Bagozzi, R.P., Warshaw, P.R.: User acceptance of computer technology: a comparison of two theoretical models. Manag. Sci. 35(8), 982–1003 (1989). https://doi.org/10.1287/ mnsc.35.8.982 3. Eyal, N.: Hooked: How to Build Habit-Forming Products. Penguin, London (2014) 4. Godin, S.: Purple Cow: Transform Your Business by Being Remarkable. Penguin, London (2005) 5. Google Inc.: Top free in android apps (2018). https://play.google.com/store/apps/top?hl=en. Accessed 5 Nov 2018 6. Hair, J.F.: Multivariate Data Analysis. Pearson, Essex (2014) 7. Hennessy, B.: Influencer: Building Your Personal Brand in the Age of Social Media. Citadel Press, New York (2018) 8. Rogers, E.M.: Diffusion of Innovations. Free Press, New York (2003) 9. StartItUp: AARRR (startup metrics) (2018). http://startitup.co/guides/374/aarrr-startupmetrics. Accessed 1 Oct 2018 10. Venkatesh, V., Davis, F.D.: A theoretical extension of the technology acceptance model: four longitudinal field studies. Manag. Sci. 46(2), 186 (2000). https://doi.org/10.1287/mnsc.46.2. 186.11926

Chapter 9

Neural Network Classifier for Diagnosis of Diabetic Retinopathy Gauri Borkhade and Ranjana Raut

Abstract Optical field sensitivity test results are essential for accurate and efficient diagnosis of blinding diseases. The classification of eye diseases in retinal images is the focus of several researches in the field of medical image processing. Diabetic retinopathy is the disease caused by disorder of diabetes. The vision of patient commences to weaken as diabetes grow and leads to retinopathy; prior detection is must for effective treatment. Multiple detection techniques survey for eye diseases and play a vital role as screening tool. Anomaly of retina due to diabetic is detected through numerous techniques. As optimal binary classifier, artificial neural network is proposed in this paper. The sets of constraints which elaborate EEG eye states in database are covered in this investigation. Indeed, performances are classified as normal and diseased. Artificial neural networks are often used as powerful and intelligent classifier for early detection and accurate diagnosis of the diseases. Thus, the result concludes that the support vector machine (SVM) model is operational for classification of eye states with total accuracy of 90%.

9.1 Introduction Human eye is a very sensitive organ; it broadens the vision of individuals and permits to gain information about the outer world, as compared to other senses. If eye disorders are not detected and cured at an early stage, it may cause vital damages and blindness. These days diagnosis of medical error is an important issue for ophthalmologists. Reducing these errors, improvising the superiority of medical services, and assuring patient safety are the staid duties of ophthalmologists. Retinal or fundus photography is used to certificate the health of the eye and in verdict of certain eye G. Borkhade (B) Ramrao Adik Institute of Technology, Navi Mumbai, India e-mail: [email protected] R. Raut Department of Applied Electronics, SGB Amravati University, Amravati, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_9

83

84

G. Borkhade and R. Raut

conditions. Diabetes has high blood sugar, because of two reasons: when patient’s body does not yield ample amount of insulin and when body does not respond to the created insulin. In diabetes, retinopathy is the common impediment. In retinopathy small blood vessels are dented, which results in vision loss. The possibility of retinopathy increases with aging, hence diabetic retinopathy is more prone to middle age and older age diabetics. Non-proliferative diabetic retinopathy (NPDR) and proliferative diabetic retinopathy (PDR) are the stages of retinopathy. NPDR is an early stage of retinopathy, in which blood leaks from tiny blood vessels of retina within the retina. Retina swells due to fluid leak and forms exudates from deposits. In PDR retina blood vessels are spoiled; in response to that new fragile blood vessels grow in retina. But these new blood vessels grow on the surface of the retina and are abnormal, so they are not able to resupply the blood to the eyes. Some of the diseases such as retinopathy, cataract, and glaucoma can be distinguished by using effective segmentation and training of neural network. Neural networks classifiers are widely used for planning automated diagnosis systems. Medical experts work with ease using ANN, as it has learning properties and is self-organized. The decision support system helps consultants in their conclusion. DSS is a software which processes input data of required conditions and produces conclusions. In medical diagnosis, artificial intelligence has been practically successful. To enrich health care, DSS provides logically filtered precise statistics. The safety and feature of medical analysis and treatment can be improved by incorporating clinical guidelines. ANN model is used as a troubleshooter and tool in various real-world applications, like financial estimate, health diagnostics, and speech recognition. Various medical errors caused due to human intervention could be avoided by computer classification. ANNs, after biological neural systems, are intelligence paradigms that are tested machine learning tools. Support vector machine algorithm is used for classification which employs gradient background learning. This NN is used to predict the diagnosis of diabetic retinopathy. Parsaei and Moradi [1] compared the statistic for various eye diseases such as glaucoma, scotoma, homonymous, and lesions of optical nerves with SVM, MLP, PNN, and radial-basis ANNs. SVM classifier showed encouraging performance. Kumar and Venugopal [2] presented classifier for dreaded eye disease glaucoma through ANN. This model implemented the classifier with back-propagation and multilayer feed-forward network. Glaucoma detection by image processing is proposed by using feed-forward back-propagation neural network. Sheeba and George [3] used MATLAB to extract the required features. Povilas and Saltenis [4] suggested ANN as eye disease classifier. Investigation of glaucomatous and healthy eyes was done in their study. Results are achieved through the application of Levenberg–Marquardt learning algorithm and activation function logsig. A superior set of input vectors was used for accurate valuation for neural network teaching and justification [5]. Numerous investigators used the Levenberg–Marquardt learning process for classification. Osareh et al. [6] associate two NN classification techniques for retinal images, which are classified after segmenting exudates sections. To achieve good class separation between exudates and non-exudates classes, support vector machine classifier

9 Neural Network Classifier for Diagnosis …

85

was used [5]. Guven et al. had signified diagnosis method by using electrooculography (EOG) signals for subnormal eyes. Levenberg–Marquardt back-propagation algorithm was used by the authors. Consequences are classified into two classes, normal and subnormal eye. The aimed classification structure has about 93.3% specificity and 94.1% sensitivity [7]. Vallabha et al. proposed a technique for classification of diabetes vascular irregularities [8]. This classifier provides automated detection of abnormalities. Hitz and Reitsamer proposed a linear discriminant analysis. Analysis includes classification tree and forward stepwise variable selection algorithm, for glaucoma defendants. Defendants are with and without optical defects. Visual fields are verified with analyzer named Humphrey analyzer. Measurement of the optic nerve is achieved by laser scanning topography. Both the training and test samples reported the generalization error [9]. Automatic detection method was presented by Abdel-Haleim et al. [10]. The spot of the optical disc is detected in digital retinal fundus images. The method starts by normalizing luminosity and contrast throughout the image. Image processing techniques are proposed for detection and testing in biomedical stream [5]. This research used illumination and adaptive histogram equalization image processing methods.

9.2 Classification This study classifies diabetic retinopathy dataset, in which the sets of parameters define the EEG states of the eyes. The performances are classified as normal and diseased [5]. Artificial neural networks are often used as powerful and intelligent classifier for early detection and accurate diagnosis of the diseases. Decision modeling problems are solved widely by artificial neural networks. ANN has various capabilities and features such as I/O mapping, adaptivity, non-parametric, and nonlinearity. These properties make it superior substitute for resolving difficult task in comparison to statistical techniques [5]. ANNs are non-parametric, mark no assumptions for distribution of data, and thus permit the data to react according to the environment and provide distributive structure. ANN is the key for modeling compound health complications where huge databases of significant medical statistics are obtainable. The extracted information from NN is provided as guidelines, which assists experts in the diagnosis of retinopathy. The guidelines hold information for categorizing eye disease. These guidelines are based on information developed by ANN from prior trained samples. Support vector machine is used for binary classification, and for classifying normal and disease performances. Diabetic retinopathy database from UCI machine learning repository is used as an input to classifier. This dataset comprises features extracted from the Messidor image set to predict whether an image contains sign of diabetic retinopathy. The dataset includes 1151 samples. A total of 20 attributes are present in the dataset with two attribute characteristics. This paper evaluates the performances of all neural networks based on SVM with polynomial-3, Gaussian SVM, ensemble classifier

86

G. Borkhade and R. Raut

ClassificaƟon Accuracy

and complex tree algorithms. It is found that percentage accuracy is maximum for ensemble classifier. Figure 9.1 presents a comparison graph of performances of trained model for unknown samples for SVM with polynomial-3, Gaussian SVM, ensemble classifier, and complex tree algorithms. Figure 9.2 presents accuracies of four algorithms. In this section results obtained from the dataset run in MATLAB 2014, for implementation of classifiers. Accuracy of SVM is compared to determine the best classifier for the dataset. The results produced from performance comparison showed that SVM network achieved an accuracy of 90% for dataset of diabetic retinopathy. Therefore, classifier for the diagnosis of diabetic retinopathy disease is designed using SVM. Figure 9.3 shows the comparison of classification loss for four classification techniques. It was discovered that the SVM NN classifier has the benefit of reducing misclassifications between the classes and matches with the other NN classifiers [5]. It has delivered constant classification accuracy for healthy and unhealthy instances. 95 90 85 80

92.615 88

86 82

75 SVM with Polynomial -3

Gaussian SVM

Ensemble Classifier

Complex Tree

ClassificaƟon Accuracy

Fig. 9.1 Comparison of performances of the trained model for unknown samples

100 98 96 94 92 90 88 86 84 82

99

90

SVM with Polynomial -3

89

Gaussian SVM

88

Ensemble Classifier

Fig. 9.2 Classification accuracies of various algorithms for dataset

Complex Tree

Fig. 9.3 Comparison of classification loss

CLASSIFICATION LOSS

9 Neural Network Classifier for Diagnosis … 14 12 10 8 6 4 2 0

87

12

11 10

1 SVM with Gaussian SVM Polynomial -3

Ensemble Classifier

Complex Tree

9.3 Conclusion Diabetic is very common among the population all across the world. In this paper diagnosis of diabetic retinopathy disease is with SVM-based optimal neural network. The results were obtained from network classification approach by testing samples of different eye states of diabetic patient. These results show that support vector machine (SVM) model is effective in classification of diabetic retinopathy with overall classification accuracy of 90%. It can be determined that SVM NN model is the finest classifier model, according to the classification accuracy and computational period. Artificial neural networks are used for diagnosis of eye diseases and validation of results of standard automated primary data. Advance improvements in every phase of algorithm are needed to improvise the performance of diagnosis systems and computer-assisted recognitions.

References 1. Parsaei, H., Moradi, H.M.: Development and verification of artificial neural network classifiers for eye diseases diagnosis. ICBME (2008) 2. Kumar, H.P., Venugopal, H.: Diagnosis of glaucoma using artificial neural networks. Int. J. Comput. Appl. 180(30), 29–31 (2018) 3. Sheeba, O., George, J.: Glaucoma detection using artificial neural network. IACSIT Int. J. Eng. Tech. 6(5), 158–161 (2014) 4. Povilas, T., Saltenis, V.: Neural network as an ophthalmologic disease classifier. Inf. Technol. Control 36, 365–371 (2007) 5. Borkhade, G., Raut, R.: Application of neural network for diagnosing eye disease. Int. J. Electron. Commun. Soft Comput. Sci. Eng. 174–176 (2015) 6. Osareh, A., Mirmehdi, M., Thomas, B., Markham, R.: Comparative exudates classification using support vector machine and neural networks. In: Medical Image Computing and Computer Assisted Intervention. Lecture Notes in Computer Science, Vol. 2489, pp. 413–420 (2002) 7. Guven, A., Kara, S.: Diagnosis of the macular diseases from pattern electroretinography signals using artificial neural networks. Expert Syst. Appl. 361–366 (2006) 8. Vallabha, D., Dorairaj, R., Namuduri, K.R., Thompson, H.: Automated detection and classification of vascular abnormalities in diabetic retinopathy. In: Asilomar Conference on Signals, Systems and Computers, vol. 38, no. 2, pp. 1625–1629 (2001)

88

G. Borkhade and R. Raut

9. Hitz, W., Reitsamer, H.A.: Application of discriminant, classification tree and neural network analysis to differentiate between potential Glaucoma suspects with and without visual field defects. J. Theor. Med. 5(3), 161–170 (2003) 10. Abdel-Haleim, A., Youssif, A.-R., Ghalwash, A.Z., Sabry, A.A., Ghoneim, A.-R.: Optic disc detection from normalized digital fundus images by means of a vessels direction matched filter. IEEE Trans. Med. Imag. 27(1), 11–18 (2008)

Chapter 10

Comparative Analysis of Data Mining Classification Techniques for Prediction of Heart Disease Using the Weka and SPSS Modeler Tools Atul Kumar Ramotra, Amit Mahajan, Rakesh Kumar and Vibhakar Mansotra Abstract The healthcare sector generates enormous data related to electronic medical records containing detailed reports, test, and medications. Research in the field of health care is being carried out to utilize the available healthcare data effectively using data mining. Every year, heart disease causes millions of deaths around the world. This research paper intends to analyze a few important parameters and utilize data mining classification techniques to predict the presence of heart disease. Data mining techniques are very useful in identifying the hidden patterns and information in the dataset. Decision Tree, Naïve Bayes, Support Vector Machines, and Artificial Neural Networks classifiers are used for the prediction in Weka and SPSS Modeler tools and comparison of results is done on the basis of sensitivity, specificity, precision, and accuracy. Naive Bayes classifier achieved the highest accuracy of 85.39% in the Weka tool, and in the SPSS Modeler tool, SVM classifier achieved the highest accuracy at 85.87%.

10.1 Introduction With growing volumes of large healthcare data, there is a need for a computational tool for assistance in extracting meaningful knowledge and information from available data. The ability to extract important and relevant information hidden in these large volumes of data and making useful predictions based on this knowledge is A. K. Ramotra (B) · A. Mahajan · R. Kumar · V. Mansotra Department of Computer Science & IT, University of Jammu, Jammu 180006, Jammu and Kashmir, India e-mail: [email protected] A. Mahajan e-mail: [email protected] R. Kumar e-mail: [email protected] V. Mansotra e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_10

89

90

A. K. Ramotra et al.

becoming important as decisions made on these predictions prove to be more accurate. Data mining is gaining popularity among scientist and researchers toward generating deep insights of these large healthcare datasets which contain all the information and can be effectively used to improve decision-making by discovering patterns and trends in the data [1, 2]. Data mining techniques in health care are used for improving patient care quality and also reducing healthcare costs [3]. Many data mining classification algorithms are utilized for early diagnosis of diseases related to heart and other diseases, such as different types of cancer, kidney diseases, and diabetes [4]. Diseases related to the heart are the main cause of high mortality rate both in India and abroad [5]. In most of the cases, heart disease is detected in the final stages. Detection of heart disease at an early stage can help in decreasing the mortality rate, as most of the people are unaware of it due to lack of knowledge. The treatment cost for heart disease is also very expensive and not everyone can afford it. Therefore, it becomes most important to detect heart disease at an early stage to do proper treatment. By utilizing various data mining classification techniques, detection of heart disease can be done at an early stage [6]. Data mining offers various classification techniques. The classification comprises two phases. The training phase is the first phase of classification which is a learning process, where the rules and patterns are created for a larger volume of the available dataset. The second phase is known as the test phase where the remaining data is used to check the accuracy of the classification models [7]. Our work presents the implementation of Decision Tree, Naïve Bayes, Support Vector Machines, Artificial Neural Networks, and k-Nearest Neighbor in Weka and SPSS Modeler tools to predict heart disease. Comparison of results is done to identify the algorithms with the highest accuracy in both the tools.

10.2 Literature Review Healthcare field collects huge patient data with the help of computer-based record systems. Data mining can help in extracting valued knowledge from the collection. In the next section, data mining classification techniques are explained briefly, their applications for predicting heart-related diseases, and the accuracies achieved by these data mining classification algorithms using the Cleveland heart disease dataset are also discussed.

10.2.1 Data Mining Classification Techniques Classification techniques are supervised learning methods of data mining. Classification technique assigns items in a collection to all the target categories or classes. Classification techniques aim at predicting accurately the target class for every case

10 Comparative Analysis of Data Mining Classification Techniques …

91

of the dataset based on a training set. Classification techniques are able to process large datasets in data mining and the accuracy is calculated by the total percentage of test data which is correctly classified [8]. The data mining classification techniques used in the study are: Decision Tree. The internal structure of decision tree is like a tree where all the internal nodes are representing a test on the attributes, all the branches are representing the possible test results, and all the leaf nodes are denoting the final classes. Different attribute selection measures help in selecting the best attribute for splitting the data according to classes. In data mining, various types of decision trees algorithms are available. The main difference is in the mathematical model which is used for the selection of an attribute class in rule extraction. C4.5 algorithm is a successor of the ID3 decision tree algorithm. C4.5 algorithm adopts a greedy method for the construction of the tree using the approach of divide-and-conquer in a top– down manner. Gain ratio is used as the attribute selection measure by C4.5, which is an extension of information gain. Information gain is used by ID3 as the attribute selection measure [9]. The gain ratio is calculated using the relation (10.1) as Gain Ratio (X ) = SplitInfo X ( A) = −

Gain (X ) SplitInfo X ( A)

(10.1)

n |A(m)| |A(m)| × log2 |A| |A| m=1

(10.2)

Gain (X ) = Entropy (A)−Entropy X (A)

(10.3)

where “A” is the data partition containing training samples, “X” is the attribute of the partition with “n” distinct values {z1 , z2 , z3 , z4 , …, zn } and Am is the set of samples in “A” with outcome zm of “X”. Aljaaf et al. [10] developed a risk assessment model having multiple levels for the failure of the heart prediction. The model can predict risk in five levels as starting form extremely high risk to no risk at all using C4.5 algorithm and cross validation having 10 folds. Dataset used is Cleveland Clinic Foundation with a total of 297 instances. The authors claimed to have achieved 86.30% overall precision. Alexopoulos et al. [11] proposed a model which uses a machine learning approach for predicting stroke disease by employing C4.5 classifier. The authors used 1000 patient records as input with 114 attributes and describing five different diagnoses. Table 10.1 shows the observations of sensitivity, selectivity, precession, and accuracy achieved by using a decision tree algorithm in Weka and SPSS Modeler tools Table 10.1 Results achieved by decision tree algorithm employed in Weka and SPSS Modeler Decision tree classifier

Sensitivity

Specificity

Precision

Weka

0.742

0.775

0.742

Accuracy (%) 74.15

SPSS Modeler

0.833

0.750

0.784

79.35

92

A. K. Ramotra et al.

on the Cleveland heart disease dataset. The accuracy achieved by decision tree classifier in Weka is 74.15% as compared to 79.35% of accuracy achieved in the SPSS Modeler. Naïve Bayes. Statistical techniques are used for the Naïve Bayes classifier. The prediction of the membership class is done using the probability theory of the Bayes’ theorem. Naïve Bayes algorithm is built on the assumption of conditional independence. It is the independence of the values of attributes for a given class on other attributes values. For the initial training dataset, the posterior probability of the response variable is calculated. Computations of conditional probability for the other variables are also done. Then for all the cases of response variable, the probability of occurrence is calculated on behalf of each test dataset sample. Response variable which is having the highest probability of occurrence is then selected [12]. Posterior probability and conditional independence are calculated using the relations (10.4) and (10.5), respectively as P(C(i)|X) = P(X|C(j)) =

P(X|C(i))P(C(i)) P(X) n

(10.4)

P(X(i)|C(j))

(10.5)

i=1

where X = {z1 , z2 , z3 , …, zn} , is the attribute vector of each sample in the training data. Orphanou et al. [13] developed a model for diagnosing coronary heart disease based on Naïve Bayes classifier in combination with the temporal association rules (TARs). For each TAR, both horizontal support and the mean feature representations are compared. Stulong dataset was used in the study. Alizadehsani et al. [14] diagnosed coronary artery disease by developing a feature creation method. The dataset used by the authors is Z-Alizadeh dataset having 303 patient’s records and 54 features and results of SMO algorithm, Bagging, Naïve Bayes classifier, and ANN are compared for the dataset. In the Weka tool, Naïve Bayes classifier achieved an accuracy of 85.39% and in SPSS Modeler, accuracy of 72.83% was achieved. Table 10.2 shows the result achieved from both the tools. Support Vector Machines (SVM). SVM is a method of classification which can deal with both linear and nonlinear forms of data. SVM algorithm uses a boundary determining technique for the classification and separation of data. The problem which cannot be solved in low-dimensional linear space is mapped into nonlinear Table 10.2 Results achieved by Naïve Bayes classifier employed in Weka and SPSS Modeler Naïve Bayes classifier

Sensitivity

Specificity

Precision

Accuracy (%)

Weka

0.854

0.863

0.854

85.39

SPSS Modeler

0.812

0.736

0.795

72.83

10 Comparative Analysis of Data Mining Classification Techniques …

93

space of high-dimension linear space by SVM [14]. SVM transforms the training data into higher dimensional space using nonlinear mapping and searches for decision boundary known as linear optimal separating the hyperplane which separates the classes. Support vectors are used to find the hyperplane that is characterized with the help of an equation to determine the boundary for each class. The main aim of SVM is to search for such a data boundary which is not only best but also having the maximum distance possible between the classes. Yang et al. [15] proposed a model to diagnose early-stage heart failure by employing support vector machine. The Bayesian principal component method was used to handle missing data. The final results were classified into groups as healthy, heart failure prone, and heart failure groups according to the stages of cardiac dysfunction. The dataset used is from Zhejiang Hospital containing 289 samples. The parameters use for prediction were heart rate variability test, electrocardiography test, blood test, echocardiography test, distance of six minutes’ walk test, chest radiography test, and physical test. Alty et al. [16] proposed a model for assessing arterial stiffness of the patients for cardiovascular disease risk prediction. The authors classified patients into two separate groups with one group as patients with a high risk of CVD and the other group as patients with low risk of CVD from features of digital volume pulse using support vector machine classifier. In our experimental setup, the SVM classifier achieved an accuracy of 80.89% in the Weka tool and achieved 85.87% accuracy in the SPSS Modeler tool as presented in Table 10.3. k-Nearest Neighbor (KNN). Nearest-neighbor algorithm works on the technique of analogy by comparing the data fixed for testing with the data fixed as the training data based on the similarity, hence known as memory-based technique. For the given test sample, KNN algorithm searches the pattern space of the closest k training samples. Distance is commonly calculated by using a distance metric known as the Euclidean distance [9]. The Euclidean distance between two samples is calculated using the relation (10.6) as m dist (X1, X2) = (x(1)n − x(2)n)

(10.6)

n=1

where X1 = (x11 , x12 , x13 , …, x1m ) and X2 = (x21 , x22 , x23 , …, x2m ). Masetic et al. [17] proposed a model to classify the normal and the congestive heart failures using data mining on ECG signals from PTB Diagnostic ECG and Table 10.3 Results achieved from SVM classifier employed in Weka and SPSS Modeler SVM classifier

Sensitivity

Specificity

Precision

Accuracy (%)

Weka

0.809

0.80

0.810

80.89

SPSS Modeler

0.851

0.868

0.901

85.87

94

A. K. Ramotra et al.

Table 10.4 Results achieved by KNN classifier employed in the Weka and SPSS Modeler tools KNN classifier

Sensitivity

Specificity

Precision

Accuracy (%)

Weka

0.820

0.826

0.820

82.02

SPSS Modeler

0.818

0.837

0.882

82.61

BIDMC Congestive Heart Failure databases. Autoregressive Burg is used as the feature extracting method. SVM classifier, k-NN algorithm, random forest, C4.5, and neural networks were used for the classification. Accuracy of 82.02% is achieved in Weka using KNN classifier whereas in SPSS Modeler, 82.61% of accuracy is achieved as presented in Table 10.4. Artificial Neural Networks (ANN). ANN is a metaphoric representation of the human brain used for processing of the information. The neural network comprises a connected set of input and output units. Using training data as input, the network is trained to find a pattern. Some weights are associated with the interconnections. By making adjustments in weights, the network learns and is able to correctly predict the class labels. Multilayer feed-forward is the type of ANN in which the learning is performed by the back propagation algorithm. A neural network is called feedforward network if there is no cycle in the connections. It contains three layers. The first and the last layer are known as the input and the output layer, respectively, and a layer which is present between these two layers is known as the hidden layer. All these layers have interconnections. Attributes of training samples are passed as input to the network [18]. Then the inputs along with weights (weighted inputs) are sent to the hidden layers. End layer that is the output layer consists of the weighted output from the preceding last hidden layer which represents the predicted class labels. Gharehchopogh et al. [19] proposed a model to predict heart-related problems using neural networks. The authors used the clinical records of 40 patients. The parameters used for prediction are blood pressure, gender, age, and smoking. The model was able to predict correctly 85% of cases. Table 10.5 presents the calculations of sensitivity, specificity, precision, and accuracy for the ANN classifier. Employing ANN algorithm on the Cleveland heart disease dataset has achieved the accuracy of 80.89% in Weka and 83.70% accuracy in SPSS Modeler. Table 10.5 Results achieved by ANN classifier employed in Weka and SPSS Modeler ANN classifier

Sensitivity

Specificity

Precision

Accuracy (%)

Weka

0.808

0.789

0.813

80.89

SPSS Modeler

0.810

0.882

0.921

83.70

10 Comparative Analysis of Data Mining Classification Techniques …

95

10.3 Experimental Analysis 10.3.1 Dataset Dataset for this study is collected from the UCI Machine Learning Repository created by V.A. Medical Center, Long Beach and Cleveland Clinic Foundation, Robert Detrano, M.D., Ph.D. [20]. The dataset contained 303 records and 76 attributes. After data preprocessing and removing of the missing values, a total of 297 records are considered for the study with 13 input attributes as age, sex, chest pain type (Cp), resting blood pressure (Trestbps), serum cholesterol (Chol), fasting blood sugar (Fbs), resting electrocardiographic results (Restecg), exercise-induced angenia (Exang), maximum heart rate achieved (Thalach), depression induced by exercise relative to rest (old peak ST), the slope of the peak exercise (slope), number of major vessels colored by fluoroscopy (Ca), and thal. “num” is used as the target variable.

10.3.2 Result Analysis Decision Tree, Naïve Bayes, SVM, k-NN, and ANN data mining classifiers are used to perform simulations on the dataset using 70% split and 10 fold cross validation in Weka and SPSS Modeler tools. A comparative analysis of accuracy achieved by these data mining classification techniques in Weka and SPSS Modeler tools are shown in Figs. 10.1, 10.2, 10.3, 10.4, and 10.5. Naïve Bayes classifier achieved the highest accuracy of 85.39% in the Weka tool and in the SPSS Modeler tool, SVM classifier achieved the highest accuracy of 85.87%.

Fig. 10.1 Comparative analysis of Decision Tree classifier

Decision Tree Classifier

80.00%

74.15 %

79.35 %

70.00% WEKA SPSS MODELER Accuracy(%)

96 Fig. 10.2 Comparative analysis of Naïve Bayes classifier

A. K. Ramotra et al. Naive Bayes Classifier

100.00% 80.00% 60.00%

85.39 72.83 % % WEKA SPSS MODELER Accuracy(%)

Fig. 10.3 Comparative analysis of SVM classifier

SVM Classifier

90.00% 80.00% 70.00%

80.89 %

85.87 %

WEKA SPSS MODELER Accuracy(%)

Fig. 10.4 Comparative analysis of KNN classifier

KNN Classifier

83.00% 82.00% 81.00%

82.02 %

82.61 %

WEKA SPSS MODELER Accuracy(%)

Fig. 10.5 Comparative analysis of ANN classifier

ANN

85.00% 80.00% 75.00%

80.89 %

WEKA

83.70 %

SPSS MODELER

Accuracy(%)

10 Comparative Analysis of Data Mining Classification Techniques …

97

References 1. Liao, S.H., Chu, P.H., Hsiao, P.Y.: Data mining techniques and applications—a decade review from 2000 to 2011. Expert. Syst. Appl. 11303–11311 (2012) 2. Koppad, S.H., Kumar, A.: Application of big data analytics in healthcare system to predict COPD. In: International Conference on Circuit, Power and Computing Technologies (ICCPCT), pp. 1–5 (2016) 3. Singh, P., Mansotra, V.: Data mining based tools and techniques in public health care management: a study. In: 11th International Conference on Computing for Sustainable Global Development, India Com (2017) 4. Silwattananusarn, T., Tuamsuk, K.: Data mining and its applications for knowledge management: a literature review from 2007 to 2012. Int. J. Data Min. Knowl. Manag. Process. 13–24 (2012) 5. World Health Organization: Non-communicable diseases. http://www.who.int/mediacentre/ factsheets/fs355/en/. Accessed 18 June 2018 6. Peter, T.J., Somasundaram, K.: An empirical study on prediction of heart disease using classification data mining techniques. In: IEEE-International Conference on Advances in Engineering, Science and Management, pp. 514–518 (2012) 7. Mastrogiannis, N., Boutsinas, B., Giannikos, I.: Methods for improving the accuracy of data mining classification algorithms. Comput. Oper. Res. 2829–2839 (2009) 8. Arumugam, P., Christy, V.: Analysis of clustering and classification methods for actionable knowledge. Mater. Today Proc. 1839–1845 (2018) 9. Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann Publishers (2006) 10. Aljaaf, A.J., Al-Jumeily, D., Hussain, A.J., Dawson, T., Fergus, P., Al-Jumaily, M.: Predicting the likelihood of heart failure with a multi level risk assessment using decision tree. In: Third International Conference on Technological Advances in Electrical, pp. 101–106. Beirut, Lebanon (2015) 11. Alexopoulos, E., Dounias, G., Vemmos, K.: Medical diagnosis of stroke using inductive machine learning. Mach. Learn. Appl. 20–23 (1999) 12. Jabbar, M.A., Samreen, S.: Heart disease prediction system based on hidden Naïve Bayes classifier. In: 2016 International Conference on Circuits, Controls, Communications and Computing, pp. 1–5 (2016) 13. Orphanou, K., Dagliati, A., Sacchi, L., Stassopoulou, A., Keravnou, E., Bellazzi, R.: Incorporating repeating temporal association rules in Naïve Bayes classifiers for coronary heart disease diagnosis. J. Biomed. Inform. 74–82 (2018) 14. Alizadehsani, R., Habibi, J., Hosseini, M.J., Mashayekhi, H., Boghrati, R., Ghandeharioun, A., Bahadorian, B., Sani, Z.A.: A data mining approach for diagnosis of coronary artery disease, pp. 52–61 (2013) 15. Yang, G., Ren, Y., Pan, Q.: A heart failure diagnosis model based on support vector machine. In: 3rd International Conference on Biomedical Engineering and Informatics (2010) 16. Alty, S.R., Millasseau, S.C, Chowienczyk, P.J., Jakobsson, A.: Cardiovascular disease prediction using support vector machines. In: 46th Midwest Symposium on Circuits and Systems. IEEE J. Biomed. Health Inform. (2003) 17. Masetic, Z., Subasi, A.: A congestive heart failure detection using random forest classifier. Comput. Methods Prog. Biomed. 54–64 (2016) 18. Lu, H., Setiono, R., Liu, H.: Effective data mining using neural networks. IEEE Trans. Knowl. Data Eng. 957–961 (1996) 19. Gharehchopogh, F.S., Khalifelu, Z.A.: Neural network application in diagnosis of patient: a case study. In: International Conference on Computer Networks and Information Technology, pp. 245–249, Abbottabad (2011) 20. Heart Attack Dataset from http://archive.ics.uci.edu/ml/datasets/HeartDisease. Accessed 11 Sept 2018

Chapter 11

An Automated Framework to Uncover Malicious Traffic for University Campus Network Amit Mahajan, Atul Kumar Ramotra, Vibhakar Mansotra and Maninder Singh Abstract The paper presents an automated framework that stimulates the campus network traffic to detect and prevent the malicious network activities and visualization of the logs using customized reporting dashboards on a real-time basis over the university campus network. The framework combines open source tools to give a realistic analysis of the network traffic using the detection and prevention engine. The detected malicious events by the engine are then processed by the elastic cluster for visualization of the threats. The framework measures the detection of the events and generates alerts, which shows that the engine performs better with elastic cluster which works on NoSQL for real-time incidence reporting. Once the system gets trained, the framework automatically blocks the attack as per the severity threat for further propagation in the future, over the network. This helps to secure and increase the performance of the campus networks using open source libraries and reduces the financial burdens due to commercial threat detection and prevention systems.

11.1 Introduction In recent years, the incidences of intrusions have increased many fold due to network security breaches. Organizations are coping with the situation and are discovering ways to safeguard their information and networks to reduce the danger from threats A. Mahajan (B) · A. K. Ramotra · V. Mansotra Department of Computer Science & IT, University of Jammu, Jammu 180006, Jammu and Kashmir, India e-mail: [email protected]; [email protected] A. K. Ramotra e-mail: [email protected] V. Mansotra e-mail: [email protected] M. Singh Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_11

99

100

A. Mahajan et al.

[1], therefore, forcing organizations to implement security policies which take control of their user’s system activities as part of their IT policies [2]. The inspection factor decides precisely how and what happens based on the data from the detection system. This system gathers the information in order to classify the security violators. This makes packet capturing as the key component for implementing Network Detection Systems (NDS) and performing Security Information and Event Management with visualization [3]. In general, all organizations have stressed upon the implementation of IDS (Intrusion Detection System) that monitors the activities of the network and generates reports on observation of any security violations [4]. The key responsibility of Intrusion Detection System (IDS) is to detect and report unwanted and malicious events by generating alerts. But Intrusion Detection Systems (IDSs) are not very effective in today’s scenario, where incidence reporting is not adequate. Therefore, in its place a somewhat much more proficient system called the Intrusion Detection and Prevention System (IDPS) is considered [2]. Intrusion Detection and Prevention Systems (IDPSs) are not a new technology, but an evolved form of Intrusion Detection Systems (IDS). It combines IDS and improved firewall technologies to prevent threats that are placed in-line, which make access control decisions based on application content, rather than IP address or ports [5]. Intrusion Detection and Prevention Systems (IDPSs) have been endorsed as a cost-effective way to serve as a network monitoring system for the detection, blocking/stopping of malicious traffic that acts as a network sanitizing agent [4]. Identification of possible incidents, logging information about them, attempting to stop them, and reporting them to security administrators can be done by using the centrally Security Information and Event Management (SEIM) system that combines Security Information Management (SIM) and Security Event Management (SEM) [6]. It provides realtime analysis of network vulnerability and application by interpreting and visualizing the logs on the customizable dashboards as per the administrator’s requirement [7].

11.2 Study Goals Although many commercial solutions like Intrusion Detection/Prevention Systems (IDPS) and Security Information and Event Management (SIEM) systems are available in the market, these are very costly and most of them are subscription-based, thus, limiting numerous organizations to implement or secure their network or having them pay hefty amounts for subscriptions to counter these intrusions and threats. The goal of this study was to set up and implement a framework using a combination of open-source tools. The study presents the effectiveness of the system for detecting network intruders in real time, which uses visualization of logs collected by the Intrusion Detection and Prevention Engine and support customizable dashboards for visualization of the campus network traffic captured by the engine [8]. To validate the developed framework, real-time implementation of the framework over the campus network is required to understand the vulnerabilities and types of threats that are penetrating the network to give deep insights. The development of

11 An Automated Framework to Uncover Malicious …

101

such technological solution can offer a wide variety of benefits for individual users as well as for large educational institutions that get exploited by the attacker.

11.3 Methodology To set up the environment, Network Intrusion Detection and Prevention System (NIDPS) and Security Information and Event Management (SIEM) are to be established. The monitoring tool NIDPS collects the information about all network activities within the campus network and uses rule sets with signatures to match and identify the malicious traffic that triggers alerts whenever suspicious events occur. For the development of the framework, Threat Intrusion Detection and Prevention Engine (TIDPE) is designed that should support multi-threaded, multi-CPU capabilities and hardware accelerations, which can perform network traffic analysis with increased speed and efficiency. In Fig. 11.1, the incoming and outgoing university campus network traffic from the core switch is mirrored to the interface Eth1 of Threat Intrusion Detection and Prevention Engine (TIDPE). The Threat Intrusion Detection and Prevention Engine (TIDPE) processes the mirrored traffic captured and triggers alerts based on packets captured from the campus network. These captured packets match the already provided rule sets with signatures. After matching the rules and signatures, alerts were generated from the Threat Intrusion Detection and Prevention Engine (TIDPE) which are stored in a log file. Since logs are a very crucial aspect of any security device, the engine developed should be capable of storing logs in various formats. In our study, we have configured the engine to store logs in JSON (Java Script Object Notation) format. JSON is a light-weighted data interchangeable format and does not require any relational database to store the information. The logs generated by Threat Intrusion Detection and Prevention Engine (TIDPE) are parsed, indexed, and stored using the elastic cluster consisting of tools like file beat, logstash, Elasticsearch, and Kibana, which are used to create and visualize the output in customizable dashboards with the help of the Kibana plugin. The visualization of logs on the dashboard helps to quickly gain insights into the potential network vulnerabilities. This will help to define different types of network activities that could potentially indicate suspicious activity and to block/stop such traffic for further network penetration.

Fig. 11.1 Workflow of the framework to uncover the threats over the campus network

102

A. Mahajan et al.

11.4 Experimental Setup The core switch of the university campus LAN is configured using mirror port with the TIDPE framework for capturing the packets. Figure 11.2 shows the architecture of the designed framework. The framework consists of the following hardware components: Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60 GHz, CPU cores 8 and RAM 32 Gb, 2 TB storage, 64-bit OS with two Giga network interfaces. The open-source software used for the development of the framework includes Ubuntu, Threat Intrusion Detection and Prevention Engine (TIDPE) using Suricata, external rule set with signatures from Emerging Threats and Snort, Beats, Logstash, Elasticsearch, and Kibana. The other parameters were developed and tuned on the engine to configure as per the university campus network. The configuration done is able to capture the traffic over the network in real time as well as in offline mode. This helps the administrators to analyze the network traffic and fine-tune the network settings as per the organizational requirements. The architectural design of the framework shown in Fig. 11.2 is divided into three segments: (1) Collection of raw packets from the university campus LAN (2) Threat Detection and Prevention Engine and (3) Network Security Information and Event Management (SIEM) for visualization and analysis.

Fig. 11.2 Shows the architecture for the implementation of the framework

11 An Automated Framework to Uncover Malicious …

103

All three segments associate together to create a framework, which is very helpful for the study and analysis of the campus network. The traffic flowing over the campus network is tapped by the university campus LAN core Cisco catalyst switch, which forwards the packets to the Threat Detection and Prevention Engine—Suricata, developed by Open Information Security Foundation. It is the heart of the framework, which is a rule-based Intrusion Detection System (IDS) which is converted to an Intrusion Prevention System (IPS) that uses externally developed rule sets from Emerging Threats and also supports Snort rule set [9, 10] that contains rules with alert signatures to monitor the network traffic. The alerts obtained by the engine warn the admin when suspicious event occurrences. It is a multi-threaded engine that offers increased speed and efficiency in network traffic analysis. The detection engine inspects the network traffic by using powerful signature-based rules for the detection of complex threats and performs classification with the help of multi-pattern matching algorithms (MPM) that increase the detection rate. The packet processing system combines the process when acquiring packets from the network. It has a separate decoding function that allows it to inspect the data stream at the application layer which is then applied to detection strings. Iptables and scripting perform the filtration for both incoming and outgoing traffic packets and store logs in various formats [2]. To accelerate the hardware, the engine is built to support multicore CPU environment that uses available cores of the CPU to increase the performance of the engine. The logs processed by the TIDPE require proper investigation, for that purpose, elastic cluster is configured with the IDPS. Elastic cluster consists of three open-source software libraries—Logstash, Elasticsearch (NoSQL), and Kibana [6]. Figure 11.3 shows the log interpretation using Elastic cluster. The output from the TIDPE is written in logs that are understandable by the system. These logs are stored in JSON format in the TIDPE. Elastic clustering is used as a security and event management system to visualize the log data in the form of charts and graphs that are convenient for conducting analysis and making decisions. The logs are transferred to beats (syslog) in elastic cluster that ships the data to logstash for parsing, filtering, and transformation. The filtered and transformed data by the logstash is forwarded to Elasticsearch which stores and indexes the data in NoSQL format. The indexed data from the Elasticsearch is sent to the Kibana dashboard, which is a plugin for visualization in the form of customizable dashboards that allow the administrator to easily add more functionality to the system as per the requirement [7].

Fig. 11.3 Shows the elastic cluster flow of logs generated by the IDPS

104

A. Mahajan et al.

11.5 Observations and Analysis of Framework in Campus Network on Real-Time Basis To test and validate the framework, the system is deployed on a real-time campus network of the University of Jammu (UOJ) having more than 5000 wireless and LAN network users. This helps to understand the framework capabilities and different types of threats that are penetrating over the University of Jammu (UOJ) campus network to give insight views using customized dashboards for visualization of various threats.

11.5.1 Dashboard Visualization for Analysis The network is monitored by the framework for three months 24 * 7 days. The data captured by the TIDPE is analyzed by creating various visualization dashboards to interpret the logs for better insights. A total of 4,325,874 attack counts were captured over the period and these were categorized by the severity level of the threat. The high severity level 1 contains about 564,268 attack counts; medium severity level 2 contains 498,035 attacks whereas the low severity level 3 contains attacks with a count of 3,253,571 as shown in Fig. 11.4. Besides these, the TIDPE captured some latest threats like ransomware, brute force, and DoS attacks. Since the count of these attacks is not much compared to others but the framework had shown its capability to detect such attacks which are discussed below. The visualization dashboards help to reveal the suspicious activity in the network on a real-time basis for conducting analysis and making decisions on various parameters. We chose the following charts on the dashboard for visualizing the data. The bar chart in Fig. 11.5 shows connections count per 30 s chart shows the total number of event counts per 30 s. An increasing number of counts can be a sign of abnormal activity. The customizable dashboard shows various traffic trends for top 10 attacks, signatures, protocols, severities, source IP address with ports, destination IP address with ports, which are visualized for the analysis of the University of Jammu network traffic. The top protocols graph shows how much and what type of traffic is flowing over the network. In case, if the TCP protocol traffic remains high, then the network is not experiencing many attacks but in case if the traffic on other protocols like UDP Fig. 11.4 Indicates total no of attacks captured with severities that trigger over the testing period with correspondence numbers

11 An Automated Framework to Uncover Malicious …

105

Fig. 11.5 Customizable dashboards of threat detection and intrusion prevention system for visualization of threats

106

A. Mahajan et al.

gets high, it indicates that some suspicious activity is high. Ports with the largest number of requests. An increasing number of connections and requests per minute to some ports may indicate suspicious activity. It also shows geotags to IP packets, to enhance the capabilities of communication networks and of location-based services. It can improve and strengthen the security of the network. Figure 11.6 shows the memory consumption and CPU load at 1, 5, and 15 min time intervals. The information from these dashboards is very useful to know the hardware consumptions during real-time network monitoring. Figure 11.7 shows the capturing of DoS attack by the IDPS engine. The figure also shows the signatures associated with the rule, with the source and destination, and the action taken by the engine which shows that the threat is automatically blocked from broadcasting into the network. Figure 11.8 shows the capturing of brute-force attack by the IDPS engine. The figure also shows the signatures associated with the rule, severity level of the attack with source and destination, and the action taken by the engine which shows that the threat is automatically blocked from broadcasting into the network. Figure 11.9 shows the capturing of ransomware attack with source, destination and the action taken by the engine which shows that the threat is automatically blocked from broadcasting into the network by the IDPS engine,. the signatures associated with the rule, and severity level of the attack.

Fig. 11.6 CPU load at different intervals and memory consumed

Fig. 11.7 DoS attack signature and capture shot by IDPS engine

11 An Automated Framework to Uncover Malicious …

107

Fig. 11.8 Brute-force attack signature with source IP and destination captured by the engine

Fig. 11.9 Capturing of ransomware attack. Source and destination of ransomware with time stamp

11.6 Conclusion The framework developed is capable of monitoring and detecting different kinds of threats from the compromised hosts in real time. More than 43 lack attacks within a span of 3 months are classified as per the severity level. During the span of testing, the framework detected and automatically blocked some sever attacks like ransomware, brute force, DoS attacks in real time. Although the frequency of such attacks are not high, the framework identified and captured such traffic. The SEIM framework visualized the threats in customizable dashboards. Once the system gets trained, the framework automatically blocks the attack as per the severity threat for further propagation in the future, over the network. The developed framework is tested and validated in real-time basis on the University of Jammu Campus network; therefore, it can be utilized by educational institutions to secure their network which have limited financial resources for IT.

11.7 Future Scope In future, some performance analysis can be conducted to further fine-tune the engine. CPU shall be replaced will GPU to further enhance the performance of the Threat Intrusion Detection Prevention Engine. Machine learning can be introduced in the visualization framework.

108

A. Mahajan et al.

References 1. Gaigole, M.S., et al.: The study of network security with its penetrating attacks and possible security mechanisms. Int. J. Comput. Sci. Mobile Comput. 4(5), 728–735 (2015) 2. Stanger, J.: Detecting intruders with Suricata. http://www.admin-magazine.com/Articles/ Detecting-intruders-with-Suricata. Accessed 2018 3. Kostrecová, E., Bínová, H.: Security information and event management. Paripex-Indian J. Res. 4(2) (2015) 4. Mohamed, A.B., Idris, N.B., Shanmugum, B.: A brief introduction to intrusion detection system. In: International Conference on Intelligent Robotics, Automation, and Manufacturing, CCIS 330, pp. 263–271 (2012) 5. Blumenthal: Intrusion-Prevention Systems and Enterprise Architecture, 29 Jan 2008. http:// www.andyblumenthal.com/2008_01_01_archive.html. Accessed 2018 6. Waagsnes, H.: SCADA intrusion detection system test framework. Master’s thesis, Department of Information and Communication Technology, Faculty of Engineering and Science University of Agder Grimstad, 21 May 2017 7. Waagsnes, H., Ulltveit-Moe, N.: Intrusion detection system test framework for SCADA Systems. In: Proceedings of the 4th International Conference on Information Systems Security and Privacy (ICISSP 2018), pp. 275–285. https://doi.org/10.5220/0006588202750285 8. Saif, A.: IDPS—Intrusion Detection Prevention Systems, COURSE TITLE CSCE 522. https:// www.coursehero.com/file/14996872/IDPS/. Accessed 2018 9. Ish, J., Jaentsch, K.: py-idstools: Snort and Suricata rule and event utilities in python. https:// github.com/jasonish/py-idstools (2013). Accessed 2018 10. Khamphakdee, N., Benjamas, N., Saiyod, S.: Improving intrusion detection system based on snort rules for network probe attack detection. In: 2014 2nd International Conference on Information and Communication Technology (ICoICT), Bandung, pp. 69–74 (2014)

Chapter 12

Comparative Analysis of K-Means Algorithm and Particle Swarm Optimization for Search Result Clustering Shashi Mehrotra and Aditi Sharan Abstract Clustering is being used to organize search results into clusters with an aim to help a user in accessing relevant information. The paper performs a comparative analysis of the most common traditional clustering algorithms: k-means and nature-inspired algorithm, and Particle Swarm Optimization (PSO). Experiments are conducted over the well-known dataset, AMBIENT, used for topic clustering. Experimental results show the highest recall and F-measure is achieved by the PSO. Though the highest precision is achieved by the k-means algorithm, in most of the topics, PSO shows a better result than the k-means algorithm.

12.1 Introduction Due to the tremendous rise in the use of the Internet and web data, there is a need to have some efficient method for retrieval of the relevant information. Most of the search engines display search results in a list form, where searching relevant information may be time-consuming and tedious. Another problem with the search result is polysemy, more than one meaning related query [18]. It requires converging on a focused subset of the result. Rather than presenting the information in list form, grouping them in various meaningful folders will make it easier to access the results, where clustering is being used. Clustering is a process of partitioning data into some groups or categories where patterns in the same cluster are the same in nature while from other clusters, they are different [5, 9, 19]. Given a set of patterns X = x1 , . . . , xj , . . . xN ,, X may have K partitions C = {C1 , . . . , CK }(K ≤ N)), such that S. Mehrotra (B) Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Guntur 522502, Andhra Pradesh, India e-mail: [email protected]; [email protected] A. Sharan School of Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_12

109

110

S. Mehrotra and A. Sharan

1. Ci = ∅, for i = 1, . . . , K, UK i=1 Ci = X; 2. Ci ∩ Cj = ∅, for i = 1, . . . , K, j = 1, . . . , K and i = j. These days, clustering has various applications such as clustering patient records, to identify healthcare trends, and for astronomical data to discover a new class of stars [11]. A lot of clustering algorithms exist. The k-means algorithm published in 1955 is the most popular clustering algorithm due to its simplicity [8, 17] and its phase can be modified easily [2]. It is recognized as the most suitable algorithm for use on large datasets [12]. Experimental results in [13–16] show that the k-means clustering algorithm performed best among hierarchical, canopy, and EM clustering algorithms. Let X = {x1 , x2 , …, xn }, be the set of data, which are to be clustered into K clusters, i.e., C = {c1 , c2 , …, ck }. The data are partitioned such that the squared error between the mean of the cluster and the data is minimum [8]. The Particle Swarm Optimization (PSO) is based on the concept of social behavior of fish schooling or bird flocking, and its main property is information exchange among particles [3]. The PSO method is population-based and is based on coordination among particles. It contains the swarm of particles and initializes with solutions known as a population. Implementation complexity of PSO is simple [14, 16]. The position of the particle changes according to the following three parameters: (1) best particle among all the particles (2) the best value (3) the particle’s acceleration [7]. The PSO often does not get stuck at local optima. The paper structure is as follows: Sect. 12.1 covers an introduction to clustering, need of clustering for information retrieval, polysemy, and introduction to the k-means algorithm and Particle Swarm Optimization approach.

12.2 Related Work Clustering is an approach, where a set of objects are organized in such a way that objects within a cluster are of the same type whereas distinct from the objects in different clusters [4–6, 9, 19, 20]. Thus, clustering aims to maximize similarity among the objects belonging to the same cluster and to reduce similarity with the objects in other clusters [9]. Jensi and Jiji [10] presented a study of optimization approaches, text document clustering procedure, various similarity measures, and evaluation matrices. The paper discusses how to preprocess the data that includes converting all the data into capital letters and removing stop words. The words the do not have semantic relevance are known as stop words. They pointed out that using Porter Stemmer, the performance of text clustering can be improved. Stemming is eliminating some words to their root form. It also presented text document encoding using a vector space model, where documents are converted to document term matrix (DTM). The DTM model represents each document as rows and terms as dimensions. The paper discussed the dimension reduction technique: Latent Semantic Indexing (LSI), and some soft

12 Comparative Analysis of K-Means Algorithm …

111

Table 12.1 Data description Number of instances

Number of attributes

Name of attributes

309

4

(1) ID (2) URL (3) Title (4) Snippet

computing techniques for document clustering, for example, Bees Algorithm, Ant Colony Optimisation, and Particle Swarm Optimization.

12.3 Experimental Result and Analysis Thus, the study used a standard benchmark dataset AMBIENT for the experiments and comparative evaluation of the k-means algorithm and PSO-based model.

12.3.1 Dataset Details The study used AMBIENT dataset for the experimental use. The AMBIENT dataset was created by Carpineto and Romano [1] (Table 12.1).

12.3.2 Preprocessing For the experiments in the study, preprocessing over the AMBIENT data are performed as follows: (i) there are some blank rows in the file, which are removed. (ii) The snippets are filtered out, non-alphanumeric characters, such as “$”, “%”, “#” are removed. (iii) Data is converted to lowercase so that a word such as “ABC” and “abc” are treated as the same word. (iv) Stop words and punctuations are removed. Stop words are the words such as is, that, the, on, upon, and an, that have no relevance while checking for similarity. (v) Stemming is performed to remove words which end with es, ing, etc. (vi) The model used the vector space model, where each row represents a document and columns represent terms. The document term matrix is prepared, where the column shows the count of the word present in the document, and “0” when the word does not exist in the document. TF_IDF is used to calculate the weight of the term, which is further utilized for the experiments.

112

S. Mehrotra and A. Sharan

12.3.3 Performance Evaluation Performance evaluation of the k-means algorithm and the PSO-based model is performed using well-known metrics. Precision, recall, and F-measure are computed for each cluster and B-Cubed measures are used. We compared the results of the PSO-based model and the k-means algorithm, which are presented in Figs. 12.1, 12.2, and 12.3. By analyzing Fig. 12.1, we noticed that the k-means algorithm has achieved a marginal higher B-Cubed precision than the PSO model. Figure 12.1 shows that the PSO-based model has achieved better result over 17 topics out of 21 topics. It can be noticed from Fig. 12.2 that the PSO-based model has achieved better B-Cubed recall than the k-means algorithm though k-means performed better over most of the topics. By analyzing Fig. 12.3, it is observed that although the PSO-based model has obtained higher B-Cubed F-measure than the k-means algorithm, k-means shows a better result in 11 topics out of 21 topics. 1

Fig. 12.1 Comparative B-Cubed precision of PSO and k-means for each topic Precision

0.75 0.5

PSO K-means

0.25 0

AIDA

FAHRENHEIT

LOCUST

RHEA

Topics

1

Fig. 12.2 Comparative B-Cubed recall of PSO and k-means for each topic Recall

0.75 0.5

K-means PSO

0.25 0

AIDA

FAHRENHEIT

LOCUST

Topics

RHEA

12 Comparative Analysis of K-Means Algorithm …

113

1

F-mmeasure

Fig. 12.3 Comparative B-Cubed F-measure of PSO and k-means for each topic

0.75 0.5

K-means PSO

0.25 0

AIDA

FAHRENHEIT

LOCUST

RHEA

Topics

12.4 Conclusion The study performs a comparative analysis of traditional clustering algorithm k-means and nature-inspired approach PSO. Each algorithm has its features, merits, and demerits. Performance evaluation of the PSO-based model and k-means algorithm has been performed over well-known dataset used for the topic clustering. Experimental results of the PSO-based model and k-means algorithm are compared. The results of the experiment show that the highest B-Cubed recall and B-Cubed F-measure is achieved by the PSO-based model. Though the highest precision is achieved by the k-means, the k-means algorithm achieved better precision over more number of topics. Future Work Both the above-discussed algorithms have some merits and demerits. As a future goal, we tend to use the (i) hybrid approach for various domains of text data such as medical and finance domain. (ii) To use the hybrid approach for community detection with the perspective of society benefit. (iii) To introduce some novel matrices for cluster evaluation utilized in the case of text data.

References 1. Carpineto, C., Romano, G.: Ambient dataset. http://search.fub.it/ambient (2008) 2. Celebi, M.E., Kingravi, H.A., Vela, P.A.: A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst. Appl. 40(1), 200–210 (2013) 3. Chuang, L.Y., Lin, Y.D., Yang, C.H.: An improved particle swarm optimization for data clustering. In: Proceedings of the International MultiConference of Engineers & Computer Scientist 2012 I, IMECS (2012) 4. Das, S., Abraham, A., Konar, A.: Automatic clustering using an improved differential evolution algorithm. IEEE Trans. Syst. Man Cybern. Part A: Syst. Hum. 38(1), 218–237 (2008) 5. De Carvalho, F.D.A., Lechevallier, Y., De Melo, F.M.: Partitioning hard clustering algorithms based on multiple dissimilarity matrices. Pattern Recognit. 45(1), 447–464 (2012)

114

S. Mehrotra and A. Sharan

6. Grira, N., Crucianu, M., & Boujemaa, N. (2004). Unsupervised and semi-supervised clustering: a brief survey. A review of machine learning techniques for processing multimedia content, 1, 9–16 7. Huang, C.L., Huang, W.C., Chang, H.Y., Yeh, Y.C., Tsai, C.Y.: Hybridization strategies for continuous ant colony optimization and particle swarm optimization applied to data clustering. Appl. Soft Comput. 13(9), 3864–3872 (2013) 8. Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31(8), 651–666 (2010) 9. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999) 10. Jensi, R., Jiji, D.G.W.: A survey on optimization approaches to text document clustering. Int. J. Comput. Sci. Appl. (IJCSA) 3(6), 31–44 (2013) 11. McCallum, A.K.: Bow: a toolkit for statistical language modeling, text retrieval, classification and clustering (1996) 12. Mehrotra, S., Kohli, S., Sharan, A.: To identify the usage of clustering techniques for improving search result of a website. Int. J. Data Min. Model. Manag. 10(3), 229–249 (2018) 13. Mehrotra, S., Kohli, S.: Comparative analysis of k-means with other clustering algorithms to improve search result. In: 2015 International Conference on Green Computing and Internet of Things (ICGCIoT), pp. 309–313. IEEE (2015) 14. Mehrotra, S., Kohli, S.: The study of the usage of data analytic and clustering techniques for web elements. In: Proceedings of the ACM Symposium on Women in Research 2016, pp. 118–120. ACM (2016) 15. Mehrotra, S., Kohli, S.: Identifying evolutionary approach for search result clustering. In: 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), pp. 3778–3782. IEEE (2016) 16. Mehrotra, S., Kohli, S.: Application of clustering for improving search result of a website. In: Information Systems Design and Intelligent Applications, pp. 349–356. Springer, New Delhi (2016) 17. Mehrotra, S., Kohli, S.: Data clustering and various clustering approaches. In: Intelligent Multidimensional Data Clustering and Analysis, pp. 90–108. IGI Global (2017) 18. Mehrotra, S., Kohli, S., Sharan, A.: An intelligent clustering approach for improving search result of a website. Int. J. Adv. Intell. Parad. (in press). https://doi.org/10.1504/ijaip.2018. 10011466 19. Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005) 20. Zeng, H.J., He, Q.C., Chen, Z., Ma, W.Y., Ma, J.: Learning to cluster web search results. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 210–217. ACM (2004)

Chapter 13

Design and Implementation of Rule-Based Hindi Stemmer for Hindi Information Retrieval Rakesh Kumar, Atul Kumar Ramotra, Amit Mahajan and Vibhakar Mansotra Abstract Stemming is a process that maps morphologically similar words to a common root/stem word by removing their prefixes or suffixes. In Natural Language Processing, stemming plays an important role in Information Retrieval, Machine Translation, Text Summarization, etc. Stemming reduces inflected word to its root form without doing any morphological analysis of the word and sometimes it is not necessary that stemming always provides us meaningful/dictionary root words as a lemmatizer always provides meaningful dictionary words. For example, in the Hindi , (pakshion) is formed as ( (paksh) + ) having as suffix; word (paksh) and (paksh) which is not a if we remove this suffix, then it becomes meaningful Hindi dictionary word. In the context of information retrieval, the stemmer reduces varied (morphologically inflected) words to a common form, thereby reducing the index size of the inverted file and increasing the recall. In this paper, researchers have attempted to develop a rule-based Hindi Stemmer Suffix Stripping Approach for Hindi Information Retrieval. A python-based web interface has been designed to implement the proposed algorithm. Also, the developed stemmer is being tested for accuracy and efficiency in two scenarios, first as an independent stemmer and second as a supporting module to indexing in Hindi Information Retrieval. The proposed stemmer has shown an accuracy of 71% as an individual stemmer and also reduced the index size by 26% (approx.) when used in indexing.

R. Kumar (B) · A. K. Ramotra · A. Mahajan · V. Mansotra University of Jammu, Jammu 180006, Jammu and Kashmir, India e-mail: [email protected] A. K. Ramotra e-mail: [email protected] A. Mahajan e-mail: [email protected] V. Mansotra e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_13

115

116

R. Kumar et al.

13.1 Introduction Stemming reduces morphologically similar words to their root/common stem. For are all derived from a common root example, Hindi words [1]. Stemming maps all these words ( ) to their word ). Both stop words and stemming play significant role operations root form ( in Information Retrieval System. Stop word elimination removes grammatical or functional words while stemming reduces inflected words to their common root word. With stemming, two conditions need to be handled: under stemming and over stemming. When two related/morphological inflected words do not map to the same stem, it is the case of under stemming. When two unrelated/different words map to the same stem, thereby causing a match between query and irrelevant documents, it is the case of over stemming. Stemming is a basic step in processing textual data before doing any task in NLP such as information retrieval, text mining, text summarization, machine translation, etc. [2]. Since many words are mapped to one common word/stem, while designing index table for IR, the number of entries are also ”, reduced, because instead of storing all the words like “ ” is stored. The stemming algorithm presented in this paper only their root form “ is based on some language-dependent rules in context of the information retrieval task. The paper is organized as follows: the next section looks at various approaches to stemming. Section 13.3 discusses related work of stemming for Indian languages. Section 13.4 discusses proposed methodology and implementation. Section 13.5 covers analysis and conclusion.

13.2 Stemming Approaches There are different approaches used for stemming. Figure 13.1 shows the classification of different stemming approaches on the bases of their methodology. A few approaches have been discussed in this section.

13.2.1 Affix Removal Approach This is one of the simplest approaches which is used for stemming. Affix removal refers to either removing the suffix or prefix from the input word which was entered by the user. This method works on two principles, iteration and the longest match. An example of a rule under this approach is as follows. If a word ends with the suffix “E”, then remove “E”. An example of this rule is: end with the suffix “ ”, when we Hindi words like remove the suffix “ ” from these words, they reduce to common root form ( ). In the same way, one can make more rules for Hindi words.

13 Design and Implementation of Rule-Based Hindi Stemmer …

117

Fig. 13.1 Taxonomy of stemming approaches

13.2.2 N-Gram This approach was given by Adamson and Boreham [3]. This approach used the concept of n-grams where n is 1, 2, 3, and so on. N = 2 means bigrams (digrams, bi-grams, di-grams), representing a sequence of two grams (two characters or two words or two syllables in a row, consecutively in the text) and for n = 3, trigrams (three characters in the text). In this approach, we first find unique pairs of words (di-grams) and then associate them on the basis of unique digrams they both possess. (mala) and (malaon) can be broken into digrams For example, the terms as follows.

Thus, “ ” has three unique digrams, and “ digrams. These two words share three unique digrams: “

” “

” “

” also has five unique

”

Dice’s coefficient [4] is used to find similarity measure using this formula as S = 2C/A + B.

118

R. Kumar et al.

13.2.3 Suffix Stripping Approach This approach is based on removing suffixes from the word if the word is not present in vocabulary/dictionary. In this approach, a suffix list is required for removing suffixes from the word. To make a decision, these approaches check the existence of the word in a vocabulary/dictionary. The nonexistence of the word may cause the algorithm to apply suffix stripping rules and there may be cases where two or more suffix stripping rules can be applied to the same input term [3].

13.2.4 Suffix Substitution Approach Suffix substitution approach is an improvement upon the Suffix Stripping Approach. For this, one needs to create a substitution rule that replaces one suffix with another suffix [3].

13.2.5 Brute-Force Approach Stochastic algorithm is another stemming approach that uses probability to identify the root form of a word. To develop a probabilistic model, these algorithms are trained, so that one can perform stemming by inputting inflected form of a word to this model and generate the root form of that word according to its internal rule set [3].

13.2.6 Hybrid Approach The hybrid approach uses a combination of two or more approaches.

13.3 Related Work Stemming is not a new idea. This has been developed since 1968 and lots of work had been done on stemming for the English language, but for Indian languages, stemming still needs to be explored further. Most stemming approaches use morphological characteristics of the target languages, where suffix removal is also controlled by quantitative restrictions (e.g., “ing” is removed only when the resulting stem has more than three letters as in “thinking” but not in “king”) as in (e.g., the Porter stemmer for the English language [4]). For Indian languages, Mishra and Prakash

13 Design and Implementation of Rule-Based Hindi Stemmer …

119

[5] had proposed stemmer “MAULIK” for Hindi language that is purely based on Devanagari script. This stemmer uses a hybrid approach and gives an accuracy of 91.59%. Ramanathan and Rao [6] had developed stemming algorithm for Hindi which was lightweight and uses a suffix list on a longest pattern match basis. For testing, documents were chosen from varied domains such as Business, Sports, Films, Health, and Politics. This stemmer calculates under stemming and over stemming error rates which are 4.68% and 13.84%, respectively. Kumar and Rana [7] use the same technique but for Punjabi language and this stemmer gives an average accuracy of 80.73%. Gupta [8] had developed a Hindi rule-based stemmer for nouns. This stemmer has been tested on 100 news documents taken from popular Hindi newspapers. In this paper, the author had used 16 Hindi suffixes. This stemmer gives 16.35% error due to the absence of some noun suffixes. So, the author leaves improvement for future work. By adding more Hindi suffixes, we can remove error to some extent. Shahid Husain [9] had suggested and developed unsupervised stemmer (for Urdu and Marathi languages). The author had used two approaches for suffix stripping, namely length-based stripping and frequency-based stripping. For Urdu language, frequency-based suffix stripping gives accuracy of 85.36% whereas lengthbased suffix stripping algorithm gives accuracy of 79.76%, and for Marathi language, accuracy is 63.5% in case of frequency-based stripping and 82.5% in case of lengthbased suffix stripping. Paul et al. [10] had suggested and developed a rule-based lemmatizer for Hindi. For this purpose, they had created 124 rules which were used in such a way that initially, the suffix gets removed from the input word and if required, addition of a new character takes place. For analysis purpose, system is being tested on 2500 words and it gives accuracy of 89.08%. Rastogi and Khanna [11] had developed a morphological analyzer for Hindi. The approach that was used in this paper is both rule-based and corpus-based. This morphological analyser works for both Hindi words and sentences.

13.4 Proposed Methodology and Implementation In this paper, the researcher has implemented a rule-based stemming algorithm using a rule-based method. A python-based web interface is designed that has an option to process single word or paragraph or file. Initially, the input is scanned for removing invalid characters like parentheses, comma or any other special symbol. Then, again the input is scanned for removing any stop words by matching against the external list of stop words. After this, each word is matched against the suffixes list on the longest match pattern applying the list of rules. Rules are studied from the previous literature and some more rules are added to improve the accuracy of this proposed algorithm. Table 13.1 shows the list of a few stop words used by our algorithm and Table 13.2 shows the list of a few suffixes used in our algorithm. Table 13.3 shows the output of our algorithm. Figure 13.2 shows the snapshot of the interface developed using a Python-based Django framework. The stemming module is just a part of this interface. This interface

120

R. Kumar et al.

Table 13.1 List of a few stop words used in our algorithm

(par)

(in)

(vah)

(yeh)

(pura)

(ityadi)

(dwara)

(inhe)

(inho)

(hui)

(isme)

(jitna)

(dusra)

(kitna)

(ke)

(krne)

(kiya)

(liye)

(apne)

(jise)

(sabse)

(hone)

(krte)

(bahut)

(varg)

Table 13.2 List of a few suffixes used in our algorithm

13 Design and Implementation of Rule-Based Hindi Stemmer …

121

Table 13.3 Output of the proposed stemmer

Inflected Word

Output

(pakshion)

(paksh)

(malaon)

(mal)

(sevaian) (kahaniyan)

(sev) (kahan)

Fig. 13.2 Snapshot of the interface for stemming a paragraph

is being developed for the implementation of Hindi IR. In the stemming module, there is an option for the user to input either a single word or file. We have used a list of 172 stop words and 68 suffixes for this algorithm. Stop words are stored in a text file and suffixes are stored in a Python list.

122

R. Kumar et al.

13.5 Analysis and Conclusion Stemming algorithm, we have implemented works efficiently and their accuracy has been tested using Leipzig Corpus [12] containing 1M Hindi sentences having 230931 unique items. We have tested our algorithm in two different scenarios. First, we tested it as an independent stemming module and it shows an accuracy of 71%. In the second case, we tested it as part of the indexing module of IR. In indexing, it helps to reduce the index size by 26% (approx.). So, by adding more suffixes in the suffix list and combining some more approaches, we can improve its efficiency by manifolds. So, this stemming algorithm will be helpful in the field of Information Retrieval and acts as a basic resource in the field of language research [13]. In the future, this work will be further extended by combining some suitable approach to develop a robust Hindi stemmer to solve the problem of low Hindi recall and precision.

References 1. Sharma, A., Kumar, R., Mansotra, V.: Proposed stemming algorithm for Hindi information retrieval. Int. J. Innov. Res. Comput. Commun. Eng. (An ISO Certif. Organ.) 3297(6), 11449– 11455 (2016) 2. Estahbanati, S., Javidan, R., Nikkhah, M.: A new multi-phase algorithm for stemming in Farsi language based on morphology. Int. J. Comput. Theory Eng. 3(5), 623–627 (2011) 3. Giridhar, N.S., Prema, K.V., Subba Reddy, N.V.: A prospective study of stemming algorithms for web text mining 1. GANPAT Univ. J. Eng. Technol. 1(1), 28–34 (2011) 4. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130137 (1980) 5. Mishra, U., Prakash, C.: MAULIK: an effective stemmer for Hindi language. Int. J. Comput. Sci. Eng. 4(05), 711–717 (2012) 6. Ramanathan, A., Rao, D.D.: A lightweight stemmer for Hindi. In: Workshop on Computational Linguistics for South-Asian Languages, EACL (2003) 7. Kumar, D., Rana, P.: Design and development of a stemmer for Punjabi. Int. J. Comput. Appl. 11(12), 18–23 (2010) 8. Gupta, V.: Hindi rule based stemmer for nouns. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 4(1) (2014). ISSN: 2277-128X 9. Shahid Husain, M.: An unsupervised approach to develop stemmer. Int. J. Nat. Lang. Comput. 1(2), 15–23 (2012) 10. Paul, S., Tandon, M., Joshi, N., Mathur, I., Design of a rule based Hindi lemmatizer, pp. 67–74 (2013) 11. Rastogi, M., Khanna, P.: Development of morphological analyzer for Bangla. Int. J. Comput. Appl. 95(17), 1–5 (2014) 12. Eckart, T., Quasthoff, U.: Statistical corpus and language comparison on comparable corpora. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds.) Building and Using Comparable Corpora. Springer, Heidelberg (2013); Author, F., Author, S.: Title of a proceedings paper. In: Editor, F., Editor, S. (eds.) Conference 2016, LNCS, vol. 9999, pp. 1–13. Springer, Heidelberg (2016) 13. Hafer, M., Weiss, S.: Word segmentation by letter successor varieties. Inf. Storage Retr. 10, 371–385 (1974)

Chapter 14

Research on the Development Trend of Ship Integrated Power System Based on Patent Analysis Rongying Zhao, Danyang Li and Xinlai Li

Abstract We can understand the technical strength and invention dynamic of other countries in this respect, and clarify the current technical theme and core technology by studying the patents’ status of foreign ship integrated power system, so as to accumulate experience and provide a reference for the development of ship technology in China. This paper demonstrates the development trend from the perspectives of technology development life cycle and direction of research and development based on the research about ship integrated power system, using the methodologies and tools of social network clustering, technology life cycle S curve, visual analysis, and so on. And the direction of research and development is illustrated from three aspects including technology concentration, industry concentration, and regional diffusion. The foreign ship integrated power system has entered the phase of rapid growth and Japan and South Korean are very competitive in this domain.

14.1 Introduction Ship integrated power system (IPS) combines the two independent power and electric power systems in traditional ships, to achieve the supply, distribution, use, and management uniformly of the whole ship’s energy through the power network unified power supply for propulsion load, pulse load, communication, navigation, and daily equipment in the form of electric energy [1, 2]. Since the late 1980s, electric propulsion has been used in more than 30% of newly built passenger ships, ferries, and icebreakers. In terms of military ships, the Type 45 Destroyer of the United Kingdom is the first combat ship in the world that adopts the integrated power system. The first aircraft carrier “Elizabeth” that adopts integrated

R. Zhao · D. Li (B) · X. Li Research Center for Chinese Science Evaluation, Wuhan University, Wuhan 430072, China e-mail: [email protected] School of Information Management, Wuhan University, Wuhan 430072, China © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_14

123

124

R. Zhao et al.

electric propulsion was also commissioned on December 7, 2017, while the DDG1000 destroyer that was commissioned by the United States on October 15, 2016 also adopts integrated power system. On the one hand, in the face of the rapid development of the new type of integrated electric propulsion technology for ships in the world and the urgent needs of the Chinese navy, various research topics and equipment research for the purpose of using integrated electric power system for the new type of ships are constantly advancing in the twenty-first century. On the other hand, from the point of view of foreign ships that have been in service with the integrated power system, there are still many risks in the operation of integrated power ship, and the system failures are frequent. For example, the integrated electric propulsion (IEP) system has exposed a series of reliability and performance problems in the use of the six Type 45 Destroyers successively commissioned by the United Kingdom from July 2009 to September 2013, during which the whole ship lost power; faults occurred many times [3]. According to statistics, from service to the end of 2017, the Type 45 ship has suffered several serious power system failures and lost power during navigation or exercise, which requires the tugboat to be towed to the base for maintenance. The specific situation is shown in Table 14.1. Patent is one of the largest sources of technical information in the world, which is the legal expression of technological innovation. It is cutting-edge and time-effective. Research and use of patent information can reasonably analyze the status quo of technological development of an industry and predict the trend of technological development of the industry [4]. Table 14.1 Breakdown of the British Type 45 Destroyer Ships

Service time

Fault time

Fault location

Fault types

D32

July 2009

Nov. 2009

The Atlantic Ocean

Feb. 2012

The Persian Gulf

Power system failure, loss of power

May 2012

Senegal, Africa

D33

June 2010

Feb. 2014

South coast, UK

D34

Apr. 2012

Nov. 2017

The Mediterranean Sea

Propulsion system failure

D35

Apr. 2012

June 2016

The Persian Gulf

Power system failure, loss of power

April 26, 2017

Portsmouth Base

The sparks in the cabin caused a big fire, which was put out in time without causing serious

Nov. 2016

Jurassic Coast, Southern England

Propulsion system failure

D37

Sep. 2013

14 Research on the Development Trend of Ship Integrated Power …

125

14.2 Data and Methods The research object of this paper is the integrated power system of foreign ships, which needs to obtain patent data of many countries around the world except China, especially in the leading countries of the shipbuilding industry (the United States, Japan, South Korea, etc.). If we search the databases of patent offices in different countries, due to the different data formats of different patent databases, it will take time to clean and integrate a large amount of data in the later period; so we choose to use the Derwent Innovations Index (DII) for search. In this paper, an exhaustive search strategy is adopted to improve the comprehensiveness of the search results based on the English search keywords related to the topic, and the search characteristics of the database are adjusted. As of April 12, 2018, a total of 1,773 records had been retrieved, and the result after weight reduction was 1,349. Although the exhaustive strategy was adopted to ensure the completion rate of retrieval, the accuracy rate was not improved. In this regard, we select all DC classification numbers to formulate co-occurrence matrix, conduct aggregation subgroup analysis, and obtain classification number clustering related to the topic, as shown in Fig. 14.1. Based on the co-occurrence matrix of DC classification numbers, 168 DC classification numbers were divided into eight clusters. DC classification numbers with a frequency greater than 50 were selected for further analysis, mostly concentrated in three regions. We think that the classification numbers of these three regions are related to the research topic of this paper, with a total of 48 DC classification numbers. In order to pertinently improve the patent of technical analysis, the deduplication data were screened according to the above classification number. Finally, 738 records with strong correlation classification numbers were obtained. And the patent data applied

Fig. 14.1 Aggregation subgroup analysis of patent DC classification number of integrated power system

126

R. Zhao et al.

at the China Patent Office were deleted, the remaining 453 patent data were used for the following analysis of the development trend of patent technology.

14.3 Results and Discussion 14.3.1 Life Cycle Analysis of Patent Technology Development of Foreign Ship Integrated Power System In the process of technology development, it is in-line with the law of S curve. By fitting the S curve, the life cycle values of 453 strongly related subjects (records) can be calculated, and the current growth stage of this technology in foreign countries can be accurately judged and technical prediction can be made [5]. At this point, this paper uses Loglet Lab4 and Logistic model equation y = l/(1 + (αe) ˆ (–βT)), and draws life cycle by applying patent cumulants and numerical characteristics of the first year, to determine the field of life cycle stages of development [6]. The air bubble distribution in Fig. 14.2 is the scatter diagram of the cumulative number of patents and the year, and the curve is the logistic curve, which is basically consistent. The saturation point, turning point, and growth time are 1152, 2022, and 42.6, respectively [7, 8]. Substitute these three data into the formula, and calculate t1 = 1990, t2 = 2011, t3 = 2032, and t4 = 2053, respectively. The patent first appeared in 1975, so the following can be concluded: 1975–1989 was the embryonic stage, the initial stage of technology. There were a few researchers in this stage, the number of patents was small and unstable, and Fig. 14.2 Fitting diagram of technology development life cycle

14 Research on the Development Trend of Ship Integrated Power …

127

the research results were few; The period from 1990 to 2010 was a period of slow growth. During this period, the ship integrated power system attracted more and more attention. The number of patents began to increase slowly and the number of participants became more and more; It is predicted that 2011–2031 will be a period of rapid growth, and the integrated power system may become a research hotspot in the direction of ships. The number of patents increases rapidly and the trend of increase is obvious. At present, the patent research may be in the early stage of rapid growth; It is predicted that 2032–2053 will be the mature period of technology. At that time, the social recognition would be high, the growth of patent accumulation would begin to slow down, the technology innovation would become difficult, the technology core would have begun to concentrate in the hands of some patentees, the enterprise would not be suitable to enter this field. The number of patents accumulated is expected to peak at 1,152 in 2053, before the technology enters a recession [9].

14.3.2 Analysis of Patent Technology Research and Development Direction of Foreign Ship Integrated Power System With the help of patent technology development life cycle, strong foreign ship integrated power system related subjects (453 records) are divided into the three development stages: budding period (1975–1989), slow growth period (1990–2010), and fast growth period (2011–2018). The trend analysis chart and regional diffusion map of patent concentration are drawn from three aspects: technology concentration (IPC classification number), industry concentration (patentee) and regional diffusion (country/region of patent application). The development level of integrated power system in each stage is explored from data inflection point analysis and different trend line analysis [10, 11]. Technology concentration analysis. The IPC classification number of patents in each stage was counted, and the first five technology branches (IPC classification number) of each stage was used for the calculation to obtain the technology concentration degree. After sorting, Fig. 14.3 was obtained [12]. As shown in Fig. 14.3, from point A to point B, there is a slow declining trend. This is because in the initial stage, people’s exploration of new fields tends to focus on some major issues, and the research fields are relatively concentrated. In the slow growth stage, new fields begin to receive extensive attention, and the number of patent applications increases. From point B to point C, there is a trend of extremely rapid rise, because, in the period of rapid growth, the field of integrated power system has gradually become a research hotspot, and the core technology branches have gradually formed. At this time, the new technology branches are scattered and fail to form a certain scale, so the degree of technology concentration is greatly improved. Analysis of concentration degree of patentee.

128

R. Zhao et al.

The patentees are sorted out in the three stages of the budding stage (1975–1989), the slow growth stage (1990–2010), and the fast growth stage (2011–2018). By calculating the concentration degree of the first six patentees in each stage [13], Fig. 14.4 was obtained. As shown in Fig. 14.4, from point A to point B, the number of patentees tends to rise in a straight line, which is due to the relatively small number of patentees in the initial stage, the rapid growth of patentees ranking high in frequency, and the increasing concentration of patentees. From point B to point C, there is a slow rising trend, because in the period of rapid growth, the industry tends to be stable, and some patentees have formed their competitive advantages and gradually established their monopoly position through early development. Analysis of concentration degree of dominant patentee. For the patentees who have outstanding concentration degree, separate data analysis is made of the patentee in each period and Fig. 14.5 is obtained. As can be clearly seen from the figure above, in terms of comprehensive power system patent application, Daewoo Shipbuilding & Marine Engineering (DEWO-C) and HYUNDAI Corporation (HHIH-C) showed a sharp increase in the rapid growth period, and took the first and third place in the list, which can be regarded as a rising star in this field. The four companies, Fuji Electric (FJIE-C), Nishishiba Electric Co., Ltd. (NISH-Non-standard), Siemens Corporation (SIEI-C), Samsung Heavy Industries (SMSU-C) maintained a steady growth trend in the three periods, so, these four Fig. 14.3 Technology concentration analysis

Fig. 14.4 Analysis of concentration degree of the patentee

14 Research on the Development Trend of Ship Integrated Power …

129

Fig. 14.5 Concentration of representative patentees in each period

companies have steady development of the technology of integrated power system. Different from the development trend of the above companies, the concentration of patentees of general electric (gene-c), TOSHIBA (toke-c) and mitsubishi heavy industries (mito-c) dropped significantly and remained at a low level during the rapid growth period. It is speculated that these three companies have started to decline at present. Regional diffusion analysis. After sorting out the application date of each patent and its patent office, the regional diffusion of the patent was calculated from the three stages of the budding stage of the patented technology (1975–1989), the slow growth stage (1990–2010), and the rapid growth stage (2011–2018). Thus, the regional diffusion map of the three stages was generated [14], as shown in the Figs. 14.6, 14.7, and 14.8. In the early stage, as shown in Fig. 14.6, the spread of patent applications is mainly concentrated in the European region, especially internal links in Europe are more close, with Germany as the center and spreading to Britain, France, and Norway. Japan and Korea also have many patents, but the density and number of connections

Fig. 14.6 Regional diffusion in the germination stage

130

R. Zhao et al.

Fig. 14.7 Regional diffusion in the slow growth stage

Fig. 14.8 Regional diffusion in the rapid growth period

with other countries are lower than that in Europe and North America. In addition, the former Soviet Union had a certain number of patent applications, a few of its patents were applied for patent protection in other countries. We speculate that the embryonic stage was during the cold war. There was a gap between the US-led capitalist camp and the Soviet-led socialist camp in terms of technical cooperation and patent cooperation, and some technologies were protected, especially some military technologies involved in the shipbuilding industry. In the period of slow development (see Fig. 14.7), with the end of the cold war and the trend of multi-polarization of the world, intercountry technical exchanges and cooperation have become more extensive. Besides Europe and North America, Japan

14 Research on the Development Trend of Ship Integrated Power …

131

and Korea and they have even more tightly to patent cooperation. So an application for patent for the three main area formed gradually, represented by Germany and the European Patent Office in Europe, North America and East Asia. And the connection between the region were more closely and deep obviously. In the period of rapid growth (see Fig. 14.8), East Asia, represented by China, Japan, and South Korea, has become the core region for patent application of technologies related to integrated power system of ships, followed by the United States, the European Patent Office, and the World Intellectual Property Organization. It can be found that the regional diffusion pattern of patents related to ship integrated power system by comparing the three stages is: from Atlantic Rim as the center to East Asia gradually, which is consistent with the regional change of technology and industry in shipbuilding industry. Japan and South Korea show a strong momentum of development. Their research on the integrated power system of ships shows the importance of the shipbuilding industry.

14.4 Summary At present, the integrated power system and its related core technologies are in a period of rapid growth. In the next few years, the number of related technology patents will increase rapidly, and more and more enterprises will participate in this field. While deepening the development of the technical field, some advantageous enterprises are going forward simultaneously. The technology field of the integrated power system of foreign ships is highly competitive and has great potential for development. To keep abreast of the latest cutting-edge developments, we should pay close attention to the patented technologies of companies from Japan and South Korea. By studying the patents of European and American enterprises, especially the British General Electric Company and the German Siemens AG, we can explore the research core and technical deposits of the integrated power system of ships in the European and American regions, so as to avoid the deviation of research and development direction. Acknowledgements This paper is supported by the National Social Science Foundation in China (Grant No. 18ZDA325)

References 1. Lijun, F., Liu, L., Wang, G., Ma, F., Ye, Z., Ji, F., Liu, L.: The research progress of the medium voltage DC integrated power system in China. Chin. J. Ship Res. 11(01), 72–79 (2016) 2. Ma, W.: Electromechanical power conversion technologies in vessel integrated power system. J. Electr. Eng. 10(04), 3–10 (2015)

132

R. Zhao et al.

3. Project Napier sees twin-track plan adopted to resolve Type 45 problems, Warship Technology. https://www.rina.org.uk/2016_Editions3.html, 2016(7/8), 13–16 (2018) 4. Brockhoff, K.: Instruments for patent data analysis in business firms. Technovation 12, 41–58 (1992) 5. Kong, D., Wang, K.: To Assess a technology investment based on patent data——from an empirical analysis on express logist. J. Tech. Econ. Manage. 8, 14–20 (2018) 6. Stratopoulos, T., Charos, E., Chasros, K.: A translog estimation of the average cost function of the steel industry with financial accounting data. Int. Adv. Econ. Res. 6(2), 271–286 (2000) 7. Li, C., Huang, B.: Judgment for the technology life cycle of 3D printing based on S Curve. Sci. Technol. Econ. 2, 91–95 (2017) 8. Shi, M., Wang, J., Zhou, X., Yiyuan, H., Zhao, Z.: An empirical research on the field of wireless charging based on patent life length. High Technol. Lett. 1, 95–102 (2018) 9. Yang, X., Yu, X., Liu, X.: A study on patent intelligence analysis approach based on multi-level perspective theory—taking graphene technology as an example. J. Intell. (8), 64–70, 91 (2018) 10. Zhang, T., Chi, H., Ouyang, Z.: Study on the competitive situation of surgical robot based on patent analysis. China Med. Equip. 7, 119–123 (2018) 11. He, Y.D., Fan, W., Yu, J., Wang, Y., Yu, X.Z.: Technology innovation information research based on patent life cycle. J. Intell. (7), 73–77, 72 (2017) 12. Wang, J., Zhang, Y.: Technology prediction of China’s robot development based on conceptual model of patent analysis. J. Intell. 33(11), 83–87, 45 (2014) 13. Liu, D., Chen, H.: Study on the R&D trends, life cycle, advanced technology and factors of coal liquefaction technology: evidence from patent analysis. J. Intell. 7, 52–58 (2017) 14. Dengke, Yu., Xiong, S., Chen, L.: Space-time laws of emerging agricultural technology diffusion: taking patent technology of agricultural drones for example. J. Anhui Agr. Sci. 26, 5–10 (2018)

Chapter 15

Detection of Data Anomalies in Fog Computing Architectures K. Vidyasankar

Abstract Fog computing architectures provide a new platform for distributed stream processing in Internet of Things (IoT) applications. In a hierarchical infrastructure, processing of stream data arriving from sensors starts at the lowest level edge nodes and proceeds to the intermediate level fog nodes and eventually to the top-level cloud node. The goal is to do as much processing as possible at the lower level nodes and react to unexpected or interesting input values as early as possible. The unexpected values are referred to as anomalies. They may occur due to malfunctioning of sensors, which may be due to accidents or intentional attacks, or changes in the environment. The anomalies must be detected and proper actions must be taken quickly to bring the application to a steady state. We describe a generic framework for detection of data anomalies in a fog hierarchy in this paper. Our framework can be adapted to any application and other fog architectures.

15.1 Introduction Fog computing architectures [2–4, 10] provide a new platform for distributed stream processing in Internet of Things (IoT) applications. The architecture consists of several levels. In a hierarchical infrastructure, processing of stream data arriving from sensors starts at the lowest level edge nodes and proceeds to the intermediate level fog nodes and eventually to the top-level cloud node. IoT applications span many day-to-day functions, improving their effectiveness and efficiency. Examples are traffic light control, parking management, electrical power distribution, supply chain management, weather forecasting, firefighting, home climate control, and patient health care [1, 5, 7, 12], to name a few. Two core characteristics of IoT applications are (i) monitoring (the state of the environment through the processing of the sensor values) and (ii) possible actuations (reacting to some unexpected or interesting values observed while monitoring). The unexpected K. Vidyasankar (B) Department of Computer Science, Memorial University, St. John’s, NF A1B 3X5, Canada e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_15

133

134

K. Vidyasankar

values are referred to as anomalies. They may occur for various reasons. Source input data are produced by various sensor devices. Most sensors are inexpensive hardware with limited energy, storage, processing, and communication capabilities. They are deployed in various harsh environments. The probability of their producing anomalous data is high. Malfunctioning sensors may produce anomalous data continually. The same is true for intruder attacked sensors. When the sensors are functioning properly, anomalies could be due to changes in the environment. Many IoT applications are delay-sensitive. It is important that anomalies are detected early to ensure the consistency of the executions, in addition to rectifying anomalous sources and to react to the changes in the environment. Anomaly detection can be done efficiently in the cloud. However, with a very large number of sensors, even billions in some applications, and hence the enormous amount of data they generate, very high bandwidth would be required to send the input data to the cloud for anomaly detection. Edge devices will not have such capability. Further, most devices will not have 24/7 connectivity to the cloud. For minimal latency, anomalies should be detected at the sources themselves, at the edge nodes or the fog nodes at the lower levels of the hierarchy. However, the storage and processing capabilities of those nodes are typically low. Hence the anomaly detection process must be distributed carefully to the nodes in several levels. In this paper, we discuss a framework for doing so. We discuss both the detection of anomalous data and rolling back the computations done with those data if needed. We give an overview of the fog architecture in Sect. 15.2. We discuss anomalies and outline our framework for anomaly detection in Sect. 15.3. An anomaly detection algorithm is given in Sect. 15.4. Some related works are discussed in Sect. 15.5. Section 15.6 concludes the paper.

15.2 Fog Architecture Overview We use the hierarchical fog architecture given in [14]. We consider a fog hierarchy (rooted tree) of n levels, n > 1. First, we illustrate the basic definitions and concepts by taking a single input source, namely, a hierarchy consisting of a simple path of length n − 1. Here, v j will refer to the node in jth level. Then, v1 refers to the edge, vn to the cloud, the intermediate nodes to the fog. The source input batches are input to the edge node and are sequentially numbered. Batch bi refers to the ith batch. Each batch bi is processed in several nodes, starting from level 1. The computation for batch bi will be denoted C(bi ). It is decomposed into subcomputations as C(bi ) = c1,i + c2,i + · · · + cn,i where c j,i is executed at level j. An input batch I n(c j,i ) and an output batch Out (c j,i ) are associated with each c j,i . A local program state p j is associated with the executing node v j . The computation c j,i changes the program state from p j (c j,i−1 ) to p j (c j,i ).

15 Detection of Data Anomalies in Fog Computing Architectures

135

Now we describe the notations and definitions for a tree hierarchy. For simple exposition, we consider the case where all the leaves are at the same distance from the root. The source input batches are processed at level 1. They send their outputs to their parents. The nodes in level j, for j between 2 and n − 1, will process the inputs of their respective children and send output batches to their parents. We consider the case where the source data are generated synchronously and processed in batches at the leaf level nodes. (We note that in many applications, multivariate sensors are deployed [8]. They measure several parameters like temperature, humidity, air pressure, lighting level, etc. and send them simultaneously. This amounts to synchronous arrival of those data.) The set of batches processed at one time constitutes a batch set. The batch sets are indexed sequentially. A batch set with index i is referred to as Bi . In each level j, the computation at a node v j is the combined computations c j (x) for all the source input batches x which are input to v j ’s descendents in a batch set. Then C(Bi ) refer to the computations required for Bi . They are decomposed into c1 (Bi ) + c2 (Bi ) + · · · + cn (Bi ). In general, several computations, for example, C1 (Bi ), C2 (Bi ), etc., may be performed on the batch sets. For simplicity, we will refer to only one computation in this paper. We also note that when a batch contains only one tuple, both the computation and the anomaly detection are with respect to that single input.

15.3 Anomaly Detection 15.3.1 Anomalies We say that there is an anomaly in a data if its value is unexpected. Anomaly is determined by correlating the observed value with an expected value. Three types of correlations are defined in [11]: Time correlation with the most recent values from the same sensor; Spatial correlation with the values produced by some other, nearby, similar sensors; and Functional correlation imposed by the functional relations between values of different sensors. We note that time correlation with respect to batch sets (sets of values), instead of just batches, can also be done. Also, time correlation of the spatial correlations (of recent batch sets), and similarly, time correlation of the functional correlations can also be done. Thus, an anomaly in a data maybe detected by examining (i) just the value, (ii) a set of recent values, (iii) the values produced by other similar sensors, or (iv) the values of different other parameters observed by other sensors. Expected values can be learned through training sets. Three types of learning are mentioned in [9]: supervised, semi-supervised and unsupervised.

136

K. Vidyasankar

Global and local reference models are proposed in [8] to store the expected values. Both models will be updated periodically, for example, with changes in the environment. Results from the processing of a sequence of batches (Continuous Query segments [13]), that is, from the time correlation, can also be used to update the reference models. We will use the reference model idea in the following.

15.3.2 Preliminaries for Detection Algorithm We note the following. 1. Several (types of) anomalies may exist in a data item. For example, when the data item is temperature in a specific room in a house, anomalies could be deviations from a highlow range of (i) its value, (ii) its value compared to the temperatures in adjacent rooms, and (iii) its value in correlation to the house temperature setting in a centralized heating system. 2. We denote an anomaly as Ak , for some index k. For a batch b and anomaly Ak , Ak (b) will be null if b does not have that anomaly and non-null otherwise; in the latter case, the anomaly may be described in some standard way, for example, the amount of deviation from the expected value. We note that b could be a source input batch or a derived batch. We use S A(b) to denote the set of anomalies that may be found in b. 3. An anomaly may be detected at different levels of the fog hierarchy. For example, an anomaly in a source input batch can be detected either at the source itself (level 1) or at a fog node (say, level 2). In the latter case, the batch has to be sent to level 2. (In practice, instead of a batch, a summary of the batch may be sent. In this paper, we do not distinguish the summary from the batch itself.) This can be done if the level 1 node does not have the processing capability to detect the anomaly. Some anomalies may be detected only at certain higher levels. For example, spatial and functional correlations can be done at level 2 or higher where inputs from several sensors are available, whereas some time correlation may be done at level 1 itself. 4. At each level j, let S A j (b) denote the set of anomalies that may be detected in b in that level. We allow S A j (b) and S Al (b), for different j and l, to be nondisjoint. This allows for the same anomaly to be detected at different levels. 5. We define Re f j (Ai , b) to be the reference model against which anomaly Ai in b is checked at level j. Reference models could be individual for each anomaly or course-grained to check several anomalies in the same model. Further, the reference models for the same Ai and b at different levels could be different, for example, different ranges of room temperatures. 6. An anomaly Ak (b) in S A j (b) is said to be harmless for c j (the computation on that batch in that level) if the batch could be used for the computation; otherwise,

15 Detection of Data Anomalies in Fog Computing Architectures

7.

8.

9.

10.

137

it is harmful for c j . We note that Ak (b) may be harmless for c j but harmful for cl (when it is in S Al (b) also). We allow this only when j < l. A set of anomalies is harmless for c j if each anomaly in the set is harmless; otherwise, (if at least one anomaly in the set is harmful), it is harmful. If a harmful anomaly is found in a batch, we terminate the computation on that batch. Several computations may be done on a batch at a level. An anomaly may be harmful for some of them and harmless for others. As stated earlier, we focus on only one computation in this paper. A computation c uses several input batches I n 1 (c), I n 2 (c), . . . , I n m (c) and produces an output batch Out (c). An anomaly Ak in Out (c) may be due to (different types of) anomalies in some of the input batches. We define a mapping µ, from Ak (Out (c)) to a set of anomalies in each input batch. If the set is null for each input batch, it means that Ak (Out (c)) is not due to any anomaly in any input batch. (This might mean that the computing node is faulty.) We say that an anomaly Ak (b) in level j is indicative if the reference model Re f j (Ak , b) needs to be updated and isolated otherwise. An anomaly due to a fault would be isolated whereas one due to some changes in the environment is likely to be indicative. An anomaly in a batch may caution (the possibility of) that anomaly in some subsequent batches also. If so, it is called consequential anomaly; if not, it is called singular.

15.3.3 Detection Procedure We consider the processing of a batch set Bi . At each level, at each node, both anomaly detection and the computation defined in that level are done. The algorithm has the following steps. 1. Processing starts at level 1. 2. At each level j, at a node vq , the following is done. Denote the computation at that node as cq . Let the input batches for cq be I n 1 , I n 2 , . . . , I n m , omitting the reference to cq for brevity. For each I n l , S A j (I n l ) are checked. To check anomaly Ak (I n l ), Re f j (Ak , I n l ) is used. 3. If a harmful anomaly is found in any of the input batches, the computation cq is not done; otherwise, cq is performed. Then, the anomalies in the output batch, namely, S A j (Out (cq )), are checked. If a harmful anomaly is found in Out (cq ), then it is not forwarded to the next level and the computation of Bi is terminated after cq . Further, for each anomaly A p found in Out (cq ), µ(A p , vq ) is applied to get the anomalies in each of the input batches and the corresponding information is sent to the respective children as Global Anomaly Notification (GAN). In addition, the reference models, in that level, of the output and input batches are updated as needed.

138

K. Vidyasankar

4. If no harmful anomaly is found in S A j (Out (cq )), Out (cq ) is forwarded to the next level. Suppose the computation succeeds at every level and hence stops at level n, and no harmful anomaly is found at any level. This information is sent as GAN to all the children. 5. At any level, when GAN is received from the parent, it is updated with the status of the anomaly sets in that level and sent to the children. The local reference models are also updated. If the received GAN contains a harmful anomaly, then the computation in that level is rolled back. 6. We note that the computation at a node is not performed or is rolled back only when there is a harmful anomaly in at least one of the input batches. We assume that a harmless anomaly in the output batch will not map to a harmful anomaly in any input batch. Therefore, if the computation at a node is not rolled back, then the computations in all descendants of that node will not be rolled back. Thus, for any batch, a prefix of its computation is done.

15.4 Computation Algorithm In this section, we describe an algorithm for anomaly detection along with computation for the case of a simple path hierarchy. At each level, there is a single node. The computation has one input batch and one output batch. For source input batch bi , the input batch at level j is b j,i . Slight changes are made in the procedure described in the last section as follows. 1. At each level, the anomalies are checked for the input batch alone. (Checking for the output batch will be done in the parent level where it becomes the input batch.) 2. Anomalies in an output batch are assumed to be mapped as such to the anomalies in the corresponding input batch and so the mapping µ is not mentioned explicitly. 3. Global Anomaly Notification is implemented as follows: – AS j (b j,i ) is the status of the anomalies in b j,i at level j. – C AS j (b j,i ) is the cumulative status of the anomalies in b j,i up to level j; C AS j (b j,i ) is C AS j (b j−1,i ) ∪ AS j (b j,i ). – F AS(bi ) is the final status of the anomalies in bi . – Procedure Check Anomaly j : (i) S A j (I n(c j,i )) is checked, and (ii) the anomalies are recorded in AS j (b j,i ) and added to C AS j (b j,i ). – If C AS j (b j,i ) is harmless, then c j,i is done and Out (c j,i ) and C AS j (b j,i ) are sent to the parent. – If C AS j (c j,i ) is harmful, then c j,i is not done, and C AS j (c j,i ) is assigned to F AS(bi ). F AS(b j,i ) and abort message are sent as GAN to the child. – We say that AS (similarly C AS, F AS), with appropriate indexes, is harmful for the respective levels if any anomaly in that set is harmful for the computation in that level.

15 Detection of Data Anomalies in Fog Computing Architectures

139

– If no harmful anomalies are found and the computations are done at every level, then the assignment of C AS to F AS is done in the last level. – Updates to the reference models at each level are done when F AS is received in that level. – Thus, GAN is split into C AS while the processing goes up the levels and F AS while the processing goes down the levels. Further, updates to the reference models occur in the downward phase in the algorithm instead of in the upward phase in the framework. Additionally, the reference models are not updated when some harmful anomalies are detected. Algorithm 1 (Pessimistic Execution) – Processing starts for input batch bi at level 1. – At level j, 1 ≤ j < n, Check Anomaly j . success: If C AS j (b j,i ) is not harmful, c j,i is executed. If the execution is successful, • (Out (c j,i ), C AS j (b j,i )) is sent to the parent level j + 1 and Global Commit Notification (GCN) is waited for. This is the commit or abort notification for C(bi ) from the parent. • if (commit,F AS(bi )) is received from the parent, c j,i is committed and (commit,F AS(bi )) is sent to the child at level j − 1, if j > 1. The reference models of the indicative anomalies in S A j (b j,i ), if any, are updated. • if (abort,F AS(bi )) is received from the parent, if F AS(bi ) is harmful for level j, then c j,i is aborted and (abort, F AS(bi )) is sent to the child at level j − 1, if j > 1. If F AS(bi ) is not harmful, then c j,i is committed and (commit, F AS(bi )) is sent to the child at level j − 1, if j > 1, and the reference models of the indicative anomalies in S A j (b j,i ), if any, are updated. failure: If C AS j (b j,i ) is harmful (in which case, c j,i is not executed) or the execution of c j,i is unsuccessful (in which case, c j,i is aborted), C AS j (b j,i ) is assigned to F AS(bi ) and (abort,F AS(bi )) is sent to the child at level j − 1, if j > 1. – At level n, Check Anomaly n and assign C AS n (bn,i ) to F AS(bi ). If C AS n (bn,i ) is not harmful, then cn,i is executed. If the execution is successful, then cn,i is committed. The reference models of the indicative anomalies in S An (bn,i ), if any, are updated and (commit, F AS(bi )) is sent to the child at level n − 1. If C AS n (bn,i ) is harmful or execution of cn,i is unsuccessful, then cn,i is aborted and (abort,F AS(bi )) is sent to the child at level n − 1. – At level 1, after committing or aborting c1,i , the execution of c1,i+1 for the next batch bi+1 is started. In Algorithm 1, each batch is processed one at a time until the computations on that batch are committed or aborted at all levels. This delays the processing of the subsequent source input batches. An optimistic algorithm that alleviates this delay is outlined in the following. The anomaly detection steps are the same as in Algorithm 1. Therefore, they are omitted. The algorithm is from [13].

140

K. Vidyasankar

Algorithm 2 (Optimistic Execution with Reset) Modification in Algorithm 1. – At level j, 1 ≤ j < n, success: if the execution of c j,i is successful, then • Out (c j,i ) is sent to the parent at level j + 1, if j < n, as before but processing of c j,i+1 continues; • if commit is received from the parent at level j + 1, c j,i is committed and GCN is sent to the child at level j − 1, if j > 1; and • if abort is received from the parent at level j + 1, then if batches with indices i + 1, . . . , k have been processed in that level, that is, c j,i+1 , . . . , c j,k have been executed after c j,i , the program state is reset to p j (c j,i−1 ) (effectively aborting c j,i , c j,i+1 , . . . , c j,k ). failure: Resetting is done as above. Here, resetting amounts to additionally aborting those batches which were processed between the original execution and the abort of this batch. Instead of resetting, a compensating computation c j,i can also be done or simply the rollback can be ignored [13]. We note that when the abort of C(bi ) is due to a harmful anomaly in bi , the above reset approach amounts to assuming similar anomalies in the subsequent batches also, that is, the anomaly is consequential. If the anomaly is singular, then compensation or ignore options may be appropriate.

15.5 Related Work Anomaly detection in sensor networks has been discussed in several papers. Five requirements are listed in [8] for efficient and effective anomaly detection: (i) reduction of data; (ii) online detection; (iii) distributed detection; (iv) adaptive detection; and (v) correlation exploitation. The framework described in this paper either satisfies or is amenable to all these requirements. Anomaly detection in cyber-physical systems is discussed in [11]. The papers [6, 9] deal with anomaly detection in fog architectures. A distributed mechanism where anomalies are detected in fog nodes instead of in edge nodes is analyzed in [6]. Several other papers deal with anomaly detection in specific applications, describing the relevant detection algorithms. In most of them, the detection is done in the cloud. Adapting them to fog architectures is challenging. In this paper, we have given a generic framework for fog hierarchy which can be tailored to any application.

15 Detection of Data Anomalies in Fog Computing Architectures

141

15.6 Conclusion This paper focuses on detection of data anomalies. Anomalies in multiple batches from a source may indicate the source failure, a hardware anomaly. Out of order arrival of batches, from one level to another, may indicate network anomaly. Thus, data anomaly detection will help in the detection of other anomalies too. We have considered a hierarchy (tree) in this paper. Several other fog architectures have been described in the literature, for example, clustered, vehicular, and smartphone architectures in [3]. Our framework can easily be adapted to the other architectures. Acknowledgements This work is supported in part by an NSERC (Natural Sciences and Engineering Research Council of Canada) Discovery Grant 3182.

References 1. Atif, Y., Ding, J., Jeusfeld, M.A.: Internet of things approach to cloud-based smart car parking. In: The 7th International Conference on Emerging Ubiquitous Systems and Pervasive Networks EUSPN 2016, 19–22 Sept 2016, London, United Kingdom, pp. 193–198 (2016). https://doi. org/10.1016/j.procs.2016.09.031 2. Bonomi, F., Milito, R., Zhu, J., Addepalli, S.: Fog computing and its role in the internet of things. In: Proceedings of the First Edition of the MCC Workshop on Mobile Cloud Computing, MCC’12, pp. 13–16. ACM, New York, NY, USA (2012). https://doi.org/10.1145/2342509. 2342513 3. Chang, C., Srirama, S.N., Buyya, R.: Indie fog: an efficient fog-computing infrastructure for the internet of things. Computer 50(9), 92–98 (2017) 4. Dastjerdi, A.V., Buyya, R.: Fog computing: helping the internet of things realize its potential. Computer 49(8), 112–116 (2016) 5. Jones, A., Subrahmanian, E., Hamins, A., Grant, C.: Humans’ critical role in smart systems: a smart firefighting example. IEEE Internet Comput. 19(3), 28–31 (2015). https://doi.org/10. 1109/MIC.2015.54 6. Lyu, L., Jin, J., Rajasegarar, S., He, X., Palaniswami, M.: Fog-empowered anomaly detection in internet of things using hyperellipsoidal clustering. IEEE Internet Things J. 4(5), 1174–1184 (2017) 7. Porter, M.E., Heppelmann, J.E.: How smart, connected products are transforming competition. Harvard Bus. Rev. 1–23 (2014) 8. Rassam, M.A., Zainal, A., Maarof, M.A.: Advancements of data anomaly detection research in wireless sensor networks: a survey and open issues. Sensors 13, 10087–10122 (2013) 9. Santos, J., Leroux, P., Wauters, T., Volckaert, B., Turck, F.D.: Anomaly detection for smart city applications over 5g low power wide area networks. In: Proceedings of NOMS 2018— IEEE/IFIP Network Operations and Management Symposium. IEEE Xplore (2018) 10. Satyanarayanan, M.: The emergence of edge computing. Computer 50(1), 30–39 (2017) 11. Sebestyen, G., Hangan, A.: Anomaly detection techniques in cyber-physical systems. Acta Univ. Sapientiae, Informatica 9(2), 101–118 (2017) 12. Tang, B., Chen, Z., Hefferman, G., Wei, T., He, H., Yang, Q.: A hierarchical distributed fog computing architecture for big data analysis in smart cities. In: Proceedings of the ASE BigData & SocialInformatics 2015, ASE BD&SI’15, pp. 28:1–28:6. ACM, New York, NY, USA (2015). https://doi.org/10.1145/2818869.2818898

142

K. Vidyasankar

13. Vidyasankar, K.: Consistency of continuous queries in fog computing. In: Proceedings of the 9th International Conference on Emerging Ubiquitous Systems and Pervasive Networks (EUSPN 2018), Procedia Computer Science. Elsevier (2018) 14. Vidyasankar, K.: Distributing computations in fog architectures. In: TOPIC’18 Proceedings. Association for Computing Machinery (2018). https://doi.org/10.1145/3229774.3229775

Chapter 16

Cloud Data for Marketing in Tourism Sector Pritee Parwekar and Gunjan Gupta

Abstract Internet of things is one of the emerging technologies which will affect different application domains and will completely revolutionize the techniques incorporated by marketers and businesses. IoT connects millions of objects and integrates smart devices like RFID, sensors, and actuators delivering real-time, context-based data which needs to be extracted and analyzed to create contextually effective marketing strategies to impact consumers at an unprecedented level. An enormous amount of data is stored on cloud which can be used to market a product. Cloud data and eminence of IoT devices will disrupt the working of multiple industries and prominently, digital marketing. This paper discusses the impact of cloud storage and IoT in digital marketing and specifically, how the information can be leveraged to interpret the behavior of users in terms of traveling, shopping, social behavior, professional background, interest, and a lot of other personal information. This paper proposes a model which will help to channelize the cloud data and the data from IoTs to digital marketing specifically for tourism sector.

16.1 Introduction Internet of Things (IoT) is one of the trending [1] technology which is affecting every industry and applications. RFID, sensors, and actuators provide context-based realtime data which can be put to use by every industry such as agriculture, healthcare, security surveillance, transportation, manufacturing, marketing and retail business [2], e-commerce [3], tourism, and many more. The remote access of IoT devices provides us valuable data from smart objects giving us an insight into the operation of the device in the customers’ environment [4]. This data is stored in the cloud-based P. Parwekar (B) Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam, India e-mail: [email protected] G. Gupta India Digital Marketing, Noida, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_16

143

144

P. Parwekar and G. Gupta

service providers’ platform [5]. Cloud technology and IoT unlock new possibilities and potential for marketing managers and they need to integrate this technology in a very impervious manner in their marketing practices [6] to influence and reach customers. Internet is an impervious and a clandestine collector of information and the major contributor to the scores of social networking sites and free apps that have sprung up after the advent of smartphones. Social networking site like Facebook, Twitter, LinkedIn, etc., capture the entire life of an individual in a variety of media formats. Nothing is free in this world, and Internet is a classic justification of this fact. Everything finally boils down to commerce. Now, all this commerce is not bad, it is mutually beneficial if handled responsibly. This is the tenet of digital marketing of today’s times. Marketing has narrowed down from “Generic” to “Specific” for a focused and targeted reach. This paper brings out the ethical means of leveraging technology, to enable and make more efficient use of digital marketing. Internet, which today we almost synonymously call as cloud is an ocean of information in a variety of digital formats. The data on the cloud is not structured in a traditional tabular format which is usable by RDBMS, with fields and primary keys. With the advent of Big data analytics which is able to handle a variety of data formats and make sense out of them is the key enabler. Data available on the Internet cloud is in form of traditional text, location information, pictorial, videos, blogs, online behavior, time spent on Internet, geotagging, travel plans, professional information, spending behavior, choices and tastes of food and clothing, and views on social and political aspects. All this information, enabled with the power of artificial intelligence can map the psyche of an individual or a group, this data can be obtained by different sources on Internet. The data can be sourced from cloud platforms, IoT devices, and social networking platforms, and the data is growing exponentially with every passing day. Data from social networking sites/web applications etc., also have a variety of information which by using proper techniques needs to be interpreted and analyzed so as to understand the behavior of the consumer toward his/her purchase decision. The paper is organized as follows: Sect. 16.2 deals with the different sources of data from where the information can be obtained. Section 16.3 explains the proposed model for marketing data with the proposed method. Sections 16.4 deals with the result. Finally, Sect. 16.5 draws the conclusion.

16.2 Consumer Behavioral Study for the Tourism Industry Consumer characteristics have a significant role in the choice of mode of information gathering prior to finalizing travel plans. The primary factors are age and financial status of the individual. An online survey was conducted with more than 500 Indian respondents and the results are shown in the subsequent graphs. Due to high Internet penetration levels and experiences of online purchasing, target respondents belonging to metro cities in India were chosen for this study. The target respondents are from

16 Cloud Data for Marketing in Tourism Sector

145

the middle and upper middle class. Though the study has been done in India, the same can be extended to any country. This is more so because the Indian populace is extremely diverse, belonging to many religions, languages, and ethnicities and therefore, provide an extremely heterogeneous study group. A few questions where the response was sought were as follows: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

What is your income? (INR) How much time per day do you spend on social media platforms (Facebook, LinkedIn, Twitter, WhatsApp, etc.)? Do you visit social networking sites as main source of travel ideas and inspiration to plan a trip? Have you been influenced by social media that you even changed your original travel plan? Which social media platform do you prefer for finding information about destination? Do you go directly to the websites of the destination which you are thinking of visiting? Do you feel recommendations from your family and friends will affect your travel plans? Do you read independently published reviews in magazines and newspaper of the destination you are thinking of visiting? Do you read travel-related magazines or websites? Do you read travel section of the newspaper? Will you read travel reviews and opinions from travelers around the world on different travel review websites, online travel agencies, and tour operator sites? Will you research on third-party travel website like MakeMyTrip, Yatra, Cleartrip before planning your trip? Do you conduct a general web search using Google or Yahoo? (Figs. 16.1, 16.2, and 16.3)

Fig. 16.1 Age influence on usage of social media platforms in travel decision-making

146

P. Parwekar and G. Gupta

Fig. 16.2 Age influence on usage of search engines for finalizing travel plans

Fig. 16.3 Influence of income group on use of social media for planning travel

16.3 Sources of Data 16.3.1 Methods of Sourcing Data from Internet The Internet provides an immense wealth of data which can be easily used for marketing. For a firm that has limited resources and reasonable economical condition, it is prudent to use the available data from the Internet from a variety of online sources. There are many applications, websites, social media sites, commerce sites, and other web media which collect the public data on the cloud. The Internet of Things is a buzzing technology and connects billions of everyday physical objects through RFID and sensors. Due to IoT, these physical objects are getting connected to Internet and social networks and thus, getting connected to other objects and devices across Internet. From this big resource of database, information can be collected from Internet/cloud by different sources. All the above-mentioned sources and many more which are not there in the list have tremendous information which can be used to target the people for marketing. The devices connected on Internet will not only give information but it will also help

16 Cloud Data for Marketing in Tourism Sector

147

to know everyday lifestyle. Almost 24/7, the wearable and connected devices on Internet contribute to the information about specific behaviors, products, and services utilized. All this data is presently recorded in an unstructured manner. Therefore, there is a need to categorize the data and give it to the firms so that they can use the correct data suitable for their marketing requirement.

16.4 Model for Marketing Data Capture IoT can pave a new way for digital marketers and digital marketers should understand how IoT can disrupt their marketing strategies, what are the risks involved, and how they should revise and revamp their marketing techniques. In fact, digital marketers should devise a marketing model to exploit the power of IoT and create marketing strategies in such a manner that they can leverage this thriving technology. The data which is already present in the cloud and the data which is generated by the IoT can be used to understand the likes/dislikes of customers [7]. IoT connects millions of devices and these connected devices share information among each other and provide unparalleled amount of data which can be transferred over Internet without human intervention. This enormous amount of information about the devices and how the consumer is interacting with the devices gives new opportunities to the marketer to understand and analyze the need of the consumer. All this data will be unstructured data. [8] demonstrates that multiple levels of dependencies determine the patterns of innovation among technology adopters and the supply chain members. Data analytic tools are available even in open-source platforms like Hadoop which structures a largely unstructured multiple media information into a computer-readable format [9]. This data can then be filtered and categorized. Further data mining techniques can further filter the data to remove the noise. Theories like Game theory and other predictive models can help to predict individual and group behavior. Gray system theory [10] claims to provide a more valid and precise prediction about the understanding of the populace with regard to the current market situation and development prospects in every sector. In this regard, work by [11] on news affecting the price of products to accurately predict the price is also relevant. The proposed model in this paper, therefore, is like this. Source the data from the cloud −→ run it through analytics −→ weed out noise −→ categories −→ assign behaviors to individuals and groups −→ use information to market (Fig. 16.4).

16.4.1 Prediction Theories Based on the mined information available, social setup, and the latest geopolitical and financial happenings in the world, predictive theories can be applied to predict the future trends in global, local, and individualistic levels [12, 13]. Nowadays,

148

P. Parwekar and G. Gupta

Fig. 16.4 Model proposed for data marketing

e-commerce retailers are facing fiercer competition on price promotion, in that, consumers can easily use a search engine to find another merchant selling an identical product for comparing price. One of the methods is using public cloud data storage and use the information for marketing. The second method to market is to use the information available on social networking sites like Facebook, Instagram and many more other sites to advertise the product. People can spread a word by mentioning the product/company benefits and give a link along the chat/conversation to direct the interested consumer to fetch more details from the website. Blogs can also be useful, creating blog one can not only describe and give reviews about the product but can redirect the targeted audience to the company details. Mails can be sent to the users describing the services/product of the company. Customer intentions can be predicted by use of intelligent methods. Predictions of the customers repurchasing are known by machine learning and artificial bee colony (ABC) algorithm [14].

16 Cloud Data for Marketing in Tourism Sector

149

16.4.2 Marketing The need of today’s marketing is to hype the product. Attractive features of the product from indirect publicity also attracts the customers. The key is to create excitement about the product well before it is launched through smart and fast techniques to reach the targeted audience in the fastest possible time. WhatsApp messages, videos, animations, and anecdotes can be made viral by bringing ingenuity in content. The firms need to partner with the more successful firms for initial awareness of the product in the market. IoT can provide communication channels to support targeted marketing for product owners and enhance customer relationship management [4] and product support as proposed by Taylor [15]. Excessive advertising of any product may also result in bringing down the sales. So, in order to adopt a smart approach, there is a need to know about the available data on cloud, along with the targeted age-group and the frequently used Internet source by these targeted groups. The younger generation is more comfortable on smartphones while the middle age-group prefers accessing websites through computers. Digital marketing finds its origin in 1970s in form of emails. It has further transgressed into use of WWW in late 1980s and Instant messaging services in 1990s. The millennium has brought in social networking sites, and from 2008 smartphones has brought ubiquitous access to Internet and connected devices for a very effective digital marketing. Context-based marketing and online merchant sites behavior-based marketing are springing up. However, this is more in an ad hoc manner and results in spamming and deterring the customer. A more effective and impervious method of marketing to the correct customer is the need of the time. Among different digital marketing platforms and channels that are being utilized to promote and market the products, few of them are mentioned below. i. ii.

iii.

iv.

v.

Firm can register itself with Google AdSense and can make variety of advertisements to market their products. Bulk SMS is one of the most effective way to reach the customers. One can purchase the mobile numbers from a telecom company and use the cloud services for sending bulk SMS. Instagram advertisements with celebrities help to gain the popularity of the product. The company can pay one of many Instagram celebrities. Posting the products images, new launches will promote the product and will attract the potential customers and good reviews will convince him to buy the product. YouTube ads: Marketers can promote their products through video ads which run before and during actual videos plays and they have to pay a fair amount for these advertisements. This is a very effective and affordable platform for marketers to reach their target audience. Use of cloud platform for marketing—The company can list its products on an existing cloud platform for promotion. Rajabi [16] has designed an interactive marketing system, acts as a platform for both shopping mall and shoppers, which provides the details of the availability of the product, precise place to buy, special discounts, and offers. IoT is used to store the real-time information

150

P. Parwekar and G. Gupta

on the server about the availability of the products and what is already sold. Payments can be securely done through NFC. vi. Facebook and other social sites—The companies can use the existing social sites for increasing the brand awareness of the product. They can shoot videos, interesting images, or can showcase endorsements to promote them on social media sites. Ceipidor et al. [17] have devised a social media tool, which understands the user needs and requirement of the products. This platform also helps the users share the information on shops that they visited, pictures of the products that they bought, the brands that they preferred, and finally and most importantly, their customer experience of these brands and products. This will help their family and friends to make informed decisions while shopping and enhancing their shopping experience. The consumers can earn rewards and points for this information which will entice the customers for utilizing this platform. vii. E-commerce sites can host the advertisement and list the product in the recommended product list to gain popularity among people. According to e-commerce sales 2018 [18], maximum e-commerce happens on Amazon.com then eBay, followed by Apple and other companies. Such companies may be used for marketing the product. Xu [19] has proposed architecture about how IoT technology can be integrated in e-commerce marketing in inventory, logistics, and payment. RFIC tag stickers on the products provide us with the details of how many products are in warehouse, how many are in transit, and also the exact location of the product during delivery. Logistics efficiency can be improved and payment can be made more secure using IoT. viii. Advertisers these days are resorting to sponsored search (paid search) which allows them to include sponsored results based on selected keywords. These ads are often sold via real-time auctions, where advertisers bid on keywords, time, language, geographical location, and other such filters. All the above-mentioned sources and many more, which are not there in the list, have tremendous information which can be used to target the people for marketing. The devices connected on Internet will not only give information but it will also help to know everyday lifestyle. From morning till night, all the wearable and connected devices on Internet contribute to the information about specific products and services utilized. All this data which is present will be in unstructured manner. There is a need to categorize the data and give it to the firms so that they can use the correct data suitable for their marketing requirement. This paper proposes a model which uses the available public data on cloud and helps the firms with the correct data useful for their marketing.

16 Cloud Data for Marketing in Tourism Sector

151

16.5 Results A survey was conducted to understand how many people use the cloud data and the data is generated by the IoT devices to know about the place/product. From more than 500 responses collected, itis been observed that most of them use the sources/methods as explained in Figs. 16.5, 16.6, 16.7, and 16.8. This survey was taken and the collected data responses were analyzed to understand the impact of

Fig. 16.5 YouTube used for marketing

Fig. 16.6 Facebook used for marketing

Fig. 16.7 Use of email/SMS for marketing

152

P. Parwekar and G. Gupta

Fig. 16.8 Use of blogs/feedback/article for marketing

social media on travel plans. Different sources of information are used by different people based on their age-group, gender, and income as shown in Figs. 16.1, 16.2, and 16.3.

16.6 Conclusion Use of data from the cloud is undoubtedly the fastest and most efficient method of marketing in today’s digital world. However, while doing the same, it should be as pervasive as possible so that the marketing media is not ignored by the customer. Data available in variety of formats on the cloud is required to be filtered for targeted approach using Big data analytics and other data mining tools. Predictive algorithms further help in narrowing down the most optimal target audience which is likely to be converted into a customer. The paper has attempted to provide a step-by-step model for data analysis and predictions. While there is further work required for a complete automated implementation, the present study could definitely act as a precursor in driving thoughts in this direction. In the future, the authors intend to create a consolidated algorithm which can act as backbone for travel websites/apps and also provide useful information to the downstream service providers like travel agents and hoteliers.

References 1. Whitmore, A., Agarwal, A., Da Xu, L.: The Internet of Things—a survey of topics and trends - c, N., Labus, A., Bogdanović, Z., Despotović-Zrakić, M.: Internet of things in marketing 2. Ðurdevi´ and retail. Int. J. Adv. Comput. Sci. Appl. 6(3) 3. Yao, Y., Yen, B., Yip, A.: Examining the effects of the Internet of Things (IOT) on e-commerce: Alibaba case study

16 Cloud Data for Marketing in Tourism Sector

153

4. Decker, R., Stummer, C.: Marketing management for consumer products in the era of the Internet of Things. Adv. Internet Things 7, 47–70 (2017) 5. Chang, Y., Dong, X., Sun, W.: Influence of Characteristics of The Internet Of Things on Consumer Purchase Intention 6. Klopper, D.: The Possibilities and Challenges of the Application and Integration of the Internet of Things for Future Marketing Practice 7. Persico, V., Pescapé, A., Picariello, A., Sperlí, G.: Benchmarking big data architectures for social networks data processing using public cloud platforms. Futur. Gener. Comput. Syst. 89, 98–109 (2018) 8. Alsaad, A., Mohamad, R., Ismail, N.A.: The contingent role of dependency in predicting the intention to adopt B2B e-commerce. Inf. Technol. Dev., 1–29 (2018) 9. Dasgupta, N.: Practical Big Data Analytics: Hands-on Techniques to Implement Enterprise Analytics and Machine Learning Using Hadoop, Spark, NoSQL and R (2018) 10. Su, Y., Wang, Y., Mi, C.: The forecast of development prospects of China’s cross-border Ecommerce based on grey system theory. In: 2017 International Conference on Grey Systems and Intelligent Services (GSIS), pp. 182–186. IEEE (2017, August) 11. Tseng, K.K., Lin, R.F.Y., Zhou, H., Kurniajaya, K.J., Li, Q.: Price prediction of e-commerce products through Internet sentiment analysis. Electron. Commer. Res. 18(1), 65–88 (2018) 12. Pappas, I.O., Kourouthanassis, P.E., Giannakos, M.N., Lekakos, G.: The interplay of online shopping motivations and experiential factors on personalized e-commerce: a complexity theory approach. Telemat. Inf. 34(5), 730–742 (2017) 13. Blaise, R., Halloran, M., Muchnick, M.: Mobile commerce competitive advantage: a quantitative study of variables that predict m-commerce purchase intentions. J. Internet Commer. 17(2), 96–114 (2018) 14. Kumar, A., Kabra, G., Mussada, E.K., Dash, M.K., Rana, P.S.: Combined artificial bee colony algorithm and machine learning techniques for prediction of online consumer repurchase intention. Neural Comput. Appl., 1–14 (2017) 15. Taylor, M., Reilly, D., Wren, C.: Internet of things support for marketing activities. J. Strat. Mark., 1–12 (2018) 16. Rajabi, N., Hakim, A.: An intelligent interactive marketing system based-on Internet of Things (IoT). In: 2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI), pp. 243–247. IEEE (2015) 17. Ceipidor, U.B., Medaglia, C.M., Volpi, V., Moroni, A., Sposato, S., Tamburrano, M.: Design and development of a social shopping experience in the IoT domain: the ShopLovers solution. In: 2011 19th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), pp. 1–5. IEEE (2011) 18. Godin, S.: How to get your ideas to spread (2003). Video, TED. https://www.ted.com/talks/ seth_godin_on_sliced_bread 19. Xu, X.: IOT technology research in E-commerce. Inf. Technol. J. 13(16), 2552–2559 (2014)

Chapter 17

Road Travel Time Prediction Method Based on Random Forest Model Wanchao Song and Yinghua Zhou

Abstract Accurately predicting the travel time of each key road in a certain period of time will help the traffic management department to take measures to prevent and reduce traffic congestion. At the same time, it can help to make an optimal travel plan for the traveler based on the dynamic traffic information. Consequently, the utilization efficiency of the load can be improved. RF-DBSCAN, a prediction model based on the random forest (RF) and DBSCAN (Density-Based Spatial Clustering of Applications with Noise), is proposed. After trained using the history traffic datasets, the model can predict the road travel time taking into account the regularity of time series, weather factors, road structures, weekends, and holidays. Experiments are carried out and the results show that the RF-DBSCAN has higher accuracy compared with the traditional random forest and GBDT (Gradient Boosting Decision Tree).

17.1 Introduction Road travel time [1] is the average travel time that all cars drive from upstream to downstream of the road in unit time. It is not only an important criterion for traffic efficiency evaluation but also an important data source for Intelligent Transportation Systems (ITS). With the continuous improvement of people’s living standards and the rapid development of urbanization, cars are becoming a popular means of transportation. The increasing number of vehicles has also brought huge challenges to urban transportation, resulting in a growing conflict between road resource needs and supplies. At the same time, traffic congestion and traffic safety have attracted extensive concerns. It’s not wise to just rely on the transport facilities building to solve the traffic problems in this era of big data since the current transportation network is of large scale already. W. Song (B) · Y. Zhou Chongqing University of Posts and Telecommunications, Chongqing, China e-mail: [email protected] Y. Zhou e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_17

155

156

W. Song and Y. Zhou

It is necessary to improve the utilization of roads, optimize and integrate transportation network resources to relieve the traffic pressure. So, the concept of ITS [2] is generated and put into practice. Dynamic travel time [3] is important information for an intelligent transportation system, and it is also the reference factor of road traffic management, road network optimization, and travel planning. How to accurately predict the travel time of each key road is always a problem in the intelligent transportation system. It is also a hot issue for scholars to study. The prediction models of road travel time generally include parametric models and nonparametric models. The parametric models consist of history average model, ARMA and ARIMA [4] respectively; the nonparametric models consist of Support Vector Regression (SVR) [5], neural network model, some ensemble models, and so on. Xu [6] proposed a road traffic prediction algorithm based on the powerful linear fitting ability of ARIMA and the good real time performance of Kalman model. However, since the traffic states, which affect the travel time, are influenced by many environmental factors besides time and space, the ARIMA model paying attention only to the characteristics of time is not as good as other models, which take more factors into consideration. Kumar [7] uses artificial neural network (ANN) model to predict short-term traffic flow. The model is trained using the historical traffic datasets of each road section, such as vehicle speed, vehicle density, and day of the week to achieve good prediction results. However, this model has a long training period with expensive computation and converges hardly facing a large number of historical traffic data sets. Zhang [8] employs a GBDT method to analyze and model freeway travel time to improve the prediction accuracy and model interpretability. This study proposes a RF-DBSCAN algorithm to predict travel time by taking into account the regularity of time series, weather factors, road structures, weekends, and holidays. Different from the random forest model, the leaf nodes of each decision tree in the RF-DBSCAN prediction model will be clustered by the DBSCAN algorithm, and then the average of the clusters most similar to test samples is regarded as the prediction result. DBSCAN algorithm can detect outliers in leaf nodes of each decision tree. The decision tree can filter out the outliers when calculating the results. Therefore, the RF-DBSCAN outperforms traditional RF methods in terms of accuracy.

17.2 The Principle of Random Forest Algorithm Random Forest [9] is an ensemble learning algorithm based on Bagging algorithm and random subspace method [10] proposed by Leo Breiman in 2001. The ensemble algorithm consists of multiple base learning that can be defined as f (D, θk ), where k = 1, 2…, M. In terms of random forests, f(D, θk ) represents the decision tree, θk is the independently distributed feature subspace, M is the number of decision trees. RF algorithm uses to have multigroup data sets constructed by the Bagging algorithm in the manner of random sample allowed to put back. In order to guarantee that

17 Road Travel Time Prediction Method Based …

157

each decision tree is independent, each decision tree is constructed by unique data set; at the same time, some features are randomly selected from the feature vector as the feature subspace of the decision tree. When the decision tree is constructed, the best node is selected from the feature subsets according to the corresponding splitting rule, and by this split node, the data set is divided into two data subsets that are defined as the left branch and right branch of this node. Each branch selects the respective splitting node to unceasingly divide data subsets using the same splitting rule. Until this process is satisfied for the condition of stopping split, each data subsets will be not divided anymore and eventually store in the leaf nodes of each tree. It is data randomness and feature randomness that ensure the independence and diversity of decision tree in the RF, which improves the performance of the algorithm. Precisely this algorithm has excellent representation in classification and regression tasks. The process of the algorithm is as follows: (a) The number of the decision tree T = {t1 , t2 , t3 , …, tm } and the depth d of this tree are initialized, where t1 represents a decision tree; m means the number of decision trees and d stands for depth of the tree. (b) The sample subsets D = {D1 , D2 , D3 , …, Dm }, being as same number as decision tree, are generated on the basis of the input sample data set D by the Bootstrap method, where m stands for the number of the sample subsets. (c) The method of feature random selection is taken to construct a separate feature vector subsets Fi = {v1 , v2 , …, vp } for each tree, where p represents the number of features, Fi stands for the feature vector set of the decision tree ti , i ≤ m. (d) Input sample subsets Di and feature vector set Fi , respectively, to construct a decision tree model without pruning. (e) If RF is used to deal with classification tasks, obtain the classification result of each tree and then choose the final result by voting; if the model is used for regression tasks, the prediction results for each tree are obtained and their average is regarded as the prediction result.

17.3 The Principle of DBSCAN DBSCAN [11] is a clustering algorithm based on density. This algorithm makes high-density points gradually fall together and come into clusters of any size by neighborhood Eps (neighborhood of a point) and MinPts (density threshold). What’s more the algorithm can also filter noise data. The algorithm is mainly defined as follows: Density value: Select a point as the center of the circle and the Eps value is as the radius to draw the circle, and then the number of data points within the circle is the density value of that point. Core point: If the density value of a point is larger than MinPts, the point is a core point; otherwise it is the boundary point.

158

W. Song and Y. Zhou

Directly density-reachable: Let D is the data set, q is core point, ∀p, q ∈ D, p ∈ qEps , such that p is directly density-reachable from q. Density-reachable: If there is a chain of points p1 , p2 , p3 , …, pn , pn+1 is directly density-reachable from pi , such that pn is density-reachable form p1 . Density-connected: If there is a point o, p1 and p2 are both directly density-reachable from point o about Eps and MinPts, such that p1 and p2 are density-connected form o.

17.4 The Prediction Model of RF-DBSCAN 17.4.1 Data Sources The traffic history data provided by Guiyang Public Security Traffic Management Bureau are used for this study. The data from March 2017 to May 2017 is regarded as a training set and choose the data of June 2017 as the test set. The data set records the travel time of different time periods (2 min as a time period) of each road section in one day. Meanwhile, the weather data of Guiyang from March to June from the website (http://www.tianqi.com) is crawled as part of the basic data set.

17.4.2 Feature Selection With the increase of data sets’ dimensional, the performance of the predictive model will be affected. The complexity of the training set will be increased in this case. What’s more, the situation will make the increase of the redundant features and result in the decline of the prediction accuracy. In order to improve the prediction accuracy of the model and reduce the computational complexity, it is very necessary to select the key features related to travel time. The feature importance measure based on random forest can be used to solve this problem. The method is applied to measure the impact of each feature on the impurity of the regression tree, and the features are ranked according to this measure. So we can get the key features by this method. Considering the time-space characteristics of the road network [12] and the feature importance, the following attributes will be selected as feature set of the RF-DBSCAN model. (a) Road attributes Road attributes include length, width, grade of a road (major road, auxiliary road), and road id. (b) Properties of time Properties of time generally include weekends and holidays, time interval and the first few minutes of the predicted value.

17 Road Travel Time Prediction Method Based …

159

Fig. 17.1 Travel time curve

First few minutes of the predicted data: if we need to predict the road travel time from 9:00 am to 9:02 am in one day, we can think of the travel time from 8:58 am to 9:00 am as a feature. Imitating the idea above, we generally view the travel time of the first eight time interval as the features of this model. The time interval can be also regarded as a feature of the predictive model. Figure 17.1 shows that the travel time will be changed by the time in the same road section. The morning peak, the afternoon peak and the late peak are especially obvious. (c) Road structures Taking into account the interaction between the roads, the upstream and downstream information of the road segment can be used as features. Such as the difference between the width of upstream of road segments and the width of downstream of road segments, the difference between the number of the road section of upstream and the number of road section of downstream. (d) Weather factors As we all know, the weather and the quality of airplay a key role in people’s travel plan. Therefore, weather and air quality can be viewed as features of the predictive model.

17.4.3 The Principle of RF-DBSCAN RF can be applied for solving classification and regression problems. In regression problems, the dependent variable is continuous. In classification problems, the dependent variable is categorical. Because the travel time is treated as a continuous variable, that belongs to the regression problem in this paper. In this paper, RF regards the regression tree CART [13] and DBSCAN as the base learner to solve regression problem. Figure 17.2 presents a specific structure. The specific steps are as follows: (a) Feature set F = {v1 , v2 , …, vn } is ranked according to the importance of the features. Where vi represents the value of feature vector vi and n stands for the feature dimension. In this paper, the random forest algorithm is applied to calculate the importance of the features.

160

W. Song and Y. Zhou

Fig. 17.2 Regression tree combined with DBSCAN

(b) The training dataset is divided into m subsets by the Bagging algorithm, and then a regression tree is constructed on these data subsets. Those regression trees select the minimum residual variance as the evidence for selecting the splitting attribute and the splitting point. The specific process is as follows: First, the data of each feature is sorted in ascending order to be a split node. Data that is smaller than the split node is divided into the left branch; otherwise, it is divided into the right branch. In this way of thinking, two data subsets can be obtained. And then, using the minimum residual variance, as in Eq. (17.1), compute the value S of being not divided and the value SL of left branch of being divided and the value SR of the right branch of being divided. What’s more, the best splitting node and splitting feature are obtained by the formula Max(S − (SL + SR )). Finally, repeat the above division process until the conditions for stopping the split are satisfied. At this time, its leaf nodes will accumulate some sample data, and there will be noise data when the leaf nodes cover data too much. S=

N

(yi − y¯ )2

(17.1)

i=1

where N means the number of sample in each sample subsets, yi stands for the predicted target and y¯ presents an average of the predicted target. (c) The predicted value of the regression tree is optimized using density clustering algorithm. The sample data in the leaf nodes of the regression tree is clustered by the algorithm DBSCAN. This moment, the sample data sets in the leaf node forms many clusters C = {c1 , c2 , …, ck } of irregular shapes. Finally, the centroid

17 Road Travel Time Prediction Method Based …

161

points Dc = {d1 , d2 , …, dk } of each cluster is calculated, where k means the number of clusters, d1 represents the centroid of the cluster c1 . (d) To forecast sample data X, firstly, the predicted sample X need to be moved from the root node of the regression tree to the corresponding leaf node of the regression tree; and then, the leaf node’s a cluster that is very similar to the sample X is necessary to be calculated by Eq. (17.2). Equation (17.2) presents Weighted Euclidean Distance. We generally think that the result of Eq. (17.2) is smaller and the cluster is more similar to the predicted sample.

σ (X, d) = sqr t

n

(1 −

vi )

· (X i − di )

2

(17.2)

i=1

where Xi means the prediction sample, di presents the centroid of cluster ci , n stands for the sample dimension, and vi is the importance of the features. (e) To compute the prediction results of a single tree, firstly, the most similar cluster ci should be calculated; and then, compute the average of the target variable of the samples in ci , finally, the average is viewed as the final prediction result of the single regression tree. (f) To calculate random forest prediction results, first of all, the prediction results of each regression tree need to be obtained; and then, the prediction result of the random forest algorithm is computed by Eq. (17.3).

yr f =

1 (y1 + y2 + y3 · · · + ym ) m

(17.3)

where yrf represents the final predicted value of the RF, y1 stands for the predicted value of the first regression tree.

17.4.4 Comparison of Experiment Results A large number of experiments evident that the performance of the random forest model is optimal when the number of trees is 400 and the depth of each tree is 3. The RF-DBSCAN can the best prediction result when Eps = 1.5 and MinPts = 2. In order to test the performance of the prediction model, the mean relative error (MRE) and the mean absolute percentage error (MAPE), as in Eq. (17.4), can be regarded as evaluation indicators to analysis compare with the GBDT and RF. The following groups of graphs (Figs. 17.3 and 17.4) show two days’ comparison curves of measurement and prediction from 5:00 am to 20:00 pm. The prediction results in the case study show that the results given by these three methods are quite consistent and have good comparability. The RF-DBSCAN and the GBDT methods perform

162

W. Song and Y. Zhou

Fig. 17.3 The travel time from 5:00 am to 20:00 pm, June 23, 2017

Fig. 17.4 The travel time from 5:00 am to 20:00 pm, June 24, 2017

better than the RF model when travel time fluctuates greatly. Table 17.1 clearly demonstrates that the MRE and MAPE of these three methods, The RF-DBSCAN model are relatively stable compared with the other two models.

Table 17.1 Table captions should be placed above the tables

RF

GBDT

RF-DBSCAN

MAE

1.826

1.479

1.174

MAPE

0.178

0.134

0.107

17 Road Travel Time Prediction Method Based … MAE =

163

N N 1 1 obser vedi − pr edictedi |obser vedi − pr edictedi |; MAPE = · · N N obser vedi i=1

i=1

(17.4)

17.5 Conclusions The traditional random forest is applied for solving regression problems, that mainly view the mean value of the leaf nodes of the regression tree as the prediction results of each tree. In this RF-DBSCAN model, the DBSCAN algorithm combined with the regression tree to generate multiple clusters in the leaf nodes of this tree, and then the clusters of the leaf nodes most similar to the predicted samples are obtained by the weighted Euclidean distance. Finally, we regard the average of those samples in the cluster as the prediction results of each tree. The prediction model proposed in this paper reasonably removes the influence of noise data on the prediction results to improve the prediction accuracy of the algorithm.

References 1. John, R., Erik, V.Z.: A simple and effective method for predicting travel times on freeways. IEEE Trans. Intell. Transp. Syst. 5, 200–207 (2004) 2. Tak, S., Kim, S., Yeo, H.: Travel time prediction for origin-destination pairs without route specification in urban network. In: 2014 IEEE 17th International Conference on Intelligent Transportation Systems (2014) 3. Jiang, Z., Zhang, C., Xia, Y.: Travel time prediction model for urban road network based on multi-source data. Procedia Soc. Behav. Sci. 138, 811–818 (2014) 4. Billings, D., Yang, J.S.: Application of the ARIMA models to urban roadway travel time prediction-a case study. In: 6th IEEE International Conference on Systems Man and Cybernetics, vol. 3, pp. 2529–2534 (2006) 5. Cortes, C., Vapnik, V.: Support vector machine. Mach. Learn. 20(3), 273–297 (1995) 6. Xu, D., Wang, Y., et al.: Real-time road traffic state prediction based on ARIMA and Kalman filter. Front. Inf. Technol. Electron. Eng. 18(2), 287–302 (2017) 7. Kumar, K., Pariad, M., Kativar, V.K.: Short term traffic flow prediction for a non urban highway using artificial neural network. Procedia Soc. Behav. Sci. 104(1), 755–764 (2014) 8. Zhang, Y., Haghani, A.: A gradient boosting method to improve travel time prediction. Transp. Res. Part C 58(Part B), 308–324 (2015) 9. Bierman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 10. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998) 11. Lv, Y., Ma, T., Tang, M., et al.: An efficient and scalable density-based clustering algorithm for datasets with complex structures. Neurocomputing 171 (2015) 12. Wang, J., Tsapakis, I., Zhong, C.: A space–time delay neural network model for travel time prediction. Eng. Appl. Artif. Intell. 52(1), 145–160 (2016) 13. Lawrence, R.L., Wright, A.: Rule-based classification systems using classification and regression tree (CART) analysis. Photogramm. Eng. Remote. Sens. 67, 1137–1142 (2001)

Chapter 18

Video Synchronization and Alignment Using Motion Detection and Contour Filtering K. Seemanthini , S. S. Manjunath , G. Srinivasa

and B. Kiran

Abstract The proposed method presents a proficient abandoned functioning of a video synchronization and alignment using motion detection and contour filtering, based on various flat dimensionality frame matching techniques. In the proposed system, motion detection algorithm is used to detect only the motion of the objects and Contour filtering algorithm is used to recognize the objects based on its color. The algorithms are implemented in Java language, which facilitates prototyping using open source library. The application of video alignment includes the detection of objects such as vehicles, which is used in Advance Driver Assistance System (ADAS) and also in video surveillance system for traffic monitoring. The motion detection algorithm is used in CCTV surveillance for detecting terrorist threats. The contour filtering algorithm is implemented in medical examinations. The proposed system is tested on live datasets and obtained a good change detection between the frames. Video synchronization and alignment algorithms had been developed in the earlier years for plain datasets using static cameras. Compared to the algorithms developed in the initial stages, the proposed system provides a better efficiency to speedup the application by optimizing the algorithms, recuperating the data locality and also executing the various modules of the application. The proposed system has obtained a chronological speed up result of 12.39x factor when compared to the existing methods, when processed with the testing dataset with the video resolution of 240 * 320, with 30 frames per second using high definition cameras. The results obtained are further processed to run the embedded CPU applications and GPU processors. K. Seemanthini (B) Dayananda Sagar Academy of Technology and Management, Bangaluru, India e-mail: [email protected] S. S. Manjunath · G. Srinivasa · B. Kiran Department of Computer Science & Engineering, ATMECE, Mysuru, India e-mail: [email protected] G. Srinivasa e-mail: [email protected] B. Kiran e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_18

165

166

K. Seemanthini et al.

18.1 Introduction Please note that the first paragraph of a section or subsection is not indented. The first paragraphs that follows a table, figure, equation etc. does not have an indent, either.

18.1.1 Overview Analyzing human activity becomes a trending and challenging research part in recent years. Its applications are diverse. The activity understanding is necessary for surveillance system for improving computer technology. However, multiple Computer Vision (CV) is an interdisciplinary field that deals with how computers can be made for gaining high-level understanding from digital images or videos. As we all know, the Computer Vision applications run on many embedded devices such as in surveillance cameras, smartphones, and driver assistance systems. In the proposed methodology, the computer-intensive CV application [1] such as video alignment and change detection procedure is developed and optimized to run on parallel accelerator hardware. The proposed methodology takes two videos, named as training video and testing video, taken at different times on same trajectories, finds the best matching pair of frames for two videos. Once the matches are performed, the algorithm can be used to find the critical events. The proposed method has enormous allocations such as in finding missing objects, in terrorist threats, in flying drone, in medical examination, and also in advanced driver assistant systems. The proposed system is projected to solve the problem of aligning video frames captured at similar trajectories recorded in moving cameras based on the GPS information. In the proposed system, attention is given on compute-intensive CV application, the video alignment, and change detection. Video synchronization and alignment algorithms [2] have been developed in the earlier years for plain datasets using static cameras. But the existing system failed to provide accurate results and also it was not an open source system. Hence it is required to develop an efficient open source video synchronization and alignment algorithm. Here the motion detection algorithm is used to detect the motion of the objects [3] in the video and contour filtering algorithm recognizes the objects based on its color.

18.1.2 Problem Statement Video synchronization and alignment algorithms had been developed in the earlier years for plain datasets using static cameras. But the existing system failed to provide

18 Video Synchronization and Alignment Using Motion Detection …

167

accurate results and also it was not an open source system [4]. Hence it is required to develop an efficient open source video synchronization and alignment algorithm. Here the motion detection algorithm is used to detect the motion of the objects in the video and contour filtering algorithm recognizes the objects based on its color.

18.1.3 Objectives 1. Development of a user-friendly interface for uploading the videos. 2. Implementation of low dimensionality frame matching algorithm for detecting video alignment. 3. Rendering the result in the most effective way.

18.1.4 Existing System In the existing system, the author has developed an algorithm to align two video sequences [5], i.e., the first and the second videos are compared to find the temporal and spatial correspondence between the frames. The video synchronization [6] and the alignment techniques are developed in the earlier years for static videos. In addition to that, linear time correspondence between the frames has been used which develops the restrictive assumptions and hence limits the practical applications of any solutions developed. The main drawback of the existing system is low accuracy. Hence the proposed system is developed to align the video sequences captured at different moving cameras, obtained from the intensity of image fusion and the GPS information.

18.1.5 Proposed System The proposed system brings subsequent significant contributions: 1. It provides a proficient open source implementation of the video synchronization and alignment algorithm, which is executed on parallel hardware such as Graphical Processing Units (GPUs) and also on Chip Multiprocessors (CMPs). 2. The implementation is tested on various challenging datasets: a group of videos recorded while driving on a highway and also on various landmarks. Both the datasets causes serious challenge to our algorithm because of the big differences in trajectories. The proposed methodology is able to make a major difference by providing a better alignment quality.

168

K. Seemanthini et al.

3. The proposed system develops a standard algorithm to compute the temporal alignment of the frames in the input video. In the executed experiments, the standard algorithm improves the result of temporal alignments. 4. The main advantage of the proposed system is that the ECC algorithm provides a more accurate synchronization and alignment results compared to Lucas– Kanade’s method.

18.2 Literature Survey This literature survey provides different techniques to solve the problem of video synchronization and alignment. Few of the methods used only the spatio information to find the results. And few of the methods have imposed some constraints on the video sequences. Feature-based image alignment algorithm [7] is used to find the feature points into feature trajectories. The alignment is recovered by establishing the correspondence between trajectories. This method works as follows. Initially, the feature trajectories are constructed to find the centroid of the moving objects and to obtain the transformation matrix. Next step is to select the transformation matrix with the least MSE after alignment [8]. Repeat the same process until the best transformation [9] matrix is obtained. Evangelidis [10] works on the variations of spatial brightness and is isolated to obtain the video alignment parameters. This method works as follows, initially, the spatiotemporal pyramid is constructed and spatiotemporal alignment is estimated at each level. Finally the initial estimate is used to propagate the output of current level to next level. Anuradha and Karibasappa [11] develop an algorithm to create the secondary video which is spatially and temporally registered with primary video [12]. The main purpose of this algorithm is to select the feature point in the secondary image based on the weighting function. But this method fails to model discontinuities in the corresponding field.

18.3 System Design Figure 18.1 shows a general block diagram describing the activities performed in the proposed system. The entire architecture has been implemented in ten modules.

18 Video Synchronization and Alignment Using Motion Detection …

169

Fig. 18.1 Architecture of video change detection

The entire architecture diagram can be divided into the following divisions: 1. Data Access Layer Data access layer provides access to all the possible operations on the data base to the real world. The internal components of DAL are DAO classes, DAO interfaces, POJOs, and Utils. All the other modules of the proposed system will communicate with DAO layer for their data access needs. 2. Account Operations Account operations module provides the following functionalities to the end users of the proposed system. Registration, Login, Logout, Editing, Changing password, Deleting an account, etc. 3. Video Alignment Algorithm The proposed methodology takes two videos, named as training video and the testing video captured at same trajectories at different times and finds the best matching between the frames of two videos. Once the matches are performed, the algorithm can be used to find the critical events. 4. Contour Filtering Component This component is used to recognize objects [13]. It recognizes the object based on the color of the object. The color of the object is given as input to the component. The input to the component can be given either by a webcam or an image which is already present in the system. It searches the given image for the presence of the color that was given as input to it. The system uses the contour detection which in turn uses

170

K. Seemanthini et al.

the contour filtering algorithm for edge detection to get the shape of the image. The resulting image is a black and white image that shows the object of interest. 5. Motion Detection Component This component detects motion of the objects. We use a video source to capture the motion of the objects [14]. The video sources can be a web camera or any video stored in the computer. This component detects motion in the source file provided to it. The path of the source file is provided as input to the component. This path is provided by the user after logging into the system. The user can log into the system only if he is a registered user. Hence the user has to register before logging in. This component is put in a jar file which is run in the user interface.

18.4 Dataflow of Video Change Detection Dataflow diagrams can be used to provide the end user with a physical idea of where the data they input, ultimately as an effect upon the structure of the whole system (Figs. 18.2, 18.3, and 18.4).

Fig. 18.2 Dataflow diagram of video change detection

18 Video Synchronization and Alignment Using Motion Detection …

171

Fig. 18.3 Dataflow diagram of account operations

18.5 Class Diagrams Class diagrams can also be used for data modeling [8]. The classes in a class diagram represent both the main objects, interactions in the application, and the classes to be programmed (Figs. 18.5, 18.6, 18.7 and 18.8).

18.6 Experimental Results The following table provides the performance comparisons of the videos.

Video 1

Video 2

Video 3

Total

True positives

35

18

22

75

False positives

2

1

8

11

False negatives

9

6

13

28

Err rate

0.10

0.36

0.40

0.86

See Fig. 18.9.

172

Fig. 18.4 Dataflow diagram for motion detection algorithm and flow chart

K. Seemanthini et al.

18 Video Synchronization and Alignment Using Motion Detection …

Fig. 18.5 Class diagram of user servlet

173

174

K. Seemanthini et al.

Fig. 18.6 Class diagram of video change detection

18.7 Conclusion The proposed system provides an efficient open source implementation of the video synchronization alignment algorithm. Alignment results are presented in the context of videos recorded along the same track at different times and possibly at different speeds. We explore two applications of the proposed video alignment method. One is the detection of vehicles, which could be of use in ADAS. The other is online difference spotting videos for surveillance rounds. The motion detection component used in this paper detects the motion of the object in the video. This algorithm could be used in CCTV surveillance for detecting terrorist threats in parking lots and the contour filtering component is used for recognizing the objects based on its color. It

18 Video Synchronization and Alignment Using Motion Detection …

Fig. 18.7 Class diagram of motion detection servlet

175

176

K. Seemanthini et al.

Fig. 18.8 Class diagram of BVO servlet

Fig. 18.9 Performance analysis of the videos

uses the contour filtering algorithm for edge detection to get the shape of the image. The resulting image is a black and white image that shows the object of interest. This contour filtering could be used in medical examinations.

References 1. Evangelidis, G.D., Bauckhage, C.: Efficient Subframe Video Alignment Using Short Descriptors. OpenCV Computer Vision with Python. Packt Publishing (2013) 2. Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36 (2014) 3. Stein, J.: Programming computer vision with python: tools and algorithms for analyzing images. O’Reilly and Associate Series. O’Reilly Media, Incorporated (2012); Diego, F., Ponsa, D.,

18 Video Synchronization and Alignment Using Motion Detection …

4. 5. 6.

7.

8. 9. 10. 11.

12. 13. 14.

177

Serrat, J., Lopez, A.M.: Video alignment for change detection. IEEE Trans. Image Process. 20(7), 1858–1869 (2011) Sand, P., Teller, S.: Video matching. ACM Trans. Graph. 23(3), 592–599 (2004) Bradski, G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000) Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the International Conference on Computer Vision, ICCV ’99, vol. 2, pp. 1150–, Washington, DC, USA (1999). IEEE Computer Society Singh, S., Mandal, A.S., Shekar, C., Vohra, A.: Real-time implementation of change detection for automated video surveillance system. ISRN Electronics, Hindawi Publishing Corporation (2013) Smistad, E., Falch, T.L., Bozorgi, M., Elster, A.C., Lindseth, F.: Medical image segmentation on GPUs—a comprehensive review. Med. Image Anal. 20(1), 1–18 (2015) Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975) Evangelidis, G.D., Psarakis, E.Z.: Parametric image alignment using enhanced correlation coefficient maximization. IEEE Trans. Pattern Anal. Mach. Intell. 30(10), 1858–1865 (2008) Anuradha, S.G., Karibasappa, K., Eswar Reddy, B.: Video segmentation for moving object detection using local change and entropy based adaptive window thresholding. IEEE Trans. Pattern Anal. Mach. Intell. 35(10), 2371–2386 (2013) Kumar, R., Gupta, S., Venkatesh, K.S.: Cut scene change detection using spatio temporal video frame. In: 2015 Third International Conference on Image Information Processing Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D.: Compilers: principles, techniques, and tools, 2nd edn. Addison-Wesley Longman Publishing Co. Inc., Boston, MA, USA (2006) Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: an efficient alternative to SIFT or SURF. In: Proceedings of the 2011 International Conference on Computer Vision, ICCV ’11, pp. 2564–2571, Washington, DC, USA (2011). IEEE Computer Society

Chapter 19

Mutichain Enabled EHR Management System and Predictive Analytics Meghana Nagori, Aditya Patil, Saurabh Deshmukh, Gauri Vaidya, Mayur Rahangdale, Chinmay Kulkarni and Vivek Kshirsagar

Abstract One of the challenges in biomedical research and clinical practice is that we need to consolidate tremendous efforts in order to use all kinds of medical data for improving work processes, to increase capacity while lessening costs and enhancing efficiencies. Very few medical centers in India have digitized their patient records. Because of less interoperability among themselves, they end up having scattered and incomplete data. Health data is proprietary and being a personal asset of the patient, its distribution or use should be accomplished only with the patient’s consent and for a specific duration. This research proposes multichain as a secure, decentralized network for storing Electronic Health Records. The architecture provides users with a holistic, transparent view of their medical history by disintermediation of trust while insuring data integrity among medical facilities. This will open up new horizons of vital trends and insights for research, innovation, and development through robust analysis. The platform focuses on an interactive dashboard containing year, month, and season wise statistics of various diseases which are used to notify the users and the medical authorities on a timely basis. Prediction of epidemics using machine

M. Nagori (B) · A. Patil · S. Deshmukh · G. Vaidya · M. Rahangdale · C. Kulkarni · V. Kshirsagar Government College of Engineering, Aurangabad 431005, Maharashtra, India e-mail: [email protected] A. Patil e-mail: [email protected] S. Deshmukh e-mail: [email protected] G. Vaidya e-mail: [email protected] M. Rahangdale e-mail: [email protected] C. Kulkarni e-mail: [email protected] V. Kshirsagar e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_19

179

180

M. Nagori et al.

learning techniques will facilitate users by providing personalized care and the medical institutions for managing inventory and procuring medicines. Vital insights like patient to doctor ratio, infant mortality rates, and prior knowledge of the forthcoming epidemics will help government institutions to analyze and plan infrastructural requirements and services to be provided.

19.1 Introduction In 2014, according to the latest NSSO report, 44 out of each 1,000 Indians wind up getting hospitalized in a year. In 2016, the total number of hospitals in India reached 1,96,312, not all of which store data digitally [1]. Interoperability among the few organizations that use the Electronic Health Record (EHR) systems is difficult to achieve due to the differences in their technological tools used. An essential prerequisite for giving good patient care is the guarantee of the quality information in clinical health databases. Keeping the records on centralized servers makes them vulnerable. Hackers recently breached 1.5 million Singapore patient records, repeatedly targeting their Prime Minister [2]. It requires vast efforts to gather all the invoices and send them to insurance suppliers, pharmacists, etc., as information is not accessible on a solitary digital platform. In critical applications, absence of dependable information and sluggish interfaces prove devastating. Hospitals are wasting ample resources into duplicating the work that can be simply achieved by a refined framework and by expanding the proficiency of inadequately structured frameworks [3]. To improve the efficiency of patient care conveyance, healthcare parties must exchange the patient information among themselves, independent of their organizational and technological particularities. With the headways in innovations, there is a scope to increase the efficiency of Electronic Health Record Management, making diagnosis increasingly accurate and faster. Consolidating these advancements in healthcare and information technology would stimulate prodigious change in the healthcare industry [4]. Blockchain’s potential to accelerate and enhance research and development, care conveyance with fewer expenses is tremendous. The platform houses an interactive dashboard containing all vital pieces of insights gained from the data like infant mortality rates, major diseases in an area, epidemics, etc., provided, will help government officials to plan accordingly and take measures to try and eradicate these in future. The dashboard contains year, month, and season wise statistics of various diseases which can be used to notify the user or the medical authorities on a timely basis.

19 Mutichain Enabled EHR Management System and Predictive Analytics

181

19.2 Related Work Security, trust, traceability, and control are the promises of the blockchain, and we intend its use for storing sensitive health data and for the operation and management of supply chains. The direst need to have such a platform can be elaborated and justified with these factual problems faced around the globe and the work done using similar platforms to overthrow them.

19.2.1 Verifying the Authenticity of Returned Drugs by Tracking Supply of Authorized Drugs in Supply Chain In 2016, the US pharmaceutical manufacturer sales were $323 billion. The 2018 forecast revenue for the top ten global pharmaceutical companies (which represent half of the sales of the top fifty companies) is $355 billion. Based on these estimates, the saleable returns market at 2–3% of total sales is 7–10 billion. Instead of obliterating these exemplary drug shipments, pharmaceutical companies instead chose to resell them. However, before they can resell these returned drugs, the pharmaceutical companies have a legal obligation to verify the authenticity of the returned drugs [5]. The Drug Supply Chain Security Act (DSCSA), in the United States stipulates that all US manufacturers must implement serialization or barcoding of drugs at a package level by November 2018. Also, by the same time next year, these serial numbers must be used to verify the authenticity of the returned drugs [5]. A far better approach is to have pharmaceutical manufacturers record the serial numbers of their packages on a blockchain, which serves as a decentralized and distributed ledger. Wholesalers and customers can then verify the authenticity of a drug package by connecting to the blockchain. The research aims to have a similar existing architecture while giving the user the power to control their own data.

19.2.2 Transparency and Traceability of Consent in Clinical Trials Patient consent involves making the patient aware of each step in the Clinical Trial process including any possible risks posed by any procedure undergone. Traceability for stakeholders and transparency for patients is expected from the data of clinical trial consent that is altered [6]. The Food and Drugs Administration reports that relatively 10% of the preliminaries they screen include a few issues identified with assent gathering. Frequently there are reported cases of misrepresentation, for example, issues of predating assent records [7].

182

M. Nagori et al.

Tracking the intricate information stream with various partners and archiving it progressively through a timestamping work process is a vital step toward demonstrating information consistency.

19.2.3 Incentivized Access to Medical Data MedRec is a system that gives a transparent and accessible view of medical history by prioritizing patient agency, proposed by Asaph Azaria. It provides proof of concept that uses blockchain as a mediator to health information. MedRec incentivizes medical researchers and healthcare stakeholders for mining against access to aggregated and anonymized medical data, as a byproduct of continuing and anchoring the system through Proof of Work [8]. An attainable and anchored shared network can be gathered just by giving huge information, to enable experts while interfacing with patients and providers. MedRec was approved just for therapeutic records [8].

19.3 Problem Setting In this section we define the mathematical context of the architecture, Let U be a vector of users’ details with attributes as age, gender, and diagnostic result. The set of streams used as wallets to store the transactional details of the user is defined as W. We refer to the wallet of a user as Wui for every user ui ∈ U. Let H with location, no. of doctors, no. of. beds as attributes be the vector of Hospitals’ details. U = {u1 , . . . , um } W = {Wu1 , . . . , Wum } H = {h1 , . . . , hn } For the given set of users and hospitals T is the set of transactions where each transaction is defined as tk . T = {t1 , . . . , tk } tk = (Wum , hn ) tk is the relation between the patient (or the user) and the hospital. Each transaction has a timestamp as attribute. From the transaction and hospital vector, the hospital to patient ratio, doctor to patient ratio is calculated. Using the attributes of the hospitals, users, and transactions, area-specific diseases are predicted. Personalized care will be provided to patients by doctors based on the corresponding subset of transactional details T.

19 Mutichain Enabled EHR Management System and Predictive Analytics

183

19.4 Proposed Architecture The platform uses Multichain API to build and deploy blockchain for implementing a shared database using administered distributed system. It also provides public–private key cryptography, transaction mechanisms, and fault-tolerant mechanisms that can survive any malevolent actions, forming a secure and decentralized network for storing Electronic Health Records. MultiChain blockchain provides timestamping, authenticity, and invariability to the streams which are used as a database which comprehensively allows append-only operations. The medical authority issues a wallet to the patient on his/her first visit, triggering the creation of a stream in the background. A MultiChain blockchain could contain an infinite number of streams, these streams act as wallets for the patients. The data published in each stream is stored by every node thus promising its security. The nodes can access any stream and perform actions like adding transaction, i.e., publish reports to streams and retrieve from stream, either as text or a file of formats like pdf, jpg, png, and dcm with limit of 64 MB on chain and 1 GB off-chain. Instead of the original data, we can set the hashes of the data within the transactions to manage the data scalability issue. Files are stored off-chain using lossless compression algorithms, digital tokens or hashes of which are stored on blockchain using SHA-256. Data is anonymized, i.e., care is taken to protect the private and personal information of the patient, and is analyzed using machine learning techniques to predict vital insights and provide customized analysis—descriptive, predictive as well as prescriptive. A trustful, secure environment will be created across the healthcare industry for collaborative and meaningful data exchange among the health centers. It is a multi-node platform with hospitals, insurance companies, government officials as the nodes, thus aiming at bringing revolution in the healthcare industry in the country (Fig. 19.1).

Fig. 19.1 Off-chain data in multichain

184

M. Nagori et al.

19.4.1 Multichain Multichain is an open source platform for the intra or interorganizational deployment of private blockchains. It is an augmented version of Bitcoin’s blockchain based on similar principles. It also lays accentuation on end user decision enabling the user to control whether the chain is private or public. Also the user controls who can interface with the system, the objective time for blocks, the screening of individuals who can associate with the system along with extreme block size, and metadata. Integrated management and user permissions help in eradicating the problems of mining, privacy, and openness [9].

19.4.2 Off-Chain Storage With large amounts of data generated, there raises a question of confidentiality and scalability. As the transaction copy is shared among all the nodes in the network, there would be a threat to data privacy. However, this could be solved by using encryption and compression techniques. The data will be encrypted using a key and then shared across the network. The nodes having the key will only be able to access the data. While taking scalability into consideration, as multiple file formats can be uploaded on Multichain, it will be difficult to store the large amounts of data generated. The solution for this is, instead of storing the original data on chain, we can store the hashes of the data within the transactions. Thus actual data will be stored off-chain, ensuring the performance, speed, and additional layer of security of Multichain [10].

19.4.3 Hashing Hashing of data helps to supplant delicate information with a unique value that is not sensitive. This randomly generated nonsensitive value acts as a unique identifier and is the “Hash or token” for a sensitive record reducing the risk of unauthorized access. It can be used as a constant value and used by many end users. This allows users to interact with the data directly, without having to unscramble and re-encrypt data each time they access the information in multichain platform (Fig. 19.2) [11].

19.4.4 Data Analysis Three stages for analysis on the data generated are descriptive, predictive, and prescriptive.

19 Mutichain Enabled EHR Management System and Predictive Analytics

185

Fig. 19.2 The hashing process

1. Descriptive Analysis This is the initial phase in data analysis which summarizes and organizes the data collected for better understanding. Such data is extremely helpful in forecasting vital facts. Data is turned into actionable insights and further studied in Predictive analysis phase. 2. Predictive Analysis Above Descriptive analysis gives factual information that can help better administration, treatment, and enhance diagnosis methodology. The information gathered is timely verified and is consumed by Machine Learning (ML) Engines for getting diversified predictions on the patient medical background. These predictions lead to improvised diagnostics and better healthcare. 3. Prescriptive Analysis Every medical condition associated with some signs and symptoms is used as data to predict whether the features of corresponding conditions lie in a particular disease region and if yes, then classify the disease and suggest preventive measures. Based on decision optimization technology, these capabilities enable doctors to recommend the best course of action for patients.

19.4.5 Machine Learning The estimation of machine learning in healthcare is its capacity to process immense datasets past the extent of human ability, and after that dependably convert examination of that information into clinical bits of knowledge that guide doctors in arranging and giving consideration, eventually prompting better results, bring down expenses of consideration, and expanded patient fulfillment. The analytics produced will recognize and propose preventive measures for epidemics, exact forecasts about an illness to happen, give appropriate medicare dependent on the medical background.

186

M. Nagori et al.

Fig. 19.3 The complete workflow

19.4.6 Workflow The workflow of the system as shown in Fig. 19.3 can be summed up in the following steps User’s biometric verification, registration by authorized medical authority and issuance of a digital wallet (i.e., the health wallet) of the user. Record storage on local servers using compression techniques to reduce storage space thus reducing related costs, every time a user visits a medical center with his/her consent. Anonymization of data and use of machine learning techniques on this anonymized data to predict vital statistics like the spread of disease, epidemics, etc. The hash is stored in the chain where the transaction is verified and appended to the user’s digital wallet. User can now log into the portal and can view his/her complete medical history along with the analysis of his/her health profile.

19.5 Conclusion EHRs combined with analytics have an extraordinary potential to help clinical studies. A certain extent to reform clinical research and provide control to stakeholders which include patients, health systems, analysts, industry, and society, can be guaranteed by the proposed framework. The platform is intended to be utilized and shared among assorted healthcare facilities like research facilities, small clinicians, and experts. It will help to collect

19 Mutichain Enabled EHR Management System and Predictive Analytics

187

information from all professionals involved in the patient’s care, in order to bring about decentralization in record management. The platform does not offer any Crypto asset for trading and is solely based on using the Open Source blockchain technology and related frameworks to create a solution for storing and sharing Health records on distributed network. Machine learning is used in the prediction of various diseases. Using algorithms, creating various self-learning frameworks helps specialists for better and easier diagnosis of different diseases. Choices identified with patient medical services can be made for the extent of productive results.

References 1. Kurian, O.: The Wire. [2018-03-25]: Why It’s a Challenge To Make Quick Sense of India’s Health Data: https://thewire.in/health/imr-mmr-data-nss-srs 2. Yue, X., Wang, H., Jin, D., Li, M., Jiang, W.: Healthcare Data Gateways: Found Healthcare Intelligence on Blockchain with Novel Privacy Risk Control. https://link.springer.com/article/ 10.1007/s10916-016-0574-6 3. Dhillon, V., Metcalf, D., Hooper, M.: Blockchain in Health Care. https://link.springer.com/ chapter/10.1007/978-1-4842-3081-7_9 4. Linn, L.A., Koo, M.B.: Blockchain for Health Data and Its Potential Use in Health IT and Healthcare Related Research. https://www.healthit.gov/sites/default/files/11-74ablockchainforhealthcare.pdf 5. Pharmaceutical Commerce May 2017 Article on US drugs 2016 Sales. http:// pharmaceuticalcommerce.com/latest-news/us-drug-single-digit-growth 6. Gupta, U.C.: Journal on Informed Consent in Clinical Research: Re-visiting Few Concepts and Areas. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3601699/ 7. Benchoufi, M., Porcher, R., Ravaud, P.: Blockchain Protocols in Clinical Trials: Transparency and Traceability of Consent. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5676196/ 8. Nakamoto, S.: Bitcoin: A Peer-to-Peer Electronic Cash System—Bit coin.org. https://bitcoin. org/en/bitcoin-paper 9. Greenspan, G.: MultiChain Private Blockchain—White Paper. https://www.multichain.com/ download/MultiChain-White-Paper.pdf 10. Off-Chain Storage via Multichain. https://www.multichain.com/blog/2018/06/scalingblockchains-off-chain-data/ 11. Four Genuine Use Case of Blockchain. https://www.multichain.com/blog/2016/05/fourgenuine-blockchain-use-cases/ 12. Azaria, A., Ekblaw, A., Vieira, T., Lippman, A.: MedRec: Using Blockchain for Medical Data Access and Permission Management. https://ieeexplore.ieee.org/abstract/document/7573685? reload=true

Chapter 20

Quick Insight of Research Literature Using Topic Modeling Vrishali Chakkarwar and Sharvari C. Tamane

Abstract With the development in Information technology and advancement in education and research, a huge number of research publications and articles are generated every year. Crucial knowledge and information about innovative technology are embedded in these documents. It has become necessary to find a text analytics technique that gives quick insight into the research content from such enormous unstructured text data. Here we proposed information retrieval technique using topic modeling for taking quick outlook of the data. Latent Dirichlet is a generative probabilistic model which uses probability distribution of words in document to extract theme of documents. Topics generated show many hidden themes, correlated terminologies to main theme of documents which gives a quick overview of these documents. In this work Blockchain Technology, emergent technology publications are considered.

20.1 Introduction With the exponential growth of research publications in every domain, enormous research articles and papers are generated every year. Huge knowledge is embedded in these documents which are in unstructured text format. Different types of text analytics can be applied to reveal hidden knowledge in these documents. If any new researcher likes to do research he needs to download all publications in specific domain and go through it. It is a very tedious job. It necessitates quick insight into research trends through publications and articles. Automatic information retrieval from text documents has become a challenging task. This can be achieved by applying text analytic technique such as topic modeling. Blei [1] introduced Latent Dirichlet V. Chakkarwar (B) Government College of Engineering, Aurangabad, Maharashtra, India e-mail: [email protected] S. C. Tamane Jawaharlal Nehru Engineering College, Aurangabad, Maharashtra, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_20

189

190

V. Chakkarwar and S. C. Tamane

Allocation statistical probabilistic method, to generate topics. Topics are distribution of fixed vocabulary of words and documents are fixed distribution of topics. This methodology, when applied to text corpus, generates topics which define the theme of documents.

20.2 Literature Survey Text analytics techniques are applied to various fields to extract knowledge from word corpus. Katherine et al. [2] proposed a method in text analytics using topic modeling to study the correlation between genetic mutation test and cancer types. As cancer has highest mortality in the world. Great research is going on various aspects of cancer. Patient’s details are stored in Electronic Health Records (EHR) which contain history, pathology reports, and treatment data. This EHR record is in unstructured form. Various mining algorithms have been applied to understand the clinical records. Topic modeling is applied to extract relevant topic from clinical notes. 5605 clinical notes are preprocessed using natural language preprocessing such as stop word removal, lemmatization, stemming, and then topic modeling is applied. Topic modeling is unsupervised statistical model that extracts topic from word corpus. Here Latent Dirichlet Allocation (LDA) with Gibbs sampling is used. Dimensionality reduction is applied using Principal Component Analysis (PCA). These results are correlated with genetic mutation tests. Strong correlations are observed in specific genetic mutation tests and cancer type. These results can be used for cancer prediction. O’Niell et al. [3], proposed text analytics application for legislative text. This helps Law practitioners to visualize and summarize the British Legislation and extracting the specific Legal topics and related terms that are potentially associated with compliance across various documents. In this paper, many methods for topic modeling are discussed like Latent Semantic Indexing (LSI), Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and HDP Hierarchical Dirichlet Allocation. In this work United Kingdom Legislative text is used. Preprocessing with natural language processing is performed. Topic modeling is applied using Negative Matrix factorization and LDA. It was observed that LDA has given best results. Farzan et al. [4], proposed clustering and topic modeling methods to analyze the National Cyber (NC) Security policies. This work examined 60 NCSs by applying machine learning approaches such as hierarchical clustering and topic modeling techniques. Results pinpointed analytical method such as LDA and clustering can be used during the analysis of qualitative data such as textual policies, strategies, and legislation to get a bigger picture and insights during the formulation of NCS. This could be regarded as a complementary approach to assist policymakers for better identification of topics that are neglected or not covered appropriately. The overall similarity between NCSs was observed.

20 Quick Insight of Research Literature Using Topic Modeling

191

20.3 Methodology 20.3.1 Preprocessing—Natural Language Processing As input data is text data which possess English Language properties. This data is needed to be preprocessed before applying any clustering or machine learning algorithm. Typical preprocessing steps include Stop word removal, stemming, and lemmatization, part of speech tagging. These functions are supported in Stanford’s NLTK Library [4–6]. Typical Natural Language Processing steps are as follows • Tokenization—Tokenization is the process of breaking text into pieces called tokens. Each token is referred as word. • Stop word Removal—The most frequent words often do not carry much meaning. Examples: “the, a, of, for, in,” Such word need to be removed. This is called Stop word removal. • Stemming—English words like ‘view’ can be inflected with a morphological suffix to produce ‘viewing, viewed’. They share the same stem ‘view’. • Verb removal—Verbs are helpful in describing actions but are not considered as topics. Hence such verbs need to be removed.

20.3.2 Topic Modeling Topic modeling algorithms are used to extract the underlying theme or topics of documents. Latent Dirichlet Allocation is [3] statistical model-based technique. It can discover latent topics from collection of documents using probability distributions of the words that appear in these documents. It uses word to topic and topic to document distributions for generating the topics. A typical process of topic generation is as follows (Fig. 20.1).

Fig. 20.1 Topic modeling for text documents

192

V. Chakkarwar and S. C. Tamane

Fig. 20.2 Graphical model for LDA

20.3.3 Latent Dirichlet Allocation Latent Dirichlet Allocation introduced by Blei [1]. It is a Bayesian generative model using Dirichlet priors for topic mixtures. Figure 20.2 indicates the block diagram of LDA, in this probability of word w is dependent of topic k in z which is dependent on the probability of document θd that is drawn from Dirichlet Prior α. Similarly a word w also dependent on the probability F that the word w is in k. Here for each document θd is chosen from Dirichlet prior α, then for each word d topic category is chosen according to Dirichlet. A word w is then generated given the topic zw and β . Following is a short algorithm for LDA

LDA assumes documents exhibiting multiple topics. Topics are nothing but distribution over fixed vocabulary. Here topics are specified in advance. Above algorithms explains a simple form of generation process [3, 7]. Figure 20.3 diagram shows topic generation using probability distribution. We have explained shortly LDA.

20.4 System Development Due to the advancement in education and research every year enormous conferences are conducted. Research contents are published in Journals, Conferences, and Books. Millions of publications are generated in various domains. Knowledge is expanding exponentially unstructured text format. These research work and studies were dumped in form of publications. There is an immense need to automatically discover current trends, topics or patterns from research documents that can be used by a

20 Quick Insight of Research Literature Using Topic Modeling

193

Fig. 20.3 Process of topics generation from text data

new researcher to overview different research trends. Topic modeling is one of text mining algorithm that extracts the underlying theme of documents [8]. Problem definition. Computer science field has various domains like data science, computer vision, Big Data analytics [9], High-performance computing, Artificial Intelligence. Enormous publications and literature are available on internet. But at a glance it is very difficult to find current trends in each field and its correlation with other fields. Text analytic technique which gains a broad understanding of an entire dataset and to explore trends in specific domain [10]. Data set. In this work, abstract of 550 IEEE publications (conferences as well as journals) in the area of blockchain technology. Blockchain technology is an emerging field for secure transaction on distributed networks using cryptography. It is related to cryptocurrency, used in distributed Ledger. This emergent technology can be used in various applications. Abstracts of 550 publications are considered as a data set. This data is in unstructured Text Format. This forms our word corpus. Different Natural language processing steps are applied as preprocessing steps. Then LDA topic modeling is applied. Here we propose a method to extract current topics from text documents using LDA [11]. Detailed steps in the proposed system are as follows. Preprocessing. Natural language processing (NLP) is a technique that makes the computer to understand English Language [5]. It is used to interpret the text and transform it into preprocessed form that can be used as input to machine learning algorithm LDA. For this preprocessing Stanford NLTK toolkit was used. After applying all steps as discussed above in Sect. 20.3.1, the normalized text is generated. This text is used to generate bag of words which is nothing but term frequency matrix. Different mathematical models can be applied to this matrix. Here, Latent Dirichlet Allocation (LDA) method for topic modeling is applied to bag of words.

194

V. Chakkarwar and S. C. Tamane

Topic Modeling. Latent Dirichlet Allocation (LDA) is an unsupervised, statistical approach to document modeling that discovers latent semantic topics in large collections of text documents [12, 13]. LDA considers that words carry strong semantic information, and documents discussing similar topics use a similar group of words. Latent topics are thus discovered by identifying groups of words in the corpus that frequently occur together.

20.5 Experimental Results Here topic modeling is applied to 550 abstracts of blockchain technology. Blockchain technology is an emerging field that has applications in many fields. A new researcher needs to read many publications of related technology. It was very time-consuming task. Topic modeling applied to this data gives brief overview of all topics related to blockchain technology. Latent Dirichlet Allocation generates topics based on word frequency. After preprocessing text document-term matrix is created which is also called a bag of words. LDA is applied to this bag of words to generate the topics using probability distribution of words [7, 14]. Table 20.1 shows 10 topics and each topic shows correlated 20 words. After applying topic modeling to abstracts of research publications related to blockchain technology we got above results. The objective was to identify various themes related to blockchain technology. Topic modeling enabled us to get quick idea of this area. Topic 0 shows many documents discussed technology, challenges, proposed systems, and applications. Topic 1 shows blockchain technology has been used in IOT, mobile technology, smart systems, cloud security, and distributed networks. Topic 2 indicates it’s relation with to cryptocurrency, bitcoin, financial transaction, and financial ledger management in distributed systems. Topic 2 and 3 are similar. Topic 4 shows it has applications related vehicles, grid system, smart system, electricity generation, power mechanism, and trading. Topic 5 indicates it is used in healthcare, medical domain for providing security and authentication of digital information. Topic 6 shows its applications in cloud systems and service to provide user authentication and integrity. Topic 7 indicated that this technology is used in financial transaction in terms of digital money and bitcoin. Likewise other topics also represent some crucial information. This application gives a broad view of particular domain and correlated keywords or topics.

20 Quick Insight of Research Literature Using Topic Modeling

195

Table 20.1 Experimental results for 10 topics Topic 0

Blockchain, abstract, technology, paper, proposed, applications, propose, based, technologies, transactions, research, provide, decentralized, current, challenges, users, results, blockchains, information, potential

Topic 1

Data, blockchain, internet, iot, computing, things, security, cloud, devices, peer, based, architecture, keywords, mobile, technology, systems, management, distributed, network, smart

Topic 2

Blockchain, technology, peer, bitcoin, keywords, distributed, data, processing, electronic, ledger, management, computing, consensus, cryptography, systems, financial, supply, contracts, system, information

Topic 3

Blockchain, technology, peer, bitcoin, keywords, distributed, data, processing, electronic, ledger, management, computing, consensus, cryptography, systems, financial, supply, contracts, system, information

Topic 4

Energy, blockchain, peer, power, person, smart, based, system, security, vehicles, distributed, electricity, market, communication, trading, generation, systems, vehicle, grid, mechanism

Topic 5

Data, blockchain, medical, health, key, digital, management, records, system, healthcare, technology, public, information, scheme, identity, security, signature, based, authentication, privacy

Topic 6

Data, privacy, blockchain, access, cloud, storage, control, user, based, system, secure, sharing, integrity, scheme, authentication, computing, users, identity, management, service

Topic 7

Bitcoin, currency, financial, transaction, digital, money, cryptocurrency, analysis, exchange, crypto, transactions, block, mining, network, payment, learning, intelligence, currencies, wallet, credit

Topic 8

Smart, contracts, contract, blockchain, ethereum, software, platform, business, execution, based, engineering, keywords, city, services, home, model, framework, cities, language, process

Topic 9

Blockchain, systems, based, cyber, system, framework, security, architecture, software, decision, chain, data, block chains, performance, making, analysis, quality, model, trust, physical

20.6 Conclusion The literature review is very critical and important for understanding current research in any domain. When masses of publications are available, a text mining technique which gives insights of all topics is a feasible solution. Such type of text analytics reveals insight of huge research Literature. Many interrelated areas are visible due to these anaytics. Here a small data set is used. Topic modeling, LDA applied to this dataset discovered many relevant topics in this domain. Many interrelated themes with respect to this domain had been discovered. Interrelated topics had shown further path for research. Results indicate LDA is an effective Topic Modeling Algorithm

196

V. Chakkarwar and S. C. Tamane

for generating the context of document collection as discussed above. This method can be applied to various fields like Health informatics, Economic growth, Legal, Political, and Medical Science Literature to reveal hidden themes of those domains. The objective of extracting the targeted information from mass of literature using text mining is achieved successfully. Acknowledgements We like to acknowledge the Department of Computer Science and Information Technology, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, Maharashtra, India for supporting Research Facilities. We also like to Thank Head, Computer Science and Engineering Department and Principal, Govt. College of Engineering, Aurangabad, Maharashtra, India for their valuable guidance in different aspect of this paper.

References 1. Blei, D.M., Andrew, Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 2. Chan, K.R., Lou, X., Karaletsos, T., Crosbie, C., Gardos, S., Artz, D., Rätsch, G.: An empirical analysis of topic modeling for mining cancer clinical notes. In: 2013 IEEE 13th International Conference on Data Mining Workshops, pp. 56–63 (2013). https://doi.org/10.1109/icdmw. 2013.91 3. O’Neill, J., Robin, C., O’Brien, L, Buitelaar, P.: An analysis of topic modeling for legislative texts. In: Proceedings of the Second Workshop on Automated Semantic Analysis of Information in Legal Text (ASAIL 2017), London, UK (2017) 4. Kolini, F., Janczewski, L.: Clustering and topic modeling: a new approach for analysis of national cyber security strategies. In: Twenty First Pacific Asia Conference on Information Systems, Langkawi (2017) 5. Sun, S., Luo, C., Chen, J.: A review of natural language processing techniques for opinion mining systems. Inf. Fusion 36, 10–15 (2017). Elsevier 6. Phand, S.A., Chakkarwar, V.A.: Enhanced sentiment classification using geo location tweets. In: ICICCT 2018, pp 881–886, IISC Banglore, India (2018) 7. Chen, H., Xie, L., Leung, C.-C., Lu, X., Ma, B., Li, H.: Modeling latent topics and temporal distance for story segmentation of broadcast news. IEEE/ACM Trans. Audio, Speech, Lang. Process. 25(1), 112–123(2017) 8. Uys, J.W., du Preez, N.D., Uys, E.W.: Leveraging unstructured information using topic modeling. In: PICMET 2008 Proceedings, pp. 955–961, 27–31 July, Cape Town, South Africa (c) (2008) 9. Tamane, S.C.: Text analytics for big data. Int. J. Mod. Trends Eng. Res. 02(03) (2015). ISSN: 2349-9745, p-ISSN: 2393-8161 10. ElShal, S., Mathad, M., Simm, J., Davis, J., Moreau, Y.: Topic modeling of biomedical text. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Shenzhen, China (2016). https://doi.org/10.1109/bibm.2016.7822606 11. Gao, Y., Xu, Y., Li, Y.: Pattern based topics for document modeling in information filtering. IEEE Trans. Knowl. Data Eng. 27(6) (2015) 12. Ko, N., Jeong, B., Choi, S., Yoon, J.: Identifying product opportunities using social media mining: application of topic modeling and chance discovery theory, pp. 2169–3536 © 2017 IEEE (2017). https://doi.org/10.1109/access

20 Quick Insight of Research Literature Using Topic Modeling

197

13. Chien, J.-T.: Hierarchical theme and topic modeling. IEEE Trans. Neural Netw. Learn. Syst. 27(3), 565–578, (2016) 14. Bulut, A.: TopicMachine: conversion prediction in search advertising using latent topic models. IEEE Trans. Knowl. Data Eng. 26(11) (2014)

Chapter 21

Secure Cloud-Based E-Healthcare System Using Ciphertext-Policy Identity-Based Encryption (CP-IBE) Dipa D. Dharamadhikari and Sharvari C. Tamane

Abstract Healthcare organizations are adopting Electronic Health Record (EHR) for better and fast services. Due to flexibility, security, and efficiency, cloud data storage has become a frequent alternative for deploying EHR systems. Several cryptography techniques provide security to cloud user’s data by encrypting and decrypting data for intended users. The present paper proposes Ciphertext-Policy Identitybased Encryption (CP-IBE) as a cryptographic technique for cloud data. CP-IBE is combination of data encryption and identity-based approach. It is enhanced by integrating CP-IBE algorithm with Elliptic Curve Cryptography (ECC) that gives a novel scheme for public-key cryptography known as Ciphertext-Policy Identitybased Elliptic Curve Cryptography (CP-IBE-ECC). Torsion point concept is used in the Elliptic Curve Cryptography (ECC) system. An elliptic curve is the formation of curve under an algebraic set of coordinates which does not intersect with other points during curve formation. This merit makes us define a new security model in healthcare applications. None of the studies has explored the merits of applying torsion points on key distribution system. It provides dynamic data support, efficiency, security as well as privacy for E-Healthcare System. This system focus on storing and maintaining the Health Records of patients electronically so refer these records as EHR. The EHRs are migrated to the cloud data centers in order to prevent medical errors and to have efficient storage and access to those records. IBE encryption technique is also applied in order to make EHRs more secure. The Shared Nearest Neighbor (SNN) clustering is applied to resolve the issues with attribute clustering. SNN works on estimating the nearest neighbors for the most shared attributes and then clusters the records.

D. D. Dharamadhikari (B) Department of Computer Science and Engineering, Marathwada Institute of Technology, Aurangabad, Maharashtra, India e-mail: [email protected] S. C. Tamane Department of Information Technology, Jawaharlal Nehru Engineering College, Aurangabad, Maharashtra, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_21

199

200

D. D. Dharamadhikari and S. C. Tamane

21.1 Introduction 21.1.1 An Overview Nowadays, abounding information obtains in all fields of real life, which is treated as a kind of increasingly important resource. The massive data storage has impacts on developing a secured framework. The massive accession information is often collected by the service providers and recorded as assets stored in cloud systems. The storage of such massive data is of great significance for the quality of service of computing systems and poses many challenges to storage service providers. Recently, cloud storage is getting more and more competitive and popular, which is also a trend of future storage techniques. Healthcare application is used for our proposed research. Recently, healthcare application gets attracted by the advents of cloud technologies. The main task of the healthcare application is to take care of their patient records, i.e., Electronic Health Records (EHR). This EHR is migrated to the cloud data centers in order to prevent medical errors. Therefore, privacy and security are the two important parameters that should be effectively investigated. An IdentityBased Encryption (IBE) scheme is a public-key cryptosystem where any string is a valid public key. It is a public-key encryption scheme in which any arbitrary string can be considered as a public key [1]. The motivation behind selecting this encryption for developing novel schemes is to help the deployment of a publickey infrastructure. It can manage a large number of public keys efficiently and it also makes the system simple by taking simple public keys from users. Senders can send mail to a recipient who is not having a public key. With this technique, the recipient’s certificate is not required for sending the e-mail. Also, sending an e-mail can be made time-bound and the system can refresh the private key of the recipient after equal intervals of time [2, 3]. Elliptic-curve cryptography (ECC) is an approach to public-key cryptography which is based on the algebraic structure of elliptic curves [4]. It can be applicable for encryption by combining the symmetric encryption scheme with the key agreement. Torsion point [5] deals with the security criteria. The proposed scheme considers the torsion point for deciding the security parameter for finite groups present in the healthcare system. It is applied in pairing for a cryptographic system designed for maintaining EHRs. The keys are distributed systematically by initializing the security attributes, i.e., identities of the users and picking elliptic points, that is, base point and torsion point on elliptic curves (a, b). Initially, the private key is generated. Then a master secret key is generated which will be used for encryption. This encrypted data can be sent through cloud environment and received by users. Then the decryption process can be applied at the receiver side by using the generated keys. This is how security is provided to this E-Healthcare system and records of patient EHRs are managed very efficiently. The proposed scheme focuses on developing an efficient storage scheme by creating efficient storage database schema and supporting dynamic data. It also focuses on developing a ciphertext database scheme for proposing a security-preserving

21 Secure Cloud-Based E-Healthcare System Using Ciphertext-Policy …

201

approach to ensure security and integrity of Electronic Health Records as well as proposing a privacy preserving approach to ensure the privacy of the personal health records. The performance of the system is achieved for the parameters: dynamic data support, efficiency, and security and privacy.

21.2 Related Work Cloud computing refers to a computing environment which provides communications-related services for network-based resources. With this environment, the user does not have to bother about the purchase storage or maintenance of the infrastructure or availability of own database. With this promising technology, one can access a shared pool of computing resources which provides unlimited services as well as functionality. It can also provide efficient, secure storage, and scalable information sharing with resilient Internet computation. Cloud computing deals with the way we design, build, deploy, and run the various applications which operate in the virtualized environment with the ability to dynamically grow, self heal, and share resources among a number of users [6]. Emphasis is made on various deficiencies related to security and efficiency of public data over cloud. The researchers [7] have devised the public data integrity auditing scheme which can handle shared dynamic data efficiently. This scheme was efficient for group user revocation based on vector commitment with more security and also verifies the correctness of the local revocation group signature. This scheme achieves efficiency, confidentiality, traceability, and also the accountability with more secure group user revocation. The novel scheme was designed for achieving the correctness of the outsourced data as well as applying more security levels for it [8]. But there is a drawback of this scheme that it was not able to maintain the confidentiality of the user data. An adaptive database schema design method [9] has been proposed for multitenant applications. It swaps Independent Tables Shared Instances (ITSI) and Shared Tables and Shared database Instances (STSI). This design discovered the balance between them in order to achieve high performance and good scalability. Base tables were created after selecting the significant attributes. Supplementary tables were created for less important attributes. The graph partitioning algorithm was applied in order to construct the base tables. The importance of all the attributes is calculated by applying page ranking algorithm. With this model, properties like high scalability, high performance, and less storage are achieved. Clustering is a very important technique used for the extraction of useful knowledge from the data. Class labels are not required in prior. So it is an unsupervised technique. The SNN is a clustering algorithm [10] that can form clusters of different shapes, sizes, and densities. It is the variant of the k-nearest neighbor algorithm. Attribute Based Encryption (ABE) is a cryptographic technique used in order to achieve fine-grained data access control [11]. It defines the access policies based on different attributes. The Ciphertext-Policy Attribute-based Encryption (CP-ABE)

202

D. D. Dharamadhikari and S. C. Tamane

defines a set of attributes which can be used for encryption and decryption [12]. The draw-back of this scheme is high resource costs. ECC is susceptible to the problem of discrete Logarithm [13]. A generator function with a number of parameters was used to construct the elliptic curve. Various attacks can be handled with this new and secure mechanism—Elliptic Curve Discrete Logarithmic Problem. The session keys are exchanged with Elliptic Curve Diffie– Hellman which provides more security while exchanging the keys. Also the session keys are distinct. The Elliptic Curve Diffie–Hellman technique made the communication between two parties very secure for the audio messages and increase the immunity against various attacks. In order to protect the privacy of the receiver and identify the receiver, Anonymous Identity-Based Encryption was applied [14]. Anonymous Identity-based Encryption with identity recovery is an anonymous IBE with characteristics of identity recovery. Identity Recovery Manager has a secret key to recover the identity of the receiver from the ciphertext. Standard encryption techniques are not suitable for the EHR system with a cloud environment. Symmetric-Key Cryptography is efficient but introduces complexity in EHR systems causing requirement of a mechanism for access control. Normally, all healthcare providers use one common shared key for both encryption and decryption. It is more vulnerable when the common key is compromised. Also, there are certain problems with the public-key cryptography technique. It is a secure method but it is difficult to operate in the huge infrastructure of a cloud environment. This system was used to maintain the EHR. Due to these inadequacies with traditional encryption, the paper builds on the proposed scheme, Ciphertext-Policy Attribute-based Encryption (CP-ABE) approach [15]. In present scenario [16] an efficient and secure system is required but the security parameters like efficiency, confidentiality, integrity privacy along with accountability & traceability are still untouched against any issue.

21.3 Proposed Methodology The proposed storage oriented framework is as shown in Fig. 21.1.

21.3.1 Developing an Efficient Data Storage Scheme In the Efficient Data Storage Scheme, the Database as a Service (DBaaS), relational e-health cloud is processed. The main three challenges are efficient multi-tenancy, scalability, and database privacy. Figure 21.1 shows the structure of multi-tenancy architecture. Multi-tenancy is an architecture where a single instance of the software system serves multiple cloud clients. Here, each cloud client is referred to as a tenant. In order to support many cloud clients, the multi-tenant database necessitates

21 Secure Cloud-Based E-Healthcare System Using Ciphertext-Policy …

203

Devising cloud storage systems

Efficient data storage schemes

Arranging the attributes in an efficient way using Shared Nearest Neighboring (SNN) database schema.

Before data outsourcing process, the data should be in encrypted form. Proposed Enhanced Ciphertext policy based Identity Encryption

The enhancement is the use of Torsion point in Elliptic Curve Cryptography

In addition to base point, torsion point is also used as key distribution management system.

The database is encrypted before outsourcing.

The storage system of the healthcare systems towards cloud is devised.

Fig. 21.1 Proposed framework

for excellent performance, lessened storage space, and good scalability. The biggest challenge in cloud storage is the formulation of effective database schema. The service provider should take responsibility for supporting dynamic data to efficiently manage the data. Generally, the redesign data in multi-tenant databases is known as “physical tables”. A database with group of tables (T1 …Tk ) will be maintained by the tenants. Each table is referred to as “source table”. The proposed scheme shows how to effectively and dynamically design high-quality physical tables. In real-time applications, the outsourced data is sensitive information and topic-related, containing the details about diseases and their causes. It also contains inpatient’s and

204

D. D. Dharamadhikari and S. C. Tamane

outpatient’s data, the availability of doctors in both normal and emergency environment, and the availability of medicines. From these observations, the attributes considered in designing database schema are divided into two types, viz., significant attributes and non-significant attributes. Significant attributes are used as “classic tables (highly significant attributes)” and “ground tables (less significant attributes)”, and non-significant attributes are used as “auxiliary tables” (insignificant attributes). Proposed dynamic model of building classic tables, ground tables and auxiliary tables are based on queries from different tenants (Fig. 21.2). As data generated in the e-health system is tremendous in nature, the support of dynamic data is required. In order to support the dynamic data, the attributes should be effectively clustered. The Shared Nearest Neighbor (SNN) clustering is deployed. Though various clustering algorithms persist, SNN supports the clustering process in terms of dynamic size, shape, and density. SNN works on estimating the nearest neighbors for the most shared attributes and then clusters the records. The importance of each attribute is assessed. SNN emerges from the base of k-nearest neighbor algorithm [17]. The SNN steps are as follows: 1. Let the input data be in P * Q matrix where P is the number of records and Q is the number of tenants

Node Node

Data

DATA CENTER 2

Cloud user 1 Cloud user 2

Data

Node

Query

INTERNET

Cloud user 3

Node

Data Data

Cloud user 4

DATA CENTER n Node Node

Fig. 21.2 Multi-tenant cloud architecture

Data Data

21 Secure Cloud-Based E-Healthcare System Using Ciphertext-Policy …

205

2. Initialize k = number of nearest neighbors for each record Eps is Density threshold Minpts is Minimal density value 3. Apply clustering on the tenants 4. The SNN density, Minpts is computed from the no. of nearest neighbors that shares frequently used attributes 5. The core points are picked from the Minpts 6. Construct the clusters. The selection of the k value depends on the number of source tables. Thus, the k value should be balanced between a single cluster and maximum number of clusters. This task is initiated to cover the challenges such as To find the significance level of attributes placed in different tables. To dynamically select the new attributes and redesign the table structure. To study the operational cost of EHR for basic operations like select, insert, update, and delete.

21.3.2 Developing a Ciphertext Database Scheme The efficiently stored records of patients are then shared to the cloud servers and this process is known as outsourcing the databases. Before outsourcing the records, the database should be encrypted and then outsourced to the cloud servers. Since the EHRs can interoperate with other systems in the health information management systems, the records need to be encrypted before uploading them to the cloud servers. Once the records are uploaded to the cloud, the patient loses their access control over the cloud services. So the issues like information leakage, user authentication, key management, and revocation handling are required to be addressed here. The main intention is to provide encrypted databases and data access control via patient-centric models.

21.3.2.1

Designing an Encrypted Databases

The system model for the outsourced databases is shown in (Fig. 21.3). The EHR system consists of multiple data owner/patient who will encrypt the personal records before uploading to the cloud servers. It is not an easy task to develop secured databases over Multiple Authorities Multiple Owners (MAMO) cloud environments. In this work, multiple authorities refer to staff members and other non-technicians whereas multiple owners refer to the owner of the medical records. An improved Ciphertext Policy Based Identity-based Encryption is proposed to secure the outsourced data even in case of untrusted cloud environment.

206

D. D. Dharamadhikari and S. C. Tamane

Cloud database service provider

Return the results

Store data Query the data

Cloud application server

Data Store data

Encrypting the data

Fig. 21.3 System model for outsourced databases

The premises considered while devising the Enhanced Ciphertext Policy Based Identity Encryption model are data lineage, data leakage, access control policy, and local–global authority for proper data accessing systems. Once the files are uploaded by the data owner, it should be arranged properly for easy file search and retrieval process. Therefore, the access structure plays a vital role in the CP-IBE systems. Encryption keys are the keys used for protecting data access from unauthorized users. The path of the encrypted file is known only to the storage servers which in turn provide access only from the main server. File insertion is the core process applied over the main server. Once the user authorization is verified, the file is inserted randomly for every one of its cloud users. The file key becomes hidden and safe. And also the user account is kept confidential in the database table. The path of the encrypted file from the storage server is found by using the user account name and the hash table input for the requested file. Thus, the encryption model plays a vital role between the storage server and the main server systems. Access structure is the field of security systems. It is used to retrieve resources from the coordination of multiple parties. Let us consider an access structure with different sets of parties (P1 , P2 …Pm ). For the proposed scheme, it is restricted to a monotone access structure. The parties contain two sets, namely, authorized sets and unauthorized sets. The proposed algorithm, CP-IBE-based on ECC scheme, consists of the following steps: (i) Setup a. In this phase, the setup algorithm adopts security parameter p (any attributes) and the set of identities I = {I1 , I2 …In ) as inputs. The proposed ECC algorithm is as follows. b. Pick an elliptic curve group G = {p, Ep (a, b), T) where p is the base point, T is the torsion point (selection of prime numbers over a number field) on elliptic curve Ep (a, b) over finite field Zp

21 Secure Cloud-Based E-Healthcare System Using Ciphertext-Policy …

207

c. Select three random private keys α, k1 , and k2 from finite field Zp . Then, compute Pi = αi P Qi = k1 αi P Ri = k2 αi P for all i = 1 to n, where n is the number of users. d. Pick one-way collision resistance hash functions H1 , H2 , and H3 using complex multiplication H1 : {0, 1} a+bi → F*p H2 : {0, 1} a+bi → {0, 1} lσ H3 : {0, 1} a+bi → {0, 1} lm where lσ —Length of arbitrary string under security parameter lm —Length of Plaintext message M. e. Output: Master Secret Key: α, k1 , and k2 Master Public Key (MPK): (G, Pi , Qi , Ri , H1 , H2 and H3 ) (ii) Encrypt Phase Input: Access Policy P, Master Public Key, and Plaintext message M Output: Ciphertext C a. Choose any arbitrary number σm ε {0, 1}lσ and compute r = H1 (P, M, σm ) and Km = KDF(rm , P) b. Assume Access Policy P = (p1 , p2 …pm ), then the f(x, P) = P(x + H4 (i))1−p c. Let fi denote the xi ’s coefficients in polynomial f(x, P). Then, the ciphertext is generated as follows: i = 1, . . . , n − |P| Pm,i = rm Pi Cm = H3 (σm )XM Hence, the ciphertext C = {P, Pm,i , K1m , K2m , Cα, and Cm }. (iii) Private Key Generation phase Input: User’s Identities I, Master Public Key (MPK), and Master Secret Key (MSK) Output: User’s secret key Steps: a. Let I = i1 , i2 …in, then the secret key is generated as follows: where f(x, A) is an n-degree with complex function Zp (x) b. Select two random numbers ru and tu . Estimate u1 = ru + k1 tu (mod p) u 2 = su − k2 tu (mod p) Output: Then the secret key ku = (u1 , u2 ).

208

D. D. Dharamadhikari and S. C. Tamane

iv.

Decrypt phase Input: Secret key ku = (u1 , u2 ); Identity set I and Ciphertext C = {P, Pm,i , K1m , K2m , Cα, Cm } Steps: a. Compute U = u2 K1m = (Su − k2 tu )(rm k1 f(α, P))P V = u1 K2M = (Ru + k1 tu )(rm k2 f(α, P))P U + V = Su − k2 tu )(rm k1 f(α, P))P + (Ru + k1 tu )(rm k2 f(α, P))P b. Replace Ci = ai − bi where i = 1, 2…n. Let f(x, A, P) be the n − |P| of polynomial degree. f(X) = F(X, A, P) = P(x + H3 (i))c where fi is the coefficient of polynomial f(x) Verify whether the condition rm P = rm P by M’ = Cm ⊗ H3 (α’m ) r’m = H1 (P, M’, α’m ) If it holds, treat M’, the original plaintext M. Otherwise, the output is null.

21.4 Conclusion Cloud data storage has become a frequent alternative for deploying EHR systems. This scheme focuses on maintaining the Electronic Health Records of all the patients taken care of by the Healthcare System. The Shared Nearest Neighbor clustering resolves the issues with attribute clustering which helps in finding significant attributes for this model. SNN works on estimating the nearest neighbors for most shared attributes and then clusters the records. With this proposed CP-IBE Scheme, EHRs become more secure by encrypting and decrypting data for the intended users and storing on the cloud. Elliptic Curve Cryptography technique is applied by defining a new security model for healthcare applications. The EHRs are migrated to the cloud data centers in order to prevent medical errors and to have efficient storage and access to those records. CP-IBE encryption technique is also applied in order to make EHRs more secure. The encryption and decryption have tolerable impact on average response time for accessing and updating record on cloud.

References 1. Boneh, D., Franklin, M.: Identity based encryption from the Weil pairing. J. Comput. 32(3), 586–615 (2003)

21 Secure Cloud-Based E-Healthcare System Using Ciphertext-Policy …

209

2. Wang, Q., Li, F., Wang, H.: An anonymous multireceiver with online/offline identity-based encryption. In: Hindawi Wireless Communications and Mobile Computing, pp. 1–18 (2018) 3. Hu, Z., Liu, S., Liu, J.: Revocable identity-based encryption and server-aided revocable IBE from the computational Diffie-Hellman assumption. Cryptogr. MDPI 2, 1–18 (2018) 4. Ding, S., Li, C., Li, H.: A novel efficient pairing-free CP-ABE based on elliptic curve cryptography for IoT. Special Section on Security and Trusted Computing for Industrial, IEEE Access 6, 27336–27345 (2018) 5. Clark, P.L., Cook, B., Stankewic, J.: Torsion Points on Elliptic Curves with Complex Multiplication (2013) 6. Lee-Post, A., Pakath, R.: Cloud Computing: A Comprehensive Introduction. Information Science Reference (an imprint of IGI Global), pp. 1–23 (2014) 7. Jiang, T., Chen, X., Ma, J.: Public integrity auditing for shared dynamic cloud data with group user revocation. IEEE Trans. Comput. 65(8), 2363–2373 (2015) 8. Wang, J., Chen, X.: Verifiable auditing for outsourced database in cloud computing. IEEE Trans. Comput. 64, 3293–3303 (2015) 9. Ni, J., Li, G., et. al.: Adaptive database schema design for multi-tenant data management. IEEE Trans. Knowl. Data Eng. 26(9), 2079–2093 (2013) 10. Licenciado, Doutor João Carlos Gomes Moura Pires.: Implementation for Spatial Data of the Shared Nearest Neighbour with Metric, pp. 1–71 (2012) 11. Goyal, V., Jain, A., Pandey, O., Sahai, A.: Bounded ciphertext policy attribute based encryption. In: ICALP Part II, LNCS, vol. 5126, pp. 579–591. Springer, Berlin, Heidelberg (2008) 12. Li, Q., Zhu, H., Ying, Z., Zhang, T.: Traceable ciphertext-policy attribute-based encryption with verifiable outsourced decryption in eHealth cloud. Hindawi Wirel. Commun. Mob. Comput. 2018(2018), 1–12 (2018) 13. Abdullah, K.E., Ali, N.H.M.: Security improvement in elliptic curve cryptography. Int. J. Adv. Comput. Sci. Appl. 9(5), 122–131 (2018) 14. Ma, X., Wang, X., Lin, D.: Anonymous Identity-Based Encryption with. In: ACISP, pp. 1–10 (2018) 15. Bethencourt, J., Sahai, A., Waters, B.: Ciphertext-policy attribute-based encryption. In: IEEE Symposium on Security and Privacy, pp. 110–122 (2007) 16. Dharmadhikari, D., Tamne, S.: Public Auditing Schemes (PAS) for Dynamic Data in Cloud: A Review. In: SmartCom 2017, CCIS 876, pp. 186–191. Springer (2018) 17. Naveen Bail, B., Mary Sowjana, A.: An incremental shared nearest neighbour clustering approach for numerical data using an efficient distance measure, Vishakhapatnam. Int. J. Adv. Trends Comput. Sci. Eng. 4(9), pp. 14192–14196 (2015)

Chapter 22

Security Vulnerabilities of OpenStack Cloud and Security Assessment Using Different Software Tools Manisha P. Bharati and Sharvari C. Tamane

Abstract New security challenges are raised because of cloud computing when contrasted with customary on-start as a result of its multi-occupant virtual condition on each cloud layer, namely Platform as a Service—PaaS, Infrastructure as a Service—IaaS, or Software as a Service—SaaS. Open clouds are utilizing restrictive cloud programming and security is generally kept up by issuing organizations. Security remains a concern for private clouds. Numerous components influence the cloud mis-configuration and integrity that could emerge on the grounds that security is kept up by an outsider. The target of this investigation is to inspect the territory of OpenStack cloud specifically. This will give a more noteworthy comprehension of in what way cloud computing capacities and any kinds of issues of security emerge in that. The investigation comprises three sections; in the primary section, the foundation of cloud computing and OpenStack is described. In the second section, OpenStack architecture is described. In the third section, known vulnerability exploitation and mitigation strategies are presented along with an assessment of various vulnerabilities in OpenStack is conducted utilizing top security scanners namely Metasploit and OpenVAS in an attempt to finding new vulnerabilities.

22.1 Introduction Earlier, Snort was a popular open source IDS as well as IPS was used in networks to spot any arriving attacks from any source and alert the network administrator about this brute-force attack [1]. Nowadays, there are numerous challenges in the cloud-based environment [2]. OpenStack is defined as an open-source distributed computing programming platform which is absolutely free [3]. Clients principally look at the infrastructure as an administration arrangement. Right now, OpenStack M. P. Bharati (B) SPPU Pune, Pune, Maharashtra, India e-mail: [email protected] S. C. Tamane BAMU Aurangabad, Aurangabad, Maharashtra, India © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_22

211

212

M. P. Bharati and S. C. Tamane

has discharged 18 editions from the beginning to end; elements of the framework are persistently enhancing since its development. As an open-source cloud stage, its safety parameter is moderately high, yet there are bugs and vulnerabilities which have turned out to be a standout among the most imperative indications that require be considered. The reason for OpenStack arrangement is to make a worldwide standard and in the meantime, a product demonstration aimed at further improvement of cloud arrangements, serving cloud suppliers, also end-clients. Constant interest of registering force for all intents and purposes, boundless storage room, rapid access to client information, and having the capacity to play out this from any area on the planet constrained cloud programming engineers to include more programming components by means of each release. OpenStack’s initial version known as Austin was introduced in October 2010 which had just dual modules namely Swift and Nova. Presently, OpenStack announced its new release which is the eighteenth version of their product known as “OpenStack Rocky” brings dozens of enhancements for operators, driven by realworld use cases and user feedback. From a security point of view, this implies somewhere around multiple times that there are more conceivable outcomes to hack the security of cloud. With such a large number of weaknesses, a security evaluation on each version of cloud or any cloud basic structural modification is a flat out must.

22.2 Software Architecture Prior to some security evaluation, it is imperative to identify the structure of the main network. In view of this data, security tests could run on essential nodes and analysis could be done for the type of communication between services (Fig. 22.1). OpenStack gives clients a chance to introduce virtual machines VMs and instances that handle diverse tasks for communicating with the cloud domain. Level scaling turns out to be simple, i.e., tasks which benefit by running in parallel handle many clients on the fly by spinning more instances. Consider an example where a portable apps like for mobile phones that require to connect with a distant server might have the ability to separate, crafted by communicating with every client crosswise over various occasions, all connecting to one another, however, scaling quickly and successfully as the number of clients increases. OpenStack is open-source programming that indicates that anyone who selects acquires the source code, rolls out any improvements or adjustments they require, and unreservedly shares these enhanced changes back to the society. It likewise denotes that OpenStack has the benefit of thousands of engineers universally to build up the utmost grounded, utmost secure, and utmost powerful item.

22 Security Vulnerabilities of OpenStack Cloud and Security …

213

Fig. 22.1 OpenStack network architecture [4]

22.2.1 Components of OpenStack OpenStack contains various parts with measured engineering and various code names. How about, we have concisely taken a gander at the segments of OpenStack. In spite of the fact that OpenStack is comprised of a few different segments as a result of its open nature, the OpenStack people group has perceived these nine parts as the center segments to be specific Image Service-Glance, DashboardHorizon, Object Storage-Swift, Networking-Neutron, Identity Service-Keystone, Block Storage-Cinder, Telemetry-Ceilometer, Orchestration-Heat, Compute-Nova, as appeared in Fig. 22.2 and Table 22.1. The fundamental components and their capacity for OpenStack are shown in Table 22.1.

22.2.2 OpenStack Usage in Cloud Platform The cloud is tied in with influencing accessible processing for end users in a distant environment, where the legitimate programming continues running as an admin on dependable and flexible servers as opposed to that on every end user PC. Cloud computing demonstrates a variety of things, however ordinarily the business claims

214

M. P. Bharati and S. C. Tamane

Fig. 22.2 Components of OpenStack [5]

running diverse stuff “as a service” software, platforms, and infrastructure. OpenStack is seen as Infrastructure as a Service (IaaS). The given structure suggests that OpenStack makes it straightforward for customers to quickly incorporate new examples so that other cloud parts can run. Regularly, the framework at that point runs a “platform” whereupon an engineer could make programming applications.

22.3 Vulnerabilities in OpenStack 22.3.1 Known Vulnerabilities in OpenStack OpenStack comprises different sets of components that provide individual functionality. These components can be exclusively targeted by an attacker. Therefore, to make OpenStack secure as a whole, these individual components have to be evaluated for security vulnerabilities. We take a look at some of the vulnerabilities in OpenStack that could have affected serious harm if they were not mitigated in time.

22 Security Vulnerabilities of OpenStack Cloud and Security …

215

Table 22.1 OpenStack cloud components [3] Code name

Description

1

Compute-Nova

Supported hypervisors: Xen, VMware, KVM, Hyper-V. Supplies compute instances

2

Image Service-Glance

Registration, discovery, and delivery services for disk and server images

3

Object Storage-Swift

Redundant and scalable storage system

4

Dashboard-Horizon

Automate the deployment of cloud resources and graphical web interface for access, provision, and

5

Identity Service-Keystone

Verification system used through the cloud system

6

Networking-Neutron

Service used for IP and networks maintenance address space

7

Block Storage-Cinder

Block-level storage devices utilized for compute instances

8

Orchestration-Heat

Coordinate multiple cloud applications using templates

9

Telemetry-Ceilometer

Provides all necessary system counters for customer billing. Single Point Of Contact for billing systems

10

Database-Trove

Engine for non-relational and relational databases

11

Elastic Map Reduce-Sahara

Provides provision for data-intensive application cluster like Hadoop or Spark

12

Bare Metal Provisioning-Ironic

Provisions bare metal machines

13

Multiple Tenant Cloud Messaging-Zaqar

Cloud messaging service for mobile and web developers

14

Shared File System Service-Manlia

Service for Compute instances to allow access for shared file systems

15

DNSaaS-Designate

DNS is available as a service

16

Security API-Barbican

REST API is designed for encryption keys, X.509 certificates, securing storage and management of secrets (i.e., passwords)

Session fixation vulnerability: The session fixation vulnerability was found in Horizon. This happened while utilizing the default marked treat session. The default situation in Horizon is to utilize marked cookies that stores state of the session at the customer end, which makes it probable that an attacker can catch a client’s cookies; they may perform designations as that client, regardless of whether the client has logged out. The administrations that were influenced by this vulnerability were Folsom, Grizzly, Horizon, Icehouse, and Havana. The Vulnerability: At the end point of the customer’s sessions, the server does not know about the client’s login state. The OpenStack consent tokens are secured in

216

M. P. Bharati and S. C. Tamane

the session ID inside the cookie. In case a particular attacker can take the cookie, he can execute all exercises as the goal customer, even after the customer has logged out. A couple of various ways exists where the attacker can take the cookie. One point of reference is by catching it over the wire if Horizon isn’t orchestrated to use SSL. The attacker may, in like manner, contact the treat from the Record structure if they approach the machine. There are, in like manner, diverse ways to deal with fake cookies which are past the degree of this note. Just by engaging a server-side session following game plan, for instance, memcache, the session is done after the customer logs out. Doing this shields an attacker after using a cookie from the finished sessions. Horizon should ask for key-stone drop, i.e., the token upon customer logout, yet this has not been executed for the Identity API v3. Token nullification may moreover crash and burn if the keystone advantage is blocked off. Consequently, to ensure that sessions are not usable after the customer logs out, it is proposed toward custom server-side session following. Solution: It is suggested that Horizon should be organized to utilize an alternate session backend as opposed to signed cookies. One likely option is to utilize memcache sessions. To check every time the signed cookies are utilized, search for this line in Horizon’s local_settings.py SESSION_ENGINE = ‘django.contrib.sessions.backends.signed_cookies’ This vulnerability does not exist if the SESSION_ENGINE is set to value other than ‘djano.contrib.sessions.backends.signed_cookies’. Check for it in settings.py if SESSION_ENGINE is not set in local_settings.py. The steps to configure memcache sessions are 1. To guarantee whether the memcached benefit is running on the framework 2. To guarantee python-memcached is installed 3. To design memcached store backend in local_settings.py CACHES = {‘default’: { ‘BACKEND’: ‘django.core.cache.backends.memcached.MemcachedCache’, ‘LOCATION’: ‘127.0.0.1:11211’,}} 4. Try to utilize the port and real IP of the memcached benefit. 5. To insert a line in local_settings.py to utilize the cache backend: SESSION_ENGINE = ‘django.contrib.sessions.backends.cache’ Heartbleed vulnerability: The Heartbleed vulnerability [6] was discovered in April 2014. It is the vulnerability in OpenSSL that can lead to OpenStack being compromised. The services that can be affected by this vulnerability are Grizzly, Havana, and OpenSSL. The vulnerability: Heartbleed vulnerability in OpenSSL was newfound that permits remote attackers limited access to information in the memory of any administration utilizing OpenSSL to give encryption to arrange correspondences. This can contain key material utilized for SSL/TLS, which implies that any private information that has been sent over SSL/TLS might be compromised. While OpenStack programming itself is n’t specifically influenced, any deployment of OpenStack is likely utilizing OpenSSL to give SSL/TLS usefulness.

22 Security Vulnerabilities of OpenStack Cloud and Security …

217

Solution: It is suggested that an instance refresh ought to be implemented for the OpenSSL programming on the frameworks that run OpenStack administrations. Most of the time, the overhauling would be to OpenSSL variant 1.0.1g; however, it is recommended that one surveys the correct influenced form realities on the Heartbleed site referenced in this paper. Subsequent to upgrading your OpenSSL programming, it needs to restart any administrations that utilize the OpenSSL libraries. You can get a rundown of all procedures that have the old variant of OpenSSL stacked by running the underneath order: lsof | grep ssl | grep DEL lsof | grep ssl | grep DEL. Any procedures appeared by this direction should be restarted, or the whole framework can be restarted whenever favored. In an OpenStack arrangement, OpenSSL is typically used to empower SSL/TLS insurance for OpenStack API endpoints, SSL eliminators, information bases, message merchants, and libvirt remote access. Notwithstanding the local OpenStack benefits, some ordinarily utilized programming may be that ought to be restarted incorporates Apache HTTPD, Libvirt, MySQL, Nginx, Postgre SQL, Pound, Qpid, RabitMQ, and Stud. It is likewise suggested that the current SSL/TLS keys are imperiled as traded off and new keys are created. This incorporates keys that are used to empower SSL/TLS assurance for OpenStack API endpoints, databases, message agents, and libvirt remote access. In tally, any private information, for example, qualifications that have been sent over an SSL/TLS association may have been endangered. It is proposed that cloud executives change any passwords, tokens, or different credentials that may have been imparted over SSL/TLS.

22.3.2 Assessment for Security To examine the security of OpenStack cloud nodes along facilitated virtual machines—VMs, the OpenStack cloud is installed on a Linux CentOS 7 server; also every one of its segments (Controller and Compute) is introduced on a similar host. The cloud is facilitating four virtual machines with the accompanying OS types: Kali Linux cloud, Windows Server, CentOS 7 cloud, and Ubuntu cloud. The security assessment configuration is displayed in Fig. 22.3.

22.3.3 Network Vulnerability Scanners Among the extensive variety of security checks, two that give the best outcomes were selected: OpenVAS and Metasploit Pro. Metasploit Pro is a security scanner that grants likewise manipulating the vulnerabilities traced [7]. OpenVAS is free programming for vulnerability examining and managing [8], as shown in Fig. 22.3. On Ubuntu 16, installed Metasploit x64 and OpenVAS examine software for vulnerabilities. Metasploit is introduced on a PC running Windows. In Metasploit Pro, the outside sweep didn’t discover any kind of weakness.

218

M. P. Bharati and S. C. Tamane

Fig. 22.3 Assessment structure of OpenStack

The product detailed as opened the accompanying TCP ports: 80, 20, 111, 873, 3306, 5000, 5901, 5900, 5903, 5904, 5907, 6000, 60, 6379, 8080, 11211, and UDP port 111 (PORTMAP). A total depiction for each port referenced above and their utilization is given in Table 22.2. Table 22.2 OpenStack default ports [9]

OpenStack services

Port

Users

SSH

22

Client connection

HTTP

80

Dashboard-Horizon

SUNRPC

111

Port mapper-RPC

RSYNC

873

File synchronization protocol

MySQL

3306

Many cloud components

HTTP identity service

5000

Keystone service

VNC service

5900-5999

Virtual machines consoles

Object Storage-Swift

6000

Many cloud components

VNC proxy for browsers

6080

Service novnc proxy

Redis service port

6379

Storage software

HTTP alternate

8080

Object Storage-Swift service

Unprotected memcached

1211

Proxy server

22 Security Vulnerabilities of OpenStack Cloud and Security … Fig. 22.4 Security assessment

219

20 10 0 Metasploit

OpenVAS

OpenVAS vulnerabilities found are as follows: – SSH Server CBC Mode Ciphers Enabled—TCP port 22 – VNC Server Unauthenticated Access on TCP ports 5901–5903 – HTTP TRACE/TRACK methods allowed on TCP port 5000 OpenVAS detailed the accompanying security imperfection: TCP timestamps— This security defect licenses count of server uptime and is of low significance. Solution: Deactivate TCP timestamps. Include the line ‘net.ipv4.tcp_timestamps = 0’ to/and so on/sysctl.conf then execute ‘sysctl - p’. To assess the execution of all vulnerability scanners, we checked the opened ports found through every application after playing out within security tests. In OpenStack records, the cloud ought to have around 29 opened ports [9]. A few ports are not “obvious” to security scanners due to application’s security arrangement or because of firewall rules. Metasploit found 17 opened ports, OpenVAS 9 ports. The outcomes are spoken to graphically in Fig. 22.4. Security evaluation: Inside sweep for cloud VMs’ internal output allows additionally investigating the hosts internal to cloud. Simply by having a visitor account on one of the virtual machines, we can have a cloud security check. The visitor VM IP settings are the beginning stage. For instance, on virtual machine CentOS 7, IP settings are IP 10.10.10.5, netmask 255.255.255.0 door 10.10.10.1. Some VMs possibly have additional access to cloud’s assets than others. Bargaining a host with complete access to cloud’s assets can lead toward the conclusion near complete right to use additionally to cloud itself. As a matter of course, cloud’s interior VMs are not available on the Internet. OpenStack deals the likelihood to relate an outer IP [1] for each VM to have the capacity to get to it specifically. They can be connected and evacuated anytime.

22.4 Conclusion The security evaluations performed had investigated the OpenStack cloud Pike form within the cloud environment. Experiments are executed utilizing two vulnerability scanners: Metasploit and OpenVAS. This examination centers on OpenStack security matters and threats. Some components of OpenStack were viewed as protected while some are required to be improved. Assessing diagrams from Fig. 22.4, we can presume that Metasploit gave the most outcomes. Despite the fact that Metasploit

220

M. P. Bharati and S. C. Tamane

could be considered from this point of view the total victor. OpenVAS is for nothing out of pocket application despite the fact that Metasploit; the outcomes were shockingly efficient and clear. To anchor the OpenStack administrations, right off the bat one ought to think about which ports need to be available from the Internet and which ports to be kept open internal to the cloud. Toward the finish of this intrusion, there will be rundown of opened ports for internal and another rundown for external ports. A few ports must be available just for a few internal or external hosts. At the point when the whole rundown of ports and their impediments is finished, fundamental firewall principles can be connected. OpenStack accompanies an installed firewall that can be designed just through dashboard web interface. Our upcoming work will explore extra security of cloud compartments and will give a progressively definite methodology about separating virtual machines from their neighbors, i.e., outside threats and in addition to inside threats. We will concentrate on cloud authentication as it additionally assumes a vital job in cloud security.

References 1. Bharati, M., Tamane, S.: Defending against bruteforce attack using open source-SNORT. In: IEEE—International Conference on Inventive Computing and Informatics-2017 (2017). https:// ieeexplore.ieee.org/document/8365267/. https://doi.org/10.1109/ICICI.2017.8365267 2. Bharati, M., Tamane, S.: Intrusion detection systems (IDS) & future challenges in cloud based environment. In: 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM). https://ieeexplore.ieee.org/document/8122180. https://doi.org/10.1109/ icisim.2017.8122180 3. OpenStack Pike: https://releases.openstack.org/pike/ 4. Networking in OpenStack: Panoramic view: https://ilearnstack.com/tag/openstack/ 5. Albaroodi, H., Manickam, S., Singh, P.: Critical review of open-stack security: issues and weeknesses. J. Comput. Sci. 10(1), 23–33 (2014) (National Advanced IPv6 Centre (NAv6), Universiti Sains Malaysia, 11800, Penang, Malaysia) 6. The Heartbleed bug: http://heartbleed.com/, Openstack—manage IP addresses: https://docs. openstack.org/ocata/user-guide/cli-manage-ipaddresses.html 7. Installing Metasploit Pro, Ultimate, Express, and Community: https://metasploit.help.rapid7. com/docs 8. OpenVAS: http://www.openvas.org 9. Openstack firewalls and default ports: https://docs.openstack.org/newton/config-reference/ firewalls-defaultports.html

Chapter 23

Smart Physical Intruder Detection System for Highly Sensitive Area Smita Kasar , Vivek Kshirsagar, Sagar Bokan and Ninad Rathod

Abstract In this ever-growing world of automation and digitization, where data is a pivotal element for the growth of every individual, institution, and organization, whether digital or physical, data could also be the reason for destruction, if acquired by an antagonist through unconventional access. Data is a very sensitive point in all the domains ranging from an individual’s personal space to tactical military centers such as defense institutions, military matters, financial institutions, hospitals, and educational institutions. Thus, it is necessary to protect the data from intruders. Physical Intruder Detection is equally important as the detection of intrusion in computer networks. Though the later is always digital and without manual intervention. Physical Intruder Detection can be either digital or done manually. The paper presents a system for an enclosed area, based on IoT and supported by Digital Image Processing, to capture any Physical Intruder who breaches the security system and alert the rightful person regarding the intrusion. The approach uses the PIR motion sensor to detect any suspicious activity, turn on the webcam and with the help of Face Recognition System using Digital Image Processing, recognizes whether it is the rightful person or not. If it is an Intruder, then the webcam will start recording the activity of the Intruder and send a text message as well as an email to the system owner alerting him/her/them about the intrusion. A link to this live feed to the system owner is also attached to the alert message and mail. This Intruder Detection

S. Kasar (B) Department of Computer Science & Engineering, Maharashtra Institute of Technology, Aurangabad, Maharashtra, India e-mail: [email protected] V. Kshirsagar Department of Computer Science and MCA, Government Engineering College, Aurangabad, Maharashtra, India e-mail: [email protected] S. Bokan Nvidia, Pune, India N. Rathod TCS, Pune, India © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_23

221

222

S. Kasar et al.

System is energy efficient as well because the webcam will be turned on only when the motion sensor detects any suspicious activity.

23.1 Introduction Safety and Security are the two terms which are intertwined with one another. To prevent the occurrence of any internal or external threat such as criminals or any individual that intends to hinder or destroy the durable state of the organization is termed as security. Protecting the organization from any damage or reducing the extent of damage in case any such attack happens is what Safety means. The paper proposes a system without manual intervention which mainly focuses on the safety of highly sensitive areas which contains confidential data of the organization. In case of Military Offices, the confidential data could be information about the military officers, data about the equipment possessed by particular military base, case files, and important letters which may be regarding the plans of the military. In case of Business Organizations, it could be the contact details of clients, bank details of the clients, etc. In case of Hospitals, it could be medical history of the patients. An intrusion detection system (IDS) is a system that monitors network traffic for suspicious activity and issues alert when such activity is discovered. The primary objective of the proposed intruder detection system is to grab hold of: • Any intruder sabotaging the infrastructure and • Any intruder or insider who tries to gain illegitimate access to confidential resources and documents Internet of Things is a trending technology all over the world nowadays. The basic idea of IoT is to connect any device to the internet, so that it can interact with other devices connected to the internet and exchange data at real time to share information about the surrounding environment. Devices which have sensors built within are connected to an IoT platform which collects data from many sources, analyze that data to give the most valuable information which could prove beneficial to address particular needs. In the proposed system, PIR sensor is used to detect suspicious motions and trigger the Intruder Detection System. These devices could be remotely controlled with the help of Mobile devices. Smart Home is one such technological phenomenon which is beneath the development of embedded system [1]. IoT has now become the basis of automation and the proposed work uses the IoT approach to automate the system. Identification and verification of a person from a digital image is the primary objective of the Facial Recognition System. One way to accomplish this is the comparison between the selected facial features from the captured image and the values in the face database. The traditional approach suggests the identification of an individual by extracting the facial features from the image of subject’s face using face recognition algorithms. These algorithms are further divided into two main approaches, i.e., the one with Geometric approach and the other is with Photometric approach.

23 Smart Physical Intruder Detection System for Highly …

223

The system uses the video frame captured in the webcam to recognize the intruder. This forms an important part of the System and the selection of Facial Recognition algorithm depends upon various factors and conditions [2].

23.2 Literature Survey Ever since the beginning of human civilization, building and maintaining perimeter security is always important. The security system could be in any form, whether a wall or a person charged with security of a particular area or even a well-trained dog, all involved in this system and has always been evaluative in detaining and blocking a perimeter breach. But having a real time system that analyzed a situation and gives immediate response could be crucial. In the paper [3] an approach to detect a physical intruder within a restricted area is proposed and alerts the concerned person regarding the intrusion. It proposes a security system with base components: the environmental design, the electronic and mechanical entrance control, intrusion detection, and video monitoring. The first component, i.e., the environmental design refers to the physical structure of the system and the workforce employed to monitor and grasp any physical threat to that particular area [4]. It suggests the use of metal detector to detect an intruder. Intrusion detection deals with an alarm and a system that triggers it on an unusual activity that seems like a security breach. The final layer called the video monitoring using any suitable device ranging from a camcorder with its own inbuilt memory storage to hidden camera. Such security systems are common in places such as airports, courts, schools, railway stations, night clubs, etc. to ensure that no one brings a metallic weapon into the premises. The system could either be a handheld type or a walk-through model. But this system is incapable of detecting any threat to confidentiality of an area or to detect a threat of stealth. The system may trigger a false alarm due to a fault in any part which can be frustrating. Monitoring the area all the time will require many monitoring officers and even then, an intruder could easily slip through. The system also lacks safety features. The proposed system in [5] suggests that images from CCTV surveillance will be used to identify an intruder based on the techniques in Digital Image Processing, which is an improvement to the previously published work. The algorithm used will detect and isolate moving objects in terms of their shape and their sizes, extract other properties of the object with respect to its position. The system uses the algorithm which is an improved and modified version of the algorithm described by Freer et al. It generates images which contain two intensity levels only, i.e., black and white, termed as threshold and binarised image [6]. Further analysis is performed on these images to detect intruder. There has been many more improvement in this system further and many new algorithms were proposed which efficiently detected physical intruder by performing digital image processing and analysis on the images obtained from the video surveillance footage. Any of these algorithms could be used in the proposed system to identify the intruder. On identification the functionality of live recording of the area can be triggered and the footage can be stored on HDFS. The major advantage of using

224

S. Kasar et al.

Hadoop would be that the data stored in HDFS could be processed in real time and the video surveillance footage is stored in its source file format. Thus, it will be more convenient to process and manipulate the data. Furthermore, an automation function can be provided to the system which could be used to remotely control the surveillance system, thus making it more energy efficient.

23.3 Proposed Methodology The system owner or groups of system owners are the only ones who could surpass the Intruder Detection System. The images of these legitimate users are already stored in the system. The system is as described below • The proposed model is kept in the area where the surveillance is required. The model is connected to the camera, PIR sensor, and the trigger alert system. • In case the PIR motion sensor senses any motion, the system will trigger the camera to capture the image in the vicinity. Frame difference will ensure motion detection. Also, an email alert message will be sent to the legitimate user. • The system performs basic preprocessing operations on the image captured by the camera. • Since the objective is to detect an intruder, the face detection algorithm is applied to the image captured. • The system compares the extracted properties with those stored in the image database and classifies the image as legitimate or intruder. • If the image is classified as an intruder then the alert system is triggered at particular intervals. The major components used in the system are Raspberry Pi 3, webcam, PIR sensor, and trigger system. An overview of system workflow is shown in Fig. 23.1 and the sequence flow of control throughout the system. The PIR sensor is continuously working while the system is turned on. It sends a read value to the system at regular Fig. 23.1 Proposed system flow diagram

23 Smart Physical Intruder Detection System for Highly …

225

intervals. There are two possible values, ‘0’ or ‘1’. ‘0’ means ‘down’, i.e., there is no motion detected in the restricted zone. The read value is mostly ‘0’ because the restricted zone is empty in absence of the authorized user. But whenever motion is detected, the read value is set to ‘1’, i.e., ‘up’. This triggers the module which is responsible for Face Detection. In this step, the webcam captures the image of the space. Then various preprocessing operations are performed on the image before it is used to classify the body as an intruder or an authenticated user. If the Face Detection System classifies the body as an intruder, then the alert system is triggered. The alert system sends an alert to the authenticated user or group of authenticated users via text message and via email. Along with the alert message the system can send a link to the live feed of the restricted area in text as well as in mail. Simultaneously, the live feed can also be stored on the cloud storage, so that the user could later access the footage of the intruder activity even though he misses the live feed.

23.3.1 Working of PIR Motion Sensor Passive Infrared (PIR) Sensor is common in the application domain of Security Alarm Systems and Motion Detectors [7]. The term “Passive” is used since it doesn’t emit infrared beams, instead receives it from other bodies. It basically detects any change in heat and as soon as it detects any change in heat its output PIN becomes HIGH. The human body also emits a certain amount of infrared due to the body heat. Thus whenever a human body enters the restricted zone (i.e., through the range of PIR sensor), it produced infrared reason being the friction between air and body and the PIR sensor detects the change in heat. The Motion detection algorithm is given below. Algorithm MotionDetection (var a) 1.

Algorithm MotionDetection (var a)

2.

{ b := average of selected color in frame 1;

3.

if abs(a-b)> threshold then

4.

motion detected;

5.

else

6.

{ wait(x seconds);

7.

Algorithm MotionDetection(b);

8. 9.

} }

Frame differencing method [8] is used to detect motion in the restricted area. It detects motion on the basis of changes in pixel position after each frame. The

226

S. Kasar et al.

most crucial aspect in this is the threshold which will be used as a scale to conclude whether there is any human body in the environment and also time ‘x’ after which the next frame is captured by the PIR sensor. The algorithm compares the change in pixel values in frames with the threshold value at an interval of x-seconds in an infinite loop until the system is turned off.

23.3.2 Face Recognition Using Viola-Jones Algorithm and KNN Algorithm Basic filters are used to preprocess the image and minimize redundant and unwanted data from the image [9]. Preprocessing is necessary in case of unwanted noises in the images that may affect further face detection and the classification. The Viola-Jones algorithm [10] gives a solution to the problem of detection of face in an image. It requires full view of frontal upright face. The major phases in the algorithm are as follows. • Integral image: It is the representation of an image to compute the rectangle feature. The sum of all pixels above and to the left of point (x, y) is the value of the integral image given as, ii(x, y) =

i x , y

x ≤x,y ≤y

Where ii(x, y) is an integral image and i(x, y) is the original image [11, 12]. • AdaBoost algorithm is used for the classifier • The methods used for combining classifiers in a cascade which will help to easily discard the background of the image. The cascade works in a way to reject as many as negatives as possible and trigger the evaluation with a positive instance [10] (Fig. 23.2). Principal Component Analysis (PCA) is used for preprocessing the image and removing redundant and unwanted data from the image [9]. PCA is a statistical approach which uses orthogonal transformation to convert a set of variable values that could be possibly correlated into principal components [13–15]. Then the KNearest Neighbor algorithm is used to classify the extracted face as an authorized user or an intruder [16, 17]. The classification is implemented in three phases: The first being Training Phase, second phase is the Testing Phase [18]. In the training phase a model is constructed from training images. Thus classes of image are generated on basis of the training images. In the testing phase the model is tested against test images and classified as intruder or authorized user [19].

23 Smart Physical Intruder Detection System for Highly …

227

Fig. 23.2 Face recognition in the proposed system

In a scenario where streaming data is to be stored on HDFS in real time we need to consider two important aspects: The first one being security of data and the system should receive new data without interruption. Spark can be used to ingest real time data into the Hadoop Distributed File System. Spark is an important open source project which is a part of the Hadoop Ecosystem and comes with the CDH [20].

228

S. Kasar et al.

23.4 Conclusion and Future Scope The proposed system gives instant outcome in real time and maintain the confidentiality of data for the user and notify him/her/them in case there is an unwanted personnel trying to gain access to the user space unconventionally. Various techniques in Digital Image Processing are used to analyze the images from the camera and to reduce the work required of a human operator. Also the system minimizes the necessity of a constant human supervision required to protect the area. The system uses Raspberry Pi for the monitoring system and presently proposes the use of a camera for capturing the image. The energy efficiency is achieved since the recording and capturing the images using camera begin only after the sensor detects the motion. However, to enhance the security, multiple cameras may be used to capture the live feed once the motion is detected. Also the live feed can be stored on cloud which will reduce the requirement of data storage in the near future. The future scope also includes to get the live feed on a mobile device and to control the basic properties of the system through a mobile application. The system can be further tested on a large scale to verify the performance and accuracy.

References 1. Sukmana, H.T., Farisi, M.G., Khairani, D.: Prototype utilization of PIR motion sensor for real time surveillance system and web-enabled lamp automation. In: IEEE Asia Pacific Conference on Wireless and Mobile (2015) 2. Bartlett, M.S., Movellan, J.R., Sejonowski, T.J.: Face recognition by independent component analysis. IEEE Trans. Neural Netw. 13(6), 1450–1464 (2002) 3. Oludele, A., Ayodele, O., Oladele, O., Olurotimi, S.: Design of an automated intrusion detection system incorporating an alarm. J. Comput. ISSN: 2151-9617 (2009) 4. Kodali, R.K., Soratkal, S.R.: MQTT based home automation system using ESP8266. In: IEEE Region 10 Humanitarian Technology Conference (2016) 5. Freer, J.A., Beggs, B.J., Fernandez-Canque, H.L., Chevrier, F., Goryashko, A.: Automatic video surveillance with intelligent scene monitoring and intruder detection. ECOS, pp. 109– 133 (1997). https://doi.org/10.1049/cp:19970433 6. Monzo, D., Albiol, A., Albiol, A., Mossi, J.M.: A comparative study of facial landmark localization methods for face recognition using HOG descriptors. In: Proceedings of IEEE (2010) 7. Jayant: PIR sensor based motion detector/sensor circuit. (2015) Retrieved from https:// circuitdigest.com/electronic-circuits/pir-sensor-based-motion-detector-sensor-circuit 8. Nguyen, H.-Q., Loan, T.T.K., Mao, B.D., Huh, E.-N.: Low Cost Real-Time System Monitoring Using Raspberry Pi. Computer Engineering Department, Kyung Hee University, Yongin, South Korea (2015) 9. Suganthy, M., Ramamoorthy, P.: Principal component analysis based feature extraction, morphological edge detection and localization for fast iris recognition. J. Comput. Sci. 8(9), 1428– 1433, ISSN 1549-3636 (2012) 10. Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004) 11. Ming, Y., Ruan, Q., Li, X., Mu, M.: Efficient kernel discriminate spectral regression for 3D face recognition. In: Proceedings of ICSP 2010 (2010) 12. Huang, L.-L., Shimizu, A., Hagihara, Y., Kobatake, H.: Face Detection from Cluttered Images Using a Polynomial Neural Network. Elsevier Science (2002)

23 Smart Physical Intruder Detection System for Highly …

229

13. Paul, L.C., Suman, A.A., Sultan, N.: Methodological analysis of principal component analysis (PCA) method. Int. J. Comput. Eng. Manag. 16(2), 32–38 (2013) 14. Castells, F., Laguna, P., Sornmo, L., Bollmann, A., Roig, J.M.: Principal component analysis in ECG signal processing. EURASIP J. Adv. Signal Process. 2007, 1–21 (2007) 15. Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent developments. Philos. Trans. A Math. Phys. Eng. Sci. 374(2065), 0150202. 13 Apr 2016 (2016) 16. Guo, G., Wang, H., Bell, D., Bi, Y., Greer, K.: KNN Model-Based Approach in Classification, pp. 1–12 17. Imandoust, S.B., Bolandraftar, M.: Application of K-nearest neighbor (KNN) approach for predicting economic events: theoretical background. Int. J. Eng. Res. Appl. 3(5), 605–610 (2013) 18. Moise, I., Pournaras, E., Helbing, D.: K-Nearest Neighbour Classifier. ETH zurich (2015) 19. Vermeulen, J., Hillebrand, A., Geraerts, R.: A comparative study of k-nearest neighbor techniques in crowd simulation. In: 30th Conference on Computer Animation and Social Agents, 23 May 2017 (2017) 20. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (2010) 21. Aswinth Raj, B.: How to Send Text Message (SMS) Using ESP8266. (2003) Retrieved from https://circuitdigest.com/microcontroller-projects/sending-sms-using-esp8266

Chapter 24

Two-level Classification of Radar Targets Using Machine Learning Aparna Rathi, Debasish Deb, N. Sarath Babu and Reena Mamgain

Abstract Measurements from airborne radar sensor provide information on aerial, sea surface, and ground moving targets. Along with the target of interest, the radar picks up the detections from ground moving targets and windmills which may clutter the radar operator screen. The traffic on highways picked up by radar may not be of interest to radar operator who is primarily carrying out surveillance for airborne threats. The targets of interest may vary from fighters, helicopters, and UAVs to transport class of aircrafts. Classification of the targets enhances the situation awareness information for the commander. In this paper, we discuss how classification of targets may be carried out using machine learning techniques. The objective is to analyze and model data using statistics and machine learning. In the paper, we will concentrate on supervised learning technique and its usage in radar target classification.

24.1 Introduction The airborne radar sensor primarily carries out the early detection of aerial threats in air-to-air mode of operation. The sensor more often is an Electronic Scanned Array (ESA) which provides rapid beam switching and offers intelligent configuration of radar sensor. The radar used for surveillance usually has wide beam in elevation to meet the height coverage requirements. Owing to the wide beam, radar is likely to pick up ground moving targets along with aerial targets. To complicate the matter further, the height accuracies are bound to be poorer because of wide elevation beam. The ground moving target can rapidly lead to multiple target tracks on them and can clutter the radar screen. Further, the windmills falling within radar coverage also increases the number of clutter tracks. The increase in clutter tracks can lead to operator discomfort, especially, during air combat situation. In phased array radar, A. Rathi · D. Deb (B) · N. Sarath Babu · R. Mamgain Centre for Airborne Systems, Bangalore, India e-mail: [email protected] A. Rathi e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_24

231

232

A. Rathi et al.

this leads to large wastage of radar resources as discussed in subsequent section. In this paper, we propose multilevel classification for more effective management of radar resources. Initially, first level of classification is applied on radar detections to broadly classify them as ground moving targets or aerial targets. The radar resources are not utilized for the targets that are classified as ground moving targets. The targets where uncertainty of classification is present are kept in abeyance until more data is collected with which correct decision can be taken. The second level of classification is applied after the presence of target is confirmed and firm target track is formed. The targets are classified as fighters, commercial, and helicopter class in level 2. The enemy fighters present more threat and hence based on classification results adequate radar resources can be diverted to be utilized on such targets. The paper is organized as follows. Section 24.2 brings out a brief description of phased array radar and different beam types used therein. Section 24.3 brings out the design of multilevel classifier followed by Sect. 24.4 which gives a brief of how the output from level 0 and level 2 can be fused. Section 24.5 presents the results of the simulation study. Conclusion is drawn in Sect. 24.6.

24.2 Phased Array Radar Phased array has become the most obvious choice for most of the radar systems. This provides inertia-less beam steering and dedicated target tracking. Unlike in mechanically scanned radar, in phased array, the time from first detection to established target track is very less. This is enabled by the fact that the moment a detection is produced by the radar, verification beams are scheduled with higher probability of detection to confirm the presence of a target. This is followed by scheduling initiation beams which transmits multiple near consecutive beams to estimate the kinematic parameters of the target with reasonable amount of accuracy. Though this allows to establish the track very fast once the first detection is obtained, this also will lead to huge resources if the radar produces detection on targets which are of no interest for the given scenario. To avoid this it is required that the radar shall have some preliminary classification abilities so that the resources can be selectively expended. The different beams [1] that are used by phased array radars are enumerated in Table 24.1. Surveillance beam type is scheduled irrespective of detections, thus it Table 24.1 Priority of different beams

Priority

Beam type

1 (Lowest) 2 3 4 5 (Highest)

Surveillance Track update Track initiation Verification Maintenance

24 Two-level Classification of Radar Targets Using Machine Learning

233

presents the minimum load to the radar. Other beams are played based on availability of target detections and these are required to be judiciously scheduled to conserve radar resources without compromising the system performance.

24.3 Design of Multilevel Classifier The classification is carried out at different levels with varying success rate. The first level classification is carried out at the raw detection level which aids in making decision about whether the special beams, viz., verification and initiation beams to be scheduled. Classification is also carried out using signal processing technique which is named here as level 0 classifier. The classification at signal processing level can also be achieved by using imaging technique [2] which is generally not available in many surveillance radars owing to higher bandwidth requirement. These classification results are not used for deciding the scheduling of special beams for target track. The output of this classifier is fused with the output of classifier at level 2.

24.3.1 Signal Level Classification: Level “0” Classifier at signal level uses features which are extracted from micro-Doppler signatures of the targets [3, 4]. Classification based on range-doppler diagram of radar targets are studied in [5]. Rotating parts of the target, when exposed to radar, can backscatter the incident energy and provide peculiar micro-Doppler signature of the radar target. These signatures vary with type of target, plane of rotation, aspect angle, frequency, rotation speed, number of blades, number of engines, and blade RCS. In signal domain, spectral energy from micro-Doppler signature is extracted as feature for level 0 classification. In this paper four types of targets are considered for level 0 classifier: helicopter, propeller, jet engine aircrafts, and turboprop airliner. These targets have distinguished features for classification as listed below: (1) Jet engine aircraft: Although rotating blades of jet engine compressor and turbine blades are not fully exposed to radar incident energy but the same incident energy at certain aspect angle and with wavelength less than equal to centimeter can enter through short air intake and large exhaust resulting in micro-Doppler signature of aircraft. For X-band radar, these signatures can be used to extract features like number of blades and engine rotation speed. However, in case of multi-engine configuration, the parameter extraction will not be so accurate due to multiple echo interactions leading to higher probability of misclassification. In case of commercial airliner, the blades are exposed more than any military aircraft depending upon the location of engines in aircraft. Hence commercial airliners classification based on spectral energy is quiet accurate due to large RCS resulting in high SNR. In case of S-band radar, the micro-Doppler returns

234

A. Rathi et al.

are ambiguous in Doppler and hence parameters like number of blades and rotation rates cannot be estimated. Therefore, spectral energy computed from micro-Doppler signature is used as feature in signal level classifier. (2) Helicopter: Helicopter speed is much lesser than aircraft but blades are few in number with comparatively high RCS. The slow speed rotating blades produce time-domain short impulses called blade flashes which result in equispaced bands in Doppler domain. The helicopter micro-Doppler signature is observed for all frequencies and all aspects due to high RCS and horizontal plane of rotation. Although number of blades and rotation speed can be estimated even for S-band, the spectral content is estimated as feature extraction for sake of uniformity in Doppler domain signal processing. For a signal level classifier, this Doppler spectral energy along with amplitude width for equispaced Doppler returns is used as feature for classification. (3) Propeller: The micro-Doppler signature for propeller aircraft results in continuous spectrum at lower Doppler range as the number of blades are less with low RCS and low rotational speed than aircraft. Also, the plane of blade rotation is vertical hence micro-Doppler signature cannot be observed in all aspects. The Doppler spectral content is low for propeller as compared to helicopter. Even if multiple propellers are present with slight speed variation, the Doppler spectral content does not increase much. (4) Turboprop with Jet engine: Large transport class airliners have multiple jet engines along with multiple propellers installed. This results in micro-Doppler signature which is combination of both jet engine aircraft and propeller signature. Thus, Doppler spectral content is very high and may lead to misclassification in signal level classifier. This has to be used along with data level classifier to distinguish this target from airliners or propeller class of targets.

24.3.2 Classification Level 1 Using Raw Detection The first level of classification is applied on radar detections. The detections are classified into two broad classes: Aerial and Ground Moving targets. For design of classifier we have used supervised learning (machine learning) technique that seeks to build a predictor model with known set of input data and its response. The model then generates reasonable predictions in response to the new data. The supervised learning involves the following steps: (a) Preparation of Data: In this step we identify the set of known inputs and their known responses which may be used to build a predictor model. (b) Choice of appropriate algorithm for building classifier. (c) Using trained model for prediction with new data. In the current context we have analyzed two methods: (i) Naive Bayes classifier which is designed for use when features are independent of one another within each class. However, it also works well in practice even when that independence assumption is not valid. (ii) Support Vector Machines is used for binary classification.

24 Two-level Classification of Radar Targets Using Machine Learning

235

1. Naive Bayes Method: Naive Bayes classification is based on estimating P(X |Y ), the probability or probability density of features X given class Y . For computation of probability we use normal (Gaussian) and kernel distributions and analyze the effectiveness of both distributions. Normal (Gaussian) Distribution: The “normal” distribution is appropriate for features that have normal distributions in each class. For each feature you model with a normal distribution, the Naive Bayes classifier estimates a separate normal distribution for each class by computing the mean and standard deviation of the training data in that class. Kernel Distribution: The “kernel” distribution is appropriate for features that have a continuous distribution. It does not require a strong assumption such as a normal distribution and you can use it in cases where the distribution of a feature may be skewed or have multiple peaks or modes. It requires more computing time and more memory than the normal distribution. For each feature you model with a kernel distribution, the Naive Bayes classifier computes a separate kernel density estimate for each class based on the training data for that class. 2. Support Vector Machine Method: Let the training data be of the form xi , yi , where xi ∈ R d . Consider we have a hyperplane which separates aerial targets from ground targets. Point x which lies on hyperplane satisfies wx + b = 0, so that we can classify feature x to aerial target if wx + b > 0 and to ground target if wx + b < 0. Hence we call the function wx + b as the decision function. Let d A be the Euclidean distance from the hyperplane to the closest aerial target data point and dG be the Euclidean distance from the hyperplane to the closest ground target data point. The objective of SVM is to find (w, b) such that the margin d A + dG is maximum. These conditions are formulated as a constrained optimization problem and are solved by applying KKT conditions. Less often, we have a perfect hyperplane, i.e., a hyperplane which separates all the aerial targets from ground targets. This problem is handled by suitably modifying constraints of the optimization problem. Until now we have only considered the case where decision function is a linear function of data. This is not true for all the cases. In that case, we can map the data to some other Euclidean space H which may give better results using the transformation, f : R d → H . In linear SVM, training algorithm will depend on the data through dot product only, i.e., xi .x j . Now if there is a mapping f , the training algorithm will depend only on the dot product f (xi ). f (x j ). If there is a kernel function K (xi , x j ), then we need to use only K (xi , x j ) for training SVM and need not even know what f is. f (x) can even be infinite dimension. In this study, we have used SVM with linear kernel, K (x, y) = x.y and gaussian 2 2 radial basis function (rbf) kernel K (x, y) = e(−||x−y|| /2σ ) .

24.3.3 Level 2 Classifier Using Track Data The main functionality of surveillance radar is to maintain track on as many number of targets as possible in its coverage volume. With the flexibility of agile beam in electronic scanned array, the targets of interest can be tracked with adaptive revisit

236

A. Rathi et al.

rates. The targets that are in combat require more accuracy and hence the requirement of radar resource time for such targets will be more as these tracks may be performing evasive maneuvers which calls for better revisit rates. Further, the targets in combat have requirement of higher measurement accuracies and hence better time on target from point of view of radar resource utilization. To arrive at optimum performance the targets of interest such as fighters can be updated with better revisit rates while targets such as commercial aircrafts can be tracked without the use of special radar time. The second level of classification is carried out using the estimated parameters of targets once these are under tracking. In literature, classification for airport surveillance radar is studied in detail [6]. After the second level of classification of aerial targets, the tracks classified as fighters have adequate share of radar resources. Also for the area where the ground targets classified as windmills special data processing techniques are utilized to achieve optimum performance. The techniques include marking the windmill areas and subsequently changing the track confirmation criteria for targets in these areas. This apart from optimizing radar resources also reduces the number of false tracks. Hence, the classifier imparts double benefits in time of optimum deployment of radar resources and reduction in false tracks for the radar operator.

24.4 Fusion of Information with Signal Level Classifier The data level classifier classifies the aerial targets as fighter, commercial, and helicopter. Though it produces reasonable level of accuracy in the classification, the same classification is augmented from the output of signal level classifier. In literature, the fusion of different classifier outputs is studied for radar [7]. For information fusion, we use simple averaging as described by Eq. 24.1. The weights wd and ws are decided based on the quality of signal level and data classifier. The usage of data as well as signal analysis improves the knowledge base of the target and helps in further applying the information for optimization of radar resources. pcfi used = (wd pcdi + ws pcsi )/2

(24.1)

f used

where pci is the posterior probability of a target belonging to class ci after fusion. pcdi is the probability of a target belonging to class ci as predicted by level 2 data level classifier; pcsi is the probability of a target belonging to class ci as predicted by level 0 signal level classifier. The weights shall add upto 1.

24.5 Simulation and Results For the simulation studies, the radar is assumed to be electronic phased array sensor with wide elevation beam and narrow azimuth beam to meet the surveillance requirements. The sensor is assumed to have elevation accuracy as 1.5◦ and azimuth

24 Two-level Classification of Radar Targets Using Machine Learning

237

accuracy as 0.5◦ . The measurement accuracy of range rate is assumed to be 10 m/s. From the experimental data collected with airborne radar, we can make the following reasonable assumptions: 1. The height of ground-based target varies from 0 to 600 m. This takes into account the windmill installations on mountainous terrain. 2. The height of aerial targets is from 600 m upwards. Here, there may be exceptions in case of helicopters that can have vertical takeoff and may be hovering near ground. 3. The range rate of majority of ground-based targets fall between 0 to 40 m/s from Main Lobe Clutter and that of windmills may spread from 0 to 60 m/s depending on wind conditions. The error in the measurement of height of the target is function of range. Figure 24.1 shows the height error as function of range. However, as the sensor has measurement inaccuracies, the elevation measured by radar is erroneous. Figures 24.2 and 24.3 shows the measured elevation and computed height of the radar targets with respect to range of the targets. The training step utilizes training samples and estimates the parameters of a probability distribution, assuming features are conditionally independent given the class. In the prediction step, for new test samples, the classifier computes the posterior probability of the sample belonging to each class and classifies sample according to the largest posterior probability.

Fig. 24.1 Variation in height error with range

238

A. Rathi et al.

Fig. 24.2 Simulated data for radar elevation measurement

Fig. 24.3 Simulated data for radar height measurement (true)

24.5.1 Results for Level 1 Classification The simulated data are generated separately for training and testing of the classifier. For level 1 classifier, the model parameters are tabulated Table 24.2 with the prior probabilities as P(Ground target) = 0.47 and P(Aerial target) = 0.53. For training, a total of 1572 data samples were generated with 833 belonging to Aerial target class and remaining 739 belonging to ground target class. The results of Naive Bayes is shown in Fig. 24.4. Blue and red dots indicate detections from aerial and ground targets, respectively. If classified as aerial targets then samples are encircled with blue circle otherwise if classified as ground targets then they are encircled with red circle. We can see from the figure that the miss classification is very low.

24 Two-level Classification of Radar Targets Using Machine Learning Table 24.2 Model parameters for classifier 1 (normal distribution) Class Attributes Mean Aerial Aerial Ground Ground

Height Range rate Height Range rate

8.37 km −12.32 m/s 1.70 km −0.85 m/s

239

Standard deviation 5.15 km 395.66 m/s 2.43 km 24.60 m/s

Fig. 24.4 Plot of classification results level 1 for NBC

Table 24.3 Confusion matrix for classifier level 1 with NBC

Ground Aerial

Ground

Aerial

242 4

5 287

The efficacy of any classifier is indicated by confusion matrix. Confusion Matrix for Naive Bayes Classifier is shown in Table 24.3. The results with SVM classification method with linear kernel is shown in Fig. 24.5. The confusion matrix for this is depicted in Table 24.4. For Linear Kernel out of 1572 data samples given for training, the algorithm has identified a total of 712 support vectors. Figure 24.6 show the results of classification using support vector machine with rbf kernel. The confusion matrix for SVM with rbf kernel is depicted in Table 24.5. For Kernel with radial basis function out of 1572 data samples given for training, the algorithm identified a total of 204 support vectors.

240

A. Rathi et al.

Fig. 24.5 Plot of classification results level 1 with SVM

Table 24.4 Confusion matrix for classifier level 1 with SVM

Ground

Aerial

Ground Aerial

212 84

35 207

Ground

Aerial

Ground Aerial

247 19

0 272

Fig. 24.6 Plot of classification results level 1 for SVM with RBF kernel

Table 24.5 Confusion matrix for classifier level 1 with SVM (RBF kernel)

24 Two-level Classification of Radar Targets Using Machine Learning

241

24.5.2 Classification Level 2 Using Radar Track and Signal Data The second level of classification is applied after firm target track is formed. The aerial targets are further subclassified into three classes: Fighters, Commercial, Helicopters, or UAV. The features used for classification are target tracks Speed, RCS, and Height. Table 24.6 summarizes model parameters for classifier 2. The prior probabilities used are P(Fighter ) = 1/3, P(H elicopter ) = 1/3 and P(Commer cial) = 1/3. For training NBC, a total of 1500 data samples were given out of which 500 belongs to Fighter class, 500 to Helicopter class, and 500 to Commercial class. The fourth feature is signal level signature analysis which is carried out separately. The result of signal pattern analysis is then fused with data level classifier to arrive at the final classification result. Figure 24.7 shows the results of classification using for level 2. Table 24.7 depicts confusion matrix for classifier 2. Table 24.6 Model parameters for classifier 2 (normal distribution) Class Attributes Mean Fighter Fighter Fighter Helicopter Helicopter Helicopter Commercial Commercial Commercial

Fig. 24.7 Plot of classification results

RCS Height Speed RCS Height Speed RCS Height Speed

m2

13.04 25,592.24 ft 1655.60 m/s 3.09 m2 12,115.45 ft 197.73 m/s 55.36 m2 35,991.30 ft 613.21 m/s

Standard deviation 7.28 m2 14,827.07 ft 785.51 m/s 1.48 m2 7185.26 ft 142.59 m/s 26.45 m2 3677.56 ft 244.98 m/s

242

A. Rathi et al.

Table 24.7 Confusion matrix for classifier level 2 Helicopter Fighter Helicopter Fighter Commercial

200 4 0

0 180 7

Commercial 0 16 193

24.6 Conclusion In addition to detection and tracking of targets, it is also very important to carry out classification of targets by radar. In this paper, we have brought out multilevel classification of radar targets with varying accuracies. The classification of targets is carried out at signal level, detection level, and post-tracking level. The signal level classification is based on the spectral content of the target signature which also varies on the orientation of the target. The plot-level classification helps in conserving radar resources by not letting radar resources expended on targets which are of not interest to operational scenario. The post-track classification help provide the operator with information about the target types and this information is fused with the classification carried out at signal level. The confusion matrix for each classifier is brought out. It can be seen that the results of the classifiers are very encouraging.

References 1. Gini, F., Rangaswami, M.: Knowledge-Based Radar Detection, Tracking, and Classification. Wiley, USA (2008) 2. Tait, P.: Target classification for air defence radars. In: 2006 IET Seminar on High Resolution Imaging and Target Classification, London, pp. 3–16 (2006) 3. Yang, Y., Lei, J., Zhang, W., Chao, L.: Target classification and pattern recognition using microdoppler radar signatures. In: 2016 Seventh ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD’06), pp. 213–217. NV, Las Vegas (2006) 4. Karabay, O., Yceda, S.M., Yceda, O.M., Cokun, A.F., Serim, H.A.: Micro-Doppler-based classification study on the detections of aerial targets and wind turbines. In: 2016 17th International Radar Symposium (IRS), pp. 1–4. Krakow (2016) 5. de Wit, J.J.M., van Dorp, P., Huizing, A.G.: Classification of air targets based on range-Doppler diagrams. In: 2016 European Radar Conference (EuRAD), pp. 89–92. London (2016) 6. Ghadaki, H., Dizaji, R.: Target track classification for airport surveillance radar (ASR). In: 2006 IEEE Conference on Radar, Verona, NY, USA, p. 4 2006. https://doi.org/10.1109/RADAR. 2006.1631787 7. Radoi, E., Hoeltzener, B., Pellen, F.: Improving the radar target classification results by decision fusion. In: 2003 Proceedings of the International Conference on Radar (IEEE Cat. No. 03EX695), Adelaide, SA, Australia, pp. 162–165 (2003). https://doi.org/10.1109/RADAR.2003.1278731

Chapter 25

A Cognitive Semantic-Based Approach for Human Event Detection in Videos K. Seemanthini , S. S. Manjunath , G. Srinivasa , B. Kiran and P. Sowmyasree

Abstract Surveillance systems are mainly used to monitor activity and behaviour for the purpose of managing small human group. In particularly, the depicted approach focused on the small human groups stayed in the same place for a while and characterizing the behaviour of the group. The described approach has wide functions in several areas such as surveillance, group interface and behaviour classification. The video surveillance mainly deals with the recognition and classification of group behaviour with respect to several activities such as Normal speech, kicking, hugging and punching. The processing steps include frame generation, segmentation using Fuzzy C-Mean (FCM) clustering; features are extracted using Local Binary Pattern (LBP) and Hierarchical Centroid (HC). Convolution Neural Network (CNN) is used to classify the features. The proposed model is better than existing methods by achieving 80% accuracy.

25.1 Introduction Please note that the first paragraph of a section or subsection is not indented. The first paragraphs that follows a table, figure, equation etc. does not have an indent, either. K. Seemanthini Dayananda Sagar Academy of Technology and Management, Bangalore, India e-mail: [email protected] S. S. Manjunath (B) · G. Srinivasa · B. Kiran · P. Sowmyasree Department of Computer Science & Engineering, ATMECE, Mysuru, India e-mail: [email protected]; [email protected] G. Srinivasa e-mail: [email protected] B. Kiran e-mail: [email protected] P. Sowmyasree e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_25

243

244

K. Seemanthini et al.

Analyzing of human activity [1] becomes a trending and challenging research part in recent years. Its applications are diverse. The activity understanding is necessary for surveillance system for improving the computer technology. However, multiple activities are involved during activity recognition between multiple people which cause a challenging issue [2]. Recent studies observed the [3] collective behaviour of the group under various perspectives. Group activities are differentiated by individual actions and interactions with other peoples as well as the environment. The group activities [4] can be described by the locations and individual movements [5]. The various group activities can exist like walking, handshaking, hugging, kicking, etc. However, the numerous parameters such as background clutter, noise, occlusions, appearance of actors, video quality and illumination changes have contributed to the difficulty for the recognition and event detection among the people [6]. In a designed system an efficient approach for a small human group and action detection has been presented. The adaptive FCM approach separates the human region and estimates the number of clusters and fuses those clusters to retrieve human region. The human and other background region get separated and nonhuman region will be eliminated using this approach. The features are extracted by applying Completed Local Binary Pattern (LBP) and Hierarchical Centroid algorithms. These feature extraction methods present the huge numerical feature; subsequently, smaller subset of features is selected based on optimality criteria to detect human objects [7, 8].

25.2 Methodology The proposed architecture shows in Fig. 25.1. It gives the depiction about event involved in the interaction of a small human group. Initially, the human zone is predicted by frame generation step. Identification and event detection in small human group are based on FCM clustering, followed by extraction of relevant features by employing LBP and hierarchical Centroid algorithms. CNN [9] classifier is preferred for human activities classification. Each step is briefly summarized in below sections.

25.2.1 Frame Generation Video is a sequence of images for a given interval of time, each image in video is called as frame. A key frame is a frame which provides more precise and compressed information of the video. In television and computer system, frame rate is the numbers of frames get displayed in a single second. It is widely used for synchronizing of pictures or films. For television a 30 fps and for moving camera 24 fps per second are preferred in real-time application.

25 A Cognitive Semantic-Based Approach for Human Event … Feature Extraction

245

CNN Training

Knowledge Base Training

Segmented Samples

Testing

Frame Generation Input Video

Segmantation

Feature Extraction

FCM Clustering

LBP HC

CNN Classifier

Human Group Detection

Event Detection

Output

Fig. 25.1 System architecture of proposed approach

25.2.2 Fuzzy C-Mean Clustering (FCM) A clustering approach helps to minimize some objective function by minimizing the error function. In FCM-C represents the clusters; hence, it is referred as fuzzy C-mean clustering. The FCM technique supports fuzzy membership that assigns membership degree for each cluster. The significance of using fuzzy clustering membership is same as probability of pixel during mixture modelling. The important benefits of using FCM technique are to generate new clusters that involve close membership with existing clusters. Generally, fuzzy membership, partition matrix and objective function show three basic functions in FCM method. The FCM Algorithm The FCM algorithm is iterative and it is illustrated in the following steps: Step 1: Initialize m(m > 1). Step 2: Initialize µij (membership function), i = 1, 2, …, n; j = 1, 2, …, c. Step 3: find cluster centres yj , j = 1,2, …, c. Step 4: Determine Euclidian distance dij , i = 1, 2, …, n j = 1,2, …, c. FCM partitions given image with n set of object or vectors x (x, x, … xn) into c(1 < c < n) clusters having centroids y = (y1, y2, y3…yc). After using FCM technique the fuzzy clusters are expressed in fuzzy matrix form µ having c columns and n rows, where n indicates the number of objects in image and c represents the number of clusters. µij Shows the number of elements in ith row and jth column in µ. m represents the weighting exponent which controls the clusters [10] fuzziness. dij is the Euclidian distance between object xi and cluster centre yj . The jth cluster of

246

K. Seemanthini et al.

centroid yj is obtained using Eqs. (25.1) and (25.2). jm =

c n

uijm dij

(25.1)

j=1 i=1

dij = xi − yj

(25.2)

25.2.3 Feature Extraction The segmentation result will be given to the feature extraction block for further process. Two novel algorithms termed as LBP and hierarchical centroid have been employed for the extraction of features which are described below.

25.2.3.1

Local Binary Pattern

There are many approaches to extract image features one among them is Local Binary Pattern (LBP) method. LBP describes the texture and shape of digital images. The extracted features consist of binary patterns which in turn describes the pixel value. Final resulted features are combined into a single histogram feature that represents the image. By computing similarity between feature histogram, images are compared. LBP provides good results with respect to speed and discrimination. LBP works with eight neighbouring pixels by considering centre pixel as a threshold. If grey level value of neighbouring pixel is higher than centre pixel then pixel value is assigned as 1 or else if grey level value of neighbour pixel is smaller than centre pixel 0 is assigned. LBP code is produced by combining the eight 0s or eight 1s. Normal operation of LBP algorithm is shown in Fig. 25.2. Further enhancement in LBP makes it to use different sizes of neighbourhood pixel value which is the extended uniform LBP of basic LBP operator. LBP referred as uniform if it has two bitwise transitions from 0 to 1 or 1 to 0, for example 00000000, 001111000 and 11100011. Uniform LBP finds only local textures, such as ends of lines, edges, angles and spots. Figure 25.3 depicts a flow diagram of the local binary pattern. Overall procedure to extract resultant shape features is described.

25.2.3.2

Hierarchical Centroid

Extraction of features derived from shape which are invariant to scaling, translation, rotation is a tuff task in target recognition system. Feature extraction [11] of samples based on shape is an emerging task in image processing. To get the similarities between the sample images and to boost, the visualization effect hierarchical centroid

25 A Cognitive Semantic-Based Approach for Human Event …

247

Fig. 25.2 Block diagram of the LBP feature extraction method

algorithm is considered. The algorithm exhibits the description of selected input samples. The computation of description will be done twice with intent to describe sample segmented image vertically and horizontally to boost up the accuracy. Binary image or grey-scale sample is taken to get the depth of the recursion of segmented sample. Extraction of features depending on shape via hierarchical centroid method contains the output in terms of division locations, values around zero, depth image length and depth of the division locations. Figure 25.4 shows the flow chart of the hierarchical centroid. This function will read the input image and get the transposed image and difference of level 1 and level 2. It will perform normalization and calculate the normalization for the location and size. Calculate the centroids and update the distance and finally, it will get the descriptors.

25.2.4 CNN Classifier It is comparable to feedforward ANN system implemented to analyze and classify features of images using feature extraction and classification method. Basically, CNN consists of four different layers: convolution, activation, pooling and fully connected (dense) layers. CNN consists of sparse local connectivity and weight sharing. Each layer is connected to the input part. Neurons overlap with each other to retrieve images. CNN neurons produce set of feature map with similar weights; hence, entire procedure is equal to normal convolution process having filtered for each feature map. Weight sharing significantly decreases the network parameters, increases efficiency and avoid overfitting. An activation layer extracts complex characteristics of input. Pooling layers sample the previous layer by collective small set of rectangular values. Maximum pooling restores input and decreases output sensitivity for small set of input shifts. Fully connected layer provides classification result. The CNN training

248 Fig. 25.3 Flow chart for LBP

K. Seemanthini et al.

Start

Select input image

Determine the dimensions of the input image

Determine the block size using x and y coordinates

Select the coordinates of the origin

Calculate dx and dy

Fill the center pixel matrix

Initialize the result matrix with zeros

Compute LBP code image

Compute floors, ceils and rounds for x and y

Calculate the interpolation weights

Compute interpolated pixel values

Update resultant matrix

Stop

25 A Cognitive Semantic-Based Approach for Human Event … Fig. 25.4 Flow chart for hierarchical centroid technique

Start

Read Input image and depth

Get transposed image

Get the difference of level 1 and level 2

Perform normalization

Calculate normalization for location and size

Compute centroids

Update the distance

Get the discriptors

Stop

249

250

K. Seemanthini et al.

is similar as ANN training, through gradient descent and backpropagation error loss function is reduced. Since from decades CNN concept has existed, recently training these classifiers having various stacked layers is accomplished. This is because of the extensive parallelization properties associated with extremely parallel GPUs, large amount of available data and various design tricks like rectified linear activation units (ReLU). In proposed CNN architecture (AlexNet), five convolution layers are used followed by ReLU, max/Average pooling, and three fully connected layers resulting 1000 output (Softmax). CNN classifier is trained with Stochastic Gradient Descent (SGD) to address the multinomial logistic regression with a momentum term. CNN assists data representations learning in multiple ways in semantic abstraction, where structures like cars or faces can able to recognize by combining low-level features like edges. Designing CNN is insignificant for specific issue, thus research has been done on CNNs for classifying image, recognizing texture and analyzing medical image. Optimal architecture and configuration of CNN is decided by considering comprehend problem nature. Here, texture defines as a stochastic small size of textons repetition of a few structures compared with whole region. Image convolution describes small structures which look similar to CNN kernel, thus filter bank recognize responses effectively which are employed in many applications of texture analysis [12]. By finding optimal problem, CNN kernels analyze the image texture. However, some key objectives are required to be considered (i) The convolution layer input must be smaller than local structure texture characteristic, or else nonlocal characteristics are stored which are inappropriate to précised texture, (ii) To prevent loss of information, pooling layer must not be used between convolution layers and (iii) To achieve rotation invariance, output of convolution layer having feature map must contain single feature map after pooling. Colour image contains high-level structure, a low-level feature can still be valid for same class when flipped or rotated. CNN network input is a set of feature vectors convolved using five convolution layers. The size of kernel for each layer is selected as 2 × 2, while maintaining receptive field as 6 × 6 to extract similar local texture. CNN kernel is relative to the CNN neuron receptive field. Hence, it can able to manage the local texture complexity. The receptive field of 2 × 2 sizes is keeping on incremented by 1. The number of kernel used is depending on input complexity. The average pooling layer size is same as last convolution layer output (i.e. 27). The resulting last layer feature maps are fed as input to three fully connected layers.

25.3 Test Results and Discussions The experimental outcome of proposed method is briefly summarized for each step, preprocessing step where input sample image is executed by the RGB to grey-scale conversion method. The segmentation has been achieved using FCM clustering [11]. Features of segmented frames are extracted using CLBP and HC algorithms. The

25 A Cognitive Semantic-Based Approach for Human Event …

251

respective frame action is classified using CNN approach. Figure 25.5, (a) indicates input frame, (b) segmented frame, (c) human detected and (d) activity. Using confusion matrix overall model is evaluated. Confusion matrix is a prediction analysis method which is preferred to measure true positive, false positive, true negative and false negative rates. It helps us to analyze the methodology in decent manner. Using confusion matrix below parameters is computed. The amount of precision of a quantity is called as accuracy which is given by the formula, ACC =

TP + TN TP + TN = P+N TP + TN + FP + FN

(25.3)

Precision is nothing but scientific or mechanical exactness given in Eq. (25.4), PPV =

TP TP + FP

(25.4)

Sensitivity, true positive rate and recall rate is given by

Input Frame (100/507)

Segmented Output

Human Detection

Human Activity

Input Frame (330/507)

Segmented Output

Human Detection

Human Activity

Input Frame (410/507)

Segmented Output

Human Detection

Human Activity

Input Frame (490/507)

Segmented Output

Human Detection

Human Activity

Fig. 25.5 a Input frame, b segmented output, c human detection and d human activity

252 Table 25.1 Confusion matrix

K. Seemanthini et al. N = 500

Predicted No

Predicted Yes

Predicted No

TN = 50

FP = 50

Total = 100

Predicted Yes

FN = 50

TP = 350

Total = 400

Total = 100

Total = 400

Fig. 25.6 Accuracy graph

TPR =

TP TP = P TP + FN

(25.5)

Sensitivity is a degree of responsiveness or awareness of external or internal changes. The proposed model evaluated totally on 300 frames, out of which positive (TP) score is 240, true negative (TN) score is 50, 5 are false positive (FP) and 5 are false negative (FN) as shown in Table 25.1 and Fig. 25.6 shows the overall performance of the system.

25.4 Conclusions We have represented a novel methodology for small human group activity detection which will encode the particular event in the video sequence using LBP and C techniques. Segmentation has been achieved by using new algorithm termed as FCM clustering. The proposed model makes use of novel feature descriptors termed as LBP and hierarchical centroid. The obtained features are then matched with the features stored in the knowledge base using CNN [13] classifier. The results demonstrate that performance achieved is 80% accuracy in detecting and recognizing the group [14, 15, 16] activity.

25 A Cognitive Semantic-Based Approach for Human Event …

253

References 1. Ciptadi, A., Goodwin, M.S., Rehg, J.M.: Movement pattern histogram for action recognition and retrieval. In: Conference on Computer Vision, pp. 695–710. Springer (2014) 2. Manfredi, M., Vezzani, R., Calderara, S., Cucchiara, R.: Detection of static groups and crowds gathered in open spaces by texture classification. Pattern Recogn. Lett. 44, 39–48 (2014) (Elsevier) 3. Stephens, K., Bors, A.G.: Human group activity recognition based on modelling moving regions interdependencies. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2115–2120. IEEE (2016) 4. Tran, K.N., Gala, A., Kakadiaris, I.A., Shah, S.K.: Activity analysis in crowded environments using social cues for group discovery and human interaction modelling. Pattern Recogn. Lett. 44, 49–57 (2014) (Elsevier) 5. Yang, Y., Zhang, B., Yang, L., Chen, C., Yang, W.: Action recognition using completed local binary patterns and multiple-class boosting classifier. In: 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 336–340. IEEE (2015) 6. Liciotti, D., Massi, G., Frontoni, E., Mancini, A., Zingaretti, P.: Human activity analysis for in-home fall risk assessment. In: 2015 IEEE International Conference on Communication Workshop (ICCW), pp. 284–289. IEEE (2015) 7. Kim, Y., Moon, T.: Human detection and activity classification based on micro-Doppler signatures using deep convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 13(1), 8–12 (2016) 8. Al-Nawashi, M., Al-Hazaimeh, O.M., Saraee, M.: A novel framework for intelligent surveillance system based on abnormal human activity detection in academic environments. Neural Comput. Appl. 28(1), 565–572 (2017) 9. John, V., Mita, S., Liu, Z., Qi, B.: Pedestrian detection in thermal images using adaptive fuzzy Cmeans clustering and convolutional neural networks. In: IEEE, IAPR International Conference, pp. 246–249 (2015) 10. Priya, M.M., Nawaz, D.G.K.: MATLAB Based Feature Extraction and Clustering Images Using K-Nearest Neighbor Algorithm (2016) 11. Lajevardi, S.M., Hussain, Z.M.: Automatic facial expression recognition: feature extraction and selection. Image Video Process. 6(1), 159–169 (2012) (Springer) 12. Liu, L., Lao, S., Fieguth, P.W., Guo, Y., Wang, X., Pietikäinen, M.: Median robust extended local binary pattern for texture classification. IEEE Trans. Image Process. 25(3), 1368–1381 (2016) 13. Milan, A., Schindler, K., Roth, S.: Detection-and trajectory-level exclusion in multiple object tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3682–3689 (2013) 14. Mazzon, R., Poiesi, F., Cavallaro, A.: Detection and tracking of groups in crowd. In: IEEE, International Conference, pp. 202–207 (2013) 15. Liu, L., Shao, L., Li, X., Lu, K.: Learning spatio-temporal representations for action recognition: a genetic programming approach. IEEE Trans. Cybern. 46(1), 158–170 (2016) 16. Chen, C., Jafari, R., Kehtarnavaz, N.: Action recognition from depth sequences using depth motion maps-based local binary patterns. In: 2015 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1092–1099. IEEE (2015)

Chapter 26

Analysis of Adequate Bandwidths to Guarantee an Electoral Process in Ecuador Segundo Moisés Toapanta Toapanta , Johan Eduardo Aguilar Piguave and Luis Enrique Mafla Gallegos Abstract The analysis of the appropriate bandwidths was made based on the electoral processes in countries with the availability of electronic voting and these have been adapted to the situation in the electoral processes of Ecuador. The problem is the low importance given and the recidivism in the scarce bandwidth used in the current electoral processes. The objective reflects the problems caused by a mediocre bandwidth for events of such magnitude and importance in any country and creates awareness of the appropriate conditions to guarantee all aspects of the electronic process. The method used is deductive to analyze the data that were used as parameters to calculate the bandwidth in the electoral process in Nigeria. The analysis shows that the voting processes up to the 2017 period have not been optimal, but they are sustainable and acceptable despite the fact that there were setbacks when issuing the results of the votes. It is concluded that there must be simulations to avoid failures at the time of the actual electoral process and improve resources for the implementation of the electoral processes.

26.1 Introduction The National Electoral Council (CNE), in Ecuador, is in charge of formalizing the results of the electoral processes. But nevertheless, in the last electoral processes, these final results have been questioned [1] due to several inconveniences; one of them and the one that stands out most is the so-called “Electro blackout”. S. M. T. Toapanta (B) · J. E. A. Piguave Department of Computer Science, Universidad Politécnica Salesiana Del Ecuador, Robles 107 Chambers, Guayaquil, Guayas, Ecuador e-mail: [email protected] J. E. A. Piguave e-mail: [email protected] L. E. M. Gallegos School of Systems Engineering, Escuela Politécnica Nacional, Ladrón de Guevara E11-253, Quito, Pichincha, Ecuador e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_26

255

256

S. M. T. Toapanta et al.

Ecuador still does not take the step to electronic voting and prefers to keep it traditional by voting by ballots [2]. There are neighboring countries such as Colombia, which are conducting research regarding the security of electronic voting, in order to determine the right time to move to this type of process [3]. This problem arises from inadequate and insufficient bandwidth, which increases even more if they are countries with electronic voting already implemented in their electoral processes such as Brazil and Venezuela in Latin America, and there are even nations with sufficient resources to apply the electronic vote and these are not put at risk by different factors that have not yet been solved [4]. Something that is little talked about and also has little documentation, is the existence of electoral processes that often suffers mishaps due to poor bandwidth or poor implementation of this. Why should the analysis of bandwidth be carried out to guarantee an electoral process? According to the analysis extended to the past problems in the electoral processes and considering factors such as the passage of a vote of vote to an electronic vote and the relative increase of the population of the country, we will have an adequate width of band for any electoral process. This objective is to reflect the problems caused by a mediocre bandwidth for events of such magnitude and importance in any country and raise awareness of the appropriate conditions to guarantee all aspects of the electronic process. The articles analyzed to make this document are as follows: Numerical Analysis of Ecuador’s Electoral Register Integrity [1], From Piloting to Roll-out Voting Experience and Trust in the First Full e-election in Argentina [2], Sistema de votación electrónico con características de seguridad SSL/TLS e IPsec en Colombia [3], Ensuring the Blind Signature for the Electoral System in a Distributed Environment [4], Bandwidth and Resource Allocation for Full Implementation of e-Election in Nigeria [5], Ancho de banda, crisis y crecimiento del PIB en países latinoamericanos en el periodo 2005–2015 [6], ICT for National Development in Nigeria creating an enabling environment [7], International Internet connectivity in Latin America and the Caribbean [8], Remote Internet Voting Security and Performance Issues [9], Comparison of ID-based Blind Signatures from Pairings for E-Voting Protocols [10], and Optimal Bandwidth Choice for the Regression Discontinuity Estimator [11]. The method used is the deductive one to analyze the data that was used as parameters to calculate the bandwidth in the electoral process in Nigeria [5]. It concludes that an increase in bandwidth is needed, both for votes by ballot and for electronic votes, and it is also considered that, if the implementation of a system by electronic voting is carried out, it should be contemplated even more, bandwidth that going to be destined to an electoral process in Ecuador.

26 Analysis of Adequate Bandwidths to Guarantee …

257

26.2 Materials and Methods In the first instance, in the Materials section, some determining works were reviewed such as the different election systems, the average bandwidths in Latin American countries and the growth of this in Ecuador in different periods of time. In the second instance, for the Methods section, parameters were proposed according to the Ecuadorian electoral process to generate a bandwidth calculation model.

26.2.1 Materials Information provided by the International Telecommunications Union (ITU) was used because it has high relevance data on bandwidth in Latin America. In addition, the results of the simulations developed in Nigeria for the implementation of electronic voting were analyzed [5] and these were compared with the existing data in Ecuador to recreate the simulation with the amount of current population in the country. The only countries that already implemented electronic voting in a real situation [4] have been placed in this table, to compare values, as bandwidth is concerned. The comparison with Colombia was included, considering that security in the implementation of the electronic process has been studied in this territory [3] (Table 26.1). The data taken as reference for the elaboration of this table correspond to the 2016 period; curiously, we can notice that despite the crisis that the vast majority of Latin American countries have experienced and the penetration of bandwidth in them [6], there are countries that have already used electronic voting. Venezuela is an example of this, and even being positioned near the bottom of the Latin American list with a bandwidth value well below Ecuador; it is one of the countries that has been able to carry out electoral processes through this system. The ITU is an entity that is characterized by periodically providing information from all countries; by having these data, we can be more specific in the analysis of the average Internet bandwidth in Ecuador and Nigeria over the past few years. The graph below shows the values published by the ITU between the periods of 2012–2016 regarding the bandwidth in Ecuador (Fig. 26.1). The average value for Ecuador was 26.17 kb/s with a minimum of 8.25 kb/s in 2012 and a maximum of 36.89 kb/s in 2016. Table 26.1 International Internet bandwidth per user, kb/s (Source International Telecommunications Union)

Country

Average Internet bandwidth

World ranking

Brazil

42.97

Position 60

Venezuela

14.40

Position 94

Colombia

34.99

Position 67

Ecuador

36.89

Position 65

258

S. M. T. Toapanta et al.

Fig. 26.1 Average Internet bandwidth in Ecuador, kb/s (Source International Telecommunication Union)

Fig. 26.2 Average Internet bandwidth in Nigeria, kb/s (Source International Telecommunication Union)

The data granted by the ITU between the periods of 2012 and 2016 with respect to the bandwidth in Nigeria are (Fig. 26.2). The average value for Nigeria during that period was 1.48 kb/s, with a minimum of 0.11 kb/s in 2012 and a maximum of 3.44 kb/s in 2016. For this reason, research work in Nigeria was taken as the main reference, with a maximum average bandwidth per user of 3.15 kb/s in 2016 and a population index that exceeded 170 million inhabitants, leaving Ecuador behind, with 17 million.

26.2.2 Methods The scarcity of the bandwidth arises from the fact that many users not considered connect to it [7], so the method analyzed is the one studied in Nigeria for the implementation of electronic voting [5]. However, since ballot voting was used in Ecuador, it was considered as a method of less control over voting users. The approximate number of votes was determined to take into account the electoral laws in Ecuador, which indicate who are the people for the electoral process [1] and in this way those who fall within the voting range are identified. Ecuadorians participating in this election process are all those between the ages of 16 and 65, including those who are abroad. The figures registered by the CNE of the number of voters in the last elections are 12,437,523 inhabitants within Ecuador and 378,292 inhabitants abroad giving a total of 12,815,815.

26 Analysis of Adequate Bandwidths to Guarantee …

259

The research work in Nigeria was composed of something simple: a server which had the function of receiving the votes and ten laptops simulating voting machines [5]; this launched as results: the average time between votes, lapses of delay of the whole process, and also examined if there was any congestion in the ranks of voters during the duration of the drill. We took the analysis that was developed, and we passed it to the voting system that is carried out in the electoral processes of our country, emphasizing that we focus on the bandwidth of the information recipient, that is, the National Electoral Council (CNE), in Ecuador. The parameters deduced are • • • • •

CT: # of total candidates, LV: Vote length, VB: Value of votes in bytes, VH: Votes per average hour, TPA: The active voter population in Ecuador.

These parameters were taken into account for the generation of calculations regarding the bandwidth in an Ecuadorian electoral process. It is important to mention that the number of total candidates (CT) parameter is subjected to special considerations with respect to the candidates to be elected, since in the same electoral process the ballot is cast on different ballots to postulants for various positions, not only to a specific category such as they are the president or vice president of the Republic. To be more explanatory, during the same election process, the winning candidates are also selected to the position of ministers, assembly members, mayors, councilors, among others.

26.3 Results Regarding the results obtained in the analysis it is stated as follows: • The process is scalable so that while the number of candidates and the population in Ecuador are increasing, the servers that receive the votes should also be increased. • Increasing the servers that host the information should increase the bandwidth depending on the increase that occurs. • Ecuador exceeds Nigeria in terms of infrastructure and bandwidth, so the electronic method is a viable process, and if implemented, would greatly improve the current electoral process. • The electronic vote and the ballot vote must have the same level of privacy and security. • Ecuador can take a leap to electronic voting from the perspective of reaching a national level and not just specific sectors.

260

S. M. T. Toapanta et al.

1. Mathematical description of the bandwidth calculation. Among the various parameters mentioned, the following formulas are proposed for the calculation of the total candidates, the length of the vote, and the value of votes in bytes. • To determine the total candidates, we have the development of the following formula: CT = numbers of candidates + 2

(26.1)

Two are added to the value of the number of candidates since voting processes can also be voted on with whites or with nulls in the electoral processes. • For the length of the votes, the calculation is carried out using the formula below: Length of vote = (CT × 3) × 2

(26.2)

The multiplication by the value of 3 is fixed while the multiplication by 2 is given because it is stored as a data of short type and occupies two bytes. • To obtain the vote value in bytes, it is calculated from the following formula: V alue of votes =

Length of vote × 256 245

(26.3)

The division to 245 is given by the standard of encryption RSA (Rivest, Shamir, and Adleman) that allows this maximum of encrypted bytes and the multiplication to 256 is by the total length of the block that stores the vote (Figs. 26.3 and 26.4). START DATA: VARIABLES number_candidate

Numeric Entire

ALGORITHM: Read number_candidate Total_candidates = number_candidate + 2 Length_votes = (Total_candidates x 3) x 2 Value_votes = (Length_votes / 245) x 256

END Fig. 26.3 Mathematical algorithm of the calculation of the vote value (Source Authors)

26 Analysis of Adequate Bandwidths to Guarantee …

261

START DATA: VARIABLES People_between_16_and_65 People_foreign

Numeric Entire

Numeric Entire

ALGORITHM: Read People_between_16_y_65 Read People_foreign Total_active_population = People_between_16_and_65 + People_foreign Votes_per_average_hour = Total_active_population / 12 Bandwidth = Votes_per_average_hour / Value_Votes

END

Fig. 26.4 Mathematical algorithm of bandwidth calculation (Source Authors)

2. Prototype of the generic algorithms for the calculation of the bandwidth. Figure 26.5 describes the steps for calculating the vote value in bytes with the established parameters. The steps of the first algorithm are detailed below. • The number of total candidates included the blank and null vote. • The length of the vote with the result of the previous calculation. • The vote value in bytes. Figure. 26.6 describes the steps for calculating the bandwidth with the result of the parameters of Fig. 26.5 and the other established parameters. The steps of the second algorithm are detailed below. • The total active population, • Average hourly votes, • The bandwidth of the electoral process. The description of the algorithms performed in this investigation was developed using programming language techniques and flowcharts; to present the algorithms in prototype. Now we proceed to show the flow diagrams of the resulting algorithm. Also, as this article is based on the electoral process in Ecuador; not only the data such as the number of eligible population for the vote should be taken into consideration but also the addition of the state of the infrastructure in which the website that registers and publishes the votes is hosted. Therefore, safety parameters and contingency plans must be contemplated when carrying out this process.

262

S. M. T. Toapanta et al.

Fig. 26.5 Flowchart of the process of calculating the vote value in bytes for an electoral process

26.4 Discussion From the analysis of the adequate bandwidth for an electoral process in Ecuador, the following points are proposed to be discussed: • The results were obtained without having the integrity of their data 100% guaranteed, something that, in the perception of the majority in Latin America, should improve [8]. • The system must be available during the entire election process [9], that is, there must be no interruption or crash in the system or on the website where the information is hosted. • The parameters considered are applicable to similar countries in Latin America that wish to put aside the transfer of vote count in paperwork and make an electronic vote [8].

26 Analysis of Adequate Bandwidths to Guarantee …

263

Fig. 26.6 Flowchart of the process of calculation of the bandwidth for an electoral process

• The telecommunications infrastructure in Ecuador is deficient; having high availability servers does not mean improving bandwidth [6]. • The protocol used in the electoral process also affects bandwidth as a measure of greater security and also consumes bandwidth resources [3, 10].

264

S. M. T. Toapanta et al.

26.5 Work Futures and Conclusions Carry out simulations of greater or lesser scale to avoid failures at the time of the actual electoral process, following as an example some of the works analyzed here [5], with the purpose of converting it into a clear process at all times and in any place, free of possible suspicions of fraud, in addition to that, all information must be collected and published. The selection of bandwidth should be based on the requirements of an electoral process, not the same measures if the approach of the voting changes is of greater or lesser magnitude [11]. It is concluded that there should be drills to avoid failures at the time of the actual electoral process, improved resources for the implementation of electoral processes. It is necessary to have better adequate technological infrastructure and bandwidths to implement the electoral processes for electronic voting; whereas we are in similar conditions with other countries. In addition, the website where the electoral process information is stored must be available throughout the process. Establish new proposals to standardize a better Internet quality for all of Ecuador since it only focuses on the main cities of the country. Acknowledgements The authors thank the Universidad Politécnica Salesiana del Ecuador, the research group of the Guayaquil Headquarters “Computing, Security and Information Technology for a Globalized World” (CSITGW) created according to resolution 142-06-2017-07-19 and Secretaría de Educación Superior Ciencia, Tecnología e Innovación (Senescyt).

References 1. Mafla, E., Gallardo, C.: Numerical analysis of ecuador’s electoral register integrity. In: 2018 5th International Conference on eDemocracy eGovernment, ICEDEG 2018, pp. 351–355 (2018). https://doi.org/10.1109/icedeg.2018.8372305 2. Pomares, J., Levin, I., Alvarez, R.M., Mirau, G.L., Ovejero, T.: From piloting to roll-out: voting experience and trust in the first full e-election in argentina. In: Proceedings of the 6th International Conference on Electron Voting (EVOTE 2014), pp. 33–42 (2014). https://doi.org/ 10.1109/evote.2014.7001136 3. Cardenas Urrea, S.E., Navarro Núñez, W., Sarmiento Osorio, H.E., Forero Paez, N.A., Bareño Gutierrez, R.: Sistema de votación electrónico con características de seguridad SSL/TLS e IPsec en Colombia. Rev. UIS Ing. 16, 75–84 (2017). https://doi.org/10.18273/revuin.v16n12017008 4. Toapanta, S., Huilcapi, D., Cepeda, M.: Ensuring the blind signature for the electoral system in a distributed environment. In: 2018 3rd International Conference on Computer Science and Information Engineering (ICCSIE 2018) (2018). https://doi.org/10.23977/iccsie.2018.1026 5. Akonjom, N.A., Ogbulezie, J.C.: Bandwidth and resource allocation for full implementation of e-election in nigeria. Int. J. Eng. Trends Technol. 10, 58–65 (2014). https://doi.org/10.14445/ 22315381/IJETT-V10P211 6. Pascual, G., Elvia, I., Reinaldo, A.: Broadband, crisis and GDP growth in Latin American countries in the 2005 2015 period| Ancho de banda, crisis y crecimiento del PIB en países

26 Analysis of Adequate Bandwidths to Guarantee …

7. 8. 9.

10.

11.

265

latinoamericanos en el periodo 2005–2015. In: Iberian Conference on Information Systems and Technologies (CISTI) (2017). https://doi.org/10.23919/cisti.2017.7976041 Oghogho, I.: ICT for national development in Nigeria: creating an enabling environment, 3 (2013) Messano, O.: Study on international internet connectivity in Latin America and the Caribbean, pp. 1–56 (2012) Ahmed, M.I., Abo-Rizka, M.: Remote internet voting: security and performance issues. In: 2013 World Congress on Internet Security WorldCIS 2013, pp. 56–64 (2013). https://doi.org/ 10.1109/worldcis.2013.6751017 Ribarski, P., Antovski, L.: Comparison of ID-based blind signatures from pairings for e-voting protocols. In: 2014 37th International Convention on Information and Communication Technology Electronics Microelectronics MIPRO 2014—Proceedings, pp. 1394–1399 (2014). https:// doi.org/10.1109/mipro.2014.6859785 Imbens, G., Kalyanaraman, K.: Optimal bandwidth choice for the regression discontinuity estimator. Rev. Econ. Stud. 79, 933–959 (2012). https://doi.org/10.1093/restud/rdr043

Chapter 27

Load and Renewable Energy Forecasting for System Modelling, an Effort in Reducing Renewable Energy Curtailment Dipam Chaudhari and Chaitanya Gosavi Abstract Every day, humanity is achieving a big feat in the field of technology, and in doing so the most fundamental thing that comes to them as aid, is energy. With the invention of generators, fossil fuels have been integrated for the generation of this energy. But this process gave breeding ground for the carbon footprints. Hence, the traditional means of energy sources like wind, solar, hydro, etc. were integrated extensively all around the world to curb this deterioration of the environment. These sources have indeed helped in mitigating the problem of increasing carbon footprint. But even these Renewable Energy (RE) sources cannot be considered as the impeccable source of energy, as even these sources have got some issues related to the sustainability and the usability. The problem of curtailment of these variable sources of energy has been there since the rise in penetration of these sources in the world market. Due to uncertainty in the generation of power from renewables, various problems related to grid integration appear. Hence to maintain the real time balance between load and generation, some of the RE is wasted which is called as the RE curtailment. This paper, hence majorly focuses on reducing such power losses due to this imbalance in the power management system. At first, the literature survey was conducted on the subject, to get aware of the problem and its severity, and various system strategies earlier used to solve it. In this paper, the conception of day-ahead RE power supply and day-ahead Demand forecasting to ensure the day-ahead planning of the power demand–supply management was integrated. These forecasting models were designed using the concepts of Artificial Neural Network. These forecasted entities were used to schedule a day-ahead power demand and supply strategies. In case of system imbalance, i.e. during excess supply or excess demand the utilities can be maintained using day-ahead power transactions with electricity market. Practically, these forecasting methods may also have some modelling errors, which may affect the accuracy in the balance of the system. To mitigate this problem of system D. Chaudhari (B) Government College of Engineering, Aurangabad, MH, India e-mail: [email protected] C. Gosavi Technical University of Berlin, Berlin, Germany e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_27

267

268

D. Chaudhari and C. Gosavi

inaccuracy the concept of two-settlement strategy to balance the day-ahead market and the real time market was used.

27.1 Introduction According to a report by the World Bank on air pollution, exposure to ambient air pollution cost $5.11 trillion of welfare losses to the world’s economy in 2013. As a result of this, many international organizations are urging developing countries to shift their energy sources to nonconventional ones as developing countries are major consumers of conventional sources of energy. Nonconventional sources of energy are hence, considered to be cleaner, reliable, sustainable and environmentfriendly. They don’t cause any hazardous emissions, no rise in effective pollution and are cost-effective too. So, can it be evidently said that the renewable energy is the most effective and impeccable source of energy? But today’s energy researchers won’t stand with the same. There are some flaws even with this greenest form of energy. The problem is with the post generation storage and load management of the energy. This is because RE generation is uncertain as its output is determined by the underlying meteorological factors. The abundant RE may have to be retrenched because the real time balance between load and generation must be maintained, and electric generation cannot be economically stored on a large scale. Although the energy can be generated with maximum efficiency, it cannot be used to its full potential. In case of equipment maintenance, upgrade work or failure, RE generation cannot be used. RE reduction in these cases is known as curtailed electrical energy. With the rapid increase in the penetration of RE, this issue of energy curtailment rose subsequently. In China, the curtailed wind power generation was up to 12.3 billion kWh, in the year 2011 [1]. In addition, China’s power sector lost as much as 20.82 billion kWh in 2012, compared to 16.23 billion kWh in the year 2013. National RE Laboratory (US) defined curtailment as curtailment is a reduction in the output of a generator from what it could otherwise produce given available resources like wind or sunlight [2]. Hence, curtailment is posing a huge challenge to the sustainable use of RE resources. Sustainability can hence be achieved by mitigating this energy curtailment. Reduction of energy curtailment may often involve increased flexibility of the system. The flexibility of the system can be improved by making certain physical additions to the systems like increasing storage, grid capacity, operational changes and some organizational changes. Use of pumped hydropower by Portuguese grid operator had shown a significant reduction in the wind energy curtailment in the year 2011. So many alternatives have been worked on to mitigate this problem. In this paper, solution using energy load and generation forecasting for improved load management with controlled power distribution from conventional sources based on previously forecasted entities.

27 Load and Renewable Energy Forecasting for System Modelling …

269

27.2 Literature Review Rachel Golden and Bentham Paulos, in their studies, defined curtailment as a reduction in the output of a wind or solar generator from what it could otherwise produce. They have discussed several factors of curtailment as excess generation during low load period, transmission congestion, lack of transmission access and voltage and interconnection issues. In US, local oversupply of energy and transmission constraints are considered to be the major reasons for energy curtailment [3]. In another study [4], the main causes of RE curtailment were observed to be equipment maintenance or failure, low generation flexibility, defective control strategy, etc. For large scale renewable grid integration energy curtailment is caused by mismatched electrical demands, transmission constraints, poorly coordinated dispatch scheduling and lack of coordination mechanisms. For small scale renewables, it occurs because these are not considered while generation scheduling and also due to low wind power forecasting accuracy. In a study many strategies have been applied to alleviate this problem of which forecasting methods for improving energy planning models looks quite promising. There are some stand-alone methods and some are hybrid ones. Regression analysis, univariate time series methods, multivariate time series are the statistical ones. Computational Intelligence methods like machine learning and knowledge-based methods along with Mathematical Programming are also the stand-alone methods used for energy forecasting. Hybrid methods include combination of two or more of the above-mentioned stand-alone methods. Based on various measuring processes like root mean square error (RMSE) and mean absolute error (MAE), computational intelligence methods are found to be more useful for electricity load and RE forecasting models [5].

27.3 Proposed Work In this paper, we highlight the problem of energy curtailment due to inefficient demand and supply management, by optimizing the load balancing with forecasting and some arithmetical equations and considerations. Accurately forecasted dayahead RE supply and day-ahead demand become a key factor, for an efficient load supply balancing. Using both of these entities, an effort was taken to compute the day-ahead excess supply or excess demand, depending on the mathematical relationships between these entities. So, dynamic energy dispatch and distribution methods have been discussed to deal with these excess supply or excess demands that occur during distribution balancing. Forecasting models based on Artificial Neural Networks have been presented in this paper, to forecast the day-ahead RE supply and day-ahead demands. As this forecasting technologies are not 100% efficient enough, hence they may have some uncertainties. These uncertainties are further calculated using real time power supply

270

D. Chaudhari and C. Gosavi

and demand and using these, the excess supply or excess demand can be calculated. Further techniques to deal with such excess supplies or demands have been discussed. This paper majorly focuses on, maximizing the use of RE sources with optimized grids and hence, reducing the power losses due to unnecessary energy curtailments.

27.3.1 Neural Network-Based Load Forecasting Many forecasting methods have been used by the researchers for the short-term load forecasting to allocate the available resources rationally. These include many traditional methods such as time series, regression, ARIMA (Autoregressive Integrated Moving Average) as well as further computing techniques such as fuzzy logic, genetic algorithm and neural networks [6–10]. Among the above-mentioned forecasting methods, ANN (Artificial neural network) method is quite attractive based on its ability to handle nonlinear relationship between load and the factors affecting it. An artificial neural network is analogous to biological neural network that constitutes human brain. It is a framework for the machine learning algorithms to work together and process all the complex data inputs. It constitutes many computing elements, known as neurons as like human nervous system. These neurons are interconnected by the synaptic weights, which transmit signals. These neurons are typically arranged in the form of layers. Every layer performs independent processing on their inputs. Every neuron which receives the signal first processes it and then transmits it to the next connected neuron. As learning based on inputs proceeds, corresponding weights of neurons change. The input for these artificial neurons is given by feedforward network where signals move in forward direction from input layer to hidden layers and then finally to the output layer. The neural network model used for this purpose is a multilayered feedforward model with sigmoid function, to forecast the energy consumption of the given region. Day of the week, interval of the day, type of residency, weather data and past usage data will be given as an input to the model. In this, neural network model every input p(i) has a weight w(i) associated with it. The combination of these inputs and weights produce an input p(j) to a neuron in the next subsequent layer which is given as: p( j) =

p(i) ∗ w(i) + b

(27.1)

i

where b is a bias value. To standardize the output the synapses in the range of [0, 1], neurons use sigmoidal activation function commonly known as a sigmoid function. Following is the formulation of the above sigmoid function. σ (x) = 1/(1 + e ∧ (−x))

(27.2)

Here, x is the input to a given neuron and σ (x) is the result of the sigmoid function. The number of inputs hence, decide the number of nodes in the input layer. Next to

27 Load and Renewable Energy Forecasting for System Modelling …

271

Fig. 27.1 Neural network model for load forecasting

it is a hidden layer in which, number of nodes are selected prudently on the basis of works done in [11, 12]. Finally, there is an output layer which gives the desired output in the form of forecasted load. The given inputs can even be normalized for the smoother proceedings, with the help of various normalization functions like z-score function [13, 14]. The proposed model is depicted (Fig. 27.1).

27.3.2 Renewable Generation Forecasting Energy generation from RE resources like, solar and wind largely depend on the underlying weather conditions and hence are considered variable sources of energy. Forecasting these various sources of energy, hence pose a big challenge to the researchers working in this field. But for the proper and the clean supply of energy and for flawless load management this is the necessity of time. Generally, this forecasting is achieved using either Numerical Weather Prediction (NWP) models, weather data, statistical models or modern technological tools like artificial intelligence. Even the hybrid models by aggregating above models can help us in improving the accuracy of the model. More often, it is observed that short-term energy forecasting gives accurate results than the forecasting for longer terms [15, 16]. The input layer of the ANN model consists of varied nodes of NWP model data, GHI data and the previously measured power data. Apart from the internal weights of the nodes, it can be observed that the accuracy of the data heavily relies on the accuracy of the numerical weather prediction data. Hence, for improving the accuracy of forecasting, focus has been heavily on NWP data accuracy. Further, hidden layer consists of predetermined

272

D. Chaudhari and C. Gosavi

number of hidden nodes, which receive weighted input from the earlier layer and then process it using sigmoid activation function (1) and transmit that to the next layer. Using the training set, the model is trained and the weights are optimized using the back-propagation [17] method.

27.4 Energy Supply Optimization with Grid Balancing As the load and renewable generation are aptly forecasted and their accuracy and errors are precisely maintained, the further losses of RE due to various reasons, can be reduced by implementing proper supply and energy management techniques. This can be achieved using the identification of the regions for which the RE from given plant is supplied. So, the energy demand for that region can be forecasted as discussed in 3.1.1. Let’s assume, a day-ahead forecasted load for the given region to be Lt24 and a day-ahead forecasted RE to be REt24 . Hence, a relation between these two entities can be formulated as: G E t24 = Lt24 −R E t24 ∀t

(27.3)

G E t = L t −R E t ∀t

(27.4)

where, GEt24 is a day-ahead power quotient, which decides whether supply will be in excess or there will be an excess demand. GEt is a real time power quotient, Lt and REt are real time load and RE supply, respectively. The relation between real time power quotient and the day-ahead power quotient is given using the following equation: G E t = G E t24 + Er t ∀t

(27.5)

where, Ert is a real time error factor, as the real time demands and RE supply may vary from the forecasted ones, because there may be some errors in the forecasting models. Hence, in the above equation (), et24 can be given as: Er t = E Lt −E R Et ∀t

(27.6)

where ELt is an error for the day-ahead load forecasting, while EREt is an error for the day-ahead RE supply forecasting. These, individual forecasting errors can be formulated as: E Lt = L t −L t24 ∀t

(27.7)

E R Et = R E t −R E t24 ∀t

(27.8)

27 Load and Renewable Energy Forecasting for System Modelling …

273

where, Lt is a real time load, and REt is a real time RE power supply. Hence, ELt is a difference between the real time load and the day-ahead forecasted load, while, EREt is a difference between the real time RE power supply and the day-ahead forecasted power supply.

27.4.1 Power Supply Model Once, the power quotient GEt24 is quantified from Eq. (27.3), then it becomes easy to infer whether the supply will be in excess or in demand. In either of the cases, it becomes crucial to manage the power supply, or else it may cause either the problem of power losses or in the other case, the problem of load shedding. Mostly during off-peak load hours when the demands are less there are more chances of getting surplus power, which needs to be utilized rationally without any losses, i.e. when Lt < REt, the value of GEt comes out to be a negative quantity. So, a day-ahead planning is to be done in this case, to rationally supply the generated power to meet the demands. The day-ahead predicted REt24 will be used to meet the demand of Lt24 , and in this case surplus power GEt24 will be given using Eq. (27.3). Whereas, the real time demands and power supply may be different. So, in such cases the excess of RE, i.e., EREt will be used to meet the demands of excess load ELt . This ELt may occur due to various reasons, such as model uncertainties or the occurrence of flexible loads caused due to festivities, economical changes or variable industrial or domestic demands. Depending on the above factors, if EREt > ELt , then there is a surplus power, which adds to the day-ahead surplus power to give total surplus from the system for given time t. While, if EREt < ELt , then the excess demand can be met using the day-ahead surplus power, hence the total surplus power of the system can be reduced. Whereas, in case of day-ahead surplus power not being sufficient enough in meeting these excess demands then day-ahead grid power can be purchased from day-ahead electricity market. If EREt > ELt , Pr t = E R Et −E Lt ∀t

(27.9)

where Prt is a real time surplus power and the total surplus power Pt of the system for the given time t can be given as: Pt = Pt24 + Pr t ∀t(Pt > Pt24 )

(27.10)

If EREt < ELt , where, Prt becomes a negative value and hence, in this case, Pt < Pt24 . Due to various underlying topographical and financial limitations and even taking storage inefficiencies under considerations, various energy storage methods won’t help us meet the cause of mitigating power losses. The other way to look at these problems is a day-ahead electricity market. Hence, this surplus energy can be traded

274

D. Chaudhari and C. Gosavi

Fig. 27.2 Power flow model

in the day-ahead electricity markets. Eventually, it can be seen that Pt24 is the amount of surplus power which can be traded in the market but the actual surplus is Pt , which is different from Pt24 . This difference can be maintained using the real time market and the two-settlement system strategies [18, 19]. Whereas in case of peak load, when demands are high there is a possibility of a shortage of power, i.e. demands may exceed the supplied power. In such cases of power shortages, when Et > 0, these demands can be satisfied using day-ahead purchase of grid power. So, using the values of GEt24 and the average error factor we can purchase the grid power from day-ahead market which can be utilized for the next day. The error factor Et can hence be managed using real time market and the two-settlement system methods (Fig. 27.2).

27.5 Conclusion Although, with the rapid increase in the penetration of renewable sources of energy in the world market, there is a serious issue of energy curtailment faced by the providers and the central bodies. Renewable Energy is though the greenest form of energy, but this problem doesn’t make it the most impeccable one. In this paper, methods to deal with this serious problem have been discussed. Grid balancing and load management with the help of load and energy forecasting were used to achieve the reduction of the energy curtailment. Many Solutions have earlier been proposed based on policies, energy storage, system efficiency, etc. But still, the problem sustained. Hence, to get rid of this issue, the best solution is to make use of the generated RE at once with minimal amount of storage in case of surplus production. To achieve this, precise forecasting of demand and RE generation are needed to be performed. The use of Artificial Neural Network model has been taken into consideration for forecasting these two entities. Accuracy of the ANN model over other statistical

27 Load and Renewable Energy Forecasting for System Modelling …

275

methods has been studied by researchers for many years. Hence, demand forecasting and energy forecasting models were used to forecast the supply quotient. Depending on the value of it, generators can do transactions in the day-ahead electricity market, keeping in mind the maximum utilization of these variable sources. In this way, renewable generation can be used with priority and hence maximum exploitation of these resources can be ensured.

References 1. Ming, Z., Kun, Z., Jun, D.: Overall review of China’s wind power industry: status quo, existing problems and perspective for future development. Int. J. Electr. Power Energy Syst. 76, 768–774 (2007) 2. Wind and Solar energy curtailment: Experience and practices in the United States by National Renewable Energy Laboratory (NREL) 3. Golden, R., Paulos, B.: Curtailment of renewable energy in California and beyond. Electr. J. 28(6), 36–50 (2015) 4. Li, C., Shi, H., Cao, Y., Wang, J., Kuang, Y., Tan, Y., Wei, J.: Comprehensive review of RE curtailment and avoidance: a specific example in China. Renew. Sustain. Energy Rev. 41, 1067–1079 (2015) 5. Debnath, K.B., Mourshed, M.: Forecasting methods in energy planning models. Renew. and Sustain. Energy Rev. 88, 297–325 (2018) 6. Hernández, L., Baladrón, C., Aguiar, J.M. Carro, B., Sánchez-Esguevillas, A.: Improved shortterm load forecasting based on two-stage predictions with artificial neural networks in a microgrid environment, Energies 6, 4489–4507 (2013) 7. Hsiao, Y.-H.: Household electricity demand forecast based on context information and user daily schedule analysis from meter data. IEEE Trans. Ind. Inform. 11(1), 33–43 (2015) 8. Marvuglia, A., Messineo, A.: Using recurrent artificial neural networks to forecast household electricity consumption. Energy Procedia 14, 45–55 (2012) 9. Twanabasu, S.R., Bremdal, B.A.: Load forecasting in a smart grid-oriented building. In: 22nd International Conference and Exhibition on Electricity Distribution (CIRED 2013), Institution of Engineering and Technology (2013) 10. Custer, C., Rezgui, Y., Mourshed, M.: Electrical load forecasting model: a critical systematic review. Sustain. Cities Soc. 35 (2017) 11. Pradhan, R.: Z score estimation for banking sector. Int. J. Trade Econ. Financ. 5(6), 516–520 (2014) 12. Mustaffa, Z., Yusof, Y.: A comparison of normalization techniques in predicting dengue outbreak. In: International Conference on Business and Economic Research, vol. 1, pp. 345–349 (2010) 13. Karsoliya, S.: Approximating number of hidden layer neurons in multiple hidden layer BPNN architecture. Int. J. Eng. Trends Technol. 3(6), 714–717 (2012) 14. Panchal, F.S., Panchal, M.: Review on methods of selecting number of hidden nodes in artificial neural network. Int. J. Comput. Sci. Mob. Comput. 3(11), 455–464 (2014) 15. Ghofrani, M., Alolayan, M.: Time series and renewable energy forecasting. IntechOpen (2017) 16. Dobbs, Alex: Short-Term Solar Forecasting Performance of Popular Machine Learning Algorithms. National Renewable Energy Laboratory Golden, Colorado (2017) 17. Tarigan, J., Diedan, R., Suryana, Y.: Plate recognition using backpropagation neural network and genetic algorithm. Procedia Comput. Sci. 116, 365–372 (2017) 18. Counsell, L.K., Evans, L.T.: Day ahead electricity markets: is there a place for a day ahead market in NZEM?. New Zealand Institute for the Study of Competition and Regulations 19. Veit, Daniel J., Weidlich, Anke, Yao, Jian, Oren, Shmuel S.: Simulating the dynamics in twosettlement electricity markets via an agent-based approach. Int. J. Manag. Sci. Eng. Manag. 1(2), 83–97 (2006)

Chapter 28

RAM: Rotating Angle Method of Clustering for Heterogeneous-Aware Wireless Sensor Networks Kameshkumar R. Raval

and Nilesh Modi

Abstract Advances of research in the wireless network of few hundreds of tiny sensor nodes empower human to get more precise information about the environment of harsh and un-attendant area from remote location. We are extending such valuable efforts and presenting a new approach to form and manage functioning of wireless sensor network (WSN), by dividing network into number of clusters. To divide the whole WSN into the geographically equal size of clusters, we use “Rotating Angle Method” algorithm. We have also simulated as well as implemented this algorithm and we have compared the resultant information with many other clustering-based energy-efficient WSN algorithms, which shows this algorithm enhancing the lifetime and stability period of WSN.

28.1 Introduction Wireless sensor node [1] has sensing unit, having sensors to sense the data from external environment, ADC unit to convert sensed analog data into digital form, low-powered processing unit to process data, and wireless transmitter–receiver for communication. Usually sensor nodes are intended to be deployed in the remote and in the infrastructures, which don’t have sufficient resources like persistent power supply and wired network. So, to power different components of the sensor node discussed above, battery is used. Sensor nodes are very small, low cost, and energized by a battery. Once the battery is drained, then that sensor node becomes useless and cannot be utilized again. Hundreds or thousands of sensor nodes are deployed K. R. Raval (B) Som-Lalit Institute of Computer Applications, SLICA, SLIMS Campus, University Road, Opp. Xavier’s College, Navarangpura, Ahmedabad 380009, India e-mail: [email protected] N. Modi BAOU, Dr. Baba Saheb, Ambedkar Open University, Near Nirma University, Chharodi, S.G. Highway, Ahmedabad, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_28

277

278

K. R. Raval and N. Modi

randomly in the area which needs to be monitored. Each sensor node senses the data from the external environment, on periodical interval, and transmits the data to the base station (BS) either directly or via some other sensor nodes (multi-hope). Base station is a node which is to be assumed as full of resources like wired network and uninterrupted power supply. Some researchers are using the term sink node for the base station. It will represent the sensor field data to the end user, at remote location.

28.2 Related Work Many researchers have made their valuable effort to develop more productive and energy-efficient algorithms. Direct Transmission (DT) [2] is the simplest algorithm in which every sensor node is transmitting the data directly to the BS or sink node. The major drawback of this simplest algorithm is that those nodes which are geographically far from BS will drain their energy quicker because they are directly transmitting their sensed data to the BS, which is geographically far. In the minimum transmission energy (MTE) [2], sensor nodes transmit the data to the BS via some intermediate sensor nodes. Here, other sensor nodes which geographically come between the transmitter sensor node and BS will act as sensor node as well as gateway nodes. These nodes sense the data and transmit to the BS as well as they receive the data from another sensor node and transmit the data to another sensor node (which comes between it and BS) or directly to BS. After running few rounds of this algorithm, those nodes which are nearer to BS will die as they play dual role as they sense and transmit their own data as well as they receive the data from other sensor nodes and transmit it to other sensor node or sink node directly. In PEGASIS [3], the node will pass the sensing data via other sensor nodes which geographically comes between it and BS and if no intermediate node is available, then node can transmit the data directly to BS. This algorithm suffers from delaying of information as the data will propagate in a long chain-based system by multiple hops. LEACH [4] is a pioneer in cluster-based energy-efficient algorithm for WSN. As nearby sensors will sense the same data and it will be redundancy if we send same type of data, we are transmitting from different sensor nodes. So, to save the energy of the sensor nodes and to extend the lifetime of the WSN, LEACH has proposed cluster-based algorithm in which nearby sensor nodes will form a cluster. One node from all sensor nodes of the same cluster will be elected as a head node of the cluster. Other nodes will act as cluster members. Cluster member nodes will sense the data from external environment and send the data to its cluster head node. Head node of the cluster will aggregate the data and only aggregated data (less number of bits) will be transmitted to the BS which is at far distance. LEACH saves energy of the sensor nodes by restricting transmission of the redundant information to BS. As per LEACH protocol, a no. of control frames is transferred between the sensor nodes to elect the head node and to transfer TDMA schedule prepared by the head node of the cluster. LEACH-C [5] algorithm has overcome this problem by increasing the responsibility of the BS. BS will be responsible to form clusters, appoint head node

28 RAM: Rotating Angle Method of Clustering for Heterogeneous …

279

for each cluster, and to inform other cluster members about their TDMA schedule and their head node. BS has unlimited resources and that can be used to save energy of WSN nodes. LEACH, PEGASIS considers homogeneous WSN having same type and energy level of nodes. In LEACH-C, heterogeneous network is considered. After some time of the deployment of the network, few more nodes can be added to the network to extend its lifetime. In that case, WSN has some advanced (recently added sensor node which has much energy) and normal nodes which have less energy than advanced nodes. In LEACH-C, energy level of the sensor node is not considered and as a result, it fails to utilize higher energy of advanced nodes. SEP [6] uses the higher energy of the advanced nodes and it gives much priority to be a head node to advanced nodes, which will increase the stability of the WSN. By means of stability, more number of live nodes, which is in turn, produces more accurate and precise data. ETSSEP [7] is another algorithm which uses another technique to elect cluster head based on threshold value to enhance the performance of SEP. In MH-LEACH [8], the author has proposed a LEACH algorithm where head nodes of the cluster will pass the data to the BS via multiple hops which are cluster head nodes of other clusters. EE-LEACH [9] uses Gaussian method to distribute sensor nodes and considers residual energy of the node to elect head node of the cluster. Authors in [10] used fuzzy logic to appoint head node of the cluster. Most of the algorithm discussed above uses randomization method to elect head node, and then rest of the nodes joins the nearest head node and forms the cluster. Some algorithm also considers residual energy level of the node and gives much priority to that node to be elected as head node of the cluster having the highest energy. Randomization method in the election of the head node sometimes electing two more cluster head nodes in the same area, and some area of WSN, do not have any head node. Sensor nodes in this area need to join, to the head node which has more distance and they have to spend much energy in the transmitting their data to the head. In MSECHP [11], it is carefully observed and introduced a new approach to appoint head node which will be uniformly distributed over WSN field by applying virtual grid. MSECHP has increased the performance, lifetime, and stability period of the network.

28.3 Model of Radio Transmission We have assumed the simplest first-order radio transmission model to estimate conservation of the energy. If we are considering E TX as energy required to transmit a bit and E RX energy required to receive a bit, then as per basic model of radio transmission E elec = E T X = E R X = 50n J/bit

(28.1)

280

K. R. Raval and N. Modi

28.3.1 Transmission Energy To transmit L bits and send it to the recipient which is at d distance, sensor node needs to amplify transmitted signals, and it has to spend energy to transmit plus amplify L bits as shown below. E T X (L , d) = E elec ∗ L + E amp (d) ∗ L

(28.2)

Amplification energy for distance d meter can be calculated based on free space or multipath method shown in the equation given below. E amp (d) =

ε f s ∗ d 2 i f d < d0 εmp ∗ d 4 i f d ≥ d0

(28.3)

Here d is the distance between sender and recipient node. The value of d 0 can be obtained by the following equation. d0 =

εfs εmp

(28.4)

We can generate summarized equation from the equations discussed above as E T x (L , d) =

E elec ∗ L + ε f s ∗ d 2 ∗ L i f d < d0 E elec ∗ L + εmp ∗ d 4 ∗ L i f d ≥ d0

(28.5)

28.3.2 Receiving Energy Energy dissipated by the receiving sensor node to receive L bits of data can be formulated as below. E R X (L) = E elec ∗ L

(28.6)

Energy dissipation can be described by the following Fig. 28.1. Let L bit data is sensed by the transmitter sensor node and given to the transmitter circuitry, which will transmit L bits and forward it to amplifier, which will strengthen the signal for all L bits so that they can reach up to d distance. At d distance, the recipient node will receive L bits and tends to lose receiving energy for L bits.

28 RAM: Rotating Angle Method of Clustering for Heterogeneous …

281

Fig. 28.1 Communication of L bits between sender and recipient nodes at distance d

28.4 Proposed Rotating Angle Method Algorithm In the proposed algorithm, we have divided whole WSN field into four equal sizes of regions as shown in Fig. 28.2 given below. By taking the center point of the WSN as an origin, we have considered two circles with radius 25 and 40 m. So we have circle1 of radius 25 m and circle2 of radius 40 m as shown in Fig. 28.2 (Left). On circle1 exactly at 45°, we have placed our first optimal point. The same process is applied to the remaining three quadrants. After completion, we get four optimal points on the circle1 in all four quadrants which make 45° from the X-axis. In the circle2, we have taken two more optimal points in the opposite directions of each other which will make 45° with first optimal point, as shown in Fig. 28.2 (Right). So in the first quadrant, we get three optimal points (one point on circle1 and two points on circle2). The same process is continued for other three quadrants and total

Fig. 28.2 Finding optimal geographic points for the appointment of cluster head nodes

282

K. R. Raval and N. Modi

Fig. 28.3 a Sensor nodes are transmitting their data to cluster head. b After completion of the first round, angle for the optimal point is rotated by 5°

12 optimal points we have placed in WSN field. BS will find nearest sensor nodes from these optimal points and promote these nodes to act as head nodes. Other sensor nodes will join the nearest head node and form clusters. If any node finds BS to be closer than all appointed head nodes, then it can directly transmit its sensing data to the BS. All normal nodes will sense the data, based on TDMA schedule given to them by BS, and forward it to the head node of the cluster. Cluster head will receive the data, aggregate it, and send it directly to BS as shown in Fig. 28.3a. In the second round, the base angle will be increased by some constant that is δ. If the optimal point on the first circle is moved, then two optimal points placed on the second circle will also be changed as they are maintaining exactly 45° from the first optimal point on circle1. The angle of all four optimal points in all four quadrants is increased by δ. In every round, the optimal point of first circle will be increased by δ and based on it, the optimal points of circle2 will change their positions. So, this will rotate the head nodes responsibility to other nodes, and all nodes can get opportunity to be a head node of the cluster. Algorithm to find optimal points, for all quadrants and for both circles are given below

28 RAM: Rotating Angle Method of Clustering for Heterogeneous …

283

for qt=0:1:3 a1=90*qt+ang+45; %Finding Optimal points on First Circle a2=90*qt+ang+45-22.5; a3=90*qt+ang+45+22.5; if a1>360 a1=a1-360; end if a2>360; a2=a2-360; end if a3>360 a3=a3-360; end Optimal(qt*3+1).xd=50+rad1*cosd(a1); %Finding Opt points on Second Circle Optimal(qt*3+1).yd=50+rad1*sind(a1); Optimal(qt*3+2).xd=50+rad2*cosd(a2); Optimal(qt*3+2).yd=50+rad2*sind(a2); Optimal(qt*3+3).xd=50+rad2*cosd(a3); Optimal(qt*3+3).yd=50+rad2*sind(a3); end

In each round, ang variable is increased by δ constant which will rotate optimal points on the circles. Variables rad1 and rad2 refers to the radius of circle1 and circle2.

28.5 Simulation 28.5.1 Simulation Parameters We have simulated 100 sensor nodes in the WSN of the area 100 m * 100 m in MATLAB. We have considered 50nJ/bit transmission and receiving energy. From the set of 100 nodes, 10% nodes are advanced nodes and remaining nodes are normal nodes. The following Table 28.1 shows the details of all the parameters we have used for our simulation purpose. The following Fig. 28.4 shows MATLAB simulation in which BS is represented as ×, normal nodes are represented as O, advanced nodes are represented as +, and dead nodes are represented as ◆.

284

K. R. Raval and N. Modi

Table 28.1 Simulation parameters Type of operation

Energy utilization

Transmitting Receiving

Transmission: ETX = 50 nJ/bit Receiving: EREC = 50 nJ/bit Eelec = ETX = EREC = 50 nJ/bit

Data processing

EDA = 5 nJ/bit

Transmit amplifier electronics Using free space model If dtoBS ≤ d0

Efs = 10 pJ/bit/m2

Using multipath model If dtoBS > d0

Emp = 0.0013 pJ/bit/m4

Constant variation in δ

3

If mod (round, 400) = 0

rad1 = rad1 − δ rad2 = rad2 + δ

Fig. 28.4 Simulation of rotating angle method clustering algorithm

28.5.2 Simulation Results We have compared our simulation result with other cluster-based algorithms like LEACH-C, SEP, MSECHP, and many other algorithms with same set of data. The following Fig. 28.5 shows comparative analysis between them. The result clearly shows that MSECHP performs better than LEACH-C and SEP, where RAM algorithm performs better than MSECHP on same data values. We have also compared our result statistics with HEED [12], LPCH, UDLPCH [13] and PCAC [14]. Figure 28.6 shows that during the lifetime of WSN, more number of live nodes are present in RAM. We have run simulation for more than five times and recorded, round number in which first node died, which is described in Table 28.2. Table 28.2 shows in rotating angle method that first node will die after more number of rounds compared to LEACH-C, SEP, and MSECHP. More number of live nodes provides stability to the wireless sensor network.

28 RAM: Rotating Angle Method of Clustering for Heterogeneous …

285

Fig. 28.5 Comparison of the result LEACH, SEP, and MSECHP

Fig. 28.6 Comparison of the result of MSECHP and rotating angle method

Table 28.2 Comparative analysis of first node dead in the specific round Simulation

MSECHP

RAM

1

LEACH-C 996

SEP 893

1112

1227

2

1075

1041

1026

1178

3

1002

896

1096

1207

4

943

995

1107

1239

5

938

1002

1115

1288

286

K. R. Raval and N. Modi

28.6 Conclusion and Future Work Finally, we can conclude that rotating angle method for heterogeneous-aware WSN increases the lifetime and stability of the network. The more number of live nodes in each round will give more accurate result. In future, we will try to find out, how the algorithm will react to different sizes of networks? or how the algorithm is performing if the density of the sensor nodes of WSN is changed? What has to be an optimal value of δ and optimal value of the radius of each circle? How much optimal change is required in each radius of the circle and how many circles have to be considered for different sizes of WSNs? We will try our best to figure out these questions in our future work.

References 1. Akkaya, K., Younis, M.: A survey on routing protocols for wireless sensor networks, 3, 325–349 (2005) 2. Shepard, T.: A channel access scheme for large dense packet radio networks. In: Proceedings of the ACM SIGGCOMM, pp. 219–230 3. Jafri, M.R., Javaid, N., Javaid, A., Khan, Z.A.: Maximizing the lifetime of multi-chain PEGASIS using sink mobility. World Appl. Sci. J. 21, 1283–1289 (2013) 4. Heinzelman, W.R., et al.: Energy-efficient communication protocol for wireless microsensor networks. In: Proceedings of the 33rd Annual Hawaii International Conference on Systems Sciences 00, 3005–3014 (2000) 5. Heinzelman, W.B., Chandrakasan, A.P., Balakrishnan, H.: An application-specific protocol architecture for wireless microsensor networks. IEEE Trans. Wirel. Commun. 1, 660–670 (2002) 6. Smaragdakis, G., Matta, I., Bestavros, A.: SEP: a stable election protocol for clustered heterogeneous wireless sensor networks. Second Int. Work. Sens. Actor Netw. Protoc. Appl. (SANPA 2004), 1–11 (2004) 7. Kumar, S., Kant, S., Awadhesh, V.: Enhanced threshold sensitive stable election protocol for heterogeneous wireless sensor network. Wirel. Pers. Commun. (2015). https://doi.org/10.1007/ s11277-015-2925-x 8. Brand, H., Rego, S., Cardoso, R., Jr, J.C.: MH-LEACH: a distributed algorithm for multi-hop communication in wireless sensor networks, 55–61 (2014) 9. Arumugam, G.S., Ponnuchamy, T.: EE-LEACH: development of energy-efficient LEACH Protocol for data gathering in WSN. EURASIP J. Wirel. Commun. Netw. 2015 (2015) 10. Amuthan, A., Arulmurugan, A.: Semi-Markov inspired hybrid trust prediction scheme for prolonging lifetime through reliable cluster head selection in WSNs. J. King Saud Univ. Comput. Inf. Sci. (2018). https://doi.org/10.1016/j.jksuci.2018.07.006 11. Raval, K.R., Modi, N.: MSECHP: more stable election of cluster head protocol for heterogeneous wireless sensor network. Adv. Intell. Syst. Comput. 508 (2017) 12. Younis, O., Fahmy, S.: HEED: a hybrid, energy-efficient, distributed clustering approach for ad hoc sensor networks. Mob. Comput. IEEE Trans. 4, 366–379 (2004) 13. Khan, Y., et al.: LPCH and UDLPCH: location-aware routing techniques in WSNs. Proceedings of the—2013 8th International Conference on Broadband, Wireless Computing, Communication and Application BWCCA 2013, pp. 100–105 (2013). https://doi.org/10.1109/bwcca. 2013.25 14. Butun, I., Ra, I., Sankar, R. PCAC: Power and connectivity-aware clustering for wireless sensor networks. EURASIP J. Wirel. Commun. Netw, 2015 (2015)

Chapter 29

GWO-GA Based Load Balanced and Energy Efficient Clustering Approach for WSN Amruta Lipare, Damodar Reddy Edla, Ramalingaswamy Cheruku and Diwakar Tripathi Abstract Energy consumption of sensor nodes is one of the major challenges in wireless sensor networks (WSNs). Therefore, to defeat this challenge clustering technique is used. In cluster based WSN, the leader of cluster, called cluster head (CH) collects, aggregates, and sends data to the base station. Hence, data load balancing is also one of the crucial tasks in WSN. To overcome this problem, we use two bio-inspired algorithms for clustering namely Grey Wolf Optimization (GWO) and Genetic Algorithm (GA). The best fitted solutions from GWO and GA undergo the crossover and mutation operations to produce healthy off-springs. The clustering solution obtained from GWO-GA is well load balanced and energy efficient. We compare GWO-GA approach with some of the existing algorithms over fitness values and different network parameters namely first sensor node dies and half of the sensor nodes are alive in the network. We observe GWO-GA outperforms existing algorithms.

A. Lipare (B) · D. Reddy Edla National Institute of Technology, Goa, India e-mail: [email protected] D. Reddy Edla e-mail: [email protected] R. Cheruku Mahindra Ecole Cetrale, Hyderabad, Telangana, India e-mail: [email protected] D. Tripathi Madanapalle Institute of Technology & Science, Madanapalle, Andhra Pradesh, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_29

287

288

A. Lipare et al.

29.1 Introduction In the last several decades, wireless sensor networks (WSNs) play a dynamic role in current generation. WSN has a number of applications in various fields such as health monitoring, military applications, agriculture, etc. Sensor node consists of sensors, microcontrollers, transmitter, and receiver, etc. In WSN, sensor nodes are battery operated and charging or replacing batteries is a crucial task. Therefore, we must use this power efficiently [1, 2]. Many researchers have come forward with different energy efficiency algorithms in WSN. In WSN with cluster architecture, the clusters are formed by grouping the sensor nodes. The leader of the cluster is called as cluster head (CH). CH collects sensed data from sensor nodes of their respective clusters, aggregates collected data, and transmit data towards the base station (BS) collectively through other CHs. BS is a device connected to the internet to know the updates and notifications from the targeted area [3, 4]. In this paper, we used gateways instead of CHs. Gateway is a device with high power constraints and works similar to the CH. In this paper, we are mainly focusing on efficient energy consumption of sensor nodes. We are also achieving the overall load balancing of gateways in WSN. Many researchers have used different clustering algorithms for load balancing such as distance based clustering, density-based clustering, etc. Node local density-based approach [5] for overall balancing of data load in WSN is used by Zang et al. They defined their algorithm in three phases. In the first phase, they defined assignment of sensor nodes to CHs which are in the communication range (R) of one and only one CH. Further, they restricted assignment with the communication range of R/2 and assigned sensor nodes to CHs within communication range R/2, in the second phase. In the third phase, they counted the number of sensor nodes connected to each CHs and assigned remaining sensor nodes to the CHs with the least connected sensor nodes. Further, researchers have come up with different bioinspired algorithms such as evolutionary algorithms, swarm intelligence algorithms, etc. [6, 7]. Genetic Algorithm (GA) [8] is an evolutionary algorithm inspired by the natural process of selection of genes. There are different swarm intelligence algorithms such as Particle Swarm Optimization [9], Fish Schooling Optimization [10], and Grey Wolf Optimization (GWO), [11] etc. These algorithms get modulated for solving different challenges in WSN. Hussain et al. [12] have applied the genetic algorithm for clustering in WSN. Al-Aboody et al. [13] have applied GWO algorithm for cluster head selection and decision tree approach for energy efficiency in WSN. In this paper, we used both the GWO and GA approach to achieve better optimum solution using clustering to balance the overall load of the network. The rest of the paper is organized as follows: In Sect. 29.2, the preliminaries required for our proposed work are mentioned. The proposed GWO-GA approach is elaborated in Sect. 29.3 and Sect. 29.4 shows the experimental results of proposed work. Finally, the paper is concluded with observed results.

29 GWO-GA Based Load Balanced and Energy Efficient Clustering …

289

29.2 Preliminaries 29.2.1 Overview of Grey Wolf Optimization Mirjalili et al. [11] have designed the grey wolf optimizer (GWO) which is inspired by the leadership hierarchy and hunting behavior in the grey wolves. In GWO, the initial population of wolves is generated randomly and α, β, δ wolves are the first, second, and third best solutions, respectively. The position of each grey wolf can − → → → → x 2 + ··· + − x n , where n is the be represented in the vector form as X = − x 1+− dimension of search space. α, β and δ wolves move in the direction of prey and ω wolves follow them. Wolves hunting strategy is defined in three stages: (i) Encircling prey (ii) Hunting prey, and (iii) Attacking prey. Encircling prey: In this stage, wolves change their positions by reflecting the positions of α, β and δ wolves throughout the optimization. This can be mathematically expressed by the following Eqs. 29.1 and 29.2. − → − → − → − → D = | C · X p (t) − X (t)|

(29.1)

− →− → − → − → X (t + 1) = | X p (t) − A D |

(29.2)

− → − → → x : wolfs’ position, t: current iteration, D : Distance Where X p : position of prey, − − → − → between position vector of wolf and prey, A and C are coefficient vectors. Where → components of − a are linearly decreased from 2 to 0 over the course of iterations and r1, r2 are random vectors in [0, 1] as shown below in Eq. 29.3 and Eq. 29.4 respectively. − → → → → a (29.3) A = 2− a ·− r 1−− − → → C =2·− r 2

(29.4)

Hunting prey: Hunting behavior is carried out by α, β and δ wolves; because they have better knowledge about the location of prey. ω wolves update their position according to these three best solutions as expressed in Eq. 29.5. − → − → − → − → − → − → − → − → − → − → − → − → D α = | C 1 · X α − X |; D β = | C 2 · X β − X |; D δ = | C 3 · X δ − X | (29.5) − → − → − → X denotes current position of wolf. X α and D α shows current and updated position − → − → of alpha wolves respectively. X β and D β shows current and updated position of beta − → − → wolves, respectively, and X δ and D δ shows current and updated position of delta wolves, respectively. After calculating distances, the absolute positions of current → − → − → − wolves X 1 , X 2 and X 3 are evaluated as shown in Eqs. 29.6 and 29.7.

290

A. Lipare et al.

− → − → − → − → − → − → − → − → − → − → − → − → X 1 = | X α − A 1 · D α |; X 2 = | X β − A 2 · D β |; X 3 = | X δ − A 3 · D δ | (29.6) − → − → − → X1+ X2+ X3 − → (29.7) X (t+1) = 3 − → − → − → where A 1 , A 2 and A 3 are random vectors and t specifies current iteration number. Attacking prey: Grey wolves diverge from each other while searching for prey (exploration) and converge while attacking prey (exploitation).

29.2.2 Overview of Genetic Algorithm The genetic algorithm [8] is inspired by the natural process of selection of genes. It consists of several steps as defined below. Step 1. Step 2.

Step 3.

Step 4.

Step 5.

Initial population: In this phase, certain number of solutions are generated randomly. Fitness Function Evaluation: In this phase, the randomly generated solutions are validated with the fitness function and the fitness value of each solution is inspected. Selection: In this phase, the parent solutions are selected with various techniques of selection such as roulette wheel selection, tournament selection, etc. Crossover: The parent solutions obtained from selection phase undergo the crossover operation. In crossover operation, two parents exchange their information to generate a new off-spring solution. Mutation: In this phase, number of genes can be changed to other genes.

The overall process produces new off-springs with better fitness values.

29.2.3 Energy Consumption Model for WSN For calculating sensor nodes’ energy consumption, we have used same energy model as [14]. The amount of energy consumption of sensor node sending l-bit data over distance d is calculated by following Eqs. 29.8 and 29.9. Let E elec be the energy required to do electronic operations. ε f s and εmp be the amplification energy required to transmit data in open space and multipath fading channels, respectively, and d0 is a threshold value. Therefore, total transmission energy is expressed in Eq. 29.8. l ∗ E elec + l ∗ f s ∗ d 2 , d < d0 E T (l, d) = l ∗ E elec + l ∗ mp ∗ d 4 , d ≥ d0

(29.8)

29 GWO-GA Based Load Balanced and Energy Efficient Clustering …

291

The energy received by the sensor node is shown in Eq. 29.9. E R (l) = l ∗ E elec

(29.9)

29.3 Proposed Work We have applied GWO and GA approaches for load balanced clustering in WSN. Our contribution to these approaches are as follows: (i) We restricted the assignment of sensor nodes and gateway in the initial population phase. Instead of assigning gateways randomly, we assigned sensor nodes to gateways within their communication range for both GWO and GA approach. This increases the chances of generating a better solution in the first phase itself. (ii) For crossover operation, instead of replacing both the parent solutions, we replaced only the worst solution. This helps to save the best solution among the population. (iii) In mutation phase, instead of changing a random bit, we changed the assignment of the sensor node connected to farther gateway to its nearest gateway. This reduces the energy consumption of the sensor node. (iv) The best M solutions obtained from GWO and GA approaches undergo crossover and mutation operations. Our GWO-GA approach is explained in the following steps: 1. Initially we started with GWO approach, we carried out initial population phase and generated solutions randomly. The solution represents the assignment of sensor nodes and gateways. 2. Each solution is evaluated by the fitness function [7] represented in Eq. 29.10. The metric estimated gateway load is used for efficient load balancing of gateways and computed using Eq. 29.11. Fitness = Estimated gateway load × m Estimated gateway load =

i=1

#Balanced gateways #gateways

Load distribution (G i ) #gateways

(29.10)

(29.11)

Load distribution (G i ) metric represents the overall load distribution of gateways in Eq. 29.12. Load distribution (G i ) =

Load (G i ) max Load G i , ∀i = 1, 2, . . . m

(29.12)

292

A. Lipare et al.

Load (G i ) represents total size of outgoing packets from gateway (G i ) and calculated using Eq. 29.13. Load (G i ) = l × # outgoing packets

(29.13)

In order to categorize the balanced and unbalanced gateways, each gateway load is checked with the threshold load. Maxt and Min t are the maximum and minimum load thresholds, respectively. The gateways having load between Maxt and Min t are considered as balanced gateways. These thresholds are presented in Eqs. 29.14 and 29.15. Mean of the overall packet load is calculated using Eq. 29.16 and γ is a constant value. Maxt = Mean + γ ∗ Mean

(29.14)

Min t = Mean − γ ∗ Mean

(29.15)

m Mean =

i=1

Load (G i ) m

(29.16)

3. The best three solutions with the best fitness value are held for the next phases. 4. According to GWO approach, other solutions update their positions with respect to the best three solutions as per the Eqs. 29.1–29.7. 5. The best M solutions are held for next procedure and si ze(initialpopulation)— M solutions are omitted. 6. Further, we started with the GA approach, we carried out the initial population phase and generated solutions randomly. 7. We have used the same fitness function here as used for GWO approach as expressed in Eqs. 29.10–29.16. 8. After calculating fitness value for each solution, we applied roulette wheel selection and extracted M best solutions. 9. Best M solutions obtained from GWO approach and best M solutions obtained from the GA approach which undergo crossover operation. We used one point crossover operation to get the optimum solution. 10. In the phase of mutation, we find out the sensor node assigned to the gateway with the farthest distance and changed its assignment to its nearest gateway. 11. Finally, we grabbed the solution with the best fitness value among all the offsprings.

29.4 Results and Discussions We have simulated experiments on a machine using Matlab 2015Ra on Microsoft Windows 7 operating system. The experiments are carried out for three scenarios; (i) 50 sensor nodes with 5 gateways, (ii) 100 sensor nodes with 11 gateways, and (iii) 150 sensor nodes with 19 gateways deployed in 50 × 50 m2 area. The initial energy

29 GWO-GA Based Load Balanced and Energy Efficient Clustering …

293

Fig. 29.1 Comparison of GWO-GA with NLDLB, GA and GWO in terms of clustering fitness

of sensor node and gateway is given as 2 J and 10 J, respectively. The communication range of sensor node is given as 10 m. The values of energy parameters E elec , f s , mp and E D A are considered as 50 nJ/bit, 10 pJ/bit/m2 , 0.001 pJ/bit/m4 and 5 nJ/bit, respectively. We measured the clustering fitness value for best solution of each compared algorithm and we observed that GWO-GA outperforms both GA [12] and GWO [13] and also NLDLB [5]. It is due to the crossover and mutation operations carried out using the best solutions from GWO and GA algorithms. The comparative study of this parameter is shown in Fig. 29.1. We also performed the experiments for energy efficiency, i.e., we noted the number of rounds when first node dies (FND) and half of the nodes alive in the network. The greater the number of rounds, the higher is the lifetime of the network. We observed the first node in the network of GWO-GA dies after the higher number of rounds than other compared algorithms for all the three scenarios. This is due to, in mutation phase, the allocation of sensor node is changed with its nearest gateway and this saves the energy of sensor node. The comparative study of this parameter is shown in Fig. 29.2a. Further, we measured the number of round when the half of the sensor nodes are alive (HNA) in the network. This network parameter represents the stability of the network by short sighting the difference between first node dies and half of the sensor nodes are alive in the network. HNA parameter is advantageous when the sensor nodes in the targeted area are closely deployed. From Fig. 29.2b we observed that our GWO-GA performed superior to other compared algorithms because of the generation of load balanced network. Therefore, the network is more stable in the period of FND and HNA in the network.

294

A. Lipare et al.

(a) First Sensor Node Die

(b) Half Alive Sensor Nodes

Fig. 29.2 Comparison of the GWO-GA with NLDLB, GA, and GWO in terms of a first sensor node dies b half sensor nodes are alive over number of rounds

29.5 Conclusion In this paper, we have combined GWO and GA approach for load balancing in WSN. The initial population phase of both the algorithms is improved. The crossover and mutation operation also upgraded to obtain the healthy off-spring in terms of balanced load and efficient energy consumption of the network. In order to verify the efficiency of GWO-GA, we compared NLDLB, GA, and GWO algorithm against fitness value and different network parameters such as first node dies and half of the nodes are alive in the network. We observed that GWO-GA outperforms the compared algorithms in terms of the network parameters by the combination of unique properties of GWO and GA.

References 1. Akyildiz, I.F., Weilian, S., Sankarasubramaniam, Y., Cayirci, E.: Wireless sensor networks: a survey. Comput. Netw. 38(4), 393–422 (2002) 2. Lipare, A., Edla, D.R.: Novel fitness function for SCE algorithm based energy efficiency in WSN. In: 9th IEEE International Conference on Computing, Communication and Networking Technologies, IISc, Bangalore, pp. 1–7 (2018) 3. Edla, D.R., Kongara, M.C., Cheruku, R.: SCE-PSO based clustering approach for load balancing of gateways in wireless sensor networks. Wirel. Netw. 1–15 (2018) 4. Edla, D.R., Kongara, M.C., Cheruku, R.: A PSO based routing with novel fitness function for improving lifetime of WSNs. Wirel. Pers. Commun. 1–17 (2018) 5. Zhang, J., Yang, T.: Clustering model based on node local density load balancing of wireless sensor networks. In: Forth International Conference on Emerging Intelligent Data and Web Technologies, Xi’an, China, pp. 273–276 (2013)

29 GWO-GA Based Load Balanced and Energy Efficient Clustering …

295

6. Edla, D.R., Lipare, A., Cheruku, R., Kuppili, V.: An efficient load balancing of gateways using improved shuffled frog leaping algorithm and novel fitness function for WSNs. IEEE Sens. J. 17(20), 6724–6733 (2017) 7. Edla, D.R., Lipare, A., Cheruku, R.: Shuffled complex evolution approach for load balancing of gateways in wireless sensor networks. Wirel. Pers. Commun. 98(4), 3455–3476 (2018) 8. Deb, K., et al.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002) 9. Kennedy, J.: Particle swarm optimization. In: Encyclopedia of Machine Learning, pp. 760–766. Springer, Boston, MA (2011) 10. Bastos Filho, C.J.A., et al.: Fish school search. In: Nature-Inspired Algorithms for Optimisation, pp. 261–277. Springer, Berlin, Heidelberg (2009) 11. Mirjalili, S., Mirjalili, S.M., Lewis, A.: Grey wolf optimizer. Adv. Eng. Softw. 69, 46–61 (2014) 12. Hussain, S., Matin, A.W., Islam, O.: Genetic algorithm for energy efficient clusters in wireless sensor networks. In: Fourth IEEE International Conference on Information Technology, Las Vegas. NV, USA (2007) 13. Al-Aboody, N.A., Al-Raweshidy, H.S.: Grey wolf optimization-based energy-efficient routing protocol for heterogeneous wireless sensor networks. In: 4th IEEE International Symposium on Computational and Business Intelligence (ISCBI), Olten, Switzerland (2016) 14. Heinzelman, W.B., Chandrakasan, A.P., Balakrishnan, H.: Application specific protocol architecture for wireless microsensor networks. IEEE Trans. Wirel. Commun. 1(4), 660–670 (2002)

Chapter 31

Proof of Authenticity-Based Electronic Medical Records Storage on Blockchain Mustafa Qazi, Devyani Kulkarni and Meghana Nagori

Abstract Electronic health records driving over the hype of digitalization are currently booming in many hospitals. Despite advancements, plethora of challenges such as data interconnectivity, interoperability, and data sharing arises due to hospitals with their own hospital management information system form clusters of data. These can be solved by effectively employing blockchain platform. The authors in this work are proposing a novel consensus algorithm titled Proof of Authenticity over the distributed platform for all medical stakeholders. Unlike the previous approaches, wherein researchers were the miners, this work illustrates a methodology to implement blockchain for health care, where the hospitals and clinics are assumed the roles of both miners and validators. The peer-to-peer network is leveraged with a designed smart contract that follows the proof of authenticity mechanism. The medical stakeholders will access the medical data under security protocols and patient’s consent in a tamper-proof network. The proposed work aims for more patient centric and transparent health care.

31.1 Introduction According to the World Bank, India’s total health expenditure (% GDP) was reported to be 4.6851% in 2014 [1] (Fig. 31.1). India is a country of population around 1.33 billion spread over a stretch of 3.287 million km square. Healthcare management in such a gigantic country is an immense challenge. Thus, India has a great opportunity of imbibing technological advancements in machine learning domain. In 2015, fiscal federalism reform gave M. Qazi (B) · D. Kulkarni · M. Nagori Government College of Engineering, Aurangabad 431001, Maharashtra, India e-mail: [email protected] D. Kulkarni e-mail: [email protected] M. Nagori e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 Y.-D. Zhang et al. (eds.), Smart Trends in Computing and Communications, Smart Innovation, Systems and Technologies 165, https://doi.org/10.1007/978-981-15-0077-0_31

297

298

M. Qazi et al.

Fig. 31.1 Graph illustrating India’s total health expenditure (%GDP) per year

the Indian states more control over their healthcare spending [2]. Stated by the World Bank, India’s per capita income (nominal) was $1670 (Rs. 11,3400) per year in 2016, which landed India on a rank of 112th out of 164 countries [3] and over 80% of people are not covered under any health insurance [4] implies that big multispecialty hospitals are scarce in the country. Rather small hospitals and/or clinics with a doctor having no electronic record systems are utterly popular in rural areas and mediumsized cities. These are the antecedents of prevailing problems of patient’s record keeping in India. Traditional decentralized databases, that can be used for maintaining a common data platform, calls into question for some architectural and ethical problems. Sharing of healthcare data at an interinstitutional level is a complex commitment with the potential to eloquently increase clinical and research effectiveness [5]. First and foremost, institutions often hesitate to share data because of privacy and ethical concerns [6], and for obvious reasons they may fear that sharing information will give others a competing advantage [7]. Ensuring data integrity is a crucial, professional, and ethical obligation to protect patients and produce reliable results that can lead to new breakthroughs in medicines and treatments [8]. In this research, authors are proposing a novel consensus mechanism for electronically storing medical data. The mechanism of the consensus proof of authenticity revolves around the consideration that the patient is the owner of his/her data. The objective aims at promoting circular economy [9], sharing, and reusability of data should be an important facet of health care. This would allow health organizations to get hold of vast troves of previously inaccessible data which can be used for education, research, software development, and any other credible health project [10]. This will inherently become a platform for a systematic and complete keeping of medical records.

31 Proof of Authenticity-Based Electronic Medical …

299

The blockchain is primarily a distributed ledger. It is a peer-to-peer connecting technology that works on the principle of distributed consensus and maintains the anonymity. A block is a digital page in the ledger that stores data publicly after approval from each node. By convincing users to give away their large amount of valuable data in exchange for free services, firms are generating their revenue. Blockchain has the ability to make these firms accountable for it. It is made for distribution of powers, end of monopoly over controlling the majority of traffic and thereby wealth, thus leading to a distribution of value. Blockchain is reliable as it can continue its mechanism of the transaction even if the Internet is shut down. Streams were introduced in MultiChain platform. They enable a blockchain to be used as a general-purpose append-only database, with the blockchain providing time stamping, notarization, and immutability [11]. Smart contracts are systems which automatically moves digital assets according to arbitrary prespecified rules. Smart contracts can be used to encode arbitrary state transition functions allowing the ability to create systems which are confined inside prespecified laws [12]. In this platform, smart contracts act in the role of gatekeeper of blockchain barring the blocks which are unfit for the system.

31.2 Related Work During the research, we examined two of the previous approaches: MedRec and MedChain. MedRec is a novel, decentralized record management system to handle Electronic Medical Records (EMR), using blockchain technology [13]. It uses Ethereum blockchain and its smart contracts feature to store medical data into the chain. The authors of MedRec introduced researchers as miners and the incentive they get for mining the data is data itself. The data is being stored off-chain in a database, and the query which fetches the patient data is being stored in the chain. They are using three smart contracts: Registrar Contract (RC), Summary Contract (SC), and Patient Provider Relationship Contract (PPRC). RC is a global contract which pairs participant identification strings versus their Ethereum address ID. PPRC is issued between two nodes in the system where one node stores and manages, and SC functions as a fragmented path for participants in the system to locate their medical own record history [13]. MedChain solves two primary challenges faced by EMR and electronic Protected Health Information(ePHI) systems which are data security and interoperability. They tried solving it by decentralizing EMRs and giving charge to the patient themselves. One of the problems the authors mentioned was the problem who controls the data. So as the patient is the link between the disparate records and providers, MedChain will put the patient at the center of the system, not the healthcare provider. MedChain is a multiphase project. The first layer aims to establish as the authors say, the “common data layer,” and the subsequent phases shall release corresponding blockchain networks [14].

300

M. Qazi et al.

31.3 Our Methodology The motivation behind the architecture is that the data belongs to the patient. Thus, aiming for a clean, decentralized, transparent, and a circular economy based healthcare system. Unlike previous approaches where miners were different systems, mining blocks for some incentive, in this methodology we put forth hospitals to be the miners. This raises a concern about the expenses of computationally intense and time-consuming tasks which are hence tossed into the responsibility bucket of hospitals. To prevent volatility in the system, we propose a new consensus algorithm as “Proof-OfAuthenticity,” which is discussed later. The purpose of extruding miners from the circle is explained as: In the original paper of Satoshi Nakamoto, both ends for a transaction were users, precisely a sender, and a receiver. However, this is not the case in health care. Here, the purpose of blockchain is to store data from multiple independent sources which creates mutually distrustful environment. Mapping every user against a personal copy of the ledger is not very feasible, whereas each hospital can be given its own copy of the ledger. So, instead of making a third party a storage point, hospitals themselves will act as the storage point of data. Streams are being used to publish data in the blockchain with each hospital getting its own stream. Actual data will be in the format of key and value pair, as further discussed.

31.3.1 Problem Setting Let P = {P1, P2, …, Pn} be the set of patients and D = {D1, D2, …, Dn} be data generated in hospitals using Hospital Management Information System (HMIS). Each patient has public and private key. D is collected as JSON and encrypted with SHA256 by applying asymmetric cryptography resulting in set E = {E1, E2, E3, …, En}. E is encrypted with public key so that nobody else other than patient can see the data. D is transformed into hexadecimal form and converted to proper data structure including publisher: The publisher is the doctor. The doctor is identified using Doctor Unique Identification Number Key: Patient’s Patient Unique Identification number (Fig. 31.2).

Fig. 31.2 Flow of patient data

31 Proof of Authenticity-Based Electronic Medical …

301

Algorithm 1: Proof_of_Authenticity (h, t, n)

1 2 3 4 5 6 7 8 9 10

Input: h - HMIS data, t - timestamp, n - nonce p