Dependability in Sensor, Cloud, and Big Data Systems and Applications: 5th International Conference, DependSys 2019, Guangzhou, China, November 12–15, 2019, Proceedings [1st ed. 2019] 978-981-15-1303-9, 978-981-15-1304-6

This book constitutes the refereed proceedings of the 5th International Conference on Dependability in Sensor, Cloud, an

793 94 36MB

English Pages XVIII, 488 [499] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Dependability in Sensor, Cloud, and Big Data Systems and Applications: 5th International Conference, DependSys 2019, Guangzhou, China, November 12–15, 2019, Proceedings [1st ed. 2019]
 978-981-15-1303-9, 978-981-15-1304-6

Table of contents :
Front Matter ....Pages i-xviii
Front Matter ....Pages 1-1
Secrecy Outage Probability of Secondary System for Wireless-Powered Cognitive Radio Networks (Kun Tang, Shaowei Liao, Md. Zakirul Alam Bhuiyan, Wei Shi)....Pages 3-17
CodeeGAN: Code Generation via Adversarial Training (Youqiang Deng, Cai Fu, Yang Li)....Pages 18-30
Information Consumption Patterns from Big Data (Jesús Silva, Ligia Romero, Claudia Fernández, Darwin Solano, Nataly Orellano Llinás, Carlos Vargas Mercado et al.)....Pages 31-39
DHS-Voting: A Distributed Homomorphic Signcryption E-Voting (Xingyue Fan, Ting Wu, Qiuhua Zheng, Yuanfang Chen, Xiaodong Xiao)....Pages 40-53
Towards In-Network Generalized Trustworthy Data Collection for Trustworthy Cyber-Physical Systems (Hafiz ur Rahman, Guojun Wang, Md Zakirul Alam Bhuiyan, Jianer Chen)....Pages 54-66
QoS Based Clustering for Vehicular Networks in Smart Cities (Soumia Bellaouar, Mohamed Guerroumi, Samira Moussaoui)....Pages 67-79
Searchable Attribute-Based Encryption Protocol with Hidden Keywords in Cloud (Fang Qi, Xing Chang, Zhe Tang, Wenbo Wang)....Pages 80-92
Front Matter ....Pages 93-93
A Comparative Study of Two Different Spam Detection Methods (Haoyu Wang, Bingze Dai, Dequan Yang)....Pages 95-105
Towards Privacy-preserving Recommender System with Blockchains (Abdullah Al Omar, Rabeya Bosri, Mohammad Shahriar Rahman, Nasima Begum, Md Zakirul Alam Bhuiyan)....Pages 106-118
Integrating Deep Learning and Bayesian Reasoning (Sin Yin Tan, Wooi Ping Cheah, Shing Chiang Tan)....Pages 119-130
Assessing the Dependability of Apache Spark System: Streaming Analytics on Large-Scale Ocean Data (Janak Dahal, Elias Ioup, Shaikh Arifuzzaman, Mahdi Abdelguerfi)....Pages 131-144
On the Assessment of Security and Performance Bugs in Chromium Open-Source Project (Joseph Imseis, Costain Nachuma, Shaikh Arifuzzaman, Zakirul Alam Bhuiyan)....Pages 145-157
Medical Image Segmentation by Combining Adaptive Artificial Bee Colony and Wavelet Packet Decomposition (Muhammad Arif, Guojun Wang, Oana Geman, Jianer Chen)....Pages 158-169
Recommender System for Decentralized Cloud Manufacturing (Karim Alinani, Deshun Liu, Dong Zhou, Guojun Wang)....Pages 170-179
Gait Analysis for Gender Classification in Forensics (Paola Barra, Carmen Bisogni, Michele Nappi, David Freire-Obregón, Modesto Castrillón-Santana)....Pages 180-190
Hybrid Cloud Computing Architecture Based on Open Source Technology (Amelec Viloria, Hugo Hernández Palma, Wilmer Cadavid Basto, Alexandra Perdomo Villalobos, Carlos Andrés Uribe de la Cruz, Juan de la Hoz Hernández et al.)....Pages 191-200
Front Matter ....Pages 201-201
Transportation and Charging Schedule for Autonomous Electric Vehicle Riding-Sharing System Considering Battery Degradation (Mingchu Li, Tingting Tang, Yuanfang Chen, Zakirul Alam Bhuiyan)....Pages 203-216
Blockchain-Powered Service Migration for Uncertainty-Aware Workflows in Edge Computing (Xiaolong Xu, Qingfan Geng, Hao Cao, Ruichao Mo, Shaohua Wan, Lianyong Qi et al.)....Pages 217-230
Towards the Design of a Covert Channel by Using Web Tracking Technologies (Aniello Castiglione, Michele Nappi, Chiara Pero)....Pages 231-246
Dependable Person Recognition by Means of Local Descriptors of Dynamic Facial Features (Aniello Castiglione, Giampiero Grazioli, Simone Iengo, Michele Nappi, Stefano Ricciardi)....Pages 247-261
From Data Disclosure to Privacy Nudges: A Privacy-Aware and User-Centric Personal Data Management Framework (Yang Lu, Shujun Li, Athina Ioannou, Iis Tussyadiah)....Pages 262-276
A Socio-Technical and Co-evolutionary Framework for Reducing Human-Related Risks in Cyber Security and Cybercrime Ecosystems (Tasmina Islam, Ingolf Becker, Rebecca Posner, Paul Ekblom, Michael McGuire, Hervé Borrion et al.)....Pages 277-293
Mobile APP User Attribute Prediction by Heterogeneous Information Network Modeling (Hekai Zhang, Jibing Gong, Zhiyong Teng, Dan Wang, Hongfei Wang, Linfeng Du et al.)....Pages 294-303
Application of Internet of Things and GIS in Power Grid Emergency Command System (Daning Huang, Huihe Chen, Shiyong Huang, Yuchang Lin, Ying Ma)....Pages 304-313
Visualized Panoramic Display Platform for Transmission Cable Based on Space-Time Big Data (Renxin Yu, Qinghuang Yao, Tianrong Zhong, Wei Li, Ying Ma)....Pages 314-323
Front Matter ....Pages 325-325
Optimal Personalized DDoS Attacks Detection Strategy in Network Systems (Mingchu Li, Xian Yang, Yuanfang Chen, Zakirul Alam Bhuiyan)....Pages 327-340
AI and Its Risks in Android Smartphones: A Case of Google Smart Assistant (Haroon Elahi, Guojun Wang, Tao Peng, Jianer Chen)....Pages 341-355
A Light-Weight Framework for Pre-submission Vetting of Android Applications in App Stores (Boya Li, Guojun Wang, Haroon Elahi, Guihua Duan)....Pages 356-368
Nowhere Metamorphic Malware Can Hide - A Biological Evolution Inspired Detection Scheme (Kehinde O. Babaagba, Zhiyuan Tan, Emma Hart)....Pages 369-382
Demand Forecasting Method Using Artificial Neural Networks (Amelec Viloria, Luisa Fernanda Arrieta Matos, Mercedes Gaitán, Hugo Hernández Palma, Yasmin Flórez Guzmán, Luis Cabas Vásquez et al.)....Pages 383-391
Analyzing and Predicting Power Consumption Profiles Using Big Data (Amelec Viloria, Ronald Prieto Pulido, Jesús García Guiliany, Jairo Martínez Ventura, Hugo Hernández Palma, José Jinete Torres et al.)....Pages 392-401
Front Matter ....Pages 403-403
A New Intrusion Detection Model Based on GRU and Salient Feature Approach (Jian Hou, Fangai Liu, Xuqiang Zhuang)....Pages 405-415
Research on Electronic Evidence Management System Based on Knowledge Graph (Honghao Wu)....Pages 416-425
Research on Security Supervision on Wireless Network Space in Key Sites (Honghao Wu)....Pages 426-434
Review of the Electric Vehicle Charging Station Location Problem (Yu Zhang, Xiangtao Liu, Tianle Zhang, Zhaoquan Gu)....Pages 435-445
Structural Vulnerability of Power Grid Under Malicious Node-Based Attacks (Minzhen Zheng, Shudong Li, Danna Lu, Wei Wang, Xiaobo Wu, Dawei Zhao)....Pages 446-453
Electric Power Grid Invulnerability Under Intentional Edge-Based Attacks (Yixia Li, Shudong Li, Yanshan Chen, Peiyan He, Xiaobo Wu, Weihong Han)....Pages 454-461
Design and Evaluation of a Quorum-Based Hierarchical Dissemination Algorithm for Critical Event Data in Massive IoTs (Ihn-Han Bae)....Pages 462-476
Comparative Analysis on Raster, Spiral, Hilbert, and Peano Mapping Pattern of Fragile Watermarking to Address the Authentication Issue in Healthcare System (Syifak Izhar Hisham, Mohamad Nazmi Nasir, Nasrul Hadi Johari)....Pages 477-486
Back Matter ....Pages 487-488

Citation preview

Guojun Wang Md Zakirul Alam Bhuiyan Sabrina De Capitani di Vimercati Yizhi Ren (Eds.)

Communications in Computer and Information Science

1123

Dependability in Sensor, Cloud, and Big Data Systems and Applications 5th International Conference, DependSys 2019 Guangzhou, China, November 12–15, 2019 Proceedings

Communications in Computer and Information Science

1123

Commenced Publication in 2007 Founding and Former Series Editors: Phoebe Chen, Alfredo Cuzzocrea, Xiaoyong Du, Orhun Kara, Ting Liu, Krishna M. Sivalingam, Dominik Ślęzak, Takashi Washio, Xiaokang Yang, and Junsong Yuan

Editorial Board Members Simone Diniz Junqueira Barbosa Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Rio de Janeiro, Brazil Joaquim Filipe Polytechnic Institute of Setúbal, Setúbal, Portugal Ashish Ghosh Indian Statistical Institute, Kolkata, India Igor Kotenko St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg, Russia Lizhu Zhou Tsinghua University, Beijing, China

More information about this series at http://www.springer.com/series/7899

Guojun Wang Md Zakirul Alam Bhuiyan Sabrina De Capitani di Vimercati Yizhi Ren (Eds.) •





Dependability in Sensor, Cloud, and Big Data Systems and Applications 5th International Conference, DependSys 2019 Guangzhou, China, November 12–15, 2019 Proceedings

123

Editors Guojun Wang Guangzhou University Guangzhou, China

Md Zakirul Alam Bhuiyan Fordham University New York, USA

Sabrina De Capitani di Vimercati Università degli Studi di Milano Milan, Italy

Yizhi Ren Hangzhou Dianzi University Hangzhou, China

ISSN 1865-0929 ISSN 1865-0937 (electronic) Communications in Computer and Information Science ISBN 978-981-15-1303-9 ISBN 978-981-15-1304-6 (eBook) https://doi.org/10.1007/978-981-15-1304-6 © Springer Nature Singapore Pte Ltd. 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

The 5th International Conference on Dependability in Sensor, Cloud, and Big Data Systems and Applications (DependSys 2019) was held in Guangzhou, China, November 12–15, 2019, and was hosted by School of Computer Science, Guangzhou University. This conference series brings together new ideas, techniques, and solutions for dependability and its issues in sensor, cloud, and big data systems and applications. The first and second DependSys were both held in Zhangjiajie, China (2015, 2016), the most beautiful place in the world, the third in Guangzhou, China (2017), and the fourth in Melbourne, Australia (2018). Following the style of the previous four successful conferences, DependSys 2019 provided a forum for individuals, academics, practitioners, and organizations who are developing or procuring sophisticated computer systems, and on whose dependability of services they need to place great confidence. This year, the conference received 112 submissions from all over the world. All submissions were reviewed by at least three reviewers based on a high-quality review process. A total of 39 oral papers were presented at the conference and included in this Springer CCIS volume (i.e., an acceptance rate of 34.8%). The editors would like to thank the authors for their contributions and the reviewers for their thorough and constructive work contributing to the quality of the papers. In addition to the technical presentations, the program included a number of keynote speeches by world-renowned researchers. We are very grateful to the keynote speakers for their time and willingness to share their expertise with the conference attendees. DependSys 2019 was made possible by the joint efforts of a large number of individuals and organizations worldwide. There is a long list of people who volunteered their time and energy to put together the conference and deserve special thanks. First and foremost, we would like to greatly acknowledge the guiding work of the Steering Committee Chairs, Prof. Jie Wu from Temple University, USA, and Prof. Md Zakirul Alam Bhuiyan from Fordham University, USA. We are also deeply grateful to all the Program Committee members for their time and efforts in reading, commenting, debating and finally selecting the papers. We would like to offer our gratitude to General Chairs, Prof. Vincenzo Piuri, Prof. Witold Pedrycz, and Prof. Guojun Wang, for their tremendous support and advice in ensuring the success of the conference. Thanks also go to: Program Chairs: Md Zakirul Alam Bhuiyan, Sabrina De Capitani di Vimercati, and Yizhi Ren; Workshop Chairs: Debiao He, Kevin I-Kai Wang, and Syed Hassan Ahmed; Local Organizing Committee Chair: Jianer Chen; Publicity Chairs: Kuan-Ching Li, Saqib Ali, Yulei Wu, and Michele Nappi; Local Organizing Committee Chair: Jianer Chen; and Journal Special Issue Chairs: Kim-Kwang Raymond Choo, Chunhua Su, and Arcangelo Castiglione. It is worth noting that DependSys 2019 was jointly held with the 7th International Conference on Smart City and Informatization (iSCI 2019). We encouraged all participants to explore the co-located conferences while in Guangzhou.

vi

Preface

Finally, we would like to thank all the organizers, contributing authors, and attendees of DependSys 2019 for a lively and scientifically stimulating meeting. Hopefully, you also enjoyed the beautiful city of Guangzhou, China! November 2019

Guojun Wang Md Zakirul Alam Bhuiyan Sabrina De Capitani di Vimercati Yizhi Ren Kim-Kwang Raymond Choo

Organization

General Chairs Vincenzo Piuri Witold Pedrycz Guojun Wang

University of Milan, Italy University of Alberta, Canada Guangzhou University, China

Program Chairs Md Zakirul Alam Bhuiyan Sabrina De Capitani di Vimercati Yizhi Ren

Fordham University, USA Universita degli Studi di Milano, Italy Hangzhou Dianzi University, China

Program Vice Chairs Track 1: Dependability and Security Fundamentals and Technologies Qin Liu Karthigai Kumar

Hunan University, China Karpagam College of Engineering, India

Track 2: Dependable and Secure Systems Guangjie Han Mamoun Alazab

Dalian University of Technology, China Australian National University, Australia

Track 3: Dependable and Secure Applications Zhiyuan Tan Rasheed Hussain Thaier Hayajneh

Edinburgh Napier University, UK Innopolis University, Russia Fordham University, USA

Track 4: Dependability and Security Measures and Assessments Md. Arafatur Rahman Alireza Jolfaei Aniello Castiglione

University Malaysia Pahang, Malaysia Federation University Australia, Australia University of Naples Parthenope, Italy

Program Committee Track 1: Dependability and Security Fundamentals and Technologies Qin Liu Karthigai Kumar Shaikh Arifuzzaman

Hunan University, China Karpagam College of Engineering, India University of New Orleans, USA

viii

Organization

Marco Guazzone Subir Halder Mohammad Mehedi Hassan Mohammad Asadul Hoque Alireza Jolfaei Carlos Juiz Xiong Li Anfeng Liu Changqing Luo Sumesh Philip Vladimir Podolskiy Shawon S. M. Rahman Md. Abdur Razzaque Rukhsana Ruby Kouichi Sakurai Selina Sharmin M. Kamruzzaman Sikder Houbing Song Yulei Wu Guoqi Xie Yifan Zhang

University of Piemonte Orientale, Italy B. C. Roy Engineering College, India King Saud University, Saudi Arabia East Tennessee State University, USA Federation University Australia, Australia University of the Balearic Islands, Spain Hunan University of Science and Technology, China Central South University, China Virginia Commonwealth University, USA Illinois State University, USA Technical University of Munich, Germany The University of Hawaii-Hilo, USA University of Dhaka, Bangladesh Shenzhen University, China Kyushu University, Japan Jagannath University, Bangladesh Humber College, Canada Embry-Riddle Aeronautical University, USA University of Exeter, UK Hunan University, China SUNY Binghamton, USA

Track 2: Dependable and Secure Systems Guangjie Han Mamoun Alazab Muhammad Alam Yuanguo Bi Sammy Chan Wooiping Cheah Lien-Wu Chen Long Cheng Long Cheng Longjun Dong Xiao-Jiang (James) Du Xiaopeng Fan Haiguang Fang Weiwei Fang Shibo He Jiankun Hu Qiong Huang Muhammad Imran Wenbin Jiang Georgios Kambourakis

Dalian University of Technology, China Australian National University, Australia Xi’an Jiaotong-Liverpool University, China Northeastern University, USA City University of Hong Kong, Hong Kong, China Multimedia University, Malaysia Feng Chia University, Taiwan Northeastern University, USA State Key Laboratory of Management and Control for Complex Systems, China Central South University, China Temple University, USA Shenzhen Institutes of Advanced Technology, China Capital Normal University, China Beijing Jiaotong University, China Zhejiang University, China University of New South Wales at the Australian Defence Force Academy, Australia South China Agricultural University, China King Saud University, Saudi Arabia Chinese Academy of Sciences, China University of the Aegean, Greece

Organization

Arijit Karati Marimuthu Karuppiah Gábor Kiss Kenichi Kourai Aohan Li Chunxiao Li Fan Li Jie Li Lu Li Guilan Luo Manuel Mazzara Wang Miao Jianwei Niu Jonghyun Park Al-Sakib Khan Pathan Anand Paul Gerardo Pelosi Yuexing Peng Lianyong Qi Varatharajan Ramachandran Yiqiang Sheng Mohammad Shojafar Lei Shu Yunchuan Sun Apostolos Syropoulos Bing Tang Sana Ullah Jiafu Wan Kun Wang Lei Wang Xiaonan Wang Yu Wang Zumin Wang Xiaoling Wu Xiaofei Xing Jinbo Xiong Hui Xu Dequan Yang Panglong Yang Xindong You Deze Zeng Wenbo Zhang

ix

National Sun Yatsen University, Taiwan VIT University Vellore, India University óbuda, Hungry Kyushu Institute of Technology, Japan Keio University, Japan Zhejiang University of Technology, China Beijing Institute of Technology, China Northeastern University, USA University of Derby, UK Dali University, China Innopolis University, Russia University of Exeter, UK Beihang University, China Chonnam National University, South Korea International Islamic University Malaysia, Malaysia Kyungpook National University, South Korea Politecnico di Milano, Italy Beijing University of Posts and Telecommunications, China Qufu Normal University, China Bharath University, India National Network New Media Engineering Research Center, China University of Padua, Italy Guangdong University of Petrochemical Technology, China Beijing Normal University, China Greek Molecular Computing Group, Greece Hunan University of Science and Technology, China Polytechnic Institute of Porto, Portugal South China University of Technology, China Nanjing University of Posts and Telecommunications, China Dalian University of Technology, China Changshu Institute of Technology, China Guanzghou University, China Dalian University, China Guangzhou Institute of Advanced Technology, China Guangzhou University, China Fujian Normal University, China University of California, USA Beijing Institute of Technology, China PLA University of Science and Technology, China Hangzhou Dianzi University, China China University of Geosciences, China Shenyang Ligong University, China

x

Organization

Yunzhou Zhang Zhangbing Zhou Chuan Zhu Zhifeng Zuo

Northeastern University, USA China University of Mining and Technology, China Hohai University, China University of Electro-Communications, Japan

Track 3: Dependable and Secure Applications Farhan Ahmad Zhiyuan Tan Rasheed Hussain Thaier Hayajneh Mohiuddin Ahmed Amjad Anvari-Moghaddam William Buchanan Arcangelo Castiglione Christian Esposito Franco Frattolillo Luis Javier García Villalba Damien Hanyurwimfura Peng Hao Julio Hernandez-Castro Fatima Hussain Aruna Jamdagni Mian Jan Syed Muhammad Ahsan Kazmi Chaker Abdelaziz Kerrache Hasan Ali Khattak JooYoung Lee Shancang Li Xiong Li Shujun Li Entao Luo Weizhi Meng Naghmeh Moradpoor Mahmuda Naznin Alma Oracevic Paul Pang Constantinos Patsakis Deepak Puthal Farzana Rahman Shalli Rani Zeinab Rezaiefar Imed Romdhani Kashif Saleem

University of Derby, UK Edinburgh Napier University, UK Innopolis University, Russia Fordham University, USA Edith Cowan University, Australia Aalborg University, Demark Edinburgh Napier University, UK University of Salerno, Italy University of Napoli Federico II, Italy University of Sannio, Italy Universidad Complutense de Madrid, Spain University of Rwanda, Rwanda Beihang University, China University of Kent, UK Royal Bank of Canada, Canada Western Sydney University, Australia Abdul Wali Khan University Mardan, Pakistan Innopolis University, Russia University of Ghardaia, Algeria COMSATS, Pakistan Innopolis University, Russia University of West England, UK Hunan University of Science and Technology, China University of Kent, UK Hunan University of Science and Engineering, China Technical University of Denmark, Demark Edinburgh Napier University, UK Bangladesh University of Engineering and Technology, Bangladesh Innopolis University, Russia Unitec Institute of Technology, New Zealand University of Piraeus, Greece University of Technology Sydney, Australia Florida International University, USA Chitkara University, India Hanyang University, South Korea Edinburgh Napier University, UK King Saud University, Saudi Arabia

Organization

Wei Shi Houbing Song Traian Marius Truta Alexandr Vasenev Kamal Z. Zamli Nicola Zannone Qingchen Zhang Shaobo Zhang Xuyun Zhang

Carleton University, Canada Embry-Riddle Aeronautical University, USA Northern Kentucky University, USA Netherlands Organisation for Applied Scientific Research, The Netherlands University Malaysia Pahang, Malaysia Eindhoven University of Technology, The Netherlands St. Francis Xavier University, Canada Hunan University of Science and Technology, China University of Auckland, New Zealand

Track 4: Dependability and Security Measures and Assessments Md. Arafatur Rahman Alireza Jolfaei Venki Balasubramanian Amin Beheshti Mohammad Asad Rehman Chaudhry Tooska Dargahi Salvatore Distefano Angela Guercio Mohammad Sayad Haghighi Abbas Haider Wenbin Jiang Mohammad Mehedi Hassan Pouya Ostovari Prantosh Kumar Paul Biplob Ray Mubashir Husain Rehmani Genaina Rodrigues Amin Sakzad Sattar Seifollahi Hossain Shahriar Shahab Shmshir Junggab Son Houbing Song Sona Taheri Muhamed Turkanovic Muhammad Usman

xi

University Malaysia Pahang, Malaysia Federation University Australia, Australia Federation University Australia, Australia Macquarie University, Australia University of Toronto, Canada University of Salford, UK University of Messina, Russia Kent State University at Stark, USA University of Tehran, Iran National University of Sciences and Technology, Pakistan Huazhong University of Science and Technology, China King Saud University, Saudi Arabia San Jose State University, USA Raiganj University, Raiganj, India Central Queensland University, Australia COMSATS Institue of Information Technology, Pakistan University of Brasilia, Brazil Monash University, Australia Federation University Australia, Australia Kennesaw State University, USA Norges teknisk-naturvitenskapelige universitet, Norway Kennesaw State University, USA Embry-Riddle Aeronautical University, USA Federation University Australia, Australia University of Maribor, Slovenia University of Surrey, UK

xii

Organization

Xin-Wen Wu Chunsheng Zhu

Indiana University of Pennsylvania, USA University of British Columbia, Canada

Workshop Chairs Debiao He Kevin I-Kai Wang Syed Hassan Ahmed

Wuhan University, China The University of Auckland, New Zealand Georgia Southern University, USA

Local Organizing Committee Chair Jianer Chen

Guangzhou University, China

Publicity Chairs Kuan-Ching Li Saqib Ali Yulei Wu Michele Nappi

Providence University, Taiwan Guangzhou University, China The University of Exeter, UK Universita di Salerno, Italy

Publication Chair Shuhong Chen Guihua Duan

Guangzhou University, China Central South University, China

Journal Special Issue Chairs Kim-Kwang Raymond Choo Chunhua Su Arcangelo Castiglione

The University of Texas at San Antonio, USA The University of Aizu, Japan University of Salerno, Italy

Registration Chairs Xiaofei Xing Pin Liu

Guangzhou University, China Central South University, China

Conference Secretariat Wenyin Yang

Foshan University, China

Steering Committee Jie Wu (Chair) Md Zakirul Alam Bhuiyan (Chair)

Temple University, USA Fordham University, USA

Organization

Guojun Wang Vincenzo Piuri Jiannong Cao Laurence T. Yang Sy-Yen Kuo Yi Pan A. B. M Shawkat Ali Mohammed Atiquzzaman Al-Sakib Khan Pathan Kenli Li Shui Yu Yang Xiang Kim-Kwang Raymond Choo Kamruzzaman Joarder

xiii

Guangzhou University, China University of Milan, Italy Hong Kong Polytechnic University, Hong Kong, China St. Francis Xavier University, Canada National Taiwan University, Taiwan Georgia State University, USA The University of Fiji, Fiji University of Oklahoma, USA Southeast University, Bangladesh Hunan University, China University of Technology Sydney (UTS), Australia Swinburne University of Technology, Australia The University of Texas at San Antonio, USA Federation University and Monash University, Australia

Contents

Dependability and Security Fundamentals and Technologies Secrecy Outage Probability of Secondary System for Wireless-Powered Cognitive Radio Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kun Tang, Shaowei Liao, Md. Zakirul Alam Bhuiyan, and Wei Shi

3

CodeeGAN: Code Generation via Adversarial Training . . . . . . . . . . . . . . . . Youqiang Deng, Cai Fu, and Yang Li

18

Information Consumption Patterns from Big Data . . . . . . . . . . . . . . . . . . . . Jesús Silva, Ligia Romero, Claudia Fernández, Darwin Solano, Nataly Orellano Llinás, Carlos Vargas Mercado, Jazmín Flórez Guzmán, and Ernesto Steffens Sanabria

31

DHS-Voting: A Distributed Homomorphic Signcryption E-Voting. . . . . . . . . Xingyue Fan, Ting Wu, Qiuhua Zheng, Yuanfang Chen, and Xiaodong Xiao

40

Towards In-Network Generalized Trustworthy Data Collection for Trustworthy Cyber-Physical Systems . . . . . . . . . . . . . . . . . . . . . . . . . . Hafiz ur Rahman, Guojun Wang, Md Zakirul Alam Bhuiyan, and Jianer Chen QoS Based Clustering for Vehicular Networks in Smart Cities . . . . . . . . . . . Soumia Bellaouar, Mohamed Guerroumi, and Samira Moussaoui Searchable Attribute-Based Encryption Protocol with Hidden Keywords in Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fang Qi, Xing Chang, Zhe Tang, and Wenbo Wang

54

67

80

Dependable and Secure Systems A Comparative Study of Two Different Spam Detection Methods . . . . . . . . . Haoyu Wang, Bingze Dai, and Dequan Yang

95

Towards Privacy-preserving Recommender System with Blockchains . . . . . . Abdullah Al Omar, Rabeya Bosri, Mohammad Shahriar Rahman, Nasima Begum, and Md Zakirul Alam Bhuiyan

106

Integrating Deep Learning and Bayesian Reasoning. . . . . . . . . . . . . . . . . . . Sin Yin Tan, Wooi Ping Cheah, and Shing Chiang Tan

119

xvi

Contents

Assessing the Dependability of Apache Spark System: Streaming Analytics on Large-Scale Ocean Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Janak Dahal, Elias Ioup, Shaikh Arifuzzaman, and Mahdi Abdelguerfi On the Assessment of Security and Performance Bugs in Chromium Open-Source Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joseph Imseis, Costain Nachuma, Shaikh Arifuzzaman, and Zakirul Alam Bhuiyan Medical Image Segmentation by Combining Adaptive Artificial Bee Colony and Wavelet Packet Decomposition . . . . . . . . . . . . . . . . . . . . . . . . Muhammad Arif, Guojun Wang, Oana Geman, and Jianer Chen

131

145

158

Recommender System for Decentralized Cloud Manufacturing . . . . . . . . . . . Karim Alinani, Deshun Liu, Dong Zhou, and Guojun Wang

170

Gait Analysis for Gender Classification in Forensics . . . . . . . . . . . . . . . . . . Paola Barra, Carmen Bisogni, Michele Nappi, David Freire-Obregón, and Modesto Castrillón-Santana

180

Hybrid Cloud Computing Architecture Based on Open Source Technology . . . Amelec Viloria, Hugo Hernández Palma, Wilmer Cadavid Basto, Alexandra Perdomo Villalobos, Carlos Andrés Uribe de la Cruz, Juan de la Hoz Hernández, and Omar Bonerge Pineda Lezama

191

Dependable and Secure Applications Transportation and Charging Schedule for Autonomous Electric Vehicle Riding-Sharing System Considering Battery Degradation . . . . . . . . . . . . . . . Mingchu Li, Tingting Tang, Yuanfang Chen, and Zakirul Alam Bhuiyan Blockchain-Powered Service Migration for Uncertainty-Aware Workflows in Edge Computing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaolong Xu, Qingfan Geng, Hao Cao, Ruichao Mo, Shaohua Wan, Lianyong Qi, and Hao Wang Towards the Design of a Covert Channel by Using Web Tracking Technologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aniello Castiglione, Michele Nappi, and Chiara Pero Dependable Person Recognition by Means of Local Descriptors of Dynamic Facial Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aniello Castiglione, Giampiero Grazioli, Simone Iengo, Michele Nappi, and Stefano Ricciardi From Data Disclosure to Privacy Nudges: A Privacy-Aware and User-Centric Personal Data Management Framework . . . . . . . . . . . . . . . Yang Lu, Shujun Li, Athina Ioannou, and Iis Tussyadiah

203

217

231

247

262

Contents

A Socio-Technical and Co-evolutionary Framework for Reducing Human-Related Risks in Cyber Security and Cybercrime Ecosystems . . . . . . Tasmina Islam, Ingolf Becker, Rebecca Posner, Paul Ekblom, Michael McGuire, Hervé Borrion, and Shujun Li Mobile APP User Attribute Prediction by Heterogeneous Information Network Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hekai Zhang, Jibing Gong, Zhiyong Teng, Dan Wang, Hongfei Wang, Linfeng Du, and Zakirul Alam Bhuiyan

xvii

277

294

Application of Internet of Things and GIS in Power Grid Emergency Command System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daning Huang, Huihe Chen, Shiyong Huang, Yuchang Lin, and Ying Ma

304

Visualized Panoramic Display Platform for Transmission Cable Based on Space-Time Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Renxin Yu, Qinghuang Yao, Tianrong Zhong, Wei Li, and Ying Ma

314

Dependability and Security Measures and Assessments Optimal Personalized DDoS Attacks Detection Strategy in Network Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mingchu Li, Xian Yang, Yuanfang Chen, and Zakirul Alam Bhuiyan

327

AI and Its Risks in Android Smartphones: A Case of Google Smart Assistant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haroon Elahi, Guojun Wang, Tao Peng, and Jianer Chen

341

A Light-Weight Framework for Pre-submission Vetting of Android Applications in App Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boya Li, Guojun Wang, Haroon Elahi, and Guihua Duan

356

Nowhere Metamorphic Malware Can Hide - A Biological Evolution Inspired Detection Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kehinde O. Babaagba, Zhiyuan Tan, and Emma Hart

369

Demand Forecasting Method Using Artificial Neural Networks. . . . . . . . . . . Amelec Viloria, Luisa Fernanda Arrieta Matos, Mercedes Gaitán, Hugo Hernández Palma, Yasmin Flórez Guzmán, Luis Cabas Vásquez, Carlos Vargas Mercado, and Omar Bonerge Pineda Lezama

383

Analyzing and Predicting Power Consumption Profiles Using Big Data . . . . . Amelec Viloria, Ronald Prieto Pulido, Jesús García Guiliany, Jairo Martínez Ventura, Hugo Hernández Palma, José Jinete Torres, Osman Redondo Bilbao, and Omar Bonerge Pineda Lezama

392

xviii

Contents

Explainable Artificial Intelligence for Cyberspace A New Intrusion Detection Model Based on GRU and Salient Feature Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jian Hou, Fangai Liu, and Xuqiang Zhuang

405

Research on Electronic Evidence Management System Based on Knowledge Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Honghao Wu

416

Research on Security Supervision on Wireless Network Space in Key Sites . . . Honghao Wu

426

Review of the Electric Vehicle Charging Station Location Problem . . . . . . . . Yu Zhang, Xiangtao Liu, Tianle Zhang, and Zhaoquan Gu

435

Structural Vulnerability of Power Grid Under Malicious Node-Based Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minzhen Zheng, Shudong Li, Danna Lu, Wei Wang, Xiaobo Wu, and Dawei Zhao Electric Power Grid Invulnerability Under Intentional Edge-Based Attacks . . . Yixia Li, Shudong Li, Yanshan Chen, Peiyan He, Xiaobo Wu, and Weihong Han Design and Evaluation of a Quorum-Based Hierarchical Dissemination Algorithm for Critical Event Data in Massive IoTs . . . . . . . . . . . . . . . . . . . Ihn-Han Bae Comparative Analysis on Raster, Spiral, Hilbert, and Peano Mapping Pattern of Fragile Watermarking to Address the Authentication Issue in Healthcare System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Syifak Izhar Hisham, Mohamad Nazmi Nasir, and Nasrul Hadi Johari Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

446

454

462

477

487

Dependability and Security Fundamentals and Technologies

Secrecy Outage Probability of Secondary System for Wireless-Powered Cognitive Radio Networks Kun Tang1(B) , Shaowei Liao1 , Md. Zakirul Alam Bhuiyan2 , and Wei Shi3 1

School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510640, China {tangkun,liaoshaowei}@scut.edu.cn 2 Department of Computer and Information Sciences, Fordham University, New York 10023, USA [email protected] 3 School of Information Technology, Carleton University, Ottawa K1S 5B6, Canada [email protected]

Abstract. In this paper, we consider a secrecy wireless-powered cognitive radio network, where an energy harvesting secondary system can share the spectrum of the primary system by assisting its transmission. In particular, we focus on the secure information transmission for the secondary system when an eavesdropper is existed to intercept the secondary user’s confidential information. Closed-form analytical expressions of primary outage probability, secondary secrecy outage probability (SOP) and the probability of non-zero secrecy capacity (PNSC) are derived. We also aim to joint design optimal time-switching ratio and power-splitting coefficient for maximizing the secondary secrecy outage probability under primary requirement constraint. To solve this nonconvex problem, we prove the biconvexity of optimization problem and then develop a corresponding algorithm to solve that optimization problem. Numerical results show that our proposed transmission scheme can provide greater secure information transmission for secondary system and guarantee the outage performance for primary system. Keywords: Cognitive radio network · Energy harvesting · Secrecy outage probability · Probability of non-zero secrecy capacity · Biconcave

1

Introduction

With the rapid increase and development of the wireless devices and services, the next generation mobile communication technologies are excepted to provide high capacity and low energy consumption transmission services [1]. In the meanwhile, conventional spectrum management strategies also lead that spectrum resources c Springer Nature Singapore Pte Ltd. 2019  G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 3–17, 2019. https://doi.org/10.1007/978-981-15-1304-6_1

4

K. Tang et al.

are under-utilized for most of the time [2]. Therefore, how to effectively achieve these new wireless services and applications under the constraint of limited radio spectrum is becoming an extremely challenge at present. Cognitive radio (CR) technologies have been recognized as a promising method to solve the shortage of spectrum resources, where the secondary users (SUs) can be allowed opportunistically to access the license spectrum for data transmission, which is also known as dynamic spectrum access (DSA) [3]. Energy scarcity is another critical factor affecting the development of wireless communications, especially for sensors and cellular networks, which are generally powered by batteries and it is difficult to be replaced by new ones. To alleviate this, wireless-powered (WP) technology has been paid high attention since the devices can be able to scavenge energy from the surrounding environment into electric energy for future data transmission, such as solar, wind or RF signals [4]. Especially with the concurrent developments in design of antennas and circuits, wireless energy harvesting based on RF signal is more attractive due to its wireless, low cost, and small form factor implementation [5,6]. Therefore, the combination of cognitive radio networks with energy harvesting can effectively improve both the spectral efficiency and energy efficiency. Nevertheless, there exists some security issues in WP-based CR networks (WP-CRN) due to the open nature of wireless medium, where several potential eavesdroppers may listen legitimate users’ confidential information. To address the secure transmission problems in WP-CRN, physical-layer security has been discussed by [7,8]. In [7], secure information transmission for the primary system is investigated when the SUs were the potential eavesdroppers. A jointly optimal algorithm to derive the optimized power splitting coefficient and secure beamforming vector were also proposed. The authors of [8] investigated the probability of strictly positive secrecy capacity for a underlay CRN with full-duplex WP secondary system. In this paper, we study the secure information transmission issue for the secondary system of WP-CRN, where an eavesdropper is existed to intercept the secondary user’s confidential information. We derive the closed-form expressions of the primary outage probability, SOP and PNSC of the secondary system. We further aim to maximize the SOP of the secondary system while guaranteeing the outage probability requirement of the primary system. To solve this nonconvex problem, the optimization problem is proved as biconcave problem and an effective algorithm is then proposed. The remainder of this paper is organized as follows. In Sect. 2, the system model is introduced and the spectrum sharing scheme is proposed. Section 3 analyzes the outage probability of the primary system, SOP and PNSC of secondary system. Besides, an algorithm to derive the optimal time-switching ratio and power-splitting coefficient is given in Sect. 4. Simulation and numerical results are presented in Sect. 5. Finally, Sect. 6 concludes this paper.

Secrecy Outage Probability of Secondary System

2 2.1

5

System Model and Transmission Protocol System Model

We consider an overlay cognitive radio network with wireless-powered relay (CRN-WPR) as shown in Fig. 1. In the primary system, a PT intends to communicate with a PR by assisting of a ST as relay since large propagation loss and shadowing exists between PT and PR. In the secondary system, the ST delivers its confidential information to desired SR, while a secondary eavesdropper (SE) exists within the transmission range of ST who aims to intercept the ST’s confidential information. We assume that PT has a fixed power supply, whereas the ST may have limited battery reserves and needs to have the ability to harvest energy from the received primary radio frequency (RF) signals. All nodes operate at half-duplex mode and are equipped with a single antenna.

Fig. 1. System model of a CRN-WPR. The blue line denotes the first wireless information and power transfer phase, the black line represents the second information transmission phase, and the red lines denotes the third transmission phase. (Color figure online)

All channels undergo the flat block Rayleigh fading channel, which remains quasi-static in one time slot and changes independently over different time slots. Let hP S , hSP , hSS , and hSE be the channel coefficient between PT and ST, ST and PR, ST and SR, ST and SE, respectively. Thus the channel power gain 2 |hA | with A = {P S, SP, SS, SE} is exponentially distributed with zero mean and variance λA = d−θ A , where dA denotes the transmission distance and θ being the path loss exponent. We also assume that the global channel state information (CSI) is available for the ST [7]. 2.2

Energy Harvesting and Information Transmission

As shown in Fig. 1, the transfer protocol for a transmission block includes three phases. In the first phase, PT takes a portion of time αT (0 < α < 1) to transmit its signal to the relay node ST, the received signal can be expressed by  (1) yST = PP hP S xP + nST ,

6

K. Tang et al.

where PP denotes the transmission power of primary signal, xP is the unitpower signal intended for PR, nST ∼ CN (0, δST ) represents the additive white Gaussian noise (AWGN) introduced by antenna of relay node ST. Based on power-splitting method, the received signal at the ST can be divided into two streams, one for energy harvesting and the other for relaying information. Thus, the partially received signal for harvested energy is given by    βyST = βPP hP S xP + βnST , (2) where 0 < β < 1 denotes the portion of information splitting for energy harvesting. The amount of harvested energy is then calculated as 2

E (αT ) = αT βηPP |hP S | ,

(3)

where 0 < η < 1 is the energy conversion efficiency. Note that the harvested energy from thermal noise is negligible compared to that of the primary signal. Without loss of generality, we assume T = 1 in the followings. √ During the second phase (1 − α)/2, ST forwards residual primary signal 1 − βyST to PR based on the amplify-and-forward √ (AF) strategy.  The broad√ 1 − βyST + nC , where nC ∼ casting information at the ST is x ˜P = ω PST CN (0, δC ) denotes the AWGN introduced by the signal conversion from passband to baseband at the ST. ω is the power normalization factor, which is given by 1 ω= . (4)   2 (1 − β) PP |hP S | + δST + δC In practice, the received noise power δST is much smaller than the noise δST introduced from signal conversion and even lower than the average power of received signal. Moreover, δC can be ignored at high signa-noise-ratio (SNR). We thus assume δST = δST = 0 to simplify the analysis and the approximated ω ˜ in the rest of this paper is given by 1 ω ˜= . (5) 2 (1 − β) PP |hP S | Therefore, the corresponding received signal at the PR can be expressed as yP R =



PST x ˜P hSP + nP R =





PST

⎞ + n (1 − β) n x h ST C P S P ⎝ ⎠ hSP + nP R , + |hP S |2 (1 − β) PP |hP S |2

(6) α βη where nP R ∼ CN (0, δP R ) denotes the AWGN at the PR and PST = 1−α 2 α PP |hP S | , where the coefficient 1−α is following the fact that the half harvested energy is used to transmit information and time duration of each transmission phase is normalized to 1−α 2 . Thus, the signal-interference-plus-noise ratio (SINR) of PR to derive xP is given by rP R = 

2 2 α 1−α βηPP |hP S | |hSP | α 1−α βηδST

+

β α 1−α 1−β ηδC



2

|hSP | + δP R

.

(7)

Secrecy Outage Probability of Secondary System

7

The achievable rate at the PR is then expressed as CP R = 1−α 2 log2 (1 + rP R ). In the third phase, ST transmits its unit-power signal xS to SR, while SE can also eavesdrop the xS because the wireless broadcast nature. The received signals at the ST and SE are respectively given by   ySR = PST xS hSP + nSR , ySE = PST xS hSE + nSE , (8) where nSR ∼ CN (0, δSR ) and nSE ∼ CN (0, δSE ) represent the AWGN at the SR and SE, respectively. The corresponding SNR at the SR and SE can be respectively expressed as 2

rSR =

2

2

2

αβηPP |hP S | |hSR | αβηPP |hP S | |hSE | , rSE = . (1 − α) δSR (1 − α) δSE

(9)

For the sake of simplicity, we assume the received noise powers at the SR and SE are the same, i.e., δSR = δSE = δ0 , in the followings. Accordingly, the achievable data rates at the SR and SE are respectively given by CSR =

3

1−α 1−α log2 (1 + rSR ) , CSE = log2 (1 + rSE ) . 2 2

(10)

Analysis of System Outage Performance

3.1

Outage Probability of the Primary System

An outage event will be occurred if the achievable data rate of the PR is lower than a given target threshold γP . Thus, the primary outage probability can be expressed as P = Pr {CP R < γP } . (11) Pout Based on (7), we have the following proposition. Proposition 1: Define l = αβηPP /(1 − α), m =

αβη 1−α

 δST +

δC 1−β

 , and γ˜P =

22γP /(1−α) − 1. Let X = |hP S | and Y = |hSP | . The primary outage probability in the considering cognitive radio networks is given by 2

2

P Pout = 1 + Q21 − Q1 Q2 − Q1 ,

where

(12)



γ˜P m Q1 = exp − , lλP S 

 4˜ γP δP R λP S 4˜ γP δP R 1 γ˜P m Q2 = exp − K1 λP S lλP S lλSP lλP S λSP

(13)

(14)

with K1 (·) denoting the modified Bessel function of the second kind with the first-order, which is defined in [9]. Proof: The proof is provided in Appendix A.

8

K. Tang et al.

3.2

Secrecy Outage Performance of the Secondary System

The SOP is defined that the probability of the achievable secrecy rate is smaller than a given threshold Cth . Based on the analyses of (9) and (10), the achievable secrecy rate of the secondary system is given by

1−α 1−α CS = log2 (1 + rSR ) − log2 (1 + rSE ) 2 2

+ ,

(15)

+

where [x] represents the maximum value between x and 0. Therefore, the SOP of the secondary system is expressed as    S  1−α S log2 rSOP < Cth , (16) PSOP = Pr {CS < Cth } = Pr 2 S = (1 + rSR )/(1 + rSE ). where rSOP

Proposition 2: Define C˜th = 22Cth /(1−α) and the SOP of the secondary system is then given by S = 1 + T12 − T1 T2 − T1 , (17) PSOP where T1 =

λSR , λSR + λSE C˜th

⎛   ⎞      4δ0 C˜th − 1 λSR  4δ0 C˜th − 1 ⎜ ⎟  1 ⎟. T2 = K1 ⎜ ⎝ lλP S lλP S λSR ⎠ λSR + λSE C˜th

(18)

(19)

The proof is omitted since the process is similar as the Proposition 1. The PNSC is defined as the probability of existing a positive secrecy capacity between SR and SE. Based on (16), the PNSC can be written as S PPSN SC = 1 − PSOP (Cth ≤ 0) .

(20)

In this case, we can obtain 0 < C˜th ≤ 1 when Cth ≤ 0 from equation (16), which means the received SNR at the SR is less than the received SNR at the SE, i.e., rSR ≤ rSE . Therefore, the PNSC is only related to the channel power gain ratio 2 2 |hSR | /|hSE | . After some algebraic manipulation, we have PPSN SC =

λSR . λSR + λSE

(21)

Secrecy Outage Probability of Secondary System

4 4.1

9

Optimal Time-Switching Ratio and Power-Splitting Coefficient Optimization Problem Analysis

Following the above-mentioned analyses of system outage performance, we can obtain that the improved data transfer rate means reducing the SOP and enhancing PNSC of the secondary system. Moreover, optimizing data transfer rate can also increase energy efficiency, which means that a given amount of energy can be utilized to transmit more information. Therefore, in this section, we focus on the design of optimal time-switching ratio and power-splitting coefficient for maximizing the secondary secrecy rate under the primary user’s rate, the power time-switching ratio, and power splitting coefficient constraints. Mathematically, the optimal scheme can be represented as the following optimization problem (P1):

+ 1−α 1−α log2 (1 + rSR ) − log2 (1 + rSE ) max CS = α,β 2 2 (22) s.t. C1 :CP R ≥ rP ; C2 : 0 X > lX − wm  l l   wm wδP R  wm  + Pr Y ≥ Pr X ≤ X ≤ lX − wm l  l     wm wm  wδP R  wm  + Pr X ≤ , = Pr Y < Pr X > X > lX − wm l l l

(24)

where

  wm wδP R  Pr Y < X >

l

 ∞ lX − wm 1 x x wδP R 1 = exp − exp − − − dx. wm λP S λP S λP S λP S λSP (lx − wm) l Δ

(25)

Defining a new integration variable x ¯ = x − wm/l, the above Eq. (25) can be rewritten as   wm wδP R  Pr Y < X > lX − wm l 

 4wδP R λP S 4wδP R 1 wm wm − exp − K1 . (26) = exp − lλP S λP S lλP S lλSP lλP S λSP  !    ! Q1

Q2

Secrecy Outage Probability of Secondary System

15

Besides, we have



 wm wm  = exp − Pr X > = Q1 , l lλ PS

 wm wm  = 1 − exp − = 1 − Q1 . Pr X ≤ l lλP S

(27)

Therefore, the outage probability of the primary system can be derived as (12).

B B.1

Proof of Theorem 1 Convexity of objective function G (α, β)

Based on mentioned Lemma 1 in [10], we can learn that log2 (p(x)) is a concave function if p(x) is a positive concave function. Therefore, for the given α = α0 , we re-define the following function 2

2

P |hP S | |hSR | 1+ α0 βηP(1−α 1 + rSR (α0 , β) 0 )δ0 ˜ = G (α0 , β) = α0 βηPP |hP S |2 |hSE |2 1 + rSE (α0 , β) 1+ (1−α0 )δ0

(28)

as auxiliary function to demonstrate the convexity of G (α0 , β) through analyzing its secondary derivative with respect to β, which is given by   4 2 2 2 2α02 (α0 − 1) η 2 PP2 |hP S | |hSE | |hSR | − |hSE | ∂ 2 G˜ (α0 , β) (29) = . 2 2 ∂β 2 α0 ηPP |hP S | |hSE | β + (1 − α0 ) δ0 Obviously, the G˜ (α0 , β) > 0 and the G˜ (α0 , β) is concave in β since the value ˜ 0 ,β) ∂ 2 G(α is always negative. Thus, we can proof that the G (α0 , β) is also concave ∂β 2 in β. For the given β = β0 , the second partial derivative of G (α, β0 ) is given by ∂ 2 G (α, β0 ) −π 2 πα + ( + π) (1 − α) = × , 2 2 2 ∂α 2 ln 2[( − 1) α + 1] [(π − 1) α + 1] where

2

=

2

2

(30)

2

β0 ηPP |hP S | |hSE | β0 ηPP |hP S | |hSR | ,π = . δ0 δ0 2

0) Based on aforementioned analysis, the value of ∂ G(α,β is clearly negative, ∂α2 which can certainly prove that the G (α, β0 ) is concave in α.

B.2

Convexity of Constraint Function H (α, β)

To simplify the proving process, we also assume δST = δP R = δC = δ0 in the followings. For the given α = α0 , the H (α0 , β) is given by ⎛ ⎞ 2 2 α0 βηP |h | |h | P P S SP 1 − α0 1−α0 ⎠ . (31)  H (α0 , β) = log2 ⎝1 +  2 α0 2 βηδ + α0 β ηδ |h | + δ 1−α0

0

1−α0 1−β

0

SP

0

16

K. Tang et al.

Based on above-mentioned analyses, the convexity of H (α0 , β) can be demonstrated through analyzing the convexity of rP R (α0 , β) with respect to β. Thus, we have ∂ 2 rP R (α0 , β) 4θ =−  3 ∂β 2 2 1 1 ϕβ 2 (1 − β) 1−β + ϕβ +1     (32) 1 1 2 1+ ϕβ 2 1+ 1−β θ θ −   3 − 3 , 3 1 1 1 1 (1 − β) 1−β + ϕβ +1 ϕβ 3 1−β + ϕβ +1 2

2

where θ = PP |hP S | /δ0 and ϕ = α0 η|hP S | /(1 − α0 ). Though the analysis of the above-mentioned equation, the result of (32) is clearly positive real number. As a result, H (α0 , β) is concave in β. For the given β = β0 , it is difficult to derive the exact result of the second partial derivative of H (α, β0 ) with respect to α since its non-convexity, an approximate expression when the primary system operates at high SNR region is used to proof the convexity of H (α, β0 ) [10], which is given by

ςα ˜ (α, β0 ) = 1 − α log2 H (α, β0 ) ≈ H , (33) 2 (λ − 1) α + 1 where

2

ς=

2

2

(2 − β0 ) β0 η|hSP | PP β0 η|hP S | |hSP | ,λ = . δ0 1 − β0

˜ (α, β0 ) with respect to α is The corresponding second partial derivative of H then given by ˜ (α, β0 ) ∂2H 2λα + 1 − α =− 2 (34) 2. ∂α2 2ln α2 [(λ − 1) α + 1] ˜ (α, β0 ) is concave in α. Obviously, it is easy to find that the H

References 1. Ullah, H., Nair, N.G., Moore, A., et al.: 5G communication: an overview of vehicle-to-everything, drones, and healthcare use-cases. IEEE Access 7, 37251– 37268 (2019) 2. Ji, P., Jia, J., Chen, J.: Joint optimization on both routing and resource allocation for millimeter wave cellular networks. IEEE Access 7, 93631–93642 (2019) 3. Sharma, M., Sahoo, A.: Stochastic model based opportunistic channel access in dynamic spectrum access networks. IEEE Trans. Mob. Comput. 13(7), 1625–1639 (2014) 4. Wang, C., Li, Y., Yang, Y., et al.: Combing solar energy harvesting with wireless charging for hybrid wireless sensor networks. IEEE Trans. Mob. Comput. 17(3), 560–576 (2018) 5. Chen, H., Zhai, C., Li, Y., et al.: Cooperative strategies for wireless-powered communications: an overview. IEEE Wirel. Commun. 25(4), 112–119 (2018)

Secrecy Outage Probability of Secondary System

17

6. Tang, K., Shi, R., Dong, J.: Throughput analysis of cognitive wireless acoustic sensor networks with energy harvesting. Future Gener. Comput. Syst. 86, 1218– 1227 (2018) 7. Jiang, L., Tian, H., Qin, C., et al.: Secure beamforming in wirless-powered cooperative cognitive radio networks. IEEE Commun. Lett. 20(3), 522–525 (2016) 8. Zhang, J.-L., Pan, G.-F., Wang, H.-M.: On physical-layer security in underlay cognitive radio networks with full-duplex wireless-powered secondary system. IEEE Access 4, 3887–3893 (2016) 9. Gradshteyn, I.-S., Ryzhik, I.-M.: Table of Integrals, Series, and Products, 7th edn. Academic Press, New York (2007) 10. Gorski, J., Pfeuffer, F., Klamroth, K.: Biconvex sets and optimization with biconvex functions: a survey and extensions. Math. Methods Oper. Res. 66(3), 373–407 (2007)

CodeeGAN: Code Generation via Adversarial Training Youqiang Deng1 , Cai Fu1(B) , and Yang Li2 1

2

Huazhong University of Science and Technology, Wuhan 430074, China {dengyouqiang,fucai}@hust.edu.cn Wuhan Maritime Communication Research Institute, Wuhan 430074, China liyang 22 [email protected]

Abstract. The automatic generation of code is an important research problem in the field of Machine Learning. Generative Adversarial Network (GAN) exhibits a powerful ability in image generation. However, generating code via GAN is so far an unexplored research area, the reason of which is the discrete output of language model hinders the application of gradient-based GANs. In this paper, we propose a model called CodeeGAN to generate code via adversarial training. First, we adopt Policy Gradient method in Reinforcement Learning (RL) to solve the problem of discrete data. Data generated by the generative model is discrete data which makes the generative model cannot be adjusted by gradient descent. Second, we use Monte Carlo Tree Search (MCTS) to create our rollout network for evaluating the loss of generated tokens. Based on the two mechanisms above, we create CodeeGAN model to generate code via adversarial training. We evaluate the model with datasets from four different platforms. Our model shows a better performance than other existing works and proves that code generation via adversarial training is an advanced efficient method. Keywords: GAN · Code generation Gradient · Monte Carlo Tree Search

1

· CNN · LSTM · Policy

Introduction

The ability to generate computer programs with high accuracy by AI is meaningful for the development of Machine Learning. In 2017, Tony Beltramelli presented a pix2code model based on CNN and LSTM, which converts the input GUI screenshots into the corresponding computer programs [1]. Since then, some researchers have begun different attempts based on pix2code. Screenshotto-code, proposed by Emil Wallner [2], is a new attempt based on pix2code. Screenshot-to-code is combined with sketching interfaces [3] and im2markup [4], and converts the input screenshots into computer code. Sketch-code [5] is another pix2code-based model and draws on some inspiration from Screenshotto-code, but the author of Sketch-code makes a bigger breakthrough. Sketch-code c Springer Nature Singapore Pte Ltd. 2019  G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 18–30, 2019. https://doi.org/10.1007/978-981-15-1304-6_2

CodeeGAN: Code Generation via Adversarial Training

19

learns features from hand-drawn wireframes (without requiring a specific GUI screenshot) to generate the corresponding HTML code. Meanwhile, Microsoft corporation also has some related researches on AI generating computer code. ailab of Microsoft proposed a Sketch2Code model in 2018 [6]. Sketch2Code also converts hand-drawn web page sketches into corresponding HTML code. But Sketch2Code has a better performance than Sketch-code. However, although these studies made some breakthroughs, they had certain limitations. On one hand, the accuracies of the generations are not satisfactory. Such as pix2code only has an accuracy of about 77%. On the other hand, the scalability of these models are poor. There is still a long way to go before the practical application of these studies. Meanwhile, Generative Adversarial Network (GAN) [7] exhibits a powerful ability in image generation. The core idea of GAN is based on the game theory. There is a generative model and a discriminative model. The generative model is responsible for generating data, while the discriminative model discriminates between generated data and real data. If the generated data is too realistic to be discriminated by the discriminative model, the system reaches Nash Equilibrium. Therefore, we propose to use the idea of GAN to generate computer code. We first create a generative model to generate code, and then construct a discriminative model to discriminate whether the generated code is real or not. Unfortunately, applying GAN to generate code has to confront two challenges. First, GAN is originally designed to generate real-valued continuous data but has difficulties in directly generating discrete markup sequences, such as text [8]. The generative model starts with random noise and is then regulated by model parameters. The gradient of the loss from the discriminative model is used to guide the generation of the generative model, and the generated values are slightly changed to make it more realistic. If the generative model generates discrete tokens, the discriminative model cannot make slight change guidance, because there is probably no corresponding token for such slight change in the limited dictionary space [9]. Second, GAN can only give the score/loss to the entire sequence generated, but cannot evaluate the quality of part of the sequence [10]. In this paper, to address above two issues, we consider the code generation of the generative model as a sequential decision making process. We describe this situation with Markov Decision Process (MDP) [13]. MDP is a mathematically idealized form of the Reinforcement Learning problem by which precise theoretical statements can be made. According to MDP, there is an agent, which is a learner or a decision maker. Meanwhile, everything besides the agent is called the environment. While the agent takes an action, the environment will respond to the action and present a new situation to the agent. The environment also gives rise to rewards that the agent seeks to maximize over time through its choice of actions [13]. Thus we regard the generative model as an agent, which is going to generate tokens. The action is the next token to be generated. Besides, we create a discriminative model to guide the generation of the generative model. Our

20

Y. Deng et al.

model consists of the generative model and the discriminative model mentioned above and we name it CodeeGAN. Specifically, the contributions of this paper can be summarized as follows: • We propose a code generation method by applying adversarial training. • We create CodeeGAN model to bring about code generation via adversarial training, which consists of a generative model and a discriminative model. • We further propose a generic code generation framework by combining multiple deep learning methods and a code discrimination framework by applying text classification method. • We test the effectiveness of the proposed method by conducting experiments on Android UI dataset, iOS UI dataset, web-based UI dataset and handdrawn website mockups dataset. Experimental results show that our proposed method is more efficient compared with other existing works. This paper is structured as follows. The problem statement and introduction of the proposed scheme are given in Sect. 1. Section 2 shows some related works of the proposed idea. In Sect. 3, we create our model according to the proposed idea and introduce the model in detail. Section 4 shows the algorithms used during training process. In Sect. 5, we conduct experiments to verify the efficiency and accuracy of our model. Section 6 concludes the paper.

2

Related Work

With the development of Machine Learning in image processing, text recognition, etc., researchers have begun to focus on the automatic generation of computer programs. Over the last few years, attempts to attacks on generating code from input images have revved up. In each research study, the applied machine learning algorithms and system performance are provided [29]. A recent example is DeepCoder proposed by Balog et al. [19]. DeepCoder is a system able to generate computer programs to augment traditional search techniques. The approach of DeepCoder is to train a neural network to predict properties of the program that generated the outputs from the inputs. Balog et al. used the neural network’s predictions to augment search techniques from the programming languages community, including enumerative search and an SMT-based solver [19]. In another work pix2code, the author Beltramelli [1] used deep neural networks CNN and LSTM to form the training model pix2code, generating computer programs from the input GUI screenshots. Tony Beltramelli’s work shows that deep learning methods can be leveraged to train a model end-to-end to automatically generate code from a single input image with over 77% of accuracy [1]. In addition, Microsoft ailab proposed Sketch2Code [6] in 2018. Sketch2Code is a solution that uses AI to transform a handwritten user interface design from a picture to valid HTML markup code, which can convert hand-drawn GUI images into HTML code.

CodeeGAN: Code Generation via Adversarial Training

21

Meanwhile, Generative Adversarial Network (GAN) has been proved to be significant in image generation since it was proposed by Goodfellow et al. [7] in 2014. The training procedure of GAN is a minimax game between a generative model and a discriminative model [10]. There are many variants of GAN, such as DCGAN, WGAN, LSGAN, etc. These variants of GAN are optimization for GAN. But there are not many research studies on text sequence generation via GAN [26–28], the reason of which is the discrete output of language model hinders the application of gradient-based GANs. In this regard, Kusner et al. [20] believed that such a problem can be avoided by Gumbel-softmax distribution. Gumbel-softmax distribution is a continuous approximation of the polynomial distribution based on softmax function. Matt Kusner et al. evaluated the performance of GANs based on recurrent neural networks with Gumbel-softmax output distributions in the task of generating sequences of discrete elements [20]. Zhang et al. [21] adopted the approach of smooth approximation to solve the problem of non-differentiable gradient caused by discreteness. They proposed a generic framework employing Long short-term Memory (LSTM) and convolutional neural network (CNN) for adversarial training to generate realistic text [21]. In another aspect, Yu et al. [10] proposed a model called SeqGAN, which solves the problem of how to apply adversarial training to NLP area and has excellent performance in text generation tasks. Lantao Yu et al. considered the process of sequence generation as a sequence decision process in Reinforcement Learning. The generative model is regarded as an agent, the tokens generated so far as the current state, the next token to be generated as the action, and the discriminative model’s evaluation score for the sequence as the reward. Updating parameters of the generative model via Policy Gradient avoids the problem of non-differentiable gradient caused by discreteness.

3

CodeeGAN

Our model for the task of code generation named CodeeGAN consists of a generative model (G) and a discriminative model (D). It first generates code using G which consists of a Convolutional Neural Network (CNN) and two Recurrent Neural Networks (RNN). The CNN of G plays the role of image encoder to extract image features and arrange the features in a grid. The first RNN of G encodes the input tokens into an intermediary representation, while the second RNN of G decodes the representations learned by both the CNN and the first RNN. Then we train D to provide a guidance for improving G. D gives probabilities indicating how likely a sequence is from real sequences. The illustration of CodeeGAN is shown in Fig. 1. 3.1

The Generative Model

We create our generative model with a CNN and two RNNs. Since we need to generate corresponding computer code from the input images, our generator

22

Y. Deng et al.

Fig. 1. The illustration of CodeeGAN.

must have a neural network for processing the images. CNN has unique advantages in image processing because of its special structure of local weight sharing. The fact that images can be directly input into the network avoids the complexity of data reconstruction during feature extraction and classification [25]. The input images are initially resized to 256 × 256 pixels and the pixel values are normalized before to be fed in the CNN. To train the generative model, our inputs will also include the DSL code which is used for describing the images. A domain-specific language (DSL) is a computer language specialized to a particular application domain. In this paper, we use DSL code from Beltramelli’s datasets [1] to describe the input images insead of HTML code. Thus we need a RNN to process the input DSL code. To achieve better performance and simplify the network, our RNNs use Long Short Term Memory networks [22]. Long Short Term Memory network usually just called “LSTM”, is a special kind of RNN, capable of learning long-term dependencies. Three gate functions are introduced in LSTM, namely input gate it , forget gate ft and output gate ot , which are used to control input values, memory values and output values. Different LSTM gate outputs can be computed as follows: it = σ (Wi xt + Ri ht−1 + bi + wi  ct−1 )

(1)

ot = σ (Wo xt + Ro ht−1 + bo + wo  ct )

(2)

ft = σ (Wf xt + Rf ht−1 + bf + wf  ct−1 )

(3)

Where W{i,o,f } and R{i,o,f } are matrices of weights, b{i,o,f } is bias vector, xt is the new input vector at time t, ht−1 is the previously produced output vector, σ (∗) is the logistic sigmoid function, and  is the element-wise multiply operator (Hadamard product). Then we get the outputs of input gate, output gate and forget gate via above formulations. But the outputs of input gate, output gate

CodeeGAN: Code Generation via Adversarial Training

23

and forget gate are not our goals. What we really want is the hidden state. The hidden units ht are updated as follows: ut = tanh (Wu xt + Ru ht−1 + bu )

(4)

ct = it  ut + ft  ct−1

(5)

ht = ot  tanh (ct )

(6)

Assuming we are given a GUI image I and a corpus X = {x1 , · · · , xt } corresponding to the image I. The generative model can be expressed mathematically as follows: p = CN N (I)

(7)

qt = LST M (xt )

(8)

yt = sof tmax (LST M (p, qt ))

(9)

The input GUI image I is processed by CNN to obtain a tensor p while corpus X = {x1 , · · · , xt } is processed by the first LSTM to obtain another tensor qt . Then we concatenate tensor p and tensor qt into one tensor as input to feed the second LSTM. The following token xt+1 is used for training label, and the softmax layer performs multi-classification processing and predicts the result yt . After the generation of the generative model, the input image I and corpus X = {x1 , · · · , xt } generates the next token yt . We set the generated token and the original real context token with labels 0 and 1, then feed the discriminative model with which to discriminate. 3.2

The Discriminative Model

In Generative Adversarial Network’s framework, the generative model is pitted against an adversary, while the discriminative model learns to determine whether a sample is from the model distribution or the data distribution (Ian Goodfellow). Now we have already created the generative model, then how to build the discriminative model is a key problem. The discriminative model is a classifier that classifies the input generated data and the real data, and outputs the probability values of the two categories. There are many classifiers for text classification tasks, such as deep neural network (DNN) [14], convolutional neural network (CNN) [15] and recurrent convolutional neural network (RCNN) [16]. CNN has recently been shown of great effectiveness in text (token sequence) classification [17], so we choose CNN as our discriminative model’s network. Our CNN for text classification uses an embedding layer, followed by a convolutional layer, a max-pooling layer and a softmax layer. Assuming we are given the generated corpus Y = {y1 , · · · , yn }, each word yt is embedded into a k-dimensional word vector We [yt ], where We ∈ Rk×V is a word embedding matrix (to be learned), V is the vocabulary size. A convolution operation involves a filter Wc ∈ Rk×h , applied to a window of h words to produce a new feature. We can induce one feature map as follows:

24

Y. Deng et al.

T = conv2d (Y, Wc ) + b

(10)

c = sof tmax (T )

(11)

Where conv2d(∗) denotes the convolutional function, b is a bias vector, and T represents the result of the convolutional operation. After the nonlinear activation function sof tmax(∗), we get the feature map c. At the same time, in order to achieve better performance and prevent over-fitting due to the network being too complicated, we add a highway network to the discriminative model. 3.3

Rollout Network

As mentioned above, for discrete data, the discriminative model cannot use the gradient descent algorithm to update the generative model. Thus we adopt the method of Policy Gradient in Reinforcement Learning to solve such a problem. For the currently generated code, state of which is st . Then the generative model takes an action At to generate next token yt . For this action At , we give the generative model a reward via the potential reward mechanism. Whenever a token generated, there is a corresponding reward. Then how is this reward mechanism implemented? Our target is to generate a complete sentence for scoring. For currently generated tokens, we adopt the approach of Monte Carlo Tree Search to complete the remaining part of the sentence. Then pass the complete sentence to the discriminative model for scoring [10]. After that we get each token’s reward. We feed each token’s reward into the rollout network, then update parameters of the rollout network. Since the structure of rollout network is the same as the generative model and parameter sharing between rollout network and the generative model is implemented, the weight and bias of the generative model will update accordingly. The process of Monte Carlo Tree Search is shown in Fig. 2.

Repeat unƟl all searches are done

Current state

Next token

Rollout

Fig. 2. The process of Monte Carlo Tree Search.

BackpropagaƟon

CodeeGAN: Code Generation via Adversarial Training

4

25

Training Process

Assuming we are given the input GUI image I and the contextual sequence X = {x1 , · · · , xt } corresponding to the image I, instead of directly minimizing the objective function from standard GAN, we adopt an approach similar to feature matching. Totally, we have the generative model and the discriminative model. Assuming the loss of the generative model is LG , and the loss of the discriminative model is LD , we can describe our goal as: T  

G xt+1 log (yt ) · RD (s = x1:t , a = yt )

(12)

minimizing : LD = −EY ∼pdata [log D(Y )] − EY ∼G [log(1 − D(Y ))]

(13)

minimizing : LG = −

i=1 yt ∈Y

Where xt+1 denotes the expected token, yt denotes the predicted token, s as state represents the state of the decoding tokens before the current timestep G represents and a as action represents the next token to be generated. Thus RD the reward of the current tokens. We use the Reinforcement Learning algorithm [18] and regard the probability estimated by the discriminative model as reward. Formally, we have: G (s = x1:t , a = yt ) = RD

N 1  D (Y1:t ) N n=1

(14)

To minimize the loss of the generative model LG , we must calculate the reward of each generated token. Formally, the rewards are initialized with random values between 0 and 1. For each epoch, we train the generative model first to minimize LG . Then we generate tokens with the sample network through weight sharing. After that, we concatenate the generated tokens and the real tokens to feed the discriminative model. And the discriminative model is trained to minimize LD . When the discriminative model obtain a result, the generative model is ready to update in next epoch. The key strategy is to calculate the reward via the classification result of the discriminative model. Unfortunately, the discriminative model only provides a reward value for a finished sequence. To solve such a problem, we apply Monte Carlo Tree Search with a roll-out policy to sample the unknown last T − t tokens for current state [10]. We can describe the process as:   1 N = M C (Y1:t , N ) Y1:T , . . . , Y1:T

(15)

In addition, to prevent overfitting, we use two different ways in the generative model and the discriminative model, respectively. Dropout is a common method of preventing overfitting in deep learning. During the training phase, each neural unit is discarded with a probability p. While in the test phase, neural units are all retained, but the weight parameters w are multiplied by p. Considering the

26

Y. Deng et al.

output of a neuron in the first hidden layer before dropout is x, then the expected value after dropout is E = px (1 − p)

(16)

To find an appropriate dropout rate p for the generative model, we have already made many attempts. Finally, a dropout regularization set to 25% is applied to the CNN networks in the generative model after each max-pooling operation and at 30% after each fully-connected layer [1]. Besides, L2 regularization is also a good choice for preventing overfitting, which formula is simple:  wj2 (17) L = Ein + λ j

Where Ein is the training sample error that does not contain regularization terms, and λ is the regularization parameter. L2 regularization solves the problem of collinearity and filters out noise in raw data. Thus we add a L2 regularization to the loss function of the discriminative model.

5

Experiments

To train our model and evaluate the results, we select datasets from four different platforms: Android UI dataset1 , iOS UI dataset (see footnote 1), web-based UI dataset (see footnote 1) and hand-drawn website mockups dataset2 . Each dataset consists of 1750 images and 1750 contexts. We divide each dataset into a training set and a test set, where the training set contains 1500 images and 1500 contexts, and the test set contains 250 images and 250 contexts. Besides, for horizontal comparison, our hand-drawn website mockups dataset is different from other three datasets. Hand-drawn website mockups dataset’s training set contains 1530 images and 1530 contexts, and test set contains 170 images and 170 contexts. Dataset statistics are shown in Table 1. Table 1. Dataset statistics Datasets

Training set Test set Instances Samples Instances Samples

Android dataset

1500

85353

250

iOS dataset

1500

93554

250

16102

Web-based dataset

1500

144260

250

23698

Hand-drawn dataset 1530

146788

170

16276

14668

We feed our CodeeGAN with datasets from four different platforms for training, and calculate the loss in 10 epochs. Since our goal is to make the data 1 2

http://github.com/tonybeltramelli/pix2code/tree/master/datasets. http://sketch-code.s3.amazonaws.com/data/all data.zip.

CodeeGAN: Code Generation via Adversarial Training

27

generated by the generative model too realistic to be discriminated by the discriminative model, we use the cross entropy loss function of the generative model as the evaluation. The cross entropy loss curves in 10 epochs on four different datasets of CodeeGAN are shown in Fig. 3(a). The loss curves indicate that the convergences of different datasets are clearly different. In contrast with the loss curves of Android UI dataset and iOS UI dataset, the loss curves of web-based UI dataset and hand-drawn website mockups dataset converge to smaller values because of a larger number of samples. The more samples the dataset has, the more features the model can learn from it. According to Table 1, the number of Android UI dataset samples is the least in all datasets. Meanwhile, the value that Android UI dataset’s loss curve converges to is the largest. Thus, we can conclude that the number of dataset samples affects the learning outcome of the model, and finally affects the accuracy of code generation. For a particular classifier, each instance gets a classification result. CodeeGAN is actually a multi-classifier. We need to evaluate the classification accuracy of CodeeGAN by adopting an appropriate evaluation method. Here we use ROC (Receiver Operating Characteristic) curve to visually show the classification effect of CodeeGAN. It is the ratio of the false positive (FP) to true positive (TP) rate [30]. The abscissa of ROC curve is False Positive Rate (FPR), and the ordinate is True Positive Rate (TPR). TPR determines the performance of positive cases in which CodeeGAN can correctly distinguish among all positive samples, and FPR determines the number of false positives in CodeeGAN in all negative samples. AUC (Area Under Curve) indicates the area under the ROC curve, which value is between 0.1 and 1. AUC as a numerical value can be used for intuitively evaluating the quality of the classifier. The larger the value, the better the classification effect. ROC curves calculated during evaluating with CodeeGAN trained for 10 epochs on four different datasets are shown in Fig. 3(b).

(a) CodeeGAN training loss

(b) CodeeGAN ROC curves

Fig. 3. Training loss on different datasets and ROC curves of CodeeGAN

Obviously, CodeeGAN’s ROC curves indicate that the AUC values evaluated on Android UI dataset, iOS UI dataset, web-based UI dataset and hand-drawn

28

Y. Deng et al.

website mockups dataset are 0.753, 0.948, 0.956 and 0.951, respectively. Among them, the AUC value evaluated on Android UI dataset is the least, the reason of which is the number of Android UI dataset samples is less than other three datasets. Besides, the AUC values evaluated on other three datasets are around 0.95, which are fairly high accuracies. Therefore, our CodeeGAN, which generate code via adversarial training, is proved to be significant. In addition, to compare with pix2code, we calculate the training loss of pix2code for 10 epochs on four different datasets. Then, we compare the loss curves of CodeeGAN and pix2code in the same coordinate system. Training loss comparisons between CodeeGAN and pix2code are shown in Fig. 4. The loss curves comparisons between CodeeGAN and pix2code on four different datasets seem to be similar. Both pix2code and CodeeGAN converge to a minimum value after training for 10 epochs. But no matter on which dataset, CodeeGAN converges to the minimum value smaller than pix2code. Taking the web-based UI dataset as an example, the loss function error of pix2code gradually converges from 1.16 to 0.14 after training for 10 epochs, while CodeeGAN’s loss function error drops to about 0.08. The comparison results of the other three datasets are similar to the result of web-based UI dataset. Thus we can conclude that CodeeGAN has a better performance than pix2code.

(a) Android dataset

(b) iOS dataset

(c) web-based dataset

(d) hand-drawn dataset

Fig. 4. Training loss comparisons between CodeeGAN and pix2code on four different datasets

6

Conclusion

In this paper, we propose a method to generate code from input images via adversarial training. According to the idea, we create our model named CodeeGAN which consists of a generative model and a discriminative model. The generative model is based on CNN and LSTM. CNN of the generative model is used for processing the input GUI images while the first LSTM of the generative model is used for processing the input contexts. Then we feed the results of CNN and the first LSTM into the second LSTM for code prediction. The discriminative model is a code classifier based on CNN, which takes the data generated by the generative model and the real data as inputs and outputs the probability of discrimination (that is, the probability of code classification). In order to update the generative model’s parameters via Policy Gradient, we design a rollout network

CodeeGAN: Code Generation via Adversarial Training

29

which is basically consistent with the generative model and sharing parameters with the generative model. After theoretical derivation and experimental verification, we get over 0.95 of AUC on four different datasets. According to the design idea of CodeeGAN, it is not difficult to understand why the structure of CodeeGAN works. There is a game relationship between the generative model and the discriminative model of CodeeGAN. The generative model always tries to generate data to deceive the discriminative model, while the discriminative model is also eager to improve its discriminative ability to discriminate between generated data and real data. At the end of this game relationship, data generated by the generative model is too realistic to be discriminated by the discriminative model. This situation is an ideal state, which means data generated by the generative model is the same as real data. Although such an ideal state is not achievable, we can get as close as possible to the state for improving the accuracy of code generation.

References 1. Beltramelli, T.: pix2code: Generating Code from a Graphical User Interface Screenshot, arXiv preprint arXiv:1705.07962 (2017) 2. Wallner, E.: Screenshot-to-code (2017). https://github.com/emilwallner/ Screenshot-to-code 3. Wilkins, B., Gold, J., Owens, G., Chen, D., Smith, L.: Sketching Interfaces: Generating code from low fidelity wireframes (2017). https://airbnb.design/sketchinginterfaces 4. Deng, Y., Kanervisto, A., Ling, J., Rush, A.M.: Image-to-markup generation with coarse-to-fine attention. In: ICML (1997) 5. Kumar, A.: Sketch-code (2018). https://github.com/ashnkumar/sketch-code 6. Microsoft ailab. Sketch2Code (2018). https://github.com/Microsoft/ailab/tree/ mas-ter/Sketch2Code 7. Goodfellow, I.J., et al.: Generative Adversarial Nets, arXiv preprint arXiv:1406.2661 (2014) 8. Huszar, F.: How (not) to train your generative model: scheduled sampling, likelihood, adversary? arXiv preprint arXiv:1511.05101 (2015) 9. Goodfellow, I.J.: Generative adversarial networks for text (2016). http://goo.gl/ Wg9DR7 10. Yu, L., Zhang, W., Wang, J., Yu, Y.: SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient, arXiv preprint arXiv:1609.05473 (2017) 11. Bachman, P., Precup, D.: Data generation as sequential decision making. In: NIPS, pp. 3249–3257 (2015) 12. Bahdanau, D., Brakel, P., Xu, K., et al.: An actor-critic algorithm for sequence prediction, arXiv preprint arXiv:1607.07086 (2016) 13. Sutton, R.S., Barto, A.G.: Finite Markov decision processes. Reinforcement Learn. 47–71 (2018) 14. Vesely, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: INTERSPEECH, pp. 2345–2349 (2013) 15. Kim, Y.: Convolutional neural networks for sentence classification, arXiv preprint arXiv:1408.5882 (2014)

30

Y. Deng et al.

16. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: AAAI, pp. 2267–2273 (2015) 17. Zhang, X., LeCun, Y.: Text understanding from scratch, arXiv preprint arXiv:1502.01710 (2015) 18. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992) 19. Balog, M., Gaunt, A.L., Brockschmidt, M., Nowozin, S., Tarlow, D.: Deepcoder: learning to write programs, arXiv preprint arXiv:1611.01989 (2016) 20. Kusner, M., Lobato, J.: GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution, arXiv preprint arXiv:1611.04051 (2016) 21. Zhang, Y., Gan, Z., Carin, L.: Generating Text via Adversarial Training, arXiv preprint arXiv:1725.07232 (2017) 22. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 23. Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., Wang, J.: Long Text Generation via Adversarial Training with Leaked Information, arXiv preprint arXiv:1709.08624 (2017) 24. Li, J., Monroe, W., Shi, T., Jean, S., Ritter, A., Jurafsky, D.: Adversarial Learning for Neural Dialogue Generation, arXiv preprint arXiv:1701.06547 (2017) 25. Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015) 26. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: Proceedings of The 33rd International Conference on Machine Learning, vol. 3 (2016) 27. Zhang, H., et al.: Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242 (2016) 28. Shetty, R., Rohrbach, M., Hendricks, L.A., Fritz, M., Schiele, B.: Speaking the same language: Matching machine to human captions by adversarial training. arXiv preprint arXiv:1703.10476 (2017) 29. Alrowaily, M., Alenezi, F., Lu, Z.: Effectiveness of machine learning based intrusion detection systems. In: Wang, G., Feng, J., Bhuiyan, M.Z.A., Lu, R. (eds.) SpaCCS 2019. LNCS, vol. 11611, pp. 277–288. Springer, Cham (2019). https://doi.org/10. 1007/978-3-030-24907-6 21 30. Manavi, M., Zhang, Y.: A new intrusion detection system based on gated recurrent unit (GRU) and genetic algorithm. In: Wang, G., Feng, J., Bhuiyan, M.Z.A., Lu, R. (eds.) SpaCCS 2019. LNCS, vol. 11611, pp. 368–383. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-24907-6 28

Information Consumption Patterns from Big Data Jesús Silva1(&), Ligia Romero2, Claudia Fernández2, Darwin Solano2, Nataly Orellano Llinás3, Carlos Vargas Mercado4, Jazmín Flórez Guzmán3, and Ernesto Steffens Sanabria4 1

Universidad Peruana de Ciencias Aplicadas, Lima, Peru [email protected] 2 Universidad de la Costa, St. 58 #66, Barranquilla, Atlántico, Colombia {lromero11,cfernand10,dsolano1}@cuc.edu.co 3 Corporación Universitaria Minuto de Dios – UNIMINUTO, Barranquilla, Colombia [email protected], [email protected] 4 Corporación Universitaria Latinoamericana, CUL, Barranquilla, Colombia {cvargas,esteffens}@ul.edu.co

Abstract. Virtual social networks imply an important opportunity to generate friendlier communication bridges between students, teachers and other actors related to the educational field. In this sense, our study presents an approximation to the connection habits between university students in these networks, which in the future will allow to take advantage of these platforms to achieve a successful communication between actors. Thus, the characterization of uses, habits and consumption of virtual social networks becomes very relevant. Keywords: Big data

 Social networks  Consumption patterns

1 Introduction Individuals, especially the youngest, use virtual social networks (VSN) for multiple purposes: entertainment, socialization, citizen organization and information, among others. There is one point, however, that cannot be ignored: their influence among university students [1–6] have made outstanding contributions in this regard, with special attention to these population. Some recent studies focused on the uses and effects of virtual social networks in school environments in Latin America, point out to a recognition of the didactic potential of the network by the students themselves, who are even willing to participate in this type of communication to disseminate and share materials, opinions and knowledge [7–12]. Men, as social beings, requires the collaboration and participation of their pairs to undertake actions that meet their different needs. In this sense, communication and interaction are needed. [13–15] explain that individuals are mainly social since they are born and develop into social groups. It is precisely in social interaction -contact with other people and joint participation in various activities- that important processes take place, such as socialization, group integration and the construction of identity. © Springer Nature Singapore Pte Ltd. 2019 G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 31–39, 2019. https://doi.org/10.1007/978-981-15-1304-6_3

32

J. Silva et al.

However, at the early 21st century, the forms of communication and interaction between individuals changed significantly. In certain socioeconomic and socio-cultural contexts, people are no longer only contacted directly -physically, interpersonally, faceto-face- but also virtually, through the mediation of digital technologies such as mobile telephone and the Internet. Virtual social networks cannot be understood without first characterizing the concept of information and communication technologies (ICT) and their framework of appearance. The information society, in which there is not just an important technological revolution, but also an important impact of this technology on the social component, can be seen in product innovation and in the execution of processes in which virtuality takes on an increasingly important role [16–18]. This study presents specific information on the connection habits in these networks among undergraduate students at the University of Mumbai in India, particularly at its most populous area, the Social Sciences, Economic-Administrative and Humanities campus, in 2018.

2 An Approach for the Analysis of Virtual Social Networks Through Data Mining Data mining is an extraction technique that organizes, groups, relates and classifies information from databases by means of determined algorithms. It operates through a confusion matrix in which all the attributes added to the database are mixed [13]. The function of the selected algorithms is to search for rules, trends and behavioral predictions. The data mining technique has several applications in different sciences. For example, it is used to analyze databases from organizations and companies, but in general, it can be applied to any social phenomenon whose indicators can be systematized through a database. In the field of technology, it is used to analyze learning trends and learning objects. A number of experiments have also been carried out on virtual social networks based on data mining. [19] explored the cognitive potential of Facebook and the possibility of using this platform as a pedagogical tool. [20] applied data mining to Facebook to study the personality of 20000 users and present behavior and interaction patterns based on their profiles and publications on this network. [21] used this technique to discover the use patterns of virtual social networks and their relationship with users’ decisions, and focused their attention especially on the possibilities of using these networks as educational tools. The study of these authors was based on young people from Turkey and focused their attention on the time of use and the frequency with which these young people accessed Facebook, as well as on the contributions of this network in educational environments. This is the context of the research, which seeks to know the connection habits of university students in relation to their favorite virtual social networks. The relevance of this proposal is that, through the application of data mining techniques, it is possible to detect several trends that can have a positive impact on the teaching and learning processes. An example of this would be the creation of optional subjects for students and teachers oriented to the use of this type of networks from an academic perspective. Another example is the characterization of the use of mobile

Information Consumption Patterns from Big Data

33

devices by students of the Social Sciences, Economic-Administrative and Humanities campus, which can help determine whether the technological infrastructure of the campus is adequate. A final aspect of interest is that it allows to know the habits of using virtual social networks at specific times from a gender perspective [22, 23].

3 Method The information presented in this work was obtained through the application of an exploratory instrument that allowed students to know the habits of use and consumption of virtual social networks. To this end, the questionnaire asked those who participated, among other questions, about their favorite networks, the devices they used to connect and the locations for doing so. A total of 1457 questionnaires were collected from an estimated sample of 952 participants. All of them were studying at some faculty of the Social Sciences, Economic-Administrative and Humanities campus, whose population was 12536 undergraduate students for the 2017–2018 school year. For the calculation of the sample, MACStats version 2.5 was used, and an Alpha value of 0.5 and a value of p and q equal to 0.5 were considered, which resulted in a sample size equivalent to 952. And, as could be observed, in the process of applying the questionnaires, the programmed sample was exceeded by far [24, 25]. In order to carry out the data mining process, a database was created with 13 attributes, among which general information was considered, such as the sex of the participants, their faculty of origin, and the frequency with which they connect to the Internet from places such as home or school. Also included was the frequency of connection from different devices, such as smart phones, tablets and laptops or desktops. The networks selected for the data mining process were the most popular: Facebook, Twitter, YouTube, Instagram, Google+ and the WhatsApp instant messaging service, which was pointed out repeatedly by the informants, so that it was included among the selected attributes. For the analysis, WEKA (Waikato Environment for Knowledge Analysis) software was used, which is an open source visual tool, developed by researchers at the University of Waikato in New Zealand. Weka is developed in the Java programming language and operates in different operating system environments. In this tool, the SimpleKMeans grouping algorithm and the Apriori and Predictive Apriori association were applied. With this attribute base, based on the application of the mentioned algorithms, a series of grouping and association patterns was obtained, which will be explained below [26, 27].

4 Results Previously, it was commented that the Alexa website positioned Facebook as the most popular network in the world. The data collected locally coincide with this assertion, but an unexpected fact emerged: the popularity of WhatsApp, which is not a network itself, but a digital instant messaging service that allows to create specific groups of contacts, thus defined as networks. For this reason, it was not included in the original design of the questionnaire, but in the pilot study, the volunteer students commented

34

J. Silva et al.

that it should be included, as they not just used it, but also conceived it as a virtual social network. Faced with this, the results obtained were that 27% of the sample ranked WhatsApp as the network with the highest connection rate, followed by Facebook, with 25%, YouTube with 16%, and Twitter with 11% (see Fig. 1).

Fig. 1. Virtual social networks with a higher connection frequency referred by the informants.

On its part, Tumblr and Skype barely differ with 1%, and LinkedIn closes with zero. The latter is not used by volunteers because it is a network whose approach is professional, i.e. it seeks contacts with more specific interests like work, updating, etc., that may not be to the liking of students who are just beginning their professional career. Regarding the uses and habits of Internet consumption, it stands out that 84% of students connect from home, always or frequently, while 28% do it from school. With respect to the time of connection to the Internet from some devices, 46% of the students assert they dedicate from one to six hours to it. Fifty-seven per cent reported that they usually connect from a computer, and 78% always or frequently connect from a smart phone. So far, only findings based on descriptive statistics were presented, i.e. general data provided by informants were presented. Then, based on the analysis carried out from data mining, a guideline of behaviors and connection habits will be presented, as well as the preferred virtual social networks. Below are the results obtained from the implementation of various data mining techniques using WEKA software in the database generated by the authors. First, an experimentation with the generation of student groups. Subsequently, association rules were extracted based on the uses and consumption of virtual social networks. 4.1

Grouping Algorithm

Group generation is one of the most commonly used techniques in data mining. This work experimented with the SimpleKMeans algorithm [28] which was used successfully in various researches [29]. First, it experimented with the generation of two groups of students in which gender, faculty, frequency of connection at school,

Information Consumption Patterns from Big Data

35

frequency of connection at home and virtual social network as YouTube stand out as mutually exclusive attributes (see Table 1). The values reported in Table 1 point to clear gender differences and the broad presence of the Faculties of Law and Accounting and Administration in the population that participated in the experiment. Table 1. Generation of two student groups with the SimpleKMeans algorithm by using the WEKA program. Attribute Gender Faculty Connection frequency in the faculty Connection frequency at home from a laptop or desktop computer YouTube virtual social network

Group 2 (476)

Group 1 (476) Feminine Faculty of laws Never Always

Masculine Faculty of accountant and administration Some times Frequently

No

yes

In general, both men and women show little connection to the Internet from faculty. This behavior is due, in part, to the restrictions of connection to the wireless network in mobile devices within the campus of Social Sciences, Economic-Administrative Sciences and Humanities. Another important factor is that most social science subjects do not require Internet connections during classes. As a result, most students do not bring their laptop to school every day. On the contrary, the connection frequency at home by means of a laptop or desktop computer is permanent among women and only frequent among men. Finally, the presence of the YouTube video platform stands out, where it is observed that the men who participated in the experiment consider it to be one of their favorites. Subsequently, three groups of students were generated, in which the mutually exclusive attributes are the frequency of connection at school and the frequency of connection at home using a laptop or desktop computer (Table 2). The values reported in Table 2 identify that, in general, students connect little at school. There is even a group of 324 students who stated that they never connect there. As for the frequency of connection at home by means of a laptop or desktop computer, most of the students who participated in the experiment do it always or frequently. Table 2. Generation of three groups of students with the SimpleKMeans algorithm by using the WEKA program. Attribute

Group 1 (324) Frequency of connection at the faculty Never Frequency of connection at home by laptop or desktop Always computer

Group 2 (498) Some times Frequently

Group 3 (130) Rarely Never

36

4.2

J. Silva et al.

The Association Algorithms

The generation of association rules is another data mining technique. In this work, two algorithms were applied to study the behavior of informants with respect to their preferences for one network or another. The first algorithm used was Predictive Apriori, from which 20 rules were retrieved whose alpha reliability were 0.99. In the present study three of them are described, which were chosen for their level of confidence and depth. Among the results, the great popularity of WhatsApp instant messaging service stands out in relation to the different places and devices of connection, as well as of the networks indicated as preferred (Table 3). The results in Table 3 show WhatsApp’s outstanding presence among user predilections, an aspect that will be developed below. Table 3. Example of description of three generated rules through the application of the Predictive Apriori algorithm Rule generated through the WEKA Device_Smartphone = Always Favorite_Facebook = Yes Favorite_Twitter = Yes Favorite_Google += No 67 ==> Favorite_WhatsApp = Yes 67 Gender = Feminine Device_Smartphone = Frequently Favorite_Facebook = Yes Favorite_Instagram = Yes 20 ==> Favorite_WhatsApp = Yes 20 Device_Tablet = Always Favorite_Facebook = Yes Favorite_Instagram = Yes 49 ==> Favorite_WhatsApp = Yes 49

Interpretation in neutral language

Confidence index If the user always accesses the Internet from a Conf.: smartphone and Facebook and Twitter are his (0.99452) favorite networks, but not Google+, then he has a preference for WhatsApp If they belong to the female gender, they frequently connect to the Internet from a smart phone and among their favorite VSN are Facebook and Instagram, then they also have a predilection for WhatsApp

Conf.: (0.99987)

If the user always accesses the Internet from a Conf.: (0.99347) tablet and their favorite networks include Facebook, Twitter and Instagram, then WhatsApp is also among their favorites

The second algorithm used was Apriori, which yielded 20 association rules, whose alphas also reached 0.99 reliability. Unlike the results exposed by the Predictive Apriori, in these rules, more presence of the Facebook social network is observed for the sketch of patterns (Table 4). In the same way as in the previous case, some of the most outstanding rules that resulted from the Apriori algorithm are explained below. Table 4 shows that WhatsApp’s great impact among the studied population, even above the most popular virtual social network in the world, which is Facebook. This result includes a couple of ideas that should be highlighted: the students’ indifference towards a virtual social network and an instant messaging service. There is also the great power of penetration of technologies such as the smart phone in the daily lives of individuals. Although WhatsApp currently has computer applications, its main device continues to be the smartphone.

Information Consumption Patterns from Big Data

37

Table 4. Example of description of three rules generated through the Apriori algorithm. Rule generated by WEKA

Interpretation in neutral language

Confidence index Conf.: Connection_Home = Always If the user always accesses the Internet from home and prefers WhatsApp and YouTube, then (0.91) Favorite_Facebook = Yes Facebook is also one of their favorite VSN Favorite_YouTube = Yes 199 ==> Favorite_WhatsApp = Yes 184 If YouTube and WhatsApp are among the user’s Conf.: Favorita_YouTube = Si predilections, but not Instagram, then Facebook (0.91) Favorita_WhatsApp = Si stands out among their favorite networks, too Favorita_Instagram = No 177 ==> Favorita_Facebook = Yes 163 Conf.: If they belong to the female gender and their Gender = Female favorite network is Facebook, then they also have (0.89) Favorite_Facebook = Yes a predilection for WhatsApp 233 ==> Favorite_WhatsApp = Yes 210

5 Conclusions Among the main results, there are clear trends in connection habits, the preference for some virtual social network and the main devices. In the application of SimpleKMeans algorithm, the generation of groups with Internet connection habits from a laptop or desktop computer in the ranges always and frequently stands out. Also, YouTube appears as the favorite virtual social network of a particular group. With the same algorithm, three groups of students were found: one that never connects to the Internet from school, another one that does it sometimes, and finally, another one that does it rarely. This means that the school is one of the least recurrent sites to connect to the Internet, which contrasts with the results of the connection from home. In the latter case, three groups are observed: one that always connects, another one that does so frequently, and a third that never does. This implies an important socioeconomic condition, since it could mean the existence of a group of individuals whose purchasing power is sufficient to contract a home Internet service and another one that cannot. With respect to the results of the association algorithms (Predictive Apriori and Apriori), the most popular networks are Facebook, YouTube and Twitter, but students recognize the WhatsApp instant messaging service as a virtual social network, and it occupies an important place among users’ preferences. This result reveals an important idea: if a particular technology is useful and userfriendly for the daily communication of individuals, they will use it, regardless of whether it is a network or an instant messaging service, as happens, in this case of WhatsApp. In general terms, with respect to the findings found through the application of data mining, the rules outlined suggest important conditions. With regard to

38

J. Silva et al.

connection devices, it can be observed that despite the great acceptance of the smart phone in everyday life, the presence of the laptop or desktop computer as a device for connecting to the Internet continues to be the preference of university students.

References 1. Romero, C., Ventura, S.: Educational data mining: a survey from 1995 to 2005. Expert Syst. Appl. 33(1), 135–146 (2007) 2. Romero, C., Ventura, S.: Educational data mining: a review of the state of the art. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 40(6), 601–618 (2010). http://ieeexplore.ieee. org/xpl/RecentIssue.jsp?reload=true&punumber=5326 3. Scheffer, T.: Finding association rules that trade support optimally against confidence. Intell. Data Anal. 9(4), 381–395 (2004) 4. Hernández, J.A., Burlak, G., Muñoz Arteaga, J., Ochoa, A.: Propuesta para la evaluación de objetos de aprendizaje desde una perspectiva integral usando minería de datos. In: Hernández, A., and Zechinelli, J. (eds.) Avances en la ciencia de la computación, pp. 382– 387. Universidad Autónoma de México, México (2006) 5. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, Statistics, pp. 281–297. University of California, Berkeley (1967) 6. Martínez, G.: Minería de datos. Cómo hallar una aguja en un pajar. Ingenierías XIV(53), 53– 66 (2001) 7. Medina, R., Cortés, R.: El MSN como medio de comunicación y socialización entre los jóvenes de Motul, Yucatán. In: Cortés, R. (ed.) Comunicación y juventud en Yucatán. Ediciones de la Universidad Autónoma de Yucatán, Mérida (2010) 8. Tsiniduo, M., et al.: Evaluation of the factors that determine quality in higher education: an empirical study. Qual. Assur. Educ. 18, 227–244 (2010) 9. Gonzalez Espinoza, O.: Quality of higher education: concepts and models. Cal. Sup. Educ. 28, 249–296 (2008) 10. Bonerge Pineda Lezama, O., Varela Izquierdo, N., Pérez Fernández, D., Gómez Dorta, R.L., Viloria, A., Romero Marín, L.: Models of multivariate regression for labor accidents in different production sectors: comparative study. In: Tan, Y., Shi, Y., Tang, Q. (eds.) DMBD 2018. LNCS, vol. 10943. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-938035_5 11. Izquierdo, N.V., Lezama, O.B.P., Dorta, R.G., Viloria, A., Deras, I., Hernández-Fernández, L.: Fuzzy logic applied to the performance evaluation. honduran coffee sector case. In: Tan, Y., Shi, Y., Tang, Q. (eds.) ICSI 2018. LNCS, vol. 10942, pp. 164–173. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93818-9_16 12. Pineda Lezama, O., Gómez Dorta, R.: Techniques of multivariate statistical analysis: an application for the Honduran banking sector. Innovare: J. Sci. Technol. 5(2), 61–75 (2017) 13. Viloria, A., Lis-Gutiérrez, J.P., Gaitán-Angulo, M., Godoy, A.R.M., Moreno, G.C., Kamatkar, S.J.: Methodology for the design of a student pattern recognition tool to facilitate the teaching - learning process through knowledge data discovery (big data). In: Tan, Y., Shi, Y., Tang, Q. (eds.) DMBD 2018. LNCS, vol. 10943. Springer, Cham (2018). https://doi. org/10.1007/978-3-319-93803-5_63 14. Chase, R.B., et al.: Administración de operaciones: producción y cadena de suministros. McGraw-Hill/Interamericana Editores, Bogota (2009)

Information Consumption Patterns from Big Data

39

15. Chen, T.-L., Su, C.-H., Cheng, C.-H., Chiang, H.-H.: A novel price-pattern detection method based on time series to forecast stock market. Afr. J. Bus. Manag. 5(13), 5188 (2011) 16. Conejo, A.J., Contreras, J., Espinola, R., Plazas, M.A.: Forecasting electricity prices for a day-ahead pool-based electric energy market. Int. J. Forecast. 21(3), 435–462 (2005) 17. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 18. Du, X.F., Leung, S.C.H., Zhang, J.L., Lai, K.K.: Demand forecasting of perishable farm products using support vector machine. Int. J. Syst. Sci. 44(3), 556–567 (2011) 19. Matich, D.J.: Redes Neuronales: Conceptos básicos y aplicaciones. Cátedra de Informática Aplicada a la Ingeniería de Procesos–Orientación I (2001) 20. Mercado, D., Pedraza, L., Martínez, E.: Comparación de Redes Neuronales aplicadas a la predicción de Series de Tiempo. Prospectiva 13(2), 88–95 (2015) 21. Wen, Q., Mu, W., Sun, L., Hua, S., Zhou, Z.: Daily sales forecasting for grapes by support vector machine. In: Li, D., Chen, Y. (eds.) CCTA 2013. IAICT, vol. 420, pp. 351–360. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54341-8_37 22. Cáceres, M., Ruiz, J., Brändle, G.: Comunicación interpersonal y vida cotidiana. La presentación de la identidad de los jóvenes en Internet. CIC: Cuadernos información comunicación 14, 213–231 (2009) 23. Cancelo, M., Almansa, A.: Estrategias comunicativas en redes sociales. Estudio comparativo entre las universidades de España y México. Historia Comun. Soc. 18, 423–435 (2013) 24. Castells, M.: La era de la información: economía, sociedad y cultura, vol. 3. Siglo XXI Editores, México (2004) 25. Cortés, R.: Interacción en Redes Sociales Virtuales entre estudiantes de Licenciatura. Una aproximación con fines pedagógicos. Rev. Iberoamericana Producción Académica Gestión Educativa (1) (2015). http://www.pag.org.mx/index.php/PAG/article/view/107/155 26. Cortés, R., Canto, P.: Usos de la red social Facebook entre estudiantes universitarios. In: Prieto, M.E., Pech, S.J., Pérez, A. (eds.) Tecnologías y aprendizaje. Avances en Iberoamérica, vol. 1, pp. 351–358. Editorial de la Universidad Técnológica de Cancún, Cancún (2013) 27. Wu, Q., Yan, H.S., Yang, H.B.: A forecasting model based support vector machine and particle swarm optimization. In: 2008 Workshop on Power Electronics and Intelligent Transportation System, pp. 218–222 (2008) 28. Zhang, G.P.: Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing 50(Suppl. C), 159–175 (2003) 29. Viloria, A., Gaitan-Angulo, M.: Statistical adjustment module advanced optimizer planner and SAP generated the case of a food production company. Indian J. Sci. Technol. 9(47) (2016). https://doi.org/10.17485/ijst/2016/v9i47/107371

DHS-Voting: A Distributed Homomorphic Signcryption E-Voting Xingyue Fan, Ting Wu, Qiuhua Zheng, Yuanfang Chen(B) , and Xiaodong Xiao School of Cyberspace, Hangzhou Dianzi University, Hangzhou, China {fanxingyue,wuting,171270014}@hdu.edu.cn, zheng [email protected], [email protected]

Abstract. In electronic voting, voters encrypt their ballots and post them on a bulletin board, where the results are tallied by authorities. The right to tally is fully controlled by authorities, who generates a certificate to prove the results are correct, but their attempts to tamper with the bulletin board and falsify the certificate are inevitable. Therefore, we propose an electronic voting scheme based on distributed encryption and homomorphic signcryption, DHS-Voting, which can not only enable the verification of signatures to be completed quickly, but also allows anyone to get the result of the election with the help of the authorities, making the election results more credible. In addition, we prove that DHS-Voting satisfies the ballot privacy and some necessary properties of electronic voting, which we will prove in detail in the paper. Keywords: Electronic voting · Homomorphic encryption Signcryption · Distributed encryption · Privacy

1

·

Introduction

Electronic voting (e-voting) mentioned in this paper refers to online voting, which allows voters sign and encrypt their ballots on their personal computers or mobile devices, and then put them on a bulletin board. After all the hiddenballots have been verified and tallied, the final results are obtained and released. Everything on the bulletin board is transparent and visible except the secret keys of voters and authorities [1]. E-voting has been studied by many scholars for its good characteristics of protecting voters’ privacy. The development process of e-voting is mainly divided into the following stages, mix-net [2–4], blind signature [5–7] and homomorphic encryption [8–12]. Homomorphic encryption is the most widely used in the design of e-voting schemes because it allows ballots to be counted without decryption [13,14], and ElGamal [15] is the most widely used, because it satisfies the addition homomorphism and is well practiced in the Helios [8] (e.g. Norway, Australia and France used helios to vote). In the e-voting scheme with homomorphic encryption, the voters need to sign their own ballots to prove that they really voted by themselves, which c Springer Nature Singapore Pte Ltd. 2019  G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 40–53, 2019. https://doi.org/10.1007/978-981-15-1304-6_4

DHS-Voting: A Distributed Homomorphic Signcryption E-Voting

41

increases the number of times the voters operate on the ballot. Homomorphic signcryption algorithm can solve this problem. Zhang et al. proposed the first homomorphic signcryption and applied it to e-voting scheme [16]. However, this scheme’s security has not been proven properly, the simulator only simulates the encryption part without considering the signature verification. And it has only one authority to tally, others have only the power to verify. Homomorphism signcryption algorithm is a combination of homomorphism and signcryption, which can not be only complete encryption and signature at the same time, but also compute the ciphertext, so that the decryption results correspond to the results of the plaintext computing. The homomorphic signcryption algorithm is as follows:  dhs(f (hs(x), hs(y))) = g(x, y), vhs(f (hs(x), hs(y))) = vhs(x)&&vhs(y), – – – –

f () and g() are two functions, hs() is a homomorphism signcryption algorithm, dhs() is a decryption algorithm, vhs() is an algorithm for verifying signatures.

We mainly studied the electronic voting based on homomorphic signcryption, and further optimized it through distributed encryption, so as to propose DHS-voting scheme. This is illustrated in Fig. 1 where the design of DHS-voting scheme. The entities are:

,⋯⋯,

,⋯⋯, 1, ⋯ ⋯ ,

1

1

,⋯⋯,

,⋯⋯,

BB Everyone

Election Results

Fig. 1. DHS-voting scheme.

V1 , · · · , VNV : NV eligible voters cast their hidden ballots (b1 , · · · , bNV ) in an election. A1 , · · · , ANA : NA authorities are responsible for generating the voting key (P K1 , · · · , P KNA ) and partial decryption key (P K 1 , · · · , P K NA ). BB: A bulletin board responsible for public information and preliminary processing of ballots.

42

X. Fan et al.

The most important attribute of e-voting is to protect the privacy of voters. DHS-voting scheme proposed in this paper ensures some necessary properties listed below [18–20]: Eligibility: Only registered voters can submit their ballots. Uniqueness: Each voter can cast only one ballot. Transparency: The bulletin board is open to all, not just to voters. Correctness: All qualified ballots should be included in the election result. Privacy: Any adversary cannot obtain any hidden information from BB other than those that are voluntarily disclosed. Consistency: The election results must correspond to the ballots cast by voters. The contribution of this paper are as follows. (i) We propose an electronic voting scheme based on distributed encryption and homomorphic signcryption, DHS-voting, which can not only enable the verification of signatures to be completed quickly, but also everyone can tally ballots, making the election results more credible. (ii) This paper gives proofs of the ballot privacy, BPRIV proposed by Bernhard et al. [17] and some necessary properties of e-voting for DHS-voting.

2 2.1

Preliminaries of Cryptography Homomorphic Signcryption

The proposed homomorphic signcryption (HS) algorithm consists of seven procedures [21], HS = (Setup, KeyGenR, KeyGenS, Signcrypt, U nsigncrypt, V erif y, Evaluate). – params ← Setup(1λ ). (λ is a security parameter) Takes λ as input and outputs params (G, g, q) (G: a cyclic group of an order q with a generator g). – (SK, P K) ← KeyGenR(params). Takes params as input, outputs the key pair of the receiver (SK, P K) by SK = (x0 , x1 , x2 ) ← Zq3 , P K = (y0 , y1 , y2 ) = (g x0 , g x1 , g x2 ). – (sk, pk) ← KeyGenS(params). Takes params as input, outputs the key pair of the sender (sk, pk) by sk = w ← Zq , pk = h = g w . – c ← Signcrypt(params, P K, sk, m). Given a random value r ← Zq , the params and a plaintext m, the ciphertext c is computed by c = (c0 , c1 , c2 ) = (g r , g m y0r , y1w+m y2r ). – m ← U nsigncrypt(params, SK, c). Given the params, SK, c and m is computed by c1 m = logg x0 . c0

DHS-Voting: A Distributed Homomorphic Signcryption E-Voting

43

– 0/1 ← V erif y(params, pk, SK, m, c). Takes params, c, m, pk and SK as input, outputs 1 if c2 = pk x1 y1m cx0 2 . Otherwise, the plaintext m is rejected and it outputs 0. – c ← Evaluate(ci , · · · , cn , pki , · · · , pkn ). Takes each sender’s pki and ci (n is the number of senders). The algorithm outputs C = (C0 , C1 , C2 ) by pk = C0 =

n  i=1 n 

pki , ci,0 , C1 =

i=1

n 

ci,1 , C2 =

i=1

n 

ci,2 .

i=1

Then, we get the result of the addition of all plaintexts using U nsigncrypt n above, U nsigncrypt(C) = i=1 mi = m. 2.2

Security of Homomorphic Signcryption

The security of HS guarantees both encryption algorithm and signature algorithm, i.e. weak unforgeability and IND-CPA. Theorem 1. The possibility of a PPT algorithm A breaks IND-CPA security of DDH (cf. Appendix A), therefore, HS is IND-CPA secure. HS is less than AdvD IN D−CP A DDH AdvA ≤ 2 · AdvD .

(1)

Theorem 2. The possibility of a PPT algorithm F breaks weak unforgeability of HS is less than AdvCCDH (cf. Appendix A), therefore, HS is weak unforgeable. W UF AdvF ≤ AdvCCDH .

(2)

The proofs of Theorems 1 and 2 are in Appendix A. 2.3

Distributed Homomorphic Encryption

We assume there are l receivers, each of which generates its own key pair (SKi , P Ki ) according to the algorithm KeyGenR. SKi = (xi,0 , xi,1 , xi,2 ) ← Zq3 , P Ki = (yi,0 , yi,1 , yi,2 ) = (g xi,0 , g xi,1 , g xi,2 ). Then the encrypted public key is: P K = (y0 , y1 , y2 ) =

l 

P Kj = (

i=1

l 

i=1

yi,0 ,

l  i=1

yi,1 ,

l 

yi,2 ).

(3)

i=1

For the ciphertext c = (c0 , c1 , c2 ) generated by algorithm Signcrypt, the decryption process requires the participation of all receivers. Each receiver computes cx1 0 and the plaintext m can be obtained from the following formula: c1 c1 n xi,0 = n x = g m . i=1 i,0 c c i=1 0 0

44

2.4

X. Fan et al.

Proof of Partial Knowledge

The proof of partial knowledge (PPK) is an interactive proof protocol between a prover and a verifier inspired by Disjunctive Chaum-Pedersen proofs [23], consider G, q and g, PPK is described as follows: Prover: ⎧ t, a1 , d1 ← Zq ⎪ ⎪ ⎪ ⎪ signcrypt m0 get cipertext ⎪ ⎪ ⎪ ⎪ c = (c0 , c1 ) cf. Sect. 2.1 ⎪ ⎪ ⎪ ⎪ P = (P0 , P1 , P2 ) ⎪ ⎪ ⎪ ⎪ ⎨ = (g t , y t , (g m1 ·a1 ) · y d1 )/ca1 ) 0

0

1

⎪ a = hash(c0  c1  P0  P1  P2 ), ⎪ ⎪ ⎪ ⎪ where hash() denotes hash function, ⎪ ⎪ ⎪ ⎪ ⎪ i.e. MD5. ⎪ ⎪ ⎪ a0 = a ⊕ a1 , where ⊕ denotes XOR. ⎪ ⎪ ⎪ ⎩ d0 = r · a0 + t Then the prover sends c, P, a0 , a1 , d0 , d1 to the verifier.

2.5

Verifier: ⎧ a0 ⊕ a1 = hash(c0  c1  P0  P1  P2 ) ⎪ ⎪ ⎪ ⎪ ⎨ g d0 = P0 · ca0 0 ⎪ y0d0 = P1 · (c1 /g m0 )a0 ⎪ ⎪ ⎪ ⎩ y0d1 = P2 · (c1 /g m1 )a1

If all the equations hold, the verifier believes that the ciphertext c is indeed the encrypted value of m0 or m1 , but cannot tell which one is the true one.

Zero Knowledge Proof

The zero knowledge proof (ZKP) is an interactive proof protocol between a prover and a verifier inspired by Schnorr proofs [22]. Consider G, q, g and h (G = = ), the prover convinces the verifier that A = g x and B = hx have the same exponent x. Prover: ⎧ chooses t ← Zq , ⎪ ⎪ ⎪ ⎪ computes T1 = g t , ⎪ ⎨ T 2 = ht , ⎪ ⎪ ⎪ T = Hash(T1  T2 ), ⎪ ⎪ ⎩ s = x · c + t, Then the prover sends A, B, T1 , T2 , s, T to the verifier.

3

Verifier: ⎧ ⎪ ⎨ computes T = Hash(T1  T2 ), verifies g s = As · T1 , ⎪ ⎩ hs = B s · T 2 . If all the equations hold, the verifier believes that A and B have the same exponent, but cannot determine the exact value.

Preliminaries of E-Voting

The DHS-voting scheme is a tuple (Setup, ASetup, V Setup, V ote, T ally, AT ally, V erif y) of algorithms. The voting process is attended by an authority central, authorities (A1 , · · · , ANA ), voters (V1 , · · · , VNV ) and BB.

DHS-Voting: A Distributed Homomorphic Signcryption E-Voting

45

– Setup generates trusted public parameters params. – ASetup is used by Aj generate his/her own key pair (SKj , P Kj ), and a voting key P K is calculated by the BB. – V Setup is used by Vi generate his/her own key pair (ski , pki ). – V ote is used by Vi to cast his/her own ballot. – T ally is used to test the eligibility of ballots and to count all eligible ballots by BB. – AT ally is s used by Aj generate his/her own partial decryption key. – V erif y is used by anyone to obtain and verify the election results. 3.1

Ballot Privacy

Definition 1. (BPRIV)(Definition 7 in [17]) Consider a voting scheme V = (Setup, ASetup, V Setup, V ote, T ally, AT ally, V erif y). We say the scheme has and ballot privacy if the advantage of A distinguish between games Expbpriv,0 A,V bpriv,1 ExpA,V , |P r[Expbpriv,0 (λ) = 1] − |P r[Expbpriv,1 (λ) = 1]| A,V A,V is negligible in λ. RIV,β Experiment ExpBP (λ): A,V (pk, sk) ← V Setup, f ← A(V ote), Let c0 ← V ote(id, b0 ), c1 ← V ote(id, b1 ), If P P K(bβ = ), BB 0 = BB 0 b0 and BB 1 = BB 1 b1 , Else return ⊥, If P P K(f = ), BB 0 = BB 0 f and BB 1 = BB 1 f , Else return ⊥, Return BB β ,

If β = 0, SK, ZKP ← AT ally, ballots ← V erif y(SK, ZKP, BB 0 ),

If β = 1, SK, ZKP ← AT ally, ballots ← V erif y(SK, ZKP, BB 0 ), SK  , ZKP  ← Sim(ballots, BB 1 ), Sim() can simulate SK and ZKP .

Return ballots.

3.2

Ballots Consistency

Definition 2. (Strong Consistency) Consider a voting scheme V = (Setup, ASetup, V Setup, V ote, T ally, AT ally, V erif y), We say the scheme has strong consistency if the advantage of A cons = |P r[Expcons AdvA A (λ) = 1]|

is negligible in λ.

46

X. Fan et al.

Experiment Expcons A (λ): params ← Setup, (P K1 , · · · , P KNA , P K) ← ASetup, ((id0 , b0 ), · · · , (idNV , bNV )) ← V ote, SK, ZKP ← AT ally, ballots ← V erif y(SK, ZKP, BB), If ballots = (Extract(b1 )+, · · · , +Extract(bNV )), return 1 else return 0. Extract() can get the actual value (e.g. 0 or 1) of each ballot.

4

DHS-Voting Scheme

We consider approval voting in which voters can only vote “yes” or “no” for each candidate in DHS-voting scheme, and our scheme also applies to score voting by using the matrix construction of ballot in [9]. 4.1

Initialization and Registration

In the initialization phase, the authority center generates voting public parameters params through algorithm Setup. Then each Aj generates his/her own key pair (SK j , P K j ) and publishes P K j . Related algorithms are described as follows: – KeyGen(q, g) generates the key pair (SKj , P Kj ) for ACj , SKj = (xj,0 , xj,1 , xj,2 ) ← Zq3 , x0

(4) x1

x2

P Kj = (yj,0 , yj,1 , yj,2 ) = (g , g , g ). Then the voting public key P K is generated according to Formula (3). In this phase, the authority center generates a ID number for each eligible voter idi . Each voter uses his/her idi to vote instead of their real identity. 4.2

Generate Ballot

The voter Vi cast ballot bi ∈ {0, 1}, i ∈ {1, · · · , NV } (1 means yes, 0 means no), which is described in Algorithm 1. Then the voter Vi sends the hidden ballot ci , the PPK of ballot P P Ki , his/her public key pki and ID number idi to BB.

DHS-Voting: A Distributed Homomorphic Signcryption E-Voting

47

Algorithm 1. Generation of Ballots and PPK Input: bi , P K, params Output: Vi ’s hidden ballot: ci = (ci,0 , ci,1 , ci,2 ) A PPK of Vi ’s ballots: P P Ki ski = wi ← Zq , pki = hi = g wi Signcrypt: ri ← Zq ci = (ci,0 , ci,1 , ci,2 ) = (g ri , g bi y0ri , y1wi +bi y2ri ) PPK: ti , ai,1 , di,1 ← Zq d a Pi = (Pi,0 , Pi,1 , Pi,2 ) = (gti , y0ti , (g bi ·ai,1 ) · y0 i,1 /c1 i,1 ) i a = hash(ci,0  ci,1  Pi,0  Pi,1  Pi,2 ) ai,0 = ai ⊕ ai,1 di,0 = ri · ai,0 + ti P P Ki = ci , Pi , ai,0 , ai,1 , di,0 , di,1 return ci = (ci,0 , ci,1 , ci,2 ), P P Ki

4.3

Vote Tallying

When each ballot is received, BB follows these steps to perform T ally. First, BB verifies the idi and eligibility of the ballot using PPK (Sect. 2.4), that is, the hidden ballot is encrypted with 0 and 1. Eligible ballots are posted on BB, in which each voter can check whether his/her ballot has been modified during a reasonable complaint period. When the voting poll closes, BB performs tallying using Evaluate. pk =

n  i=1

pki ,

C = (C0 , C1 , C2 ) = (

n 

ci,0 ,

i=1

n  i=1

ci,1 ,

n 

ci,2 ).

(5)

i=1

The decryption and verification procedure must be completed with the full cooperation of all authorities. Each authority Aj computes his/her own partial decrypted keys and exposes them: x

x

P K j = (P K j,0 , P K j,1 , P K j,2 ) = (C0 j,0 , pk xj,1 , C0 j,2 ).

(6)

In order to prevent these values from being incorrectly calculated or Aj performs evil operations, Aj must generate ZKP to guarantee the correctness of these values (cf. Sect. 2.5 Prover). After all the authorities have and published these values, anyone can get the election results through Algorithm 2.

5

Argumentation of Security for DHS-Voting

As we described in Sect. 1, e-voting requires more requirements than other types of voting, and these requirements are determined by cryptography schemes used by e- voting, among which the most important attributes are to provide voters privacy and ensure correct election results. Therefore, we discuss BPRIV, strongly consistent and correct of DHS-voting in this section.

48

X. Fan et al.

Algorithm 2. Acquisition and validation of election results Input: C, pk, P K, P K j , ZKP j Output: ballots set pk, F lagZKP = 0 Verify ZKP (cf. Sect.2.5 Verifier): for j = 1 → NA do if ZKP j verified successfully then F lagZKP = 1 end if end for if F lagZKP then g ballots = n CP1 K j,0 i=1 end if   n j,2 then if C2 = i=1 P K j,1 · y1ballots · n i=1 P K return ballots end if

5.1

Argumentation of BPRIV

We define a sequence of games from β = 0 to β = 1 to simulate the interaction with a BPRIV challenger and a adversary. There is a subtle difference between adjacent games that the adversary can ignore. By that analogy, we can come to the conclusion that DHS-voting is BPRIV private. for β = 0 Game G−1 . Let this game corresponds to Experiment Expbpriv,β A,V DHS in the BPRIV game. Game G0 . Let this game as a transition game between games G−1 and games G1 and define, and define Game G0 as a series of games (Game G0,1 , · · · , G0,n , n is the number of entries on BB 0 ). Game G0,i . Game G0,i has two possibilities depending on the choice of tuple (idi , b0i , b1i , vi0 , vi1 ): – if b0i = b1i , no change. – if b0i = b1i , replace the i-th entry (idi , b0i ) in BB 0 with (idi , b1i ). The probability of an adversary distinguishing between Game G0,i−1 and Game G0,i is negligible because NM-CPA property of the signcryption scheme in DHS-Voting. The NM-CPA property relies on IND-CPA with simulator sound extractable zero-knowledge proof, which are Schnorr and Chaum-Pedrsen protocols, and the relevant proof is shown in the paper [24]. Game G1 . Let Game G0,n rename as Game G1 . Then, Game G1 corresponds for β = 1 in the BPRIV game. Thus, we conclude to Experiment Expbpriv,β A,V DHS that (λ) = 1] − |P r[Expbpriv,1 (λ) = 1]| |P r[Expbpriv,0 A,V DHS A,V DHS is negligible in λ.

DHS-Voting: A Distributed Homomorphic Signcryption E-Voting

5.2

49

Argumentation of Strongly Consistent

To argue that DHS-voting conforms to strongly consistent in Definition 2, two conditions need to be met. One is that the homomorphism of HS ensures that the election results are chosen by the voters, and the other is that no one can fake the ballot. Theorem 3. The final election results are what the voters actually cast. Proof. To see the signcryption-homomorphic holds after the Evaluate, consider an evaluated ciphertext C = (C0 , C1 , C2 ) and the partial decrypted keys of receiver P Kj . Hence:

C1 m = logg n j=1 P Kj n ci,1 = logg ni=1 xj,0 C j=1 0 n ci,1 = logg i=1 n xj,0 C0 j=1 n ci,1 = logg ni=1 ( i=1 ci,0 )x0

n mi · y0ri i=1 g = logg  n ( i=1 g ri )x0 = logg = logg

g g

n

mi n

n

· y0 i=1 (g i=1 ri )x0

i=1

n

ri

n mi · g x0 ( i=1 ri ) n (g i=1 ri )x0

i=1

n

(7)

= logg g i=1 mi n

= mi . i=1

The above shows that the sum of plaintext continuous addition can be obtained by decrypting the homomorphism result. This means that if no one cheats, the HS can guarantee that the declassified election results are authentic. Theorem 4. No PPT forger Fvoting is able to successfully submit a forged ballot. Proof. We have got weak unforgeability of HS in Theorem 2, which means that no PPT forger Fvoting can forge a correct signature. Therefore, any Fvoting cannot forge another’s signature to vote in DHS-voting. 5.3

Argumentation of Correctness

When we argue about the correctness of the election results, we need to consider the eligibility of the ballots and the accuracy of the tallying. Theorem 5. The ballots are eligible and validly signed by the voters. Proof. The correctness of P P K is described in Sect. 2.4, and to see the accuracy of signature of HS, consider a evaluated ciphertext C = (C0 , C1 , C2 ), pk, P K and the partial decrypted keys of receiver P Kj . Hence:

50

X. Fan et al.

C2 = =

n  i=1 n 

P K j,1 · y1m · pk xj,1 · y1m ·

= =

x1

n

n 

x

C0 j,2

i=1

mi

n ·(

ci,0 )x2 i=1 n n n m g ( i=1 wi )·x1 · y1 i=1 i · ( g ri )x2 i=1 n n n w m y1 i=1 i · y1 i=1 i · g ( i=1 ri )x2 n n n y1wi · y1mi · y ri i=1 i=1 i=1 2 n Ci,2 . i=1

= pk

=

P K j,2

i=1

i=1

=

n 

· y1

i=1

(8)

The above shows that the proposed homomorphic signcryption system is additively homomorphic for signature, that is, it can correctly verify the signature of ciphertext with different private keys after homomorphic calculation. Theorem 6. The DHS-voting scheme is correct during tallying. Proof. All hidden ballots and tallying results are posted on BB, so everyone, not just voters can check to see if any ballots are missing. And the accuracy of the election results obtained from the tallying results is guaranteed by the additive homomorphism of HS can be obtained in Sect. 2.1 and Theorem 3. This shows that in the DHS-voting scheme, the process of tallying and decryption is credible.

6

Conclusions

In this paper, we propose an e-voting scheme by using homomorphic signcryption and distributed encryption, which giving voters more power and making elections more fair and transparent. We believe that simpler and more democratic voting will be a development direction in the future.

A

Security

Definition 3 (DDH (Decisional Diffie-Hellman) Assumption). Let G, q, g and three elements a, b, c ← Zq . For a PPT algorithm D, the advantage of distinguishing D = (g, g a , g b , g ab ) and R = (g, g a , g b , g c ) DDH AdvD = |Pr[D(triplet R) → 1] − Pr[D(triplet D) → 1]|

is negligible in λ.

(9)

DHS-Voting: A Distributed Homomorphic Signcryption E-Voting

51

Definition 4 (CDH (Computational Diffie-Hellman) Assumption). Let G, q, g and two elements a, b ← Zq . For a PPT algorithm C, the advantage of obtain g ab AdvCCDH = Pr[g ab ← A(G, q, g a , g b ) | a, b ← Zp , G = g ]

(10)

is negligible in λ. The proof of Theorem 1: In order to prove that HS is the IND-CPA security, we set up three simulation games. GAME 1. The game is a basic IND-CPA game of HS. 1. The simulator S runs Setup and ASetup to obtain params and (SK, P K), where (SK, P K) = ((x0 , x1 , x2 ), (y0 , y1 , y2 )). Then the params and the P K are sent to a PPT adversary A. 2. A generates sk = w ← Zq and sends it to S. 3. S selects a random bit ˜b ← {0, 1} and computes c = (c0 , c1 , c2 ) according to Signcrypt (r ← Zq ), where c0 = g r ,

c1 = g m˜b y0r ,

m

c2 = y1w y1 ˜b y2r .

Then, S sends c to A. 4. Finally, A outputs ˜b ∈ {0, 1}. A wins the game if ˜b = ˜b . GAME 2. Except for the third step of the second game, the others are the same as GAME 1. In the third step, c is computed as follows (r ← Zq ): c0 = g r ,

R0 ← G,

c1 = g m˜b R0 ,

m

c2 = y1w y1 ˜b y2r

GAME 3. Except for the third step of the third game, the others are the same as GAME 2. In the third step, c is computed as follows (r ← Zq ): R0 , R1 ← G,

c0 = g r ,

c1 = g m˜b R0 ,

m

c2 = y1w y1 ˜b R1

We argue the three games under the DDH assumption, we set wini to represent the situation that A wins GAME i, i∈ {1, 2, 3}. The advantage of the PPT adversary A winning GAME 1 is 1 IN D−CP A = |P r[win1 ] − |. AdvA 2

(11)

Lemma 1. If there is a PPT adversary A who can distinguish between GAME 1 and GAME 2 (GAME 2 and GAME 3), then the PPT algorithm D has the advantage DDH AdvD = |P r[win2 ] − P r[win1 ]|, (12) DDH = |P r[win3 ] − P r[win2 ]|. AdvD to breaks the DDH assumption.

52

X. Fan et al.

Lemma 2. The PPT adversary A has no advantage to win GAME 3, that is, P r[win3 ] =

1 . 2

(13)

Therefore, we can get Eq. (1) and Theorem 1 is completely proved. The proof of Theorem 2: In order to prove weak unforgeability of HS, we set up a simulation game as follows. 1. The algorithm C obtains a CDH instance (g, g a , g b ), then chooses x0 , x2 ← Zp and computes a P K and pk, where P K = (y0 , y1 , y2 ) = (g x0 , g b , g x2 ), pk = g a . Then (params, pk, P K) is send to the forger F. 2. F computes c = (c0 , c1 , c2 ) on b according to Signcrypt, and sends the c to C. Finally, the algorithm C can calculate g ab by c2

g ab = cx0 2

logg

· y1

c1 x c0 0

.

(14)

W UF Since the advantage AdvCCDH is negligible under CDH assumption, AdvF is negligible for any PPT forger F, we can get Eq. (2) and Theorem 2 is completely proved.

References 1. Del Blanco, D.Y.M., Alonso, L.P., Alonso, J.A.H.: Review of cryptographic schemes applied to remote electronic voting systems: remaining challenges and the upcoming post-quantum paradigm. Open Math. 16(6), 95–112 (2018) 2. Fujioka, A., Okamoto, T., Ohta, K.: A practical secret voting scheme for large scale elections. In: Seberry, J., Zheng, Y. (eds.) AUSCRYPT 1992. LNCS, vol. 718, pp. 244–251. Springer, Heidelberg (1993). https://doi.org/10.1007/3-540-57220-1 66 3. Islam, N., Alam, K.M.R., Tamura, S., Morimoto, Y., et al.: A new e-voting scheme based on revised simplified verifiable re-encryption mixnet. In: International Conference on Networking, Systems and Security, Dhaka. IEEE (2017) 4. Chang, D., Chauhan, A.K., K, M.N., Kang, J.: Apollo: end-to-end verifiable voting protocol using mixnet and hidden tweaks. In: Kwon, S., Yun, A. (eds.) ICISC 2015. LNCS, vol. 9558, pp. 194–209. Springer, Cham (2016). https://doi.org/10.1007/ 978-3-319-30840-1 13 5. Kumar, M., Katti, C.P., Saxena, P.C.: A secure anonymous e-voting system using identity-based blind signature scheme. In: Shyamasundar, R.K., Singh, V., Vaidya, J. (eds.) ICISS 2017. LNCS, vol. 10717, pp. 29–49. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-72598-7 3 6. L´ opez-Garc´ıa, L., Perez, L.J.D., Rodr´ıguez-Henr´ıquez, F.: A pairing-based blind signature e-voting scheme. Comput. J. 57(10), 1460–1471 (2014)

DHS-Voting: A Distributed Homomorphic Signcryption E-Voting

53

7. Zhang, H., You, Q., Zhang, J.: A lightweight electronic voting scheme based on blind signature and Kerberos mechanism. In: International Conference on Electronics Information and Emergency Communication, Beijing, pp. 210–214. IEEE (2015) 8. Adida, B.: Helios: web-based open-audit voting. In: Proceedings of the USENIX Security Symposium, Berkeley, pp. 335–348 (2008) 9. Yang, X., Yi, X., Kelarev, A., et al.: A secure verifiable ranked choice online voting system based on homomorphic encryption. IEEE Access 6, 20506–20519 (2018) 10. Mateu, V., Miret, J.M., Seb´e, F.: A hybrid approach to vector-based homomorphic tallying remote voting. Int. J. Inf. Secur. 15(2), 211–221 (2016) 11. Huszti, A.: A homomorphic encryption-based secure electronic voting scheme. Publ. Math. 79(3), 479–496 (2015) 12. Kiayias, A., Zacharias, T., Zhang, B.: An efficient E2E verifiable e-voting system without setup assumptions. IEEE Secur. Priv. 15(3), 14–23 (2017) 13. Yi, X., Paulet, R., Bertino, E.: Homomorphic Encryption and Applications. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-319-12229-8 14. Huang, R., Li, Z., Zhao, J.: A verifiable fully homomorphic encryption scheme. In: Wang, G., Feng, J., Bhuiyan, M.Z.A., Lu, R. (eds.) SpaCCS 2019. LNCS, vol. 11611, pp. 412–426. Springer, Cham (2019). https://doi.org/10.1007/978-3-03024907-6 31 15. ElGamal, T.: A public key cryptosystem and a signature scheme based on discrete logarithms. In: Blakley, G.R., Chaum, D. (eds.) CRYPTO 1984. LNCS, vol. 196, pp. 10–18. Springer, Heidelberg (1985). https://doi.org/10.1007/3-540-39568-7 2 16. Zhang, P., Yu, J., Liu, H.: A homomorphic signcryption scheme and its application in electronic voting. J. Shenzhen Univ. Sci. Eng. 28(6), 489–494 (2011) 17. Bernhard, D., Cortier, V., Galindo, D.: A comprehensive analysis of game-based ballot privacy definitions. In: Symposium on Security and Privacy, San Jose. IEEE (2015) 18. Alhothaily, A., Hu, C., Alrawais, A., et al.: A secure and practical authentication scheme using personal devices. IEEE Access 5, 11677–11687 (2017) 19. Br¨ aunlich, K., Grimm, R.: A formal model for the requirement of verifiability in electronic voting by means of a bulletin board. In: Heather, J., Schneider, S., Teague, V. (eds.) Vote-ID 2013. LNCS, vol. 7985, pp. 93–108. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39185-9 6 20. Cortier, V., Galindo, D., K¨ usters, R., et al.: SoK: verifiability notions for e-voting protocols. In: IEEE Symposium on Security and Privacy, San Jose, pp. 779–798. IEEE (2016) 21. Rezaeibagha, F., Mu, Y., Zhang, S., Wang, X.: Provably secure homomorphic signcryption. In: Okamoto, T., Yu, Y., Au, M.H., Li, Y. (eds.) ProvSec 2017. LNCS, vol. 10592, pp. 349–360. Springer, Cham (2017). https://doi.org/10.1007/978-3319-68637-0 21 22. Schnorr, C.: Efficient signature generation by smart cards. J. Cryptol. 4(3), 161– 174 (1991) 23. Chaum, D., Pedersen, T.P.: Wallet databases with observers. In: Brickell, E.F. (ed.) CRYPTO 1992. LNCS, vol. 740, pp. 89–105. Springer, Heidelberg (1993). https://doi.org/10.1007/3-540-48071-4 7 24. Bernhard, D., Pereira, O., Warinschi, B.: How not to prove yourself: pitfalls of the Fiat-Shamir heuristic and applications to helios. In: Wang, X., Sako, K. (eds.) ASIACRYPT 2012. LNCS, vol. 7658, pp. 626–643. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34961-4 38

Towards In-Network Generalized Trustworthy Data Collection for Trustworthy Cyber-Physical Systems Hafiz ur Rahman1 , Guojun Wang1(B) , Md Zakirul Alam Bhuiyan2 , and Jianer Chen1 1

School of Computer Science, Guangzhou University, Guangzhou 510006, China hafiz [email protected], [email protected], [email protected] 2 Department of Computer and Information Sciences, Fordham University, New York, NY 10458, USA [email protected]

Abstract. Data trustworthiness (i.e., the data is free from error, up to date, and originate from a reputable source) is always preferred. However, due to environmental influence (i.e., equipment faults, noises, or security attacks) and technology limitation a wireless sensor/sensors module can neither store/process all raw data locally nor reliably forward the data to a destination in heterogeneous IoT environment. As a result, the sensing data collected by IoT/Cyber-Physical System (CPS) is inherently noisy, unreliable, and may trigger many false alarms. These false or misleading data can lead to wrong decisions once the data reaches end entities/cloud. Therefore, it is highly recommended and desirable to identify trustworthy data before data transmission, aggregation, and data storing at the end entities/cloud. In this paper, we propose an Innetwork Generalized Trustworthy Data Collection (IGTDC) framework for collaborative IoT environment. The key idea of IGTDC is to allow a sensor’s module to check whether or not the raw/sense data is reliable before routing to the sink/edge node. It also identifies whether the received data is trustworthy or not before aggregation at sink/edge node. Besides, IGTDC facilitates to identify a faulty sensor. For a reliable event detection in real-time without waiting for a trust report from the end entities/cloud, we use collaborative IoT, and gate-level modeling with Verilog User Defined Primitive (UDP) to make sure that the collected data/event information is reliable before sending to end entities/cloud. We use BCD 8421 (Binary Coded Decimal) in gate-level modeling as a flag which assists in identifying a faulty or compromise sensor. Through simulations with Verilog Icarus, we demonstrate that the collected data in IGTDC is trustworthy that can make trustworthy data aggregation for event detection and help to identify a faulty sensor.

Keywords: Trustworthy data System · Internet of Things

· Event detection · Cyber-Physical

c Springer Nature Singapore Pte Ltd. 2019  G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 54–66, 2019. https://doi.org/10.1007/978-981-15-1304-6_5

Towards In-Network Generalized Trustworthy Data Collection

1

55

Introduction

Internet of Things (IoT) can integrate physical objects with the virtual world (i.e., smart devices, sensors, computers, etc.) for various operations. Millions of embedded devices are being used today in safety and security-critical applications such as crowdsensing, structural health monitoring, industrial control systems, chemical explosions, and military surveillance [1–3]. According to a report that around 50 billion objects and accessories will attach to the Internet by the year 2020. Furthermore, these numbers will surpass the number of human beings on earth [2]. According to [4], around 85% of corporate/businesses worldwide will be shifting to the IoT domain. In addition, IoT based facilities have grown exponentially over the past five years, especially in telemedicine and production uses. It is expected that the global economy will generate about USD 1.1–2.5 Trillion in endowments by 2020 [5–7]. Nevertheless, IoT sensors are generally deployed to monitor and measure different attributes of the environment. These sensors sense and generate a substantial size of data that will be used for decision and policy-making in the future [8–11]. The quality of data or monitoring and the timely detection of events are the essential problems in most sensing applications. This is especially true in cases of structural damage or fire, where the system should be capable of identifying the collected data failures in real-time and taking immediate recovery measures to evade unimportant operations [9,10]. The reliability of data and event detection is extremely solicited where public safety and economic loss are involved [10,12]. However, deployed sensors are vulnerable to damage in rough environments. Due to technology limitation and environmental influence (e.g., equipment faults, background noises, clutter, lack of sensor calibration, and security attack), the sensor data collected by IoT/Cyber-Physical Systems (CPS) is inherently noisy, usually unreliable, and may trigger many false alarms [10,13–15]. These false or misleading data can lead to wrong decisions, and ultimately results in wastage of resources, time, and could be risking of human life [10]. As a motivation, As a motivation, a recent investigation conducted by HP and [6] further reported that 70% of the IoT devices are exposed to various cyber-attacks and a huge amount of sensor network data was affected by data failures. About 60%, 50%, and 25% of the data items were faulty and erroneous in the great duck island, macroscope project, and Berkeley lab experiments, respectively [16–19]. Moreover, unreliable or compromised data are received at the time of acquisition; and we apply several robust security algorithms to process, store, and transfer the data; and finally, we make decisions for numerous CPS applications. As a result, decisions made in a cyber-system based on the collected data may be meaningless/untrustworthy, i.e., we may process the compromised data, we may transmit the compromised data, we may encrypt the compromised data, or we may store the compromised data. Therefore, it is highly desirable to acquire and shift meaningful information/trustworthy data from a large volume of noisy data for trustworthy IoT/CPS before to invest cost and time for processing, storing, and transmitting of unsecured and untrustworthy data.

56

H. ur Rahman et al.

It is challenging to differentiate meaningful and significant data from trusted data with a considerable quantity of noisy data. Current approaches for measuring data trustworthiness are generally meant for web and traditional sensor network. These methods are less suited to IoT/CPS since IoT/CPS has inherently different nature than other paradigms or domains. IoT/CPS environments are heterogeneous and dynamic as compare with traditional sensor network and web domain. Currently, there is no comprehensive approach to the problem of high assurance data trustworthiness for trustworthy IoT and cyber systems. None/Few of the researchers [20–22] consider security and privacy before the data processing, storing, and transmitting. Complete data trustworthiness requires an articulate solution by combining different approaches/techniques. For instance, to develop a system/solution for computing data trustworthiness in IoT that can distinguish data faults, noise, and cyberattacks during the data life cycle (e.g., during data collection, data transmission, and cloud data storage). In this paper, we propose an In-network Generalized Trustworthy Data Collection (IGTDC) framework for collaborative IoT environment. The key idea of IGTDC is to allow a sensors module to check whether or not the raw/sense data is reliable before transmitting to the sink/edge node. It also identifies whether the received data is trustworthy or not before aggregation at sink/edge node. Besides, IGTDC help to identify a faulty sensor. For a reliable event detection in real-time without waiting for a trust report from end entities, we use collaborative IoT and gate-level modeling with Verilog User Defined Primitive (UDP) [23] to make sure that the collected data/event information is reliable before sending to end entities/cloud. IGTDC only forward the corresponding true/real alarm for an event to sink/edge. We use binary bode in gate-level modeling as a flag which helps to identify a faulty or compromise sensor. We envision that IGTDC gives a low cost, energy-efficient, online, and in-network solution towards trustworthy data collection for a trustworthy cyber system in a dynamic environment. As a summary, during data collection time, IGTDC can decide/sense (in real-time) whether the acquired/sense data is trustworthy or not and finally transmit the trustworthy/reliable data to sink/edge node. Before aggregating the collected data, the sink node/edge node will check that the collected data is reliable or not. The rest of the paper is organized as follows. Section 2 provides an overview of the data characteristics of IoT/CPS and IGTDC framework. Section 3 describes the need for data trustworthiness for an event detection. Section 4 evaluates our IGTDC framework. Finally, we draw conclusions and further research directions in Sect. 5.

2

IoT Data Characteristics and IGTDC Framework

In this section, we describe the characteristic of generalized IoT environment and GPDC framework. Typically, the IoT system consists of three layers as shown in Fig. 1. A physical awareness layer that observes and measures the physical environment. The interconnected ‘things’ such as sensors or mobile devices monitors

Towards In-Network Generalized Trustworthy Data Collection

57

Fig. 1. Generalized IoT system

and collects different types of environmental event (e.g., sleep habits, fitness levels, humidity, temperature, smoke, movement positions, pressure, etc.,). These data or events can be further aggregated, fused, processed, and disclose to derive useful information for smart services such as event detection at the gateway. A network layer that sends and processes the collected data. The application layer provides situational awareness information service to end entitties. All IoT related smart services, no matter how diverse they may be, follow these five basic steps at all times: (1) Sense; (2) Transmit; (3) Store; (4) Analyze; (5) Act [6,8,24]. Resource scarce devices such as IoT devices are dumb (i.e., it used only for data gathering and data transmitted to the cloud, where predictions are made). However, transmitting data to the cloud has issues such as drains the battery of devices, privacy, latency, and bandwidth concerns. There can be a question, can we make these devices intelligent? Can we design an algorithm that can be deployed on tiny devices for prediction? Furthermore, the proposed solution can provide near the state-of-the-art performance and perform faster predictions without draining the battery. A huge line of work exists on designing memory-efficient models using machine learning approaches, for instance, Compressing Neural Networks [25], Pruning Random Forests [26], Compressing k-Nearest Neighbors (kNN) [27]. However, all these have one or more below problems: models do not fit into tiny devices with less than 2 kb of RAM; perform poorly when compressed into tiny devices; do not generalized to other supervised learning tasks such as multilabel classification ranking. In this paper, we emphasize the first layer of generalized IoT system, as shown in Fig. 2. We propose an In-network Generalize Trustworthy Data Collection (IGTDC) framework for collaborative IoT environment. We design a tiny programmable logic device (PLD) as discussed before in Sect. 1, that work at

58

H. ur Rahman et al.

Fig. 2. Generalized IoT system with IGTDC framework

edge level (i.e., sensors module or in-network level) in order to achieve data trustworthiness during data collection time. IGTDC can be deployed at a sensor module level, which will provide near state-of-the-art performance without draining the battery, having low cost, energy-efficient, in-network solution, and can work in any heterogeneous dynamic/unstable IoT environment. IGTDC is based on digital combinational logic (i.e., Truth table, K-Map, and gate-level implementation). This utility does not use memory owing to combinational and fewer logic elements (gate) [23]. Furthermore, it is based on simple boolean expression (i.e., Sum of Product (SOP) and Product of Sum (POS)) that can be modified according to the nature of IoT environment. That is why we term this method as In-network Generalize Trustworthy Data Collection (IGTDC) framework.

3

Trustworthy Data Collection for Event Detection

In this section, we present a trustworthy data collection for IoT/CPS environment using IGTDC framework. Our initial interest is physical perception part. To identify locally at the sensor module whether the acquired data is trustworthy or not, and finally transmit the reliable data to the upper level/layers of IoT environment. For example, during data collection time, the IGPDC framework can decide that the acquired data is trustworthy without waiting for a trust report from end entities/cloud. Finally, it will only forward the accurate data to the gateway. Otherwise, they will not forward. In this paper, we consider a fire detection scenario for simplicity. Fire is the major cause of accidents claiming precious lives and property. The Global Warming Report 2017 says “rapidly spreading fire as one of the real reason behind increment in an Earth-wide temperature boost” [28]. Common causes of wildfire are lightning, extremely hot and arid weather, and human carelessness such as Amazon rainforest fires [29]. So far this year, more than 80,000 fires in the country have been detected by Brazil’s space research center [29,30]. Hence, timely detection of fire is critical for avoiding a major accident.

Towards In-Network Generalized Trustworthy Data Collection

59

Fig. 3. Sensor’s data life cycle in IGPDC framework

Fig. 4. Logic domain interpretation (i.,e Truth table, K-Map, and BCD Flag Information) for trustworthy event (fire) detection

Nevertheless, catching fire results significant increase in temperature, the humidity of the air, and the percentage of carbon dioxide and carbon monoxide present in the atmosphere. Hence appropriate sensors have to be installed at the vulnerable places to detect the mentioned physical quantities [31,32]. In IGPDC framework, sensors are deployed according to an engineering-driven deployment method. In the design prototype, sensors are installed in three distinct locations to identify the exact location of fire hazards that has taken place. In the design prototype, sensors are installed in three distinct locations to identify the exact location of fire hazards that has taken place. The fire-related data is collected by three different sensors (Temperature, Smoke, and Humidity sensor). By PLD/User Defined Primitive (UDP), the sensors data are compared with predefined threshold values, and finally, the corresponding event information sends to a gateway for data aggregation, as shown in Fig. 3. Figure 3 depicts the sensor’s data life cycle in IGPDC framework for trustworthy data collection in IoT/CPS environment. Firstly, different sensors measure

60

H. ur Rahman et al.

the temperature, smoke, and humidity from the environment and send to the PLD (Sensor’s Module) for processing. All sensors data is considered raw before processing by PLD. In step II, the PLD module process and evaluate all raw sensor readings using User Defined Primitive (UDP) logic (i.e., Truth table, K-Map, and BCD values), as shown in Fig. 4 tables. K-map is essential an alternative representation of a truth table that was designed to facilitate the creation of minimal Boolean expressions [33]. K-maps reduce logic functions more quickly and easily compared to Boolean algebra [33]. By reducing, we mean to simplify, reducing the number of gates and inputs. We like to simplify logic to the lowest cost form to save costs by elimination of components. In IGTDC, the User Defined Primitive (UDP) logic table output is used and considers reliable if and only if all the sensors signals cross the threshold according to a specific event at a given time. Otherwise, it will consider faulty/fake or compromise signals. As a result, PLD will forward the result and the corresponding flag values to the gateway, according to Fig. 4 truth tables. We call the data is “Routed Data” after processed by the PDL module. The flag (BCD 8421 (Binary Coded Decimal)) contains information about current sensors status (i.e., a sensor on/off status and corresponding sensors signals for an event at a given time interval; 1 indicate a sensor observe an event and cross a threshold and vice versa). We observe that traditionally certain approaches have been proposed to discover unreliable sensor’s module data at gateway, which are (Data Provenance, Signal correlation analysis, Neighborhood similarity hypothesis, Mutual Information Independence (MII), Signal comparison (i.e., Signal to Signal, Signal to Noise, Random Signal Sampling), Truth Discovery) to name a few. Furthermore, these techniques can also resolve conflicts with multiple noisy and cluttered data sensors. We used BCD 8421 code, which is a weighted code in which a four-digit character represents the decimal digits 0 to 9. It helps gateway to investigate and identified a faulty or compromise sensor. A final decision will take by the gateway for an accurate and reliable event. Once the gateway collects output from all sensor modules and identifies if collected sensor module outputs are faulty or not. If any change appears in all sensor’s modules, there is a possibility that a sensor is faulty or compromised. As a summary, using the logic output values, a sensor’s module can decide locally without the help of a trust report from edge/fog node or centralized cloud, whether the raw data is reliable for an event or not. If it is reliable, then it will be routed towards the gateway for data aggregation. Finally, we used BCD 8421 in IGTDC framework as trust aggregation technique for a reliable event and faulty/compromise sensor detection at the gateway.

4

Results and Discussion

In this section, we discuss IGTWD performance evaluation to justify trustworthy data collection for a trustworthy cyber system through simulations. We use Intel Core i7, 16 GB RAM, Win 10, 64 Bit OS. We used the latest version of Icarus Verilog (0.9.7) for PLD coding, and GTKWave (3.3) [34,35] for examining digital sensors view waveform. GTKWave is used to check the output of the

Towards In-Network Generalized Trustworthy Data Collection

61

Fig. 5. Logic domain interpretation (i.,e Truth table, K-Map, and BCD Flag Information) for three different sensor’s modules

Fig. 6. GTKWave waveform for trustworthy event detection using PLD/User Defined Primitive (UDP) (Logic Object Domain - I, Logic Object Domain - II, and Logic Object Domain - III)

Fig. 7. GTKWave waveform for trustworthy event detection using PLD/User Defined Primitive (UDP) (Logic Object Domain - I) with some noise/faulty signals

62

H. ur Rahman et al.

Fig. 8. GTKWave waveform for trustworthy event detection using PLD/User Defined Primitive (UDP) (Logic Object Domain - II) without noise/faulty signals

Fig. 9. GTKWave waveform for trustworthy event detection using PLD/User Defined Primitive (UDP) (Logic Object Domain - III) with some noise/faulty signals

various digital emulators and analysis tools for debugging on a Verilog or VHDL simulation models [34,35]. Figure 5 shows Logic Domain Interpretation (i.,e Truth table, K-Map, and BCD Flag Information) for three different sensor’s modules. Moreover, Fig. 5, PLD uses logic Domain values as a fundamental role for different operations/test cases. We attempt to implement a lightweight/tiny Varilong PLD/User Defined Primitive (UDP) programming utility and optimize the output with K-Map don’t care condition [33] with BCD (8421) Flag as shown in Fig. 5. K-maps also allow easy minimizations of functions whose truth tables include “don’t care” conditions. A “don’t care” condition is a combination of inputs for which the designer does not care what the output is. Therefore, “don’t care” conditions can either be included in or excluded. As shown in Fig. 5 “don’t care” is indicated with red ‘X’. We are only interested in reliable output for a reliable event. Therefore, we did not consider those output for which information is missing. We envision it will help to save battery of devices, latency, and bandwidth concerns. When a sensor reading is ready to be transferred from an IoT sensor, it first goes to the sensor’s module. The K-map outputs are either ‘1’, ‘0’ or ‘x’ that is being given input to the gateway. We combine the trustworthiness on those attributes to evaluate the final confidence in the raw data itself. The control logic maintains all the inside system actions and output. Figure 6, shows digital sensors view waveform using GTKWave. As shown in Fig. 6, the simulation runs for 240 s. According to the logic object module, as shown in Fig. 5, only one output value is routed to the gateway represented by the ‘1’ output column with the corresponding flag of ‘111’ in the table. The output will be routed to the gateway if and only if all sensors (temperature, smoke, and humidity) detect environmental events (i.e., all sensors exceed the threshold values). Figure 6 shows the same

Towards In-Network Generalized Trustworthy Data Collection

63

observations using the GTKWave waveform for reliable event detection. We can see that out of seven, only one reliable event is detected at 110 s in a 240-s simulation period. Despite everything, trustworthy data is achieved by randomly modifying multiple sensors in the data sets in the second set of simulations. At a different time interval of simulation, we randomly select a few sensors and fed faulty signals into their data collection model. Figures 7, 8, and 9 show the corresponding digital waveforms and flags for trustworthy event detection as well for faulty sensor’s signals. Figure 7 shows the Humidity, Smoke, and Temperature sensors have faulty or partly altered signals at 400 s, 450 s, 520 s, 580 s, and 630 s, respectively. Figure 7 have no faulty or altered signal injection. This means that the acquired data is reliable, and the corresponding event signal is trustworthy. However, only one faulty/compromise signal is injected for the Humidity sensor, as shown in Fig. 8. The corresponding event signal is recorded at 450 s. Whenever the gateway receives such data, it can discard the data from the aggregation, or the data reconstruction method can be used for the unreliable data. By using flag, The corresponding faulty/compromise sensor(s) can be backtrace as well.

5

Conclusions and Future Directions

In this paper, we have presented an In-network Generalized Trustworthy Data Collection (IGTDC) framework for an event and faulty sensor detection in IoT/CPS environment. It facilitates real-time trustworthy data collection at the edge of the IoT/CPS system (i.e., sensor’s module and gateway level). For a reliable event detection, We built a tiny utility using a programmable logic device (PLD) that will work at sensor’s module level to make sure that the collected and transmitted data for an event is trustworthy. Through simulations with Verilog Icarus, we demonstrate that the collected data in IGTDC is trustworthy that can make reliable data aggregation for an event, and faulty sensor detection. Our future work comprises a full implementation of the framework for data protection/security before routing to a gateway. One of the limitations of our work is that we can not evaluate our framework with currently available frameworks, which is another future work. Acknowledgments. This work was supported in part by the National Natural Science Foundation of China under Grant 61632009 and 61872097, in part by the Guangdong Provincial Natural Science Foundation under Grant 2017A030308006, and in part by the High-Level Talents Program of Higher Education in Guangdong Province under Grant 2016ZJ01.

References 1. Cam-Winget, N., Sadeghi, A.-R., Jin, Y.: Can IoT be secured: emerging challenges in connecting the unconnected. In: Proceedings of 53rd ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6 (2016)

64

H. ur Rahman et al.

2. Cisco, Internet of Things (IoT) - Cisco IoT Product Portfolio - Cisco. http:// www.cisco.com/c/en/us/solutions/internet-ofthings/iot-products.html. Accessed 23 May 2019 3. ur Rahman, H., Azzedin, F., Shawahna, A., Sajjad, F., Abdulrahman, A.S.: Performance evaluation of VDI environment. In: 2016 Sixth International Conference on Innovative Computing Technology (INTECH), pp. 104–109. IEEE, August 2016 4. Manyika, J., Chui, M., Bughin, J., Dobbs, R., Bisson, P., Marrs, A.: Disruptive technologies: advances that will transform life, business, and the al economy, vol. 12. McKinsey Global Institute San Francisco, CA (2013) 5. Andrew, M.: How the Internet of Things will affect security and privacy (2016). http://www.businessinsider.com/internet-of-things-security-privacy-20168?IR=T 6. Makhdoom, I., et al.: Anatomy of threats to the internet of things. IEEE Commun. Surv. Tutorials 21(2), 1636–1675 (2018) 7. ur Rahman, H., Wang, G., Chen, J., Jiang, H.: Performance evaluation of hypervisors and the effect of virtual CPU on performance. In: 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), pp. 772–779. IEEE, October 2018 8. Karkouch, A., Mousannif, H., Al Moatassime, H., Noel, T.: Data quality in internet of things: a state-of-the-art survey. J. Netw. Comput. Appl. 73, 57–81 (2016) 9. Bhuiyan, M.Z.A., Wu, J.: Trustworthy and protected data collection for event detection using networked sensing systems. In: 2016 IEEE 37th Sarnoff Symposium. IEEE (2016) 10. Bhuiyan, M.Z.A., Wang, G., Choo, K.K.R.: Secured data collection for a cloudenabled structural health monitoring system. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE (2016) 11. Javadpour, A., Abharian, S.K., Wang, G.: Feature selection and intrusion detection in cloud environment based on machine learning algorithms. In: 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC), pp. 1417–1421. IEEE, December 2017 12. Arif, M., Wang, G., Wang, T., Peng, T.: SDN-based secure VANETs communication with fog computing. In: Wang, G., Chen, J., Yang, L. (eds.) Security, Privacy, and Anonymity in Computation, Communication, and Storage. SpaCCS 2018. LNCS, vol. 11342, pp. 46–59. Springer, Cham (2018). https://doi.org/10. 1007/978-3-030-05345-1 4 13. Elahi, H., Wang, G., Li, X.: Smartphone bloatware: an overlooked privacy problem. In: Wang, G., Atiquzzaman, M., Yan, Z., Choo, K.-K.R. (eds.) SpaCCS 2017. LNCS, vol. 10656, pp. 169–185. Springer, Cham (2017). https://doi.org/10.1007/ 978-3-319-72389-1 15 14. Li, J., Deng, H., Jiang, W.: Secure vibration control of flexible arms based on operators’ behaviors. In: Wang, G., Atiquzzaman, M., Yan, Z., Choo, K.-K.R. (eds.) SpaCCS 2017. LNCS, vol. 10656, pp. 420–431. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-72389-1 34

Towards In-Network Generalized Trustworthy Data Collection

65

15. Haron, N., Jaafar, J., Aziz, I.A., Hassan, M.H., Shapiai, M.I.: Data trustworthiness in internet of things: a taxonomy and future directions. In: 2017 IEEE Conference on Big Data and Analytics (ICBDA), pp. 25–30. Kuching (2017) 16. Sharma, A.B., Golubchik, L., Govindan, R.: Sensor faults: detection methods and prevalence in real-world datasets. ACM Trans. Sens. Netw. (TOSN) 6(3), 23 (2010) 17. Tolle, G., et al.: A macroscope in the redwoods. In: Proceedings of the 3rd International Conference on Embedded Networked Sensor Systems, pp. 51–63. ACM, November 2005 18. Kamal, A.R.M., Bleakley, C., Dobson, S.: Packet-level attestation (PLA): a framework for in-network sensor data reliability. ACM Trans. Sens. Netw. (TOSN) 9(2), 19 (2013) 19. Samuel, M.: Intel lab data, June 2004. http://db.csail.mit.edu/labdata/labdata. html 20. Li, W., Song, H., Zeng, F.: Policy-based secure and trustworthy sensing for internet of things in smart cities. IEEE Internet Things J. 5(2), 716–723 (2018) 21. Tang, L.-A., et al.: Trustworthiness analysis of sensor data in cyber-physical systems. J. Comput. Syst. Sci. 79(3), 383–401 (2013) 22. Tang, L.-A., et al.: Tru-alarm: trustworthiness analysis of sensor networks in cyberphysical systems. In: 2010 IEEE International Conference on Data Mining. IEEE (2010) 23. Ciletti, M.D.: Modeling, Synthesis, and Rapid Prototyping with the Verilog HDL. Prentice Hall, Upper Saddle River (1999) 24. Han, G., Jiang, J., Shu, L., Niu, J., Chao, H.-C.: Management and applications of trust in wireless sensor networks: a survey. J. Comput. Syst. Sci. 80(3), 602–617 (2014) 25. Luo, T., Nagarajan, S.G.: Distributed anomaly detection using autoencoder neural networks in WSN for IoT. In: 2018 IEEE International Conference on Communications (ICC). IEEE (2018) 26. Xie, M., Hu, J., Han, S., Chen, H.-H.: Scalable hypergrid k-NN based online anomaly detection in wireless sensor networks. IEEE Trans. Parallel Distrib. Syst. 24(8), 1661–1670 (2013) 27. Zhang, Y., Meratnia, N., Havinga, P.J.: Distributed online outlier detection in wireless sensor networks using ellipsoidal support vector machine. Ad hoc Netw. 11(3), 1062–1074 (2013) 28. Bailey, D.L., et al.: Combined PET/MRI: global warming–summary report of the 6th International Workshop on PET/MRI, 27–29 March 2017, T¨ ubingen, Germany. Mol. Imag. Biol. 20(1), 4–20 (2018) 29. Amazon rainforest fires, Everything we know and how you can help. https:// www.cnet.com/how-to/amazon-rainforest-fire-whats-happening-now-and-howyou-can-help-update-indigeous-tribes/. Accessed 17 Aug 2019 30. Aguiar, A.P.D., et al.: Modeling the spatial and temporal heterogeneity of deforestation-driven carbon emissions: the INPE-EM framework applied to the Brazilian Amazon. Glob. Change Biol. 18(11), 3346–3366 (2012) 31. Saeed, F., et al.: IoT-based intelligent modeling of smart home environment for fire prevention and safety.’ J. Sens. Actuator Netw. 7(1), 11 (2018) 32. Adriano, D.B., Wahyu, A.C.B.: Iot-based integrated home security and monitoring system. J. Phys. Conf. Ser. 1140(1), 012006 (2018) 33. Mehta, R., et al.: Delivering high performance result with efficient use of K-map. Int. J. Control Autom. 9(2), 307–312 (2016)

66

H. ur Rahman et al.

34. Gai-ning, H.A.N., Qing-lin, X.U., Qun, D.U.A.N.: Research and application of digital circuit base on Iverilog and GTKWave. J. Xianyang Normal Univ. 4, 58–61 (2009) 35. Yadav, P., Abdul Rajak, A.R., Fathima, A.: VLSI implementation of gold sequence by novel method. Int. J. Adv. Stud. Comput. Sci. Eng. 2(2), 32 (2013)

QoS Based Clustering for Vehicular Networks in Smart Cities Soumia Bellaouar(B) , Mohamed Guerroumi(B) , and Samira Moussaoui(B) Vehicular Networks for Intelligent Transport Systems (VNets) Group, Laboratory for Research in Intelligent Computing, Mathematics, and Applications (RIIMA), Department of Computer Science, University of Science and Technology Houari Boumediene, 16000 Bab-Ezzouar, Algiers, Algeria [email protected], [email protected], [email protected]

Abstract. Today, Vehicular Ad Hoc Network (VANET) is one of the most important smart city system components. VANETs are distinguished by their high mobility, high information storage capacity, available energy, and the environment governed by road infrastructure but there are still many challenges to be faced to achieve good communication between vehicles. In the traditional ad hoc mobile networks, these difficulties have often been overcome by a clustered topology. The clustering in VANETs can reduce the communication problems and makes the network appear smaller and more stable according to the view of each vehicle. In this work, we propose a VANET clustering solution called “QoSCluster”, which aims to maintain the cluster structure while respecting the quality of service requirements as the network involves. We simulate our protocol using the Veins platform, the OMNET simulator, and the realistic mobility model SUMO. Keywords: Vehicular Ad Hoc Networks · Vehicle to vehicle communication (V2V) · Clustering · Quality of service · Cluster Head

1

Introduction

VANETs offer a wide range of applications like safety applications for drivers and passengers. VANETs offer wide range of applications, like safety Applications for drivers and passengers. These networks are characterized by their dynamic topology due to the high mobility of vehicles and a high density which leads to an increase in the number of broadcasted messages and cause congestion in the network. To minimize Strom problem, clustering technique is considered as an efficient means to eliminate or at least minimize this problem [1,2]. Clustering vehicles is a natural phenomenon in the network. It allows facing some VANETs challenges like the storm problems of the broadcast.

c Springer Nature Singapore Pte Ltd. 2019  G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 67–79, 2019. https://doi.org/10.1007/978-981-15-1304-6_6

68

2

S. Bellaouar et al.

Related Work

The most of the proposed clustering algorithms aim to maintain the stability of vehicle groups, others are interested in the use of the metric of the number of hops (distance), direction, relative mobility. In [3], the clustering algorithm extends the life of the cluster and the Cluster Head (CH). The election decision of Cluster Head (CH) involves three parameters: relative speed (RS), remote connectivity (DC) and expected reciprocal average transmission count (RMETX). The authors of [4] introduce a multi-hop clustering protocol based on the relative mobility between K-hop distance vehicles which can extend the coverage area of clusters and improve the stability of clusters. In [5], the authors consider the clusters where they naturally form groups of vehicles called convoys. The lane where the vehicle is traveling is not taken into consideration; although the grouping of vehicles on the same track is favored by the criterion of belonging to the convoy. A new Distributed Multi-Hop Clustering Algorithm (DMCNF) for VANETs is proposed in [1] that generates CHs via the vehicle-sidetracking relationship has been proposed. This protocol is based on the assumption that a vehicle cannot efficiently determine which vehicle in its multi-hop neighbors is the most stable, but it can easily select it in the one-hop distance. Each vehicle has a direct follow-up relationship with its one-hop neighbors. A tracking chain can exist between two non-neighboring vehicles via an indirect tracking relationship. A neighborhood monitoring cluster is defined as a group of vehicles whose CH is the vehicle that is directly or indirectly followed by other vehicles. In [6] QoS metric is considered as important parameter to establish and maintain the network connectivity. In [7], a new clustering-based reliable low-latency multipath routing (CRLLR) scheme is proposed by employing Ant Colony Optimization (ACO) technique and it based on the existing ad hoc on-demand multipath distance vector (AOMDV) routing scheme. The link reliability is used as criteria for Cluster Head (CH) selection. Moreover, the ACO technique is employed to efficiently compute the optimal routes among the communicating vehicles for VANETs in terms of four QoS metrics, reliability, end-to-end latency, throughput and energy consumption. In [8], a clustering proto- col extends the street-centric QoS-OLSR protocol for urban VANETs is proposed where QoS and current street are used for cluster head and MPR selection to improve network connectivity, stability, and performance. The QoS metric for electing Cluster heads is based on Available Bandwidth, Lane Weight and Neighbors on Other Streets.

3

Network Architecture Model

In this work, we introduce a Multi-Metric Clustering Algorithm QoS-Clustering Protocol based on service quality requirements and changes in the relative mobility of vehicles, as well as other metrics that allow us to build more stable and efficient clusters. In large scale and complex VANET networks, a vehicle can hardly collect precise details on multi-hop vehicles and can hardly decide which CH to choose among the multi-hop neighbors. However, a vehicle can easily control the information and determine which vehicle within its transmission range is

QoS Based Clustering for Vehicular Networks in Smart Cities

69

the most stable. In most existing clustering protocols, a vehicle must determine the cumulative weight value of all vehicles at N-Hop in order to elect the CH. This situation generates and disseminates many control messages throughout the network, reduces the effectiveness of clusters and increases overhead. Our main idea is to build, in a first phase, clusters at a one-hop based on stability and quality of service metrics and merge then neighboring clusters to build multi-hop clusters. A Cluster Member node (CM) saves the ID of all CHs that they have already invited it to restore the cluster in the event of a failure of connection with the new CH. This reduces overload in the network by avoiding to reform the clusters in one hop and minimize the time during which the vehicles do not belong to any cluster. We also define the relay nodes (CG: Cluster Gateway) which allow the transfer of emergency messages between clusters with a better quality of service. The cluster members are classified as ClusterHead (CH), Clustergateways (CG), isolated ClusterHead (ICH: Isolated CH) or cluster members (CM). When one vehicle joins a VANET network, it initializes its status to FN (Free Node). 3.1

Assumptions

In this network, we consider vehicles that are equipped with a single wireless transmission interface for V2V communications, all nodes have a same synchronized timer, a geolocation system that allows both to know the speed, position, and direction of travel of the vehicle. We also assume that the VANET network deployment environment represents a highway environment consisting of (3) three lanes and characterized by high density, absence of obstacles and highspeed vehicle traffic. Three message types are used in our solution: “beacon” messages, control messages used for cluster formation and cluster maintenance (JOIN, HeadMsg, AckCm, CGPropose. . .) and alert messages. Table 1 gives a summary of all types of messages used in our proposal. 3.2

Cluster Head Election Metrics

The cluster formation should take into account a compromise between the quality of service and mobility parameters. The quality of service parameters are used to ensure the transmission reliability, ensure communication (inter-cluster and intra-cluster) and increase cluster heads coverage, while mobility parameters are considered to guarantee the stability of the network. Therefore, we propose in our QoS-based Clustering Protocol the combination of the two types of metrics for the selection of the CH. QoS Metrics: In our protocol, the QoS metrics are used in the election of the Cluster Head to improve lead times and ensure reliable dissemination of emergency messages between clusters. We consider three parameters of QoS which are the available bandwidth, the delay and the link expiration time.

70

S. Bellaouar et al. Table 1. Type of messages. Messages

Description

Beacon

Vehicle related information

JOIN

Request to join cluster

AckCH

Acknowledgment of join request

HeadMsg

Invitation request to re-join cluster

AckCM

Merger invitation

INVIT

Weight of the current CH vehicle

INVIT-Replay Merger proposal AckFusion

Merger validation

Fusion

Merger information of cluster members

CGpropose

Role proposition of clusters

Select

Answer of CGpropose

QuitCM

CM quiet its cluster

QuitCH

CH quiet its role

– Average link expiration time: the measurement of the link expiry time (LET) can be an important criterion for the quality of a network. We consider that the speed is not sufficient to assess the stability of the link between two mobile nodes. The LETij link expiry time between vehicle i and vehicle j and the average link expiry time are calculated according to the work [9]. – Average link bandwidth: we calculate this metric by estimating the band occupied on the link by the number of bits transmitted between two vehicles i and j on a given channel for a duration T, with respect to the maximum capacity C of the channel as in [6]. – Average time to completion: the Delayij end-to-end is defined by the time it takes for packets to be delivered and transmitted from a source vehicle i to destination vehicle j. This includes all the possible delays caused by processing, buffering, transmission and propagation, it is calculated as in [10]. Stability Metrics: Stability parameters are applied in the CH election process to form a stable cluster structure, extending the life of the clusters and thus reducing the number of clusters the number of reconfigurations of this structure. We consider three stability metrics in our QoS-Clustering protocol: the level of connectivity, relative mobility and distance relative. – The level of connectivity: it is defined as the total number of neighbours located in the transmission range of a vehicle i, This cardinality can be calculated according to formula 1. DegVi = |N |

(1)

QoS Based Clustering for Vehicular Networks in Smart Cities

71

– Relative mobility: it is calculated as in [10]. – The average relative distance: we follow an existing strategy defined in the protocol [11] where the vehicle position is represented by the x coordinate only. This, in assuming that the trajectory of all vehicle nodes is a straight line since the width of the track is weak. 3.3

Calculation of the Combined Weight

Each vehicle periodically calculates the values of the metrics defined above using the speed, direction and position information exchanged using the messages beacon. By combining these measurements, we obtain a weight defined by Eq. 2, which shows the ability of a vehicle to become a CH. To do this, each vehicle node must calculate this combined weight. The node with the highest weight is selected as CH. (2) W eight = c1 ∗ StabV i + c2 ∗ QoS V i Where c1 + c2 = 1 The calculation of the QoS and Stab values is based on formulae 3 and 4 respectively. q1 ∗ BWj + q2 ∗ LETj QoSV j = (3) q3 ∗ Delayj StabV i =

s1 ∗ Degj s2 ∗ RelM obj + s3 ∗ Distj

(4)

Where: q1 + q2 + q3 = 1 and s1 + s2 + s3 = 1 c1, c2, q1, q1, q2, q3, s1, s2, and s3 are weighting factors that indicate the level of importance we give to the components of quality of service or stability. The choice of parameters q1, q2, q3, s1, s2 and s3 is based on the work done in [6] and [3]. Such that: q1 = 0.2, q2 = 0.3, q3 = 0.5, s1 = 0.1, s2 = 0.6 and s3 = 0.3. While c1 and c2 will be chosen during the simulation phase in order to achieve a compromise between Stab and Qos by taking into consideration the network requirements within a given period of time, for example promotes stability in relation to QoS depending on the type of data or the other way around or you choose for the equality of the two factors. 3.4

QoS-Based Clustering Protocol “Qos-Cluster”

Our clustering algorithm can be divided into three main phases.

72

S. Bellaouar et al.

Initial Phase: This step describes the period when no clusters have been set up in the network. Each vehicle node begins with the initial FN state and periodically broadcasts a Hello (beacon) message, each time interval BI (beacon interval) to its neighbors at a jump to inform them of its presence and exchange with them the necessary information for the election of the CH (id, speed, position, direction...). After receiving a beacon, each vehicle updates its neighborhood table by adding the sending vehicle to it if it did not already exist or by updating its information and recalculating the parameters of quality of service of the radio links as well as the relative speed and the distance between its neighbors. The beacon message includes the vehicle ID, current status, current position, current speed, driving direction, combined weight value, CH ID, jump level in cluster and time stamping. Clustering Phase: The process of forming the k-hop cluster structure based on the merger of clusters. It consists of two steps: – Formation of clusters in a single hop: Each vehicle broadcasts beacon messages to its neighbors in one hop each time period T, containing its weight calculated using mobility and quality information of service collected during the initial phase. When a CD timer expires (Collection Duration), all existing nodes in the vicinity compare the weight values of all the neighbors and the vehicle with the highest weight refers to itself as CH and sends a message HeadMsg to his neighbors at a jump containing his ID and weight. When an FN vehicle receives a HeadMsg message, it changes its status to CM in order to assume its role as a member of the cluster and updates the ID of its CH which is initially NULL. So, he sends an AckCm acquittal to his CH for inclusion in his membership table. If the vehicle receives several HeadMsg messages from its neighbors, it chooses the CH with the weight and it records all the IDs of the CHs they invited him to join their cluster according to the order of their weight. In the case where the vehicle receiving the HeadMsg message is in the states”, he compares the weight received with that of his CH and joins the new cluster if the latter is better by sending him an AckCm. A QuitCm message is sent to the old CH. On receiving an AckCm or a QuitCm the CH adds or deletes the vehicle respectively issuer of its membership list. If the weight of the current CH is still better, the CM records only the ID and weight of the CH who sent the HeadMsg in its Heads table and proposes as a relay between the two neighboring clusters. – Merging clusters and selecting relays (gateways): When the member nodes of a cluster are declared as CM, they broadcast a INVITE message to their neighbors to a jump that is not a member of the same cluster in order to invite other neighboring clusters to be merged with their cluster. The INVIT message must contain the CH ID and the weight of the CM he sent it. If the node receiving the INVIT message is a CH, it first checks whether it has not already received an INVIT of the same cluster, if this is the case it records the invitation with the ID and weight of the CM that he sent it and the ID of his CH. Otherwise, he chooses the best CM to save it. If the vehicle receiving

QoS Based Clustering for Vehicular Networks in Smart Cities

73

the INVIT message is a CM, it proposes itself as a relay between the two clusters. When an FD (Fusion Duration) timer expires, the CH chooses the best invitation (best inviting CM weight) to reply with an INVIT-Replay message. When the CM inviting receives the INVIT-Reply, transfers it to his CH, the latter will respond with an AckFusion positive if the sum of the number of jumps of the two clusters does not exceed the number of jumps in the network, or with a negative AckFusion if not. The chronology of the launch of each timer (BI, T, CD, and FD) In the case of a positive AckFusion receipt, the clusters will be merged and the CH becomes CM and its members after having sent its table of members to the new CH in an AckCm message. This is done by distributing a Fusion message to its members, containing the ID and the weight of the new CH as well as the number of jumps of the new cluster in order to inform them of the merger. The latter checks the number of MAX jumps to execute the merge process of clusters in their neighborhood. However, a node receiving a negative AckFusion chooses another invitation among those registered if it exists and repeats the merging process, otherwise it waits for receipt of a new invitation. Upon receipt of the AckCm the new CH must set the following updates the number of jumps in its cluster, as well as its members’ table. The vehicle can propose itself as a CG by sending a proposal message CG Propose to the two neighboring CHs containing the ID of its CH and its weight. The CH confirms with Positive CGSelect the role of the node as CG. If the CH receives multiple CGProposals to the same cluster, the node with the best weight will be selected. On receiving a positive CGSelect message from both CHs, the node changes its status to CG, and begins to assume its bridging role between neighboring clusters; a vehicle can assume the role of CG for several clusters at the same time if no other proposal is received for neighbouring clusters, and as soon as this happens another CG is selected and a negative CG Select is sent to the old CG, the latter directly leaves its role and becomes an ordinary CM as shown in Fig. 1. Here, every node j needs to calculate some weight parameters by taking values from the neighbors List stocked in node j, because the parameters are sent in the periodic beacon messages.

Fig. 1. Clustering formation process.

74

S. Bellaouar et al.

Maintenance Phase: When groups and CHs are already selected, three cluster state transitions are possible for a vehicle (Fig. 1): - FN joins an existing cluster to become CM: At this point, the neighboring nodes broadcast beacon messages periodically. If the FN hears the beacon message from a CH neighbor, he sends him a JOIN message allowing this FN to join his cluster, which If the FN receives multiple thoughtful AckCh from different CHs, the one with the best-combined weight will be chosen. If no cluster is detected in the vicinity of the FN, the latter is waiting for the launch of the cluster formation process with neighboring NFs if they exist, otherwise, it declares itself as ICH. This can reduce the number of CHs in the network while increasing their stability. - CM leaves the cluster: If the CH does not receive a Beacon from specific CM, then this CM is considered to have left the current cluster and will be deleted from the list of members. In case the deleted CM is not located in the first jump level, CMs of the lower jump level broadcast INVIT messages in their vicinity and those at the higher jump level select a new CH if no other neighboring CM is available. It Can’t connect with its CH. - CH leaves the current cluster: if CM has not received a Beacon from CH during a period of time, then this CH is supposed to leave the current cluster. The CM joins other clusters registered during the cluster formation phase if they exist by sending message JOIN, it will otherwise keep the FN status until the next round of CH elections in the neighborhood. The CH may leave when it loses its role if another CH is elected in the neighborhood after updating or when he loses the connection with his neighbors. Thus, if a CH vehicle receives the HeadMsg message from its neighbor, which means that its weight is better, so he joins it directly by leaving his cluster with diffusion from QuitCH to these members if the two clusters cannot be merged; or the merging process will be triggered (detailed in the next section) by sending a INNIT-Reply message to the new CH. Once receiving a QuitCh message, the cluster members select the best CH registered in the Head table if it exists, by sending it a message JOIN and otherwise remain in the FN status until the next CH election process. Upon receipt of a JOIN, the CH adds the vehicle to its list of members and discharges it. The nodes are synchronized by handeling self messages that are sent to itself when time intervals expire. Figure 2 shows the different launch timers of QoS-Custer.

Fig. 2. Chronology of the launch of the timers of QoS-Custer.

QoS Based Clustering for Vehicular Networks in Smart Cities

4

75

Performance Evaluation

In order to study the impact of our contribution and evaluate the performance of our protocol, we carried out several scenarios by varying some simulation parameters and metrics, such as: maximum speed, number of vehicles in the network, maintenance interval T (start period of the Head cluster election process again), number of jumps, which is the maximum break link distance between a CM and a CH and finally the weight factors (c1 and c2) used to calculate the combined weight. The objective of this variation is to observe different behaviors of vehicles under this protocol. We choose the Multi-Hop DMCNF protocol [1] for the performance comparison where relative mobility is calculated according to the variation of the packet delay of two consecutive messages. 4.1

Simulation Parameters

In this section, we evaluate the behavior of our proposed protocol and the DMCNF [1] comparison protocol according to the maintenance period (T) which represents the interval between two launches of the CH election process. We choose to fix the CD collection and FD fusion values at 10 s and 5 s respectively and we vary the period of the maintenance interval (T = 40 s, 60 s). In order to measure some clustering metrics, we also vary the maximum speed (10, 20, 30 m/s) by fixing the number of vehicles at 100 and the weighting factors c1 and c2 at 0.5 and the CD collection and FD fusion times at 10 s and 5 s respectively. The main configuration parameters are shown in Table 2. Table 2. Simulation parameters. Parameters

Values

Simulation

Time 200 s

Max speed (velocity)

10, 20 and 30 m/s

Propagation model

Simplepathloss

Scenario

Highway

Lenght highway

3 km

Packets size

512 bits

Max hop

3

Beacon interval (BI)

1s

Collection duration (CD) 10 s Fusion duration (FD)

5s

Maintain interval (T)

40, 60 s

Maximal emission power

10 mW

76

4.2

S. Bellaouar et al.

Measured Performance Metrics

The performance of the clustering mechanism is evaluated using OMnet++, Veins, and SUMO. We measure the stability of the clusters such as the duration of the ClusterHead and the duration of the cluster member. Another necessary evaluation metric is the Overhead, also called communication overhead. Average Duration of CH: It defines the period of time during which the vehicle maintains its CH status once it has been elected until it loses its role. The Fig. 3 illustrates the variation in the ClusterHead lifetime as a function of the maximum speed for two maintenance interval values T (T = 40 and T = 60), while the maximum jump number (MaxHop = 3).

Fig. 3. The average duration of ClusterHead: MaxHop = 3.

Average CM Duration: It represents the time interval when a vehicle joins a specific cluster and the time it leaves that cluster. The lifetime of a cluster is a necessary performance parameter to evaluate the efficiency of the DMCNF and QoS-Cluster protocols. As shown in Fig. 4 the average cluster duration increases by increasing the values of T as a function of the maximum speed Fig. 4.

QoS Based Clustering for Vehicular Networks in Smart Cities

77

Fig. 4. The average duration of cluster member: MaxHop = 3.

Fig. 5. Overhead of DMCNF and QoS-Cluster protocols: MaxHop = 3.

Overhead: We consider the communication overhead of cluster formation and selection of the CH as a communication overhead; they define the amount of information (in bit) circulating in the network per unit time. As a result, the Overhead records all control messages generated by each vehicle throughout the network. Figure 5 presents the overhead of the QoS-Cluster and DMCNF [1] We

78

S. Bellaouar et al.

observe that our protocol generates less overhead compared to the DMCNF protocol. First, the cluster stability of our protocol is higher because CH and CM durations are higher. Another reason is that the DMCNF protocol calculates relative mobility based on the variation of the inter-packet delay of two consecutive messages, increasing the number of “Hello” messages exchanged between neighbors to perform this calculation in addition to the periodic “Hello” messages. While QoS-Cluster uses the relative speed between neighbors to calculate relative mobility; this requires a single beacon message sent periodically for updates.

5

Conclusion and Perspective

We proposed clustering protocol for maintaining the stability of the vehicle network and meeting the requirements of quality of service. This protocol is based on three fundamental phases: initialization, one-hop cluster formation, and cluster fusion. To measure the performance of our protocol, several simulations were performed while varying the mobility scenarios and different metrics using the OMNET ++ simulation tool. The obtained results show the effectiveness of our solution comparing to DMCNF in terms of lifetime ClusterHead, lifetime Cluster Member, and overhead which confirms that our solution is well suited for applications road. As a possible perspective to improve our solution in the short and long term, we expect: – The integration of a reliability mechanism to take into account the different scenario network and roads in smart city. – the improvement interval calculation functions by doing more tests for the calculation of the weighting factors. – Analyzing analytically the proposed protocol. – Adaptation of weighting factors to the applications road like security applications.

References 1. Chen, Y., Fang, M., Shi, S., Guo, W., Zheng, X.: Distributed multi-hop clustering algorithm for vanets based on neighborhood follow. EURASIP J. Wirel. Commun. Netw. 2015(1), 98 (2015) 2. Singh, P., Pal, R., Gupta, N.: Clustering based single-hop and multi-hop message dissemination evaluation under varying data rate in vehicular ad-hoc network. In: Choudhary, R.K., Mandal, J.K., Auluck, N., Nagarajaram, H.A. (eds.) Advanced Computing and Communication Technologies. AISC, vol. 452, pp. 359– 367. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-1023-1 36 3. Jin, D., Shi, F., Song, J.: Cluster based emergency message dissemination scheme for vehicular ad hoc networks. In: Proceedings of the 9th International Conference on Ubiquitous Information Management and Communication, p. 2. ACM (2015) 4. Zhang, Z., Boukerche, A., Pazzi, R.: A novel multi-hop clustering scheme for vehicular ad-hoc networks. In: Proceedings of the 9th ACM International Symposium on Mobility Management and Wireless Access, pp. 19–26. ACM (2011)

QoS Based Clustering for Vehicular Networks in Smart Cities

79

´ 5. Kaisser, F., Johnen, C., V`eque, V.: Etude de la formation de convois dans un r´eseau de v´ehicules sur autoroute (2011) 6. Fekair, M.E.A., Lakas, A., Korichi, A.: CBQoS-Vanet: cluster-based artificial bee colony algorithm for QoS routing protocol in VANET. In: 2016 International Conference on Selected Topics in Mobile & Wireless Networking (MoWNeT), pp. 1–8. IEEE (2016) 7. Abbas, F., Fan, P.: Clustering-based reliable low-latency routing scheme using aco method for vehicular networks. Veh. Commun. 12, 66–74 (2018) 8. Kadadha, M., Otrok, H., Barada, H., Al-Qutayri, M., Al-Hammadi, Y.: A clusterbased QoS-OLSR protocol for urban vehicular ad hoc networks. In: 2018 14th International Wireless Communications & Mobile Computing Conference (IWCMC), pp. 554–559. IEEE (2018) 9. Zhang, L., El-Sayed, H.: A novel cluster-based protocol for topology discovery in vehicular ad hoc network. Procedia Comput. Sci. 10, 525–534 (2012) 10. Ucar, S., Ergen, S.C., Ozkasap, O.: VMaSC: vehicular multi-hop algorithm for stable clustering in vehicular ad hoc networks. In: 2013 IEEE Wireless Communications and Networking Conference (WCNC), pp. 2381–2386. IEEE (2013) 11. Aissa, M., Arafah, B., Henchiri, M.: Safe clustering algorithm in vehicular ad-hoc networks (2014)

Searchable Attribute-Based Encryption Protocol with Hidden Keywords in Cloud Fang Qi1 , Xing Chang1 , Zhe Tang1(B) , and Wenbo Wang2(B) 1

School of Computer Science and Engineering, Central South University, Changsha 410083, China [email protected] 2 Department of Computing and Software, McMaster University, Hamilton L8R 3G8, Canada [email protected]

Abstract. With the continuous development of mobile devices, the emergence of 5G networks, a large number of applications for cloud computing, the computing and storage performance of mobile devices has been greatly improved. The social network applications are platform for communication between users. Although their form is diverse, they focus on connecting different users, enabling users to communicate and interact conveniently to meet social needs. However, the rise of mobile social networks still faces many challenges, including information security, privacy preserving and access control. The predecessors did a lot work at resource consuming traditional cryptographic methods and in case of the mess data, the profile matching process is inefficient. Aiming at solving these problems, a searchable encryption scheme with hidden keywords and fine-grained access control is proposed in this paper. Profile owners can design the flexible access policy on their personal profile. With the hidden keywords, the efficient of profile matching has largely increased. Security analysis shows that the proposed scheme can prevent the leakage of privacy information and the hidden keywords. Detailed performance analysis demonstrates the efficiency and the practicability. Keywords: CP-ABE · Hidden keywords search Access control · Friend discovery

1

· Privacy preserving ·

Introduction

This era has witnessed the continuous development of mobile intelligent devices, which enrich life and stimulate the emergence of mobile social networks. There are many social applications in mobile social networks such as Douban, WeChat, QQ, Weibo, Facebook, etc. Taking Douban as an example, it allows users to create a label set to identify who they are. At the same time, users with certain tags such as hobby, location, character, etc. can be found for communicating or making friends. Matching profile with the same hobbies and experience is c Springer Nature Singapore Pte Ltd. 2019  G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 80–92, 2019. https://doi.org/10.1007/978-981-15-1304-6_7

Searchable Attribute-Based Encryption Protocol

81

a common approach in friend discovery. Nevertheless, unresolved security and privacy issues hinder its reliability and popularity [1]. In recent years, there have been a number of schemes on privacy preserving in mobile social networks to reduce the risk of privacy leakage. For example, Luo et al. [7] propose a matching scheme under different authorities and realize cross domain data access and sharing. Li et al. [3] propose a private matching scheme, namely Findu, based on the common interests and without the reliance on the trusted third party (TTP), but this scheme fails in the fine-grained access control. [6] provide a keyword searchable attribute-based encryption scheme with attribute update for cloud storage, which is a combination of attributebased encryption scheme and keyword searchable encryption scheme. Qi et al. [4] employed an asymmetric-scalar-production based on KNN query, but the presentation of interest is too single to get an accurate result. Li et al. [2] propose a scheme which is based on traditional attribute based encryption, can realize distributed access control and support complete fine-grained revocation mechanism. This paper proposes a secure attribute-based encryption system with the hidden keywords and the cloud-assisted friend discovery scheme, to achieve finegrained access control and privacy protection. Attribute-based encryption falls into two categories, namely ciphertext-policy attribute-based encryption (CPABE) and key-policy attribute-based encryption (KP-ABE). CP-ABE [5] was first proposed in [8]. Compared to KP-ABE [9], it can not only keep the encrypted information confidential even if the cloud storage server is not entirely credible, but also allow users to customize the access structure. In this way, CPABE effectively achieves fine-grained access control. By employing the powerful storage and computation ability of cloud server, it greatly reduces the computing and storage cost on the client. The traditional friend discovery scheme finds potential friends through comparing the access policy and attribute universe one by one, which will reduce efficiency and cause excessive energy consumption in large-scale mobile networks. Differently, the proposed scheme searches by the hidden keywords, thus reducing the work pressure on the cloud server and preventing the leakage of private information. The main contributions are outlined as follows. – Flexible key management. In our scheme, the secret key is related with a user’s personal attribute, which means each user has an independent secret key. The risk of key leakage can be avoided. – Fine-grained access control. Before the profile, owner uploads files, he/she will design a specific access policy for different profile, the access policy is embedded in the personal profile, only the qualified users can decrypt the corresponding ciphertext. – Hidden keywords. The indexing keywords submitted by users is in the form of trapdoor, and the keywords in the profile also encrypted, TTP is unable to eavesdrop on any plaintext. We use the keywords to be queried as a trapdoor, and if the user’s keyword satisfies the trap, the user is captured.

82

F. Qi et al.

– Good Performance. We have conducted extensive analysis to show that our scheme is secure against collusion attack and privacy query in the standard model. Especially, our scheme significantly reduces the matching query time and computation overhead. The rest of this paper is structured as follows. In Sect. 2, we describe the preliminaries. In Sect. 3, we describe our system architecture and models. Section 4 describes the detailed scheme of this paper. The formal security proof and the performance evaluations of our scheme are given in Sects. 5 and 6 respectively. Last section concludes this paper.

2

Preliminaries

In this section, we first briefly present some background information, which is the basic of the proposed scheme. 2.1

Bilinear Mapping

Suppose p is a prime number, both G and GT are respectively the additive and the multiplicative cyclic group of the order p, g is the generator of G. Suppose the discrete logarithm problem is difficult in G and GT . e : G × G → GT is a bilinear mapping with the following characteristics: 1. Bilinear: ∀x , y ∈ G and a, b ∈ Zq there is e(xa , y b ) = e(x, y)ab 2. Non-degenerate: for g ∈ G, e(g, g) = 1 3. The map is computable: ∀u, v ∈ G , there exists a computable algorithm to compute e(u, v ). 2.2

Access Structure and Access Tree

Access structures are used in the research of the security system where multiple parties need to work together to get a resource. Groups of parties that are granted access are called authorized. Let {P1 , P2 , ..., Pn } be a set of attributes. A collection A ⊆ 2{P1 ,P2 ,...,Pn } is monotone if ∀B , C : if B ∈ A and B ⊆ C then C ∈ A. An access structure is a collection A of non-empty subset of {P1 , P2 , ..., Pn }, i.e., A ⊆ 2{P1 ,P2 ,...,Pn } \{∅}. The sets in A are called the authorized sets, and the sets not in A are called the unauthorized sets [10,11]. Let T be a tree with root s representing an access structure. Each non-leaf node of the tree is a threshold gate , described by its children and a threshold value. If numx is the number of children of the node x and the threshold value is kx where 1 ≤ kx ≤ numx . When kx = 1, the threshold gate is an OR gate. When kx = numx the threshold gate is an AND gate. Each leaf node x of the tree represents an attribute attri and a threshold value kx = 1. We denote the parent of the node x by parent(x). The access tree T defines an ordering between the children of every node, the children are numbered from

Searchable Attribute-Based Encryption Protocol

83

1 to num. index(x) denotes the label of x. Given a node y with c children, its child node are numbered from 1 to c. Let Tx be the subtree of T rooted at the node x. If a set of attributes S satisfied Tx , we denote it as Tx (S) = 1. Tx (S) can be computed as follows. if x  is a non-leaf node, evaluate Tx (S) for all children x of node x. Tx (S) return 1 if and only if at least kx children return 1. if x is a leaf node, Tx (S) return 1 if and only if attr(x) ∈ S. Thus, according to the above recursive computation, if set S satisfies T , then Tr (S) = 1, where r is the root node of T . Figure 1 shows an example of an access tree. According to the example, three kinds people can satisfy the access tree: first kind, a woman not more than 30 years old; second, a man who likes to play basketball; third, a doctor. A person who is a woman but 40 years old cannot pass the verification.

Fig. 1. Example of Access Tree

2.3

Bilinear Diffie-Hellman (BDH) Assumption

The BDH problem in G is defined as follows: taken (g, g a , g b , g c ) ∈ G as input, compute e(g, g)abc ∈ GT . We say that the adversary A has ε advantage in solving BDH problems in G if P r|[A(g, g a , g b , g c ) = e(g, g)abc ]| ≥ ε

(1)

We say that the BDH assumption holds in G if no probability polynomial adversary A has non-negligible advantage in solving the BDH problem in G.

3 3.1

System Architecture and Model System Architecture

As shown in Fig. 2, the architecture of the friend discovery system mainly contains the following four components:

84

F. Qi et al.

1. Trusted authority (T A). In this scheme, T A is a trusted third party which honestly obeys the protocol of the scheme and is responsible for revocation, parameter generation, secret key generation, distribution and management; 2. Cloud server. The cloud server is honest-but-curious [12], which follows the protocol but tries to get more information allowed. It has huge volumes of personal profiles and performs the efficient retrieve and attribute matching process to realize the fine-grained access control for privacy preserving; 3. Profile owner. Before uploading, the profile owner will specifically design the access policy for different profiles. The access policy is embedded in the profile, and the encryption process is controlled by the access policy; 4. Visitor. The visitor gets the secret key from T A according to the personal attribute universe. Visitor can send to retrieve request, if the attribute satisfies the access policy, he/she can visit the profile.

Fig. 2. Architecture of friend discovery system

3.2

System Definition

The proposed scheme consists of the following algorithms: SysInit: On input 1λ , where λ is the security parameter. TA outputs the global public parameter PK and master key MK. KeyGen: TA inputs PK, MK, the user’s identifier Ui and attribute universe ΨUi , then generates the secret key skUi and public key pkUi . TA transmits skUi to Ui and Ui keeps skUi secretly. pkUi is stored in the cloud server.

Searchable Attribute-Based Encryption Protocol

85

Encrypt: Profile Owner Ui inputs the plaintext M, access policy A, the public key pkUi and keyword set W = {w1 , w2 , ..., wn }. Then the algorithm returns the ciphertext CT and secret index I (W ). Finally, Ui transmits CT, I (W ), A to the cloud server. Trapdoor: If visitor Uv wants to search some information containing the keyword set W, first he/she needs to generate the corresponding trapdoor. Visitor Uv inputs the secret key skUv and the keyword collection W on the mobile terminal. the trapdoor is T Dw . Retrieve: When the cloud server receives the retrieve request, it will search the secret index I(W ) that satisfies the trapdoor T Dw . Then the cloud server will verify whether ΨUv satisfies the access policy in the corresponding ciphertext. Finally, the cloud server will transmits the collection of qualified ciphertext CT ∗ . Decrypt: While receiving the ciphertext collection CT ∗ , on inputting skUv , the visitor can obtain the plaintext collection M ∗ . Revoke: If the authorization of user Ui is expired or needs to be revoked in advance, first T A will generate a revocation certificates RevokeUi . After receiving RevokeUi , the cloud server will delete corresponding (pkUi , Ui ) and add this certificate to the revocation list. If a user who would like to visit the stored information is in the revocation list, the cloud server will reject the search request. 3.3

Security Model

The formal security model of our proposed scheme defined by the following game, runs between a challenger C and an adversary A.  Init: The adversary A declare the access structure A , that he wish to be challenged upon. Setup: The challenger runs the Setup algorithm and give the global parameter PK to the adversary A. Key query phase 1: The adversary A repeatedly tries to query the private keys corresponding to the sets of attributes Ψ1 , Ψ2 , ..., Ψn , none of the Ψi satisfy  A , for all i, 1 ≤ i ≤ n. Challenge phase: A submits two equal length message m0 , m1 . The chal lenger flips a random coin b ∈ {0, 1} and encrypts mb under A . The ciphertext  CT is given to A. Key query phase 2: Key query phase 1 is repeated.  Guess: The advantage A outputs a guess b of b.  The advantage that adversary A wins this game is Adv(λ) = |P r[b = b] − 12 |. The proposed scheme is secure if for any polynomial time, the advantage Adv(λ) is negligible. 3.4

Adversary Model and Design Goal

In the profile matching process, there usually exists adversary models: in the honest-but-curious (HBC) model [12], an attacker honestly follows the protocol but tries to get more information from the received message than allowed. In this paper, we suppose all the authorities and users are honest-but-curious.

86

F. Qi et al.

The main goal as well as the great challenge of our scheme is to conduct efficient matching against the chosen-plaintext attack and the collusion attack.

4

Proposed Scheme

In this section, we will depict the proposed scheme in detail. Compared to the traditional CP-ABE, our scheme can achieve hidden keyword research, thereby sharply reducing the computation cost and communication overhead especially in the cloud environment. It mainly consists of the following phases: system initialization, key generation, information encryption, trapdoor generation and retrieve, decryption and revocation. Suppose Ψ = {Ψ1 , Ψ2 , ..., Ψm } is a collection ∗ of the attributes in this scheme, and the number of attributes is m, H : {0, 1} → G is a hash function. 4.1

System Initialization

For a friend discovery system, we assume the trusted authority will guide the whole system. With the security parameters 1λ and Ψ as the input, T A outputs the global parameter PK and the master key MK. Suppose G is a bilinear group with order p and g is the generator of G. TA randomly selects m + 1 numbers, α, t1 , t2 , ..., tm ∈ Z∗p , then computes y = e(g, g)α and Ti = g ti . Finally, TA secretly keeps the master key M K = (α, ti )

(2)

and publishes the global parameter P K = (g, y, Ti , G, H)

(3)

where 1 ≤ i ≤ m. 4.2

Key Generation

T A inputs P K, M K, the user’s identifier Ui and attributes universe ΨUi of Ui , where ΨUi ⊆ Ψ . Then, it randomly selects r, βi , η ∈ Z∗p and computes d0 = −1



g α−r , d0 = η. For every attribute attri in ΨUi , T A computes di = g rti . Hence, Ui ’s secret key and public key are respectively 

skUi = (d0 , d0 , di |∀attri ∈ ΨUi ) and



pkUi = βi d0

(4) (5)

T A transmits skUi to Ui and Ui keeps skUi secretly. pkUi is stored in the cloud server.

Searchable Attribute-Based Encryption Protocol

4.3

87

Encryption

Profile Owner Ui inputs the plaintext M , keyword set W = {w1 , w2 , ..., wn }, the public key pkUi and access policy A. Each keyword is hashed as a separate text, and the result is a collection of hash values. Suppose the corresponding access tree of A is T ∗ , and the root node is R with value s. attrj,i means the attrj in the access policy A is the ith attribute in Ψ . Cj,i means the corresponding ciphertext of attrj,i .Then T A runs the Algorithm 1. Algorithm 1: Encryption Input: plaintext M , keywords set W = {w1 , w2 , ..., wn }, the public key pkUi and access policy A Output: ciphertext CT , secret index I(W ) ∗ s s 1 T A randomly selects s ∈ Zp and compute C0 = g , C1 = M y ; ∗ 2 According to the threshold gate between the parent and children in T , T A does the following operations ; 3 if the threshold gate is OR then 4 The value of all children are s; 5 end 6 else 7 suppose there are t children and T A randomly selects s1 , s2 , ..., st−1 ∈ Z∗p ; t−1  8 st = s − si ; i=1

9 10 11 12 13 14 15 16 17 18 19

for j from 1 to t do tj = sj ; end end for every leaf attrj,i in T ∗ do Cj,i = Tjsi ; end The ciphertext CT = (T ∗ , C0 , C1 , Cj,i |attrj,i ∈ T ∗ ); T A randomly chooses γ ∈ Z∗p and computes A = g γ , B = e(H(W ), g γ·pkUi )1/βi ; So the secret index I(W ) = (A, B); return ciphertext CT , secret index I(W ). Finally, Ui transmits CT, I(W ), A to the cloud server.

4.4

Trapdoor Generation and Retrieve

If visitor Uv wants to search some information containing the keyword set W , first he/she needs to generate the corresponding trapdoor. Uv inputs the secret key skUv and the keyword collection W on the mobile terminal; the mobile terminal selects the random number λ ∈ Z∗p . The trapdoor is T Dw = (T Dw,1 , T Dw,2 ) = (λ, H(w)λ/βv )

(6)

88

F. Qi et al.

Then Uv sends T Dw to the cloud to find the satisfying profile. When receiving T Dw , the cloud server will search for the matched profile, which means T Dw and I(W ) satisfy the formula: Satisf ied ← e(T Dw,2 , ApkUv ) = B T Dw,1

(7)

Then the cloud server will verify whether ΨUv satisfies the access policy in the sv = s, then the cloud server transmits corresponding ciphertext. If attrv,i ∈ΨUv

the collection of qualified ciphertext CT ∗ = {CT1 , CT2 , ..., CTn } to the visitor. 4.5

Decryption and Revocation

Once receiving the ciphertext collection, the visitor will input skUv to decrypt. First, We can get   −1 e(Cv,i , dv ) = e(g tv si , g rtv ) = e(g, g)rs (8) attrv,i ∈ΨUv

attrv,i ∈ΨUv

Next, we have e(C0 , d0 ) · e(g, g)rs = e(g s , g α−r ) · e(g, g)rs = e(g, g)αs

(9)

Finally, we can obtain the plaintext by computing C1 /e(g, g)αs = M · y s /e(g, g)αs = M

(10)

If the authorization of user Ui expired or needs to be revoked in advance, first T A will generate a revocation certificates: RevokeUi = {Ui , Date, SigM K (Ui , Date)}

(11)

Ui is the user’s identifier, Date is the revocation time, SigM K (Ui , Date) is the signature of revocation information based on M K. T A transmits RevokeUi to the cloud. When receiving RevokeUi , the cloud server will delete corresponding (pkUi , Ui ) and add this certificates to the revocation list. If a user who wants to visit the stored information is in the revocation list, the cloud server will reject the search request.

5 5.1

Security Analysis Profile Security Analysis

This section demonstrates that our scheme can achieve privacy protection. Suppose there exist an adversary A and a challenger C. Definition 1. Our proposed scheme can achieve privacy-preserving, if all polynomial time adversaries have at most a negligible advantage in the security game in Sect. 3.3.

Searchable Attribute-Based Encryption Protocol

89

Lemma 1. The scheme proposed can protect against the chosen-plaintext attack. Proof. Suppose the adversary A can break our proposed scheme with advantage AdvA , then the challenger C can break the underlying CP-ABE scheme with the advantage AdvC which equals to AdvA . SetUp: the searchable CP-ABE scheme gives C the global parameters P K = (g, y, Ti , G, H). C randomly selects α, t1 , t2 , ..., tm ∈ Zp and gives A the following  public parameters P K = (g, y, Ti , G, H) and M K  = (α, Ti ). Key query phase 1: A submits ΨUi to the random oracle KeyGen and   C submits ΨUi to the proposed scheme obtaining skUi = (d0 , d0 , di |∀attri ∈ Ψi )     and pkUi = βi d0 . Then C sends skUi and pkUi to the A.  Challenge phase: The adversary A gives C the access policy A and two  message m0 , m1 with the same length. Then C submits (A , m0 , m1 ) to the   searchable CP-ABE scheme and gets the ciphertext CT = (A , C0 , C1 , Cj,i |  attrj,i ∈ A ). It is noted that the above operations are subject to the restriction that ΨUi  cannot satisfy the access policy A . Key query phase 2: A once again repeats the operation in Key query phase 1.   Guess: The adversary A outputs a guess b of b, and C submits b to the searchable CP-ABE scheme. From the above analysis, it is obvious that the distribution of parameters, keys and ciphertexts are the same as the real scheme. Hence, we can get AdvC = AdvA . 5.2

Query Privacy Analysis

When visitors search for the profile, the system will first generate the trapdoor T Dw according to the keyword set W . The T Dw is produced by the hash function H : {0, 1} → G. Due to the one-way property of hash function, even T Dw is submitted to the cloud server, it cannot recover T Dw to get the keywords set W . Hence, the privacy of query phase is kept. Moreover, the random number λ ∈ Z∗p can efficiently resist replay attack. 5.3

Against Collusion

The biggest threat of the searchable ABE protocol with hidden keywords in clouds is colluding user attacks. The secret sharing value s in this scheme is embedded in the ciphertext. Users or conspirators need to recover e(g, g)γs to decrypt the ciphertext. The user’s key is uniquely generated with the system master key M K and a random value γ. Each user has a different value γ, and for a given combination of secret keys, the operator cannot divide the value of γ. Therefore, users can’t recover e(g, g)γs or M even if they conspire.

90

6

F. Qi et al.

Performance Analysis

In this section, we evaluate the proposed scheme by comparing it with several existing works in terms of efficiency and practicability. 6.1

Storage Overhead

Regarding storage overhead, compared with other methods, this construction reduces the storage cost of the system. Storage overhead refers to the size of the secret key. N , S and |G| respectively represent the number of attributes in the access structure, the number of attributes owned by the user and the length of an element in the group G. The storage overhead of the proposed scheme is nearly half that of others, In addition, the property revocation function is supported as shown in Table 1. Table 1. Comparison of property

6.2

Scheme

Attribute revocation Access structure Secret-key Size

Wang [6]

Supported

LSSS

(S + 2)|G|

Zhang [13] Non-supported

LSSS

(2S + 2)|G|

Ma [14]

LSSS

(2S + 1)|G|

Zheng [15] Non-supported

Non-supported

Tree

(2S + 1)|G|

Ours

Tree

(S + 2)|G|

Supported

Computational Overhead

In the comparison of computation overhead, this scheme reduces the computation cost in the KeyGen and Decryption. In this Table 2, E represents exponential operation. Compairing to the schemes of Zhang et al. [13] Wang et al. [6] Ma et al. [14] and Zheng et al. [15], our KeyGen algorithm reduces exponential operations by nearly 50%. The overall efficiency of this scheme is also higher than that of other schemes, indicating improved performance. Table 2. Comparison of computing cost Scheme

KeyGen

Wang [6]

(2 + 2S)E (2 + 3N )E 2 + N

Encryption Pairings in decryption

Zhang [13] (2 + 2S)E (3 + N )E

2 + 2N

(6 + N )E

1 + 2N

Ma [14]

2SE

Zheng [15] (2 + 2S)E (4 + 2N )E 3 + 2N Ours

(1 + S)E

(3 + N )E

1 + 2N

Searchable Attribute-Based Encryption Protocol

7

91

Conclusion

In this paper, a searchable CP-ABE friend discovery scheme with hidden keywords is proposed to achieve flexible fine-grained access control and privacy preserving. The detailed security analysis demonstrates that the scheme can resist chosen-plaintext attack in the standard model and performs well in storage and computational cost. It is expected to apply this scheme to the mobile healthcare in future work. Acknowledgments. This work is supported by the National Natural Science Foundation of China under Grant No.61632009, and by the earmarked fund for China Agriculture Research System, and by the Hunan Province Key Research and Development Plan under Grant 2018NK2037, and by the Science and Technology Project of Changsha City under Grant No.kq1701089.

References 1. Zhou, J., Cao, Z., Dong, X., et al.: TR-MABE: white-box traceable and revocable multi-authority attribute-based encryption and its applications to multi-level privacy-preserving e-healthcare cloud computing systems. In: IEEE INFOCOM 2015 - IEEE Conference on Computer Communications, pp. 2398–2406. IEEE (2015) 2. Li, Y., Qi, F., Tang, Z.: Traceable and complete fine-grained revocable multiauthority attribute-based encryption scheme in social network. In: Wang, G., Atiquzzaman, M., Yan, Z., Choo, K.-K.R. (eds.) SpaCCS 2017. LNCS, vol. 10656, pp. 87–92. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-72389-1 8 3. Li, M., Cao, N., Yu, S., et al.: FindU: privacy-preserving personal profile matching in mobile social networks. In: 2011 Proceedings IEEE INFOCOM, pp. 2435–2443. IEEE (2011) 4. Qi, F., Wang, W.: Efficient private matching scheme for friend information exchange. In: Wang, G., Zomaya, A., Perez, G.M., Li, K. (eds.) ICA3PP 2015. LNCS, vol. 9530, pp. 492–503. Springer, Cham (2015). https://doi.org/10.1007/ 978-3-319-27137-8 36 5. Rifki, S., Park, Y., Moon, S.: A fully secure ciphertext-policy attribute-based encryption with a tree-based access structure. J. Inf. Sci. Eng. 31(1), 247–265 (2015) 6. Wang, S., Ye, J., Zhang, Y.: A keyword searchable attribute-based encryption scheme with attribute update for cloud storage. PLoS ONE 13(5), e0197318 (2018) 7. Luo, E., Wang, W., Meng, D., Wang, G.: A privacy preserving friend discovery strategy using proxy re-encryption in mobile social networks. In: Wang, G., Ray, I., Alcaraz Calero, J.M., Thampi, S.M. (eds.) SpaCCS 2016. LNCS, vol. 10066, pp. 190–203. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49148-6 17 8. Bethencourt, J., Sahai, A., Waters, B., et al.: Ciphertext-policy attribute-based encryption. In: IEEE Symposium on Security and Privacy, pp. 321–334. IEEE Computer Society, Los Alamitos (2007) 9. Goyal, V., Pandey, O., Sahai, A., et al.: Attribute-based encryption for fine-grained access control of encrypted data. In: Proceedings of the 13th ACM Conference on Computer and Communications Security, pp. 89–98. ACM (2006)

92

F. Qi et al.

10. Ye, J., Zhang, W., Wu, S., et al.: Attribute-based fine-grained access control with user revocation. In: Linawati, Mahendra, M.S., Neuhold, E.J., Tjoa, A.M., You, I. (eds.) Information and Communication Technology - EurAsia Conference. LNCS, pp. 586–595. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3642-55032-4 60 11. Zhu, S., Yang, X., Wu, X.: Secure cloud file system with attribute based encryption. In: International Conference on Intelligent Networking and Collaborative Systems (INCos), pp. 99–102. IEEE (2013) 12. Zhou, J., Cao, Z., Dong, X., et al.: Securing m-healthcare social networks: challenges, countermeasures and future directions. IEEE Wirel. Commun. 20(4), 12–21 (2013) 13. Zhang, M., Du, W., Yang, X., et al.: A fully secure KP-ABE scheme in the standard model. J. Comput. Res. Dev. 52(8), 1893–1991 (2015) 14. Ma, S., Lai, J., Deng, R.H., et al.: Adaptable key-policy attribute-based encryption with time interval. Soft Comput. 21(20), 6191–6200 (2017) 15. Zheng, Q., Xu, S., Ateniese, G.: VABKS: verifiable attribute-based keyword search over outsourced encrypted data. In: IEEE Infocom, pp. 522–530. IEEE (2015)

Dependable and Secure Systems

A Comparative Study of Two Different Spam Detection Methods Haoyu Wang1, Bingze Dai1, and Dequan Yang2(&) 1

2

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China Network Information Technology Center, Beijing Institute of Technology, Beijing 100081, China [email protected]

Abstract. With the development of the Internet, the problem of spam has become more and more prominent. Attackers can spread viruses through spam or place malicious advertisements, which have seriously interfered with people’s life and internet security. Therefore, it is of great significance to study efficient spam detection methods. Currently using machine learning methods for spam detection has become a mainstream direction. In this paper, the machine learning method of Bayesian linear regression and decision forest regression are used to conduct experiments on a data set from UCI Machine Learning Repository. We use the trained models to predict whether a mail is spam or not, and find better prediction scheme by comparing quantitative results. The experimental results show that the method of decision forest regression can get better performance and is suitable for numerical prediction. Keywords: Bayesian linear regression  Decision forest regression detection  Machine learning  Numerical prediction

 Spam

1 Introduction “Spam” refers to some unsolicited e-mails or text messages that often contain advertisements or trashwares. Spams are sent out through computer network and mobile phone to many different addresses, usually indiscriminately. Twitter spam is usually referred to as the unsolicited tweets that contain malicious links directing victims to external sites with malware downloads, phishing, drug sales, scams, etc. [1]. Spam email is still one of the serious problems that plague Internet communication in the world, and with the continuous development of Online Social Networks (OSNs), such as Facebook, Twitter and Instagram, these social platforms have become a very important part of people’s lives, because people are using these platforms to socialize more and more. This environment where a large number of users are active at the same time has become an ideal working environment for spammers. Therefore, it is very necessary to adopt a more effective detection and filtration method for users. Spam not only seriously wastes network resources, but also takes up users’ valuable time. It also poses a threat to Internet security and directly causes huge economic losses. Spam has seriously plagued the normal mail communication of hundreds of © Springer Nature Singapore Pte Ltd. 2019 G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 95–105, 2019. https://doi.org/10.1007/978-981-15-1304-6_8

96

H. Wang et al.

millions of Internet users, and has taken up a large amount of limited storage, computing and network resources on the Internet, reducing the efficiency of network use and consuming a large amount of processing time of users. Moreover, spam has gradually become a major way for viruses to spread on the Internet. Faced with the growing problem of spam, more and more technologies are being applied to anti-spam work. Therefore, it is of great significance to study efficient spam detection technology. At present, there are many methods for detecting spam. For example, blacklist is one of the most effective and convenient methods for detecting spam. However, due to its timeliness and lag, people are looking for independent updates or dynamic monitoring method. This can be achieved by the research of Fu et al. [2] Most of the current spam detection methods are implemented by detecting text messages in emails. However, Youn et al. [3] and Li et al. [4] have proposed ways to identify spam by detecting image information. In addition, spam detection using machine learning method is increasingly popular. In this article, we applied the Bayesian linear regression and the method of decision forest regression to predict mails’ characteristic value to determine whether the mail is spam, and compared the results of two experiments. The rest of the paper is organized as follow: In the second part, we will introduce the recent research about the machine learning methods of Bayesian classifiers and decision forests. In the third part, we use the Bayesian method and the decision forest method to experiment on the same mail data set respectively. In the fourth part, the results of two experiments are presented and compared. Finally, we will summarize in the fifth part.

2 Related Work A number of studies about Bayes Classifier have been reported during last ten years. Nurul Fitriah Rusland, Norfaradilla Wahid et al. (Analysis of Naïve Bayes algorithm for email spam filtering across multiple datasets.) used the naive Bayesian algorithm to test the performance of two data sets. Their test results show that the type of e-mail and the number of data set instances have an impact on the performance of the naive Bayesian algorithm. They found that for naive Bayes classifiers, datasets with fewer e-mails and attributes perform better. Qijia Wei (Understanding of the naive Bayes classifier in spam filtering.) introduced the concept and process of the naive Bayes classifier and gave two examples. He also suggested that although the naive Bayes classifier proved to be a very efficient classification method, the interdependence between its attributes (usually words or phrases in e-mail) was limited. Jieming Yang et al. (A new feature selection algorithm based on binomial hypothesis testing for spam filtering.) proposed a new method called Bi-Test to evaluate whether the probability of being classified as spam satisfies the threshold. They used Naive Bayes (NB) and Support Vector Machines (SVM) classification algorithms to separate the six benchmark spam corpora (pu1, pu2, pu3, pua, lingspam, CSDMC2010). Test was evaluated and compared with four well-known feature selection algorithms (information gain, v2 - statistical, Gini index, Poisson distribution). The experimental results show that when using the naive Bayes classifier, the performance of the double test is significantly better than the v2 - statistic and Poisson distribution, which is equivalent to the information gain and the improved Gini

A Comparative Study of Two Different Spam Detection Methods

97

index performance on the F1 measure; when using the SVM classifier,its performance is comparable to other methods. Moreover, Bi-Test performs faster than the other four algorithms. In the study of Lizhou Feng et al. (Quick online spam classification method based on active and incremental learning.) to improve the classification speed of mail, some of them train the classifier according to the incremental learning theory. They will support vector machine (SVM), naive Bayesian classifier (NB) and k-nearest neighbor. The classifier (KNN) is used for the two types of classifiers, Trec2007 and Enron-spam. The experimental results show that compared with the six typical active learning-based incremental learning methods, the proposed method greatly reduces the time-consuming of mail classification while ensuring classification accuracy. Chong-zhi Gao et al. (Privacy-preserving Naive Bayes classifiers secure against the substitution-thencomparison attack.) constructed a privacy-protected NB classifier that is resistant to replacement and then comparison (STC) attacks. In the case of not using the full homomorphic encryption with large computational overhead, a scheme for avoiding information leakage under the STC attack is proposed. Our key technology involves the use of “double-blind” technology and demonstrates how it can be combined with additional homomorphic encryption and unrelated transmission to hide the privacy of both parties. At the same time, machine learning method of random forest has been widely used in spam detection. He Long (Identification of Product Review Spam by Random Forest.) proposed a random forest-based product spam comment recognition method, which is to repeatedly extract the same number of samples from the large and small categories in the sample or give the same weight to the total samples of the large and small categories to establish the random forest model. Moreover, its experimental results on amazon data set show that the recognition results based on random forest are better than other baseline methods. Al-janabi, M et al. (A systematic analysis of random forest based social media spam classification.) conducted systematic analysis on random forest classification, and assessed the impact of key parameters such as tree number, tree depth and minimum size of leaf nodes on classification performance. Their research results show that controlling the complex random forest classifier is of great significance to the classification of social media spam. Sun Xue et al. (One Email Filtering System Based On Category Feature Selection And Feedback Learning Random Forest Algorithm.) proposed an email filtering model based on category feature selection and feedback learning stochastic forest algorithm. Their experimental results show that this method can alleviate the impact of redundant information and noise data on classification performance effectively, and can realize the self-regulation of email filtering system and timely catch the changing trend of spam. Together these studies provide important insights into the Bayesian approach and random forest method to spam detection.

3 Experimental Method This article uses the Spambase Data Set created by Mark Hopkins et al. in the UCI Machine Learning Repository. This data set extracts some characteristics of spam and quantifies these characteristics to build a digital data set. Two regression algorithms for

98

H. Wang et al.

numerical prediction are used in this paper to conduct experiments, which are Bayesian regression algorithm and decision forest algorithm. The two approaches and their advantages are briefly described below. 3.1

Bayesian Linear Regression

When we only have limited data or want to use prior probabilities in the model, Bayesian linear regression can satisfy these needs. The Bayesian linear regression method is special compared with other regression algorithms for that Bayesian linear regression is not to find the optimal value of the target parameter, but to determine the posterior probability distribution of the model parameters. By training the model through input and output parameters in the dataset, the posterior distribution of a parameter in the model can be obtained. PðxjyÞ ¼

PðyjxÞPðxÞ PðyÞ

ð1Þ

In this function, P(x|y) is the posterior probability distribution of a model parameter calculated from a pair of given input and output. It is equal to the likelihood of the output P(y|x) multiplied by the prior probability P(x) of the parameter x for a given input and divided by the normalization constant. This is a simple form of expression of Bayes’ theorem which is the basis for supporting Bayesian inference. Likelihood  Prior Normalization

Posterior ¼

ð2Þ

Linear regression model is a linear combination of the basis function of a set of input variable x, and the mathematical expression is yðx; wÞ ¼ w0 þ

M X

wj /j ðxÞ

ð3Þ

j¼1

M is the number of basis functions, we assume that /0 ðxÞ ¼ 1, then yðx; wÞ ¼

M X

wj /j ðxÞ ¼ wT /ðxÞ

ð4Þ

j¼0

where w ¼ fw0 ; . . .; wM g, / ¼ f/0 ; . . .; /M g, then the probability density function of the linear model is PðTjx; w; bÞ ¼

N Y

Nðti jyðx; wÞ; b1 IÞ

i¼1

T is the target data vector, T ¼ ft1 ; . . .; tN g.

ð5Þ

A Comparative Study of Two Different Spam Detection Methods

99

Assuming that the prior probability PðwÞ obeys the Gaussian distribution PðwÞ ¼ Nðwj0; a1 IÞ

ð6Þ

So the posterior probability can be expressed as PðwjX; TÞ ¼

PðTjw; XÞ  PðwÞ PðTjXÞ

ð7Þ

After the posterior probability distribution is obtained through training the model, we can acquire the value of the estimated parameter with the maximum posterior ^ . So based on this estimated parameter, the output probability density, which is w estimate with new data input can be estimated. Compared with other typical regression algorithms such as Ordinary Least Squares (OLS) and Maximum Likelihood Estimation (MLE), Bayesian linear regression has three main advantages: 1. Prior distribution: If there are data or reasonable guesses about a domain or model parameters, then they can be included in the process of using Bayesian linear regression, rather than when using OLS, all required information about the parameters needs to be obtained from the data. If there is no prediction in advance, noninformation priori can applied on the parameter, such as a normal distribution. Using this estimation may produce larger errors when the data is small, but as the data points increase, estimates will increasingly trend towards the values predicted by OLS. 2. Posterior distribution: the result of Bayesian linear regression is a distribution of model parameters based on training data and prior probability. This allows the quantification of the uncertainty of the model: if there are fewer data points, the posterior distribution will be more dispersed. As the amount of data points increase, the effect of the a priori will reduce. When there are have infinite data, the output parameters converge to the values obtained using the OLS method. 3. Prevent over-fitting: Since the maximum likelihood estimation would make the model too complex to produce over-fitting, simply using maximum likelihood estimation is not always an effective method. While the Bayesian linear regression can solve the problem of over-fitting in the maximum likelihood estimation. This formula that uses model parameters as a probability distribution reflects the essence of Bayesian theory: starting with the initial estimate and the prior distribution, the model makes fewer mistakes as more data is collected, and gets closer to the truth. Bayesian reasoning can also be understood as a natural extension of our intuition. For example, we have an initial hypothesis at the beginning, and with the collection of data that supports or denies ideas, our model of the world’s perceptions will change. 3.2

Decision Forest Regression

The random decision forest regression method can generate a new decision model. Obviously, the random decision forest is to establish a forest model in a random way. The forest model consists of many decision trees, and there is no correlation between each

100

H. Wang et al.

decision tree. A decision tree consists of nodes and directed edges. Generally, a decision tree contains a root node, several internal nodes, and several leaf nodes. The node contains the attributes of the objective function it depends on, and the value of the objective function reaches the leaf nodes through the branch. The decision process of the decision tree needs to start from the root node of the decision tree, and the data to be tested is compared with the feature nodes in the decision tree, and select the next comparison branch according to the comparison result until the leaf node is the final decision result. Repeat the above process to get a forest with t decision trees. The decision tree algorithm in stochastic decision forest regression is a process of recursively constructing a decision tree. The minimum error criterion is used to select features and generate a binary tree [13]. After obtaining the forest model, once there is a new input, each decision tree in the forest will discriminate the sample and give a predicted value. Finally, the value of the sample is taken as the average of the predicted values for all decision trees. Figure 1 shows us the random decision forest frame. In the process of establishing a decision tree, the sample needs to be sampled. The sampling method with a put back is applied here. Assuming that there are N input samples, the sampled samples are also N. We also assume that the number of input features is M. When splitting on each node of each decision tree, m input features are randomly selected from M input features, and then choose the best one from the m input features for splitting. m does not change during the construction of the decision tree. As a result, each tree’s sample size is not all samples during training which leads to the advantage that it’s not easy to come to over-fitting. Each tree in a regression decision forest outputs a Gaussian distribution as a prediction. An aggregation is performed over the ensemble of trees to find a Gaussian distribution closest to the combined distribution for all trees in the model.

Source Sample

Subsample1

Subsample2

Subsample N

m1

m1

m1

m2

m3

m4

m2

m5

Prediction1

m3

m4

m2

m5

Prediction2

Final Prediction

Fig. 1. Random decision forest framework

m3

m4

m5

Prediction N

A Comparative Study of Two Different Spam Detection Methods

101

Random decision forests have several advantages: 1. Training can be highly parallelized, can run efficiently on large data sets, can produce high-accuracy classifiers; 2. Can handle a large number of input variables; 3. While classifying samples, can output the importance of each feature to the predicted target; 4. When some features are missing, the accuracy can still be maintained, and the tolerance for feature loss is high; 5. The training process of random forest is very fast. 3.3

Experimental Process

The experiments in our paper were carried out in Microsoft’s Azure Machine Learning Studio, using existing machine learning models: Bayesian linear regression model and stochastic decision forest model. First, we upload the spam dataset downloaded from the UCI database to the studio platform and use it as the first module of the process. Then we divide the data set and use 75% (3,450 emails) of the data set for the training of the model. The features used for training were all the attributes given in the data set, and the predicted attribute is whether the email is spam or not. The remaining 25% (1150 emails) of data were used to test the model after training, and the predicted values were finally obtained and compared to known results. The specific flow charts are as follows (Figs. 2 and 3). Spambase Data Set

25%

Split Data

Score Model

Result

75%

Bayesian Liner Regression Model

Train Model

Fig. 2. Bayesian liner regression method

Spambase Data Set

25%

Split Data

Score Model

75%

Decision Forest Regression Model

Train Model

Fig. 3. Decision forest regression method

Result

102

H. Wang et al.

In the random decision forest model, the decision tree node parameters are set as Table 1. Table 1. Decision tree node parameters. Number of parameter Value Number of decision trees 8 Maximum depth of the decision trees 32 Number of random splits per node 128 Minimum number of samples per leaf node 1

When everything is ready, the experiment can begin.

4 Experimental Result In UCI’s Spambase Data Set, spams are quantified by number 1, and non-spams are quantified by number 0, so the trained model predicts the characteristic value of an email based on the input values of attributes, and finally classifies the characteristic value of the email as 1 or 0. The email is considered spam within a certain range of values close to 1, and we consider the email to be non-spam within a certain range close to 0. In order to verify the performance of the model predictions, we know in advance that 25% (1150 emails) of the datasets participating in the test have 435 spams and the rest are nonspams. After constructing the model according to the flow chart in Sect. 3.3, the two experiments using different methods are carried out, and the predicted value distribution results of the mails that participated in the test are shown in Figs. 4 and 5.

Fig. 4. Predicted value of Bayesian linear regression.

A Comparative Study of Two Different Spam Detection Methods

103

Fig. 5. Predicted value of decision forest regression.

It can be seen from Fig. 4 that there is a peak between the predicted value range (–0.15, 0.15] (i.e. the predicted value is near 0) and another peak between the predicted value range is (0.9, 1.05] (i.e. the predicted value is near 1). This result indicates that about half of the emails in the data set participating in the test can be predicted as spam or non-spam, and the distinction between tow peaks is obvious. However, since Bayesian linear regression predicts the posterior probability distribution of a parameter, so it can be seen that forecasts are widely distributed. Moreover, there are still some emails that were divided into the fuzzy area of the middle of predicted range. This means that these mails are not obviously distinguished whether they belong to the nonspam or spam. The presence of these mails also reflected one flaw of the Bayesian linear regression method in the practical application: it requires a certain amount of samples to train the model to obtain good results. From Fig. 5 we can see that all the predicted values are distributed in the interval [0, 1], and most of the mail can be classified as spam or non-spam. Compared to Bayesian linear regression method, the number of characteristic values predicted by the decision forest model in the middle fuzzy region is less. And the closer to the middle area (i.e. the predictive value of 0.5), the less number of divided mails are. The result indicates that the random decision forest has the advantages of high accuracy. The advantage is that only using more trees or setting a higher tree depth will make the model more adequately trained and ultimately make the prediction performance even better. By contrast, we can see that the decision forest regression algorithm has better performance than the Bayesian linear regression algorithm, and the predicted value is more close to the actual situation. The Mean Square Error (MSE) between the predicted

104

H. Wang et al.

value and the actual value in the random forest algorithm is 0.053, while the MSE between the predicted value and the actual value in the Bayesian linear regression algorithm is 0.111. Therefore, compared with the Bayesian linear regression algorithm, the accuracy of the random forest regression algorithm is improved by more than half, so it is a machine learning method with more accurate prediction, and can more accurately describe and predict experimental data.

5 Conclusion In this paper, we conduct two experiments on Azure Machine Learning Studio platform using the Machine Learning methods of Bayesian linear regression and random decision forest regression to detect a given set of spam data. Through the experimental results, we can see the difference of the prediction results caused by the difference of the two methods. The Bayesian linear regression method is based on the posterior probability distribution of the characteristic parameters, so its predicted values are relatively scattered. However, the random decision forest method uses the least square error criterion to generate the binary tree for feature selection. Compared to the Bayesian linear regression method, it has higher accuracy, and the random decision forest regression also has the advantages of simple modeling and fast training speed, so it is very suitable as a benchmark model of machine learning. With the continuous development of artificial intelligence technology and the continuous advancement of machine learning technology, researchers will surely reach a new level of spam classification to meet the needs of users for a good email communication environment. The likely direction for spam detection is to produce a better classification standard, such as extracting more complex and accurate attributes that can determine spam attributes as feature signatures. In addition, researchers can develop more excellent low complexity gain algorithms based on random decision forests, such as neural networks. Therefore, the development of a more accurate and faster spam detection method is one of the future development directions in the field of machine learning. Acknowledgments. This (NGII20180407).

research

was

supported

by

CERNET

Innovation

Project

References 1. Benevenuto, F., Magno, G., Rodrigues, T., Almeida, V.: Detecting spammers on twitter. In: Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS), p. 12, July 2010 2. Fu, Q., Feng, B., Guo, D., Li, Q.: Combating the evolving spammers in online social networks. Comput. Secur. 72, 60–73 (2018) 3. Youn, S., Cho, H.C.: Improved spam filter via handling of text embedded image e-mail. J. Electr. Eng. Technol. 10(1), 401–407 (2015) 4. Li, S., et al.: WAF-based chinese character recognition for spam image filtering. Chin. J. Electron. 27(5), 1050–1055 (2018)

A Comparative Study of Two Different Spam Detection Methods

105

5. Rusland, N.F., Wahid, N., Kasim, S., Hafit, H.: Analysis of Naïve Bayes algorithm for email spam filtering across multiple datasets. In: IOP Conference Series: Materials Science and Engineering. IOP Publishing, August 2017 6. Wei, Q. Understanding of the naive Bayes classifier in spam filtering. In: AIP Conference Proceedings. AIP Publishing (2018) 7. Yang, J., Liu, Y., Liu, Z., Zhu, X., Zhang, X.: A new feature selection algorithm based on binomial hypothesis testing for spam filtering. Knowl.-Based Syst. 24(6), 904–914 (2011) 8. Feng, L., Wang, Y., Zuo, W.: Quick online spam classification method based on active and incremental learning. J. Intell. Fuzzy Syst. 30(1), 17–27 (2016) 9. Gao, C.Z., Cheng, Q., He, P., Susilo, W., Li, J.: Privacy-preserving Naive Bayes classifiers secure against the substitution-then-comparison attack. Inf. Sci. 444, 72–88 (2018) 10. He, L.: Identification of product review spam by random forest. J. Chin. Inf. Process. 29(3), 150–154 (2015) 11. Al-Janabi, M., Andras, P.: A systematic analysis of random forest based social media spam classification. In: Yan, Z., Molva, R., Mazurczyk, W., Kantola, R. (eds.) Network and System Security. NSS 2017. LNCS, vol. 10394, pp. 427–438. Springer, Cham. https://doi. org/10.1007/978-3-319-64701-2_31 12. Sun, X., Han, L., Li, K.: One email filtering system based on category feature selection and feedback learning random forest algorithm. Comput. Appl. Softw. 32(4), 67–71 (2015) 13. Huawei Cloud. https://support.huaweicloud.com/algnoderef-mls/mls_02_0054.html. Accessed 05 July 2019

Towards Privacy-preserving Recommender System with Blockchains Abdullah Al Omar1(B) , Rabeya Bosri1 , Mohammad Shahriar Rahman2(B) , Nasima Begum1 , and Md Zakirul Alam Bhuiyan3 1

University of Asia Pacific, Dhaka, Bangladesh {omar.cs,nasima.cse}@uap-bd.edu, [email protected] 2 University of Liberal Arts Bangladesh, Dhaka, Bangladesh [email protected] 3 Fordham University, New York, USA [email protected]

Abstract. Data tampering is one of the most intriguing personal information security concerning issues in online business portals. For various individual or business purposes, clients need to share their personal information with these online business portals. Upon taking conveniences from this sharing of information about an individual, online business sites accumulate client data including client’s most sensitive information for running different data analysis without taking the clients’ authorization. In a view to proposing suggestions, data analysis may need to be done in the online business portals. A recommender system or framework creates an automated personalization on a rundown of items based on the users’ preference of searching any product over the portal. These days, the recommender system or framework is the part and parcel to the online marketing and business portals. However, secure control of client information is missing to some extent in such systems. Blockchain technology guarantees security in data manipulation for the clients in these online portals since it is a secure distributed ledger for storing data transaction. This paper presents a privacy-preserving or privacy-securing platform for recommender framework or system utilizing blockchain technology. The distributed ledger attribute of blockchain gives any client a verified domain where information is utilized for analysis with his/her required consents. Under this platform, clients get rewards (i.e., points, discounts) from the proposed online based company for sharing their information to figure out and propose relevant suggestions. Keywords: Blockchain · User-centric recommender system · Privacy-preserving platform · Private data analysis · Secure protocol

1

Introduction

Recommender systems or frameworks [25], being a subclass of information filtering system, make a prediction on a number of items. Online business portals c Springer Nature Singapore Pte Ltd. 2019  G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 106–118, 2019. https://doi.org/10.1007/978-981-15-1304-6_9

Towards Privacy-preserving Recommender System with Blockchains

107

are broadly utilizing recommendation engine to establish a relevant and more accurate suggestion which reflects the client’s previous preferences. Expanding product sale is the essential objective of a recommender system. These days, recommendation method is being utilized in online business portals as well as in different industries. Amazon [20], Netflix [13], LinkdIn, Facebook [6], YouTube [5] are the most well-known examples of commercial companies who are using recommender system or technique to a whole lot extent. There are two different ways of producing recommendations: collaborative filtering and content-based filtering. To make a proper recommendation utilizing collaborative filtering separating [15,23] tend to store clients’ private informations. Along these lines, here comes the issue of information or data protection and security for the clients. Hazardous circumstances emerge when the online portals unveil their clients’ private information. There are a few instances of unveiling clients’ private information [18]. A few platforms [1–3,7,8,19,27,30] have been proposed to deal with information protection or security issues. However, the odds of unveiling clients’ information still exists like the online company can control or share clients’ information to an unauthorized third party without the client’s consent. In our platform, we are utilizing blockchain to hold the exchange of information sharing. Blockchain is a data structure with the properties: immutability, append only, ordered, open and transparent, secure (identification, authentication, authorization) [28]. Blockchain turns out to be increasingly well known platform in Bitcoin cryptocurrency [16], which is a public ledger to hold and keep up the exchange and authentic states. The first blockchain was presented by Satoshi Nakamoto in 2008 [21]. The reasons behind its ubiquity, it is a decentralized and computerized record to hold exchanges, where exchanges are recorded and handled with no third party invasion [26]. All the stored information are encrypted and it is likely to discover an encrypted outcome with the help of a Fully Homomorphic Encryption algorithm on encrypted information [11]. Its structure has been planned by linearly sequenced blocks. Each square contains the cryptographic hashes comparing to past and current block to guarantee continuity and immutability of the chain and the chaining mechanism guarantees integrity of this protected data structure. Blockchain can be totally open to the general mass that whoever wants can join or it may very well be absolutely private for permitting only a handful number of selected folks. This two categories are Public and Private blockchain [10], along with another known as permissioned blockchain [4,24]. Blockchain are built in from private key cryptography, p2p network [29] and the blockchain technology. Proof-of-Work [14,21] and Proof-of-Stake [17,22] are the two primary ways to validate the transaction on Blockchain. Therefore, Blockchain is secure, ordered and immutable structure for data which stores exchanges of the client data, and it promises recognizable proof which is called identification, authentication and approval or authorization by the clients. In our platform, we are utilizing a permissioned blockchain which gives an access control and finding an optimal recommendation with respect to users’ interest, we are using collaborative filtering. Through blockchain, our platform guarantees information security as clients have access to blockchain for

108

A. A. Omar et al.

checking and approving the information exchange transaction. Additionally, as client information is stored by our platform, web based companies are no longer able to gather information arbitrarily from the clients. Our platform enables the clients to have authority over his/her information which is missing in current recommender systems. Our Contribution: In this paper, we have proposed a framework for a protected and a safeguarding recommender framework to guarantee the privacy and security of client data. The principle idea of this paper is to keep the client data secure and this data is not accessible to any third parties. In our framework, client information security issue will not emerge on the grounds that each of the information is stored in cloud. Through utilizing blockchain to store details of every one of the information exchange our stage accomplishes accountability, integrity and security. Some other cryptographic functionalities are proposed in our proposed platform to guarantee the security. Our contribution in a nutshell is as follows: – Security and privacy guarantee: Accountability, integrity, pseudonymity, security and privacy is guaranteed by the proposed platform which provides our platform privacy and becomes a secure platform. A rigorous ananlysis has been given in the Sect. 4. – Analysis: A rigorous analysis on pseudonymity, accountability, integrity, security and privacy has been shown. Paper Organization: The remainder of the paper organized as follows: Sect. 2 discusses the related works. Section 3 explains the protocol overview and the working scenario of our platform. The protocol analysis has been described in Sect. 4. Concluding remarks in Sect. 5.

2

Related Work

In this section, we have included some of the works that have been done in this field. We have described each of the works elaborately. Lam et al. talked about significant research inquiries in [18] which are related with the privacy and security of recommender systems. They concentrated on automated collaborative filtering. Detailed discussion on the infringements of clients’ trust, the danger of unveiling clients’ personal information comes forward from the researches. Researchers have taken different approaches to address information privacy and security. Different methods are proposed so far. A recently constructed data management platform utilizes Blockchain to ensure individual information privacy [30]. In web based shopping, security of client data has dependably been a burning question. A framework is proposed in [9] to guarantee the security of client information, where company deals with the clients through a mutual contract and the company gets access to some partial information shared by that client.

Towards Privacy-preserving Recommender System with Blockchains

109

In [8], a secure recommender system or framework has been proposed. The authors proposed a secured recommender framework utilizing Blockchain alongwith secure multiparty computations [12]. In the framework, companies can store client information and use blockchain to store information, (e.g., favorite list, habits, shopping history, sensitive data like credit card data). All information are encoded and not available to anyone without the consent of clients because of cryptographic functionalities. Companies offer incentives if the clients give access to their information for figuring out proper recommendations. Here in their proposed framework, collaborative filtering keeps running on the authorized data from the clients to select an ideal recommendation. In spite of the fact that they proposed a protected framework, they are lacking in some critical points. At first, in their proposed framework the companies are not permitted to access the stored information of the clients, but in any case, the companies are permitted to gather the client information from the clients arbitrarily and without the client’s authorization they are permitted to store client information. Thus, the framework can’t manage the issue of gathering and storing information from clients without their authorization. Also, they guarantee that entire computation is done in blockchain yet calculation in the blockchain is basically infeasible in some degree we can incorporate, it is impossible. This is the major issue in their proposed framework. Lastly, a complete anonymous recommender framework may raise the probability of information loss. Unauthenticated access to the companies can easily contribute to making spam profiles without a much difficulty to control the rating of their own items. Thus, the idea of fully anonymous customers leaves the schemes for fraudulent activities open.

3

Our Protocol

In this section, we present the architecture as well as the design view of our mechanism. Table 1 describes the notations used in this paper. 3.1

Entities and Steps Involved in Our Protocol

Figure 1 shows the entities of our platform, their roles are described briefly here: Cloud( C): In our platform, C is used as a data storage. Information of the users who are interested to join our platform are stored in C. Only M has the ability to establish a connection with C. Hence all the users information is stored in C and only M has the ability to connect with C. Therefore, there is no chance for third parties (i.e. companies, illegal authorities) to access the user’s data. C has two entities: 1. User List (UL): UL stores all the users who are interested in joining our system. No companies are able to access the UL. M is the only entity who is able to connect with UL and UL is generally updated by the M .

110

A. A. Omar et al. Table 1. Terminology table Notation Description GUi

Guest User

BC

Blockchain

M

Manager

Ci

Clusters in the cloud

RS

Recommender system

IDg

ID generator

T DS

Temporary data storage

IDi

An unique number corresponding to GUi

Ri

Recommendation

Pi

Product type

C

Cloud of our system

2. Cluster: Cluster stores the specific product information. In our platform cluster has two parts(a) Product Type (Pi ): Specific product’s name is stored in this part. Using this Pi , C can identify a particular Ci and M is able to give them points. (b) Point: All the users in a Ci will have same points. Whenever a Ri is calculated by using a particular Ci ’s information then, all the users of that Ci are benefited through receiving points from M . Users may get some offer or discount from companies using this points. Recommender System ( RS): RS computes the Ri using the permitted data and sends it to M . Blockchain (BC): This is the most important part of our system. BC holds the transaction between C and RS. The purpose of this log is: if any Ci ’s data is tampered then it can claim what data was shared to which GUi at what time. ID Generator ( IDg ): User can establish a connection with our platform through a trusted party1 who is defined as IDg . IDg receives the request from GUi and generates a unique IDi corresponding to the GUi . After generating the IDi , IDg sends it to M with Pi . IDg does not store any IDi , every time it generates a unique one and shares it with M . Guest User ( GUi ): Guest user who is new to this system is defined as GUi . GUi is capable of connecting with IDg and M . GUi requests IDg for Ri and receives the Ri from M . Temporary data storage ( T DS): T DS is a trusted party2 of our platform who can store data temporarily. When M sends a data access request to C then 1 2

We assume that IDg is a trusted party of our platform. We assume that T DS is a trusted party of our platform.

Towards Privacy-preserving Recommender System with Blockchains

r confirm on store his/her 9(a) data & update user list, other wise 9(b)

Cluster 1

Cluster 2

All Users Cluster 3

Cluster n

4(b) Store the on Tran

3(b) Data access req.

7(a) Give point to the cluster

Blockchain

4(a) Store data

Temporary Data storage 5 Forward stored 3(c) Revoke data for calcu on 3(a) Recommend

Manager

111

on req.

6 Send recommend

Recommender System

on

ID Generator 2 Send Id & product Type

C2 …..

Companies

C1 7(b) Send recommend on & joining req. through ID 8 Response

1 Request for recommend on Guest User

Cn 9(b) Redirected to the recommended Cn 10 Access Companies Website

Fig. 1. User centric recommender system platform

C transfers the particular Ci ’s data to T DS. T DS waits for C’s response. After a certain time period T DS will stop collecting data and will forward the shared data to RS. Here important fact is, T DS never stores the data permanently and does not share any data with other parties of our platform. Manager ( M ): The whole process of requesting and sending the Ri occurs through M . It has the ability to connect with all other entities. The process starts when M receives the IDi and Pi from IDg and ends when GUi is redirected to the company’s page or a particular product’s page. Between these two steps M has to do some other works like storing the IDi , sending a request to RS for Ri , sending data access request to C for sharing the particular Ci ’s data with TDS. M receives the Ri from RS and then gives some point to the particular Ci who shared the data with TDS for calculating Ri . M also connects with GUi through sending GUi the Ri and the joining request to this system. M has the ability

112

A. A. Omar et al.

to update the UL if GUi accepts the request. Companies are connected to this system through M . Steps in Our System: Steps of our system can be defined from Fig. 1. • Step 1: GUi requests for Ri . • Step 2: IDg receives the Pi from GUi then generates a unique IDi corresponding to GUi . After generating the IDi , IDg sends the IDi and Pi to the M as a Ri request. • Step 3(a): M stores the IDi and sends Ri request to C. • Step 3(b): Sends a data access request to C. • Step 3(c): RS revokes TDS for a data response. • Step 4(a): By accessing with user permission3 the data will be stored in T DS. • Step 4(b): Transaction between C and T DS will be stored in BC. • Step 5: T DS forwards the shared data to RS for calculating Ri . • Step 6: RS will send the Ri to the M . • Step 7(a): When M receives the Ri , it will give some point to Ci . • Step 7(b): M forwards Ri to GUi through his IDi and sends a request to join with the Ci . • Step 8: GUi sends his response to the M . • Step 9(a): If user confirms, M store the user’s data to Ci and update the UL. • Step 9(b): GUi will get the Ri and will be redirected to that company’s page or particular product’s page. • Step 10: GUi will visit our platform’s recommended company’s page. 3.2

Functionalities of Our Protocol in Depth

Request Sending to Our Platform: Figure 2 shows the low level view of request sending to our platform. GUi will send the request to our platform through sending the Pi . IDg receives the request and generates a unique IDi corresponding to Pi . IDg will send the IDi and Pi to M . M will store the IDi and forwards Pi to C. IDg will use a random function for calculating IDi . Random(Pi ) = IDi 3

Users’ permission were taken when they confirm to join our platform.

Towards Privacy-preserving Recommender System with Blockchains Manager

ID Generator(IDg)

Guest User(GUi) Pi

113

Cloud

generates IDi

IDi, Pi Pi

Fig. 2. Low level view of sending request

Recommendation Calculation: After receiving Pi from M , C will search for among the particular Ci which has the same product as Pi . Then all the users of that corresponding Ci will provide access to their data. This permission will be taken from the users at the time of joining our platform. Then data fetching from C to T DS will start. After a certain time period T DS will stop storing data and forwards the shared data to RS. Figure 3 shows the low level view of the procedures of Ri calculation. For calculating Ri our platform will use collaborative filtering system. Collaborative filtering (CF) is a method of making recommendation which is calculated by filtering the other users’ information who have same interest as GUi . Here collaborative filtering will be performed by two steps, firstly collecting the users’ data who have same Pi as GUi . This process of collecting data will be performed by T DS whereby cloud will find the particular Ci having the same Pi . Secondly, calculating the Ri corresponding to GUi and this will be performed by RS. Cloud

Cluster

Blockchain

TDS

RS data Ri

Pi

transacƟon

TDS collects the permiƩed user’s data from cloud

calculate

Fig. 3. Low level view of recommendation calculation

Sending the Recommendation: Figure 4 shows the low level view of sending the Ri from our platform to GUi . RS will send the Ri to M then M forwards Ri to GUi . M will identify GUi from the stored IDi corresponding to Ri and Ri will be given to GUi in a digital signed format. Here, we are using Schnorr Signature

114

A. A. Omar et al.

Scheme. After receiving Ri , GUi will verify the signature. If the verification succeeds the GUi will be redirected to the platform recommended company’s page or website by the M . Recommender System

Manager

Guest User(GUi)

Ri sign(Ri ) signature

Fig. 4. Low level view of recommendation sending

Joining Request to GU i and Adding GU i to this Platform: Figure 5 shows the procedures of joining request to GUi and adding GUi to this platform. When GUi receives the Ri at the same time GUi receives another request from M which is the joining request to this system. If GUi accepts the request then GUi is supposed to answer some questions like his favorite list, then all this information and GUi ’s shopping history will be stored in C by M . By accepting the joining request users also give the data accessing permission4 . Finally, GUi becomes a member of our platform through updating the user list by M and Cloud

Manager

GU

Company

joining request response quesƟons (if accept) store data

answers redirected to company

Fig. 5. Low level view of adding a new member 4

Platform will ask about the permission separately.

Towards Privacy-preserving Recommender System with Blockchains

115

redirected to the company’s page or website after updating the UL. If GUi does not accept the request, then GUi just will be redirected to the company’s page or website.

4

Protocol Analysis

In this section, we analyze the security claims of our system. This analysis will show how our system provides the security metrics to the user. • Pseudonymity: GUi is perceived by the M during sending of the Ri . Whatever be the parties who are associated with our platform (i.e., organization) can’t distinguish the GUi during correspondence or communication with our platform or BC transaction, which offers pseudonymity. Unauthorized access are more likely to occur as a consequence of anonymity. Unauthenticated access raises the odds of controlling the ranking of their items by making spam or dummy profiles. That is the reason our stage is in charge of client pseudonymity. • Privacy: The proposed framework maintains GUi ’s privacy and protection in the U L. This framework does not preserve client data during its computations. Roles of M as a delegate between GUi and organization. GUi will get recommendations however, will in no way have the capability to follow the source and the information of the GUi ’s. • Integrity: The client’s information is private for all customers. The M permits a prayer for access alongside the Ri to GUi . GUi recognizes the access request and enables M to store its data and access it. If not, our framework won’t store any of the information and no other entity in the framework will ever get to access the Ri . Other than this all access to data for Ri calculations is stored in BC as transactions. The integrity of every client’s data keeps up as a result of this access technique and a unchanged or unaltered record of information access as transactions are saved in the BC. • Accountability: The M will remain in charge of any accessed client information. The data transaction is put away on the BC after the access solicitation of the M is approved by a GUi . The client can trace back any access to the data. • Security: The information of the clients are stored in the C. Access to and storage of information is just attainable after the acknowledgment of the access request of M applications of M to GUi to get access to the data and permission is provided by the GUi . The M can’t get access to the data without the consent of the GUi . Other entities of the framework (e.g., T DS, RS) have no means of access to client information. Its the sole authority of M to access the Client’s information. Except M no other party can have any sorts of access to the client information. When sending Ri to GUi which is a digitally signed configuration, M uses Schnorr Signature Scheme. Our framework offers safety with Schnorr Signature Scheme. Additionally, the retrieved client information is only stored until the computation of the RS is finished and gives the M with a Ri . The T DS deletes all information stored in it after the information has been used.

116

5

A. A. Omar et al.

Conclusion

In this paper, we have proposed a structure for secure recommender framework which ensures client information protection utilizing blockchain technology. Accumulating clients’ information without guaranteeing security is one of the primary issues in recommender framework. Here, we lay down an solution to handle this adverse situation. In our proposed system, clients have the access control over their information. Our framework and clients are not completely anonymized so the possibility for unapproved gatherings to make some or fake profiles to raise the rating of their own item by an unauthorized company is disallowed through our proposed framework. As far as our knowledge expands we can state that, for the very first time a client driven recommender framework is proposed which cryptographically ensures the clients’ information privacy and security. In future, we intend to think of construction and deployment (prototyping) of the proposed framework and its performance evaluation. We additionally plan to incorporate a constructive analysis for comparing between our framework and a well-known blockchain based recommender framework. Acknowledgments. This work is partially supported by Institute of Energy, Environment, Research and Development (IEERD), University of Asia Pacific (UAP), Bangladesh.

References 1. Al Omar, A., Bhuiyan, M.Z.A., Basu, A., Kiyomoto, S., Rahman, M.S.: Privacyfriendly platform for healthcare data in cloud based on blockchain environment. Future Gener. Comput. Syst. 95, 511–521 (2019) 2. Al Omar, A., Rahman, M.S., Basu, A., Kiyomoto, S.: MediBchain: a blockchain based privacy preserving platform for healthcare data. In: Wang, G., Atiquzzaman, M., Yan, Z., Choo, K.-K.R. (eds.) SpaCCS 2017. LNCS, vol. 10658, pp. 534–543. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-72395-2 49 3. Azaria, A., Ekblaw, A., Vieira, T., Lippman, A.: MedRec: using blockchain for medical data access and permission management. In: International Conference on Open and Big Data (OBD), pp. 25–30. IEEE (2016) 4. Cachin, C.: Architecture of the hyperledger blockchain fabric. In: Workshop on Distributed Cryptocurrencies and Consensus Ledgers, vol. 310 (2016) 5. Davidson, J., et al.: The Youtube video recommendation system. In: Proceedings of the Fourth ACM Conference on Recommender Systems, pp. 293–296. ACM (2010) 6. Dutta, P., Kumaravel, A.: A novel approach to trust based identification of leaders in social networks. Indian J. Sci. Technol. 9(10) (2016) 7. Felt, A., Evans, D.: Privacy protection for social networking platforms. In: Proceedings of the IEEE Symposium on Security and Privacy, Oakland, CA, 22 May 2008 8. Frey, R., W¨ orner, D., Ilic, A.: Collaborative filtering on the blockchain: a secure recommender system for e-commerce. In: Proceedings of the 22nd Americas Conference on Information Systems (AMCIS 2016), San Diego, CA, USA, 11–13 August 2016

Towards Privacy-preserving Recommender System with Blockchains

117

9. Frey, R.M., Vuckovac, D., Ilic, A.: A secure shopping experience based on blockchain and beacon technology. In: Proceedings of the Poster Track of the 10th ACM Conference on Recommender Systems (RecSys 2016), Boston, USA, 17 September 2016 (2016) 10. Gabison, G.: Policy considerations for the blockchain technology public and private applications. SMU Sci. Tech. L. Rev. 19, 327 (2016) 11. Gentry, C.: A Fully Homomorphic Encryption Scheme. Stanford University, Stanford (2009) 12. Goldreich, O.: Secure multi-party computation. Manuscript. Preliminary version, 78 (1998) 13. Gomez-Uribe, C.A., Hunt, N.: The netflix recommender system: algorithms, business value, and innovation. ACM Trans. Manag. Inf. Syst. (TMIS) 6(4), 13 (2016) 14. Hazari, S.S., Mahmoud, Q.H.: A parallel proof of work to improve transaction speed and scalability in blockchain systems. In: 2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0916–0921, January 2019 15. Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. (TOIS) 22(1), 5–53 (2004) 16. Iwamura, M., Kitamura, Y., Matsumoto, T., Saito, K.: Can we stabilize the price of a cryptocurrency?: Understanding the design of bitcoin and its potential to compete with central bank money (2014) 17. King, S., Nadal, S.: PPcoin: peer-to-peer crypto-currency with proof-of-stake. Selfpublished paper, 19 August 2012 18. Lam, S., Frankowski, D., Riedl, J.: Do you trust your recommendations? An exploration of security and privacy issues in recommender systems. Emerging Trends Inf. Commun. Secur., 14–29 (2006) 19. Liang, X., Shetty, S., Tosh, D., Kamhoua, C., Kwiat, K., Njilla, L.: ProvChain: a blockchain-based data provenance architecture in cloud environment with enhanced privacy and availability. In: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 468–477. IEEE Press (2017) 20. Linden, G., Smith, B., York, J.: Amazon.com recommendations: item-to-item collaborative filtering. IEEE Internet Comput. 7(1), 76–80 (2003) 21. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system (2008) 22. Nguyen, C.T., Hoang, D.T., Nguyen, D.N., Niyato, D., Nguyen, H.T., Dutkiewicz, E.:. Proof-of-stake consensus mechanisms for future blockchain networks: fundamentals, applications and opportunities. IEEE Access, 1 (2019) 23. Pazzani, M.J.: A framework for collaborative, content-based and demographic filtering. Artif. Intell. Rev. 13(5–6), 393–408 (1999) 24. Qiu, C., Richard Yu, F., Xu, F., Yao, H., Zhao, C.:. Permissioned blockchain-based distributed software-defined industrial internet of things. In: 2018 IEEE Globecom Workshops (GC Wkshps), pp. 1–7. IEEE (2018) 25. Resnick, P., Varian, H.R.: Recommender systems. Commun. ACM 40(3), 56–58 (1997) 26. Swan, M.: Blockchain: Blueprint for a New Economy. O’Reilly Media, Inc. (2015) 27. Tasnim, M.A., Omar, A.A., Rahman, M.S., Bhuiyan, M.Z.A.: CRAB: blockchain based criminal record management system. In: Wang, G., Chen, J., Yang, L.T. (eds.) SpaCCS 2018. LNCS, vol. 11342, pp. 294–303. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-05345-1 25 28. Chenhan, X., et al.: Making big data open in edges: a resource-efficient blockchainbased approach. IEEE Trans. Parallel Distrib. Syst. 30(4), 870–882 (2018)

118

A. A. Omar et al.

29. Yamamoto, S., Nakao, A.: In-network P2P packet cache processing using scalable P2P network test platform. In: 2011 IEEE International Conference on Peer-toPeer Computing, pp. 162–163, August 2011 30. Zyskind, G., Nathan, O., et al.: Decentralizing privacy: using blockchain to protect personal data. In: Security and Privacy Workshops (SPW), 2015 IEEE, pp. 180– 184. IEEE (2015)

Integrating Deep Learning and Bayesian Reasoning Sin Yin Tan, Wooi Ping Cheah(&), and Shing Chiang Tan Faculty of Information Science and Technology, Multimedia University, Jalan Ayer Keroh Lama, 75450 Melaka, Malaysia {tan.sin.yin,wpcheah,sctan}@mmu.edu.my

Abstract. Deep learning (DL) is an excellent function estimator which has amazing result on perception tasks such as visualization recognition and text recognition. But, its inner architecture acts as a black box, because the users cannot understand why such decisions are made. Bayesian reasoning (BR) provides explanation facility and causal reasoning in terms of uncertainty which is able to overcome demerit of DL. This paper is to propose a framework for the integration of DL and BR by leveraging their complementary merits based on their inherent internal architecture. The migration from deep neural network (DNN) to Bayesian network (BN) involves extracting rules from DNN and constructing an efficient BN based on the rules generated, to provide intelligent decision support with accurate recommendations and logical explanations to the users. Keywords: Black box of deep learning Rule extraction

 Bayesian reasoning  Integration 

1 Introduction Deep Learning (DL) and Bayesian reasoning (BR) are currently two active areas of research in artificial intelligence (AI). They are two different but complementary tasks performed by an intelligent agent. DL provides a powerful model and an easy framework for machine learning. It has achieved significant success in many perceptions related tasks, which involve seeing, hearing, and reading by machine. Specific application areas are visual object recognition, text understanding, and speech recognition. These are fundamental tasks for a functioning AI or data engineering system. It is being used to solve problems that require high predictive power for accurate decision support and improves an intelligent agent’s performance by enhancing its capability in perceiving the environment. DL discovers complex structure with multiple hidden layers between the input and output layers, which allows the computational model to learn the representations of data with multiple levels of processing layers in large data sets by using backpropagation algorithm. How a machine should change its internal parameters that are used to compute the representation in each layer from the representation in previous layer can be calculated with backpropagation algorithm. Unfortunately, DL suffers from a difficulty commonly known as black box problem. This is because deep learning algorithm scans through a set of data, finds patterns in it, and © Springer Nature Singapore Pte Ltd. 2019 G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 119–130, 2019. https://doi.org/10.1007/978-981-15-1304-6_10

120

S. Y. Tan et al.

builds a model that can be used for accurate prediction or classification tasks. However, the model may in fact be a black box, which is something that we can feed inputs to and get outputs from, but whose inner workings cannot be interpreted accurately [1–3]. Bayesian reasoning (BR) provides a powerful approach for information propagation, integration, and manipulation. It has achieved significant success in many thinking related tasks, which involve uncertainty quantification, causal and effect analysis, and decision making. It improves an intelligent agent’s performance by enhancing its capability in inferring implicit knowledge from the explicit representation of the environment. Specific application areas are information retrieval, large-scale ranking and recommendation systems. BR seeks to establish the relationship between causes and effects which can be represented by Bayesian Network (BN). BN is a framework for modelling, representing and reasoning about causal knowledge [4–6]. The domain variables will be represented by the nodes, while the cause-and-effect relationships between the variables will be explained by the linkage between the nodes. The graphical structure provides an intuitive way for understanding the causal models such that the causes of some events can be diagnosed and their effects can be predicted, as well as interpreting the reasoning processes. Application of BR is beneficial to decision support system. The inherent structure is easy to understand and the decision made is convincing as it is able to explain why a particular decision is made. Unfortunately, causality utterances are often used in situations that are plagued with uncertainty, which will affect the predictive power and jeopardize the accurate decision support. Each of these two research areas has shortcomings that can be effectively addressed by the other, pointing towards a need in integration between them. The complementary aspects of the two research areas are the main focus of this paper. We assume that the black box nature is an inherent feature of DL, which cannot be solved by simply modifying the architecture of the DL itself. Instead, the black box nature of deep learning can be subsided by complementing it with some external explanation facilities, which are not built into the deep learning architecture itself. In this study, we propose a method to overcome the black box problem inherent to DL by proving some form of explanation facilities, which help to reveal the internal architecture and operations of the DL algorithm, but also subside the uncertainty problem inherent to BR by introducing high and accurate predictive power of DL. The method is able to leverage the complementary roles of these two areas to provide intelligent decision support with accurate recommendations and logical explanations. This research is significant because the proposed method is able to support accurate prognostic reasoning through prediction and analysis, as well as intuitive diagnostic reasoning through justification and explanation. To overcome these problems, a detailed comparison of the two frameworks will be conducted based on their inherent features. Method proposed by Zilke [7] to extract rules from Deep Neural Network (DNN) and method proposed by Zarikas et al. [8] to construct an efficient and convenient Bayesian network (BN) based on the expert rules generated will be adopted. The new framework will be a simple and effective method for decision support leveraging the merits of both DL and BR. This paper is organized into the following sections: Sect. 2 presents a short review about DL and BR. Section 3 provides a brief qualitative comparison of DL and BR

Integrating Deep Learning and Bayesian Reasoning

121

based on their inherent features. Section 4 describes the proposed framework of migration from DNN to BN by extracting rules from trained NN. Finally, Sect. 5 outlines the conclusion of the study.

2 Related Work Most of the machine learning techniques involve shallow-structured architectures, which consists of a single layer of nonlinear feature transformations and they lack multiple layers of adaptive non-linear features. Deep learning is one of the popular machine learning techniques, which consists of features represented with many layers of processing stages in a hierarchical architecture. They are successfully applied in pattern classification and feature learning. Since 2006, DL has emerged as a new area of machine learning research [9–11]. According to recent news reported in New York Times [12], many applications in machine learning and artificial intelligence have adopted the techniques developed from DL research within the past few years. A series of workshops, conferences, and many journals have been devoted exclusively to deep learning. Many reputable academic institutions and big companies, such as Microsoft, Google, Facebook, Baidu, etc. are doing research in this area. From the reviews, a necessity of deep architectures is highlighted, that is to extract complex structure and build internal representation from rich sensory inputs [13, 14]. Similarly, human visual system is also hierarchical in nature [15, 16]. It is natural to believe that the state-of-theart can advance in processing if efficient and effective deep learning algorithms are developed. The study of Bayesian reasoning (often called causality) has become an important research in artificial intelligence nowadays. Judea Pearl has won the 2011 Turing Award, for his fundamental contributions to artificial intelligence through the development of a calculus for probability and BR [17]. Interest in BR and how it is related to data science and engineering has increased in recent years. Kleinberg explained the importance of causal reasoning in decision support based on the framework called data’s hierarchy of needs [18]. At the beginning of the big data boom, people were mostly worried about the storage and processing power of large amounts of data. Not satisfied with that, people switched their focus from predicting to reporting and visualization to extract insights about the data, which is called business intelligence. It is not enough because we can make better decisions if we can predict what is going to happen, so the focus switched again to predictive analytics. Unfortunately, models built for predictive analytics often end up depending on associations between the features and the predicted outputs. If the resolution suggested by the model is without understanding, the representation of features may lead us to erroneous inference and potentially harmful interventions. BR aims to identify factors that are independent of spurious correlations, allowing stakeholders to make intelligent decisions [19, 20]. Despite these successes, the black box nature of DL is a drawback in practical deployment. The inner mechanism of NN lacks transparency and interpretability. It is crucial to the users when the decision making support system is able to provide some prognosis causal reasoning, which highlighted in various domain such as medical,

122

S. Y. Tan et al.

finance, military, etc. [21–23] to increase the trust of users toward the prediction of decision support system. Besides, all companies in Europe are enforced to deploy decision making algorithm with explanation under the law of GDPR in year 2018 [24]. This black box problem had been highlighted in last few decades but recently the reemergence brings positive impart to the industry. Many researchers are trying to open and explain the black boxes by building novel interpretable model in different approach to bridge the gap. Upol et al. [25] introduced AI rationalization to generate explanation of natural language; Wang et al. [26] worked on probabilistic graphical model approach on recommender system and topic model; Wieland et al. [27], Marco et al. [28], Wojciech et al. [29] and Paolo et al. [30] and Collin et al. [31] worked on image or text classifier; Osbert et al. [32] interpreted the black box model by using decision tree via model extraction; Ravid et al. [33] studied on learning representation of DNN through information bottleneck. While DARPA is working on a program called ‘Explainable AI (XAI)’ since 2017 [34, 35]. In fact, several researchers had tried to solve this black box problem by rule extraction approach. A review had been done by Tameru [36], which shows that most of researchers had successfully extract rules from Artificial NN with 1 hidden layer, but not much work is being done on extracting rules from DNN. Meanwhile, Zilke [7] had successfully extracted rules from DNN and Camila et al. [37] and improved the algorithm by reducing the connectivity of the network to facilitate rule extraction. Since rules extraction is considered a mature discipline in AI and there are limited reviews on the construction of BN from rules based. Most of the researcher constructed BN based on dataset. However, Zarikas et al. [8] demonstrated how to translate fuzzy rules into probabilities by using defuzzification process to construct BN. Thus, method proposed by Zilke [7] and Zakaris et al. [8] will be adopted in this proposed framework by extracting the rules from DNN to construct BN based on the rules generated. Indeed, DL and BR are currently two of the most active areas of research in AI, but there were not many research works that try to compare their respective roles in the development of decision support systems. There were rarely any attempts to leverage their complementary nature and mutual merits by integrating the two frameworks, which are founded on completely different conceptual paradigms. This proposed research is an attempt to bridge the gap.

3 Comparison of DL and BR A qualitative comparison of deep learning and Bayesian reasoning has been conducted based on following criteria: efficiency of the learning algorithm, transparency of the training algorithm, accuracy in prediction, decomposability of the model learned, expressiveness in representation and adequacy in reasoning. Table 1 shows a basic qualitative comparison of DNN and BN. The comparison will help to determine the best way to integrate these two frameworks.

Integrating Deep Learning and Bayesian Reasoning

123

Table 1. Qualitative comparison of DNN and BN Criteria Efficiency of the learning Transparency Accuracy Decomposability Expressiveness in representation Adequacy in reasoning

DNN BN p p p X p X p X p X p X

Fig. 1. Example of DNN with 4 hidden layers (Left) and BN with Compatible CPTs (Right)

The diagram illustrates an example of e-commerce function in two different approaches, DNN and BN (see Fig. 1). Based on the data obtained from an e-commerce function, the learning algorithms of DL are able to learn the purchasing pattern of their customers easily based on those features which are captured (age, gender, income, credit card offer, festive promotion, discounted item, membership offer, quality of service, payment method, point collection, delivery cost and graphical user interface) in the data set. Meanwhile, the e-commerce sellers are able to use the trained DNN to predict the sales volume (output). However, it will be more helpful if the algorithms are able to assist the e-commerce sellers to improve their sales volume or find out the fall of the business. They need to know the reasons or the causes which are actually contributing to the lower level of sales volume. Hence, BN is a great tool that is able to complement the deficiency of DL algorithms by proving the causal reasoning and facilitate human in decision making process. From the BN diagram (see Fig. 1(Right)), the reason of hitting the lower sales volume (65%) is most probably caused by higher delivering cost (82.7%) compared to lesser membership offers (42.3%). In addition, we may observe the changes of uncertainty in CPTs of the nodes if some evidences are included.

124

S. Y. Tan et al.

4 Migration Migration involves two stages, which are: (1) obtaining rules from a trained DNN whereby the rules map NN behavior, and, (2) the rules extracted from DNN will be converted to conditional probability tables (CPTs) that are compatible to qualitative causal structure of BN. 4.1

Extracting Rules from a Trained DNN

For this approach, method proposed by Zilke [7] will be adopted for obtaining rules from trained deep neural networks. The extracted rules are decompositional rules which can describe the inner structure of DNN. For every term that are present in each hidden layers, decision tree (DT) algorithm need to be applied to find the DT that describes the hidden layer hk by means of hk-1 and can be transformed to the rule set Rhk−1→hk. In the next step, the intermediate rules will be combined. The merging rules process is to be repeated until rules that describe the terms present in input and output layer are being obtained. To reduce complexity of rules, redundant rules will be removed. Step 1. Create DT by using C4.5 algorithm Given a DNN with k hidden layers h1, h2, …, hk, a DT that consists of points of divergence on the activation values of last hidden layer’s neurons and their corresponding outputs in the trees’ leaves is constructed. As a result, a DT that describes the output layer, o by split points on values of last hidden layer, hk will be obtained. The respective rule set called Rhk→o. The terms in Rhk→o are directly used in order to differentiate positive and negative examples for the DT algorithm to run.

root

C4.5

ℎ2,1 > 0

ℎ2,1 ≤ 0 =

1

ℎ2,0 > 0

ℎ2,0 ≤ 0 =

0

=

1

Fig. 2. Construction of DT for last hidden layer’s neurons and the corresponding outputs

The diagram above illustrates an example of a DNN with 2 hidden layers (see Fig. 2(Left)) which consists of input neurons x0, x1, and output neurons y0, y1. First hidden layer, h1 consists of 3 neurons while last hidden layer consists of 2 neurons. By using algorithm C4.5, a decision tree diagram has been constructed which describes the 2 neurons in last hidden layer, h2 and their corresponding outputs based on the activation values of last hidden layer’s neurons (see Fig. 2(Right)).

Integrating Deep Learning and Bayesian Reasoning

125

Step 2. Get the intermediate rules set for each hidden layer, Ri→h1, Rh1→h2, …, Rhk −2→hk−1, Rhk−1→hk, Rhk→o The next shallower hidden layer, i.e. hk−1 will be processed. The DT algorithm should be applied to find decision trees that describe layer hk by means of hk-1 and can be transformed to the rule set Rhk−1→hk. This is continued until decision trees/rules that express terms in the first hidden layer, h1 by terms consisting of the original inputs to the NN, i.e. Ri!h1 are being obtained. Step 3. Merge the immediate rule set To obtain more understandable rules, we need to merge the intermediate rules. i.e. merging rules set Rhk−1→hk and Rhk→o, to get the rule set Rhk−1→o that describe hidden layer, hk−1 and output layer, o. Step 4. Repeat the merging rules process until we obtain rule set that can be used to describe terms in the input layer, i and output layer, i.e. Ri!o . The procedures of extracting rules from a trained DNN are illustrated in the diagram below (see Fig. 3). Eventually, rule set that describes terms in the input, i and output layer, o is extracted through the process of merging.

Fig. 3. The proposed work for steps 2–4.

4.2

Construction of BN from the Expert Rules

Method proposed by Zarikas et al. [8] will be adopted to construct efficient and convenient Bayesian network based on the expert rules generated in the previous stage. However, the major disadvantage of constructing BN is a vast amount of subjective probabilities needed. So, this proposed method is able to overcome this problem. The expert rules are transformed according to certain equations into probabilities used in CPTs. The rules are elicited and categorized into certain types. For each type of rules, a mathematical expression is determined by referring to the defuzzification Centre of Area method (CoA), which set correctly the conditional probabilities.

126

S. Y. Tan et al.

Step 1. Set feature nodes and rule nodes into the network. Given that x1, x2,…, xi are a list of mutually exclusive features and each feature is represented by a node called feature node. Their children are rule nodes that encode information extracted from expert rules in previous stage. At the beginning, set the feature nodes and link it with the related rule nodes that are connected to all relevant features. The following diagram illustrates an example of BN to show the linkage of 4 feature nodes (x1, x2, x3 and x4) and 3 rule nodes (R1, R2 and R3) (see Fig. 4). The feature nodes always are the parents of rule nodes. One feature can be mentioned in several rules but those features that are not included in active rules are omitted from BN.

Fig. 4. The linkage between the feature nodes and rule nodes.

Step 2. Determine the probabilities in CPTs for rule nodes. Triangular membership functions will be used to describe the fuzzy degree of beliefs among the feature and rule nodes. Each rules set will be described by a probability after defuzzification with the CoA method which is depicted in Table 2. Table 2. Defuzzification of CoA(B) Linguistic description B0 = extra weak (e.wk) B1 = very very weak (v.v.wk) B2 = very weak (v.wk) B3 = weak (wk) B4 = weak-medium (wk.md) B5 = medium (md) B6 = medium-high (md.hg) B7 = high (hg) B8 = very high (v.hg) B9 = very very high (v.v.hg) B10 = extra strong (e.st)

Probability 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Defuzification is a process of getting a probability given fuzzy sets and corresponding membership degrees in linguistic value as input.

Integrating Deep Learning and Bayesian Reasoning

127

For example, Rule 1: “IF x1 increases, THEN the output decreases.” can be written in probability of P(y−|x1+) = Defz(very very high). By referring to Table 2., the probability of the linguistic input value of ‘very very high’ is 0.9. However, in the case of no previous knowledge about the rules and features, ‘neutral policy’ will be applied by assigning 1/n to each of n states. All CPTs have to be filled up for all combination of different states of features and rules. Step 3. Assigning the probabilities in CPTs by using mathematical description. In order to construct BN, mathematical description for the CPT assignment according to the different types of rules is necessary. In this case, we will use triangular membership function that can be expressed as two-valued function given in Eq. (1) and ðBÞ ¼ CoAðBÞ. l ( lðxÞ ¼

1 xn xn1 ðx  xn1 Þ; 1 xn þ 1 xn ðxn þ 1  xÞ;

xn1  x  xn xn \x  xn þ 1

ð1Þ

where n = 1, 2, 3, …, 10 with xn = n/10 Example of Rule 2: “IF xj increases, THEN the outputi decreases.”, the general formula of CPT can be evaluated from Eq. (2). Similarly, for Rule 3: “IF xj exists, THEN the outputi increases.”, the CPT can be filled up according to Eq. (3) [8]. These rules are represented in certain equations in different states. Pðyi  jxj þ Þ 0 j ¼ 1; . . .; k j ¼ k þ 1 for j ¼ k þ 2 ... for j ¼ m  1 lðmdÞ lðmdÞ B  ðmdÞ  ðmdÞ þ a1 lðhgÞ . . . l ðmdÞ þ amk2 lðhgÞ 1=N l l mðk þ 1Þ mðk þ 1Þ B B l ðv:wkÞ lðwkÞ l ðv:wkÞ ðwkÞ ðwkÞ þ amk2 mðk þl1Þ ðwkÞ ðwkÞ þ a1 mðk þ 1Þ ... l 1=N l l ¼B B l ðv:v:wkÞ lðv:wkÞ l ðv:v:wkÞ lðv:wkÞ @ ðv:wkÞ þ amk2 mðk þ 1Þ ðv:wkÞ l ðv:wkÞ þ a1 mðk þ 1Þ ... l 1=N l 1=N 0 0 ... 0 i ¼ 1; 2; 3; 4 N ¼ 4 j ¼ 2k þ 1 ¼ 1; 2; . . .; m av ¼ v ¼ 1; 2; . . .; m  k  2

1 j¼m ðhgÞ C l C ðv:wkÞ C l C C ðv:v:wkÞ A l 0

ð2Þ Pðyi þ jxj existsÞ 1 0 j¼1 j¼2 C B 0 C B 1=N C B B ¼ B 1=N lðv:v:wkÞ C C C B lðv:wkÞ A @ 1=N 1=N lðhgÞ i ¼ 1; 2; 3; 4 N ¼ 4 j ¼ 1; 2

ð3Þ

As a conclusion, Fig. 5 illustrates the process of migration from DNN to BN by extracting the rules from DNN and constructing an efficient BN based on the rules generated. This can assist users in making intelligent decisions.

128

S. Y. Tan et al.

Fig. 5. Framework for integration of Deep Learning and Bayesian Reasoning

5 Conclusion Based on the literature review, both deep learning and Bayesian reasoning have very impressive performance in various domains. But the black box problem of DL and the lack of accuracy in BR undermines certain aspect of their applicability. The new framework for the integration of DL and BR proposed in this paper is more superior through complementing each other and leveraging their merits. In this case, it will make them become more powerful in terms of prediction and causal reasoning. We believe that the new framework will bring significant impact to the users where it can be used to generate explanation for the prediction made by using deep learning algorithms but also retaining the high accuracy in prediction. In decision support, it will definitely increase user’s trust to the recommendations generated by the system, which is very useful in many domains such as stock analysis, network intrusion detection, medical diagnosis, and machine fault localization, etc. Acknowledgments. This work was supported by the Fundamental Research Grant Scheme (FRGS) from the Ministry of Education and Multimedia University, Malaysia (Project ID: FRGS/1/2018/ICT02/MMU/02/1).

References 1. Kim, B.: Interactive and interpretable machine learning models for human machine collaboration. Doctoral dissertation, Massachusetts Institute of Technology (2015) 2. Lipton, Z. C.: The mythos of model interpretability. arXiv preprint arXiv:1606.03490 (2016) 3. Castelvecchi, D.: Can we open the black box of AI? Nat. News 538(7623), 20 (2016) 4. Ping, C. W.: A methodology for constructing causal knowledge model from fuzzy cognitive map to bayesian belief network. Unpublished Ph.D. Thesis, Chonnam National University (2009)

Integrating Deep Learning and Bayesian Reasoning

129

5. Cheah, W.P., Kim, Y.S., Kim, K.Y., Yang, H.J.: Systematic causal knowledge acquisition using FCM constructor for product design decision support. Expert Syst. Appl. 38(12), 15316–15331 (2011) 6. Wee, Y.Y., Cheah, W.P., Tan, S.C., Wee, K.: A method for root cause analysis with a Bayesian belief network and fuzzy cognitive map. Expert Syst. Appl. 42(1), 468–487 (2015) 7. Zilke, J.: Extracting rules from deep neural networks. Unpublished thesis (2015) 8. Zarikas, V., Papageorgiou, E., Regner, P.: Bayesian network construction using a fuzzy rule based approach for medical decision support. Expert Syst. 32(3), 344–369 (2015) 9. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006) 10. Bengio, Y.: Learning deep architectures for AI. Found. Trends® Mach. Learn. 2(1), 1–127 (2009) 11. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015) 12. Markoff, J.: Scientists see promise in deep-learning programs. New York Times, 23 (2012) 13. Deng, L.: Computational models for speech production. In: Ponting, K. (ed.) Computational Models of Speech Pattern Processing, pp. 199–213. Springer, Berlin, Heidelberg (1999). https://doi.org/10.1007/978-3-642-60087-6_20 14. Deng, L.: Switching dynamic system models for speech articulation and acoustics. In: Johnson, M., Khudanpur, S.P., Ostendorf, M., Rosenfeld, R. (eds.) Mathematical Foundations of Speech and Language Processing, pp. 115–133. Springer, New York (2004). https://doi.org/10.1007/978-1-4419-9017-4_6 15. George, D.: How the brain might work: a hierarchical and temporal model for learning and recognition. Stanford University, Palo Alto, California (2008) 16. Bouvrie, J.V.: Hierarchical learning: theory with applications in speech and vision. Doctoral dissertation, Massachusetts Institute of Technology (2009) 17. Pearl, J.: Causality: Models, Reasoning and Inference, vol. 29. MIT Press, Cambridge (2000) 18. Kleinberg, S.: Why: A Guide to Finding and Using Causes. O’Reilly Media Inc, Sebastopol (2015) 19. Kleinberg, S.: Causality, Probability, and Time. Cambridge University Press, Cambridge (2013) 20. Lake, B.M., Ullman, T.D., Tenenbaum, J.B., Gershman, S.J.: Building machines that learn and think like people. Behav. Brain Sci. 40, e253 (2017) 21. Bojarski, M., et al.: Explaining how a deep neural network trained with end-to-end learning steers a car. arXiv preprint arXiv:1704.07911 (2017) 22. Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., Elhadad, N.: Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM, August 2015 23. Holzinger, A., Biemann, C., Pattichis, C.S., Kell, D.B.: What do we need to build explainable AI systems for the medical domain? arXiv preprint arXiv:1712.09923 (2017) 24. Goodman, B., Flaxman, S.: European Union regulations on algorithmic decision-making and a “right to explanation”. AI Mag. 38(3), 50–57 (2017) 25. Ehsan, U., Harrison, B., Chan, L., Riedl, M.O.: Rationalization: a neural machine translation approach to generating natural language explanations. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 81–87. ACM, December 2018 26. Wang, H., Yeung, D.Y.: Towards Bayesian deep learning: a framework and some existing methods. IEEE Trans. Knowl. Data Eng. 28(12), 3395–3408 (2016) 27. Brendel, W., Bethge, M.: Approximating CNNs with bag-of-local-features models works surprisingly well on imagenet. arXiv preprint arXiv:1904.00760 (2019)

130

S. Y. Tan et al.

28. Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you?: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, August 2016 29. Samek, W., Wiegand, T., Müller, K.R.: Explainable artificial intelligence: understanding, visualizing and interpreting deep learning models. arXiv preprint arXiv:1708.08296 (2017) 30. Tamagnini, P., Krause, J., Dasgupta, A., Bertini, E.: Interpreting black-box classifiers using instance-level visual explanations. In: Proceedings of the 2nd Workshop on Human-In-theLoop Data Analytics, p. 6. ACM, May 2017 31. Burns, C., Thomason, J., Tansey, W.: Interpreting Black Box Models with Statistical Guarantees. arXiv preprint arXiv:1904.00045 (2019) 32. Bastani, O., Kim, C., Bastani, H.: Interpreting blackbox models via model extraction. arXiv preprint arXiv:1705.08504 (2017) 33. Shwartz-Ziv, R., Tishby, N.: Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810 (2017) 34. Gunning, D.: Explainable artificial intelligence (XAI). Defense Advanced Research Projects Agency (DARPA), nd Web, 2 (2017) 35. Gunning, D., Aha, D.W.: DARPA’s explainable artificial intelligence program. AI Mag. 40 (2), 44–58 (2019) 36. Hailesilassie, T.: Rule extraction algorithm for deep neural networks: a review. arXiv preprint arXiv:1610.05267 (2016) 37. González, C., Loza Mencía, E., Fürnkranz, J.: Re-training deep neural networks to facilitate Boolean concept extraction. In: Yamamoto, A., Kida, T., Uno, T., Kuboyama, T. (eds.) DS 2017. LNCS (LNAI), vol. 10558, pp. 127–143. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-67786-6_10

Assessing the Dependability of Apache Spark System: Streaming Analytics on Large-Scale Ocean Data Janak Dahal1,2 , Elias Ioup3 , Shaikh Arifuzzaman1,2(B) , and Mahdi Abdelguerfi1,2 1

Computer Science Department, University of New Orleans, New Orleans, LA 70148, USA 2 Canizaro Livingston Gulf States Center for Environmental Informatics, New Orleans, LA 70148, USA {jdahal,smarifuz,mahdi}@uno.edu 3 US Naval Research Laboratory, Stennis Space Center, Hancock, MS 39529, USA [email protected]

Abstract. Real-world data from diverse domains require real-time scalable analysis. Large-scale data processing frameworks or engines such as Hadoop fall short when results are needed on-the-fly. Apache Spark’s streaming library is increasingly becoming a popular choice as it can stream and analyze a significant amount of data. In this paper, we analyze large-scale geo-temporal data collected from the USGODAE (United States Global Ocean Data Assimilation Experiment) data catalog, and showcase and assess the dependability of Spark stream processing. We measure the latency of streaming and monitor scalability by adding and removing nodes in the middle of a streaming job. We also verify the fault tolerance by stopping nodes in the middle of a job and making sure that the job is rescheduled and completed on other nodes. We design a full-stack application that automates data collection, data processing and visualizing the results. We also use Google Maps API to visualize results by color coding the world map with values from various analytics. Keywords: Parallel performance · Fault tolerance · Streaming analytics · Apache spark · Hadoop · Temporal data · Large-scale system

1 Introduction Processing and analyzing data in real time can be a challenge because of its size. In the current age of technology, data is produced and continuously recorded by a wide range of sources [3, 4]. According to a marketing paper published by IBM in 2017, as of 2012, 2.5 quintillion bytes of data was generated every day, and 90% of the world’s data was created since 2010 [22]. With new satellites, sensors, and websites coming into existence every day, data is only bound to grow exponentially. The number of users interacting with theses mediums are producing data at an enormous rate [1, 2, 15]. With the Internet reaching to new nooks and corners of the world, sources of potential data are ever-growing. As more data keep coming into existence, the necessity of a system that can analyze it in real-time becomes even more imminent. Although the concept of batch c Springer Nature Singapore Pte Ltd. 2019  G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 131–144, 2019. https://doi.org/10.1007/978-981-15-1304-6_11

132

J. Dahal et al.

processing (using multiple commodity machines in a truly distributed setting [18]) was a revolution when it first came into existence, it might not be a complete solution to the need for real-time processing. Such on-the-fly processing has applications in many areas such as banking, marketing, and social media. For example, identifying and blocking fraudulent banking transactions require quick actions by processing vast amounts of data and producing quick results. Sensitive and illegal posts on social media can be quickly removed to nullify the adverse effects on its users. Weather data, like the one used in this research, can be analyzed in real time to detect or predict different climatic conditions.

2 Background The notion of using commodity machines as a computational power came into existence with the advent of Google File System (GFS). It introduced a distributed file system that excelled in performance, scalability, reliability, and availability [9]. As this truly distributed and replicated file system became rigidly stable, the next step in the ladder was to be able to process the data stored in it. For this, Google introduced MapReduce as a programming model [7]. This new parallel programming model demonstrated the ability to write small programs (map and reduce classes) for processing big data. It introduced the concept of offloading computation to the data itself and thus nullifying the effect of network bottleneck on batch processing by not having to move the input data between nodes. Hadoop is the most popular MapReduce framework today, but it has its limitations. The most prominent shortcoming of Hadoop lies in the iterative data-processing [23]. To extend Hadoop beyond conventional batch processing requires various third-party libraries. Storm can be used along with Hadoop to accomplish realtime processing [10]. Other libraries such as Hive, Giraph, HBase, Flume, and Scalding are designed to tackle specific operations, e.g., querying and graphing. Managing these different libraries can be time-consuming from a development point of view. With Hadoop’s limitations in mind, another large-scale framework Spark was designed that would reuse a working set of data across multiple operations [23]. The more iterative a computation is, the more efficient is the job running on Apache Spark. Spark streaming library has become widely popular to run real-time processing jobs. This library allows applications to stream data from different sources [14]. Some of the most popular streaming sources include Kafka, Flume, Twitter, and HDFS. Data can be streamed into the streaming job from one or more sources and unified into a single stream. For the application designed for this paper, data is streamed from the Hadoop File System (HDFS).

3 Apache Spark Introduced in a paper published in 2010, Spark is a cluster computing framework that uses a read only collection of objects called Resilient Distributed Datasets (RDDs) that let users perform in-memory calculations on large clusters [24]. RDDs are faulttolerant, parallel data structures which makes it possible to explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators [24]. As the intermediate results are stored

Assessing the Dependability of Apache Spark System

133

in memory, iterative analytics such as PageRank calculation, k-means clustering, and linear regression become much more efficient in Spark compared to Hadoop [10]. 3.1 Resilient Distributed Data (RDD) RDD is defined as a collection of elements partitioned across different nodes in a cluster that can be operated on in parallel [24]. From a user’s point of view, it looks like a data structure, but behind the scenes, it performs all the operations necessary to run in a distributed framework. Failures across large clusters are inevitable; thus, the RDDs in Spark were designed with fault tolerance in mind. Since most of the operations in Spark are lazy (no operations are run on data unless an action, e.g., collect, reduce, etc., is called), the operations on RDDs are stored in the form of a Directed Acyclic Graph (DAG). A DAG is a collection of functional lineage such as map and filter. Such awareness of the functional lineage makes it possible for Spark to handle node failures gracefully [24]. These RDDs drive the streaming framework in Apache Spark. They have the following properties that make sure the Apache Spark Streaming maintains its integrity: Replicated. RDDs are split between various data nodes in a cluster. Replicas are also spread across the cluster to make sure that the system can recover from any aftermath of the node crash. Processing occurs on nodes in parallel, and all RDDs are stored in memory on each node. Immutable. When an operation is performed on an RDD, the original RDD is not changed. Instead, a new RDD is created out of that operation [24]. Only two operations are performed on an RDD namely transformation and action. A transformation transforms the RDD into a new one whereas an action gets data from the RDD. Resilient. Resiliency pertains to the replication of data and storing the lineage of operation on RDDs. When a worker node crashes, the state of the RDD can be regenerated by running the same set of transformations to reach the current state of the RDD [24]. 3.2 Apache Spark Streaming In many real-world applications, time-sensitive data can often get stale very quickly. Thus, to make the most of such data, it must be analyzed on time. For example, if a banking website starts generating piles of 500 errors, the potential of an incoming request crashing the server must be evaluated in real time. Traditional MapReduce is not a viable solution for such cases as it is mostly suited for offline batch processing where results are not associated with any latency [23]. If the input data is repeatedly produced in discrete sets, multiple passes of the map and reduce tasks would create overhead which can be eliminated by using Spark instead. Apache Spark Streaming lets the program store results in an intermediate format in memory, and when new data arrives as another discrete set, it is batched to perform transformations on them quickly and efficiently [23]. Figure 1 outlines the Apache streaming framework. Data can be streamed into Apache Spark streaming framework from various sources like Kafka, Flume, Twitter, and HDFS [17]. A receiver must be instantiated and hooked

134

J. Dahal et al.

Fig. 1. Outline of Apache streaming framework used in this paper.

up with the streaming source to start the flow of data. One receiver can only stream data from one input source, and if we have multiple stream sources, then we can union them so that they can be processed as a single stream [12]. Once the receiver starts receiving the data from the streaming source, Spark stores the data into a series of RDDs delineated by a specified time window. After this time, the data is passed into the spark core for processing. To start any Spark streaming job, there needs to be at least two cores, one that receives the data as stream and one that processes the data.

4 Streaming Analytics on Large-Scale Ocean Data We develop an application to run queries on a large oceanographic dataset and produce results on the fly. Apache Spark is chosen for a platform to write the application because of its streaming library. We stream data into the streaming job from HDFS. We collected data from United States Global Ocean Data Assimilation Experiment (USGODAE) data catalog and then processed and stored in the HDFS. The application streams new data within the configurable window of time and run transformations and actions to generate results. 4.1

Setup and Configuration

Although Hadoop is not required to run Spark, we installed it because our application reads data from HDFS. Hadoop was first installed on a single node setting, and then other nodes were added one at a time. Each time a node was added, the sample MapReduce tasks were run to make sure that the job was making use of all the nodes. Five nodes with identical computational power were used to create the cluster. We install Apache Spark along with SBT and Scala. SBT is used to build the Scala projects. Scala is used as the programming language of choice to write streaming jobs. We install Spark in the same way as Hadoop by starting with a single node and adding one node at a time. Two workers instances (SPARK WORKER INSTANCES = 2) ran on each terminal to utilize dual CPUs. Each worker is set up to utilize up to 15 GB memory (SPARK WORKER MEMORY = 15 GB) and up to 16 cores (SPARK WORKER CORES = 16). We set up Hadoop File System (HDFS) on each of the nodes. YARN, a resource manager and a dashboard to visualize and summarize the metrics, runs on the driver node. We set up REPL environment or Spark-shell in each node to make sure that the debugging is swift when a transformation needs to be performed on a set of data. Figure 2 summarizes the Apache Spark installation.

Assessing the Dependability of Apache Spark System

135

Fig. 2. Apache Spark system configuration used for our work. The IP addresses are hidden for privacy reason.

4.2 Description of Datasets Data is generated every 6 h by an oceanographic model (NAVGEM-Navy Global Environmental Model) that predicts various environmental variables for the next 24 h to 180 h. The number of output files from the model depends on the type of variable. The data is generated for 198 different variables which cover the entire world with a precision interval of 0.5◦ . The model generates multiple files with the results, and each file contains only data for a single variable. The complete set of data for ten years is about 110 TB, but we have only about 4.5 TB disk space available in the distributed file storage. Therefore, we include only four variables for our experiments: ground sea temperature, pressure, air temperature, and wind speed. We use Panoply [10] as a GUI to visualize the input data and resulting data. Our datasets cover the entire world, so the size of the data array is 361 × 720, where 361 represents all latitude points from −90◦ north to +90◦ north with 0.5◦ increments, whereas 720 represents all longitude points from 360◦ east to 0◦ NE with 0.5◦ increments as well. Procedure for Data Collection. We collect our data using the following steps. 1. A Java program downloads the data into the filesystem. 2. NCAR Command Language was used to convert the data from GRIB (General Regularly-distributed Information in Binary form) format [11] into the NetCDF3 data format. 3. CDO (Climate Data Operators [19], written by the Max Planck Institute for Meteorology) was used to merge the data files so each file could contain data for multiple variables. 4. Files were copied to the HDFS using standard HDFS commands. We wrote a bash script to automate the above steps and make them seamless.

136

4.3

J. Dahal et al.

SciSpark

Our application extends the functionality of the SciSpark [16] project by changing its open source code as needed. SciSpark library facilitates the process by mitigating the need to write wrapper classes to represent GRIB. The library provides a class called SciTensor that represented NetCDF data and implemented all basic mathematical operations such as addition, subtraction, and multiplication. We add new functions to SciTensor class to calculate maximum (max), minimum (min), and standard deviation. Other significant changes included logic to account for missing variables in a dataset. For multivariable analysis, we added relevant logic to create unique names for x and y axes when creating NetCDF result file with more than one variable. We create RDDs using SciTensor library and feed into the spark streaming queue. 4.4

Application in Use

Our application streams new files from a location in HDFS and writes the results back to HDFS. The job runs with a configurable time window and performs transformations and actions on all the RDDs accumulated during that time-frame. We use QueueStream API in Apache Spark to read the stream of new RDDs inside the streaming job. New RDDs are represented as a Discretized Stream (DStream) of type SciTensor. Spark Streaming API defines DStream as the fundamental abstraction in Spark Streaming and is a continuous sequence of RDDs (of the same type) [26]. Figure 3 summarizes the outline of our application.

Fig. 3. A simplistic overview of our entire application. We design a full-stack application that automates data collection, data processing and visualizing the results.

A scheduled job running on the host runs every hour to download new data from the FTP server. After the download is completed, the data is processed and uploaded to HDFS. The streaming job running on the cluster processes these new files and update the result. The website running on a separate server polls the result file and visualizes the data using Google Maps.

Assessing the Dependability of Apache Spark System

137

5 Performance Evaluation We use statistical analysis on data using Apache Spark Streaming to generate summary results in real-time. In the process, we also assess the dependability and scaling of the Spark streaming framework. 5.1 Complexity of the Operation We evaluate and measure two significant steps in the streaming process namely transformation and action. We design multiple mathematical queries of varying complexity and run jobs to measure the performance of Apache Spark Streaming. For example, average, maximum and minimum are more straightforward mathematical operations, whereas standard deviation can be regarded as a more complex one. We perform the following statistical analyses: mean, max, min, and standard deviation. Once a user submits the streaming job, it cannot be changed for the lifetime of that job. The input sizes per streaming window for each job were approximately 180 MB, 500 MB, 1 GB, and 2 GB. The streaming window was set as 6 h for the streaming process because the input data is produced by the model every 6 h. Variation of Each Statistical Analysis. Since GRIB1 data represents values in 361 × 720 2D arrays and the values are scattered across multiple files, to calculate an aggregate for each index, same indices across multiple files were aggregated. To calculate aggregate results for each latitude and longitude points, 361 and 720 more values in each file needed to be aggregated respectively. Moreover, calculating one single aggregate result for all the values across all the files increased the operation complexity as it had to aggregate more values. The variation in statistical analysis in ascending order of complexity is listed as follows: (i) one result for each combination of latitude and longitude points, (ii) one result for each latitude point, (iii) one result for each longitude point, and (iv) one single aggregate result for all data points. Table 1 shows the average execution time for each variation of all four statistical analyses. It shows that the complexity of operation is directly proportional to the execution time. More transformations were required on data when running with variation 2, 3 and 4. Each additional transformation increased the length of the DAG and thus increased the execution time. Multivariable Analysis of the GRIB Data. In addition to the above metrics, we also perform multivariable analysis to measure the latency of each streaming window. The same four statistical analyses were performed but with a varying number of variables. These analyses were serialized, thus increasing the number of transformations and actions for each additional variable. There was 50 GB data initially stored in HDFS which required longer execution time as each worker had to process more data. Each streaming window was once again fed with four different datasets of size 180 MB, 500 MB, 1 GB and 2 GB. Figure 4 shows our results on the initial set of data. As expected, the execution time increases with the complexity of operation. The standard deviation operation took the most amount of time because the algorithm had multiple transformations to perform.

138

J. Dahal et al. Table 1. Result for statistical analysis Variation Dataset size DAG length Execution time (s) 1

180 MB 500 MB 1 GB 2 GB

5 5 5 5

22 37 49 81

2

180 MB 500 MB 1 GB 2 GB

6 6 6 6

23 41 51 101

3

180 MB 500 MB 1 GB 2 GB

6 6 6 6

22 43 60 117

4

180 MB 500 MB 1 GB 2 GB

7 7 7 7

42 87 133 278

Fig. 4. Statistical Analysis of the initial set of data.

Further, as for DAG lengths, standard deviation has a larger DAG than those of max and mean. The length of a DAG is directly related to the latency of the corresponding streaming job. In other words, more map and filter functions are run on the dataset for operations with higher complexity. The size of the dataset for each 6-h period was roughly 1 GB in size and latency for streaming 1 GB data was significantly smaller than the initial data. For input sources that generate discrete data at a regular interval, the streaming job is more suitable than

Assessing the Dependability of Apache Spark System

139

a batch processing job because of the lack of overhead in running an iterative job [6]. Figure 5 shows the results for batches of streams, which achieves better runtime performance than the initial set of data.

Fig. 5. Statistical Analysis of batches of stream.

5.2 Number of Executor Nodes We ran streaming jobs with a different number of worker nodes to record the change in latency. Data was streamed from HDFS and YARN was used as a dashboard to visualize states of different worker nodes. Since Apache Spark utilizes the in-memory datasets [23], the multi-node setup outperformed the single node-setup as it could use more resources from each worker as shown in Fig. 6. It is clear from the figure that there is linear scalability in latency for a streaming job. This result shows that the efficiency of a streaming job is directly proportional to the number of workers.

Fig. 6. Statistical analysis on initial Transformation vs. # of executors.

140

5.3

J. Dahal et al.

Scalability

As data grows and higher processing speed is desired, new nodes should be easily added to the cluster. During our experiment, nodes were killed and started in the cluster with a fair speed and easiness. We wrote bash scripts to control the state of a node and used YARN dashboard to verify the state. Figure 7 shows the state of the cluster after multiple nodes were killed. Further, Figs. 8 and 9 showcase how different metrics of a streaming job can be visualized using Apache Spark’s dashboard. Figure 8 plots the scheduling delay and Fig. 9 plots the processing time for batches ran with the different number of executor nodes. Yellow represents six executors, brown five executors, and purple three executors. The sizes of datasets in different batches were 180 MB, 500 MB, and 1 GB. The scheduling delay and processing time are both directly proportional to the size of the data and inversely proportional to the number of worker instances.

Fig. 7. Status of dead workers on YARN dashboard.

Fig. 8. Scheduling delay for different datasets with the varying number of executors. Yellow represents six executors, brown five executors, and purple three executors. (Color figure online)

Assessing the Dependability of Apache Spark System

141

Fig. 9. Processing time for different datasets with the varying number of executors. Yellow represents six executors, brown five executors, and purple three executors. (Color figure online)

5.4 Fault Tolerance Spark can reconstruct the RDDs using lineage information stored in the RDD objects when a node falls apart [23]. Since the data is already replicated across nodes in HDFS, lost partitions can be reconstructed in parallel across multiple nodes without much overhead. If the node running receiver fails, then another node is spun up with the receiver which can continue to read from HDFS. If the receiver was using Kafka or Flume as a source instead of HDFS, then a small amount of data may be lost which hasn’t been replicated to other nodes in the cluster [6]. We measure performance of a system running streaming job with various node failures to access the fault tolerance capability of the Apache Spark streaming. Spark’s dashboard interface was used to visualize the difference in latency for different batches running with and without node failures. Figure 10 shows that if some nodes fail while running a batch, it will take longer to account for the lost nodes and reschedule those jobs in different node/s. For instance, stage ID 520 lost a node with two workers, and the driver had to reschedule seven tasks running on that node somewhere else. As a result, the latency increased from 2.9 to 5.1 min.

Fig. 10. Difference in processing time for node failures. The first row demonstrates the execution time with node failures.

5.5 Visual Application We develop a web interface to demonstrates a sample usage of our application. The web page uses Google Maps and its developer API to visualize the results generated by our application. The web application is written in .NET MVC framework. The server-side code grabs the latest result from the cluster by using the WinSCP library (this was used to avoid installing FTP on the master in the cluster), then converts the results into text format using ncl dump. A text dump of the resulting NetCDF file was processed and sent to the view. A JavaScript function regularly polls for the result, and once the Spark application generates the result, it is visualized on the web (Fig. 11).

142

J. Dahal et al.

Fig. 11. Screenshot of color-coded representation of the result. This UI visualizes a single variable result file by color coding the latitude and longitude over google map based on the value of the variable for that coordinate.

6 Additional Observations and Findings Spark Streaming vs. Hadoop’s Batch processing vs. Storm Trident. An iterative job like the one used in this experiment can be expressed as multiple Map and Reduce operations in Hadoop. However, different MapReduce jobs cannot share data. So for iterative analysis, the same dataset must be read from HDFS multiple times, and results would need to be written to HDFS many times as well [5]. These iterations create much overhead because of the I/O operations and other unwanted computations [8]. Spark tackles these issues by storing intermediate results in memory. Spark Streaming uses D-Streams or discretized streams of RDDs which provides consistent, “exactly-once” processing across the cluster [25] and thus significantly increases the performance for iterative analysis. Apache Storm can process unbounded streams of data in real time, and it can be used alongside Hadoop, but it only guarantees “at-least-once” processing [20]. Trident bolsters Storm by providing micro-batching and other abstractions that would ensure “exactly-once” processing [21]. It would take three different libraries to work seamlessly to accomplish what Spark Streaming can accomplish by itself. Time and effort required to setup and maintain Storm Trident application along with Hadoop can hamper the production and deployment. In contrast, Spark’s Streaming library is directly written over its core and maintained by the same group who maintain the core’s code base. Thus, Spark streaming outshines both Hadoop and Storm Trident combination for streaming scientific data. Limitations of Spark. When a dataset is large enough not to allow any more RDDs to be stored in memory, Sparks starts to replace RDDs, and such frequent replacement degrades the latency [13]. However, for this work, we needed a framework that would seamlessly stream a relatively large datasets, and Spark Streaming was able to handle it efficiently.

Assessing the Dependability of Apache Spark System

143

7 Conclusions We use SciSpark successfully with Apache Spark to stream GRIB1 data in a streaming application. The bulk of the logic in this application lies in the ability to convert the statistical analysis into transformations and actions that would run upon the DStream of RDDs of type SciTensor. Datasets ranging from 180 MB to 50 GB were used in the application without running into any memory issues. Various properties of a streaming application like operation complexity, scalability and fault tolerance were assessed, and results were summarized using simple mathematical operations like mean, min/max and standard deviation. Based on these results and other properties of apache Spark Streaming, we are confident that Spark Streaming is a better solution to stream the scientific data over Hadoop or Storm Trident. Acknowledgments. Part of this work was supported by ONR contracts N00173-16-2-C902 and N00173-14-2-C901, and Louisiana Board of Regents RCS Grant LEQSF(2017-20)-RD-A-25.

References 1. Arifuzzaman, S., Khan, M.: Fast parallel conversion of edge list to adjacency list for largescale graphs. In: Proceedings of the 23rd High Performance Computing Symposium (HPC 2015), Alexandria, VA, USA, pp. 17–24, April 2015 2. Arifuzzaman, S., Khan, M., Marathe, M.: A fast parallel algorithm for counting triangles in graphs using dynamic load balancing. In: 2015 IEEE BigData Conference (2015) 3. Arifuzzaman, S., Khan, M., Marathe, M.V.: PATRIC: a parallel algorithm for counting triangles in massive networks. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM 2013), San Francisco, CA, USA, pp. 529–538, October 2013 4. Arifuzzaman, S., Pandey, B.: Scalable mining, analysis, and visualization of protein-protein interaction networks. Int. J. Big Data Intell. (IJBDI) 6(3/4), January 2019. https://doi.org/10. 1504/IJBDI.2019.10019036 5. Bu, Y., et al.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010). https://doi.org/10.14778/1920841.1920881 6. Cordava, P.: Analysis of real time stream processing systems considering latency. White paper (2015) 7. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008) 8. Ekanayake, J., et al.: Twister: a runtime for iterative mapreduce. In: 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818 (2010). https://doi. org/10.1145/1851476.1851593 9. Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. SIGOPS Oper. Syst. Rev. 37(5), 29–43 (2003). https://doi.org/10.1145/1165389.945450 10. Gopalani, S., Arora, R.: Comparing apache spark and map reduce with performance analysis using K-means. Int. J. Comput. Appl. 113(1), 8–11 (2015) 11. GRIB: Converting grib (1 or 2) to netcdf. www.ncl.ucar.edu/Applications/griball.shtml (2018). Accessed 9 Dec 2018 12. Grulich, P.M., Zukunft, O.: Bringing big data into the car: Does it scale? In: 2017 International Conference on Big Data Innovations and Applications (Innovate-Data), pp. 9–16 (2017). https://doi.org/10.1109/Innovate-Data.2017.14

144

J. Dahal et al.

13. Gu, L., Li, H.: Memory or time: performance evaluation for iterative operation on hadoop and spark. In: 2013 IEEE 10th International Conference on High Performance Computing and Communications, pp. 721–727 (2013). https://doi.org/10.1109/HPCC.and.EUC.2013.106 14. Krob, J., Krcmar, H.: Modeling and simulating apache spark streaming applications. Softwaretechnik-Trends 36, 1–3 (2016) 15. Motaleb Faysal, M.A., Arifuzzaman, S.: A comparative analysis of large-scale network visualization tools. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 4837– 4843, December 2018. https://doi.org/10.1109/BigData.2018.8622001 16. Palamuttam, R., et al.: SciSpark: applying in-memory distributed computing to weather event detection and tracking. In: 2015 IEEE International Conference on Big Data (Big Data), pp. 2020–2026 (2015). https://doi.org/10.1109/BigData.2015.7363983 17. Salloum, S., et al.: Big data analytics on apache spark 1(3), 145–164 (2016). https://doi.org/ 10.1007/s41060-016-0027-9 18. Sattar, N.S., Arifuzzaman, S.: Overcoming mpi communication overhead for distributed community detection. In: Majumdar, A., Arora, R. (eds.) Software Challenges to Exascale Computing, pp. 77–90. Springer, Singapore (2019) 19. Schulzweida, U., et al.: CDO user’s guide: Climate data operators, April 2018 20. Toshniwal, A., et al.: Storm@ twitter. In: 2014 ACM SIGMOD International Conference on Management of Data, pp. 147–156 (2014). https://doi.org/10.1145/2588555.2595641 21. Trident, A.: Trident tutorial (2018). https://storm.apache.org/documentation/Tridenttutorial. html. Accessed 9 Dec 2018 22. Winans, M., et al.: 10 key marketing trends for 2017 (2017). www-01.ibm.com/common/ssi/ cgi-bin/ssialias?htmlfid=WRL12345USEN. Accessed 9 Dec 2018 23. Zaharia, M., et al.: Spark: cluster computing with working sets. In: 2nd USENIX Conference on Hot Topics in Cloud Computing, 22–25 June 2010 24. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: 9th USENIX Conference on Networked Systems Design and Implementation, p. 2 (2012) 25. Zaharia, M., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423–438 (2013). https://doi.org/10.1145/2517349.2522737 26. Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664

On the Assessment of Security and Performance Bugs in Chromium Open-Source Project Joseph Imseis1, Costain Nachuma1, Shaikh Arifuzzaman1,2(&), and Zakirul Alam Bhuiyan3 1

3

Computer Science Department, University of New Orleans, New Orleans, LA 70148, USA {jimseis,cnachuma,smarifuz}@uno.edu 2 Big Data and Scalable Computing Research (BDSC), UNO, New Orleans, LA 70148, USA Computer Science Department, Fordham University, Bronx, NY 10458, USA [email protected]

Abstract. An individual working in software development should have a fundamental understanding of how different types of bugs impact various project aspects. This knowledge allows one to improve the quality of the created software. The problem, however, is that previous research typically treats all types of bugs as similar when analyzing several aspects of software quality (e.g. predicting the time to fix a bug) or concentrates on a particular bug type (e.g. performance bugs) with little comparison to other types. In this paper, we look at how different types of bugs, specifically performance and security bugs, differ from one another. Our study is based on a previous study done by Zaman et al. [1] in which the study was performed on the FireFox project. Instead of Firefox, we will be working with the Chrome web-browser. In our case study, we find that security bugs are fixed faster than performance bugs and that the developers who were assigned to security bugs typically have more experience than the ones assigned to performance bugs. Our findings emphasize the importance of considering the different types of bugs in software quality research and practice. Keywords: Assessment of security  Performance bugs  Data mining Software repository  Chromium project  Open-source project



1 Introduction In past studies, researchers have noted that the majority of software development costs came from maintenance and evolutionary activities [2, 3]. Researchers have focused on detecting various code smells, and types of bugs in order to improve the quality assurance of software products. For example, CCFinder, a token-based code clone detection system, was used by researchers to determine how code clones affected the chances of faults occurring in a particular program [4]. Other prediction models exist that tend to focus on the time it takes to fix a bug [5–7] and which type of developer should fix that bug [8]. © Springer Nature Singapore Pte Ltd. 2019 G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 145–157, 2019. https://doi.org/10.1007/978-981-15-1304-6_12

146

J. Imseis et al.

The problem is that most researchers working in quality assurance tend to treat bugs as generic, leaving out the differing aspects of each bug (i.e., there is no distinction between different bug types). This is concerning, because we could have a security bug that gives an intruder administrator access being grouped alongside with a simple syntax bug. Taking this example into consideration, we believe that it is vital for software quality researchers and developers to take into consideration the various aspects of different bug types. Based on a previous study by Zaman et al. [1], we study the characteristics and differences of security and performance bugs in the Chromium open-source project to demonstrate that these characteristics differ from one bug type to another. We aim to answer the following three research questions: (Q1) How fast are the bugs fixed? On average, security bugs are fixed 44% faster than performance bugs. (Q2) Which types of bugs are assigned to more experienced developers? The developers who work on security bugs are considered more experienced on average by *8% when compared to developers working on performance bugs. (Q3) What are the characteristics of the bug fixes? On average, security bugs have significantly more lines of code added/deleted. Section Summary: Section 2 includes our design and approach. Section 3 deals with answering the three research questions proposed. Section 4 mentions the limitations. Section 5 discusses the related works. Section 6 presents the conclusion to our paper.

2 Design and Approach This section includes the design of our study in which we compare security and performance bugs of the Chrome web-browser. Figure 1 gives an overview of our approach. First, we scrape the data from the Chromium bug issue tracker website. The bugs obtained are then categorized based on security and performance. For all the bugs we then calculate several metrics and use them to make comparisons across the differing bug types. This section explains each of these steps in a generalized fashion. 2.1

Classification of Bugs

The chromium issue tracker has a built-in tagging feature that we were able to use. Specifically, we wanted to look at all issues which contained fixed bugs. We did this by selecting “All issues” from the dropdown menu and typing the keywords: “status: Fixed” in the omnibox provided.

On the Assessment of Security and Performance Bugs

147

Fig. 1. Overview of approach to analyze the differences in characteristics of security and performance bugs.

Luckily the issue tracker also has a “Type” parameter that allowed us to specifically search for security bugs (i.e., our input for the omnibox was “Type = Bug-Security”). However, there was no category via the Type parameter that pertained to performance. Hence, in order to obtain bugs that are related to performance we used the “summary:” parameter which checks the description of the bug for the search term provided (e.g. the search “summary: performance” would return bugs with the word performance in their summary/description section). In addition to the word “performance”, we also used heuristic keywords to obtain more bugs related to performance. The heuristics agreed upon are as follows: “duration”, “execution time”, “fast”, “faster”, “memory”, “runtime”, “slow”, and “slower”. Using these heuristics, we found 4605 performance bugs. A similar method was followed for security bugs using the following terms: “exploit”, “safe”, “safety”, “secure”, “steal”, “vulnerable”, “vulnerability”, and “vulnerabilities. A total of 5575 security bugs were found of which 4899 bugs came from the “Type: Bug-Security” search and 676 bugs came from our heuristic terms.

148

J. Imseis et al.

Unfortunately, without the use of the Monorail API, downloading these bug reports becomes tedious, as the bug issue tracker only allows a user to download 1000 records at a time. To remedy this, we wrote a simple python script, RecordScrapper, which automatically sends download requests for the next 1000 records. RecordScrapper then iteratively sends these requests until all the records pertaining to a particular category are downloaded. Note that all bugs obtained via the method mentioned above were downloaded as CSV files between March 2019 and April 2019. 2.2

Tracking Bugs

To answer research question 1, we decided to determine the lifecycles for every category of bug. To do this we wrote a python script, DateScrapper, which uses the bug data, specifically the BugID, obtained from RecordScrapper as input. DateScrapper then iteratively takes a bugID and appends it to the end of this chromium issue tracker link: https://bugs.chromium.org/p/chromium/issues/detail_ezt?id=. By doing so, the script now has access to the information associated with that particular bugID. From here, the script is able to obtain the start/assigned date and end/fixed date for each individual bugID. It does so by employing a python package which contains an HTML parser, known as Beautiful Soup. Beautiful Soup allows us to create a soup object which contains the HTML of an individual bug report. By using this soup object we can specifically target key areas of the website by their tags, classes, or several other web elements. This is how we are able to acquire the start/assigned date and end/fixed date. In the event that no start date is found, the date that the bug was assigned to a developer is used. If both the start date and the date assigned are not found, then the date that the bug was reported is considered as the start date. If the end date/date fixed is not found, then the date that the bug is closed is considered as the end date. We have had no instances where the closed date was not found, as this is a mandatory requirement for each bug enacted by the Chromium team. The times obtained by DateScrapper are represented in UNIX time and need to be converted into a more readable format. In order to do so, we made another python script, UnixConverter. UnixConverter generates the length of time taken to fix a bug in the typical month, day, year, hour format. In order to do so, it first converts the start and end dates acquired from DateScrapper into the previously mentioned standard form. It then takes the difference between the start and end dates which gives us the total length of time taken to fix a bug (i.e. the bugs life cycle). Note that for our purposes the output file of UnixConverter only contains the bugID, and the number of days and hours taken to fix a particular bug. We will generate the mean of these values and use them as part of our analysis in Sect. 3.

On the Assessment of Security and Performance Bugs

149

Fig. 2. This figure illustrates that one particular bug, represented by its bugID (green box), may have multiple change log links (blue boxes) associated with it. (Color figure online)

Fig. 3. This figure is the result of clicking on the first hyperlink (blue box) in Fig. 2. Notice that the bug ID (boxed in green) matches the bug ID shown in Fig. 2. (Color figure online)

150

2.3

J. Imseis et al.

Obtaining Final Metrics

In order to answer research questions 2 and 3, we need to determine the lines of code a single developer has altered (i.e. added or removed), the number of days taken to fix a bug, and the number of bugs they’ve fixed. We created a 4th python script, MergeEmail, which takes the bugIDs from the output of RecordScrapper (i.e. BugID, developer email, etc.…) and merges it with the output file of DateScrapper (BugID, Start and End time in UNIX format). This produces a file which contains the BugID, developer email, start time and end time [UNIX format]. So the whole purpose of MergeEmail is to just obtain the relative developer email, which is how we identify which developer worked on a particular bugID. It must be noted that if no email was provided for a bug, then the record is not considered for merging and analysis.

Fig. 4. The figure is the result of clicking on the second hyperlink (second blue box from the top) in Fig. 2. Notice that the bug ID (boxed in red) DOES NOT match the bug ID shown in Fig. 2. Hence, this change log and it’s added/deleted lines of code should not be considered when referring to bug 944359. (Color figure online)

The output of MergeEmail then becomes the input for a python script named DaysBugs. DaysBugs finds all the bugs for a single developer and then selects the lowest start date for a bug and the highest end date (fixed date) for that developer. This gives us the period between the first bug fixed by the developer and the last bug fixed by the developer in UNIX time which is then converted into days. DaysBugs will also keep track of the number of bugs fixed by each developer during this time period.

On the Assessment of Security and Performance Bugs

151

So, the output of DaysBugs will be as follows: Developer email/ID, total number of days the developer worked on the bugs, and number of bugs fixed. To obtain the lines of code for each individual developer we use yet another python script, LocBug. This script takes the output of RecordScrapper (BugID, developer email, etc.) and iteratively appends the BugID to the chromium issue tracker link: https://bugs.chromium.org/p/chromium/issues/detail_ezt?id=similar to DateScrapper. Once on the bug report page, LocBug again creates a Soup object in order to target key areas of the website. In this case we are specifically targeting the ‘comment-list’ webelement as it contains the particular hyperlinks we need to retrieve the individual lines of code for each bug. These hyperlinks are structured as follows: https://chromium-review.googlesource.com/c/chromium/src/+/ https://codereview.chromium.org/ https://chromium-review.googlesource.com/ We noted that the hyperlinks above are incomplete, in that each bug will contain a different variant of the link (i.e., a different set of numbers at the end). In other words https://chromium-review.googlesource.com/c/1475945 and https://chromium-review. googlesource.com/c/1471357 have the same format but different ending parameters (1475945 vs. 1471357). Hence, the two differing links point to two different bug change-logs. Since Beautiful Soup does not have a specific way to distinguish the differing links, we take the hyperlinks stated above, find the parameters (numbers after the links on the webpage) and then append to the end of the links. Once this is done, LocBug will open the link it generated in order to view the change log. This change log will contain the lines of code added and deleted in a particular web-element referred to as “mainContent”. To obtain access to this webelement, a tool similar to Beautiful Soup is used, called Selenium. Selenium is an automated web-browser python package that allows us to interact and manipulate webobjects/elements. In addition to scraping the lines of code added and removed, LocBug uses Selenium to ensure that the change-log link/hyperlink generated refers to the same bug on the chromium issue tracker bug report page. The reason this is needed is because there may be multiple references to other bugs within a particular bug report page as shown in Figs. 2, 3, and 4. After iteratively confirming the that change-log refers to the particular bugID, LocBug will then summate the lines of code together and save them to a file with the associated bugID (i.e., if a bug has multiple hyperlinks associated with it then LocBug will parse them all and summate the lines of code added and deleted). So, the output file generated by LocBug will contain all BugIDs and the associated lines of code added and deleted. To obtain just the email and lines of code added/deleted, we use a python script called DevLoc which takes, as input, the results of RecordScrapper (BugIDs, developer email, etc.) and the output of LocBug (BugID and lines of code added/removed). By merging the input and pruning all non-needed features, we are left with the developer emails and the lines of code added and deleted for every single bug. At this point, we need to take into account that a developer can work on more than one bug and hence their developer email will be seen multiple times in the output file of DevLoc. Hence, we want just one developer email that contains ALL the lines of code added/deleted for all the bugs that a particular developer worked on. This is done using

152

J. Imseis et al.

Apache Hadoop, which is a framework that allows for the distributed processing of large data sets [8]. We specifically use an altered Map Reduce function of Hadoop to obtain our results. A visual representation of the process can be seen in Fig. 5 below.

Fig. 5. We use Hadoop MapReduce framework. The lines of code (LOC) for every single bug worked on by the developer is summed and then assigned to that developer. Here, a circle denotes a single developer (key), a rectangle denotes lines of code (LOC) for a single bug (value), and the inverted triangle means total lines of code (LOC) for all bugs a developer worked on (output).

3 Answers to Our Research Questions The subsections below pertain to one of the three research questions we studied using the data obtained from the Chromium Issue Tracker. Each question contains our motivation behind the question, our approach to answering that question and a discussion of our findings. Table 1 summarizes our results of each of the three research questions asked. 3.1

How Fast Are Bugs Fixed?

Motivation: Lead developers and project managers should be aware that security and performance bugs need to be handled differently, in that, one type of bug may hold higher precedence than another. For example, a project manager may want to quickly assign developers to a set of security bugs as those bugs may pose an immediate risk to the users/company. In addition, it would be ideal for the project manager to assign experienced developers to those higher risk bugs as to reduce the need for future maintenance. Our study measures which type of bug, performance or security, is fixed faster in terms of the Chrome web-browser. Approach: The metric used to answer this question is the bug’s life cycle. We specifically obtained the difference between the bug’s start/assigned date and end/fixed date to determine a bug’s life cycle.

On the Assessment of Security and Performance Bugs

153

Since the number of security bugs obtained was significantly more than the number of performance bugs, we needed to consider balancing the data in order to make accurate comparisons. To do this balancing, we first evaluated the mean for all security bugs. Next, we evaluated the mean for a randomized subset of security bugs. This randomized subset is the same size as the performance bugs. When comparing the mean of the total number of security bugs vs. the mean of the random set, we observed that the means were significantly different. This implies that the data is unbalanced and hence for our evaluation we used an equal number of security and performance bugs for our evaluations. Findings: From Fig. 6 we can see that performance bugs take more time to fix. From Table 1, we can see that performance bugs take *44% [i.e.|(111307.48/160691.11) − 111307.48 * 100|] longer to fix when compared to security bugs. Both findings suggest that security bugs are prioritized and hence are considered more important by the Chromium team to fix than performance bugs.

Fig. 6. Average time taken between bug assignment and fix: the chart demonstrates the average amount of minutes taken to fix a bug in its particular category. Security bugs are prioritized and fixed faster. Table 1. Mean values for the three research questions we attempted. Metric of interest Security bugs Performance bugs Time between assignment and fix (minutes) 111307.48 160691.11 Person experience in days 598.28 646.33 Person experience in prior bugs fixed 5.69 3.56 Number of lines changed 1813.25 1559.4

3.2

Which Bugs Are Fixed by More Experienced Developers?

Motivation: Determining which developer has the most expertise to fix a bug is vital. In a previous study by Ackerman and Halverson [10] it was found that the experience of a developer was the primary criteria engineers used to determine expertise. For instance, performance bugs are complex in that they require comprehensive knowledge of the organization of the software system. Likewise, fixing a security bug requires sufficient understanding of the locations that security vulnerabilities could lie in the source code. In our study we are interested in finding the difference in experience between developers who work with security bugs vs. developers who work with performance bugs.

154

J. Imseis et al.

Approach: We measure the experience of the developer fixing a particular bug using the following two metrics: – Experience in days, i.e. the number of days from the first bug fix of the developer to the current bug’s fix date. – Number of previously fixed bugs by the developer. Findings: According to Table 1 above, the number of prior bugs fixed for developers that work on security bugs is greater by *37% [i.e. |(5.68/3.55) - 5.68 * 100| ] when compared to the developers that work on performance bugs. This suggests that developers who work on security bugs are more experienced. When we measure the experience based on the number of days from first bug fix to the current bug fix, we observed that the developers who worked on performance bugs have more experience (*8% more [i.e. |(598.28/646.33) − 598.28 * 100|]). The averages from Table 1 are also represented as bar graphs in Figs. 7 and 8 below.

Fig. 7. Mean developer experience in days: the chart represents the average amount of days that a developer worked on bugs in a particular category.

Fig. 8. Mean developer experience: the chart represents the average number of bugs that a developer fixed in a particular category. Security bugs are fixed by more experienced developers.

3.3

Characteristics of the Bug Fix

Motivation: Knowing the complexity of bug types can help guide project managers when they need to assign the right developers to fix complex bugs. Having this insight will lead to faster bugs fixes and less cost in terms of future maintenance. We hope to find a correlation similar to the one seen by Zaman et al. in which security bugs are more considered more complex than performance bugs. Approach: To quantify the complexity of the bug fix, we used the total number of lines added/deleted as our metric.

On the Assessment of Security and Performance Bugs

155

Findings: According to Table 1 and Fig. 9, security bug fixes were found, on average, to be more complex than performance bugs. The lines of code altered in terms of security bugs was found to be *14% [i.e. |(1813.25/1559.40) − 1813.25 * 100|] more than performance bugs.

Fig. 9. Mean complexity of bug types: the bar chart represents the average LOC changed for a particular category. Security bugs are more complex and require writing or changing a higher number of LOCs.

4 Limitation of the Assessment To obtain a fair amount of bug data for analysis, we used multiple heuristics keywords. The bugs obtained via these heuristics may not always give bug reports which pertain to performance or security. For example, our heuristic term “slow” may give us a bug which classified as a compatibility bug and not a performance or security bug. We also did not check if any of the bugs we obtained belonged to both the security and performance categories. In the future we hope to perform statistical sampling to estimate the precision and recall of our heuristics, so we can be confident that a particular bug belongs in its respective category. It must be noted that a bug can be worked on by multiple developers at once, yet in our approach we only use what is labeled as the “Owner” of the bug report and hence only report a single developer per bug (i.e. we don’t consider any other developers besides the owner). This was done to avoid distributing the lines of code added and deleted among multiple developers. Out of the total bugs found for each category a significant amount was not considered due to missing fields. An example of some of the missing fields are as follows: no owner given, no start/fixed date given, and zero lines of code added or deleted.

5 Related Work Our report is based upon a similar study by Zaman et al. in which the authors analyzed Firefox Bugzilla bug reports and found that security bugs are fixed and triaged much faster but are reopened and tossed more frequently than performance and other bugs [1]. They also found that security bugs involve more developers and impact more files in a project. Researcher’s also use the IEEE Standard Classification for Software Anomalies to classify software defects [11]. This classification states that security and performance correspond to two out of the six different type of problems that are based

156

J. Imseis et al.

on the effect of defects. Gegick et al. research identifies security bug reports via a text mining approach [12]. We, similar to the Zaman et al., suggest that developers and project managers should use the classifications of bug reports via bug type (security and performance) to improve software quality.

6 Conclusion In this study, we analyzed the differences in time to fix, developer experience, and bug fix characteristics between security and performance bugs in order to validate our assumption that security and performance bugs are different and thus quality assurance should take into account the bug type. We found that on average, security bugs are fixed faster when compared to performance bugs. We also found on average that developers who worked on security bugs have fixed more bugs than the developers who worked on performance bugs. However, based on the number of days developers have been fixing bugs, we found that there is not a significant difference (*8%) between performance and security bugs. When compared to performance bugs, security bugs were considered more complex in terms of lines of code. In the future, we would hope to procure a way to fully automate our analysis, so that we may make more concise assumptions.

References 1. Zaman, S., Adams, B., Hassan, A.E.: Security versus performance bugs: a case study on firefox. In: Proceeding of the 8th Working Conference on Mining Software Repositories (2011) 2. Shihab, E., et al.: Predicting re-opened bugs: a case study on the eclipse project. In: Proceedings of the 17th Working Conference on Reverse Engineering, WCRE 2010, Washington, DC, USA, pp. 249–258 (2010) 3. Erlikh, L.: Leveraging legacy system dollars for e-business. IT Prof. 2(3), 17–23 (2000) 4. Jaafar, F., Lozano, A., Gueheneuc, Y.-G., Mens, K.: On the analysis of co-occurrence of anti-patterns and clones. In: 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS) (2017) 5. Panjer, L.D.: Predicting eclipse bug lifetimes. In: Proceedings of the 4th International Workshop on Mining Software Repositories, MSR 2007, Washington, DC, USA (2007) 6. Weiss, C., Premraj, R., Zimmermann, T., Zeller, A.: How long will it take to fix this bug? In: Proceedings of the 4th International Workshop on Mining Software Repositories, MSR 2007, Washington, DC, USA (2007) 7. Kim, S., Whitehead Jr., E.J.: How long did it take to fix bugs? In: Proceedings of the 3rd International Workshop on Mining Software Repositories. ACM, New York (2006) 8. Anvik, J., Hiew, L., Murphy, G.C.: Who should fix this bug? In: Proceedings of the 28th International Conference on Software Engineering, ICSE 2006, pp. 361–370, New York, NY, USA (2006) 9. Apache Hadoop. https://hadoop.apache.org/. Accessed 30 Apr 2019 10. Ackerman, M.S., Halverson, C.: Considering an organization’s memory. In: Proceedings of the 1998 ACM conference on Computer Supported Cooperative Work, CSCW 1998, pp. 39–48. ACM, New York (1998)

On the Assessment of Security and Performance Bugs

157

11. Draft Standard for IEEE Standard Classification for Software Anomalies. IEEE Unapproved Draft Std P1044/D00003, February 2009 12. Gegick, M., Rotella, P., Xie, T.: Identifying security bug reports via text mining: an industrial case study. In: Proceedings of the 7th International Workshop on Mining Software Repositories, pp. 11–20, May 2010

Medical Image Segmentation by Combining Adaptive Artificial Bee Colony and Wavelet Packet Decomposition Muhammad Arif1 , Guojun Wang1(B) , Oana Geman2 , and Jianer Chen1 1

School of Computer Science, Guangzhou University, Guangzhou 510006, China [email protected], [email protected], [email protected] 2 Department of Health and Human Development, Stefan cel Mare University Suceava, Suceava, Romania [email protected]

Abstract. Segmentation of MRI images plays a significant and helpful job in anticipation and treatment preliminaries. Be that as it may, the power in homogeneity, sporadic fringes and one of the most exceedingly terrible pieces of the difference may cause incredible challenges in the pieces of the seeping from brain MRI images. Heaps of specialists have made in therapeutic imaging. We proposed the novel technique for image segmentation. Our technique depends on the discrete wavelet packet decomposition and ant colony optimization to reduce the disadvantage of the conventional computations in handling of the surprising shapes in restorative images preparing. To improve the exhibition of our proposed procedure we utilize the artificial bee colony to optimize and classify the feature selected or extracted by the WPD. Results shows that our method perform better to segment the curvy shapes and haemorrhagic areas in MRI images.

Keywords: Segmentation Optimization

1

· MRI · Brain · Classification ·

Introduction

Bleeding in the brain caused due to leakage or rupture of blood vessels within the brain material [1–3]. Due to high mortality and morbidity, finding brain haemorrhage is the primary activity of patients with head disorders of head injuries [4]. It is important to diagnose speed, so magnetic resonance imaging (MRI), which is faster and less expensive than a medical examination, remains the gold standard for the first examination [5–7]. The blood clot associated with MRI images depends on its intrinsic properties such as people, size, location, relationship with surrounding buildings and things like the thickness of the scanning chip [8–10]. Segmentation of blood clots plays an important and useful role in diagnostic and c Springer Nature Singapore Pte Ltd. 2019  G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 158–169, 2019. https://doi.org/10.1007/978-981-15-1304-6_13

Medical Image Segmentation

159

treatment trials. However, the severity of homogeneity, irregular boundaries, and one of the worst parts of the contrast may cause significant difficulties in parts of the bleeding from CT scans, [11–13]. Many researchers in the medical imaging departments made two of the most commonly used methods based on aggregation and Fuzzy logic, [14] and the active contour [15], in case of brain bleeding from MRI imaging [16]. Grannan et al. [17] contributed meaningfully by taking advantage of FCM taken after the operational characterization of the regions. Weishaupt [18] describes how MRI works, and also explain the function and working of the resonance imaging. Masulli et al. [19] has proposed a semi-robotic division strategy for brain haematoma and edema. Their strategy is to develop a way to deal with part of the haematoma and divide the specific level into the edema section. Gilligan et al. [20] they reviewed the results with traumatic brain injury (TBI) involving intra cranial haematoma (ICH) with patients in rural and remote South Australia and neighbouring states. Patients were referred to the Royal Adelaide Hospital (RAH), a first-level trauma treatment centre with the major neurosurgery service. Their strategy is parts of epidural and subdural bleeding with practically identical results for human specialists. Cocosco et al. [21] suggested an adaptive method for MR brain images. The training kit can be designed using the pruning method. Brain MRI pathology and legislative variation can be modified in this way using a segmentation method. The wrongly labelled samples delivered by previous tissue probability maps can be reduced by using the minimum span tree. The earliest k nearest neighbour (KNN) [22] method is used to synthesize the tissue obtained from the brain MRI image [23]. The main downside of this method is its inability to accurately classify tumours in the brain [24]. Image segmentation have many applications, it is used in hospitals to assist the physicians and doctors, traffic flow predictions [25–28], object identifications and classifications, detect the fraud and crime, face recognition and identification [29,30].

2

Materials and Methods

In this section we describe the new method for the segmentation of medical images. Our method is based on the discrete decomposition of wavelet packages and ant colony optimization to reduce the inconvenience of traditional algorithms in the processing of curved forms in medical image processing, the flow chart is shown in Fig. 1. In the flow diagram there are two parts, one related to training the method and the other is about testing the method. In training, first we take MRI image with haemorrhage, and apply some pre-processing, in which we correct the border of the image dilation and fill the missing region by using some known filters. Applying the connecting component filters, and remove the skull of the brain image, and then we apply the wavelet packet decomposition and extract the feature by using wavelet statistical texture analysis. To increase the performance of our proposed technique, we use artificial bee colony to optimize and classify the features extracted by the WPD. At, the end we apply our method

160

M. Arif et al.

to unknown MRI images to check the validity of our method. The proposed method is portioned into 4 phases. (i) (ii) (iii) (iv)

Discrete Wavelet Packet Decomposition (DWPD), Extraction of the region of interest (ROI), Feature optimization by using artificial bee colony(ABC), Classification of optimized features by using support vector machine.

Fig. 1. Flow diagram of the proposed method

2.1

Wavelet Packet Decomposition

In the discrete change of wavelets, just estimate that the subgroups are part to produce the next level of breakdown in this procedure; they could ignore

Medical Image Segmentation

161

some data to handle this problem. [19] presents the decomposition of the wavelet beam. The deterioration of the wavelet pack is a general type of attentive wavelet change, it will improve the low restriction of recurrence of the wavelength bases at high frequencies and thus give productive signs of breakdown of transient and stationary segments. The deterioration of the Wavelet package divides the details and approximations and produces a complete binary as it appears in Fig. 2.

Fig. 2. Wavelet packet decomposition tree

The function of the decomposition of the wavelet package is given below. In decline, prepare both detailed and estimated subgroups. Separated elements are more effective instead of changing wavelet for optimization and characterization tasks. The method for wavelet packet distribution is proposed in wavelet research. Under this strategy, a multilevel band division is led in the flag, and the high recurrence part in multilevel research can be distributed even more. The comparison band coordinates the range of the indicator as indicated by the qualities of the indicator to be broken down, and then the time recurrence determination can be improved [4,31]. The distribution capacity of wavelet packets is given as follows.   n (t) = 2j/2 W n 2t t − k Wj,k When n = 1, 2, 3, ..., n, where n is the level of decomposition, j is the scale component and k is the interpretation calculate. Here f (t) is characterized as the flag in time space and the testing rate is f s. A WPD with disintegration level J is completed on flag f (t), and 2j gathering of wavelet bundle coefficients can be acquired. The recurrence band that the ith (i = 1, 2, ..., 2j − 1) wavelet parcel coefficient relating to is     j  i/2 f s, (i + 1)/2j f s The wavelet coefficient of group I can be written as  ∞ i i Wj,k = f (t)Wj,k (t)dt −∞

162

M. Arif et al.

   i 2 Where i = 0, 1, 2, . . . 2j − 1. The energy band i is E(J, i) = Wj,k  . The dimensionless and normalized quantization energy of the band i is E(J, i) E(J, i) =  1/2 2  j−1  2 |E(J, 1)|    i=0  Then the wavelet packet energy of the signal f (t) on the decomposition level i.e written as    TJ = E(J, 0), E(J, 1), . . . , E J, 2J − 1 Such wavelet packet energy could be used as the input of WST, WCT to extract the features. 2.2

Feature Extraction

To highlight the feature extraction, we proposed two techniques given below. The combination of the Wavelet statistical texture (WST) highlights acquired from 2-level WPD low and high recurrence subgroups and the Wavelet co-event texture (WCT) highlights were obtained from the high recurrence subgroups of the beam deterioration of wavelet (WPD). Once each of the components are separated. Then, for highlighting optimization, we use a colony of fake honey bees to choose the ideal measurable surface elements. The surface examination is a quantitative strategy that can be used to measure and distinguish basic abnormalities in various tissues. Since the tissues shown in the brain are difficult to order using the form or level of power of the data, it is established that the extraction of the surface element is essential for greater characterization. For highlighting grouping, we use the booster vector machine (SVM). The reason for highlighting the optimization process is to reduce the unique set of information by measuring definite elements that recognize one area of eagerness from another. Examination and representation of the surfaces present in the medicinal images should be possible using the combination of Wavelet statistical texture (WST) features obtained from 2-level WPD low and high recurrence subgroups and wave coincidence texture highlights (WCT) acquired from WPD high recurrence subgroups. At that point, the estimates of the components of both elements are standardized by subtracting the lowest estimate and separating most of the extreme estimate and the lowest estimate. The most extreme and minimum values are determined in view of the set of preparation information. In the dataset, if component estimate is not exactly the base estimate, it is set at the very least. If the component’s rating is more prominent than the highest, it is set at the most extreme. Standardized component attributes are simplified by calculating design visibility. Defining embedding is the way to choose a subset of items that are important to a particular application and improve the ranking by looking for the best subset of components, based on the consistent behaviour of

Medical Image Segmentation

163

unique elements according to the criteria of evaluating a particular item. Optimal choice of components reduces information dimensions and computational time and increases configuration accuracy. The issue of identifying elements involves selecting a subset of components for a total number of elements, given a specific progress criterion. T refers to a subset of the selected elements and V means the disposal of the remaining elements. 2.3

Artificial Bee Colony Algorithm (ABC)

In this area, we first expanded the operation of the ABC and then applied the optimization process. ABC distinguishes pseudo carcass honey bees in three meetings, to be specific, honey bees used, spectator honey bees and exploratory honey bees. Half of the province consists of the honey bees used, and the other half includes the honey bees seen. In the survey of bee settlements, at first, some honey bees arbitrarily scan for livelihood in a certain area around the hive. After finding the source of livelihood, these bees carry some nectar to the hive, store the nectar and share nectar data from food sources with different bees that support the motion area (where movement of the movement is performed) within the hive. County honey bees then enters another cycle of concentration. At each session, subsequent developments occur. (1) For the sake of the sharing of data, the honey bee used turns to a passer if the livelihood is abandoned or continues to search for feed. (2) A small number of onlookers at the same time will take some of the honeybees used in the light of the data obtained to further dispense with some of the nutrition sources that are remembered. (3) Some scouts will suddenly start an irregular endeavour. The crucial stage in the ABC account, from which the overall visibility is most certainly, is data sharing. This is achieved by influencing the behaviour of spectators who will choose their livelihood according to probability. P i = fi /

SN

fk

k=1

Where fi is the wellness estimate of the source of the i dependency (the position of parameter space). Because these spectators will look at promising areas with higher probabilities than others. The filter feed sources from the retained sources are created according to the following: vij = xij + φij (xi,j − xk,j ) Where I, k = 1, . . . , SN, j = 1, . . . , n and vi is the new food source created with both, the current food source xi and the randomly selected food source determines xk of the population and −1 y > 0. As showed in [19], the false positive rate p(y) and the false negative rate q(x, y) are two key parameters that attract the attention of the defenders. It is assumed that the noise follows the Gauss distribution. This assumption applies to many real situations, such as cooperative security detection [21] and static traffic monitoring [20,23]. Then, the false positive rate is as follows,  ∞ √ z2 e− 2 dz/ 2π (2) p(y) = y

Optimal Personalized DDoS Attacks Detection Strategy in Network Systems

331

We can find that p(y) is a strictly decreasing function of y. And the false negative error q is expressed as follows,  y √ (z−x)2 q(x, y) = e− 2 dz/ 2π (3) −∞

Obviously, q(x, y) decreases and increases strictly relative to x and y, respectively. The defender’s loss on the attacked node i is defined as follows, liA = q(xi , yi )S(xi )Wi + c1 p(yi )Wi , i ∈ N

(4)

c1 ∈ (0, 1) is a constant, Wi is the security assets of the node i. Conversely, the loss of unaffected the node i is defined as follows, liN = c1 p(yi )Wi , i ∈ N

(5)

Based on the above definitions, the payoff of game model on node i is given in Table 2. Table 2. Payoffs/Loss on single node Attack

2.2

UaA (xi , yi )

Not attack

Attacker’s payoffs

= q(xi , yi )S(xi )Wi

0

Defender’s loss

UdA (xi , yi ) = q(xi , yi )S(xi )Wi + c1 p(yi )Wi

UdN (xi , yi ) = c1 p(yi )W

Stackelberg Game Model of the Whole Network System

Our goal is to minimize the loss of the network system when attackers can obtain the defense strategy. We use Stackelberg game to model. To ensure generality, we set that W1 ≥ W2 ≥ · · · ≥ Wn . The defender’s strategy is to set intrusion detection thresholds yi , (i = 1, ..., n). After observing the network system node security assets and intrusion detection strategy, the attacker need to select a subset A of the network system nodes and adopt attack intensity xi for each node in subset A. We set that the attacker has K (K is a constant) resources, and the attacker’s strategy must satisfy |A| ≤ K, where |A| is the size of set A. This is feasible in practice, because the attacker’s resources are limited, the number of nodes they attack is limited. The attacker’s payoffs are expressed as follows:  q(xi , yi )S(xi )Wi (6) Ua = i∈A

332

M. Li et al.

and the total loss of a network system is the sum of all nodes’ loss.   liA + liN Ld = i∈A

=



i∈A /

(q(xi , yi )S(xi )Wi + c1 p(yi )Wi ) +

i∈A

= Ua +





c1 p(yi )Wi

(7) (8)

i∈A /

c1 p(yi )Wi

(9)

i∈N

In the above formula, liA is the loss of the network system when node i is attacked, and when the node i is not attacked the loss of the network system is liN .

3

Solving the Game

In the process of network intrusion detection defense and attack, the attacker observes the defender’s strategy and decides whether to attack according to the security assets and threshold of the node. This problem is modeled as Stackelberg game, and the most commonly used method to solve the equilibrium solution of Stackelberg game is reverse induction. The game tree is shown in Fig. 1.

Fig. 1. Game tree of DDoS attacks detection defense and attack in a network system

The strategy of defender (who represents the network system) is yi , (i = 1, ..n). N denotes no attack and Y denotes attack. xi , (i = 1, ..., n), indicates the strength of the DDoS attack on the node i. UaA (xi , yi ), i = 1, ..., n, is the payoff of the attacker. Correspondingly, UdA (xi , yi ) and UdN (xi , yi ), i = 1, ..., n, are the payoff of the defender. In the game tree, any decision node and its follow-up structure form a sub-game with only the attacker’s decision. Obviously, the attacker’s decision depends on his own payoff of return, and he has the optimal strategy. This optimal strategy is the Nash equilibrium of the subgame of the original decision. If the attacker determines that there are multiple optimal strategies, then we assume that the attacker chooses any one of them.

Optimal Personalized DDoS Attacks Detection Strategy in Network Systems

333

From (6) and (9), we can see that no matter which strategy the attacker chooses, it will bring the same loss to the defender. This reflects the existence of the SSE solution. Then, assuming that the attacker’s strategy is optimal, find the optimal defense strategy to minimize his losses. Since the attack strategy yi , (i = 1, ..., n) in this paper is infinite, it is not feasible to search for the optimal strategy by exhaustive method. We can also solve this problem using a mixed integer linear algorithm, but when the device size becomes too large, the computational cost becomes unacceptable. Therefore, we design a new effective algorithm to calculate the optimal strategy of the defender. Assuming that the optimal strategy of the network system is yi∗ , (i = 1, ..., n), we first consider the payoff of the attacker. The attacker’s optimal strategy is to select |A| nodes with the highest q(xi , yi )S(xi )Wi value to maximize his own payoff. For each node i(i ∈ A) attacker, the optimal attack intensity x∗i should be satisfied: q(x∗i , yi∗ )S(x∗i )Wi ≥ q(xi , yi∗ )S(xi )

for all x ∈ R and x = x∗

(10)

In other words, after the attacker observes the strategy set by the defender on the node he chooses to attack, the attacker will choose the attack intensity x, which makes the value of q(x, y)S(x) maximum. From Fig. 2, we see that there exists a unique x that maximizes the value of q(x, y)S(x). So, we can define that: F (y) = max q(x, y)S(x)

0.14

(11)

y=10.1

q(x,y)S(x)

0.12 0.1

y=8.1

0.08 0.06

y=6.1 y=4.1

0.04 y=2.1

0.02

y=0.1

0 0

1

2

3

4

5

6

7

8

9

10

x

Fig. 2. The value of q(x, y)S(x) as a function of x, y.

After the above analysis, the attacker’s optimal strategy is set A which includes |A| nodes with the maximum value of F (yi∗ )Wi . 3.1

Optimal Defense Strategy

The loss of the defender depends on the attacker’s decision. Assuming that the attacker’s optimal strategy A is given, the defender’s optimal strategy is to

334

M. Li et al.

minimize his own loss. We can represent the network system as N = (N − A, A), where set A is a collection of all nodes attacked by an attacker, and set N − A is a collection of all nodes that are not in A. Obviously, the attacker will gain the optimal payoff from attacking |A| of the largest F (yi )Wi nodes. That is, min F (yi )Wi ≥ max F (yi )Wi . We define yiA , yiN as the optimal DDoS detection i∈A

i∈A /

thresholds when node i is attacked and not attacked. Then, yiA , yiN should satisfy: min F (yiA )Wi ≥ max F (yiN )Wi i∈A

i∈A /

(12)

And because the network system’s loss is the sum of every node’s loss, the yiA , yiN must be optimal strategy for the defender. We can draw that formula (12) gives the method of searching the optimal defensive strategy of the network system under the known attack strategy A. Theorem 1. There is a constant λ. Given the attacker’s optimal set A, the defender chooses the optimal strategy as flows: – for every node i ∈ A, if F (yiA )Wi < λ, then F (yi )Wi = λ; otherwise, yi = yiA ; – for every node i ∈ / A, if F (yiN )Wi > λ, then F (yi )Wi = λ; otherwise, yi = yiN . Proof. From the above discussion, it is concluded that (12) is a necessary and sufficient condition for the set A to be the optimal strategy of the attacker. Let λ = max F (yi )Wi . Next, we will discuss in detail: i∈A /

– If node i ∈ A, then from (12), the value of F (yi )Wi cannot not be less than λ. So, if F (yiA )Wi ≥ λ, the optimal threshold of node i is yiA by the definition of yiA . But if F (yiA )Wi < λ, recall that (11), using the convexity of q(x, y), it can be seen that F (yi )Wi = λ is the optimal strategy for unit i. – If node i ∈ / A, F (yi )Wi ≤ λ can be obtained similarly. If F (yiN )Wi ≤ λ, the optimal threshold of node i is yiN . But if F (yiN )Wi is greater than λ, recall that liN = c1 q(yi )Wi and q(y) increases strictly with y, F (yi )wi = λ is the optimal strategy for node i. So, we can find the value of lambda for any given set A. We call it a subproblem to solve the optimal strategy. Further, we can find the optimal strategy for defender by solving all the subset whose size is K. In practice, the exhaustive method is not feasible because of the large search space. In the next section, we describe an effective and feasible method to search the optimal defensive strategy for the network system. 3.2

Implementation of Defender’s Optimal Strategy

Algorithm 1 shows how we can find the optimal DDoS intrusion detection strategy for network system N . In the Algorithm 1, we first initialize λ. In lines 1–15, we calculate the optimal policy liN when the node i is not attacked, the optimal policy liA when the node

Optimal Personalized DDoS Attacks Detection Strategy in Network Systems

335

Algorithm 1. Optimal defense Strategy Input N , K Initialize: λ 1: for each i ∈ N do 2: if F (yiN )Wi < λ then 3: liN = c1 p(yiN )Wi 4: else 5: compute yi which makes F (yi )Wi = λ 6: liN = c1 p(yi )Wi 7: end if 8: if F (yiA )Wi > λ then 9: liA = c1 p(yiA )Wi + c1 p(yiA )Wi 10: else 11: compute yi which makes F (yi )Wi = λ 12: liN = c1 p(yi )Wi + c1 p(yi )Wi 13: end if 14: Di = liA − liN 15: end for 16: select the set A of K nodes with the lowest value of Di 17: for each i ∈ A do 18: if F (yiA )Wi ≥ λ then 19: yi = yiA 20: else 21: compute yi which makes F (yi )Wi = λ 22: end if 23: end for 24: for each i ∈ / A do 25: if F (yiN )Wi ≤ λ then 26: yi = yiN 27: else 28: compute yi which makes F (yi )Wi = λ 29: end if 30: end for 31: return yi , (i = 1, ..., n)

i is attacked, and their difference Di . Then, in line 16, we determine that the attacker’s optimal set A. In lines 17–30, the optimal strategy yi is determined according to whether the node i is in set A. We can intuitively know that A is the set of K largest F (yi )Wi nodes in the network system. Next we prove that A minimizes the loss of the defender. Assuming that the set of optimal defender’s strategy corresponds to is A∗ , ∗ A is different from A, and the defender’s loss is:   liA + liN (13) L∗d = i∈A∗

i∈A / ∗

336

M. Li et al.

Let A+ = A∗ ∪ A − A and A− = A ∪ A∗ − A∗ . Then compute the different ΔL. ΔL = Ld − L∗d     =( liA + liN ) − ( liA + liN ) i∈A

=







liA +

i∈A+

liA −

i∈A−

=



liA −

i∈A−

=(

i∈A∗

i∈A /



i∈A−



j ∈A / ∗

liN −

i∈A+

liN ) − (

i∈A−

Di −







liN

i∈A−

liA −

i∈A+



liN )

i∈A+

Di

i∈A+

In line 16, we can nodes in the subset A have the lowest Di values.  see that the So, we learn that i∈A− Di − i∈A+ Di < 0. Then, ΔL < 0 and Ld < L∗d . This contradicts the assumption that the set A∗ is the optimal defender’s strategy against to, so A is the attacker optimal strategy and minimizes the loss of the defender. Algorithm 1 shows the method of finding the minimum defender’s expected loss Ld for a given λ and we use Ld (λ) to represent it. In this way, we can find the optimal strategy of the defender by finding the optimal λ value. According to the lines 17–30 in Algorithm 1, it is easily to compute the optimal defense strategy (y1 , y2 , . . . , yn ). 0.5 0.45

0.14 0.12

0.4 0.35

0.08

p(y)

F(y)

0.1

0.06

0.3 0.25 0.2

0.04

0.15

0.02

0.1 0.05

0

0 0

1

2

3

4

5

6

7

8

y

Fig. 3. The curve F (y).

9

10

0

1

2

3

4

5

y

Fig. 4. The curve p(y).

In the previous analysis, we already know that finding the optimal value of λ is the core issue, which can minimize the loss of the network system. We first need to compute the value of yiA and yiN . We should observe the function curve of F (y) and q(y) in Figs. 3 and 4. Since p(y) is a decreasing function and y ∈ [y, y], Equation (5) shows that yiN = y is the optimal value for all nodes regardless of the value of Wi . For yiA , it can

Optimal Personalized DDoS Attacks Detection Strategy in Network Systems 60

50

Expected loss

Expected loss

40 30 20 10 0

50

40

30

20 3

3.5

4 λ

4.5

5

3

(a) K = 1

3.5

4 λ

4.5

5

(b) K = 5

60

Expected loss

60

Expected loss

337

50 40 30

50 40 30

20

20 3

3.5

4 λ

4.5

(c) K = 6

5

3

3.5

4 λ

4.5

5

(d) K = 7

Fig. 5. Expected loss as a function of λ on various value of K.

be represented as arg miny (F (y)Wi + c1 q(y)Wi ). The global function is convex, and we can use the exhaustive search method to calculate the value of yiA . Finally, we should find the optimal λ. In Algorithm 1, we can calculate the defender’s optimal strategy and minimum loss for a given value of λ. So we can use exhaustive search method to find the optimal point of λ. According to Algorithm 1, we sampled enough points to get the change of defender loss with respect to λ. In Fig. 5, we can easily find the optimal λ and get the optimal defender’s strategy according to the steps 24–30 of the Algorithm 1.

4

Experimental Results

The experiment’s purpose is to prove the practicability of our method and show that it is superior to other non-strategic solutions in reducing the loss of the network system in DDoS attacks. 4.1

Non-strategy DDoS Attack Detection Methods

Two non-strategic strategies are the methods we use to compare. In the first non-strategic strategy, assuming that the attacker selects the nodes uniformly at random, the defender computes the optimal strategy of every node to minimize the network system’s total loss. The strategy y1 = (y11 , y12 , ..., y1n ) is computed as follows. arg min( y

 K  Wi )F (y) + p(y) c1 Wi |i| i∈N

i∈N

338

M. Li et al.

In the second non-strategic strategy, the attacker is assumed to select only the node with the maximum Wi . The defense strategy y2 = (y21 , y22 , ..., y2n ) is computed as follows. In particular, this strategy was explored and used in [22].   Wi )F (y) + p(y) c1 Wi arg min( max y

4.2

A:|A|=K

i∈A

i∈N

Performance Comparison

70

strategy of y1 strategy of y2 our strategy

60

Expected loss L

50

40

30

20

10

0 2

3

4

5

6

7

8

9

10

Number of attacked nodes K

Fig. 6. The defender’s expected loss L as a function of K.

In the following, we use numerical examples to show the results of our method for detecting DDoS attacks in network systems. We consider a network system with 100 nodes and set y = 0.1, y = 10, c1 = 0.25, t = 50, and W1 = 180, W2 = 179, . . . , W100 = 81. This experimental setting can also be seen from [22]. We calculate the network system’s loss for each defense strategy according to the attacker’s strategy. Our aim is to analyze the performance of different defense strategies against DDoS attacks. Figure 6 displays the comparison results. Compared with the two non-strategic defender’s strategies, we can see that the threshold y1 is superior to the threshold y2 , but the curve of our defender’s strategy is the lowest in Fig. 6. This shows that our threshold y can obtain the minimum loss. Therefore, our defender’s strategy is the optimal strategy for network systems to detect DDoS attacks.

5

Conclusion

In this paper, a personalized DDoS detection strategy based on Stackelberg security game is proposed to monitor the network system. Firstly, a Stackelberg game model is established. The defender’s strategy is the detection threshold assigned to the nodes in the network system. After observing the defender’s strategy, the attacker’s strategy is to choose a set of nodes in the network system and assign

Optimal Personalized DDoS Attacks Detection Strategy in Network Systems

339

an attack strength to each node. Then, we describe the best response of the attacker and find the optimal defense strategy for the network system. As we analyzed, using Algorithm 1 makes it easier to obtain the personalized strategies for DDoS attacks, even for large data sets. Finally, we use simulation data sets to evaluate our theoretical results. We compared the personalized strategies with two non-strategic strategies and found that our personalized strategies were significantly superior. This shows that our method is not only simple to calculate but also good to perform. In the future, considering the irrational strategy of attackers is a valuable work. Acknowledgments. This work is supported by the National Natural Science Foundation of China (Grant No. 61802097), and the Project of Qianjiang Talent (Grant No. QJD1802020).

References 1. Breton, M., Alj, A., Haurie, A.: Sequential Stackelberg equilibria in two-person games. J. Optim. Theory Appl. 59(1), 71–97 (1988) 2. Chen, Y., et al.: When traffic flow prediction and wireless big data analytics meet. IEEE Network 33(3), 161–167 (2019) 3. Chen, Y., Zhang, Y., Maharjan, S., Alam, M., Wu, T.: Deep learning for secure mobile edge computing in cyber-physical transportation systems. IEEE Network (2019) 4. Garcia-Teodoro, P., Diaz-Verdejo, J., Macia-Fernandez, G., Vazquez, E.: Anomalybased network intrusion detection: techniques, systems and challenges. Comput. Secur. 28(1), 18–28 (2009) 5. Han, L., Zhou, M., Jia, W., Dalil, Z., Xu, X.: Intrusion detection model of wireless sensor networks based on game theory and an autoregressive model. Inf. Sci. 476, 491–504 (2019) 6. Jain, M., et al.: Software assistants for randomized patrol planning for the LAX Airport Police and the Federal Air Marshal Service. Interfaces 40(4), 267–290 (2010) 7. Khanna, S., Venkatesh, S.S., Fatemieh, O., Khan, F., Gunter, C.A.: Adaptive selective verification: an efficient adaptive countermeasure to thwart DoS attacks. IEEE/ACM Trans. Networking 20(3), 715–728 (2012) 8. Kiekintveld, C., Islam, T., Kreinovich, V.: Security games with interval uncertainty. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 231–238 (2013) 9. Laszka, A., Abbas, W., Sastry, S.S., Vorobeychik, Y., Koutsoukos, X.: Optimal thresholds for intrusion detection systems. In: Symposium and Bootcamp on the Science of Security, pp. 72–81 (2016) 10. Leitmann, G.: On generalized Stackelberg strategies. J. Optim. Theory Appl. 26(4), 637–643 (1978) 11. Liang, X., Xiao, Y.: Game theory for network security. IEEE Commun. Surv. Tutorials 15(1), 472–486 (2013) 12. Liao, H.J., Lin, C.H.R., Lin, Y.C., Tung, K.Y.: Intrusion detection system: a comprehensive review. J. Netw. Comput. Appl. 36(1), 16–24 (2013)

340

M. Li et al.

13. Mall, P., Bhuiyan, M.Z.A., Amin, R.: A lightweight secure communication protocol for IoT devices using physically unclonable function. In: Wang, G., Feng, J., Bhuiyan, M.Z.A., Lu, R. (eds.) SpaCCS 2019. LNCS, vol. 11611, pp. 26–35. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-24907-6 3 14. Manikopoulos, C., Papavassiliou, S.: Network intrusion and fault detection: a statistical anomaly approach. IEEE Press (2002) 15. Manshaei, M.H., Zhu, Q., Alpcan, T., Hubaux, J.P.: Game theory meets network security and privacy. ACM Comput. Surv. 45(3), 1–39 (2013) 16. Roy, S., Ellis, C., Shiva, S., Dasgupta, D., Shandilya, V., Wu, Q.: A survey of game theory as applied to network security. In: Hawaii International Conference on System Sciences, pp. 1–10 (2010) 17. Sarker, J.H., Nahhas, A.M.: Mobile RFID system in the presence of denial-ofservice attacking signals. IEEE Trans. Autom. Sci. Eng. PP(99), 1–13 (2016) 18. Shieh, E., An, B.: Protect: an application of computational game theory for the security of the ports of the united states. In: Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2012), pp. 13–20 (2012) 19. Wang, D., Wang, Z., Li, G., Wang, W.: Distributed filtering for switched nonlinear positive systems with missing measurements over sensor networks. IEEE Sens. J. 16(12), 4940–4948 (2016) 20. Wu, H., Dang, X., Wang, L., He, L.: Information fusion-based method for distributed domain name system cache poisoning attack detection and identification. IET Inf. Secur. 10(1), 37–44 (2016) 21. Wu, H., Wang, W.: A game theory based collaborative security detection method for Internet of Things systems. IEEE Trans. Inf. Forensics Secur. 13(6), 1432–1445 (2018) 22. Wu, H., Wang, W., Wen, C., Li, Z.: Game theoretical security detection strategy for networked systems. Inf. Sci. 453, 346–363 (2018) 23. Yu, S., Zhou, W., Doss, R., Jia, W.: Traceback of DDoS attacks using entropy variations. IEEE Trans. Parallel Distrib. Syst. 22(3), 412–425 (2011) 24. Zonouz, S.A., Khurana, H., Sanders, W.H., Yardley, T.M.: RRE: a game-theoretic intrusion response and recovery engine. In: IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 439–448 (2009)

AI and Its Risks in Android Smartphones: A Case of Google Smart Assistant Haroon Elahi, Guojun Wang(B) , Tao Peng, and Jianer Chen School of Computer Science, Guangzhou University, Guangzhou 510006, People’s Republic of China [email protected], {csgjwang,jianer}@gzhu.edu.cn, pengtao [email protected]

Abstract. This paper intends to highlight the risks of AI in Android smartphones. In this regard, we perform a risk analysis of Google Smart Assistant, a state-of-the-art, AI-powered smartphone app, and assess the transparency in its risk communication to users and implementation. Android users rely on the transparency of an app’s descriptions and Permission requirements for its risk evaluation, and many risk evaluation models consider the same factors while calculating app threat scores. Further, different risk evaluation models and malware detection methods for Android apps use an app’s Permissions and API usage to assess its behavior. Therefore, in our risk analysis, we assess Description-to-Permissions fidelity and Functions-to-API-Usage fidelity in Google Smart Assistant. We compare Permission and API usage in Google Smart Assistant with those of four leading smart assistants and discover that Google Smart Assistant has unusual permission requirements and sensitive API usage. Our risk analysis finds a lack of transparency in risk communication and implementation of Google Smart Assistant. This lack of transparency may make it impossible for users to assess the risks of this app. It also makes some of the state-of-the-art app risk evaluation models and malware detection methods ineffective. Keywords: Artificial intelligence risks Security · Smart assistants

1

· Transparency · Privacy ·

Introduction

The number and capabilities of Artificial Intelligence (AI) systems are rapidly increasing. Like many other domains, smartphones are silently yet rapidly, integrating AI technologies. Android is at the forefronts of this silent transformation. AI capabilities have already started emerging in the latest Android smartphones, and advances like the introduction of neural networks for on-device machine learning in Android 8.1 release and incorporation of AI-based features in Android Pie prove the resilience of the industry in embracing AI [1,2,12]. Such advancements are early indications of the transformations that smartphones will undergo in the next few years. c Springer Nature Singapore Pte Ltd. 2019  G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 341–355, 2019. https://doi.org/10.1007/978-981-15-1304-6_27

342

H. Elahi et al.

Smart assistants are the most visible form of AI application in smartphones. A recent UNESCO [3] report says that AI and the digital assistants that it powers are ushering humanity into an era that portends changes as deep, expansive, personal and long-lasting as those that grew out of the industrial revolution. However, all is not good with these rapidly emerging apps, and recent research has identified different security, privacy, and trust issues in these apps [4–10]. Still, there are yet unexplored issues. For example, transparency is fundamental to the privacy protection of users in the current privacy self-management model, and it also ensures accountability [11]. Nevertheless, currently, there is no study investigating the transparency issues in smart assistants. The purpose of conducting this research is to highlight the potential risks of AI applications in Android smartphones. In this regard, it focuses on assessing the transparency of risk communication and implementation in Google Smart Assistant (GSA), an AI-powered app, and uses description-to-permissions fidelity and functions-to-API-usage fidelity to perform a risk analysis. This research identifies a lack of transparency in risk communication and implementation of GSA, highlights associated risks and urges for taking adequate measures to ensure the privacy and security in Android smartphones. Rest of the paper organizes as follows. Section 2 establishes the theoretical background. Section 3 provides the related work and Sect. 4 details our case study. Section 5 provides the results. Section 6 discusses different implications, and finally, Sect. 7 concludes this paper.

2 2.1

Background Android Permissions

Android Permissions protect the privacy and security of user data and critical features in Android smartphones by limiting an app’s access to these data and device features. An app is required to have user permission to access sensitive data and critical device features. App Permissions are assigned three protection levels: Normal, Signature, and Dangerous, according to the nature of data or features that they access. Accordingly, Android grants Normal Permissions automatically at the install time because they pose minimal risk to users’ privacy [13]. Signature Permissions protect app components and are used by the apps which require the services offered by the components of other apps, which define such Permissions. Dangerous Permissions are used to protect a user’s private information: a user’s stored data on the device, which can affect his or her privacy, or restrict access to data that can affect operations of other apps. The Dangerous Permissions are declared by an app’s developer in its manifest file, and a user needs to explicitly provide consent before an app can proceed with its functions. 2.2

Android Sensitive APIs

Android treats its apps as separate users. Each Android app has a unique ID, and it runs in an Application Sandbox, a Linux-based security mechanism to

AI and Its Risks in Android Smartphones: A Case of Google Smart Assistant

343

isolate app resources, which protects apps and the system resources. By default, these applications have limited access to system resources. Android exposes its resources to applications through application programming interfaces (APIs). Android sensitive APIs are a subset of Android APIs that are governed by an Android Permission settings [14]. These APIs access sensitive information and critical device features. Thus, the sensitive APIs implemented within an Android app depict its critical behavior. When an app calls a sensitive API, OS verifies whether this app has the Permission to access the protected resource accessed through that API, and allows such access if a user grants the Permission to the app to access the underlying resources. 2.3

Risks of Android Apps

Risks in Android smartphones emerge mainly from the apps acquired and installed by users. These apps, for their functioning, require access to user data in smartphones and critical device features, which can, in turn, generate sensitive data [16]. Whereas a primary assumption is that these requests are limited to accessing the data and features needed for performing functions detailed in their descriptions, several incentives, such as data harvesting for targeted marketing and user profiling, motivate them to engage in practices such as data over-collection [15]. In data over-collection, an app gains access to data beyond its operational scope. Such access to data poses risks to privacy and safety of users and the security of Android devices. With the increasing role of Android smartphones in technologies like smart homes and their connectivity to corporate networks, such risks do not limit to the privacy and safety of Android users. These devices can cause computing infrastructure, and the business loses. Moreover, it is hard to precisely imagine the risks of information collected from these devices in a world where “if you torture the data long enough, it will confess to anything” [17]. 2.4

Android Users and App Risks

Smartphone users operate these devices under their responsibility to protect against any associated risks. For example, end user license agreement (EULA) of Vivo X20 smartphone says that “The entire risk arising out of the use or performance of any software, documentation, and content, whether pre-installed, downloaded, or otherwise acquired, remain with you [the user].” Android uses a similar model in which a user is responsible for identifying and mitigating the risks of downloading, installing, and using different apps. However, to assist users in assessing the risks of apps, app stores like Google Play provide app description, Permission requirements of the app, app-ratings, and user reviews. Users must read the terms and conditions of use of these apps, understand their functions, assess the rationality of their Permission demands, and decide whether to acquire an app and allow it access to their data and devices or not.

344

2.5

H. Elahi et al.

App Risk Evaluation Models

Different risk evaluation models have been proposed for assessing Android apps. Some of them use the same metrics as available to Android users, that is, the app descriptions and Permissions [18]. App Permissions are central to many risk evaluation models. For example, [19–22] use app Permissions to determine its access to sensitive user data and critical device features for calculating app risks. Some of these models mix additional metrics with app Permissions. For example, some risk evaluation models may employ methods like code reviews or monitor the data flows. These models assess the Permissions and the sensitive API usage in an app to determine its behavior [20]. There are yet others, which use user input and SQLite configuration files along with Permissions of an app to assess its risks [23]. 2.6

Transparency

Cambridge online dictionary [24] defines transparency as “the characteristic of being easy to see through” and “the quality of being done in an open way.” The European General Data Protection Regulation (GDPR) declares that only the consent obtained fairly and transparently, and the data collected after that makes personal data processing legal and fair [25]. GDPR mentions that processing of personal data can only be legal and fair if the personal data collection, its use, or its processing and even the extent of such processing is transparent to the data subjects. The data subjects should be aware of the identity of data collector (controller), and they should be transparently communicated the specific purposes of the data collection and the risks, rules, safeguards, and their rights concerning the data processing. Moreover, the principles of personal data processing provided in the regulation advise that these data should be processed lawfully, fairly and transparently.

3

Related Work

Although there are widespread security and privacy concerns about AI and its products, most of the attention of those assessing its risks are bound to the side-effects of artificial general intelligence or super intelligence, and the risks of applications employing some narrower form of intelligence are getting little attention [26]. Consequently, we see that there are only a handful of studies which focus on security and privacy issues in smart assistants. Here, we mention some of the recent works. Alepis and Patsakis [4] demonstrated that voice-controlled assistants could be easily manipulated while bypassing Android’s security mechanisms. They propose that while an AI base of these assistants makes them more extensible, it also makes comprehending their full risks difficult. Zhang et al. [5] manipulated skills of Amazon Alexa and Google Home by launching remote, large-scale attacks. They carried out these attacks remotely by publishing attach skills and

AI and Its Risks in Android Smartphones: A Case of Google Smart Assistant

345

stole user data and eavesdropped on their conversations. Zhang et al. [9] demonstrated how smart assistants could be exploited to collect user data without their knowledge. They developed a spyware app and disguised it as a popular microphone controller game app. They stealthily recorded incoming and outgoing calls and synthesized Google Assistants keywords. They used Google Smart Assistant to collect environment light and noisy data by exploiting accelerometer, ambient light sensor, and the microphone without being detected by the antivirus systems. Seymour [7] identified a lack of choice for users and the lack of control over data collections by existing smart assistants as their two main problems. They suggested that not only could these devices collect data from their direct surroundings but also other smart devices in their vicinity. This was demonstrated by Chung et al. [10] who collected data from Amazon Alexa and its enabled devices and successfully characterized the lifestyles and living patterns of its users. Lau et al. [6] studied the factors affecting people’s adoption or rejection of smart speakers, their related privacy concerns, and privacy-seeking behaviors around these devices. They found that a lack of trust towards smart speaker companies was one of the main reasons for non-adoption among non-users. They also found that users of these devices did not understand their privacy risks, rarely used their privacy controls, and the privacy controls in these devices were not well-aligned with user’s needs. Overall, the number of these studies is small, and their scope is limited, which leaves huge research gaps. Most of these studies focus on exploiting the abilities of different smart assistants [4,5,9,10]. Others motivate to further investigate issues in the privacy controls of smart assistants and explore related risks [6,7].

4

Case Study

With the purpose to highlight the potential risks of AI applications in Android smartphones, taking an example of GSA, in this section, we assess the transparency of its risk communication and implementation. GSA is an AI-enabled app available in smartphones, laptops, wearable devices, vehicles, and smart home devices like smart speakers, smart TVs, and smart displays. Google expects that soon; it will be installed on a billion devices [27]. It engages in two-way communication with its users, senses or collects data and voice commands and can perform several functions for its users without explicit intervention. The widescale support and use of GSA and its features make it an exciting product and an attractive case to study its risk behavior and underlying mechanisms. 4.1

Method

In order to evaluate the transparency of risk communication and implementation in GSA, we performed a static analysis of the GSA app. App users evaluate app risks by reading their descriptions and reviewing corresponding Permission

346

H. Elahi et al.

requirements before installation [18]. Security researchers consider app descriptions, their Permissions requirements, and additional factors such as sensitive API usage while analyzing an app for risks. Therefore, we used Description-toPermission fidelity, and Functions-to-API-Usage fidelity, and compare the Permission and sensitive API usage of GSA with four popular smart assistants including, Alexa, Cortana, Dragon, and Lyra Virtual Assistant. Description-to-Permission fidelity is an established method to learn an app’s behavior. Gorla et al. [14] introduced this method to validate whether the functions of an app conform to its claims. From a user’s point of view, whose only chances of evaluating the risks of an app depend on the app information provided on an app store; this method is critical. Such information includes an app’s functional description, rating, reviews, information of the developers, number of downloads, and its permission requirements. However, for this research, we only use app descriptions and Permission requirements as required by the selected method. As mentioned earlier in the background section, Android sensitive APIs are a subset of Android application programming interfaces (APIs) that are governed by the respective Android permissions [14,28,29]. Thus they represent the actual implementation and behavior of an app as reflected in its code. Sensitive APIs, as their name suggests, access sensitive information and critical device features in a smartphone. Thus, they can be used to depict the critical behaviors of an app. We used Functions-to-API-Usage fidelity to verify the transparency of code implementation against the functions of the GSA. 4.2

Procedure

We acquired the APK file of GSA (version 0.1.187945513), from www.apkpure. com and its description from the Google Play store. We searched for leading smart assistants in the Google Play store and selected Alexa (version 2.2.241878.0), Cortana (version 3.0.0.2376-enus-release), Dragon (version 5.9.2 - build 278), and Lyra Virtual Assistant (version 1.2.4). In order to extract their permissions and verify sensitive API usage, we used Androguard (version 3.1.1), an open source reverse engineering tool [30]. We used Python to write custom scripts for this purpose. For identifying the app functions from their descriptions, we used open and axial coding. Coding is the process of manual deciphering and naming interesting concepts found in the data [31]. In open coding, we read the descriptions, obtained from Google Play store, of all the five smart assistants, line-by-line, and assigned functional names to these texts. These functional names constituted open codes. Axial coding serves to refine and differentiate open codes and lends them the status of categories. Thus we compared the open codes defined in descriptions of all smart assistants and finalized the function names, which are provided in Table 1.

AI and Its Risks in Android Smartphones: A Case of Google Smart Assistant

4.3

347

Findings

Description-to-Permissions Fidelity. Performing open and axial coding led us to discover fifteen main functions in the description of GSA. These functions are shown in Table 1. These functions show that GSA can operate across devices and synchronizes its data for this purpose. Users can ask it to play music, make calls, send messages, send emails, event planning, sending reminders, getting directions, playing YouTube videos, launch different apps, and tracking their physical activities. Further, it can be used to manage smart home devices. It uses voice recognition, calendar events, and across devices to achieve many of these tasks. As given on the play store, it requires only one Permission to perform these functions, namely, READ GSERVICES. Our extraction and review of its Permissions from its package file confirmed the same. Table 1. Functions of the examined smart assistants discovered in their descriptions. Function

Alexa Cortana Dragon GSA Lyra

Across Devices x x x Listen Music x Shopping x News Updates x x x Voice Recognition x x x Personalization x x x Calendar Management x x Calls and Messaging x Intercom Smart Home Device Mgmt. x x x Online search x x x Reminders x x Notes taking x Social Media Posts x Send emails Event Planning Get Directions Play YouTube Videos Translation x x Launch Apps Physical Activity Tracking **x means the presence of the function in a given Smart shows its absence.

x x x x x x x x x x x x x x x x x x x x x x x x x x x Assistant and -

348

H. Elahi et al.

Functions-to-API-Usage Fidelity. As shown in Table 1, we discovered fifteen functions from the description of GSA, as provided on the Google Play store. We used Androguard reverse engineering tool to discover the usage of related API usage in GSA. We searched for APIs used for learning device identity, making calls, text messaging, network information collection, location-based services, streaming services, synchronization, network connectivity, and speech recognition. As shown in Table 2, in GSA, our experiment discovered the use of only one out of twenty-five targeted APIs. Table 2. A comparative overview of API usage in the examined smart assistants API

Alexa Cortana Dragon GSA Lyra

getDeviceId x getSubscriberId x getCallState getNetworkCountry sendTextMessage getMessageBody x getNetworkInfo getLastKnownLocation x x getLatitude x getLongitude x getOutputStream openStream sendDataMessage x setPlaybackStat x setCallback x getAssets x getHost requestSync startSync stopSync loadUrl x connect x getLanguage x speak x shutdown **x indicates the usage of a given means the absence of its use.

x x x x x x x x x x x x x x x API by a

x x x x x x x x x x x x x x x x x x x x x x x x Smart Assistant, and -

AI and Its Risks in Android Smartphones: A Case of Google Smart Assistant

349

Comparison with Leading Smart Assistants. As apparent from the list of functions provided in Table 1, most of the functions discovered in GSA’s description are also discovered in the descriptions of other smart assistants. For example, just like GSA, Alexa, Cortana, and Lyra Virtual Assistant operate across devices, or all the four evaluated smart assistants use voice recognition and offer personalized services to their users. Just like GSA, users can ask Lyra Virtual Assistant to find directions for them, and it can play YouTube videos for them. Cortana and Dragon can launch apps, and Dragon can send emails, just like GSA. Only physical activity tracking and event planning are additional app features in GSA. However, there are functions which other smart assistants offer to their users, and GSA does not. For example, online shopping, news updates, intercom service, notes taking, social media posting, and translation are functions that one smart assistant offers or another, but GSA. Further, as given in Table 3, when we look at the Permission requirements of these five smart assistants, GSA offers an exceptional case. While Alexa, Cortana, Dragon, and Lyra Virtual assistant require a large number of Normal, Dangerous and System of Signature Permissions that match the functions claimed to be offered by these smart assistants, GSA required only one System or Signature Permission. Table 3. A comparison of Permission requirements of the examined smart assistants. App

Normal Dangerous System or Signature Total

Alexa Cortana Dragon GSA Lyra

21 29 20 0 10

9 18 18 0 16

26 14 6 1 2

56 61 44 1 28

Table 2 presents the comparative findings of API usage for learning device identity, making calls, text messaging, network information collection, locationbased services, streaming services, synchronization, network connectivity, and speech recognition. Again we find an unusual usage pattern in GSA, and apart from getAssets API use, which provides access to an app’s assets, GSA does not use any of the standard APIs mentioned in Table 2 and corresponding to functions listed in its description on Google Play store.

5

Results

Looking at the findings of our analysis of GSA app using Description-toPermissions fidelity and Description-to-API-Usage fidelity, it is evident that GSA does not conform to both these tests. Neither do its Permission requirements correspond to its description displayed to its users in the Google Play Store nor does its API usage reflects its functions discovered in its description. Table 4

350

H. Elahi et al.

presents the Dangerous Permissions in Alexa, Cortana, Dragon, and Lyra Virtual Assistant. It is easy to map their demands for Dangerous Permissions with their functions given in Table 1. This is the transparency that is expected from Permission use in apps and which helps users in their risk evaluation of an app. However, in the case of GSA, its Permission demands cannot be transparently mapped with its functions. It is important to note that READ GSERVICES is not in the list of documented permissions, and no official explanations are available that could elaborate on how it works. Table 4. Dangerous Permissions in Alexa, Cortana, Dragon and Lyra App

Dangerous Permissions

Alexa

RECORD AUDIO, ACCESS FINE LOCATION, CAMERA, WRITE EXTERNAL STORAGE, READ EXTERNAL STORAGE, READ CONTACTS, CALL PHONE, SEND SMS, READ PHONE STATE

Cortana

RECORD AUDIO, ACCESS FINE LOCATION, ACCESS COARSE LOCATION, CALL PHONE, READ PHONE STATE, PROCESS OUTGOING CALLS, READ EXTERNAL STORAGE, WRITE EXTERNAL STORAGE, CAMERA, SEND SMS, READ CONTACTS, WRITE CALENDAR, READ CALENDAR, READ SMS, WRITE SMS, RECEIVE SMS, RECEIVE MMS, ACCESS LOCATION EXTRA COMMANDS

Dragon

RECORD AUDIO, READ PHONE STATE, WRITE EXTERNAL STORAGE, CALL PHONE, READ CONTACTS, READ SMS, SEND SMS, WRITE SMS, READ CALENDAR, WRITE CALENDAR, RITE CONTACTS, ACCESS FINE LOCATION, ACCESS COARSE LOCATION, ACCESS LOCATION EXTRA COMMANDS, RECEIVE SMS, READ EXTERNAL STORAGE, READ CALL LOG, WRITE CALL LOG

Lyra

CALL PHONE, ACCESS FINE LOCATION, ACCESS COARSE LOCATION, RECORD AUDIO, READ PHONE STATE, READ CONTACTS, WRITE CONTACTS, READ CALENDAR, WRITE CALENDAR, SEND SMS, RECEIVE SMS, READ SMS, WRITE SMS, CAMERA, READ EXTERNAL STORAGE, WRITE EXTERNAL STORAGE

6

Discussion

Android smartphones pose numerous security and privacy risks [15,33]. In this paper, we performed a risk analysis of Google Smart Assistant, a state-of-the-art, AI-powered smartphone app. We discovered that Google used unusual undocumented Permission in GSA, which has affected the underlying implementation also. With a large number of smart assistants available at commercial scale and their expected role on human lives and society, the findings of our research become very critical [3,32]. In this section, we discuss different implications. 6.1

Risks to Privacy-Protection

In Android smartphones, the sole responsibility for privacy protection lies on the user. App descriptions and Permissions are the only tools available to users for

AI and Its Risks in Android Smartphones: A Case of Google Smart Assistant

351

risk evaluation of an app. Google has provided a list of Permissions and their role in data and feature protection in Android phones; READ GSERVICES is not in that list [13]. The only information available on this Permission, coming from an unofficial source, says that it lets an app read the configuration of Google services [34]. However, it sheds no light over the nature and scope of this Permission. Therefore users are in complete darkness when it comes to the risks of allowing this Permission. This case of GSA confirms the concerns raised by researchers that permissions model of the Android ecosystem does not protect the rights of the users concerning the protection of their data [35]. Apps should be transparent towards the user to meet the conditions of informed consent, purpose limitation, clarity, and proportionality about the extent and the duration of data storage [25]. However, such indicators are missing in the current implementation of GSA. 6.2

Risks to Risk Evaluation

Many state-of-the-art risk evaluation models use an app’s description, its Permissions, and API usage to assess its risks. However, a lack of transparency in Permission requirements of GSA and underlying mechanisms makes the mapping of its API calls and Permissions with its functions, and these models become ineffective. For example, one of the recently published work in [21] proposes a risk evaluation model for Android apps, which calculates a threat score for the app based on its requested Permissions. It calculates the different type of threat scores, including privacy, systems, and financial threat scores based on the requested Permissions of an app. If we use this model to evaluate the risks of GSA, it will generate entirely inaccurate threat scores. The same is true for other risk evaluation models using app Permissions or API usage. 6.3

Risks to Malware Detection

Many malware detection methods use sensitive APIs in apps for detecting their malicious behaviors. Such methods rely on the fact that contrary to the names of user-defined functions, sensitive APIs cannot be easily obfuscated by existing techniques. For example, MalPat uses Permission-related APIs to learn the behavior of benign and malicious apps [29]. DAPASA, an approach to detect Android piggybacked apps through sensitive sub-graph analysis, constructs a static function-call graph of a given app, by considering the invocation patterns of sensitive APIs [36]. A recent work proposing a framework to mitigate privilege escalation attacks in Android smartphones uses Dangerous and Normal Permissions to determine the capabilities of apps [37]. Likewise, another recent work identifies ‘Significant’ Permissions and uses them for malware detection [38]. All such malware detection techniques which use app Permissions or sensitive API calls shall fail to determine the behavior of apps following the implementation demonstrated by GSA.

352

6.4

H. Elahi et al.

Unfitness of Permissions to Serve AI Needs

Intelligent smartphone applications will have properties that AI exhibits including learning from experience, awareness of their surroundings, ability to reason and deduce and applying reasoning and awareness to achieve rationality [39–41]. They will cooperate with other intelligent artifacts in their surroundings and negotiate for their advantage [42]. They will be autonomous and put their traits to solve complex problems independently [43]. GSA exhibits some of these traits when it engages in conversation with its users, or when it makes calls for restaurant bookings and salon appointments. However, GSA does not truly possess the traits that future intelligent smartphone applications will have. Despite the limited AI features of GSA, its design suggests that the standard set of Android Permissions does not help in achieving its goals. As a result, Google needed to use powerful Permission, which at the minimum, violates the principle of transparency. It suggests that Android Permissions will not serve the needs to achieve privacy protection and security in intelligent Android applications expected to emerge in the future.

7

Conclusion

In this paper, we presented a risk analysis of the GSA with the purpose to explore the potential risks of AI applications in Android smartphones. We focused on its transparency in risk communication for users and implementation. We tested its Description-to-Permission fidelity and Functions-to-API-Usage fidelity conformance to assess the transparency needed for a fair evaluation of its behaviors by users and experts. A Permission and API usage comparison with four leading smart assistants revealed that GSA had unique Permission requirements and the sensitive API usage. Consequently, it does not conform to transparency requirements needed for assisting users for its risk evaluation, which put their privacy and safety at risk. We also showed that the deviation of the Permission and API usage from a conventional approach made some of the state-of-the-art risk assessment and malware detection methods ineffective. The findings of this research demonstrate how a lack of transparency in risk communication and implementation of AI applications may affect user privacy, risk evaluation, and device security. We believe that there is an urgent need for setting up design and implementation standards for AI applications to contain its risks. Acknowledgments. This work was supported in part by the National Natural Science Foundation of China under Grants 61632009, 61802076 and 61872097, in part by the Guangdong Provincial Natural Science Foundation under Grant 2017A030308006, and in part by the High-Level Talents Program of Higher Education in Guangdong Province under Grant 2016ZJ01.

AI and Its Risks in Android Smartphones: A Case of Google Smart Assistant

353

References 1. Villani, C., et al.: For a Meaningful Artificial Intelligence: Towards a French and European Strategy. Conseil national du num´erique, Paris (2018) 2. AI on the Honor V10 is a game-changer. https://www.androidauthority.com/aion-the-honor-v10-is-a-game-changer-832613/ 3. UNESCO, EQUALS Skills Coalition: I’d blush if I could: closing gender divides in digital skills through education (2019) 4. Alepis, E., Patsakis, C.: Monkey says, monkey does: security and privacy on voice assistants. IEEE Access. 5, 17841–17851 (2017) 5. Zhang, N., Mi, X., Feng, X., Wang, X., Tian, Y., Qian, F.: Understanding and mitigating the security risks of voice-controlled third-party skills on Amazon Alexa and Google Home (2018). arXiv:1805.01525 [cs.CR] 6. Lau, J., Zimmerman, B., Schaub, F.: Alexa, are you listening? Proc. ACM Hum. Comput. Interact. 2, 1–31 (2018) 7. Seymour, W.: How loyal is your Alexa? In: Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems - CHI 2018, pp. 1–6. ACM Press, New York (2018) 8. Michaely, A.H., Zhang, X., Simko, G., Parada, C., Aleksic, P.: Keyword spotting for Google assistant using contextual speech recognition. In: Proceedings of 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017, pp. 272–278, January 2018 9. Zhang, R., Chen, X., Lu, J., Wen, S., Nepal, S., Xiang, Y.: Using AI to Hack IA: A New Stealthy Spyware Against Voice Assistance Functions in Smart Phones (2018). arXiv:1805.06187 [cs.CR] 10. Chung, H., Lee, S.: Intelligent Virtual Assistant knows Your Life, pp. 1–6 (2018). arXiv:1803.00466 [cs.CY] 11. Acquisti, A., Adjerid, I., Brandimarte, L.: Gone in 15 seconds: the limits of privacy transparency and control. IEEE Secur. Priv. 11, 72–74 (2013) 12. Google: Android Pie. https://www.android.com/versions/pie-9-0/ 13. Google: Permissions Overview. https://bit.ly/2HcAcye 14. Gorla, A., Tavecchia, I., Gross, F., Zeller, A.: Checking app behavior against app descriptions. In: Proceedings of the 36th International Conference on Software Engineering - ICSE 2014, pp. 1025–1035. ACM Press, New York (2014) 15. Elahi, H., Wang, G., Xie, D.: Assessing privacy behaviors of smartphone users in the context of data over-collection problem: an exploratory study. In: 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), pp. 1–8. IEEE (2017) 16. Song, Y., Chen, Y., Lang, B., Liu, H., Chen, S.: Topic model based Android malware detection. In: Wang, G., Feng, J., Bhuiyan, M.Z.A., Lu, R. (eds.) SpaCCS 2019. LNCS, vol. 11611, pp. 384–396. Springer, Cham (2019). https://doi.org/10. 1007/978-3-030-24907-6 29 17. Varian, H.R.: Computer mediated transactions. Am. Econ. Rev. 100, 1–10 (2010) 18. Qu, Z., Rastogi, V., Zhang, X., Chen, Y., Zhu, T., Chen, Z.: AutoCog: measuring the description-to-permission fidelity in Android applications. In: Proceedings of 2014 ACM SIGSAC Conference on Computer and Communications Security, CCS 2014, pp. 1354–1365 (2014)

354

H. Elahi et al.

19. Jing, Y., Ahn, G.-J., Zhao, Z., Hu, H.: RiskMon: continuous and automated risk assessment of mobile applications. In: Proceedings of 4th ACM Conference on Data and Application Security and Privacy - CODASPY 2014, pp. 99–110 (2014) 20. Rashidi, B., Fung, C., Bertino, E.: Android resource usage risk assessment using hidden Markov model and online learning. Comput. Secur. 65, 90–107 (2017) 21. Dini, G., Martinelli, F., Matteucci, I., Petrocchi, M., Saracino, A., Sgandurra, D.: Risk analysis of Android applications: a user-centric solution. Futur. Gener. Comput. Syst. 80, 505–518 (2018) 22. Bal, G., Rannenberg, K., Hong, J.I.: Styx: privacy risk communication for the Android smartphone platform based on apps’ data-access behavior patterns. Comput. Secur. 53, 187–202 (2015) 23. Yeh, K.H., Lo, N.W., Fan, C.Y.: An analysis framework for information loss and privacy leakage on Android applications. 2014 IEEE 3rd Global Conference on Consumer Electronics, pp. 216–218 (2014) 24. Transparency, Cambridge Dictionary (Online). https://tinyurl.com/y2k94p7u 25. The European Parliament and the Council of the European Union: Regulation (EU) 2016/679 (GDPR). Off. J. Eur. Union., pp. 1–88 (2016) 26. Page, J., Bain, M., Mukhlish, F.: The risks of low level narrow artificial intelligence. In: 2018 IEEE International Conference on Intelligence and Safety for Robotics (ISR), pp. 1–6. IEEE (2018) 27. Porter, J.: The biggest Google Assistant products from CES 2019. https://tinyurl. com/ycasf9j4 28. Felt, A.P., Chin, E., Hanna, S., Song, D., Wagner, D.: Android permissions demystified. In: Proceedings of the 18th ACM Conference on Computer and Communications Security, Chicago, Illinois, USA, pp. 627–638. ACM, New York (2011) 29. Tao, G., Zheng, Z., Guo, Z., Lyu, M.R.: MalPat: mining patterns of malicious and benign Android apps via permission-related APIs. IEEE Trans. Reliab. 67, 355–369 (2018) 30. Desnos, A.: Androguard: reverse engineering, malware and goodware analysis of Android applications. https://github.com/androguard 31. Bohm, A.: Theoretical coding: text analysis in grounded theory. In: Flick, U., von Kardoff, E., Stein, I. (eds.) A Companion to Qualitative Research. pp. 270–275. SAGE Publications, London (2004). ISBN: 9780761973751 32. Lugano, G.: Virtual assistants and self-driving cars: to what extent is artificial intelligence needed in next-generation autonomous vehicles? In: 15th International Conference on ITS Telecommunications, pp. 1–5 (2017) 33. Elahi, H., Wang, G., Li, X.: Smartphone bloatware: an overlooked privacy problem. In: Wang, G., Atiquzzaman, M., Yan, Z., Choo, K.-K.R. (eds.) SpaCCS 2017. LNCS, vol. 10656, pp. 169–185. Springer, Cham (2017). https://doi.org/10.1007/ 978-3-319-72389-1 15 34. READ GSERVICE. https://tinyurl.com/y27dz3we 35. Tsavli, M., Efraimidis, P.S., Katos, V., Mitrou, L.: Reengineering the user: privacy concerns about personal data on smartphones. Inf. Comput. Secur. 23, 394–405 (2015) 36. Fan, M., Liu, J., Wang, W., Li, H., Tian, Z., Liu, T.: DAPASA: detecting Android piggybacked apps through sensitive subgraph analysis. IEEE Trans. Inf. Forensics Secur. 12, 1772–1785 (2017) 37. Xu, Y., Wang, G., Ren, J., Zhang, Y.: An adaptive and configurable protection framework against Android privilege escalation threats. Futur. Gener. Comput. Syst. 92, 210–224 (2019)

AI and Its Risks in Android Smartphones: A Case of Google Smart Assistant

355

38. Li, J., Sun, L., Yan, Q., Li, Z., Srisa-an, W., Ye, H.: Significant permission identification for machine-learning-based Android malware detection. IEEE Trans. Ind. Inform. 14, 3216–3225 (2018) 39. van Ditmarsch, H., French, T.: On the Interactions of Awareness and Certainty. In: Wang, D., Reynolds, M. (eds.) AI 2011. LNCS (LNAI), vol. 7106, pp. 727–738. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25832-9 74 40. Spohn, W.: Two coherence principles. In: Causation, Coherence, and Concepts. Boston Studies in the Philosophy of Science, vol. 256. Springer, Dordrecht (2009). https://doi.org/10.1007/978-1-4020-5474-7 10 41. Lee, S.-Y., Lin, F. J.: Situation awareness in a smart home environment. In: 2016 IEEE 3rd World Forum on Internet of Things (WF-IoT). pp. 678–683. IEEE (2016). http://ieeexplore.ieee.org/document/7845412/, https://doi.org/10.1109/WF-IoT. 2016.7845412. ISBN: 978-1-5090-4130-5 42. Sinha, A., Anastasopoulos, A.: Incentive mechanisms for fairness among strategic agents. IEEE J. Sel. Areas Commun. 35, 288–301 (2017). https://doi.org/10.1109/ JSAC.2017.2659061 43. Vernon, D., Metta, G., Sandini, G.: A survey of artificial cognitive systems: implictions for the autonomous development of mental capbilities in computational agents. IEEE Trans. Evol. Comput. 11, 1–30 (2007). https://doi.org/10.1109/ TEVC.2006.890274

A Light-Weight Framework for Pre-submission Vetting of Android Applications in App Stores Boya Li1 , Guojun Wang1(B) , Haroon Elahi1 , and Guihua Duan2 1 School of Computer Science, Guangzhou University, Guangzhou 510006, China [email protected], [email protected], [email protected] 2 School of Computer Science and Engineering, Central South University, Changsha, China [email protected]

Abstract. In general, smartphone apps are rolled-out under a data overcollection based business model. Under this model, users can download and use the apps free of cost, but a large number of permissions are asked from users to access data and resources on their smartphones. Apps collect user data and sell them to interested third-parties for making profits, or abuse smartphone resources for financial gains. This phenomenon introduces privacy and trust issues. Existing vetting mechanisms in the app stores mainly depend on user feedback and expert reviews and only target malicious apps. Permission abusive apps are not included in this list yet. In this paper, we propose a light-weight framework for presubmission vetting of Android apps by app stores. We generate functional signatures of an app from its description and analyze them to build a profile that contains different permission usage scores, or suggests whether an app is suspicious. This framework can be used in the first line of defense in app stores to vet newly submitted apps.

Keywords: Android permissions

1

· App vetting · Privacy · Trust

Introduction

Smartphones are powerful personal computing devices who know where we live, where we go, whom we meet with, whom we sleep with, and a lot more. Therefore, it is no surprise that they are considered the most intimate surveillance devices [1]. Access to the data generated and stored in the smartphones and transmission of these data to third-party servers can put the privacy of their users at stake and shake their trust. Recent reports show how consensually collected data by different apps have been put to unexpected uses for revealing intimate personal details and citizen manipulation. For example, Uber used location data of passengers to predict one-night stands in San-Francisco [2] and Facebook user data were utilized to manipulate the political opinions of users [3]. There are c Springer Nature Singapore Pte Ltd. 2019  G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 356–368, 2019. https://doi.org/10.1007/978-981-15-1304-6_28

A Light-Weight Framework for Pre-submission Vetting

357

yet other examples of abusing smartphone resources by app providers for financial gains [4–8]. These privacy and trust threats have their roots in excessive permissions granted by smartphone users to different Apps. This starts in the App stores when a user downloads an app for his or her functional needs. App stores provide a functional description of such apps and resources and user data needed to perform these functions are listed as permission requirements. It is the user who, despite his ignorance and lack of security awareness, decides whether an app is safe for use, and authorizes which features and data in the smartphone are accessible by third-party apps [11,14]. As a result, users may end up installing apps that acquire more resources and collect more data than needed for their listed functions [9]. One of the main reasons for the distribution of such apps through these app stores is that contrary to users’ beliefs, while app stores make it easy for users to install and run applications, they provide few guarantees about their provenance or behavior [10]. Consequently, different apps may get unobtrusive access to user data, and use it for purposes unknown to the users, or may abuse phone features for unfair financial gains. Existing vetting mechanisms in the app stores are not reliable and require extra controls [12,13]. The purpose of conducting this research is to propose a framework for pre-submission vetting of Android apps in app stores. Contrary to the existing approaches, the purpose of this framework is to prevent permissionabusive apps from entry into app stores. This framework uses expert-generated rules-dictionary and functional signatures of an app, generated from its description. It identifies excessive and resource-consuming permissions, and assesses the adequacy of permission use to generate a profile that contains different permission usage scores, or suggests whether an app is suspicious. This framework can be used in the first line-of-defense in app stores to vet newly submitted apps.

2

Background

In this section, we briefly introduce Android permissions, the challenges regarding lack of user attention and comprehension of permissions, the problem of permission over-claim by the apps, and the problem of resource abuse. 2.1

Android Permissions

The purpose of Android permission mechanism is to protect the privacy of its users. Android permissions have different types according to protection levels. Dangerous permissions and normal permissions are two most relevant permission types from users’ perspective. In the post-Android 6.0, apps must ask user approval of dangerous permissions at runtime, which protect sensitive data or critical device features in an Android smartphone. An app’s access to such data and features may affect user privacy. However, Android OS grants the normal permissions automatically [16].

358

B. Li et al.

Further, post-Android 6.0 versions let a user enable or disable different permissions after an app has been installed. However, this feature is based on the assumption that users understand the permissions system. Moreover, although dangerous permissions need explicit user approval at runtime, apps have a ‘remember this permission’ option to bypass this explicit approval in the future. Likewise, it is Google who decides which permissions are dangerous or normal. Finally, once a user grants a permission, legal-liability of the data protection lies with the user. 2.2

Low Attentions and Comprehension

Different studies show that permission comprehension rates of users are very low [14]. A vast majority of the users is not aware of the difference between permissions like ‘approximate location’ and ‘precise location’ [17]. Even some developers find it hard to understand the Android permissions [18]. Therefore, the design of security within smart-phone platforms is not simple enough and non-technical to cater to a diverse user base, and these users are unable to resolve the privacy-functionality trade-off [19]. 2.3

Permision Over-Claim by Apps

According to app permission best practices recommended by Google, “Permission requests protect sensitive information available from a device and should only be used when access to information is necessary for the functioning of your app” [20]. However, a large number of apps over-claim permissions in practice [21–23]. As a result, a non-negligible part of applications suffers from permission gaps, i.e., does not use all the permissions they declare [24]. While some of these gaps may be the result of the incompetence of developers, others can also be used to cater to the needs of sophisticated data harvesting techniques [25,26]. A recent study carried out a static analysis of 10,710 apps and found that 76.08% of them explicitly over-claimed permissions [27]. In 424 cases, sensitive permissions were used only by the advertisement library’s code instead of serving the functional needs of the app. 2.4

Resource-Abuse in Smartphones

Recently, resource-abuse by apps has emerged as a significant problem in Android smartphones. For example, millions of Android phones have been targetted for crypto mining [28], and different apps have been identified to abuse energy and the Internet traffic [4,5]. The co-existence of apps in Android, their multi-modal behaviors, and the ability to delegate tasks to other apps enables them to abuse resources without being detected [4]. Furthermore, the different smartphone features such as 3G radio, CPU, and display can be abused. Partial wake-locks, full wake-lock, and the keyboard backlight can drain the battery [29]. Similarly, advertisements displayed in different apps consume a significant amount of energy [31].

A Light-Weight Framework for Pre-submission Vetting

3

359

Related Work

In this section, we introduce some of the solutions that have been proposed to address different aspects of permission over-claim in Android apps. Hamed et al. [35] proposed an approach to increase user awareness of the privacy risk of permissions in Android applications. Their approach uses a contextual model where the risk score varies with change in the context of use. This approach has four steps. In the first step, they identify permissions required by applications. In the second step, they identify dangerous permissions and the interaction between them. In the third step, they compute relative scores. Finally, they use colored bar graphs to show scores to the user. Taylor and Martinovic [37] used an approach which performed contextual permission-analysis of similar Android apps. Their analysis determines whether a permission-hungry app can be replaced by a functionally-similar app requiring less sensitive access to devices. Their results show that up to 50% of apps on the Google Play store have better alternatives to provide similar functionality. Han et al. [36] used a semantic approach based upon judging the soundness of descriptions to detect potentially malicious applications, based on the semantic relatedness between the applications’ descriptions and the .apk files. Wu et al. [38] proposed a Permission Abuse Checking System (PACS). They used apps’ meta-data, user reviews, and descriptions to obtain the maximum frequent itemsets and construct a permission feature database. Their evaluation shows that in their data set, 77.6% of apps over-claim permissions. In another study, Slavin et al. [39] proposed a semi-automated framework to help developers check consistency among their privacy policies and apps’ code. Their framework consists of privacy policy terminology and an API that maps these terms with API methods that deal with sensitive information. Further, they consider the information flows to detect misalignments. Another important solution was proposed by Dao et al. [4] to detect energy-hungry apps. They developed a tool called TIDE in the form of an app that can be installed in smartphones and identifies energy-hungry apps specific to their usage-profiles. Large inconsistencies in the results of [24,27,29] and [36] demand better alternatives. The use of app reviews also makes the analysis questionable as fake reviews are themselves a big problem. Similarly, [27] uses only dangerous permissions and interactions among them for evaluation and ignores rest of the permissions. Overall, these studies use techniques that involve analysis of app code, are developed for users and yet require an understanding of permissions and usage-expertise, which is a known problem. Furthermore, most of the solutions are app-based tools intending to detect permission or resource abuse in the installed applications. Hence, the problem of a lack of understanding of the permission mechanism is ignored. It is still the technically na¨ıve smart-phone user who needs to learn more to manage permissions. It raises the need to propose solutions, which identify and filter such apps before they are distributed through app stores and reach smart-phones of the users.

360

4

B. Li et al.

Assumptions

We assume that developers follow instructions of Google which say, “Apps must provide accurate disclosure of their functionality and should perform as reasonably expected by the user” [34], and are transparent in providing the functional descriptions and required permissions and these descriptions are consistent with the underlying code implementations. Although some malicious applications can try to deceive initial checks, a vast majority of apps provide reasonable information about permissions and descriptions which can be used to make an initial assessment of permission abuse [36].

5

Proposed Framework

In this section, we present the threat analysis and an overview of the overall design of our framework. 5.1

Threat Model

A user needing to install a smart-phone app accesses app-stores. App markets use a standard template to display description, reviews, rating, permissions, developers’ information, etc. The description of a given app explains its functional capabilities and permissions show the data and resources that it requires to perform different functions. Gaining such permissions enable apps to access user data and device features, including those required for data collection [11]. Users, however, are naive in general and do not understand the permission mechanism in smart-phone apps [14]. Neither do they know privacy and security risks associated with freely available apps in the app stores. They assume that third-party apps installed from platforms like Google Play store are safe (e.g., Users downloaded a spyware 8 million times [13]). Whereas the priority of such stores is to prevent the distribution of malicious apps, permission abusive apps are not treated as malicious by these stores and developers manipulate this situation. Developers of the apps ask for excessive app permissions with an intention to over-collect user data for making profits by selling these data, or abuse device resources to reduce their costs. Developers even use deceptive methods such as fake reviews to lure users into downloading their apps [15]. Thus unwarranted use of data and loss of control over data may result in privacy violations of the user. Abuse of resources amounts to a breach of trust. 5.2

Overview of the Approach

The purpose of proposing this framework is to introduce the first line-of-defense warning mechanism to prevent the submission of permission and resource abusive apps into app markets. Such apps are one of the most significant factors contributing to privacy violations of smart-phone users as a result of data overcollection by the apps and resource abuse [25–27]. However, the frequency of

A Light-Weight Framework for Pre-submission Vetting

361

app submission is quite high, and big app stores like Google Play want a quick rollout of apps while maintaining quality as well. This goal requires a lightweight framework which can rapidly perform an evaluation of submitted apps and raise a flag in case an app requires a thorough evaluation by the experts or secondary level system. Our framework serves this purpose. This framework uses an app’s description, permissions, and category to generate permission and resource usage profile. In order to generate these profiles, we identify the tasks provided in the functional description of an app and determine whether the permissions claimed by this app are appropriate or excessive and how much resources this app claims. Previously studies have used app descriptions [36,37], permissions [32,33,37] and techniques like static code analysis or natural language processing to assess permission abuse [27,32,36,39]. After evaluation of the input description and permissions, our system identifies excessive permissions and resource usage, weights are assigned, and final scores are generated. Such scores are displayed to experts, who decide whether to accept or reject the app. Figure 1 provides an overview of the proposed framework.

Fig. 1. An overview of the proposed framework.

5.3

Steps

In this section, we explain the different steps involved in the execution of our framework.

362

B. Li et al.

Preparation of Dictionary. This step involves defining association rules and compiling them into a dictionary. The dictionary lists rules for basic tasks and relevant permissions. Different functions that an app may use to differentiate it for better positioning in the market are excluded. We follow Android’s guidelines in this regard which state that “Permission requests protect sensitive information available from a device and should only be used when access to information is necessary for the functioning of your app” [20]. Further, the use of permissions involving resource usage should also be considered for assessing the resource usage of an app. For example, existing research uses CPU, network interfaces, wake-locks, and display to estimate energy consumption [4]. Similarly, in order to save energy, Android aggressively put its components (e.g., 3G radio and CPU) to an idle state shortly after they become inactive [29]. The Android developer guide and app development experience should be used while preparing the dictionary. We propose that different set of association rules should be defined for different app categories in the app store. These association rules map app permissions and tasks in different app categories. Let c be an app category, {P } be a set of permissions and {T } be set of tasks that can be performed using this permission set then we can write it as follows: c → {T }

(1)

{T } → {P }

(2)

The notion of T and P can be defined as following: T = {t1 , t2 , t3 , . . . .tn } and P = {p1 , p2 , p3 , . . . , pm }. For a task tj (where tj ∈ T ), which requires set of l permissions, where m ≥ l ≥ 1, we can write that: tj → {p1 , p2 , p3 . . . , pl }

(3)

For example, in the case of a messenger app, if ‘SEND MESSAGE’ task requires the ‘INTERNET’ and ‘WAKE LOCK’ permissions, we write it as follows. (4) {SEN DM ESSAGE} → {IN T ERN ET, W AKE LOCK} For learning the resource consuming characteristics of an app, a set of all resource consuming permissions should be included in the dictionary. We call it {PR } and define it as following. PR = {PR1 , PR2 , PR3 . . . , PRn }

(5)

The manual definition of the dictionary has two advantages (1) it provides the flexibility required to deal with an evolving problem of permission use, (2) it incorporates the expertise and judgment of experts and eliminates the impact of false positives or false negatives that are associated with machine learning-based methods. Furthermore, any erroneous rules can be replaced by more effective rules if needed. This is in contrast to the machine learning-based methods where changes will be needed in the underlying algorithm to achieve the same goal.

A Light-Weight Framework for Pre-submission Vetting

363

Generation of Functional Signatures. In this step, the description and the Metadata of an app are analyzed. Words or pair of words that represent atomic tasks should be sorted. For example, ‘call’ is a basic task that all messenger apps perform. Similarly, ‘messaging’ or ‘send message’ represent another individual task. In addition to identifying the tasks, an app’s permission requirements should be identified for further analysis and its category should be learned (each app store has its app categories and a developer must specify an app’s category during app submission). Formally, let F D be the functional description of a smart-phone app and P P be the set of permissions required to perform the set of tasks T , described in F D. It should be possible to distinctly identify the set of tasks T that a given app can perform, provided that a set of permissions P is granted by the user. Thus, we can say that F D represents a set of tasks T = {t1 , t2 , . . . , tn } that a given app can perform such that a set of m permissions P = {p1 , p2 , . . .pm } is granted by the user. However, each task ti ∈ T has its own requirements for a set of permissions p and P  = {p1 , p2 , . . . , pm }, where m ≥ m ≥ 1. Therefore, we can define F D as follows. F D = {(ti , p)} : ti ∈ T, p ∈ P  ∈ P }

(6)

Evaluation. During this step, category of the newly submitted app is sent to the dictionary and sets of relevant tasks and permissions needed to perform these tasks are retrieved. Furthermore, a set of resource consuming permissions PR is also retrieved. Set operations are performed to identify the excessive permissions by finding a difference between set of permissions of given app and the set of minimum permissions retrieved from dictionary. Thus, if Pex is set of excessive permissions, P is the set of permissions of app and Pmin is the set of permissions retrieved from dictionary then, Pex = P − Pmin

(7)

Set of excessive permissions is further split into dangerous and normal permissions. Similarly, set of resource consuming permissions is determined by finding the intersection of set retrieved from dictionary and set of permissions of given app. If PR is the set of resource consuming permissions, P is the set of permissions of given app then we can find the set of resource consuming permissions Y , in the app as following. (8) Y = PR ∩ P Profiling. In this step, weights are assigned to the over-claimed permissions according to their categories, i.e. normal and dangerous. If the P is the set of permissions of a given app such that ‘i’ permissions are dangerous and ‘j’ permissions are normal then we can calculate respective scores as follows:  Pi (9) Score(Pdan ) =

364

B. Li et al.

Score(Pnor ) =



Pj

(10)

A single dangerous permission is assigned a weight equal to two (2) and a normal permission is weighed half of it, getting a score of one (1). If there are i dangerous excessive permissions (Pex−dan ) and j normal excessive permissions (Pex−nor ) then we can calculate the respective scores as follows:  Score(Pex−dan ) = Pi (11) Score(Pex−nor ) =



Pj

(12)

Therefore, cumulative score of excessive permissions will be calculated as follows. (13) Score(Pcum−ex ) = Score(Pex−dan ) + Score(Pex−nor ) Similarly, resource usage scores are generated using the permissions that determine resource usage requirements of the app. Resource usage permissions are all weighed equally.  Score(PRapp ) = YN (14) Furthermore, if an app claims to offer too many functions and demands too few permissions in return, our approach declares it as suspicious.

6

Discussion

This paper proposes a light-weight framework for pre-submission app vetting in app stores to prevent permission-abusive apps from entering app stores and relieving users of app vetting. In this section, we compare our approach with other recently proposed approaches and discuss its strengths and weaknesses. Table 1 presents a comparison of the proposed approach with existing state-ofthe-art approaches. Contrary to the approach proposed in [35], that takes into account only the dangerous permissions and uses different contexts to generate scores for warning users; our approach evaluates all the permissions and functional description provided in the app store. As opposed to their solution, which targets to warn users, the proposed framework helps experts in preventing permission and resource abusive apps from entering the app stores. Similarly, when compared with [36], which establishes relatedness among description and implementations of apps, the proposed framework does not analyze app code. Finally, TIDE [4] the solution proposed to find energy-hungry apps on a smartphone needs to be installed by a user and needs access to OS files, which may introduce security issues. Contrary to all these frameworks, the proposed framework evaluates the permissions demanded by the apps while considering the functions given in their descriptions against thresholds set by app store admins. This evaluation is currently the responsibility of users who want to install these apps. Only the proposed approach relieves users of vetting apps that they are not capable of doing.

A Light-Weight Framework for Pre-submission Vetting

365

Table 1. Comparison with other approaches. Work

Uses

Requirements Audience

Goal

Works In

[35]

Dangerous and normal permissions and context of use

Installation in smartphone

App users

Generating risk scores of apps

Smartphone

[36]

App descriptions and app code

Static analysis and related expertise

App developers and app store administrators

Malware detection and assessing description soundness

Development environments and app stores (postsubmission)

[37]

Dangerous permissions

Access to all available apps in an app category

App users

Recommending apps sticking to least privilege principle

App store (postsubmission)

[38]

Permissions, descriptions, reviews, ratings

Availability of apk file and app description

Unspecified

Classification of apps into different categories

Postsubmission

[39]

Privacy policy and app code

Privacy concerns among developers

App developers

To help mobile-app developers check their privacy policies against their apps’ code for consistency

Development environment

[40]

App permissions and installed apps

Installation in phone and permission to monitor other apps

App users

Helping in permission management

Smartphones

[4]

OS data

Installation in phone and access to OS files

App users

Helping in identifying energy-hungry apps

Smartphones

Proposed approach

App descriptions and permissions

Access to required data at presubmission stage

App store administrators

Preventing permission and resource abusive apps from entering the app store

App store

Furthermore, our approach is semi-automatic, where the expert-defined rules and thresholds can be more reliable and flexible. Such rules can be more effective against user feedback dependent mechanisms of app filtering or those that involve the use of user reviews [38]. It is known that user feedback can be biased, and shilling attacks and fake reviews can be launched to take an unfair advantage over competitors. It is important to note that the proposed framework does not address related issues after apps are installed. In such cases solutions proposed in [40,41] or PACS [29] can be more effective. Furthermore, if a developer pro-

366

B. Li et al.

vides false information regarding the capabilities and permissions of an app, it can deceive this initial check. However, as mentioned earlier, the purpose of the proposed framework is not to detect malicious apps, which needs thorough code analysis. Overall, we believe that the proposed model will strengthen the first line of defense in an app store and relieve users of a hectic job of vetting apps.

7

Conclusion

In this paper, we presented a lightweight framework to prevent data and resource abusive apps from entering app stores as a first step to fight data over-collection and resource abuse. It can also detect suspicious apps by finding imbalance among listed functions and too few demands for permissions. Our approach can help experts in reviewing/vetting apps ready for submission into app stores. This contributes towards the privacy protection of users, reduces their privacy burden, and improves trust relationship among users and app stores. In the future, we intend to implement this framework and extend it by adding an optional second stage for a thorough evaluation of suspicious apps. Acknowledgements. This work was supported in part by the National Natural Science Foundation of China under Grant 61632009, in part by the Guangdong Provincial Natural Science Foundation under Grant 2017A030308006, and in part by the High-Level Talents Program of Higher Education in Guangdong Province under Grant 2016ZJ01.

References 1. Schneier, B.: It’s not just Facebook. Thousands of companies are spying on you. https://bit.ly/2ro89mx. Accessed 10 Apr 2018 2. Ramos, D.: Uber crunches user data to determine where the most ‘one-night stands’ come from. https://tinyurl.com/y5qd6agd. Accessed 10 Apr 2018 3. Graham-Harrison, E., Cadwalladr, C., Osborne, H.: Cambridge analytica boasts of dirty tricks to swing elections (2018). https://tinyurl.com/y23bgenk 4. Dao, T.A., Singh, I., Madhyastha, H.V., Krishnamurthy, S.V., Cao, G., Mohapatra, P.: TIDE: a user-centric tool for identifying energy hungry applications on smartphones. IEEE/ACM Trans. Netw. 25, 1459–1474 (2017) 5. Rahman, S., et al.: Internet data budget allocation policies for diverse smartphone applications. EURASIP J. Wirel. Commun. Netw. 2016, 226 (2016) 6. Zhang, S., Wang, G., Bhuiyan, M.Z.A., Liu, Q.: A dual privacy preserving scheme in continuous location-based services. IEEE Internet Things J. 5, 4191–4200 (2018) 7. Zhang, S., Li, X., Tan, Z., Peng, T., Wang, G.: A caching and spatial K-anonymity driven privacy enhancement scheme in continuous location-based services. Future Gener. Comput. Syst. 94, 40–50 (2019) 8. Elahi, H., Wang, G., Li, X.: Smartphone bloatware: an overlooked privacy problem. In: Wang, G., Atiquzzaman, M., Yan, Z., Choo, K.-K.R. (eds.) SpaCCS 2017. LNCS, vol. 10656, pp. 169–185. Springer, Cham (2017). https://doi.org/10.1007/ 978-3-319-72389-1 15

A Light-Weight Framework for Pre-submission Vetting

367

9. Phung, P.H., Mohanty, A., Rachapalli, R., Sridhar, M.: Hybridguard: a principalbased permission and fine-grained policy enforcement framework for web-based mobile applications. In: IEEE Security and Privacy Workshops (SPW), pp. 147– 156. IEEE (2017) 10. Fragkaki, E., Bauer, L., Jia, L., Swasey, D.: Modeling and enhancing Android’s permission system. In: Foresti, S., Yung, M., Martinelli, F. (eds.) ESORICS 2012. LNCS, vol. 7459, pp. 1–18. Springer, Heidelberg (2012). https://doi.org/10.1007/ 978-3-642-33167-1 1 11. Mylonas, A., Kastania, A., Gritzalis, D.: Delegate the smartphone user? Security awareness in smartphone platforms. Comput. Secur. 34, 47–66 (2013) 12. Welch, C.: Google took down over 700,000 bad Android apps in 2017, The Verge (2018). https://tinyurl.com/yco84en2. Accessed 10 Sep 2019 13. Stefanko, L.: First-of-its-kind spyware sneaks into Google Play, Welivesecurity (2019). https://tinyurl.com/y6gq2z2v. Accessed 10 Sep 2019 14. Elahi, H., Wang, G., Xie, D.: Assessing privacy behaviors of smartphone users in the context of data over-collection problem: an exploratory study. In: IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), pp. 1–8. IEEE (2017) 15. Martens, D., Maalej, W.: Towards understanding and detecting fake reviews in app stores. Empir. Softw. Eng. 1–40 (2019). https://doi.org/10.1007/s10664-01909706-9. ISSN: 1573-7616 16. Google: Permissions Overview. https://bit.ly/2HcAcye 17. Fu, H., Lindqvist, J.: General area or approximate location? In: Proceedings of the 13th Workshop on Privacy in the Electronic Society - WPES 2014, pp. 117–120. ACM Press, New York (2014) 18. Felt, A.P., Ha, E., Egelman, S., Haney, A., Chin, E., Wagner, D.: Android permissions: user attention, comprehension, and behavior. In: Proceedings of the Eighth Symposium on Usable Privacy and Security, Washington DC, pp. 1–14. ACM, New York (2012) 19. Fife, E., Orjuela, J.: The privacy calculus: mobile apps and user perceptions of privacy and security. Int. J. Eng. Bus. Manag. 4, 1–10 (2012) 20. Google: App Permissions (Usage Notes). https://bit.ly/2LQoE61 21. Felt, A.P., Chin, E., Hanna, S., Song, D., Wagner, D.: Android permissions demystified. In: Proceedings of the 18th ACM Conference on Computer and Communications Security, Chicago, Illinois, USA, pp. 627–638. ACM, New York (2011) 22. Stevens, R., Ganz, J., Filkov, V., Devanbu, P., Chen, H.: Asking for (and about) permissions used by Android apps. In: 10th IEEE Working Conference on Mining Software Repositories (MSR), San Francisco, CA, pp. 31–40. IEEE (2013) 23. Wang, J., Cheng, H., Xue, M., Hei, X.: Revisiting localization attacks in mobile app people-nearby services. In: Wang, G., Atiquzzaman, M., Yan, Z., Choo, K.K.R. (eds.) SpaCCS 2017. LNCS, vol. 10656, pp. 17–30. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-72389-1 2 24. Bartel, A., Klein, J., Le Traon, Y., Monperrus, M.: Automatically securing permission-based software by reducing the attack surface: an application to Android. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering - ASE 2012, p. 274. ACM Press, New York (2012) 25. Seneviratne, S., Seneviratne, A., Mohapatra, P., Mahanti, A.: Predicting user traits from a snapshot of apps installed on a smartphone. Mob. Comput. Commun. Rev. 18, 1–8 (2014)

368

B. Li et al.

26. Dimitriadis, A., Efraimidis, P.S., Katos, V.: Malevolent app pairs: an Android permission overpassing scheme. In: Proceedings of the ACM International Conference on Computing Frontiers - CF 2016, pp. 431–436. ACM Press, New York (2016) 27. Tang, J., Li, R., Han, H., Zhang, H., Gu, X.: Detecting permission over-claim of Android applications with static and semantic analysis approach. In: IEEE Trustcom/BigDataSE/ICESS, pp. 706–713. IEEE (2017) 28. Segura, J.: Drive-by cryptomining campaign targets millions of Android users. https://tinyurl.com/y6pnjdob 29. Kang, Y., Miao, X., Liu, H., Ma, Q., Liu, K., Liu, Y.: Learning resource management specifications in smartphones. In: Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS, January 2016, pp. 100–107 (2016) 30. Banerjee, A., Chong, L.K., Ballabriga, C., Roychoudhury, A.: EnergyPatch: repairing resource leaks to improve energy-efficiency of Android apps. IEEE Trans. Softw. Eng. 44, 470–490 (2017). Kindly check the edits made in Ref [30] 31. Prochkova, I., Singh, V., Nurminen, J.K.: Energy cost of advertisements in mobile games on the Android platform. In: Proceedings of the 6th International Conference on Next Generation Mobile Applications, Services and Technologies, NGMAST 2012, pp. 147–152 (2012) 32. Sun, L., Li, Z., Yan, Q., Srisa-an, W., Pan, Y.: SigPID: significant permission identification for Android malware detection. In: 11th International Conference on Malicious and Unwanted Software (MALWARE), pp. 59–66. IEEE (2016) 33. Bugiel, S., et al.: Xmandroid : a new Android evolution to mitigate privilege escalation attacks. Center for Advanced Security Research Darmstadt, Darmstadt (2011) 34. Google: Privacy, Security, and Deception, Google Play (2018). https://tinyurl.com/ y63o8qbb. Accessed 18 Apr 2018 35. Hamed, A., Ben Ayed, H.K.: Privacy risk assessment and users’ awareness for mobile apps permissions. In: IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA), pp. 1–8. IEEE (2016) 36. Han, W., Wang, W., Zhang, X., Peng, W., Fang, Z.: APP vetting based on the consistency of description and APK. In: Yung, M., Zhu, L., Yang, Y. (eds.) INTRUST 2014. LNCS, vol. 9473, pp. 259–277. Springer, Cham (2015). https://doi.org/10. 1007/978-3-319-27998-5 17 37. Taylor, V.F., Martinovic, I.: SecuRank: starving permission-hungry apps using contextual permission analysis. In: Proceedings of the 6th Workshop on Security and Privacy in Smartphones and Mobile Devices - SPSM 2016, pp. 43–52. ACM Press, New York (2016) 38. Wu, J., Yang, M., Luo, T.: PACS: permission abuse checking system for Android applications based on review mining. In: IEEE Conference on Dependable and Secure Computing, pp. 251–258. IEEE (2017) 39. Slavin, R., et al.: Toward a framework for detecting privacy policy violations in Android application code. In: Proceedings of the 38th International Conference on Software Engineering - ICSE 2016, pp. 25–36. ACM Press, New York (2016) 40. Calciati, P., Gorla, A.: How do apps evolve in their permission requests? A preliminary study. In: IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pp. 37–41. IEEE (2017) 41. Cheng, Y., Yan, Z.: PerRec: a permission configuration recommender system for mobile apps. In: Ibrahim, S., Choo, K.-K.R., Yan, Z., Pedrycz, W. (eds.) ICA3PP 2017. LNCS, vol. 10393, pp. 476–485. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-65482-9 34

Nowhere Metamorphic Malware Can Hide - A Biological Evolution Inspired Detection Scheme Kehinde O. Babaagba , Zhiyuan Tan(B) , and Emma Hart School of Computing, Edinburgh Napier University, Edinburgh EH10 5DT, UK {K.Babaagba,Z.Tan,E.Hart}@napier.ac.uk

Abstract. The ability to detect metamorphic malware has generated significant research interest over recent years, particularly given its proliferation on mobile devices. Such malware is particularly hard to detect via signature-based intrusion detection systems due to its ability to change its code over time. This article describes a novel framework which generates sets of potential mutants and then uses them as training data to inform the development of improved detection methods (either in two separate phases or in an adversarial learning setting). We outline a method to implement the mutant generation step using an evolutionary algorithm, providing preliminary results that show that the concept is viable as the first steps towards instantiation of the full framework. Keywords: Metamorphic malware · Evolutionary algorithm · Mutant generation · Mobile devices · Detection methods · Adversarial learning

1

Introduction

Malicious attacks continue posing serious security threats to most information assets. They also constitute one of the commonly found attack vectors. The recent 2019 Internet Security Threat Report by Symantec revealed that there has been an increase in malicious attacks in form of form-jacking attacks, with approximately 4,800 websites being victims monthly. Ransomware is now targeting enterprises with a 12% rise in the number of infections as compared to the previous year’s attack incidence. To prevent detection and elimination of malicious binaries, obfuscation techniques are employed by sophisticated malware creators. These techniques often involve either packing the malware (also known as malware packing), transforming its static binary code (polymorphism) or transforming the dynamic binary code of the malware (metamorphism). Amongst these sophisticated malware families, metamorphic malware is particularly complex and dangerous, presenting security threats to many endpoint devices, including desktops, servers, laptops, kiosks or mobile devices with Internet connection. Its danger arises from its ability to transform its program code c Springer Nature Singapore Pte Ltd. 2019  G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 369–382, 2019. https://doi.org/10.1007/978-981-15-1304-6_29

370

K. O. Babaagba et al.

between generations using various means, including instruction substitution (substituting a given instruction sequence with its equivalent); garbage code insertion (inserting junk code to the original program code); control-flow alteration (distorting the flow of control within the original program code using loops) and register reordering (reordering the variables in the original program code). In a bid to curb the impacts of metamorphic malware, several detection strategies have been adopted: Alam et al. [1] provide a detailed overview, including Opcode-Based Analysis (OBA), Control Flow Analysis (CFA) or Information Flow Analysis (IFA) which differ depending on the type of information being used in the analysis. We propose an alternative approach via a novel framework which contains two components: the first component generates a set of potential mutants of existing malware; the second method trains a detection system to recognise the new mutants. The two components maybe used sequentially or in an adversarial setting, that is, the mutant generator can create increasingly more complex mutants as the detection system gets better at recognising mutants. In this paper we describe the generic concept of the framework then propose a method for instantiating the mutant-generating component using an evolutionary algorithm [8]. This population based search technique has been successfully used in code-modification scenarios, e.g. to fix bugs [10] or speed-up code [6], as well as some previous attempts to evolve mutants of existing malware [2]. The main contributions of the paper are as follows: – A review of current metamorphic malware detection techniques. – A proposal for a novel framework for developing malware detectors capable of recognising future malware mutations. – A proof-of-concept that an Evolutionary Algorithm as the mutation engine component of the framework can be used to generate new mutant samples. The rest of the paper is structured as follows. Section 2 gives a brief history of metamorphic malware detection. Section 3 discusses the challenges to existing detection techniques. In Sect. 4, evolutionary based malware detection and its challenges is discussed. Section 5 describes the proposed detection framework. In Sect. 6, preliminary experimentation and discussion is presented. A conclusion is drawn in Sect. 7.

2

A Brief History of Metamorphic Malware Detection

A number of techniques have been developed over the years to combat the threat of malicious software. The strategies that have been used can be grouped into three, namely – Signature based detection; – Heuristic based detection; – Malware normalisation and similarity-based detection.

Nowhere Metamorphic Malware Can Hide

2.1

371

Signature-Based Detection

This involves the extraction of unique byte streams which define the malware’s signature. It involves scanning files in the host machine in order to find a given malicious signature. The work of [17] was one of the first signature-based malware detection scheme and has subsequently been referenced by other research works such as [24]. The authors of [17] proposed a system called Malicious Code Filter (MCF) which was a static analysis tool used for malware classification. Their scheme looked for tell-tale signs in malicious code. These signs refer to attributes of a piece of program code that can be used in determining if a piece of program code is malicious or not, without the need for expert coding knowledge. Their system was successful in proving that tell-tale signs are useful in identifying malicious code. Several signature-based methods have been used for detecting metamorphic malware in particular. The authors of [12] used string signatures for metamorphic malware detection and they achieved a false alarm rate of less than 0.1%. The work, presented in [26], introduces Aho-Corasick (AC), which is a string matching algorithm for detecting metamorphic malware. [23] employs a static scanner for detecting metamorphic malware with high detection rate. More signaturebased metamorphic malware detection techniques, such as string scanning with special cases like wild-cards or mismatches, bookmarks, speedup search algorithms [14], are found in other related work. Signature-based methods of malware detection provide a fast and easy means of detecting malware. However, they are often not efficient in detecting advanced malware, such as malware that employ obfuscation techniques in masking its code structure. This is because they are only able to recognise specific code versions. It will therefore be useful if signature-based systems are fed with new data that represent potential variants. 2.2

Heuristic Based Detection

A detection technique that involves the analysis of the behaviour, functionality and characteristics of a suspicious file without relying on a signature is referred to as heuristic detection. Unlike signature-based detection, its goal is not to discover a given signature but to detect malicious functionalities like a malware’s payload or distribution routine. This method employs data mining and machine learning techniques in detecting malware. These include supervised learning (learning with a guide), semi-supervised learning (learning with a partial guide) and unsupervised learning (learning without a guide). Some of these machine learning techniques include; Decision Trees (DT) [3], Hidden Markov Models (HMM) [25], and Support Vector Machines (SVM) [20]. Machine learning based malware detection is data driven: it discovers relationships between the underlying structure of data, collected either before or after the execution of the malware and its classification as malicious or nonmalicious. Data, collected prior to the execution of the malware, includes information about the file derived from it without running. These include its code

372

K. O. Babaagba et al.

characteristics and its file format among others. On the other hand, data collected after the malware executes derives from the artefacts left behind by the executed malicious code. These include behavioural descriptions of the malicious code such as process related activities, registry related activities among others. In unsupervised machine learning, the aim is to obtain previously unknown structural descriptions of data without a guide, for instance pre-existing labels. This can be done in a number of ways, in which clustering analysis is commonly involved and helps segmenting a dataset. Unsupervised learning based malware detection does not require the datasets used to be labelled as either clean or malicious. This makes unsupervised learning useful to cybersecurity experts as a wide variety of unlabelled datasets are readily available. In supervised machine learning the data has to be labelled as either malicious or clean. This helps the model in determining the labels for new instances. Supervised learning models have to be trained with sufficient data. The trained model is then fed with new samples for prediction. The model needs to be appropriately trained with enough data for better predictions. A pioneering work in heuristic based malware detection is that of [22] which uses Na¨ıve Bayes (NB) in automatically identifying malicious patterns in malware. Their NB approach, calculated from the program’s feature set, the probability that the program is malicious. Their heuristic based approach was better than other previously employed signature-based approaches, in terms of detection accuracy. Since then, a number of research works, such as [21], have used heuristic methods for malware detection. In metamorphic malware detection, heuristic based methods such as DT, HMM and SVM have been used. For instance, a combination of statistical chisquared test and HMM is used by [25] in detecting metamorphic viruses. [3] also uses a statistical-based classifier that employs a DT in metamorphic malware detection. A single class SVM is used by [20] in Android based metamorphic malware classification. These works led to increased metamorphic malware detection rate. 2.3

Malware Normalisation and Similarity-Based Detection

An attempt to transform metamorphic malware to its original form is termed malware normalisation. The level of code obfuscation determines the effort that will be put into normalising the metamorphic malware. This technique was first introduced by Periot [19] whose approach took advantage of various code optimisation strategies in enhancing the detection of malware. [29] also uses term rewriting as a means of normalising metamorphic malicious code. The various mutations of the metamorphic malware are modelled as rewrite rules which are then changed to form a rule set for normalising the metamorphic malware. This approach was applied to the metamorphic engines’ rule set and was used in the normalisation of variants produced from the mutation engine. A comprehensive list of techniques for code normalisation is given in [5]. In addition, similarity-based approaches, for instance structural entropy and compression-based techniques, were applied in [4] and [16] to detect metamorphic

Nowhere Metamorphic Malware Can Hide

373

malware. While structural entropy involves the examination of the raw bytes of the mutated file, compression-based detection involves the use of compression ratios of the mutated file in a bid to create sequences that represent the file. In the work of [4], structural entropy was used for metamorphic malware detection. Their approach involved segmenting the binaries and then finding the similarity between their segments. The authors in [16] used a compression based technique in detecting metamorphic malware. Their approach used compression ratios in defining the files. Then, they compared the file sequences against one another and then used a scoring system to classify the files as either malicious or clean.

3

Limitations of Current Solutions to Metamorphic Malware Detection

Metamorphic malware is a class of highly sophisticated malicious software that commonly involves complex transformations of its code during each propagation. Due to sophisticated means by which metamorphic malware can change its code, many existing detection approaches perform poorly. Signature-based detection approaches, for instance, are not efficient when faced with novel metamorphic malware. They are not only very time-consuming since they require new signatures to be compared against large databases of malicious signatures, but are also required to periodically update their databases. Moreover, signature-based approaches are often reactive and therefore cannot detect new attacks. The file scanning process in heuristic based detection is usually only based on the attack name/label leading to limited information derived. This method sometimes uses statistics for its predictive analysis, which is prone to diagnostic errors emanating from the initial learning process being corrupted. If the algorithm is not trained appropriately, the resulting predictions may be inaccurate. Most malware normalisation and similarity-based approaches still cannot detect advanced metamorphic malware with complex levels of obfuscation. Consequently, low detection rates are derived when they are used on such malware. In the case of malware normalisation using control flow graphs, the normalisation process can be hampered by code streams that cannot be reached which can lead to control-flow graph fall-through edges. The complex code streams (such as those that employ opaque predicates and branch functions) are often difficult to be detected. Similarity based techniques are often prone to false alarms and are susceptible to mutations that employ a lot of packing or compression. The compression ratio is very important in the segmentation phase of compressionbased similarity detection. Consequently, previously compressed code makes this detection inefficient.

4

Evolutionary Based Malware Detection

We propose that the use of an Evolutionary Algorithm to generate new malware samples in order to create detectors that can recognise potential future variants of a class of malware will address some of the above issues.

374

K. O. Babaagba et al.

The term EA refers to a class of problem-solving techniques inspired by Darwin’s theory of evolution [8], in which the quality (fitness) of a population increases over time due to the pressure caused by natural selection. Given a quality function that needs to be optimised, a population of randomly generated potential solutions to the problem is first created. Solutions are selected for reproduction in a manner which is biased by their fitness; a reproduction operator generates new offspring from selected solutions by applying processes which mix information from two or more solutions (crossover) and/or by a mutation process that makes small, random, changes to solutions. As new fitter solutions replaces poorer quality ones in the population, the population as a whole becomes fitter as the process is iterated. The flexible nature of EA allows it to be applied to any task, that can be expressed as a function optimisation task. Although a significant amount of literature focuses on its use in combinatorial or continuous optimisation domains, it has more recently been applied to malware analysis and detection, such as malware feature extraction [27], classification problems [32] among others. [7] described a proof-of-concept that an EA could evolve a detector to recognise a virus signature represented as a bit-string, although this was only tested on 8 arbitrary functions designed to show its ability to cope in complex landscapes, rather than on real viruses. It offers the following advantages: – Exploration of a huge search space, which is one of the challenges to be solved in searching for code variants; – Provides operators that enable easy manipulation of code; – Proven ability in transformation, optimisation and improvement of software code [6,15,30]. The idea of using EA based techniques for malware analysis and detection is not a new concept. For example, [18] use EAs to improve classifier selection and performance for malware detection. In [13], the authors use a variant of an EA called Genetic Programming (GP) to evolve variants of a buffer-overflow attack with the objective of providing better detectors, showing that GP could effectively evolve programs that “hid” malicious code, evading detection by Snort in 2011 instances. In [2], the authors also used GP to create new mutant samples, applying their approach to Android based metamorphic malware. New malware samples were created using mutation techniques and evaluated on their ability to evade detection by eight antivirus systems. Similarly, [31] creates metamorphic pdf malware using GP. They also test the instances of the evolved malware to determine if they evade detection by pdf detectors. The mutants were generated by employing mutation operators like junk code insertion, code reordering to mention a few for [2] and operators like deletion, insertion and replacement for [31]. All of the above approaches generate new malware that can be used to train malware detectors in order to achieve greater detection rates by the detectors.

Nowhere Metamorphic Malware Can Hide

4.1

375

Challenges with Current EA-Based Approaches

Although the approaches described above provide evidence that EAs are a useful methodology, there remains much scope for improvement. [13] focus on bufferoverflow attacks rather than metamorphic malware, while [31] focus on pdf malware. Although [2] focus on metamorphic malware, they only consider 8 antivirus engines (Eset, GData, Ikarus, Kaspersky, Avast, TrendMicro, BitDefender and Norton) when evaluating whether their evolved malware is able to evade detection. It is also unclear whether they test whether the new mutants that are able to evade detection are still malicious. The fitness function that guides evolution scores each solution with a discrete value between 0 and 8, depending on how many engines it evades. This provides very little information to guide the evolutionary algorithm through the search-space to find evasive solutions. Our proposed scheme addresses the above weaknesses as follows. Firstly, we focus on mobile platforms as they are currently targeted by recent malware attacks. We evaluate evasiveness using a large set of 63 AV engines to evaluate the variants created. Rather than only considering evasiveness as the fitness metric, we also measure the structural similarity and the behavioural similarity between the evolved mutants and the original malware and include this information in the fitness function. This makes it easier for the EA to traverse the fitness landscape as the fitness function is more fine-grained and hopefully discovers better solutions. It also increases the diversity of evolved variants. Finally, we also ensure that all variants retain their malicious nature once mutation has occurred.

5

Framework of Detection Scheme

This section proposes a framework of metamorphic malware detection for mobile computing platforms in order to address some of the challenges raised above. The framework comprises of five functional modules, namely a data source (i.e., a mobile malware dump), a disassembly tool (i.e., apktool), a mutation engine, a data store for APK variants and a malware detector. The conceptual overview of the proposed framework is shown in Fig. 1. The framework includes a data source where the mobile malware is collected from. Then it uses a disassembly tool to disassemble the mobile malware from APK to smali. The smali files can then be fed to the mutation engine module. This module generates novel malware mutants, representing potential future variants of existing malware. The new mutants are then stored in a data store. The data stored in the data store module can be used to train a detection module that offers improved protection against future mutants. If the system is used in an adversarial context, then improvements in the detection module drives the generation of more diverse mutants, hence driving further improvement in the detection system. To implement the malware generation module, we propose the use of Evolutionary Algorithm which is explained in the next section. This technique is well-known for its ability to search vast spaces of potential solutions [11]; we use

376

K. O. Babaagba et al.

Fig. 1. A conceptual overview of the proposed detection framework.

it to efficiently search for unseen mutants that represent potential states that existing malware might morph into, thereby providing improved training data for a detection module. The detection module can employ machine-learning using an appropriate learner in detecting the generated mutants. The machine-learning based detector receives the newly created mutants from the data store as training data. We suggest that the probabilistic and evidence-based nature of Bayesian inference techniques makes it a suitable machine learning candidate. Metrics that might be used to evaluate the suitability of a machine-leaner are suggested below. – Training time of the model – This refers to the time taken to train the new machine learning adaptation. – Classification time – This refers to the time taken for the new machine learning adaptation to classify the sample as either malicious or benign. – Model Accuracy – This refers to how accurate the model is in detecting the malicious binaries. At the current stage of the research, we focus on the mutation engine in order to generate high quality and diverse samples to serve as rich a training set to the machine learning model. At a later stage of the research, we will then train a machine learning model on the generated diverse variants. 5.1

Malware Evolution

The mutation engine shown in Fig. 1 is implemented using an EA. The goal of the EA is to generate a new set of malware variants that evade current detection engines, and are diverse with respect to their behavioural and structural similarity. The malware, used for demonstration in this paper, was collected from Contagio Mini Dump [28], a mobile malware dump. The code of this mobile malware is, first, reverse engineered from an Android Package (APK) to smali and

Nowhere Metamorphic Malware Can Hide

377

then converted to a document vector that serves as an input to the evolutionary algorithm. An EA given in Algorithm 1 is used to evolve the instances of given malicious code. It begins with an initial population of solutions created by applying a random mutation process to the original malware. Each solution is evaluated using a fitness function, defined in (Sect. 5.1). The fitness function is minimised by the algorithm, i.e. the smaller the value, the more evasive the variant, and the more it is structurally and behaviourally dissimilar to the original malware. The main loop of the algorithm then selects a mutant as a parent, and mutates this to create a child mutant. The child replaces the solution in the population that has worst fitness if its own fitness is better than the worst.

Algorithm 1. Evolutionary Algorithm 1: Initialize pop of mi random mutants i ∈ [0, m − 1] 2: Assign fitness to each mutant 3: while Maximum number of iterations not reached do 4: Randomly select k variants from pop, and set parent pbest to fittest variant 5: Generate a new mutant mnew from pbest by mutating pbest , selecting mutation operator with uniform probability 6: Evaluate fitness f itnew of new mutant 7: if f itnew > f itworst then 8: Replace the worst fit in pop with mnew 9: Update f itworst 10: end if 11: end while 12: return The variants created

The fitness function referred to in Algorithm 1 is given below and returns a value between 0 and 1:  1 if variant not executable f (x) = w0 DR(x) + w1 SS(x) + w2 BS(x) otherwise  2 subject to

i=0 Wi = 1 0 ≤ DR(x), SS(x), BS(x) ≤ 1

The defined fitness function takes into consideration of the code level similarities (denoted as SS(x)) between the original malicious file and its variants; the behavioural similarities (denoted as BS(x)) between the original malicious file and its variants and the detection rate (denoted as DR(x)) of its variants. The three functions (DR(x), SS(x) and BS(x)) can be weighted with values between 0 and 1 to favour one type of solution over another. These functions are described below:

378

K. O. Babaagba et al.

The Code Level/Structural Similarity: SS(x) measures the similarity between the original smali file and its mutants. Text similarity (cosine similarity, Levenshtein and fuzzy string matching) and source code similarity (jplag and Sherlock plagiarism detectors as well as normalised compression distance) metrics are employed. The structural similarity between the original APK file and its mutants is an average of all the similarity metrics employed where a value of 0 means the original APK file and its mutants are completely dissimilar and 1 means the original APK file and its mutants are identical. The Behavioural Similarity: The behavioural analysis of the APK file’s mutants is measured using Strace and Monkey runner. Strace is used to monitor the system calls of the variants while monkey runner is used to simulate user action. A feature vector is constructed from the log of strace where each vector element represents the frequency of a system call. The behavioural similarity between the original APK file and its mutants measures the cosine similarity between the original APK file and its mutants’ feature vector. The result of the function is a value between 0 and 1 where 0 means the original APK file and its mutants are completely dissimilar and 1 means the original APK file and its mutants are identical. Evasiveness of Variants: The function DR(x) assesses the ability of a mutated APK to evade detection by antivirus engines. It is measured with the analysis report from Virustotal to determine the APK’s evasive ability. Virustotal comprises of 63 antivirus engines that represent most of the state-of-the-art antivirus engines. The function checks to see which of these antivirus engines flags a submitted file as malicious or benign. DR(x) returns the percentage of engines that detects the variants where a lower percentage indicates a more evasive variant. Maliciousness and Executability: To assess the executability of a mutated APK file, tests against its compilation and execution are conducted after mutation. The tools that are used for the assessment of the executability of the mutated APK files are listed as follows. – To check for the compilation of an APK file: apktool, apksigner and zipalign. – To check that the file runs: Android emulator. We wrap the functions that check to see that the APK variants run and compile properly in a bash script. Finally, in order to ensure that the evolved variants retain their maliciousness we use Droidbox, a dynamic analysis tool, to check that the variants are still malicious. We only analyse the results for the variants that retain their maliciousness.

6

Experiments

In this preliminary study, the malware utilised in this work for demonstration is from the Contagio Minidump. The original parent malware was selected from the

Nowhere Metamorphic Malware Can Hide

379

dump at random. The experiment was conducted in a VM Workstation running Ubuntu operating system. The EA consists of a population size of 20. Parents are selected using tournament selection [9] with k = 5. The best of the k selected parents is mutated by adding either junk codes like line numbers, reordering its variables or distorting the program code’s control flow through the insertion of a goto statement that jumps to a label that does nothing. The EA is then run 10 times for 100 iterations, and the best variant produced in each of the 10 runs is recorded.

(a) Boxplot of best fitness in terms of detection rate (DR(x), behavioral similarity (BS(x)) and structural similarity (SS(x)) of the variants for the ten runs of the EA

(b) Boxplot of best fitness for 5 runs of the EA where the fitness is a weighted combination of DR(x), BS(x) and SS(x) as given in eq. (1) in Section 5.1

Fig. 2. Boxplots of best fitness

First, we conduct three experiments in turn in which the fitness function only uses one of the three metrics DR, BS, SS to understand the influence of each individual component i.e. the weight of interest is set to 1 and the weights for the other two metrics set to 0. The initial experiments using each of the functions in the fitness function are illustrated in Fig. 2(a). The original malware had a detection rate (DR(x)) of 0.597. We see from Fig. 2(a) that we are able to create variants in which the best has a rate of 0.278 and all 10 values are less than 0.32. Also, we see that we are able to create variants that are only 35% behaviourally (BS(x)) similar to the original malware and 64% structurally (SS(x)) similar to the original malware, indicating behavioural and structural diversity. When we combine all the functions in the fitness function as given in Sect. 5.1, we are able to generate a diverse set of mutants that are executable, evasive and behaviourally and structurally different compared to the original malware which would have a weighted fitness value of 0.8657 (with a value 1 for BS(x) and SS(x) and 0.597 for DR(x)), as seen in Fig. 2(b). We also analyse which engines are more likely to be fooled by the new evasive mutants as seen in Fig. 3. The original malware was detected by 37 out of the

380

K. O. Babaagba et al.

63 antivirus engines. We analyse how many of these engines that detected the original malware were able to be fooled by the best variant found in each of the 10 runs of the EA (Fig. 3).

Fig. 3. Analysis of the detection engines

Some detection engines (e.g. Trustlook, McAfee and F-Secure) are fooled by all 10 new variants. On the other hand, Avast Mobile and Babable were not fooled in any of the runs (i.e. they were able to detect the new malware). Also, 14 of the engines were not fooled in any of the 10 runs while 17 of the engines were fooled in all the 10 runs of the EA.

7

Conclusion

In this paper we carried out a review of current methods to metamorphic malware analysis and detection. We also proposed a framework that could be used to tackle detection of malware by creating a mutation engine that provides new training examples representing potential states the malware can morph too, that can then be used to train better detectors. Furthermore, we provide a proof of concept for the mutation engine that uses an EA showing it is capable of generating a diverse set of malicious mutants. This is advantageous in that it will help in determining the effectiveness of the existing IDS. To complete the framework, future work will conduct a more thorough experimental analysis of the EA, including EA parameters and investigating how the approach generalises to other classes of malware. Also, it will focus on designing new machine learning (ML) methods that can be trained using the new metamorphic malware created in order to be robust to future attacks.

Nowhere Metamorphic Malware Can Hide

381

References 1. Alam, S., Traore, I., Sogukpinar, I.: Current trends and the future of metamorphic malware detection. In: Proceedings of the 7th International Conference on Security of Information and Networks, SIN 2014, pp. 411–416. ACM, New York (2014) 2. Aydogan, E., Sen, S.: Automatic generation of mobile malwares using genetic programming. In: Mora, A.M., Squillero, G. (eds.) EvoApplications 2015. LNCS, vol. 9028, pp. 745–756. Springer, Cham (2015). https://doi.org/10.1007/978-3-31916549-3 60 3. Bashari Rad, B., Masrom, M., Ibrahim, S., Ibrahim, S.: Morphed virus family classification based on opcodes statistical feature using decision tree. In: Abd Manaf, A., Zeki, A., Zamani, M., Chuprat, S., El-Qawasmeh, E. (eds.) ICIEIS 2011. CCIS, vol. 251, pp. 123–131. Springer, Heidelberg (2011). https://doi.org/10.1007/9783-642-25327-0 11 4. Baysa, D., Low, R.M., Stamp, M.: Structural entropy and metamorphic malware. J. Comput. Virol. 9(4), 179–192 (2013) 5. Bruschi, D., Martignoni, L., Monga, M.: Code normalization for self-mutating malware. IEEE Secur. Priv. 5(2), 46–54 (2007) 6. Cody-Kenny, B., Galv´ an-L´ opez, E., Barrett, S.: locoGP: improving performance by genetic programming Java source code. In: Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO Companion 2015, pp. 811–818. ACM, New York (2015) 7. Edge, K.S., Lamont, G.B., Raines, R.A.: A retrovirus inspired algorithm for virus detection & optimization. In: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, GECCO 2006, pp. 103–110. ACM, New York (2006) 8. Eiben, A.E., Smith, J.E.: What is an evolutionary algorithm? In: Eiben, A.E. (ed.) Introduction to Evolutionary Computing, pp. 15–35. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-662-05094-1 3 9. Fang, Y., Li, J.: A review of tournament selection in genetic programming. In: Cai, Z., Hu, C., Kang, Z., Liu, Y. (eds.) ISICA 2010. LNCS, vol. 6382, pp. 181–192. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16493-4 19 10. Forrest, S., Nguyen, T., Weimer, W., Le Goues, C.: A genetic programming approach to automated software repair. In: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, GECCO 2009, pp. 947–954. ACM, New York (2009) 11. Garc´ıa-Mart´ınez, C., Lozano, M.: Local search based on genetic algorithms. In: Siarry, P., Michalewicz, Z. (eds.) Advances in Metaheuristics for Hard Optimization. Natural Computing Series, pp. 199–221. Springer, Heidelberg (2007). https:// doi.org/10.1007/978-3-540-72960-0 10 12. Griffin, K., Schneider, S., Hu, X., Chiueh, T.: Automatic generation of string signatures for malware detection. In: Kirda, E., Jha, S., Balzarotti, D. (eds.) RAID 2009. LNCS, vol. 5758, pp. 101–120. Springer, Heidelberg (2009). https://doi.org/ 10.1007/978-3-642-04342-0 6 13. Kayacik, H.G., Heywood, M., Zincir-Heywood, N.: On evolving buffer overflow attacks using genetic programming. In: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, GECCO 2006, pp. 1667–1674. ACM, New York (2006) 14. Konstantinou, E.: Metamorphic virus: analysis and Detection. Department of Mathematics Royal Holloway, University of London, Technical report (2008)

382

K. O. Babaagba et al.

15. Langdon, W.B., Harman, M.: Optimizing existing software with genetic programming. IEEE Trans. Evol. Comput. 19(1), 118–135 (2015) 16. Lee, J., Austin, T.H., Stamp, M.: Compression-based analysis of metamorphic malware. Int. J. Secur. Netw. 10(2), 124–136 (2015) 17. Lo, R.W., Levitt, K.N., Olsson, R.A.: MCF: a malicious code filter. Comput. Secur. 14, 541–566 (1995) 18. Martin, A., Men´endez, H.D., Camacho, D.: Genetic boosting classification for malware detection. In: IEEE Congress on Evolutionary Computation (CEC), pp. 1030– 1037 (2016) 19. Periot, F.: Defeating polymorphism through code optimization. In: Virus Bulletin, pp. 142–159 (2003) 20. Sahs, J., Khan, L.: A machine learning approach to Android malware detection. In: European Intelligence and Security Informatics Conference (2012) 21. Santos, I., Brezo, F., Ugarte-Pedrero, X., Bringas, P.G.: Opcode sequences as representation of executables for data-mining-based unknown malware detection. Inf. Sci. 231, 64–82 (2013) 22. Schultz, M.G., Eskin, E., Zadok, F., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: Proceedings of the 2001 IEEE Symposium on Security and Privacy, S&P 2001, pp. 38–49 (2001) 23. Sung, A.H., Xu, J., Chavez, P., Mukkamala, S.: Static analyzer of vicious executables (SAVE). In: Proceedings of the Annual Computer Security Applications Conference, ACSAC (2004) 24. Tabish, S.M., Shafiq, M.Z., Farooq, M.: Malware detection using statistical analysis of byte-level file content. In: Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics - CSI-KDD 2009 (2009) 25. Toderici, A.H., Stamp, M.: Chi-squared distance and metamorphic virus detection. J. Comput. Virol. 9(1), 1–14 (2013) 26. Tran, N.P., Lee, M.: High performance string matching for security applications. In: International Conference on ICT for Smart Society, pp. 1–5, June 2013 27. Vatamanu, C., Gavrilut, D., Benchea, R., Luchian, H.: Feature extraction using genetic programming with applications in malware detection. In: 17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), pp. 224–231 (2015) 28. Vidas, T.: Contagio mobile - mobile malware mini dump (2015). http:// contagiominidump.blogspot.com/2015/01/android-hideicon-malware-samples. html 29. Walenstein, A., Mathur, R., Chouchane, M.R., Lakhotia, A.: Normalizing metamorphic malware using term rewriting. In: Proceedings of the Sixth IEEE International Workshop on Source Code Analysis and Manipulation, SCAM 2006 (2006) 30. White, D.R., Arcuri, A., Clark, J.A.: Evolutionary improvement of programs. IEEE Trans. Evol. Comput. 15(4), 515–538 (2011) 31. Xu, W., Qi, Y., Evans, D.: Automatically evading classifiers: a case study on PDF malware classifier. In: 23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA. The Internet Society (2016) 32. Yusoff, M.N., Jantan, A.: A framework for optimizing malware classification by using genetic algorithm. In: Zain, J.M., Wan Mohd, W.M., El-Qawasmeh, E. (eds.) ICSECS 2011. CCIS, vol. 180, pp. 58–72. Springer, Heidelberg (2011). https://doi. org/10.1007/978-3-642-22191-0 5

Demand Forecasting Method Using Artificial Neural Networks Amelec Viloria1(&), Luisa Fernanda Arrieta Matos2, Mercedes Gaitán3, Hugo Hernández Palma4, Yasmin Flórez Guzmán5, Luis Cabas Vásquez4, Carlos Vargas Mercado4, and Omar Bonerge Pineda Lezama6 1

Universidad de La Costa, St. 58 #66, Barranquilla, Atlántico, Colombia [email protected] 2 Universidad Simón Bolívar, Barranquilla, Colombia [email protected] 3 Corporación Universitaria Empresarial de Salamanca – CUES, Barranquilla, Colombia [email protected] 4 Corporación Universitaria Latinoamericana, Barranquilla, Colombia {hhernandez,lcabas,cvargas}@ul.edu.co 5 Corporación Universitaria Minuto de Dios – UNIMINUTO, Barranquilla, Colombia [email protected] 6 Universidad Tecnológica Centroamericana (UNITEC), San Pedro Sula, Honduras [email protected]

Abstract. Based on a forecast, the decision maker can determine the capacity required to meet a certain forecast demand, as well as carry out in advance the balance of capacities in order to avoid underusing or bottlenecks. This article proposes a procedure for forecasting demand through Artificial Neural Networks. In order to carry out the validation, the procedure proposed was applied in a Soda Trading and Distribution Company where three types of products were selected. Keywords: Forecast

 Artificial Neural Networks  Big Data  Demand

1 Introduction Historically, in the business context, those responsible for processes and senior management, focus much of their concerns on knowing the future status of their sales, demand and inputs, etc., and everything that means risk or opportunity for progress in the management of their finances. The importance of making accurate forecasts for business management derives from the foregoing, since it is one of the premises for planning, organizing, implementing and logistically controlling a set of activities or processes, coordinated to make the most effective use of production factors, giving priority to the most critical processes and their key activities, so that the decisions made about them generate the greatest possible positive impact [1–6]. © Springer Nature Singapore Pte Ltd. 2019 G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 383–391, 2019. https://doi.org/10.1007/978-981-15-1304-6_30

384

A. Viloria et al.

Regardless of the classification of forecasting methods into qualitative and quantitative and the possibility of their use in isolation and individually, the authors of the research consider that for the correct performance of the forecasts, no single method should be used, and the success translated into a more accurate forecast consists in most cases, in the combination of qualitative and quantitative methods, since their sources and results generally complement each other [7–12]. In this context, Artificial Neural Networks (ANNs) emerge as an effective tool to solve the problem. ANNs represent one of the best-known techniques of Artificial Intelligence, according to several authors [13–18], which is inspired by the nature of human intelligence and its eagerness to understand and develop simple intelligent entities for creating complex intelligence systems. An ANN is a mathematical model inspired by biological neural networks. Its basic unit is an elementary processor called a neuron, which has the ability to count a weighted sum of its inputs, and then apply a function to obtain a signal that is transmitted to another neuron [19–21]. Through a learning algorithm, ANNs adjust their architecture and parameters in a way that allows minimizing some error function that indicates the degree of fitness to the data. In the last two decades, ANNs have gained prominence in time series forecasting in a large number of areas of business management, including finance, power generation, services, medicine, environmental sciences, and materials sciences [22–24]. The universal approximation capability of ANNs for continuous functions that present first and second derivative in all their domain has been demonstrated mathematically. In addition, several studies show that ANNs can accurately approximate various types of complex functional relationships. This last characteristic is very important for the described application, since any prediction model is expected to accurately detect the functional relationship between the variable to be predicted and other relevant factors or variables [25–27]. There are different types of neural networks, depending on the type of learning desired. The most used type of neural network in classification and prediction is the Multilayer Perceptron, which consists of neurons connected by layers, where each one has a number of associated neurons. The learning that is used in this type of networks is that of error backpropagation, where the aim is to minimize the function of the error between the desired output and that of the neural model from a set of observations already classified [28].

2 Method The proposed procedure goes through three phases that allow to establish guiding parameters for its users to improve the understanding in its application. The proposal was made based on the fact that regardless of the diversity of the applications developed by means of ANNs for forecasting time series and the satisfactory results obtained, there was no access to a methodological tool that guarantees the construction of ANN models with accurate forecasts for the situation described in this research. Phase I. Analysis. It aims to compile the necessary information for the design and application of the forecasting instrument. As a relevant element, it is proposed (when

Demand Forecasting Method Using Artificial Neural Networks

385

necessary) the selection of the good to be forecasted from the application of Pareto’s analysis. Finally, the required information to apply the instrument will be compiled, which will depend on the variables to be considered to carry it out. These may be related to: forecasting period, product price, client type, market segments to which the product is offered, opportunity of the necessary supplies for its manufacture/offer, among others. Phase II. Design of the instrument for the prognosis. As its name indicates, it pursues to obtain the instrument for the prognosis (in this case, ANN). The procedure was designed to work with multi-layer perceptron ANNs with an architecture, with good results in other experiments. Firstly, the minimum requirements for testing and operation will be defined, taking into account the hardware (computer of 512 MB RAM), 2.7 GHz of processor and 80 GB of hard disk), software1 (Weka) and operating system (Linux). The variables to be used in the forecast of the demand of good(s) will be selected, so the variables that are correlated with this will have to be studied. The variables will be classified as shown below: • nominal: their values represent categories that are not intrinsically ordered; • ordinal: their values represent categories with some sorting. • scale: its values represent categories ordered with a metrics with meaning. Given the particularities of the software, from the data in Microsoft Excel, they must be saved in a file extension *.csv (delimited by commas), after this the Arff Converter.exe application will be executed. Figure 1 shows the graphically described process.

Fig. 1. File treatment process

386

A. Viloria et al.

Once the data files are loaded, the following ANN parameters will be defined: GUI, AutoBuild, Depure, Decay, HiddenLayers, LearningRate, Momentum, NominaltoBinaryFilter, NormalizeAttributes, NormalizeNumericClass, Reset, Seed, TrainingTime, ValidationSetSize and ValidationThreshold. Phase III. Implantation. It aims to detect possible deviations in the results. Therefore, the results of the application of the RNAPM will be validated through the determination of deviations (error) with respect to the known demand of previous periods. The objective is to forecast the quantity demanded with a minimum margin of error. Where: Forecast by RNAPM: amount forecast by the neural network of the proposed system in n month. Forecast by other method: quantity forecast according to the other method in n month. Actual demand: actual amount of sales that has been provided by the company in the known period. RNAPM error: amount representing the difference between the amount predicted by the neural network and the actual demand. Error of the other method: quantity representing the difference between the quantity predicted by the other method and the actual demand. Once the sums of the forecast errors are obtained, the average errors will be calculated. Acceptance criterion: the average error of the RNAPM must be less than the average error of the forecast of the other method for the RNAPM forecasts to be valid. If any deviation is observed, it is recommended to return to step six of the procedure to rectify the designed model.

3 Results The first step of the procedure was developed in a satisfactory way, its results allowed to have a holistic vision of the processes of the company, as well as of its relations with the planning process. Pareto’s analysis was carried out, from which it was obtained that the products to forecast would be the following: class a, class b and class c. The Mistral® software database was taken, which is the one used by the entity, from the historical periods of 2007 to 2014. Figures 2, 3 and 4 specifically show the case of product a, b and c. Based on the work with the entity’s experts, the dependent variable was selected as the historical sales level plus the recorded faults (nominal) and as independent: weighting of months, the existence of similar drugs, sales status and the timeliness of supplies. The selected software assumed the incremental variable (i) as independent and 100% of the sample was used for training by a backpropagation algorithm.

Demand Forecasting Method Using Artificial Neural Networks

387

Demand

Time (months)

Fig. 2. Behavior of product a demand

Demand

Time (months)

Fig. 3. Behavior of product b demand

An analysis of the company’s existing documentation related to the monthly marketing plans was carried out, corresponding to the periods from 2010 to 2019, taking each year prior to the one analyzed as a basis for calculating the timing of supplies, so the year 2019 was excluded from the analysis. The professional software Weka 3.6 with Linux was used in the construction of the RNAPM. The multi-layer perceptron is a model for predicting future data, to one or more variables in the output layer. Nine neural networks were designed, trained and tested, each in three experiments. The parameters were obtained, as well as predictive values, which are shown in Table 1.

388

A. Viloria et al.

Demand

Time (months)

Fig. 4. Behavior of product c demand Table 1. Results of the different experiments of ANN design. Parameters GUI Auto build Debug Decay Hidden layers Learning rate Momentum Nominal to binary filter Normalize attributes Normalize numeric class Reset Seed Training time Validation set size Validation threshold

P_1 True True False False 20,19 0,4 0,5 True

P_2 True True False False 18,2 0,3 0,2 True

True True False 0 5  105 0 10

P_3 True True False False 20,2 0,3 0,2 True

P_5 True True False False 19,20 0,5 0,3 True

P_6 True True False False 18,19 0,4 0,5 True

P_7 True True False False 19,18 0,3 0,2 True

P_8 True True False False 18,18 0,5 0,3 True

P_9 True True False False 19,19 0,4 0,5 True

True True True True True True

True True

True True

True True

True True

True True

False 0 109 0 20

False 0 106 0 15

False 0 5  105 0 10

False 0 109 0 20

False 0 106 0 15

False 0 5  105 0 10

False 0 109 0 20

P_4 True True False False 20,18 0,5 0,3 True

False 0 106 0 15

There was little variation in the performance of each network through each of the training experiments conducted (small standard deviations). The P_3 model presented the least approximation errors. The results obtained at the end of the training and prediction were verified, resulting in an acceptable range, thus the existing deviation in the predicted amount was minimal. The acceptance criterion (Table 2) was met, and it was also verified that the average error of the RNAPMs was always lower than the

Demand Forecasting Method Using Artificial Neural Networks

389

average error of the forecast made through simple exponential smoothing, which was selected by the method proposed by [29]. Once the instrument was validated, the forecast of the demand for the selected products in the second semester of 2018 was carried out (Table 3). From the analysis of comparison with the historical behavior, it is foreseen that the product a will behave in a stable way with central tendency, the product b will decay in the consumption with respect to the second semester of 2018 and the product c will behave in an irregular way, forecasting a decrease in its consumption. With these results, it is demonstrated that ANNs are the best prediction technique in the shown case, an argument that is supported by several authors, although in other scenarios [15, 20, 21, 24, 26, 30]. Table 2. Errors in the estimations by both methods. RNAPM error Product Product Product SES error Product Product Product

1 2 3 1 2 3

2017 0.0026920 0.0000000 0.0106047 0.0670756 0.4418707 2.2318796

2018 0.0028381 0.0187533 0.0002437 0.2457916 3.3253832 1.7294731

Table 3. Forecasts (in physical units) for the second semester of 2018. Months July August September October November December

Product 1 Product 2 Product 3 123118 339457 103 117455 504552 132 121763 416448 97 126605 407427 126 118333 457435 126 149670 504392 135

4 Conclusions This paper proposes a procedure for forecasting demand through ANNs, the relevance of which was demonstrated through a practical case in which nine models were used and the one with the least error was selected. It is feasible to train an artificial perceptron multilayer neural network from the main variables that influence demand, with a performance that allows its use in the decision-making process. The results obtained confirm the feasibility of using ANNs as reliable prognostic techniques and lay solid foundations for their implementation in prognoses, a criterion demonstrated in dissimilar researches [31–39].

390

A. Viloria et al.

References 1. Acosta, M.C., Villarreal, M.G., Cabrera, M.: Estudio de validación de un método para seleccionar técnicas de pronóstico de series de tiempo mediante redes neuronales artificiales. Ingeniería Investigación y Tecnología, XIV(1), 53–63, ISSN: 1405-7743 (2013), Descargado de. http://www.sciencedirect.com/science/article/pii/S140577431372225X 2. Fernández Enríquez, F., de la Fé Dotres, S., Miraglia Ubals, D.: Pronóstico de las pérdidas en redes de distribución mediante redes neuronales. Energética, XXVI(1), 17–21 (2005). Descargado de. http://rie.cujae.edu.cu/index.php/RIE/article/download/142/141 3. Lizarazo, J.M., Gómez, J.G.: Desarrollo de un modelo de redes neuronales artificiales para predecir la resistencia a la compresión y la resistividad eléctrica del concreto. Ingeniería e Investigación 27(1), 8 (2007) 4. Valarie Zeithaml, A., Parasuraman, A., Berry, L.L.: Total quality management services. Diaz de Santos, Bogota (1993) 5. Carman, J.M.: Consumer perceptions of service quality: an assessment of the SERVQUAL dimensions. J. Retail. 69(Spring), 127–139 (1990) 6. Larrea, P.: Quality of Service, the Marketing Strategy. Dfaz Santos, Madrid (1991) 7. Hair Jr., J.F., Anderson, R.E., Tatham, R.L., Black, W.C.: Multivariate Analysis, 5th edn. Prentice Hall, Iberia (1999) 8. Tsiniduo, M., et al.: Evaluation of the factors that determine quality in higher education: an empirical study. Qual. Assur. Educ. 18(8), 227–244 (2010) 9. Gonzalez Espinoza, O.: Quality of higher education: concepts and models. Cal. Sup. Educ. 28, 249–296 (2008) 10. Bonerge Pineda Lezama, O., Varela Izquierdo, N., Pérez Fernández, D., Gómez Dorta, R.L., Viloria A., Romero Marín L. (2018) Models of multivariate regression for labor accidents in different production sectors: comparative study. In: Tan, Y., Shi, Y., Tang, Q. (eds.) Data Mining and Big Data. DMBD 2018. Lecture Notes in Computer Science, vol 10943. Springer, Cham 11. Izquierdo, N.V., et al.: Fuzzy logic applied to the performance evaluation. honduran coffee sector case. In: Tan, Y., Shi, Y., Tang, Q. (eds.) ICSI 2018. LNCS, vol. 10942, pp. 164–173. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93818-9_16 12. Pineda Lezama, O., Gómez Dorta, R.: Techniques of multivariate statistical analysis: an application for the Honduran banking sector. Innovare: J. Sci. Technol. 5(2), 61–75 (2017) 13. Viloria, A., Lis-Gutiérrez, J.P., Gaitán-Angulo, M., Godoy, A.R.M., Moreno, G.C., Kamatkar, S.J.: Methodology for the design of a student pattern recognition tool to facilitate the teaching - learning process through knowledge data discovery (Big Data). In: Tan, Y., Shi, Y., Tang, Q. (eds.) Data Mining and Big Data. Lecture Notes in Computer Science, vol. 10943. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93803-5_63 14. Chase, R.B., et al.: Administración de operaciones: producción y cadena de suministros. McGraw-Hill/Interamericana Editores, Bogota (2009) 15. Chen, T.-L., Su, C.-H., Cheng, C.-H., Chiang, H.-H.: A novel price-pattern detection method based on time series to forecast stock market. Afr. J. Bus. Manage. 5(13), 5188 (2011) 16. Conejo, A.J., Contreras, J., Espinola, R., Plazas, M.A.: Forecasting electricity prices for a day-ahead pool-based electric energy market. Int. J. Forecast. 21(3), 435–462 (2005) 17. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1023/A:1022627411411 18. Du, X.F., Leung, S.C.H., Zhang, J.L., Lai, K.K.: Demand forecasting of perishable farm products using support vector machine. Int. J. Syst. Sci. 44(3), 556–567 (2011) 19. Matich, D.J.: Redes Neuronales: Conceptos básicos y aplicaciones. In: Cátedra de Informática Aplicada a la Ingeniería de Procesos–Orientación I (2001)

Demand Forecasting Method Using Artificial Neural Networks

391

20. Mercado, D., Pedraza, L., Martínez, E.: Comparación de Redes Neuronales aplicadas a la predicción de Series de Tiempo. Prospectiva 13(2), 88–95 (2015) 21. Nayak, S.C., Misra, B.B., Behera, H.S.: Impact of data normalization on stock index forecasting. Int. J. Comp. Inf. Syst. Ind. Manage. Appl. 6, 357–369 (2014) 22. Obando, J.R.: Elementos de Microeconomía. EUNED (2000) 23. Ruan, D.: Fuzzy Systems and Soft Computing in Nuclear Engineering. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-7908-1866-6 24. Sanclemente, J.C.: Las ventas y el mercadeo, actividades indisociables y de gran impacto social y económico: El aporte de Tosdal, Innovar, vol. 17, no. 30, pp. 160–162, July 2007 25. Sapankevych, N., Sankar, R.: Time series prediction using support vector machines: a survey. IEEE Comput. Intell. Mag. 4(2), 24–38 (2009) 26. Swanson, N., White, H.: Forecasting economic time series using flexible versus fixed specification and linear versus nonlinear econometric models. Int. J. Forecast. 13(4), 439–461 (1997) 27. Toro, E.M., Mejia, D.A., Salazar, H.: Pronóstico de ventas usando redes neuronales. Scientia et technica 10(26), 25–30 (2004) 28. Villada, F., Muñoz, N., García, E.: Aplicación de las Redes Neuronales al Pronóstico de Precios en Mercado de Valores. Información tecnológica 23(4), 11–20 (2012) 29. Vitez, O.: Cuáles se consideran los principales indicadores económicos. [En línea] (2017), Disponible en. https://pyme.lavoztx.com/cules-se-consideran-los-principales-indicadoreseconmicos-9641.html. [Consultado: 07-dic-2017] 30. Wen, Q., Mu, W., Sun, L., Hua, S., Zhou, Z.: Daily sales forecasting for grapes by support vector machine. In: International Conference on Computer and Computing Technologies in Agriculture, pp. 351–360 (2013) 31. Wu, Q., Yan, H.S., Yang, H.B.: A forecasting model based support vector machine and particle swarm optimization. In: 2008 Workshop on Power Electronics and Intelligent Transportation System, pp. 218–222 (2008) 32. Zhang, G.P.: Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing 50(Supplement C), 159–175 (2003) 33. Departamento Administrativo Nacional de Estadística -DANE-. Importaciones colombianas (2019). https://www.dane.gov.co/index.php/comercio-exterior/importaciones 34. Jain, M., Verma, C.: Adapting k-means for clustering in big data. Int. J. Comput. Appl. 101 (1), 19–24 (2014) 35. Comisión Económica para América Latina y el Caribe -CEPAL-. Visión agrícola del TLC entre Colombia y Estados Unidos: preparación, negociación, implementación y aprovechamiento. Serie Estudios y Perspectivas, vol. 25, p. 87 (2013) 36. Henao-Rodríguez, C., Lis-Gutiérrez, J.-P., Gaitán-Angulo, M., Malagón, L.E., Viloria, A.: Econometric analysis of the industrial growth determinants in Colombia. In: Wang, J., Cong, G., Chen, J., Qi, J. (eds.) ADC 2018. LNCS, vol. 10837, pp. 316–321. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92013-9_26 37. Lis-Gutiérrez, J.-P., Gaitán-Angulo, M., Henao, L.C., Viloria, A., Aguilera-Hernández, D., Portillo-Medina, R.: Measures of concentration and stability: two pedagogical tools for industrial organization courses. In: Tan, Y., Shi, Y., Tang, Q. (eds.) ICSI 2018. LNCS, vol. 10942, pp. 471–480. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93818-9_45 38. Viloria, A.: Commercial strategies providers pharmaceutical chains for logistics cost reduction. Indian J. Sci. Technol. 8(1), Q16 (2016) 39. Viloria, A., Gaitan-Angulo, M.: Statistical adjustment module advanced optimizer planner and SAP generated the case of a food production company. Indian J. Sci. Technol. 9(47) (2016). https://doi.org/10.17485/ijst/2016/v9i47/107371

Analyzing and Predicting Power Consumption Profiles Using Big Data Amelec Viloria1(&), Ronald Prieto Pulido2, Jesús García Guiliany2, Jairo Martínez Ventura3, Hugo Hernández Palma3, José Jinete Torres4, Osman Redondo Bilbao5, and Omar Bonerge Pineda Lezama6 1

5

Universidad de La Costa, St. 58 #66, Barranquilla, Atlántico, Colombia [email protected] 2 Universidad Simón Bolívar, Barranquilla, Colombia {rprieto1,jesus.garcia}@unisimonbolivar.edu.co 3 Corporación Universitaria Latinoamericana, Barranquilla, Colombia {academico,hhernandez}@ul.edu.co 4 Universidad Libre Seccional Barranquilla, Barranquilla, Colombia [email protected] Corporación Politécnico de La Costa Atlántica, Barranquilla, Colombia [email protected] 6 Universidad Tecnológica Centroamericana (UNITEC), San Pedro Sula, Honduras [email protected]

Abstract. The Euclidean distance (ED), the mean absolute error (MAE), the mean absolute percentage error (MAPE) and the root of the mean quadratic error (RMQE) are used to evaluate the predictive capability of the models supported by each statistical method, asserting, according to the assessment, that the best predictions come from the ARIMA method. This paper presents a prediction study for two buildings located at the University of Mumbai in India, in order to determine a method that fits the forecasts of organization expenses. Keywords: Prediction

 Power consumption  Big data  ARIMA

1 Introduction Since the beginning, electricity became an essential resource for the development of industrial, commercial and economic activities [1, 2]. Besides, in recent decades, it has encouraged the increase in devices connected to the network, thus increasing the demand for Electric Power (EP) [3, 4]. The importance of predicting the Electric Power Consumption (EPC) is given by the difficulty of storing the EP [5]. Tso, predicting the future behavior of the EPC becomes important, since the previous knowledge allows speeding up the management and planning of economic, human and energy resources [6, 7]. Previously, EPC prediction did not have the needed attention until the 1970s, because the oil crisis generated uncertainty, and the extrapolation method of the trend at that time was sufficient to explain the increase in EP demand [8]. From that time © Springer Nature Singapore Pte Ltd. 2019 G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 392–401, 2019. https://doi.org/10.1007/978-981-15-1304-6_31

Analyzing and Predicting Power Consumption Profiles

393

onwards, the development of more useful and precise techniques for the prediction of the EPC began [8]. At present, there are statistical methods, such as clustering, that allow to analyze and classify the EPC, identifying patterns and anomalies in the EPC profiles [9]. This topic is studied in [10] to propose an index associated with the change in the EPC profile, which would minimize costs and maintenance. For example, the model in [11] simulates the integration of distributed resources for buildings in the TRNSYS software, resulting in a model capable of predicting the contribution of each technology considering climatic variables as inputs. There is the possibility of implementing statistical techniques for predicting the EPC, where its difficulty and precision may vary. It is worth mentioning the case of [12], where univariate and simple methods such as moving average and exponential smoothing are applied. Another alternative is to address predictions in a multivariate way, with the application of multiple linear regression, as in [13, 14] by using climate variables to explain the EPC. The time series, traditionally applied to the economic sphere, are adapted and executed by means of autoregressive integrated moving average processes (ARIMA) applied to the EPC prediction [15, 16]. This work aims to use Big Data to predict the EPC in advance for a set of buildings.

2 Method 2.1

Data

The data collected for the analysis and prediction of EPC profiles correspond to meteorological measurements and EPC records from two buildings of the University of Mumbai, which from now on will be called UPS and UPV. The UPS has an intelligent metering system that is pioneer in the region, and has been studied, structured and implemented previously in the study of [17]. This research is preceded by studies regarding the development of electric networks towards smart grid and new measurement technologies [18]. There are also publications on proposed models for optimizing the infrastructure, planning and scalability of communication of these new technologies [19, 20]. As a result of these research works, the building load profile was obtained, whose database is 25,909 values (ranging from March 14 to December 8, 2018). The EPC of the UPS building is registered by a DiMET3 GSM CT three-phase meter every 15 min, involving the management a large amount of information, so requiring an adequate communication infrastructure, to provide storage capacity and security for the smart metering system [15]. Monitoring through the Mr. DiMS central management system allows to see: voltages, currents, frequency, demand, power factor, active and reactive energy. In turn, the UPV building has a total of 84,649 records (which begin on July 1, 2014 and end on November 27, 2018). The meteorological data of the UPS building are acquired by means of the Meteorological Station at the Universidad Politécnica Salesiana in Cuenca/INER, whose values are recorded hourly, having the following variables for the UPS building: average global radiation, average temperature, average humidity, average barometric

394

A. Viloria et al.

pressure, wind speed and direction. In the case of the UPV, temperature and humidity are also considered. The available information was adapted to the study requirements. That is to say, the consumption recorded every 15 min was averaged hourly to obtain an EPC, also identifying the days, months and labor, using Excel application for all the process. Meteorological data were matched to those of the EPC. Table 1 shows the characteristics of the data available for the UPS. The number of variables available in the UPV is less than in the UPS, as shown in Table 2. Table 1. Variables of the UPS building. Description Date Time Day Labor Month Voltage Current Power Varh Vah FP Demand.max Energy Average barometric pressure Average relative humidity Average global solar radiation Average air temperature Wind direction Wind speed

Type of variable Qualitative Quantitative Qualitative Qualitative Qualitative Quantitative Quantitative Quantitative Quantitative Quantitative Quantitative Quantitative Quantitative Quantitative Quantitative Quantitative Quantitative Quantitative Quantitative

Category Record Record Categorical Categorical Categorical Numerical Numerical Numerical Numerical Numerical Numerical Numerical Numerical Numerical Numerical Numerical Numerical Numerical Numerical

Unit Day, month, year Hours, Minutes U U U V A W Q S W Wh hPa % °C º m/s

Table 2. Variables of the UPV building. Description Date Time Day Labor Month Power Energy Relative humidity Air temperature

Type of variable Qualitative Quantitative Qualitative Qualitative Qualitative Quantitative Quantitative Quantitative Quantitative

Category Register Register Categorical Categorical Categorical Numerical Numerical Numerical Numerical

Unit Day, month, year Hours, Minutes U U U W Wh % °C

Analyzing and Predicting Power Consumption Profiles

395

3 Results The ARIMA model is applied in the SPSS program, following the of Box-Jenkins method, whose development is given according to the concepts mentioned in the introduction. Therefore, the information required for estimating the model and predicting the EPC profile is imported from the segmented data in Excel [21, 22]. 3.1

Identification

With the selected data, the day and time are defined for each imported record. Then, the autocorrelation functions are requested, for estimating the order of the parameters for p, d and q. Once the correlograms are obtained, they are analyzed [23, 24]. In the two cases under study, a differentiation was made to transform the series into a stationary series, obtaining the following results: • The result of the UPS autocorrelation function shows a decreasing behavior towards zero, where its first values are significant. On the other hand, the partial autocorrelation function shows just a high delay, allowing to propose that estimated model is composed of first order values for p, d and q. • Reviewing the partial autocorrelation function of the analyzed series of the UPV building, it can be mentioned that there is only one significant coefficient, the behavior is not noticeable, which makes it difficult to approximate the function to a theoretical pattern. For this reason, orders are proposed that can vary from 1 to 3 for p and q, while only an integration d is recommended. 3.2

Model Estimation

The model estimation results from the approximation that can be given to the autocorrelation functions with the theoretical parameters mentioned in the introduction. Thus, in the case of the UPS, it can be noted that the autocorrelation function has an exponential decrease without being annulled, while the partial autocorrelation function shows a high coefficient and its other laps are not eliminated, so AR (1) and MA (1) models are estimated [25–27]. With the autocorrelation functions belonging to the CEE series of the UPV, an exponential decrease in the autocorrelation function is identified, meanwhile, the partial autocorrelation function exposes that its first value is significant, therefore AR (1) and MA (1) models are suggested. It should be noted that the estimated model is an option to predict the EPC profile of the studies buildings, since there may be other options to predict the EPC profile. 3.3

Model Validation

The SPSS software for performing the analysis of a prediction model requires to establish orders of the seasonal and non-seasonal part of ARIMA, which is composed of the same autoregressive, moving average and integration factors. The estimated model is then adjusted to obtain good results and Table 3 shows the statistics for the

396

A. Viloria et al.

ARIMA model (1,0,1) (0,1,1), describing an ARMA in the non-seasonal part and an SMA with a differentiation in the seasonal part. The statistics for UPS indicate that the model has a good fit, since a good explanation of the EPC profile is expected by the data obtained in R square. Additionally, the level of significance “sig” is added, which is greater than 0.05, exposing that the residual errors are random. Table 3. Statistics of the ARIMA prediction model (1, 0,1) (0, 1,1), for UPS. Model statistics Model Number of predictors Power [KW] Model_1

0

Model fit statistics R square R stationary square 0,742 0,992

Ljung-Box Q (18) Statistics GL Next

Number of outliers

7,157

0

15

0,96

The ARIMA prediction model (0, 0,0) (1, 1, 1) for a building of the UPV specifies an ARMA in the seasonal part, whose statistical data are shown in Table 4, where the statistical results of the model indicate that it has a good fit due to the value of R square, while the level of significance “sig” is less than 0,05, exposing that the residual errors are not random.

Table 4. Statistics of the ARIMA prediction model (0, 0,0) (1, 1,1), for UPV. Model statistics Model Number of predictors Power [KW] Model_1

3.4

0

Model fit statistics R square R stationary square 0,15 0,93

Ljung-Box Q (18) Statistics GL

Next

Number of outliers

10098,045

0,0

0

16

Prediction

In order to predict the EPC profile of any of the two buildings, in SPSS, it is necessary to establish the parameters of the analyzed and accepted model. In addition, the program requires that the prediction period and time be established, see Figs. 1 and 2.

Analyzing and Predicting Power Consumption Profiles

397

Power (Kw)

Hours

Fig. 1. Prediction of the EPC profile of a working day at the UPS building. The blue line represents the prediction and the red line represents the profile. (Color figure online)

Power (Kw)

Hours

Fig. 2. Prediction of the EPC profile in a working day at UPV

398

3.5

A. Viloria et al.

Evaluation

The supported models of each EM are evaluated to predict the EPC profiles of each building. The results allow to know the prediction capacity of each EM and this is supported by the calculation of the statistical errors of each prediction. In the tables of prediction errors, the abbreviations refer to: Winter method (WM), simple linear regression (SLR) and multiple linear regression (MLR). It has been established that for UPS, the forecast for the first 7 days of December 2018 will be made. So, below are the prediction results from Friday, December 1 through Thursday, December 7, 2018. It could be observed that Winter’s method prediction presents some similarity with the behavior of the real EPC profile, but has a noticeable difference between its values. On the other hand, the simple linear regression predicts a profile according to its variable which is far from explaining the EPC profile, as in the case of the RLM. So, the ARIMA model is the one that shows the best prediction. Thus, the values in Table 5 of prediction errors highlight this ARIMA model as the best prediction. Table 5. Profile prediction errors for Friday 1st of December, 2018. Statistical prediction errors MW RLS DE [kW] 112,3633 49,354 EAM [kW] 19,8694 7,5785 EPAM % 133,1254 84,254 RECM [kW] 22,4587 10,785

RLM 47,696 8,275 85,447 9,747

ARIMA 8,0452 1,3963 10,785 1,5236

For UPV, it is required to make predictions of the EPC profiles from Friday 1st of July to Thursday 7th of July, 2019. On this occasion, the same EM used in the UPS are also examined. The results show a good fit for the Winter method followed by the ARIMA model, mentioning that LSR and MLR could in some way be approximated (Table 6). Table 6. EPC profile prediction errors for Friday 1st of July, 2016. Statistical prediction errors MW RLS DE [kW] 34,1748 231,9654 EAM [kW] 6,0478 42,752 EPAM % 14,236 101,987 RECM [kW] 7,0789 8,852

RLM 192,258 32,759 71,963 39,1458

ARIMA 96,785 11,369 11,3752 18,7952

The resulting errors exposed in 6 show that the best prediction is made by Winter’s method. The result of the ARIMA EM for this day shows a certain deficiency.

Analyzing and Predicting Power Consumption Profiles

399

4 Conclusions The structure of the study for predicting the profile of electrical power consumption (EPC) in the buildings involved, allowed to evaluate the efficiency of the application of statistical methods (EM) with different models on different days. The descriptive analysis of the data, through the analysis of the variance and the verification of the level of significance made it possible to specify the behavior of the power throughout the week (including Saturday and Sunday), as well as the exclusion by factor analysis of the hpa variable (barometric pressure). The ARIMA models present values for the determination coefficient R2 greater than 0.8, which together with the analysis of the root of the mean quadratic error guarantees a good fit in the prediction. A level of significance (p) greater than 0.05 in an ARIMA model and suggests that it is more likely to be suitable for making predictions. But it must be taken into account that this data cannot assure that the rest of models with a level of significance (p), less than 0.05, are not valid. The predictions made with the ARIMA models for the UPS, were totally effective and their evaluation marked low prediction errors. On the other hand, the ARIMA EM highlights its prediction capacity in most of the predictions for UPV, The daily prediction of an EPC profile can be executed using the Winter model or an ARIMA, as confirmed by the results of the fits in the final part of this study. The weekly prediction by running EM in the study shows that the margin of error increases partly due to the behavior of the EPC profile. The application of SLR and/or MLR for the daily or weekly prediction through climate variables is deficient, as explained by the low values that result for the determination coefficient. For the prediction of the EPC profile of either of the two institutions or other consumers, the Winter method and/or the ARIMA model must be executed, since these are the techniques that guarantee a prediction with a good fit. So, the SLR and MLR are not suitable alternatives for this purpose due to the low relationship of the EPC with climate variables.

References 1. Bradley, P., Mangasarian, O.: Feature selection via concave minimization and support vector machines. In: Shavlik, J. (ed.) Machine Learning, pp. 82–90. ICML, San Francisco (1998) 2. Hu, C., Du, S., Su, J., et al.: Discussion on the ways of purchasing and selling electricity and the mode of operation in China’s electricity sales companies under the background of new electric power reform. Power Netw. Technol. 40(11), 3293–3299 (2016) 3. Xue, Y., Lai, Y.: The integration of great energy thinking and big datas thinking: Big data and electricity big data. Power Syst. Autom. 40(1), 1–8 (2016) 4. Wang, Y., Chen, Q., Kang, C., et al.: Clustering of electricity consumption behaviour dynamics toward big data applications. IEEE Trans. Smart Grid 7(5), 2437–2447 (2017) 5. Rong, L., Guosheng, F., Weidai, D.: Statistical Analysis and Application of SAS. China Machine Press, Beijing (2011)

400

A. Viloria et al.

6. Ozger, M., Cetinkaya, O., Akan, O.B.: Energy harvesting cognitive radio networking for IoT-enabled smart grid. Mob. Netw. Appl. 23(4), 956–966 (2017) 7. Isasi, P., Galván, I.: Redes de Neuronas Artificiales. Un enfoque Práctico, ISBN 8420540250. Pearson (2004) 8. Mangasarian, O.: Arbitrary-norm separating plane, Tech. rep. 97-07, Computer Science Department, University Wisconsin, Madison (1997) 9. Bradley, P., Fayyad, U., Mangasarian, O.: Mathematical programming for data mining: formulations and challenges. Informs J. Comput. 11, 217–238 (1999) 10. Rahmani, A.M., Liljeberg, P., Preden, J., Jantsch, A.: Fog Computing in the Internet of Things. Springer, New York (2018). ISBN 978-3-319-57638-1, ISBN 978-3-319-57639-8 (eBook) 11. Gangurde, H.D.: Feature selection using clustering approach for big data, Int. J. Comput. Appl. (0975–8887) Innovations and Trends in Computer and Communication Engineering (ITCCE), pp. 1–3 (2014) 12. Abualigah, L.M., Khader, A.T., Al-Beta, M.A., Alomari, O.A.: Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst. Appl. 84, 24–36 (2017) 13. Sanchez, L., Vásquez, C., Viloria, A., Cmeza-estrada: Conglomerates of latin American countries and public policies for the sustainable development of the electric power generation sector. In: Tan, Y., Shi, Y., Tang, Q. (eds.) DMBD 2018. LNCS, vol. 10943. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93803-5_71 14. Sánchez, L., Vásquez, C., Viloria, A., Rodríguez Potes, L.: Greenhouse gases emissions and electric power generation in latin American countries in the period 2006–2013. In: Tan, Y., Shi, Y., Tang, Q. (eds.) DMBD 2018. LNCS, vol. 10943. Springer, Cham (2018). https://doi. org/10.1007/978-3-319-93803-5_73 15. Perez, R., et al.: Fault diagnosis on electrical distribution systems based on fuzzy logic. In: Tan, Y., Shi, Y., Tang, Q. (eds.) ICSI 2018. LNCS, vol. 10942, pp. 174–185. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93818-9_17 16. Suárez, O.M.: Application of the factorial analysis to the investigation of markets. case of study. Sci. Tech. 3(35), 281–286 (2007) 17. Bin Mohamad, I., Usman, D.: Standardization and its effects on K-means clustering algorithm. Res. J. Appl. Sci. Eng. Technol. 6(17), 3299–3303 (2013) 18. Carrasco, Á.: Explicando puntaje Z. Tripod.com (2003). http://aathosc.tripod.com/ PuntajeZ22.htm. Accessed 06 Dec 2017 19. Silva, V., Jesús, A.: Indicators systems for evaluating the efficiency of political awareness of rational use of electricity. In: Advanced Materials Research, vol. 601, pp. 618–625. Trans Tech Publications (2013) 20. Peralta, A., Inga, E., Hincapié, R.: Optimal scalability of FiWi networks based on multistage stochastic programming and policies. J. Opt. Commun. Netw. 9(12), 1172 (2017) 21. Ramón, P., Vásquez, C., Viloria, A.: An intelligent strategy for faults location in distribution networks with distributed generation. J. Intell. Fuzzy Systems Preprint, 1–11 (2019) 22. Gonen, T.: Electric Power Distribution System Engineering, vol. II. McGraw-Hill, Sacramento (1986) 23. Ghia, A., Rosso, A.: Análisis de respuesta de la demanda para mejorar la eficiencia de sistemas eléctricos, 2nd edn. Camara Argentina de la Construccion, Buenos Aires (2009) 24. Pérez Arriaga, J.I., Sánchez de Tembleque, L.J., Pardo, M.: La gestión de la demanda de electricidad, vol. I, no. I (2005)

Analyzing and Predicting Power Consumption Profiles

401

25. Microsoft: Microsoft Excel 2016, Microsoft (2016). https://products.office.com/es/excel. Accessed 03 Aug 2017 26. Castañeda, M.B., Cabrera, A., Navarro, Y., Vries, W.: Procesamiento de Datos y Análisis Estadístico usando SPSS, vol. 53, no. 9. Porto Alegre (2010) 27. MathWorks: MathWorks America Latina (2017). https://la.mathworks.com/help/matlab/ index.html. Accessed 25 Aug 2017

Explainable Artificial Intelligence for Cyberspace

A New Intrusion Detection Model Based on GRU and Salient Feature Approach Jian Hou, Fangai Liu(&), and Xuqiang Zhuang Shandong Normal University, Shandong 250014, China [email protected]

Abstract. Gated Recurrent Unit (GRU) is a variant of a recurrent neural network, just like an LSTM network. Compared with RNN, the two networks have higher accuracy in processing sequence problems, and both of them have been proven to be effective in varieties of machine learning tasks such as natural language processing, text classification and speech recognition. In addition, the network unit structure of the GRU is simpler than the LSTM unit structure, which is more conducive to the training of the model. NSL-KDD datasets, which is the replacement of KDD cup 99, is still one of the datasets for measuring the effectiveness of intrusion detection models. In order to reduce the feature data dimension and combine the prior knowledge of computer network, a GRU intrusion detection method based on salient features (SF-GRU) is proposed. SF-GRU selects the distinctive features of response for different intrusion forms, and uses GRU network to identify the selected features to improve the efficiency of model detection. The experimental results show that compared with the traditional deep learning method, this proposal has higher accuracy and computational efficiency. Keywords: Intrusion detection selection  Prior knowledge

 Gate Recurrent Unit  Salient feature

1 Introduction Deploying network intrusion detection system (NIDS) in key nodes of the network is one of the important means to guarantee the security of cyberspace. At present, there are two kinds of commonly used network intrusion detection technologies, which are misuse detection technology based on prior knowledge and network anomaly detection technology based on network behavior. Among them, the former is mainly aimed at intrusion detection of known attack modes, and can not judge the network intrusion of location mode based on prior knowledge; anomaly detection technology based on network behavior is to distinguish normal and anomalous data by analyzing some characteristics of network flow, so as to realize the detection of intrusion behavior. Because of the inherent advantages of the latter technology, more and more scholars begin to study network intrusion detection from this perspective. As one of the important technologies in the field of artificial intelligence, machine learning is also expanding it’s application field. The research on network anomaly

© Springer Nature Singapore Pte Ltd. 2019 G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 405–415, 2019. https://doi.org/10.1007/978-981-15-1304-6_32

406

J. Hou et al.

behavior detection combined with machine learning technology has also received much attention. However, traditional machine learning is often inefficient in dealing with large-scale data. With the continuous development of Internet applications, network bandwidth is increasing, network transmission rate is increasing, and network application characteristics are increasing. Therefore, the efficiency requirements of network intrusion detection technology are also increasing. Compared with traditional machine learning technology, deep learning can handle higher-dimensional learning and more complex computing. [1]. In recent years, deep learning advantages in dealing with large-scale and high dimensional feature data is recognized by many scholars. Similarly, deep learning has also been applied to the study of cyberspace security [3]. Compared with traditional machine learning techniques, many scholars have studied the advantages of deep learning technology in the direction of network intrusion detection based on different algorithms. Among them, when performing network flow feature selection, DBN parameter debugging and pre-training can improve detection efficiency and reduce false positive rate [5]; In view of the training feature dimension, the paper [6] used the LSTM algorithm to select all features and adopt some features. The results show that the LSTM algorithm has certain advantages over other machine learning algorithms when using some feature training. Also as a variant of RNN, GRU networks improves network learning efficiency and detection accuracy to a certain extent compared to LSTM networks [11]. According to the time series characteristics of network data, this paper uses GRU network to simulate and train the NSL-KDD dataset and test the generated model. At the same time, in order to reduce the feature dimension under the premise of ensuring that the feature information is not lost, this paper proposes the GF-GRU (GRU based on Salient Features) algorithm, which is the GRU deep learning algorithm based on the feature selection. By selecting the salient features of the relevant intrusion behavior from the feature data, the input feature dimension is further reduced, and the complexity of the deep learning algorithm is reduced without affecting the algorithm evaluation index.

2 Related Work The Gate Recurrent Unit (GRU) is an effective tool for processing sequence data, especially in the direction of sequence data learning of hidden features, GRU is easier to achieve reasonable classification. The GRU network is a kind of Recurrent Neural Network (RNN) [7]. In practical application, LSTM has a complex internal structure and a large number of parameters, which leads to the slow convergence of the training of cyclic neural networks. Cho et al. proposes a Gate Recurrent Unit (GRU), which has fewer model parameters and can transmit long-term information [8]. Compared with LSTM, GRU has fewer parameters, better performance and faster convergence speed. LSTM solves the problem that the RNN often produces gradient disappearance when dealing with long-time training. The GRU network simplifies the LSTM memory unit and uses two gates(reset gate, update gate) to achieve selective memory of discrete time series long-term data. To facilitate understanding of the GRU algorithm, we use Fig. 1 to describe the GRU forward computation unit process.

A New Intrusion Detection Model Based on GRU and Salient Feature Approach

407

Fig. 1. GRU forward calculation unit

The equation of state for the two gates and the GRU memory cell in the figure is: rt ¼ rðWr  xt þ Ur  ht1 þ br Þ

ð1Þ

zt ¼ rðWz  xt þ Uz  ht1 þ bz Þ

ð2Þ

^ht ¼ tanhðW  xt þ rt  ðU  ht1 Þ þ bÞ

ð3Þ

ht ¼ ð1  zt Þ  ^ht þ zt  ht1

ð4Þ

Where r is the sigmoid function; rt is the output of the reset gate, rt controlling the effect of the output ht1 of the hidden layer unit at the current moment on the time t; zt is the output of the update gate, zt used to determine the acceptance of the current input, similar to the input gate in LSTM, zt enables the gradient to propagate effectively, effectively alleviating the gradient disappearance. The model is tested separately for different intrusion methods, and the final output is a two-category problem. Therefore, logistic regression is used for classification. The network error function selects the cross entropy loss function. In order to achieve the fast convergence of the gradient descent in the GRU network back propagation process, adaptive moment estimation (Adam) is adopted as the optimization algorithm.

3 Algorithm and Evaluation Index The deep learning algorithm used in this paper is a recurrent neural network (RNN) with a gated loop unit (GRU). In order to reduce the relative complexity of the algorithm, this paper proposes different input features for different intrusion types.

408

3.1

J. Hou et al.

Algorithm Model

It is well known that KDD datasets have redundant data. Even if NSL-KDD is greatly optimized with respect to KDD’99, this redundancy still exists, especially in the detection of a certain type of intrusion. According to the packet characteristics of computer networks, different types of intrusion packets have different network characteristics. For example, the four characteristics most relevant to DOS intrusion are “service”, “flag”, “src_bytes”, and “count” [9]. Of course, testing with only these four features is not the most effective. Therefore, based on the prior knowledge of computer network and the existing research results, this paper selects different network features for different intrusion types, inputs GRU network to realize detection, and finally realizes the intrusion binary classification detection through the established recognition model. The classification model proposed in this paper consists of four GRU network identification units (Fig. 2), each of which corresponds to the detection of an intrusion behavior. When the model is in the training phase, the training data is preprocessed and input into each GRU network identification unit, and each unit is individually trained according to the label. This process can achieve parallel computing, which means that the four subunits are trained separately at the same time, reducing the overall training time. According to the characteristics of the NSL_KDD dataset, in order to solve the data imbalance problem and improve the accuracy of model identification, the input data of each GUR network unit adopts different methods for preprocessing: the input data of the DOS attack recognition unit is all NSLs subjected to feature screening. KDD record; the input data of the PRIBE attack recognition unit is all data records after feature filtering and removing DOS attacks; the R2L attack recognition unit needs to be trained twice, first under-sampling the raw data with pre-processed negative label The R2L attack data accounts for about 40%. After the training is completed, the model is saved, then the whole data is used for migration learning, and the final model is trained. The input data of the U2R attack recognition unit is processed by the SMOTE algorithm on the preprocessed forward samples. Sampling, synthesizing new samples to alleviate class imbalance problems [12].

Fig. 2. SF-GRU intrusion detection model

A New Intrusion Detection Model Based on GRU and Salient Feature Approach

409

In the recognition stage, each data to be identified passes through four identification units in turn, and the classification of each data is recorded according to whether or not the intrusion data is used, and the binary classification detection of each intrusion behavior is realized. If the model is used to implement the binary classification detection of the entire intrusion behavior, the detection values of the four GRU network identification units are combined by means of “or” calculation. In general, any piece of data is detected as intrusion data in any unit, and the data is determined to be an intrusion. 3.2

Feature Selection

From the observation and analysis of NSL-KDD data, each type of intrusion is reflected in the packet record, and the related feature elements are also different. Therefore, when analyzing a certain type of attack, only select this type of attack. A salient feature element that is reflected on the packet. 1. Denial of Service (DOS) DOS attacks are common attacks that cause server crashes. Its common attack means is to send a large number of requests to the target server in a short time, and at the same time occupy a large amount of server resources, causing the server to fail to provide services. Therefore, the salient feature elements of the DOS attack must contain basic features such as service type, connection status, and target host unit time data volume. The existing research results show that the prominent feature elements belonging to the DOS attack are 11 feature elements such as 3–6, 8, 23, 29, 36, 38–40 in Table 1. 2. Probe Probe attacks include IP sweep, nmap and so on, which belong to network scanning attacks or methods, so they have high correlation with network protocols, service types, and attack sources. Statistics show that the prominent feature elements of the Probe attack are 14 feature elements 2–6, 12, 29, 32–37, 40, etc. in Table 1. 3. Remote to local (R2L) & User to root (U2R) The amount of data for these two types of intrusion is relatively small, and the amount of information reflected in the feature elements is also scarce. According to the statistical analysis of the training dataset, the prominent feature elements used in the R2L intrusion are 14 feature elements such as 1, 3, 5, 6, 10, 24, 32, 33, 35-39, 41 in Table 1; U2R intrusion is adopted. The salient feature elements are 8 feature elements such as 3, 5, 6, 10, 14, 17, 32, and 33 in Table 1. 3.3

Feature Data Preprocessing

In the NSL-KDD dataset, there are three types of feature data, which are Boolean, symbolic, and continuous. Among them, the Boolean data and the percentage type data in the contact data can be directly trained, such as the characteristics of 25–31 in Table 1.

410

J. Hou et al.

For symbolic features, this paper uses ONE-HOT coding to map to multidimensional Boolean vectors. For example, in “protocol_type”, tcp maps to [1,0,0], udp maps to [0,1,0], and icmp maps to [0,0,1]. Similarly, “service” maps to a 70-dimensional Boolean vector and “flag” maps to an 11-dimensional Boolean vector. For other continuity feature values, update with the following formula: xi ¼ 3.4

xi  Minxi Maxxi  Minxi

ð5Þ

Evaluation Index

This paper mainly sets the evaluation index for the binary classification problem, and the evaluation visualization tool uses the confusion matrix, as shown in Table 1. Table 1. Confusion matrix Actual

Predict Positive Negative Positive TP FN Negative FP TN

Where TP is the amount of data predicted to be intrusive; FP is the amount of data predicted to be intrusive but actually normal; FN is the amount of data predicted to be normal but actually intrusion data; TN is the amount of data correctly predicted to be normal. According to the values of the four elements in the confusion matrix, this paper uses the following evaluation indicators: – Accuracy (AC) Accuracy indicates the percentage of records that can be correctly classified by the algorithm. This index is the most important indicator for evaluating the performance of the algorithm. The high accuracy is the most important embodiment of the algorithm. The calculation formula is: AC ¼

TP þ TN TP þ TN þ FP þ FN

ð6Þ

– Precision (P) Predicting the correct intrusion record as a percentage of all predicted intrusion records is expressed as the accuracy of the algorithm. The calculation formula is: P¼

TP TP þ FP

ð7Þ

A New Intrusion Detection Model Based on GRU and Salient Feature Approach

411

– Recall(R) Predict the correct proportion of intrusions to all intrusions, expressed as the recall rate. The calculation formula is: R¼

TP TP þ FN

ð8Þ

Reducing the number of intrusion records detected may increase accuracy to a certain extent, but will reduce the recall rate. Therefore, it is necessary to consider both accuracy and recall rate in order to express the ability of the algorithm for intrusion detection. The joint calculation formula is: F ¼ 2PR=ðP þ RÞ

ð9Þ

4 Experiment and Result Analysis 4.1

Dataset Selection

This paper uses the NSL-KDD dataset as the test dataset for the proposed algorithm. Compared to the original KDD’99 dataset, the NSL-KDD dataset has the following four advantages: First, the training set of the NSL-KDD dataset does not contain redundant records, so the classifier does not favor more frequent records; Second, there is no duplicate record in the test set of the NSL-KDD dataset, which makes the detection rate more accurate. Third, the classification rate of different machine learning methods changes within a wider range, which makes the accurate evaluation of different learning techniques more accurate. Effective; fourth, the number of records in training and testing is set reasonably, which makes the cost of running the experiment in the entire set of experiments lower. In addition, many researchers have done a lot of research on machine learning in the NSL-KDD dataset [8], so it is easy to obtain comparative data. The NSL-KDD dataset contains the “KDDTrain+” training set of 125,973 data, the “KDDTest+” test set of 22,554 data, and the “KDDTest-21” test set of 11,850 highly difficult information. Each piece of data contains 41 features, 1 classification and 1 difficulty value. Three of the 41 features are non-numeric, they are “protocol_type”, “service”, and “flag”, which need to be digitized during data preprocessing. Table 2 shows the 41 feature elements and their data types for each record in the NSL-KDD dataset.

412

J. Hou et al. Table 2. Features and types

Type of feature Numeric

Nominal Binary

4.2

Intrusion type (1) Duration, (5) Src_bytes, (6) Dst_btyes, (9) Urgent, (10) Hot, (18) Num_shells, (11) Num_failed_logins, (13) Num_compromised, (16) Num_root, (17) Num_file_creations, (19) Num_access_files, (20) Num_outbound_cmds, (23) Count, (24) Srv_count, (25) Serror_rate, (26) Srv_serror_rate, (28) Srv_rerror_rate, (29) Same_srv_rate, (30) Diff_srv_rate, (25) Rerror_rate, (31) Srv_diff_host_rate, (32) Dst_host_count, (33) Dst_host_srv_count, (34) Dst_host_same_srv_rate, (35) Dst_host_diff_srv_rate, (36) Dst_host_same_src_port_rate, (37) Dst_host_srv_diff_host_rate, (38) Dst_host_serror_rate, (39) Dst_host_srv_serror_rate, (40) Dst_host_rerror_rate, (41) Dst_host_srv_rerror_rate (2) Protocol_type, (3) Service, (4) Flag. (7) Land, (8) Wrong_fragment, (12) Logged_in, (14) Root_shell, (15) Su_attempted, (21) Is_hot_login. (22) Is_guest_login

Experimental Results

According to the above model and evaluation index, this paper carries out simulation experiments with NSL-KDD dataset. The main purpose of the experiment is to find the optimal hyperparameters of the GRU sub-network in the model, such as the optimal learning rate and the GRU hidden layer size. Then, determine the optimal hyperparametric comparison training set and test. According to the research results in [10] and the similarity between GRU and LSTM network structure, the learning rate and hidden layer size of GRU network are independent of each other in the impact of the algorithm, that is, they can be debugged separately when adjusting the network. In this experiment, the learning rate is debugged separately for each GRU learning module. The results are as follows (Tables 3 and 4): Table 3. Accuracy (AC) Type DOS Probe U2R R2L

Learning rate 0.1% 0.9742 0.9768 0.9994 0.9936

Learning rate 0.05% 0.9806 0.9848 0.9995 0.9984

Learning rate 0.01% 0.9788 0.9812 0.9967 0.9967

Learning rate 0.005% 0.9729 0.9755 0.9993 0.9979

Table 4. F value Type DOS Probe U2R R2L

Learning rate 0.1% 0.9638 0.8693 0.5627 0.9011

Learning rate 0.05% 0.9728 0.9160 0.5804 0.9031

Learning rate 0.01% 0.9702 0.8967 0.5891 0.8967

Learning rate 0.005% 0.9620 0.8624 0.5977 0.8721

A New Intrusion Detection Model Based on GRU and Salient Feature Approach

413

The second phase of the experiment is a fixed learning rate of 0.05%, which adjusts the size training model of the GRU hidden layer. The experimental results show that when the hidden layer size is 80, the model effect is optimal, and the specific experimental results are as follows (Table 5): Table 5. Fixed learning rate of 0. 05% Type DOS Probe U2R R2L

40 Hidden layer 60 Hidden layer 80 Hidden layer 100 Hidden layer 0.9529 0.9626 0.9728 0.9456 0.9119 0.9129 0.9364 0.9189 0.5804 0.5804 0.5717 0.5804 0.9026 0.9102 0.9186 0.9080

From the experimental results, the U2R type detection F value is low, the analysis from the original data is because the false positive rate is high, the high false positive rate is caused by the over-fitting of the model, but due to the data imbalance The accuracy of the model is still high. Therefore, this paper chooses the learning model with the learning rate of 0.05% and the hidden layer of GRU as 80, and compares it with the accuracy of the traditional machine learning algorithm. The SF-GRU intrusion detection model is on the NSLKDD dataset. The performance has certain advantages, the experimental results are as follows (Table 6): Table 6. Accuracy comparison of algorithms Type DOS Probe U2R R2L

SF-GRU 0.9812 0.9861 0.9995 0.9984

Random forest J48 SVM CART 0.9821 0.8248 0.9778 0.8894 0.9762 0.8029 0.9074 0.8273 0.9754 0.7394 0.9376 0.7308 0.9681 0.8759 0.9182 0.8083

The comparison results show that the SF-GRU Intrusion Detection Model has an improvement in accuracy compared to the traditional machine learning algorithm. Meanwhile, due to the adoption of a simplified recurrent cell and the use of less dimensional input data in the model input, Sf-GRU intrusion detection model is superior in time complexity than traditional LSTM-based intrusion network detection model.

5 Conclusions In the NSL-KDD dataset, each piece of data contains 41 features that are often redundant in the detection of certain aggressive behaviors. Therefore, by selecting the appropriate feature data to participate in the calculation through data preprocessing, not

414

J. Hou et al.

only can the detection rate not decrease with the decrease of the feature, but also the model training time can be effectively reduced. Based on the prior knowledge of computer network and the existing research results, this paper applies GRU algorithm to the NSL-KDD dataset after feature selection, and proposes a GRU intrusion detection algorithm based on salient features (SF-GRU). Experiments show that this method has higher accuracy than traditional machine learning methods, and compared with the original LSTM detection method, the model complexity is reduced and the algorithm efficiency is improved. Acknowledgments. This work was supported by National Natural Science Foundation of China (61772321), CERNET Innovation Project (NGII20170508), and in part by Guangdong Province Key Research and Development Plan (Grant No. 2019B010137004), the National Key research and Development Plan (Grant No. 2018YFB1800701, No. 2018YFB0803504, and No. 2018YEB1004003), and the National Natural Science Foundation of China (Grant No. U1636215 and 61572492).

References 1. Yin, C., Zhu, Y., Fei, J., et al.: A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access 5, 21954–21961 (2017) 2. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015) 3. Yuan, Z., Lu, Y., Wang, Z., et al.: Droid-sec: deep learning in android malware detection. In: ACM SIGCOMM Computer Communication Review, vol. 44, no. 4, 371–372. ACM (2014) 4. Depren, O., et al.: An intelligent intrusion detection system (IDS) for anomaly and misuse detection in computer networks. Expert Syst. Appl. 29(4), 713–722 (2005) 5. Gao, N., Gao, L., Gao, Q., et al.: An intrusion detection model based on deep belief networks. In: 2014 Second International Conference on Advanced Cloud and Big Data (CBD). IEEE Computer Society (2014) 6. Staudemeyer, R.C.: Applying long short-term memory recurrent neural networks to intrusion detection. S. Afr. Comput. J. 56(1), 136–154 (2015) 7. Cho, K., Van Merrienboer, B., Gulcehre, C., et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Comput. Sci. (2014) 8. Cho, K, Van Merriënboer, B., Gulcehre, C., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014) 9. Staudemeyer, R.C., Omlin, C.W.: Extracting salient features for network intrusion detection using machine learning methods. S. Afr. Comput. J. 52(1), 82–96 (2014) 10. Greff, K., Srivastava, R.K., Koutník, J., et al.: LSTM: a search space odyssey. IEEE Trans. Neural Networks Learn. Sys. 28(10), 2222–2232 (2015) 11. Agarap, A.F.M.: Proceedings of the 2018 10th International Conference on Machine Learning and Computing, - ICMLC 2018 - A Neural Network Architecture Combining Gated Recurrent Unit (GRU) and Support Vector Machine (SVM) for Intrusion Detection in Network Traffic Data, pp. 26–30 (2018). [ACM Press the 2018 10th International Conference - Macau, China (26 February 2018–28 February 2018)] 12. Han, H., Wang, W.Y., Mao B.H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Proceedings of the 2005 International Conference on Advances in Intelligent Computing, vol. I (2005)

A New Intrusion Detection Model Based on GRU and Salient Feature Approach

415

13. Wang, Q., et al.: Research on CTR prediction based on deep learning. IEEE Access 7, 12779–12789 (2018) 14. Tian, Z., et al.: Real time lateral movement detection based on evidence reasoning network for edge computing environment. IEEE Trans. Industr. Inf. 15(7), 4285–4294 (2019) 15. Tian, Z., Li, M., Qiu, M., Sun, Y., Su, S.: Block-DEF: a secure digital evidence framework using blockchain. Inf. Sci. 491, 151–165 (2019). https://doi.org/10.1016/j.ins.2019.04.011 16. Tian, Z., Gao, X., Su, S., Qiu, J., Du, X., Guizani, M.: Evaluating reputation management schemes of internet of vehicles based on evolutionary game theory. IEEE Trans. Veh. Technol. 68(6), 5971–5980 (2019) 17. Tian, Z., Su, S., Shi, W., Du, X., Guizani, M., Yu, X.: A data-driven method for future internet route decision modeling. Future Gener. Comput. Sys. 95, 212–220 (2019) 18. Tan, Q., Gao, Y., Shi, J., Wang, X., Fang, B., Tian, Z.: Toward a comprehensive insight into the eclipse attacks of tor hidden services. IEEE Internet Things J. 6(2), 1584–1593 (2019)

Research on Electronic Evidence Management System Based on Knowledge Graph Honghao Wu(&) Beijing Police College, Beijing 102202, China [email protected]

Abstract. With the development of information technology, electronic data cues have been playing an important role in the investigation of criminal cases. However, the difficulty of analyzing massive data have also brought great challenges to criminal investigations. At present, Big data, artificial intelligence and other technologies are used to store and analyze electronic evidence such as text, pictures, videos, forms, etc. However, to deeply mine and make full use of the semantic information of evidence and to break the barrier between various heterogeneous data are still hard to be settled. To solve these problems, the innovation of storage form of data is desperately needed to construct a unified expression of data for heterogeneous information. This paper proposes to build a case-oriented knowledge graph, and provides solutions to the association and the relation of the knowledge graph with external information and nonelectronic data evidence respectively. This knowledge graph can be used to realize the reasoning and judgment of cross-structure information, and assist the public security organs to detect cases. Keywords: Criminal cases  Electronic evidence Information extraction  Data fusion

 Knowledge graph 

1 Introduction The data that can be stored, processed, and transmitted in the form of digital data and can prove the facts of cases is collectively referred to electronic data evidence. Due to the widely use of information technology, a large number of clues and evidence exist in the form of electronic data. This part of the data provides a large amount of information for the detection and trial of the case, effectively improving the case solving rate [1–3]. However, massive electronic data has brought great challenges to the investigation of criminal cases. Electronic information clues in criminal cases are derived from various data generated in the process of crime. These data are heterogeneous. If only some type of data is analyzed, on the one hand, it is difficult to produce satisfactory results due to insufficient information [4–7]. On the other hand, the link between various parts of data is also cut off. At the same time, electronic data cues usually have various forms. The processing and analysis of texts, forms, pictures and videos often require different methods, which is difficult to achieve semantic associations [8–10]. Therefore, the key to applying artificial intelligence technology to the analysis of electronic data clues is semantic fusion of heterogeneous electronic data information [4]. © Springer Nature Singapore Pte Ltd. 2019 G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 416–425, 2019. https://doi.org/10.1007/978-981-15-1304-6_33

Research on Electronic Evidence Management System

417

Compared with structured database, knowledge graph can express more information and is more suitable for information analysis in the era of big data. Knowledge graph proposed by Google in 2012 shows knowledge in the form of triplets. Triplets has good expressive ability and learning ability, and is highly extensible, which makes themselves mainstream storage form of knowledge. Knowledge graphs are now widely used in semantic search and automatic question answering techniques [11]. Knowledge graph technology mainly includes knowledge acquisition, data fusion and knowledge calculation and application. Knowledge graph technology is now widely used in the medical, financial and legal fields [12–17]. This paper makes full use of the advantages of knowledge graphs, and proposes to construct a case-oriented knowledge graph to solve the problem of heterogeneity between electronic data cues, and at the same time dig deeper into the semantic information of clues. This paper also proposes a case-oriented knowledge graph model. Data layer, network layer and application layer are included in the model. In the subsequent part of this paper, technical details as well as difficulties of the architecture and the solutions will be discussed in detail.

2 Overall Architecture and Difficulty Analysis The case-oriented knowledge graph system mainly includes three layers: data layer, network layer and application layer (see Fig. 1). Data layer mainly includes existing structured databases and semi-structured forms as well as unstructured text data and audio and video data. The relationship between entities generated by data cleaning and information extraction of the data collected by the data layer will exist as the data of network layer. Network layer of the case-oriented knowledge graph is characterized by the following features: Firstly, since the case is continuously generated, the knowledge graph should be extensible, so if the distributed expression is used, the learning process should be incremental learning; Secondly, the case evidence knowledge base should be able to combine the common knowledge graph with the case knowledge graph, because if cases information are also stored as a knowledge graph for a long-life time, the knowledge graph will be too large and its expression quality will be affected. Application layer is mainly for quickly extracting data from network layer for different tasks. The technical difficulties in building a case-oriented knowledge graph are as follows: 1. Case information extraction and integration relies on a variety of machine learning techniques, including natural language processing, image recognition, etc. [18–22]. At present, various technologies have made great breakthroughs, but how to integrate these technologies to support each other is still a difficult task. Regardless of data uses text, picture or video, it ultimately expresses objective things existing in reality, so it is necessary to develop a semantic expression method for heterogeneous information. With the development of deep learning in recent years, distributed representation methods have made great progress. This method represents semantic relationship between entities as vectors and matrices. On the one hand, it deviates from the symbolic representation methods such as texts and pictures, and

418

H. Wu

Fig. 1. Case-oriented knowledge graph system

realizes the semantic representation of heterogeneous. On the other hand, it is convenient for computer recognition and calculation. 2. Identification of new types of cases. Forms of criminal cases are changing with every passing day. Failure to identify new forms of crime will affect the effectiveness of this system. In addition, the analysis and reasoning of the cases are similar. Investigation ideas of cases can be summarized through the analysis and study of a large number of cases. As a result, the construction of knowledge base needs to be extensible. At present, transfer learning is a research hotspot. Its purpose is to train learning models in a certain field and transplant them into similar fields [23]. 3. Integrating external information. One of the basic characteristics of big data is that the amount of data is huge. The convergence of large amounts of data provides the possibility to improve the accuracy of reasoning. Therefore, the case knowledge graph needs not only to extract information from the case, but also to integrate information in existing database and information existing in the non-electronic form.

3 Collection and Management of Electronic Clues The clues of criminal cases cover all aspects of society. These clues are complicated and contain large amounts of data. The data sources include various data resources generated by public security internal operations and a large number of social data resources. The information in these data plays an important role in the automatic reasoning and judgment of cases as clues to criminal cases. However, due to the different forms of these information, it is not conducive to realize the unified identification and management for computers. Therefore, this paper proposes to use the Internet of Things technology to manage the original information, with the purpose to rework this

Research on Electronic Evidence Management System

419

information in a way that adapts to big data processing. Using radio frequency technology, the original information in the file is uniquely marked in the form of a twodimensional code as shown in Fig. 2. The flow of the original information between departments and the storage location of the corresponding electronic material are recorded by radio frequency devices. Through this technology, the security of the original evidence can be guaranteed, and the location of files can be visually seen. This technology simplifies the time required for recording work and reduces possible errors. Recording tasks are done automatically when scanning the two-dimensional code with scanners.

Fig. 2. The two-dimensional code for the original data materials

The management of the original evidence materials are case-oriented and the documents is the minimum management units. When designing the two-dimensional code of the original data materials, it mainly considers four steps, which are the generation, using, circulation and preservation of paper file materials. The relevant elements include documents, places and person. The encoding of documents is mainly composed of case number, volume number and document number, which are used to uniquely mark the document in some file. The role of person includes production, management and use.

4 Heterogeneous Case Information Fusion Case information can be classified into structured information, semi-structured information and unstructured information according to its structural level. Structured information is more conducive to computer processing and analysis than semistructured and unstructured information. The knowledge graph proposed by Google in 2012 converts knowledge into triples, i.e. (e1, r, e1). Where e is an entity and r is the

420

H. Wu

semantic relationship between two entities. This form is simple and universal, which is suitable for complex semantic environments. At the same time, in order to facilitate the calculation and processing of entities and semantic relations, triplets are often stored in the form of distributed representation. At present, there are relatively mature technologies for transforming structured information, semi-structured information and unstructured information into triples. For structured information, the public security departments have established a number of large-scale structured databases covering people, places, things, organizations and other information. This part of structured information is authentic and can be used as an important part of the common sense library. At present, the information in databases is mainly converted into knowledge base information through ETL technology. Similar to structured data, the construction of a large number of information systems has produced a large amount of semi-structured information, such as web contents. This kind of data is more diverse than structured information, and the density of information is sparser. But the effect of the wrapper extraction is relatively good, and the data rules are relatively simple, which make it easy to use machine learning methods to achieve better results. The most important point is that semi-structured information’s semantic rules are more abundant and can express multiple semantic relationships of entities. Compared with structured and semi-structured information, the proportion of unstructured information in case clues is getting higher and higher. Text evidence, audio evidence, and video evidence occupy increasingly important positions. According to statistics, more than 80% of the overall information in files is composed of texts. Texts are highly generalized symbolic form that is easy for humans to understand, while computers are numerically sensitive devices that have poor analytical skills. Text is a kind of unstructured data. In order to build a large-scale knowledge map and update and maintain it in time, unsupervised learning is needed to extract knowledge from knowledge map. Compared with the traditional statistical learning model, word vectors as input of in-depth learning can express the similarity between words and have the robustness of the statistical model at the same time. Word2vec of Google can reflect the hyponymy of words, but it also needs to reflect more semantic relations to apply to the construction and reasoning of knowledge map. If we can match the word vector with the feature vectors describing entities in knowledge map model, it will be helpful for unsupervised learning method to construct knowledge map. Promotion in this area also helps to integrate knowledge. The most important method for converting unstructured information into structured information is information extraction. Information extraction technology is a technology that extracts entities in texts and semantic relations between them and constitutes structured information. The entities and the semantic relationships between entities generated by information extraction are important parts of the knowledge graph. Information extraction technology can be mainly divided into entity-oriented named entity recognition and relationship extraction for semantic relationships between entities. Traditional information extraction technology is mainly oriented to text information, mainly including named entity recognition and relationship extraction, which is to extract the entity information of people, places, things, organizations and so on from texts and extract the semantic relationship between these entities. In addition to the extraction of text information, audio information can be converted into text information and then extracted.

Research on Electronic Evidence Management System

421

In addition, the rapid development of computer vision technology in recent years has greatly improved the accuracy of image and video information extraction. For audio processing, audio can be first converted into text and then processed. For video processing, entities in each frame can be identified first, and then the semantic relationship between entities can be extracted through the changes of pictures. Through the techniques above, various types of information can be uniformly transformed into a knowledge graph. However, since the generation process is different, the entity and the semantic relationship may generate various expression results, which means that semantic alignment is required. Currently, the commonly used method is to utilize remotely supervising method to achieve semantic alignment, it is also called the heterogeneous problem of knowledge base. There are two main reasons for heterogeneous knowledge base. The first is that different knowledge bases have different descriptions of entities. The second is that different knowledge bases contain different entities. In order to generate a unified knowledge base, semantic alignment is needed. At present, the commonly used method is to achieve semantic alignment by remote monitoring of existing knowledge maps. In order to improve the effect of semantic alignment, it is necessary to label each entity with semantic labels. Because of the high reliability of structured data and the high accuracy of transforming structured database into knowledge map, this part of knowledge map generated by structured data can be used as supervisory set through remote monitoring. Aspect relation extraction expands knowledge map continuously to achieve the effect of semantic alignment. Existing artificial intelligence technology can automatically extract entity information such as people, places, things, organizations, etc. from text materials. By uniquely marking the entities appearing in files, these entities can not only be associated with the public security database, breaking the barrier between structured and unstructured data held by the public security system, but also provide a basis for file association between different cases, which makes it easy to link cases.

5 Construction of Case Knowledge Graph To understand a case, we need to track and understand the development of social hot events in time, and build event-oriented knowledge map. There is a big difference between building such knowledge graph and traditional knowledge map. Firstly, social hot events are dynamic, and single events will change and develop with time, which requires real-time updating of knowledge graph. At the same time, the time axis will be an important attribute among the entities in series, so the time attribute must be taken into account in the construction of time-oriented knowledge graph. A case-oriented knowledge base includes common sense information, such as social relations, geographic locations, and case information, such as the development process of a specific case, from the perspective of timeliness and scope of action. Compared with the knowledge base constructed by common sense information, case information is often temporary, and its contents are more abundant. It is often meaningful only for one case, and the information inside is often not confirmed, which may even be contrary to the facts. Therefore, if this part of information is stored together with the common sense library, it will easily interfere with the common sense

422

H. Wu

information, and will make the common sense library extremely large. Therefore, the network layer is mainly divided into two parts, which are common knowledge and case knowledge. At present, the construction of the knowledge graph is mainly oriented to common knowledge. Therefore, this paper mainly analyzes the extraction and construction process of case knowledge in case knowledge graph. Case information extraction has currently become a popular research field. Its purpose is to present cases in a structured form. Compared with the traditional triplet form of information extraction, case information extraction also adds time and location information to form a quintuple. Case information extraction is mainly divided into two parts, one part is the identification of cases, which mainly includes the identification of new cases and the gathering of information of the same case. The other part is to extract the information contents in the case, mainly extracting the entity information in the case, the semantic relationship between the entities, and the time and place of the semantic relationship. The elements that make up the case include: type of case, keywords, and elements. The main basis for determining relevant information when extracting case information includes: 1. Type of case. By classifying the existing case types, a three-tier case type tree can be formed based on the root, jurisdiction, crime and criminal form. By extracting the characteristics of different case types and forming a model of case classification. Due to the limited number of types of criminal cases, relatively good classification effect can be achieved by existing classification models. However, the difficulty lies in the fact that there are a lot of different forms of cases under the same crime. For example, with the development of network technology, cyber fraud has become a high-profile form of fraud, whose criminal form is changing with each passing day. This brings challenge to the classification of case types. Trees of types of case can reduce the interference of the wrong information. 2. Keywords of cases. In order to uniquely determine the composition of a case according to the crime, can be classified into the crime subject and object, subjective and objective aspects of the crime. By summarizing these four aspects, a unique mark on the case can be formed. Among them, the subject of crime mainly refers to criminals and criminal groups. Criminal groups can be represented by the aggregation of criminals. The object of crime mainly refers to the object of infringement, such as people, things, organizations, and the substantive relationship between people, things and organizations is conducive to the reasoning of the case. Subjective and objective aspects are difficult to express by entity, so Boolean value can be used to express then, such as whether there is subjective consciousness of crime, whether there are criminal facts, and so on. 3. The extraction of elements of cases. Information in a case is complicated and redundant. In order to ensure the integrity of the information and not to extract it repeatedly, in addition to the type and keywords of the case, relevant entities of the case can be selected according to persons, places, things, and organizations. At the same time, time information is also included. According all information mentioned above, semantic relationships between the various elements are extracted, forming a quadruple of case knowledge, i.e. .

Research on Electronic Evidence Management System

423

Fig. 3. Structure of subjective information and objective information

In the process of case reasoning, to add subjective information into knowledge graph, a knowledge graph model with two poles, subjective information pole and objective information pole. A document may not evaluate the objective things, that is, there is no subjective information, but there must be contend objective information. At the lower level of objective things is the subject level, each subject represents a fact, and there can be space-time or logic links between the subjects. At the lower level of the subject is the internal structure of the subject, trying to describe the subject as a generalized table structure, in which the sub-table refers to one aspect of the subject, which is called an attribute of the subject. The elements of the sub-table include the adjective adverb expressing the subject or the adjective adverb describing an attribute of the subject. At the lowest level is knowledge granule. Each Knowledge Granule consists of a group of similar words. The word vector is constructed by machine learning method as the representation method of words. The next level of subjectivity is different emotional classification. Each classification will exist as a knowledge granule, and the words in the Knowledge Granule will be assigned according to their emotional intensity. This research will focus on the construction of a knowledge base based on XML and a text processing method oriented to the knowledge graph (Fig. 3).

6 Conclusion and Future Work This paper proposes to build a case-oriented electronic evidence knowledge graph to realize the reasoning and judgment of heterogeneous evidence. This paper also proposes a model to how to extend the knowledge graph by external information and nonelectronic data evidence. This model can deeply mine and make full use of the semantic information of evidence, break the barrier between various heterogeneous evidences, innovate the storage form of data, and fully utilize the technologies of big data and

424

H. Wu

artificial intelligence to analyze of electronic data such as text, pictures, videos and forms. Using massive electronic data to assist in criminal case investigation will become a popular research field. And We will try to build an extensible knowledge graph. Acknowledgments.. This work was supported in part by Guangdong Province Key Research and Development Plan (Grant No. 2019B010137004), the National Key research and Development Plan (Grant No. 2018YFB1800701, No. 2018YFB0803504, and No. 2018YEB1004003), and the The National Natural Science Foundation of China (Grant No. U1636215 and 61572492).

References 1. Zhao, Z.: Design and implementation of automatic reasoning method for electronic evidence. J. People’s Public Secur. Univ. Chin. 22(02), 83–87 (2016) 2. Zheng, T., Zhang, Y.: Application of domain ontology-based data mining technology in investigation of bribery crime. Chin. Prosecutor 2016(03), 55–57 (2016) 3. Yin, Y.: Design and Implementation of Judges Knowledge Management System. University of Chinese Academy of Sciences, Beijing (2014) 4. Tian, Z., et al.: Real time lateral movement detection based on evidence reasoning network for edge computing environment. IEEE Trans. Industr. Inf. 15(07), 4285–4294 (2019) 5. Yang, J., Cao, J., Gao, H.: Ontology-based telecom fraud analysis knowledge base model. Comput. Eng. Des. 38(06), 1418–1423 (2017) 6. Shen, J., Miao, T., Liu, Q., Ji, S., Wang, C., Liu, D.: S-SurF: an enhanced secure bulk data dissemination in wireless sensor networks. In: Wang, G., Atiquzzaman, M., Yan, Z., Choo, K.-K.R. (eds.) SpaCCS 2017. LNCS, vol. 10656, pp. 395–408. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-72389-1_32 7. Yan, H., Deng, J., Yang, Y.: Research on construction of conceptual knowledge base of public security information by thematic map ontology technology. J. People’s Public Secur. Univ. Chin. 18(01), 44–48 (2012) 8. She, G., Zhang, Y.: Research on ontology-based case reasoning scheme for criminal trials. Libr. Inf. Work 58(13), 118–124 (2014) 9. Cao, X., Dang, L., Fan, K., Fu, Y.: An attack to an anonymous certificateless group key agreement scheme and its improvement. In: Wang, G., Atiquzzaman, M., Yan, Z., Choo, K.K.R. (eds.) SpaCCS 2017. LNCS, vol. 10656, pp. 56–69. Springer, Cham (2017). https://doi. org/10.1007/978-3-319-72389-1_5 10. Tian, Z., Gao, X., Su, S., Qiu, J., Du, X., Guizani, M.: Evaluating reputation management schemes of internet of vehicles based on evolutionary game theory. IEEE Trans. Veh. Technol. 68(6), 5971–5980 (2019) 11. Bollacker, K., Evans, C., Paritosh, P.: Freebase: a collaboratively created graph database for structuring human knowledge. In: ACM SIGMOD International Conference on Management of Data, SIGMOD, pp. 1247–1250. DBLP (2008) 12. Tan, Q., Gao, Y., Shi, J., Wang, X., Fang, B., Tian, Z.: Toward a comprehensive insight into the eclipse attacks of Tor hidden services. IEEE Internet Things J. 6(2), 1584–1593 (2019) 13. Ouyang, L., Tian, Y., Tang, H., Zhang, B.: Chinese named entity recognition based on B-LSTM neural network with additional features. In: Wang, G., Atiquzzaman, M., Yan, Z., Choo, K.-K.R. (eds.) SpaCCS 2017. LNCS, vol. 10656, pp. 269–279. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-72389-1_22

Research on Electronic Evidence Management System

425

14. Zuva, T., Kwuimi, R.: Comprehensive diversity in recommender systems. In: Wang, G., Atiquzzaman, M., Yan, Z., Choo, K.-K.R. (eds.) SpaCCS 2017. LNCS, vol. 10656, pp. 571–583. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-72389-1_45 15. Gao, X.: Construction of legal knowledge graph of artificial intelligence for civil judicial application based on essential factual civil judgment theory. Law Soc. Develop. 24(06), 66–80 (2018) 16. Li, M., Liu, W., Mi, Y.: Electronic evidence system of public data based on block chain and relevance analysis. J. Intell. Sys. 2019(09), 1–13 (2019) 17. Hao, P.: Study on the Construction of Security Police Knowledge Graph. People’s Public Security University of China, Beijing (2019) 18. Xu, H.: Modeling Research on Ontology of Criminal Case Domain. People’s Public Security University of China, Beijing (2017) 19. Lu, M.: Application and prospect of artificial intelligence in criminal justice. J. Liaoning Inst. Public Secur. Judicial Manage. 2019(04), 52–57 (2019) 20. Yao, Y.: On the innovative strategies of economic crime investigation under background of big data. Legal Syst. Econ. 2019(07), 132–133 (2019) 21. Ji, J.: Construction and application of police cloud platform in the perspective of Big Data. Electron. Tech. Softw. Eng. 2019(13), 169–170 (2019) 22. Tian, Z., Su, S., Shi, W., Du, X., Guizani, M., Yu, X.: A data-driven model for future internet route decision modeling. Future Gener. Comput. Sys. 95, 212–220 (2019) 23. Xu, Z., Li, X.: Secure transfer protocol between app and device of Internet of Things. In: Wang, G., Atiquzzaman, M., Yan, Z., Choo, K.-K.R. (eds.) SpaCCS 2017. LNCS, vol. 10658, pp. 25–34. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-72395-2_3

Research on Security Supervision on Wireless Network Space in Key Sites Honghao Wu(&) Beijing Police College, Beijing 102202, China [email protected]

Abstract. With the development of wireless network these years, the number of crimes caused by wireless network security problems keep rising. Once the wireless network fall in danger, there will be serious consequences. Existing stand-alone security equipment can hardly meet the security requirements of key sites, so this paper conducts deep research on Wi-Fi space security management under particular environments. We employ distributed probe set with knowledge base to implement the supervision of wireless network security in key sites. Through the extensive collection of Wi-Fi terminal, AP, channel and other information in specific areas, we realize overall monitoring and blocking towards safety behaviors such as Wi-Fi private access, illegal client access, illegal sniffing, brute force cracking, hijacking, phishing and so on. As a result, security managers can spot security behaviors related to Wi-Fi at the first time, and they can dispose the problems timely. Keywords: Key sites  Wireless network  Wi-Fi  Distributed probe  Security supervision

1 Introduction With the development of wireless network technology and the popularization of mobile intelligent products, Wi-Fi network is increasingly widely applied in social work and life. Most of important places, such as universities, shopping malls and major institutions have achieved full coverage of Wi-Fi network, and some cities have proposed the goal of building “wireless cities”. People can access the Internet through Wi-Fi whenever and wherever possible in their lives. Wi-Fi network technologies make our live more convenient, however at the same time, there are increasing cybercrimes caused by Wi-Fi security problems, such as stealing personal sensitive information, illegal login, network fraud, remote control and so on. Consequently, it’s essential to lay stress on Wi-Fi network security in order to protect citizens’ information security. At present, when public security police face such illegal and criminal activities at work, it’s difficult to detect and search for evidence. To be specific, it is difficult to detect the criminal activities through these terminal devices, let alone to extract the crime evidence from them, resulting in no good detection methods for such cases [1–3]. In important places where Wi-Fi is the form of the Internet connection, if hackers tamper with or steal into unauthorized network environment, there will be great network security events and risks, bringing certain detection difficulties to the public © Springer Nature Singapore Pte Ltd. 2019 G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 426–434, 2019. https://doi.org/10.1007/978-981-15-1304-6_34

Research on Security Supervision on Wireless Network Space in Key Sites

427

security organs when investigating security risks. Besides, due to the boundlessness of Wi-Fi signals, it is liable for attackers to launch non-contact covert attacks on Wi-Fi signals, which is difficult to perceive for security managers [4–8]. Our study on technologies of security supervision on wireless network space in key sites utilizes distributed Wi-Fi probes combined with secure hosts, conducting regulations on Wi-Fi space security in key sites environments. It can realize the extensive collection of Wi-Fi terminal, AP, channel and other information in specific areas, as well as overall monitoring and timely blocking towards safety behaviors such as Wi-Fi private access, illegal client access, illegal sniffing, brute force cracking, hijacking, phishing and so on. This work is helpful for public security organs to utilize Wi-Fi probe information sensing technology to proactively prevent and combat cybercrimes in the wireless network environment, and greatly reduce the detection difficulty for public security organs when they carry out major security tasks to identify security hazards.

2 Related Work Wi-Fi space security management in specific environment is a research hotspot. Through the development of the study, the government can maintain the security of spatial information and protect social safety and interests. At the corporate and personal level, Wi-Fi spatial information security can also protect the safety of individual information. Moreover, cyber space security is the lifeblood of infrastructure security. Throughout the world’s infrastructure security legislation, cyber space security is a very important content. The development and construction of this project, especially the key technology breakthroughs such as reliability and information security protection, will greatly improve the Wi-Fi space security capability, laying a good foundation for the development of Wi-Fi space security. The main confronting security issues of Wi-Fi networks at the moment are as follows: 1. Forged Wi-Fi. As the user demand for mobile communications grows, many public places, such as hotels, restaurants and airports, offer wireless Internet services. The SSID of the Wi-Fi there tends to have certain rules that it’s usually the name of the place. Criminals can forge the SSID of Wi-Fi to lure users to connect to the illegal wireless network, and then intercept the contents of user’s communications to steal the user’s information and property [9–11]. 2. Malicious attacks on Wi-Fi. Wi-Fi acts as a gateway for users to access the Internet. Once it’s attacked by hackers, wireless network will go down, compromising the network availability. Furthermore, hackers can add malicious links to wireless networks, which threatens users’ personal information and property once they link to illegal websites [12, 13]. The threats to Wi-Fi network security in key sites are more prominent. The key sites here mainly refer to the important spaces including airports, the underground, arenas and meeting places, especially the places where critical meetings are held. One of the characteristics of the important spaces is the large flow and strong mobility of people, so that the security protection is difficult. The damage of malicious attacks can

428

H. Wu

be extensive, and once the panic spreads from online to offline [14], it will lead to a worse impact. Critical meetings are characterized by high confidentiality requirements, which can lead to serious consequences if the information leaks. What’s more, the high value of information will attract attacks of higher technical contents, which causes difficulties in protection. Therefore, we highlight the threats to Wi-Fi safety in key sites. Currently, for Wi-Fi detection in particular spaces, there is a lack of long time, full coverage and systematic detection equipment. Most of the means at home and abroad for the of Wi-Fi space security supervision are stand-alone security detection equipment [15–19]. The main disadvantages of the method are as follows: 1. Under some special circumstances, for instance, signal interference and irregular terrain, which lead to signal failure, it’s difficult to realize monitoring and all-round control towards wireless network [20]. 2. Different kinds of threats require different protective measures. The subsystems are relatively independent with each other, lacking in semantic understanding of malicious events, so that threats can get allowed [21, 22].

3 Analysis of Key Issues in Wireless Network Security Supervisions In order to solve the problems above, this paper adopts the security management scheme of distributed Wi-Fi probe and centralized control, so as to realize control measures towards Wi-Fi in particular areas such as key sites, important spaces and critical meeting places. These measures are long-term, fast-response and full-coverage, including safety monitoring, early warning, blocking, basic information collection and on-site inspection. Compared with the stand-alone Wi-Fi security measures, this scheme has the following advantages: 1. All-round control under special circumstances. Sensing devices are distributed in the wireless network environment to achieve full signal coverage and conduct all-round monitoring of wi-fi in specific areas. The public security manages the sensing devices through the central control system, and issue task instructions to signal relay equipment, which processes task messages and distributes them to designated sensing equipment for execution. Afterwards, the Wi-Fi probes in the sensing devices will find the hotspot, scan the terminal device connected to the hotspot and return the result to the signal relay devices in real time. The signal relay devices will collect the data returned from the distributed probe and then return it to the central control system. Finally, the central control system will conduct a comprehensive analysis of the information received. If the information contains malicious attacks, it will block the relevant hotspots and terminal devices. By deploying repeated test and calibration of wi-fi probe signal coverage, try to avoid blank areas between probes. After multi-point and multi-angle test methods, it can achieve full signal coverage. 2. Including a complete security feature base. Build a most comprehensive feature and knowledge base. The knowledge base mainly includes databases of the detected devices, such as vulnerability database, fingerprint database, threat intelligence

Research on Security Supervision on Wireless Network Space in Key Sites

429

database, malicious program database, virus database and so on. Through construction of the security feature base, we can realize the detection of wireless network in key places, which has the important function of support the basic data. Through the data fusion among various bases, we can form a knowledge base oriented to malicious threat events, and learn the semantic expression form of malicious events through large amount of data training. At the same time, the knowledge base can constantly expand itself by self-learning to complete the learning of new malicious events. Through the space deployment of wireless security in key places, the feature base can constantly be improved and enriched to build a complete security feature base step by step. 3. Fixed malicious programs in wireless networks are conducive to further analysis of malicious programs, while providing support for the detection of criminals who undermine the security of wireless networks. The core technology of wireless network malicious program evidence fixation lies in malicious program detection and fixed upload analysis. Among them, malicious program fixed special equipment is mainly used in front-end equipment. Its main functions include security detection, device address book backup, MAC address acquisition, crime evidence fixation and evidence upload. Through this device, wireless network security detection, data backup, evidence fixation and evidence upload can be realized. The uploaded data can be used as the data source of cloud platform. Through the construction of cloud platform, the development trend of malicious program-related crimes can be controlled and monitored for wireless network security through the analysis of the overall criminal evidence data. The analysis and identification ability of malicious programs in wireless networks is the key to wireless network space security supervision. In order to better detect and identify these malicious programs, the following key issues need to be solved. One is to use dynamic sandbox engine to solve the identification of unknown viruses. Dynamic sandbox engine is mainly to solve the identification of unknown viruses. It can identify whether network behavior is malicious or not according to abnormal behavior. The dynamic sandbox engine has a complete simulation environment. After catching abnormal behavior in the current network environment, the network behavior is incorporated into the simulation environment for confirmation and judgment, and the malicious behavior is identified, and then it is incorporated into the virus feature library. Secondly, the static scanning engine is used to scan malicious programs in all directions and in many dimensions. Static scanning engine is an integrated engine formed by integrated multi-engines, which can scan malicious programs in all directions and multi-dimensions. Third, the establishment of virus feature library to achieve more comprehensive and accurate malicious program comparison and analysis. The virus feature library integrates the feature library of the Anti-Network Virus Alliance (ANVA). Fourthly, build a vulnerability feature library to discover all kinds of malicious programs against vulnerability attacks through vulnerability features. Vulnerability database uses vulnerability information in National Information Security Vulnerability Sharing Platform (CNVD), including vulnerability information of major security vendors. Specifically, the analysis of malicious programs in wireless networks should have the following functions:

430

H. Wu

1. Dynamic sandbox engine is mainly used to identify unknown viruses, and to identify whether network behavior is malicious or not according to abnormal behavior. Dynamic sandbox engine needs to build a complete simulation environment. After catching abnormal network behavior in the current network environment, the network behavior will be included in the simulation environment for confirmation and judgment, and the malicious behavior will be identified, then it will be included in the virus feature library. 2. Static scanning engine will integrate multiple engines into a comprehensive engine, relying on the MTX virus detection engine of the National Security Center, selfdeveloped virus detection engine and special recovery engine, to achieve all-round and multi-dimensional scanning of malicious programs. 3. Virus feature library integrates the feature library of the Anti-Network Virus Alliance (ANVA) of the Ministry of Industry and Information Technology, which includes not only the feature library found by this system itself, but also the feature library found by all security manufacturers in the industry. Therefore, the library has authority in the field of security. Through this virus feature library to achieve more comprehensive and accurate malicious program comparison and analysis. 4. Vulnerability database uses vulnerability information in National Information Security Vulnerability Sharing Platform (CNVD), including vulnerability information of major security vendors, and discovers various malicious programs for vulnerability attacks through vulnerability characteristics. 5. After the security detection mobile terminal is connected to the special equipment for evidence fixing through USB, it can detect the mobile terminal by one key of the security detection button or by selecting the touch screen function of the device. The test results are presented in the form of electronic reports.

Fig. 1. Functional flow diagram of distributed Wi-Fi probe and centralized control security management scheme

Research on Security Supervision on Wireless Network Space in Key Sites

431

The workflow of distributed Wi-Fi probe and centralized control security management scheme is shown in Fig. 1. In the detecting part, after the identification of the malicious hotspot through Wi-Fi probe, we can find the signal strength of the hotspot based on each probe, conduct precise location the triangulation algorithm and locate the precise position of the malicious hotspot. The main functions include security detection, equipment address book backup, MAC address acquisition, crime evidence fixation, evidence upload. After the mobile terminal is connected to the special equipment for evidence fixing through USB, it can detect the mobile terminal by the key of the security detection button, or by the touch screen function of the device. The test results are presented in the form of electronic reports. Device address book backup, evidence-fixed special equipment supports rapid backup of the address book in the mobile phone, so as to be able to remove the threat for the mobile phone after detecting the security threat, restore the factory settings when serious, and then restore the address book information for the mobile phone. MAC address acquisition, evidence-fixing device supports the acquisition of mobile terminal MAC address. When the mobile terminal connects to USB, the device automatically acquires the mobile terminal MAC address, so that the user is unconscious. After detection of malicious programs in mobile terminals by special equipment for evidence fixing, samples of malicious programs can be extracted by one button for evidence fixing, so that the evidence of network crime can be retained and the situation that the current grass-roots cadre police can not solve this problem can be solved. Evidence upload function supports uploading malicious program samples extracted by fixed special equipment to the cloud. The way of uploading is through wired connection network, which has higher security than wireless transmission. By using the technology of big data security analysis technology is upgraded from feature-based matching analysis to behavior-based abnormal analysis, including but not limited to same-type word hotspot, homonymic hotspot, flood attack, attack management background, phishing attack and other malicious attacks. Upgrade the detection based on attack characteristics to the depth detection of attack model, from the positioning of security events based on time, place and behavior characteristics to the tracing based on big data, so as to determine the authenticity and source of abnormal attack behaviors and evaluate the harm caused by the attacks. Moreover, for the identified malicious hotspots, their behavior characteristics will be recorded in the sample database by intercepting flows, and be aligned with the information in the knowledge base, so that the knowledge base can be enlarged and is likely to generate feature models of new attacks. Wi-Fi spatial security data analysis of specific environments such as key sites, important spaces and critical conferences, employ a streaming analysis engine based on memory-based complex event processing technology to conduct a real-time correlation analysis of the detected data. Meanwhile, it carries out a real-time data analysis based on machine learning by continuous aggregation engine. The algorithms used by the data detection engine include behavioral contour-based learning algorithm, continuous clustering analysis algorithm, etc., so that the engine can find the current security threats and attacks through real-time analysis. Through the analysis model, we can realize dynamic crowd monitoring and early warning analysis of security threats in key areas.

432

H. Wu

Finally, based on the analysis of malicious hotspots, we can extract the attack behaviors of malicious hotspots and realize the precise blocking of malicious behaviors. Firstly, block the malicious hotspot to make it unavailable without affecting the normal use of the trusted hotspot. Secondly, for one particular client connected to the hotspot, block it and make it offline without affecting other clients. As a result, it can realize accurate blocking of malicious hotspots clients intelligently.

4 Wireless Network Security Supervision Scheme for Key Sites Distributed Wi-Fi probes combined with secure host manages Wi-Fi space security in specific environments such as key sites, important spaces (airports, subways, racetracks, venues, buses) and critical conferences, implements the extensive collection of Wi-Fi terminal, AP, channel and other information in specific areas, and realizes monitoring and blocking towards safety behaviors such as Wi-Fi private access, illegal client access, illegal sniffing, brute force cracking, hijacking, phishing and so on. The scheme of networking sensing equipment is shown in Fig. 2 below. The scheme mainly consists of probe equipment, signal relay equipment, central control system and sensing devices.

Fig. 2. Information security detection devices based on distributed Wi-Fi probe

The information collected by the Wi-Fi probe includes the collection of information of Wi-Fi terminal, AP and channel, specifically including hotspot name, MAC address, location information and so on. Through the detection technology of behavior characteristics, it can realize the analysis of malicious attack behavior. Through the distributed probe, it can identify the location of malicious hotspots accurately, and achieve the blocking of malicious hotspots and clients. Through the security analysis model, we can establish the Wi-Fi space security analysis model in the specific environment of key sites, so as to realize the dynamic crowd monitoring and early warning analysis of security threats in key sites. At present, the malicious hotspot positioning technology through triangulation algorithm can achieve the accuracy error of no more than 10 m. Once it discovers malicious attack behaviors, it can quickly locate the position of malicious AP. Conduct

Research on Security Supervision on Wireless Network Space in Key Sites

433

real-time monitoring and blocking towards safety behaviours such as Wi-Fi private access, illegal client access, illegal sniffing, brute force cracking, hijacking, phishing and so on. It has the following functions: 1. Mass collection: Wi-Fi probe covers a wide range and can collect MAC addresses within the range. Meanwhile, the data is not limited and MAC addresses can be collected in a large amount. 2. Real-time data transmission: Wi-Fi probe can realize real-time data transmission, and monitoring data can be transmitted back in real time. 3. Identity matching: As the unique identification code of the phone, MAC address can realize identity matching with other data. 4. Precise location: Malicious hotspots can be located precisely by triangulation algorithm. 5. Malicious behavior blocking: Achieve the blocking of malicious hotspots and clients.

5 Conclusion and Future Work The paper conducts deep study on Wi-Fi space security management under particular environments. We implement the extensive collection of Wi-Fi terminal, AP, channel and other information in specific areas, and we realize overall monitoring and blocking towards safety behaviours such as Wi-Fi private access, illegal client access, illegal sniffing, brute force cracking, hijacking, phishing and so on. Conduct comprehensive asset management of Wi-Fi related devices in specific areas with distributed probe devices, and conduct centralized control in the management centre, in order that security managers can spot security behaviours related to Wi-Fi at the first time, and they can dispose the problems timely. Acknowledgments. This work was supported in part by Guangdong Province Key Research and Development Plan (Grant No. 2019B010137004), the National Key research and Development Plan (Grant No. 2018YFB1800701, No. 2018YFB0803504, and No. 2018YEB1004003), and the The National Natural Science Foundation of China (Grant No. U1636215 and 61572492).

References 1. Lin, Q.: Research and Development of Equipment Spot Checking System Based on Wi-Fi Positioning. Dalian University of Technology, Dalian (2017) 2. Wang, J., Cheng, H., Xue, M., Hei, X.: Revisiting localization attacks in mobile app peoplenearby services. In: Wang, G., Atiquzzaman, M., Yan, Z., Choo, K.-K.R. (eds.) SpaCCS 2017. LNCS, vol. 10656, pp. 17–30. Springer, Cham (2017). https://doi.org/10.1007/978-3319-72389-1_2 3. Chen, W.: Wireless network security threats and responses. Netw. Secur. Technol. Appl. 2019(01), 57–58 (2019)

434

H. Wu

4. Cao, X., Dang, L., Fan, K., Fu, Y.: An attack to an anonymous certificateless group key agreement scheme and its improvement. In: Wang, G., Atiquzzaman, M., Yan, Z., Choo, K.-K.R. (eds.) SpaCCS 2017. LNCS, vol. 10656, pp. 56–69. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-72389-1_5 5. Zhuang, H.: Prediction of wireless communication network security risk based on big data analysis technology. Microelectron. Comput. 36(08), 97–100 (2019) 6. Sun, W., Ding, R., Wu, C., Lin, X.: Sample-limited wireless network security access algorithm. China Sci. Technol. Inf. 2019(16), 99–101 (2019) 7. Liu, D.: Wireless network security problems and countermeasures based on wi-fi technology. Intell. Buil. Smart City 2018(09), 27–28 (2018) 8. Li, H.: Wi-Fi Positioning Technology Based on Location Fingerprint. Ningxia University, Yinchuan (2017) 9. Tan, Q., Gao, Y., Shi, J., Wang, X., Fang, B., Tian, Z.: Toward a comprehensive insight into the eclipse attacks of tor hidden services. IEEE Internet Things J. 6(2), 1584–1593 (2019) 10. Xiong, G., Liu, Z.: Design of wireless wi-fi coverage system for large exhibition center. Electr. Technol. Intell. Build. 12(05), 108–112 (2018) 11. Tian, Z., Su, S., Shi, W., Du, X., Guizani, M., Yu, X.: A data-riven model for future internet route decision modeling. Future Gener. Comput. Syst. 95, 212–220 (2019) 12. Hu, J.: Application and security of wireless LAN technology in campus network. Electron. Technol. Softw. Eng. 2018(14), 195–203 (2018) 13. Geng, L.: End-to-End Secure Communication Scheme with Network Supervision Function in Wireless Network. Xi’an University of Electronic Science and Technology, Xi’an (2017) 14. Tian, Z., Li, M., Qin, M., Sun, Y., Su, S.: Block-DEF: a secure digital evidence framework using blockchain. Inf. Sci. 491, 151–165 (2019) 15. Tian, Z., Gao, X., Su, S., Qiu, J., Du, X., Guizani, M.: Evaluating reputation management schemes of internet of vehicles based on evolutionary game theory. IEEE Trans. Veh. Technol. 68(6), 5971–5980 (2019) 16. Wu, K., Wei, G.: Research on security authentication technology for wireless sensor networks. Commun. Technol. 52(06), 1461–1468 (2019) 17. Lin, S.: Application and analysis of computer wireless network security technology. Wirel. Interconnect. Technol. 16(10), 13–14 (2019) 18. Bai, H., Zhang, X.: Wireless network security vulnerabilities and preventive strategies. Electron. Technol. Softw. Eng. 2019(08), 202–207 (2019) 19. Zhang, Q.: Multi-objective security optimal deployment of wireless sensor network nodes. Electron. Technol. Softw. Eng. 2019(05), 206–210 (2019) 20. Liu, Z.: Application and analysis of computer wireless network security technology. Netw. Secur. Technol. Appl. 2019(04), 61–62 (2019) 21. Tian, Z., et al.: Real time lateral movement detection based on evidence reasoning network for edge computing environment. IEEE Trans. Industr. Inf. 15(7), 4285–4294 (2019) 22. Koh, J.Y., Zhang, P.: Localizing Wireless Jamming Attacks with Minimal Network Resources. In: Wang, G., Atiquzzaman, M., Yan, Z., Choo, K.-K.R. (eds.) SpaCCS 2017. LNCS, vol. 10658, pp. 322–334. Springer, Cham (2017). https://doi.org/10.1007/978-3-31972395-2_30

Review of the Electric Vehicle Charging Station Location Problem Yu Zhang1, Xiangtao Liu2(&), Tianle Zhang1,2, and Zhaoquan Gu1,2 1

2

Guangzhou University, Guangzhou 510006, China Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou 510006, China [email protected]

Abstract. In order to encourage energy conservation and emission reduction, electric vehicles (EV) have gradually become one of the most important emerging strategic industries in many countries. With the gradual maturity and application of battery technology, policies support of new energy industry and the marketization of EV in many countries, the location problem of charging stations has become one part of the urban development strategies, but the bottleneck that hinders the development of new EV is the factors of vehicle life and the battery life. The focus of problem solving is the reasonable location and deployment of charging stations. Therefore, the charging station location problem has become a research hotspot in recent years. The EV charging station location is essentially an application scenario of facility location problem. As a long-standing problem, many researchers have proposed many classic models and algorithms, this paper reviews the research progress and existing problems related to charging station location in recent years from the aspects of algorithm and model, and finally gives some future research directions. Keywords: Location model  Optimization algorithm Electric vehicle  Optimization

 Charging station 

1 Introduction The location problem is a classic operational research problem. The facility location problem can be summarized in one sentence: selecting appropriate points in the candidate set to optimize all the objectives. The official start of the site selection theory began in 1909, Weber proposed that determine the location of a warehouse to minimize the distance from the warehouse to the customer [1]. This marked the birth of the facility location problem. In 1964, Hakimi proposed the classic P-median problem and the P-center problem [2], this article is a milestone that marking the facility location problem into a new era. Since then, related articles on facility location have sprung up, more and more scholars began to study this field and proposed various models and algorithms. These models and algorithms have their own advantages and disadvantages in different periods and scenarios. In the 1990s, with the booming development of EV, the demand for EV charging stations was increasing. Governments in various countries have increased their support © Springer Nature Singapore Pte Ltd. 2019 G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 435–445, 2019. https://doi.org/10.1007/978-981-15-1304-6_35

436

Y. Zhang et al.

for the EV industry, capital is also flowing into this field, EV charging station location have become an area of concern to the industry and academia. Therefore, on the basis of the traditional facility location problem, the EV charging station location problem was born, which is different from the traditional facility location problem, the considerations are more complex and more difficult. During this period, the charging station location method is mainly based on the traditional facility location method, and some specific considerations are made for the influencing factors of charging station, such as the driving habits of car owners, distribution of power grid, traffic flow, etc. Some adaptive algorithms are used to seek the optimal solution after repeated iterations. In recent years, with the vigorous development of the EV market and the further development of battery technology, many new methods and ideas have emerged in this field. These methods and ideas present a pattern of blooming, and there are many innovations in the construction of the model and the algorithm. However, the existing review of the EV charging station location problem is only a simple statistical and data analysis of the existing data or just summarized the algorithm or the model of EV charging station location problem. The purpose of this paper is making a complete and objective summary and analyzing the entire system of EV charging station location from macro perspective. This paper classifies the classic location problems from two aspects, and expounds their respective development history and latest developments through careful study on dozens of related papers after 2000. Then, based on the study of traditional facility location problem, the EV charging station location models and the algorithms are respectively analyzed. Finally, some directions in the field of EV charging station location research in the future are proposed.

2 EV Charging Station Location Model 2.1

Influencing Factors of EV Charging Location

Single Influencing Factor. The influencing factors of the location model are the basic premise of site selection decision. A lot of researches of EV charging stations location focus on the consideration and optimization of site selection factors. By combing the existing research ideas, the factors affecting site selection are summarized into the following three points: Charging Station’s Own Factors. Such as construction costs, operating costs, maintenance costs, etc. Charging Station-consumer Factors. Including user travel behavior and charging behavior, user charging demand generated under fixed point or travel; user travel mileage, starting point, road traffic, user behavior, including whether or not willing to detour to a charging station with a lot of spare locations to receive services, etc. Charging Station-supplier Factors. Including power grid distribution, power grid loss, transmission line distance, power grid peaks, etc.

Review of the Electric Vehicle Charging Station Location Problem

437

Among the above three points, the second point is the focus of research. How to reasonably arrange the location and type of charging station to maximize the quality of service and minimize user waiting time, the user spends and the costs are research emphasis of EV charging station location problem. Multiple Influencing Factors. In addition to the quality of service or the cost of charging stations, some scholars began to take multiple factors into account, proposed a multi-objective optimization location model, Liu et al. took geographical factors and service radius into account, established a dual-objective optimization model with the smallest sum of charging station cost and network loss cost [3]. Sun et al. considered the space-time limitation, combined with the driving limitations and charging demand characteristics of EV from the perspective of the space-time distribution and charging decision process of the traveler’s charging demand established the Time-Spatial Location Model (TSLM). Some researchers focus on the whole society cost [4]. Ge et al. fully considered the road network structure, traffic flow information and user path loss, taking the interests of both the charging station operator and the charging user into account and proposes a planning model based on the lowest cost of the whole society [5]. Huang et al. considered the service capacity and extended the service capacity to the number of charging stations and the power consumption quota, established a new integer programming model for charging station location [6]. Cai et al. considered the driver’s behavior and established a model of EV space charging demand based on the idea of travel chain [7]. In addition, the charging station-supplier factor is also the research focus. Guo pays attention to the impact of the charging station on the power quality of the grid, mainly the harmonics generated by the charging process, the calculation of the power flow of the distribution network and the charging station in combination with the actual situation. The layout plan is divided into public welfare demonstration and commercial operation stage, and corresponding layout planning methods are proposed for the characteristics of different stages [8]. Chen et al. listed the entire power supply network as the research object, including the power supply head end, the charging station alternative site node and the traditional load node, and solved the plan of the EV charging station accessing the power grid with the improved tree structure coding single parent genetic algorithm [7]. Wang considered the pre-selection point of the charging station, and set the alternative address of the charging station on the coincident node of the traffic network and the power distribution system, set the target function as the maximum traffic flow captured and minimal the network loss of the distribution system and the node voltage offset [9]. Some scholars have considered environmental factors. Chen et al. proposed a multi-target location and capacity planning model for EV charging stations that takes carbon emissions into account. The model uses charging station construction and running costs, user charging time and driving to charging stations. The resulting carbon emissions are comprehensive optimization targets, with the capacity limit of the charging station as the constraint, and the Pareto optimal front-end analysis is used to select the optional site-fixing solution [10]. Liu considered that distributed power supply provides power to both distribution network load and EV charging station, established the EV charging station capacity location model combining distributed power supply with the minimum total cost, the

438

Y. Zhang et al.

minimum network loss and the highest traffic satisfaction [11]. Suo et al. combined the planning of centralized charging stations with the dispatching of distribution network, and established a centralized charging two-level planning model for centralized charging stations considering peak-filling. The charging station’s own operation is also a consideration [12]. 2.2

Modeling Method of Charging Station Location

Basic Modeling Method. The location of the charging station is a subset of the facility location problem. Therefore, before studying the location of the charging station, it is necessary to analyze and summarize the classic facility location problem. After several decades of development, the facility location problem can be divided into the following categories: P-median problem, P-center problem and Maximum Coverage problem [13]. Their respective characteristics are as follows (Table 1): Table 1. Facility location problem Location problem P-median P-center Maximum coverage

Meaning

Scenario

Common algorithm

P points to make the average effect optimal P points to make the worst results are optimal Military Minimum facilities/cost Coverage most public facilities

Public facilities Emergencies

Lagrangian relaxation algorithm, etc. Drezner-Wesolow sky method Branch and Bound Method, Genetic Algorithm, etc.

Inclusive facilities service

In addition to classification by optimization target, the facility location problem can be divided into Continuous Location, Network Location and Discrete Location according to the locations where the candidate points can be placed. Continuous Location can place facilities at any point in the alternative space; Discrete Location can only place facilities on defined discrete points, while Network Location is a special case which place facilities on the vertices and edges of the network. The site selection of the network was proposed by Shier in 1977 that was first applied to the site selection of ‘mutual threats’ [14]. The optimization goal is to locate P service facilities in the network to make the shortest distance between the two facilities is the largest; EV charging stations location research generally uses a grid-based approach to divide continuous areas into discrete points. For example, Jun et al. divide the charging station planning area of the research into blocks and determine the charging demand coefficient of each block, and select the location of charging stations [15]. The purpose is to reduce the computational difficulty and the operability to achieve a better balance between theory and practice. The EV charging station location is further divided into Point Demand, Flow Demand and Mixed Model based on the traditional location classification.

Review of the Electric Vehicle Charging Station Location Problem

439

Point Demand. The Point Demand assumes that the charging demand is generated at a fixed point, such as a residence, a workplace, etc. The Point Demand model arranges the charging station on the premise of the above assumption [16]. The optimization goal is that the charging station is arranged so that the demand point of service can be maximized. The advantage of the Point Demand model is that it can better simulate user requirements, At the same time, the model is simple and easy to understand and can better simulate the real situation, but also because it is simple, it cannot simulate the charging demand which is generated during the trip, and the space-time attributes generated by the demand are not considered. Traffic Demand. The Traffic Demand model, also known as the Flow Capturing Location Model (FLCM), which was first proposed by Hodgson in 1990 [17]. It assumes that demand points are generated during the travel process, that is, people generate charging demand during driving, FLCM abstracts the user travel mode to travel flows and then select a point in the travel flows to place the charging station, then the traffic flow is considered to be captured by this charging station, the optimization goal is to capture the most travel traffics. The Traffic Demand is well fitted to the user behavior, making the location of the charging station more realistic, but the model does not consider the impact of the traffic volume and does not consider the capacity limitation of the charging station. In a complex city traffic network, the size and direction of the traffic vary greatly, and the simple traffic demand does not take these into account. Mixed Model. Hodeson and Rosing extended Point Demand and Traffic Demand, proposed a hybrid target planning model, combined with the Traffic Demand model and Point Demand model and extended the optimization target to dual targets, that is, maximized capture flow and minimized the sum of weighted distance [18]. Kuby and Lim et al. extended the FCLM model to FRLM (Flow Refueling Location Model) based on the vehicle range. The model considers that the demand flow on the traffic network can be set on the shortest path of the OD matrix. But because the value of OD traffic is not easily available, the application of the model is limited [19]. 2.3

Improved Modeling Method

Multilevel Modeling. With the development of charging technology, different levels of charging stations, such as fast charging and slow charging, and power exchanges have emerged, so the research is also advancing with the times. Guo-Liang Z et al. proposed multi-level charging station location model and algorithms based on the tabu search algorithm, they designed a combination of taboo coding and initial solution construction and used the 2-opt domain search strategy to iteratively realize the optimal solution [20]. Li Guo et al. considered the problem of site selection for charging stations and distribution networks in the power-exchange mode with combining traditional centralized charging stations and power-changing stations, the model has the shortest total weighted distance for users to replace batteries, aiming for minimal the construction, operation and maintenance costs of the charging and replacing facilities, the constraints are the user requirements, active power balance and line transmission

440

Y. Zhang et al.

capacity [21]. Duan et al. proposed a charging station location model and capacity determination method that took the complementarity between different types of charging piles into account, aiming at the shortest charging distance of charging users. On this basis, they proposed a multi-day charging load prediction method for EV [22]. Interdisciplinary Modeling. In recent years, the research hotspots of location problems have gradually gathered in the subject fusion, combining some other decisionmaking methods with traditional charging station location methods, such as game theory and queuing theory. Song introduced the queuing theory to study the construction quantity and scale of the public fast charging station based on the location algorithm, the charging station location decision model was established by the combination of gravity center method and facility location theory [23]. Zhou et al. introduced game theory on the basis of the traditional location model, established a non-cooperative static complete information game model, compared the layout of charging stations, and realized a collaborative-optimization system [24]. Liu integrates Delphi method and grey hierarchical analysis (GAHP) method into a new comprehensive evaluation method, and applies it to the optimal decision of EV charging station location, which is applied to the optimal decision of EV charging station location selection [25].

3 Location Algorithm 3.1

Traditional Algorithm

Since the charging station location problem is essentially an NP-hard problem, it is difficult to find the optimal solution in the polynomial time. Therefore, the solution is mainly various adaptive algorithms. The solution algorithms for the location problem mainly include Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Tabu Search (TS), Simulated Annealing (SA), etc. However, due to the existence of local optimal solution and premature convergence of a single traditional adaptive algorithm, and the improvement of model complexity brought by the increase of considerations, location selection algorithm is also a focus of current research. The current research on location selection algorithm is mainly carried out from the following aspects [26]. 3.2

Improved Algorithm

Algorithm Combination. On the basis of determining the objective function, Liu et al. proposed a method to solve the location problem using quantum particle swarm optimization algorithm. In this algorithm, superposition state characteristics and probabilistic expression characteristics in quantum theory are used to search the optimal position of particles with quantum revolving gate, with quantum non-gate implement particle position variation to avoid premature phenomenon, which potentially increases the diversity of population, global optimization ability and efficiency [27]. Feng et al. proposed a new multi-group hybrid genetic algorithm (MPHGA), the algorithm combines the standard genetic algorithm (SGA) with the Alternate Location Allocation Algorithm (ALA). Aiming at the multi-target of charging station planning,

Review of the Electric Vehicle Charging Station Location Problem

441

multi-population concept is adopted to establish multi-population and conduct coevolutionary search [28]. Gao et al. combined immune algorithm and fuzzy analytic hierarchy process to propose a two-step optimized location method for urban EV charging station location selection: (1) Based on the analysis of road segment charging requirements, using immune algorithm in the planning area search for optimization, obtain candidate road sections to be built; (2) For the candidate sites on the road to be built, consider the factors such as geography and distribution network, use the fuzzy analytic hierarchy process to quantify the influencing factors, and finally determine the optimal location scheme after comprehensive evaluation [29]. Artificial Neural Network. The high performance of neural network and the good performance under multivariate optimization conditions have attracted the attention of researchers. Some scholars have tried to use neural networks to deal with facility location problem. Ma et al. used the Hopfield artificial neural network to optimize the transportation route by converting the problem to the energy function and corresponding the variable of the problem to the network status to determine the location of logistics distribution center [30]. Li uses fuzzy neural network to solve the problem of location evaluation of distribution center, establishes the application network model of the location of the distribution center, takes the historical data of the logistics distribution center in operation as the input value, and uses the evaluation value of the expert as the output value, using MATLAB to encode it and actually train it, then get the experimental results. Then change the parameters of the fuzzy neural network, carry out many experiments to compare the experimental data, observe the differences between the experimental data and choose the optimal BP fuzzy neural network [31]. Computational Geometry. Because the facility location problem is to select the locations of several nodes in a network of points, it is also a key research direction to abstract the site selection requirements in the real world into Voronoi diagram in computational geometry. Tang et al. introduced the figure in the computational geometry into the charging station planning, which solves the problem of service range division when the EV is unevenly distributed [32]. Ge et al. considered the user’s path loss in the layout of the charging station, and used the weighted Voronoi diagram to divide the service range of the charging station; and the planning of the charging station planning scheme was optimized under the goal of minimizing the cost of the whole society [33]. Suo et al. solved the problem by a hybrid intelligent algorithm combining improved genetic algorithm and adaptive particle swarm, and the weighted Voronoi diagram is applied to the division of the centralized charging station service area. Song et al. used the global optimization ability of the differential evolution algorithm to determine the location and construction level of the maximum revenue of the charging station [34]. Qu et al. established a charging station location constant volume model with the goal of minimizing the cost of the whole society, and adopted the chaotic simulated annealing particle swarm optimization algorithm to solve the optimal EV charging station considering the traffic road structure, traffic flow information and user cost. The planning model uses the weighted Voronoi diagram to divide the service area into the planning area, introduces chaos theory for dynamic assignment, and combines

442

Y. Zhang et al.

the probabilistic jump ability of the simulated annealing algorithm to make the algorithm have higher global optimization ability. The validity and practicability of the planning scheme are verified by numerical examples [35].

4 Conclusions and Prospects The charging station location problem is a new application scenario of traditional location problems. As a classic operation research problem, the location problem is abstracted from the real scene. The consideration of the location model has always been closely around the real application scenario. The essence of facility location is a multiobjective optimization problem. In the 21st century, serious air pollution caused by fuel vehicles and advances in battery technology have led to the large-scale commercialization of EVs. Domestic and foreign car companies such as BYD and Tesla have invested a lot of money to develop EVs and national policies are also strongly supporting it. Therefore, the location of the charging station is a new application scenario for the location problem in the new era. Charging station location problem including Point Demand, Traffic Demand, and Hybrid Model are all abstraction and simplification of real scenes. The overall location of the site is from single to multiple, from the charging station-demand side to the supplier-charging station-demander two-tier relationship and even the overall interests of the society: environmental pollution, carbon emissions and other environmental factors. In the algorithm of charging station location, it mainly presents single algorithm and multiple algorithm, from traditional adaptive algorithm to interdisciplinary intersection, introducing knowledge of game theory and computational geometry to overcome the problem of difficulty in convergence and premature convergence of single adaptive algorithm [36]. Through the analysis and summary of the existing research, it can be found that the location of the charging station is closely following the social development and technological changes [37]. Finally, based on the understanding of the existing research and the prediction of future development, several research suggestions are proposed: 4.1

Location Models

The increasing number of EV ownership in the location model will generate a lot of data, including travel trajectories, travel frequency, etc. Large-scale data contains a large amount of high value-added information for site selection research. The address model can be changed from traditional static location to dynamic location [38]. With the introduction of mobile charging vehicles, new charging modes have appeared on the basis of the original fast and slow charging. These new modes will bring the charging station location problem new considerations [39]. In addition, the site selection model with multiple factors can be proposed by integrating user behavior, road condition and power grid peak value. Of course, multi-factor location model has higher requirements on the algorithm, and it is necessary to improve the original algorithm to the innovative new charging station location algorithm [40].

Review of the Electric Vehicle Charging Station Location Problem

4.2

443

Location Algorithms

At present, the charging station location algorithms are mainly traditional adaptive ones represented by genetic algorithms. These algorithms have flaws including poor local optimal solutions, slow calculation speed and large errors when dealing with large-scale data. Deep learning algorithms that have arisen in recent years have excellent performance in large-scale data sets, and can be introduced into charging station location. Deep learning-based location algorithms can achieve multi-factor coverage ability, independent selection ability of model features and the evolvability over time [41]. 4.3

Large-Scale Verification

The existing charging station location models and algorithms mainly stay at the theoretical level, and has a certain distance from the actual application. Therefore, there are problems such as over-theorization of the model, difficulty in application, and lack of deductive ability [42]. Future researches can focus on verifying the location model performance through simulation in a large-scale vehicle movement scenario, verifying the functional performance of the model in the actual environment and providing visualization of results for the charging-station location models and algorithms. Acknowledgments. This work was supported in part by Guangdong Province Key Research and Development Plan (Grant No. 2019B010137004), the National Key research and Development Plan (Grant No. 2018YFB1800701, No. 2018YFB0803504, and No. 2018YEB1004003), and the National Natural Science Foundation of China (Grant No. U1636215 and 61572492).

References 1. Weber, A.: Theory of the location of industries (trans. by C.J. Friedrich from Weber’s 1909 book). The University of Chicago Press, Chicago (1929) 2. Hakimi, S.L.: Optimum locations of switching centers and the absolute centers and medians of a graph. Oper. Res. 12, 450–459 (1964) 3. Zhipeng, L., Fulian, W., Yusheng, X., et al.: Optimal location and capacity of electric vehicle charging station. Power Syst. Autom. 36(3) (2012) 4. Sun, X., Liu, K., Zuo, Z.: A spatiotemporal location model for locating electric vehicle charging stations. Prog. Geogr. 31(6), 686–692 (2012) 5. Ge, S.-Y., Feng, L., Liu, H., et al.: Planning of electric vehicle charging stations considering users’ convenience. Adv. Technol. Electr. Eng. Energy 33, 71–75 (2014) 6. Zhensen, H., Jun, Y.: Problem of location electric vehicle refueling stations with service capacity. Ind. Eng. Manag. 20(5), 111–118 (2015) 7. Chen, T., Wei, Z.-N, Wu, S., et al.: Distribution network planning by considering siting and sizing of electric vehicle charging stations. In: Proceedings of the Chinese Society of Universities for Electric Power System and Its Automation (2013) 8. Yandong, G.: Research on the layout of charging station for electric vehicle in the city. North China Electric Power University (2013) 9. Hui, W.: Planning and operation of electric vehicle charging stations. Zhejiang University (2013)

444

Y. Zhang et al.

10. Chen, G., Mao, Z., Li, J., et al.: Multi-objective optimal planning of electric vehicle charging stations considering carbon emission. Autom. Electr. Power Syst. 38, 49–53 (2014) 11. Liu, B., Liu, X., Li, J., et al.: Multi-objective planning of distribution network containing distributed generation and electric vehicle charging stations. Power Syst. Technol. 39(2), 450–456 (2015) 12. Li, S., Wei, T., Muke, B., et al.: Locating and sizing of centralized charging stations in distribution network considering load shifting. Proc. CSEE 34, 1052–1060 (2014) 13. Yang, F.-M., Hua, G.-W., Deng, M., et al.: Some advances of the researches on location problems. Oper. Res. Manag. Sci. 14, 1–7 (2005) 14. Shier, D.R.: A min-max theorem for p-center problems on a tree. Transp. Sci. 11, 243–252 (1977) 15. Yang, J., Liao, B., Wang, X., et al.: Planning of charging facilities of electric vehicles based on geographical zonal charging demand coefficients. Electr. Power Constr. 36, 52–60 (2015) 16. Liu, K., et al.: Research review on optimization methods of electric vehicle charging station layout. J. Wuhan Univ. Technol. Traffic Sci. Eng. 3, 523–528 (2015) 17. Hodgson, M.J.: A flow-capturing location-allocation model. Geogr. Anal. 22(3), 270–279 (1990) 18. Hodgson, M.J., Rosing, K.E.: A network location-allocation model trading off flow capturing and p-median objectives. Ann. Oper. Res. 40(2), 247–260 (1992) 19. Kuby, M.J., Lim, S.: The flow-refueling location problem for alternative-fuel vehicles. Socio-Econ. Plan. Sci. 39(2), 125–145 (2005) 20. Zhang, G.-L., Li, B., Wang, Y.-F.: Location and algorithm of multi-level electric vehicle charging stations. J. Shandong Univ. Eng. Sci. 41, 136–142 (2011) 21. Li, G., Zhang, Z.S., Wen, L.Y.: Planning of battery-switching and vehicle-charging network based on battery switching mode. Power Syst. Prot. Control 41(20), 93–98 (2013) 22. Duan, Q., Sun Y., Zhang, X., et al.: Location and capacity planning of electric vehicles charging piles. Power Syst. Prot. Control (2017) 23. Song, Y.: Research on the layout planning of electric vehicle charging station in the city. Beijing Jiaotong University (2011) 24. Zhou, H.-C., Li, H.-F.: Optimization model of electric vehicle charging station siting based on game theory. Sci. Technol. Ind. 11(2), 51–54 (2011) 25. Liu, Y., Zhou, B., Feng, C., et al.: Application of comprehensive evaluation method integrated by Delphi and GAHP in optimal siting of electric vehicle charging station. In: International Conference on Control Engineering and Communication Technology (ICCECT). IEEE Computer Society (2012) 26. Cao, Q., Chen, W.: Review of the research on emergency facility location problem (Department of Logistics Command, Army Logistics University, Chongqing 401331, China) 27. Liu, Z., Zhang, W., Wang, Z.: Optimal planning of charging station for electric vehicle based on quantum PSO algorithm. Proc. CSEE 32(22), 39–45 (2012) 28. Chao, F., Bu-Xiang, Z., Nan, L., et al.: Electric vehicle charging station planning based on multiple-population hybrid genetic algorithm. In: Proceedings of the CSU-EPSA (2013) 29. Gao, Y.-J., Guo, Y.-D., Li, T.-T.: Optimal location of urban electric vehicle charging stations using a two-step method. Electr. Power 46, 143–147 (2013) 30. Bin, H.: Application of Hopfield artificial nerve network for place selection optimization of delivery center of logistics. Modular Mach. Tool Autom. Manuf. Tech. 3, 24–29 (2003) 31. Ping, L.: Application of Fuzzy neural network in location selection of distribution centers. Shanxi Electronic Technology (2018) 32. Tang, X., Liu, J., Liu, Y., et al.: Electric vehicle charging station planning based on computational geometry method. Autom. Electr. Power Syst. 36, 24–30 (2012)

Review of the Electric Vehicle Charging Station Location Problem

445

33. Ge, S.-Y., Feng, L., Liu, H., Wang, L.: An optimization approach for the layout and location of electric vehicle charging stations. Electr. Power 45(11), 96–101 (2012) 34. Song, Z., Wang, X., Lun, L., et al.: Site planning of EV charging stations based on the maximum profits. J. East China Jiaotong Univ. 31, 50–55 (2014) 35. Rui, Q., Ruoyu, L., Yichu, L., Zhongliang, Z.: Charging station planning scheme of electric vehicle based on chaotic simulated annealing particle swarm optimization, vol. 10, pp. 41– 46 (2019) 36. Wang, Q., Duan, G., Luo, E., Wang, G.: Research on Internet of vehicles’ privacy protection based on tamper-proof with ciphertext. In: Wang, G., Atiquzzaman, M., Yan, Z., Choo, K.-K.R. (eds.) SpaCCS 2017. LNCS, vol. 10656, pp. 42–55. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-72389-1_4 37. Si, G., Guan, Z., Li, J., Liu, P., Yao, H.: A comprehensive survey of privacy-preserving in Smart Grid. In: Wang, G., Ray, I., Alcaraz Calero, J.M., Thampi, S.M. (eds.) SpaCCS 2016. LNCS, vol. 10066, pp. 213–223. Springer, Cham (2016). https://doi.org/10.1007/978-3-31949148-6_19 38. Tian, Z., et al.: Real time lateral movement detection based on evidence reasoning network for edge computing environment. IEEE Trans. Industr. Inf. 15(7), 4285–4294 (2019) 39. Tian, Z., Li, M., Qiu, M., Sun, Y., Su, S.: Block-DEF: a secure digital evidence framework using blockchain. Inf. Sci. 491, 151–165 (2019). https://doi.org/10.1016/j.ins.2019.04.011 40. Tian, Z., Gao, X., Su, S., Qiu, J., Du, X., Guizani, M.: Evaluating reputation management schemes of internet of vehicles based on evolutionary game theory. IEEE Trans. Veh. Technol. 68(6), 5971–5980 (2019) 41. Tian, Z., Su, S., Shi, W., Du, X., Guizani, M., Yu, X.: A data-driven method for future internet route decision modeling. Future Gener. Comput. Syst. 95, 212–220 (2019) 42. Tan, Q., Gao, Y., Shi, J., Wang, X., Fang, B., Tian, Z.: Toward a comprehensive insight into the eclipse attacks of tor hidden services. IEEE Internet Things J. 6(2), 1584–1593 (2019)

Structural Vulnerability of Power Grid Under Malicious Node-Based Attacks Minzhen Zheng1, Shudong Li2(&) , Danna Lu1, Wei Wang1, Xiaobo Wu3(&), and Dawei Zhao4 1

School of Economics and Statistics, Guangzhou University, Guangzhou 510006, China 2 Cyberspace Institute of Advance Technology, Guangzhou University, Guangzhou 510006, China [email protected] 3 School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou 510006, China [email protected] 4 Shandong Provincial Key Laboratory of Computer Networks, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan 250014, China

Abstract. In recent years, the collapse of power grid in many countries not only has brought great inconvenience to national life, but also caused huge economic losses. Therefore, it is particularly important to analyze the vulnerability of network structure of power grid. In this paper, US power grid with 4941 nodes and 6594 edges is taken as examples. The network is attacked by deleting some percent nodes according to degree, k-shell value, betweenness centrality, and clustering coefficient, apparently. The largest connected component G, efficiency E, and average distance L are analyzed for measuring vulnerability of US power grid. The simulation results show that, in view of the largest connected component G and efficiency E, Betweenness Centrality-based attack is most destructive to the network structure than other attacks, and the attack based on Aggregation coefficient is the least destructive. Keywords: Complex networks  US power grid connected component  Robustness



Vulnerability



Largest

1 Introduction In June 2019, Argentina and Uruguay suffered a massive power outage due to a failure of the Argentine grid that affected the connected system, and further resulted in the inability of the entire country and several provinces in the neighboring countries to supply electricity. Then the damage caused huge economic loss. It is especially important to study the impact of different degrees of collapse on the structure of critical infrastructure, especially for the power grids [1, 2]. Therefore, it is necessary to find the vulnerable nodes of the entire network under node-based attacks. Vulnerability is a new concept for the assessment of the © Springer Nature Singapore Pte Ltd. 2019 G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 446–453, 2019. https://doi.org/10.1007/978-981-15-1304-6_36

Structural Vulnerability of Power Grid

447

survivability and reliability of network systems in recent years [3]. It has been applied in power networks, aviation networks, and transportation networks [4, 5]. At present, the most important research direction of grid vulnerability research is to use the network theory to judge the vulnerability of the grid and predict the possibility of cascading failure. Internationally, there has been research about safety load on the grid [6], distributed power sources to reduce the burden on the grid to reduce vulnerability [7], power supply composition [8], grid structures [9], grid operations [10], and the impact of important transmission channels on the vulnerability [11]. In this paper, we consider the electrical power grid of the western United States which has 4941 nodes and 6594 edges [12]. And we use the vulnerability to comprehensively evaluate the structural survivability of the US power grid. We simulate the intentional attack by deleting some percentage of nodes, according to the degree, kshell value, betweenness centrality and clustering coefficient, apparently. Three metrics (Largest Connected Component, Efficiency and Average Path Length) are introduced to explore the vulnerable nodes of US power grid under different node-based attacks. It should be pointed that the degree, k-shell value, betweenness centrality and clustering coefficient represent the different characteristics of nodes. Then, the vulnerable nodes could be identified under these node-based attacks. The space of this paper is deployed as follows. In Sect. 2, the definition and calculation methods of vulnerability metrics and node attack reference indicators are introduced. In Sect. 3, taking the US power grid as an example, the nodes are attacked by different indicators and the largest connected component, efficiency and average path length of the entire network after each attack are calculated to obtain simulation results. In Sect. 4 we discuss the simulation results and network vulnerability analysis in this paper [13–15].

2 The Measuring Metrics 2.1

Largest Connected Component G

The component with the largest number of nodes in the component in which the nodes are connected to each other in the whole network becomes the largest connected component. The number of nodes in this component is S, and the number of nodes is N, then the largest connected component ratio of this network is [16]: G¼

2.2

S N

ð1Þ

Efficiency E

Efficiency measures the information transmission capability between network nodes. The shorter the shortest path dij , the higher the efficiency, so the efficiency is:

448

M. Zheng et al.



2.3

X1 1 N  ðN  1Þ dij

ð2Þ

Average Path Length L

The average path length is the average of the number of edges that pass from one node to another, so the average path length is defined as [17]: L¼

1X dij : N

ð3Þ

3 Different Node-Based Attacks 3.1

Betweenness Centrality Bi

It is a measure of the contribution rate of each node to the connectivity of other nodes. It refers to the number of times that all other shortest paths of other nodes pass through this node. The betweenness centrality Bi of node i is defined as: Bi ¼

X rab ðiÞ a6¼b

rab

ð4Þ

where rab is the number of the shortest paths between the node a and b, and rab ðiÞ is the number of shortest paths between node a and b passing through the node i, respectively. 3.2

Aggregation Coefficient Ci

The aggregation coefficient is an important parameter to measure the degree of aggregation of network nodes, reflecting the degree of network small grouping. If there are Ki neighbors in a node, the total number of possible connected edges of these neighbors is Ki  ðKi  1Þ=2, and the actual number of connected edges is Ei , so the aggregation coefficient of each node is: Ci ¼

3.3

2  Ei Ki  ðKi  1Þ

ð5Þ

K-shell Value

K-shell is an indicator used to describe the influence of a node [18, 19]. First, the nodes with degrees 1, 2, 3… are deleted one by one, and the degrees of all nodes are recalculated when the nodes with degree 1 are deleted. Then delete the node with

Structural Vulnerability of Power Grid

449

degree 1 until there is no more node with degree 1 in the entire network. Finally, the kshell value of the node deleted by this process is assigned to 1. And so on, until all nodes of the entire network are deleted.

4 Simulation Result Analysis In many cases, the collapse of the national grid is caused by the failure of a very small number of nodes, which causes the collapse of the entire grid. Therefore, it is meaningful to study the impact of the click on the entire network structure. The largest connected component describes the connectivity of the entire network. Efficiency measures the ability of information transmission between nodes, that is, the speed at which a node crashes while affecting another node. The average path length refers to the number of paths between two nodes, which reflects the speed of impact on one node when one node crashes. Therefore, all three can reflect the vulnerability of the entire network structure. Figures 1, 2, and 3 respectively scale the degree Ki , k-shell value, Betweenness Centrality-based Bi , and aggregation coefficient Ci from large to small, and then attack the nodes proportionally. The largest connected component, efficiency, and average path length are used as performance indicators, and study the impact of the US grid structure on the vulnerability of the network structure. Since the whole network is described in this paper, the degree, the Betweenness Centrality-based, the k-shell value, and the aggregation coefficient of each node are summed, and the ratio of the node to the total number of nodes is used as a measure of the value of the entire network [20, 21].

Fig. 1. Using the largest connected component as a measure of structural vulnerability.

450

M. Zheng et al.

Fig. 2. Using efficiency as a measure of structural vulnerability.

Fig. 3. Using the average distance L as a measure of structural vulnerability.

4.1

The Simulation Results with the Reference Indicators of the Attack Nodes

According to the order of the aggregation coefficient, the proportional destruction node has the least impact on the network structure vulnerability, and the aggregation coefficient is never zero. Namely, the attack coefficient is the reference index, and the attack node is less destructive to the whole network than other methods. According to the degree and the number of Betweenness Centrality-based, the impact of the crushed nodes on the vulnerability of the entire network structure is basically the same. When they defeat 20%–40% of the nodes, the whole network is basically completely disintegrated.

Structural Vulnerability of Power Grid

451

When the crushed nodes are sorted by k-shell value, we can find the vulnerability by the largest connected component and efficiency. We can find that they are the trend of high-speed decline. When the quarter node is defeated, it gradually reaches a state of gentle decline until the whole network disintegration. In these three figures, we can also find that when the nodes are defeated according to the size of the aggregation coefficient, the trend of the three performance indicators is not monotonously decreasing. Because the nodes will gradually decrease as the nodes collapse. Therefore, all possible connected edges between neighbors will decrease, resulting in an increase in the aggregation coefficient [22, 23]. 4.2

Simulation Results Based on Network Structure Vulnerability Measures

Taking the largest connected component and efficiency as the structural vulnerability measure, the network structure is disintegrated more quickly in the node deletion process. For weaker degrees of power and Betweenness Centrality-based, the G and E values of the entire network are almost zero when 20% of the nodes are deleted. The average path length is used as a measure of structural vulnerability. It does not change monotonously with the previous two. Because with the deletion of nodes, the path length of some nodes reaching other nodes has increased, but the total number of nodes has changed little, so there will be an undulating state [24, 25].

5 The Conclusions Through the study of the vulnerability of the US power grid structure, it is found that different attack methods have different destructive effects on the entire network structure. Among them, the size of k-shell and the aggregation coefficient are used as the basis for judging the importance of the node. Attacking the node in these two ways has the least damage to the whole network, and the whole network shows stronger invulnerability. The largest connected component and efficiency are used to measure the vulnerability of the network structure. When the important node is attacked by 20%, the whole network is basically in a discrete state. Therefore, deleting the nodes in these two ways makes the entire network reach the disintegration state faster [26]. Regardless of the attack method, the more important the node, that is, the node with the larger index value, the more damage to the network structure. Therefore, the protection of these nodes is extremely important. By protecting these nodes, the network is improved. The robustness of the structure enables the entire power network to operate normally, reducing the collapse of the entire network system due to the destruction of individual nodes and the resulting economic loss and inconvenience to the national life. In the future, with limited resources, we will discuss how to protect the key nodes under different attacks.

452

M. Zheng et al.

Acknowledgement. This research was funded by NSFC (No. 61672020, U1803263, U1636215, 61702309), (No. 18-163-15-ZD-002-003-01), National Key Research and Development Program of China (No. 2019QY1406), Key R&D Program of Guangdong Province (No. 2019B010136003, 2019B010137004), Project of Shandong Province Higher Educational Science and Technology Program (No. J16LN61), and the National Key research and Development Plan (No. 2018YFB1800701, No. 2018YFB0803504, and No. 2018YEB1004003).

References 1. Xin, L., Dongmei, J., Guoquan, Y.: The method in researching the vulnerability of the internet and the problems existing in it. Software 1(1) (2012) 2. Nicholson, C.D., Barker, K., Ramirez, M.J.E.: Flow-based vulnerability measures for network component importance: Experimentation with preparedness planning. Reliab. Eng. Syst. Saf. 145, 62–73 (2016) 3. Huan, P., Rong, G., Gangdun, H.: Analysis and assessment of a realistic power grid’s dynamic structure vulnerability. Smart Grid 7(5), 392–401 (2017) 4. Hong, L., Ouyang, M., Peeta, S., et al.: Vulnerability assessment and mitigation for the Chinese railway system under floods. Reliab. Eng. Syst. Saf. 137, 58–68 (2015) 5. Minghui, X.: The Vulnerability Analysis of Urban Complementary Teansit Network under Spatially Localized Failures, Huazhong University of Science (2018) 6. Skelton, R.P., Anderegg, L., Prahlad, P.: No local adaptation in leaf or stem xylem vulnerability to embolism, but consistent vulnerability segmentation in a North American oak. New Phytologist 223(3), 1296–1306 (2019) 7. Hwang, T.S., Park, S.Y.: A seamless control strategy of a distributed generation inverter for the critical load safety under strict grid disturbances. IEEE Trans. Power Electron. 28(10), 4780–4790 (2013) 8. Atanov, I.V., Khorol’Skii, V.Y., Ershov, A.B., et al.: Formalization of the process of directional composition of structures of autonomous power-supply systems during design. Russ. Electr. Eng. 88(8), 475–479 (2017) 9. Bahmani-Firouzi, B., Azizipanah-Abarghooee, R.: Optimal sizing of battery energy storage for micro-grid operation management using a new improved bat algorithm. Int. J. Electr. Power Energy Syst. 56(3), 42–54 (2014) 10. Gini, F., Luise, M., Reggiannini, R.: Cramer-Rao bounds in the parametric estimation of fading radiotransmission channels. IEEE Trans. Commun. 46(10), 1390–1398 (1998) 11. Umberger, A.: Distributed Generation: How Localized Energy Production Reduces Vulnerability to Outages and Environmental Damage in the Wake of Climate Change. Golden Gate U.envtl.l.j (2012) 12. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998) 13. Dan, L.: Overview of research on grid vulnerability based on complex network theory. Heilongjiang Sci. Technol. Inf. 36 (2016) 14. Lixiong, X., Junyong, L., Yang, L., Zhanxin, Y., Li, Z., Yang, W.: Structural Characteristics Investigation of Electric Power, Transmission Networks (S2) (2014) 15. Motter, A.E., Lai, Y.C.: Cascade-based attacks on complex networks. Phys. Rev. E 66(6), 065102 (2002) 16. Jing, G., Wang, D.R.: The vulnerability analysis on power communication networks based on complex network theory. Power Syst. Commun. 30(9), 6–10 (2009)

Structural Vulnerability of Power Grid

453

17. Li, S., Wu, X., Li, A., Zhou, B., Tian, Z., Zhao, D.: Structural vulnerability of complex networks under multiple edge-based attacks. In: 2018 IEEE 3rd International Conference on Data Science in Cyberspace (IEEE DSC 2018), pp. 405–409 (2018) 18. Crucitti, P., Latora, V., Marchiori, M.: A topological analysis of the Italian electric power grid. Physica A 338, 92–97 (2004) 19. Zhao, D., Li, L., Li, S., Huo, Y., Yang, Y.: Identifying influential spreaders in interconnected networks. Physica Scripta 89(1) (2014). 0031-8949 20. Li, S., Li, L., Jia, Y., Liu, X., Yang, Y.: Identifying vulnerable nodes of complex networks in cascading failures induced by node-based attacks. Math. Probl. Eng. (2013). 938398 21. Wei, Z., Liu, J., Zhu, G., et al.: A new integrative vulnerability evaluation model to power grid based on running state and structure. Autom. Electr. Power Syst. 33(8), 11–15 (2009) 22. Li, S., Li, L., Yang, Y., Luo, Q.: Revealing the process of edge-based-attack cascading failures. Nonlinear Dyn. 69(3), 837–845 (2012) 23. Tian, Z., Su, S., Shi, W., Du, X., Guizani, M., Yu, X.: A data-driven method for future internet route decision modeling. Future Gener. Comput. Syst. 95, 212–220 (2019) 24. Anji, M., Yu, J., Guo, Z.: Electric power grid structural vulnerability assessment. In: Power Engineering Society General Meeting. IEEE (2006) 25. Zhao, D., Li, L., Peng, H., Luo, Q., Yang, Y.: Multiple routes transmitted epidemics on multiplex networks. Phys. Lett. A 378, 770–776 (2014) 26. Coppo, M., Pelacchi, P., Pilo, F., et al.: The Italian smart grid pilot projects: selection and assessment of the test beds for the regulation of smart electricity distribution. Electr. Power Syst. Res. 120, 136–149 (2015)

Electric Power Grid Invulnerability Under Intentional Edge-Based Attacks Yixia Li1, Shudong Li2(&) , Yanshan Chen1, Peiyan He1, Xiaobo Wu3, and Weihong Han2(&) 1

3

School of Economics and Statistics, Guangzhou University, Guangzhou 510006, China 2 Cyberspace Institute of Advance Technology, Guangzhou University, Guangzhou 510006, China {lishudong,hanweihong}@gzhu.edu.cn School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou 510006, China

Abstract. Power Grid as a kind of complex network is particularly important for every country, even brings huge losses if the power grid suffered from natural or even artificial attacks. Therefore, how to investigate the vulnerable edges of the power grid with under attacks has become an important proposition. In this paper, taking the US power grid as an example, by deliberately deleting some percent of edges according to different strategies which represents different attacks apparently, we calculate the collapse degree of the attacked network by three metrics (The largest connected component G, efficiency E, and average distance L). We found that, under intentional attack on the edges with higher betweenness centrality and the ones with larger multiplication of node betweenness centrality, the US power grid is inferior in invulnerability. The methods used in this paper could be used to identify the vulnerable edges of complex networks, especially for the key infrastructures. Keywords: Power grid centrality

 Invulnerability  Edge-based attack  Betweenness

1 Introduction From the collapse of the northern US power grid in 2003 to the collapse of the national power grid in Argentina on June 16, 2019, large-scale blackouts have caused incalculable losses to society and life. Therefore, how to measure the invulnerability of the network has become an important issue. In fact, the grid system can be simplified to a complex network model with a power station as a node, and the attacks are generally aim at nodes and edges. In this respect, Arianos tried to find the most critical edges in the network, and found that the net-ability was capable of identifying some of the most critical edges [1]. Koç proposed an effective graph resistance to measure the vulnerability of the network, and found that increasing the effective graph resistance, the power grid would be more resistant [2]. Then, for the failures of overload nodes, Kadloor found that, by random © Springer Nature Singapore Pte Ltd. 2019 G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 454–461, 2019. https://doi.org/10.1007/978-981-15-1304-6_37

Electric Power Grid Invulnerability Under Intentional

455

deleting the nodes would cause other nodes fail, and the disturbance levels of the system can accept before deleting the nodes, which would result in the failure of all the nodes in the network [3]. Wang took the effective graph resistance as the vulnerability metric of a grid against cascading failures, and found that by bringing an additional line into the network system, the gird vulnerability will increase [4]. Simonsen found that due to the overshooting in the loads, the flow dynamics may imply reduce the resistance of network [5]. However, how to identify the vulnerable edges of power grid has not been investigated enough until now. In this paper, considering the importance of edges in power grid, we take the topological structure of the US power grid as example. We find that by attracting the edges (removing the edges from the network) [6, 7] with the large edge betweenness centrality (Bij ), and large multiplication of node betweenness centrality (Bij ), the network shows the worst invulnerability. More technically, we attack 5% edges each time by different strategies (Bij , Bij , Kij , Cij , KShellij ) and calculate the vulnerability metrics (G, E, L). Through drawing, we can find that when deleted by Bij and Bij , vulnerability metrics decrease fastest. Therefore, we draw the conclusion that in scale-free power network, the edges with large Bij and Bij would be the most vulnerable. The organization of this paper is as follows. In Sect. 1, we introduce the metrics of the importance for edges in network. In Sect. 2, we introduce the metrics to measure the collapse degree of network. In Sect. 3, we take the US power grid as an example to discuss the collapse degree of network under the deliberate attack on edges and finally describe the network’s invulnerability. In Sect. 4, we discuss the simulations and analysis of this paper and give the conclusions.

2 The Metrics to Measure the Importance for Edges in Network In this part, for the electric power grid, we introduce some characteristics to identify the vulnerable edges. Firstly, we suppose power transfer in the shortest path. The more shortest path pass thought one edge, the more important it is. Therefore, we introduce the Edge Betweenness Centrality [8, 9] (Bij ) as the metric of the importance for edges in network. For the nodes, it is obvious that the Degree [10] (Ki ), Node Betweenness Centrality [11, 12] (Bi ), and the Clustering Coefficient [13] (Ci ) are metrics of the importance for the nodes. While the edge is connected with two nodes, it is reasonable to define the importance measure of the edge of the connected node i and the node j as the following three metrics: Kij ¼ Ki  Kj

ð1Þ

Bij ¼ Bi  Bj

ð2Þ

456

Y. Li et al.

Cij ¼ Ci  Cj

ð3Þ

The edge disappears following the deleting of small degree nodes, and the degree of nodes changes accordingly, which leads to another metric KShellij [14].

3 The Metrics of Network Collapse The network’s Maximal Connected Components Coefficient [15] (G) is an important metric for studying network connectivity. After the network is attacked, the higher the network connectivity, the smaller the network’s collapse. Therefore, G can be used as a metric to measure network crash. Suppose that in the network with N nodes, there are S nodes in the maximal connected components, so we define the Maximal Connected Components Coefficient: G¼

S N

ð4Þ

Similarly, the Average Path Length [16] (L), Efficiency [17] (E) can also be used as metrics to measure the degree of collapse. The longer the average path length, the lower the connectivity and the higher the degree of collapse. The higher the efficiency, the stronger the connectivity and the lower the degree of collapse. We define the shortest path as dij , so the Average Path Length (L) and Efficiency (E) can be defined as follow: 1X dij N

ð5Þ

X1 1 N*ðN  1Þ dij

ð6Þ

L¼ E¼

4 Exploration of Power Grid Structure In order to study the invulnerability of the power grid, we take the electrical power grid [18] of the western United States which has 4941 nodes and 6594 edges as an example, and set a scale-free [19] network with a length of 1 on each edge. By analyzing the degree ðKi Þ distribution and betweenness centrality (Bi ) distribution [20, 21] of nodes in the network, we have the distribution maps as follows (Figs. 1 and 2): In the United States Power Grid, small degree and small betweenness centrality nodes account for a large proportion, especially those with a betweenness centrality of less than 0.02 account for more than 95%. Therefore, the structure of the US power grid is composed of small proportion of large degree nodes and large betweenness centrality nodes. While small degree nodes and small betweenness centrality nodes account for the majority of grid. It can also be concluded that the basic structure of the US grid is

Electric Power Grid Invulnerability Under Intentional

457

Fig. 1. Ki Distribution map.

Fig. 2. Bi Distribution map.

relatively dense at several nodes, and the network structure near most nodes is relatively sparse. Among them, it is easy to see that the degree distribution of the grid is close to power law distribution, so the grid can be regarded as scale-free network model.

458

Y. Li et al.

5 Simulation of US Power Grid With the example of US power grid, we take a deliberately attacking to the important edge, and then calculate the degree of network collapse, thereby judging the invulnerability of the network. According to five importance metrics, we sort the edges according to five strategies, attack 5% edges each time and calculate the corresponding maximal connected components coefficient (G), efficiency (E) and average Path length (L). The results are as follows (Figs. 3, 4 and 5):

Fig. 3. The largest connected component G as a function of deleting percentage p.

From the above three figures, we can see that the important nodes measured by these five metrics are firstly attacked. In the early stage, the network was greatly damaged, and the impact was small in the later stage. This shows that the edges of the early attack are the important edges of the network, which verifies the correctness of the measurement metrics [22]. In addition, it can be seen that, by attacking the edges with large. edge betweenness centrality (Bij ), and the edges with large multiplication of node betweenness centrality (Bij ), the network suffer the greatest damage. Therefore, to enhance the network’s invulnerability, we should focus on protecting the edges with the large edge betweenness centrality (Bij ), and large multiplication of node betweenness centrality (Bij ).

Electric Power Grid Invulnerability Under Intentional

Fig. 4. Efficiency E as a function of deleting percentage p.

Fig. 5. The average distance L as a function of deleting percentage p.

459

460

Y. Li et al.

6 Conclusions In this paper, we adopted five methods to measure the importance of the edges, and studied the invulnerability of the US power grid under the deliberate attacks of five opposite sides. We can see that different attacks are differently destructive to the network. From the image, the degree of network collapse is huge by attacking important edges in the network. When the deletion point reaches 40%, the network is near collapse. In five ways, it can be seen that, by attacking the edges with large edge betweenness centrality and the edges with large multiplication of node betweenness centrality, the network suffer the greatest damage. What’s more, it can be inferred that edge betweenness centrality and large multiplication of node betweenness centrality, are the most important metrics in the network. Therefore, the nodes that protect the edge center degree and the large multiplication of the node center degree have important practical significance for enhancing the robustness of the network. In the future, considering that the protection of the network means lot of cost and resource [23–25], so we will investigate how to protect these important edges under different attacks. Acknowledgement. This research was funded by NSFC (No. 61672020, U1803263, U1636215), (No.18-163-15-ZD-002-003-01), National Key Research and Development Program of China (No. 2019QY1406), Key R&D Program of Guangdong Province (No. 2019B010 136003, 2019B010137004), A Project of Shandong Province Higher Educational Science and Technology Program (No. J16LN61), and the National Key research and Development Plan (No. 2018YFB1800701, No. 2018YFB0803504, and No. 2018YEB1004003).

References 1. Arianos, S., Bompard, E., Carbone, A., et al.: Power grids vulnerability: a complex network approach. Chaos 19(1), 175 (2009) 2. Koç, Y., Warnier, M., Van Mieghem, P., et al.: The impact of the topology on cascading failures in electric power grids. Comput. Sci. (2013) 3. Kadloor, S., Santhi, N.: Understanding cascading failures in power grids. Comput. Sci. 28(5), 24–30 (2012) 4. Wang, X., Koc, Y., Robert, E., et al: A network approach for power grid robustness against cascading failures. In: 2015 7th International Workshop on Reliable Networks Design and Modeling (RNDM). IEEE (2015) 5. Simonsen, I., Buzna, L., Peters, K., et al.: Transient dynamics increasing network vulnerability to cascading failures. Phys. Rev. Lett. 100(21), 218701 (2008) 6. Buldyrev, S.V., Parshani, R., Paul, G., et al.: Catastrophic cascade of failures in interdependent networks. Nature 464, 1025–1028 (2010) 7. Schaub, M.T., Lehmann, J., Yaliraki, S.N., et al.: Structure of complex networks: quantifying edge-to-edge relations by failure-induced flow redistribution. Netw. Sci. 2(01), 66–89 (2014) 8. Yong, Y., Yu, F.: Case study on survivability of urban rail transit network. Logistics Technol. 37(12), 58–62 (2018)

Electric Power Grid Invulnerability Under Intentional

461

9. Sun, Y., Yang, D., Meng, L., et al.: Universal framework for vulnerability assessment of power grid based on complex networks. In: The 30th Chinese Control and Decision Conference (2018) 10. Runze, W., Wanxu, W., Li, L., Bing, F., Liangrui, T.: Topology diagnosis of power communication network based on node influence. Power Syst. Prot. Control 47(10), 147–155 (2019) 11. Riondato, M., Kornaropoulos, E.M.: Fast approximation of betweenness centrality through sampling. Data Min. Knowl. Discov. 30(2), 438–475 (2016). (S1384-5810) 12. Segarra, S., Ribeiro, A.: Stability and continuity of centrality measures in weighted graphs. IEEE Trans. Sig. Process. 64(3), 543–555 (2016) 13. Yun, L.: Node importance rank by attribute reduction set evaluation and application. Shandong Normal University (2018) 14. Li, S., Wu, X., Zhu, C., Li, A., Li, L., Jia, Y.: Vulnerability of complex networks under multiple node-based attacks. In: IET International Conference on Information & Communications Technologies (2013) 15. Ruan, Y., Lao, S.-Y., Wang, J., Bai, L., Chen, L.-D.: Node importance measurement based on neighborhood similarity in complex network. Acta Phys. Sin. 66(03), 371–379 (2017) 16. Li, C., Wei, L., Lu, T., Gao, W.: Invulnerability simulation analysis of compound traffic network in urban agglomeration. J. Syst. Simul. 30(02), 489–496 (2018) 17. Sun, K., Han, Z.X., Cao, Y.J.: Review on models of cascading failures in complex power grid. Power Syst. Technol. 13, 1–9 (2005) 18. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393 (6684), 440 (1998) 19. Yuejin, T., Xin, L., Jun, W., Hongzhong, D.: Main scientific problems for the invulnerability research of complex networks. In: The 15th Chinese Congress of Systems Science and Systems Engineering Proceeding. Systems Engineering Society of China (2008) 20. Verma, T., Ellens, W., Kooij, R.E.: Context-independent centrality measures underestimate the vulnerability of power grids. Int. J. Crit. Infrastruct. 11(1), 62 (2013) 21. Tian, Z., et al.: Real time lateral movement detection based on evidence reasoning network for edge computing environment. IEEE Trans. Ind. Inform. 15(7), 4285–4294 (2019) 22. Tian, Z., Su, S., Shi, W., Du, X., Guizani, M., Yu, X.: A data-driven method for future internet route decision modeling. Future Gener. Comput. Syst. 95, 212–220 (2019) 23. Li, S., Li, L., Yang, Y., Luo, Q.: Revealing the process of edge-based-attack cascading failures. Nonlinear Dyn. 69(3), 837–845 (2012) 24. Li, S., Li, L., Jia, Y., Liu, X., Yang, Y.: Identifying vulnerable nodes of complex networks in cascading failures induced by node-based attacks. Math. Probl. Eng. 2013, 938 (2013) 25. Zhao, D., Li, L., Peng, H., Luo, Q., Yang, Y.: Multiple routes transmitted epidemics on multiplex networks. Phys. Lett. A 378, 770–776 (2014)

Design and Evaluation of a Quorum-Based Hierarchical Dissemination Algorithm for Critical Event Data in Massive IoTs Ihn-Han Bae(&) School of IT Eng, Catholic University of Daegu, Gyeongsan 38430, South Korea [email protected]

Abstract. The appearance IoT (Internet-of-Things) applications such as environmental monitoring, smart cities, and home automation has realized the IoT concept to reality at a massive scale. Mission critical event data dissemination in massive IoT networks imposes constraints on the message transfer delay between devices. Due to low power and communication range of IoT devices, event data is predicted to be relayed over multiple D2D (device-to-device) links before reaching the destination. The coexistence of a massive number of IoT devices poses a challenge in maximizing the successful transmission capacity of the overall network alongside reducing the multi-hop transmission delay in order to support mission critical applications. In this paper, we first propose an extended s-grid quorum: the xS-grid quorum and the ┣-shaped grid quorum, which are quorum structures that can be used in many different applications including decentralized consensus, distributed mutual exclusion, and replica control. Next, we propose a quorum-based hierarchical dissemination algorithm for critical event data executed on the edge-fog environment in massive IoTs. The proposed algorithm constructs xS-grid and ┣-shaped grid quorums in the overlay network on the edge nodes of the lower layer and the fog servers of the upper layer to support event dissemination, respectively. It supports reliable and guaranteed real-time data dissemination using the proposed algorithm. Then, the performance of quorum and algorithm which are proposed is evaluated through an analytical model. Keywords: Mission critical data dissemination computing  Massive IoT  Quorum structure

 D2D link  Edge-fog

1 Introduction As IoT is deployed in the field, two main categories for IoT applications are beginning to emerge. These can be defined as critical IoT and massive IoT. Critical IoT application such as autonomous driving or remote surgery are those which require very low latency levels on ultra-reliable networks, often combined with very high throughput. The IoT paradigm is rapidly maturing with many massive deployments in areas such as smart cities, home automation and agricultural monitoring. These massive IoT applications require scalable low-power connectivity over a long range. Recently, the demand for hybrid IoT applications mixed with critical IoT and massive IoT, such as © Springer Nature Singapore Pte Ltd. 2019 G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 462–476, 2019. https://doi.org/10.1007/978-981-15-1304-6_38

Design and Evaluation of a Quorum-Based Hierarchical Dissemination Algorithm

463

healthcare, security, energy and industrial automation are increasing. They are referred to as mission-critical IoT applications, and additionally require specific QoS (Qualityof-Service) guarantees in terms of latency, throughput or reliability. Hence, the successful transport of data with minimum delay from the source to destination is highly essential for effective operation of massive IoT systems [1–3]. In this paper, we first propose an extended S-Grid quorum: the xS-Grid and the ┣shaped grid quorums, which are quorum structures that can be used in many different applications including decentralized consensus, distributed mutual exclusion, and replica control. Next, we propose a QHDA (Quorum-based Hierarchical Dissemination Algorithm) which is executed on the edge-fog environment for critical event data in massive IoTs. The QHDA constructs xS-grid and ┣-shaped grid quorums in the overlay networks on the edge nodes of the lower layer and the fog servers of the upper layer to support event dissemination, respectively. It supports reliable and guaranteed real-time data dissemination using the QHDA. Then, the performance of QHDA is evaluated through an analytical model. The remainder of this paper is organized as follows. Section 2 presents related works in the area of edge-fog computing, quorum systems and IoT data disseminations. In Sect. 3, we design the extended S-Grid quorums which are quorum structures used by proposed QHDA, and proposes QHDA, quorum-based hierarchical dissemination algorithm for critical event data in massive IoTs. Subsequently, the performance of proposed quorums and QHDA are presented by Sect. 4. Finally, Sect. 5 presents future research challenges and our conclusions.

2 Related Works 2.1

Edge-Fog Computing

Organizations that rely heavily on data are increasingly likely to use cloud, fog, and edge computing infrastructures. These architectures allow organizations to take advantage of a variety of computing and data storage resources, including the Industrial Internet of Things (IIoT). Cloud, fog and edge computing may appear similar, but they are different layers of the IIoT [4]. The fog-edge layer is the perfect junction where there are enough compute, storage and networking resources to mimic cloud capabilities at the edge and support the local ingestion of data and the quick turnaround of results. Figure 1 shows a pictorial representation of the edge-fog computing model in IoT [5]. Edge-centric computing is based on a decentralized model that interconnects heterogeneous cloud resources controlled by a variety of entities. Edge computing for the IIoT allows processing to be performed locally at multiple decision points for the purpose of reducing network with edge, compute and storage systems reside at the edge as well, as close as possible to the component, device, application or human that produces the data being processed. The purpose is to remove processing latency, because the data needn’t be sent from the edge of the network to a central processing system, then back to the edge.

464

I.-H. Bae

Fig. 1. Edge-fog computing model with IoT

A fog environment places intelligence at the local area network (LAN). This architecture transmits data from endpoints to a gateway, where it is then transmitted to sources for processing and return transmission. It is also a virtualized platform located between end users and cloud data centers hosted on the Internet. Fog computing provides advantages in terms of reduction in delay, power consumption, and reduces data traffic over the network. 2.2

Quorum Systems

Quorum systems serve as a basic tool providing a uniform and reliable way to achieve coordination between processors in distributed systems. A set system is a collection of sets, S ¼ fS1 ; S2 ;    ; Sm g over underlying universe, U ¼ fu1 ; u2 ;    ; un g. A set system is said to satisfy the intersection property, if every two sets S, R2S have a nonempty intersection. Set systems with the intersection property are known as quorum systems, and the sets in such a system are called quorums [6].

Fig. 2. Grid quorum system

Many quorum systems: grid, torus, cycling and finite projective plane have been proposed. In this paper, we discuss the grid quorum and the s-grid quorum in related pffiffiffi pffiffiffi research. In a grid system shown in Fig. 2, elements are arranged as a n  n array (square). A quorum can be any set containing a column and a row of elements in the

Design and Evaluation of a Quorum-Based Hierarchical Dissemination Algorithm

465

pffiffiffi array. The quorum size in a square system is 2 n  1. An alternative is a “triangle” grid-based quorum in which all elements are organized in a “triangle” fashion. The pffiffiffipffiffiffi quorum size in “triangle” system is approximately 2 n [7].

Fig. 3. All four possible quorums based on s-grid(4  4) [8]

Stepped grid (s-grid) quorum [8] is very flexible a works with any array size. Given a universal set, in which elements are arranged as a tw array with rightmost column being arranged as wrapping around back to the leftmost column, a quorum of a s-grid (t  w) quorum system is formed by selecting ⓐ All elements of a row i, 0  i  t  1; ⓑ All elements in column 0 starting from the first element to the first element of the selected row; ⓒ All elements in the last column w-1 starting from the last element of the selected row to the last element. In s-grid(4  4), the quorum as shown in Fig. 3 is selected. 2.3

Data Dissemination Methods

The IoT platform supports end-to-end dissemination of real-time data from sensors to sinks and in the reverse direction over a range of up to several kilometers without relying on fixed network infrastructures such as 4G and low power wide area networks (LPWANs). End-to-end connectivity will require multi-hop communication due to transmit power constraints and possible lack of line-of-sight between source and sink. Daneels et al. [1] proposed a new general-purpose IoT platform based on a combination of LoWPAN and multi-hop Wireless Sensor Network (WSN) technology. It supports reliable and guaranteed real-time data dissemination and analysis, as well as actuator control, in dynamic and challenging infrastructure-less environments. Tao et al. [8] proposed a novel data dissemination scheme (MM-GSQ) constructed on top of a new quorum system in wireless sensor networks. The MM-GSM uses a new quorum system named spatial neighbor proxy quorum (SNPQ). Imani et al. [9] devised a new quorum system called steeped grid (s-grid) which is used in data dissemination and power saving protocols for multi-hop ad hoc networks. S-grid has comparably high

466

I.-H. Bae

expected quorum overlap size (EQOS) values and better neighbor sensibility than other methods: grid, cyclic, torus and FPP.

3 QHDA for Massive IoTs 3.1

Extended S-Grid Quorums

(1) xS-grid quorum In the conventional s-grid quorum [8], a row-based quorum system constitutes quorum elements in a row first in a grid, and makes up quorum in terms of column elements. However, in the proposed xS-grid quorum, the column-based s-grid quorum is added to the existing row-based s-grid quorum. Thus, the xS-grid quorum has two quorum: rowbased quorum and column-based quorum. The xS-grid quorum is constructed in the following manner. • xS-grid quorum, QxS ¼ QxS

row

[ QxS

column

ð1Þ

Where QxS_row and QxS_column represent row-based s-grid and column-based s-grid quorums, respectively. • Row-based s-grid quorum, QxS_row configuration is the same as the existing s-grid quorum. • Column-based s-grid quorum, QxS_column is formed by selecting the following elements. In the grid which elements are arranged as a t  w array, ⓐ All elements of a column j, 0  j  w  1; ⓑ All elements in row 0 starting from the first element to the first element of the selected column; ⓒ All elements in the last row t − 1 starting from the last element in the selected column to the last element. In (4  4) grid, the column-based s-grid quorum, QxS_column is shown in Fig. 4.

Fig. 4. All four possible column-based s-grid(4  4) quorums

Design and Evaluation of a Quorum-Based Hierarchical Dissemination Algorithm

467

Fig. 5. The xS-grid quorum of node 7 on s-grid(4  5)

Figure 5 shows the xS-grid quorum of node 7, QxS(7) in (4  5) grid array. The xS-grid quorum consists of union of row-based and column-based quorums. Therefore, the configuration of QxS(7) is as following: QxS ð7Þ ¼ QxS row ð7Þ [ QxS column ð7Þ ¼ f0; 5; 6; 7; 8; 9; 14; 19g [ f0; 1; 2; 7; 12; 17; 18; 19g ¼ f0; 1; 2; 5; 6; 7; 8; 9; 12; 14; 17; 18; 19g:

(2) ┣-shaped grid quorum We extend the s-grid quorum and propose a ┣-shaped grid system with smaller quorum size than the quorum size of s-grid. The proposed ┣-shaped grid quorum is easy to implement and works with any array size. Like the xS-grid, the ┣-shaped grid can also form row-based quorum and column-based quorum. Given a universal set, in which elements are arranged as a t  w array with the rightmost column being regarded as wrapping around back to the leftmost column, a quorum in the grid(t  w), the column-based quorum of the ┣-shaped grid is formed by selecting the following elements. Assume that we make the column-based quorum of the ┣-shaped grid for node (i, j). ⓐ All elements of a column j, 0  j  w  1; ⓑ Calculate the right displacement distance (rdd) from column j to the rightmost element, rdd = w − j − 1;   ⓒ If the rdd is greater than or equal to w2 , select all elements of row i in the right displacement distance; ⓓ Otherwise, skip all elements of row i in the right displacement distance, wrap back to the leftmost column. Then, select all element in the row i starting from column 0 to the selected column j; In the grid (4  4), the column-based quorums of the ┣-shaped grid are shown in Fig. 6. Figure 7 shows the column-based quorums of ┣-shaped grid for node 6 and node 7 (Q┣_column(6), Q┣_column(7)), respectively, where the quorums have intersection, Q┣_column(6) \ Q┣_column(7) = {8, 11}. Coordination between the nodes can be achieved through such intersection.

468

I.-H. Bae

In Fig. 7, the grid quorum of node 6, QG(6) is as follows:

Fig. 6. The column-based quorums of ┣-shaped grid(4  5) for node 6 and node 13

Fig. 7. The column-based quorums of ┣-shaped grid(4  5) for node 6 and node 13

Then, the column-based quorum and the row-based quorum of ┣-shaped grid for node 6 are as follows: Q‘grid ð6Þ ¼ Q‘

row ð6Þ [ Q‘ column ð6Þ;

Q‘ row ð6Þ ¼ f5; 6; 7; 8; 9; 11; 16g; Q‘ column ð6Þ ¼ f1; 6; 7; 8; 9; 11; 16g; Q‘grid ð6Þ ¼ f1; 5; 6; 7; 8; 9; 11; 16g: Therefore, we know that Q┣-grid(6) = QG(6) in case of that the quorum size is the largest. 3.2

QHDA Design

The logical hierarchical architecture of the edge-fog overlay in which the proposed EFHA is implemented is shown in Fig. 8. We construct the overlay networks in the form of the grids of array (t  w) on each edge network and the fog network, respectively. When disseminating a mission critical event data, each edge node considers the popularity of the critical event data received from the IoT device to provide spatial locality of the event data. The edge node periodically calculates the popularity of the data requesting to the edge nodes. The current local popularity of data i is defined as LP (local popularity) [10].

Design and Evaluation of a Quorum-Based Hierarchical Dissemination Algorithm

469

Fi LPi ¼ b  Pn

ð2Þ

j¼1

Fj

Pn Where, b represents a weight, Fi represents the access frequency of data i, and j¼1 Fj represents the total access frequency of all data at a time t.

Fig. 8. The logical hierarchical structure on edge-fog overlays for QHDA

When an edge node receives a certain event data from the IoT device of the lower IoT layer, QHDA depends on whether the received event data is popular or not, and the quorums and the number of edge nodes in which the event data is stored are different. If the event data is popular, it is stored in the edge nodes of the xS-grid quorum. Otherwise, it is stored only the edge nodes of the row-based s-grid quorum. In addition, the receiving edge node transmits the event data to the corresponding fog server of the upper fog layer, and the fog server receiving the event data stores the event data in the fog servers of the column-based quorum of ┣-shaped grid.

Fig. 9. Message sequence diagram for data request/response in the worst case of QHDA

470

I.-H. Bae

If the requested a mission critical event data is not in the local edge node, the corresponding edge node sends a request message to the edge nodes of the row-based s-grid quorum. If the required event data hasn’t found in the edge layer, the request message is transmitted to the corresponding fog server. If the requested a mission critical event data is not in the local fog server, the request message is transmitted to the fog servers in the column-based quorum of ┣-shaped grid of the fog server. Figure 9 shows the sequence diagram of the request and the response messages for a critical event data in the worst case of the proposed QHDA. In the example of Fig. 8, the popular event data generated by device A of IoT layer is transmitted to the node 19 of edge layer, the node 19 stores the data in the xS-grid quorum, QxS(19) = {9, 11, 12, 13, 15, 17, 18, 19, 20}. The node 19 transfers also the data to the server 7 of upper fog layer. Then the server 7 also stores the data in the column-based ┣-shaped grid quorum, Q┣_column(7) = {1, 4, 7, 8}. If the device B of IoT layer requests a mission critical data of the device A to perform a task, the request message is delivered to the corresponding node 17 of upper edge layer and receives the required data to perform the mission. In Fig. 8, if the device requesting the required data, such as IoT device C, is on a different remote sub-network than the IoT device generating the mission critical data, the request message reaches the corresponding server 3 in fog layer and sends a request message to the column-based ┣-shaped grid quorum, Q┣_column(3) = {0, 3, 4, 5, 6}. Therefore, the required data is received from the intersection {4} of the Q┣_column(7) = {1, 4, 7, 8} storing the data and the Q┣_column(3) = {0, 3, 4, 5, 6} receiving the request message, and a mission is performed.

4 Performance Evaluation 4.1

Performance of xS-Grid and ┣-Shaped Grid Quorums

We describe some metrics for comparing the performance of quorum systems: [11] • Expected quorum overlap size (EQOS): This metric helps to evaluate the averagecase neighbor sensibility. The formal definition of EQOS is given in Definition 1. • Active ratio (AR): The ratio that a node has to keep a critical event data, can be measured by the ratio can be measured by the ratio of quorum size to the system size. Definition 1. For a quorum system Q under U = {0, 1, 2, …, n − 1}, the EQOS of Q is: X pðGÞpðH ÞjG \ H j ð3Þ G;H2Q Where, p(G) and p(H) are the probability of accessing quorums G and H for quorum access strategy. Definition 2. In quorum system Q = {Q1, Q2, Q3, …, Qn} under U = {0, 1, 2,…, n − 1}, the relative size of Qi quorum with system size n is known as Qi quorum active ratio [12].

Design and Evaluation of a Quorum-Based Hierarchical Dissemination Algorithm

Active RatioðQi Þ ¼

jQi j n

471

ð4Þ

We drive the EQOS of the xS-grid(4  5) shown in Fig. 5 and then expand the results to n. The set of quorums of the xS-grid(4  5) are as follows (Table 1):   QxS ð4  5Þ ¼ Qij ji ¼ 0; 1; . . .; 3; j ¼ 0; 1; ::; 4 ; where, Q00 = {0, 1, 2, 3, 4, 5, 9, 10, 14, 15, 16, 17, 18, 19}, Q01 = {0, 1, 2, 3, 4, 6, 9, 11, 14, 16, 17, 18, 19}…, Q33 = {0, 1, 2, 3, 5, 8, 10, 13, 15, 16, 17, 18, 19} and Q34 = Q00. Table 1. Analysis of the quorum size in the xS-grid(4  5) t\w 0 1 2 4

0 12 13 12 8

1 13 13 13 11

2 12 13 13 12

3 11 13 13 13

4 8 12 13 14

R 56 64 64 56

1 The xS-grid quorum system has an average quorum size, jQxS j ¼ 20 ð56 þ 64 þ 64 þ 56Þ ¼ 12. We can expand the result of the quorum size to (t  w) grid array as follows:

jQxS ðt  wÞj ¼ 2  ðt þ w  3Þ

ð5Þ

Suppose and Qij and Qkl are two quorums in the xS-grid quorum system, where Qij contains all elements in row i-based s-grid quorum plus all elements in column j-based s-grid quorum, and Qkl contains all elements in row k-based s-grid quorum plus all elements in column l-based s-grid quorum. Note that it is possible that Qij= Qkl. Therefore, there are (t  w)2 possible permutations of Qij and Qkl in (t  w) grid array. The approximate average EQOS of the xS-grid quorum system can be evaluate by considering four typical cases of Qij and Qkl, except for special cases such as corner nodes. • Case 1. i = k, j = l,: The overlap of Qij and Qkl has average size 2  (t + w − 3). Then there are tw occurrences for such the case.   • Case 2. i = k and j 6¼ l: The overlap of Qij and Qkl has average size 2w  w2 . Then there are tw  (t − 1) occurrences for such the case.   • Case 3. i 6¼ k and j = l: The overlap of Qij and Qkl has average size 2t  2t . Then there are tw  (t − 1) occurrences for such the case.   • Case 4. i 6¼ k and j 6¼ l: The overlap of Qij and Qkl has average size t þ w2 . Then there are tw  (t − 1)  (w − 1) occurrences for such the case.

472

I.-H. Bae

Summing up the above results for all cases, the EQOS of the xS-grid(t  w) quorum system under universal set {0, 1, 2, …, n − 1} is as follows: EQOSxSðtwÞ ¼

QOSxS1 þ QOSxS2 þ QOSxS3 þ QOSxS4

ð6Þ ðtwÞ2   Where, QOSxS1 ¼ 2ðt þ w  3ÞðtwÞ, QOSxS2 ¼ 2w  w2 ðtwðt  w  1ÞÞ, QOSxS3 ¼     2t  2t ðtwðw  1ÞÞ and QOSxS4 ¼ t þ w2 ðtwðt  1Þðw  1ÞÞ. Then, the EQOS of xS-grid(4  5) is computed as follows: EQOSxSð45Þ ¼

240 þ 540 þ 480 þ 1; 560 2; 730 ¼ ¼ 6:825: 400 400

We drive the EQOS of the ┣-shaped grid(4  5) shown in Fig. 7 and then expand the results to n. The set of quorums of the ┣-shaped grid(4  5) are as follows:   Q‘grid ¼ Qij ji ¼ 0::3; j ¼ 0::4 ; where Q00 = {0, 1, 2, 3, 4, 5, 10, 15}, Q01 = {1, 2, 3, 4, 6, 11, 16}. Q02 = {2, 3, 4, 7, 12, 17}, Q03 = {0, 1, 2, 3, 8, 13, 18}, …, and Q34 = {4, 9, 14, 15, 16, 17, 18, 3w19}.  The┣-shaped grid quorum system has an average quorum size, Q‘grid ¼ t þ 4 in (t  w) grid array. Similar to xS-grid quorum, suppose Qij and Qkl are two quorums in the ┣-shaped grid quorum system, where Qij contains all elements in column j-based ┣-shaped grid quorum, and contains all elements in row l-based ┣-shaped grid quorum. Therefore, there are (t  w)2 possible permutations of Qij and Qkl in (t  w) grid array. The approximate average EQOS of the ┣-shaped grid quorum system can be figured out by considering the following four independent cases of Qij and Qkl:    . Then • Case 1. i = k, j = l: The overlap of Qij and Qkl has average size t þ 3w 4 there are tw occurrences for such the case.   • Case 2. i = k and j 6¼ l: The overlap of Qij and Qkl has average size w þ2 1 . Then there are t  (t − 1) occurrences for such the case. • Case 3. i 6¼ k and j = l: The overlap of Qij and Qkl has average t size. Then there are tw  (w − 1) occurrences for such the case. • Case 4. i 6¼ k and j 6¼ l: The overlap of Qij and Qkl has average 1.5 size. Then there are tw  (t − 1)  (w − 1) occurrences for such the case. Summing up the above results for all cases, the EQOS of the ┣-shaped grid(t  w) quorum system under universal set {0, 1, 2, …, n − 1} is as follows: EQOS‘grid ðtwÞ ¼

QOS‘G1 þ QOS‘G2 þ QOS‘G3 þ QOS‘G3 ðtwÞ2

ð7Þ

Design and Evaluation of a Quorum-Based Hierarchical Dissemination Algorithm

473

     where, QOS‘G1 ¼ t þ 3w tw, QOS‘G2 ¼ w þ2 1 ðtwðt  1ÞÞ, QOS‘G3 ¼ tðtwðw  1ÞÞ 4 and QOS‘G4 ¼ 1:5ðtwðt  1Þðw  1ÞÞ. Then, the EQOS of ┣-shaped grid(4  5) is computed as follows: EQOS‘grid ð45Þ ¼

140 þ 180 þ 320 þ 360 1; 000 ¼ ¼ 2:5: 400 400

Figures 10 and 11 show the performance of different grid quorum systems in terms of EQOS and AR for gird system sizes. From Fig. 10, we know that the proposed xSgrid quorum provides the highest EQOS among the compared grid quorum systems, and has the best neighbor sensibility. Therefore, the xS-grid quorum is useful for the distributed algorithm design using spatial locality. Also, we know that the proposed ┣-shaped grid quorum provides the lowest AR among the compared grid quorum systems from Fig. 11. It has the minimum number of nodes keeping a unpopular event data. Therefore, The ┣-shaped quorum system is useful for the distributed algorithm design that minimizes message transmission overhead when communication cost is high due to the long distance between two nodes.

Fig. 10. EQOS comparison of different quorum systems for grid system sizes

4.2

Fig. 11. Active ratio comparison of different quorum systems for grid system sizes

Performance of QHDA

This section evaluate the performance of the proposed QHDA through an analytical model. The performance is evaluated in terms of total number of messages for data request-dissemination ratio and average delay of an event data dissemination. Since the dissemination delay is proportional to the number of hops that the data dissemination message traveled, the dissemination delay is evaluated by the number of hops that the data dissemination message has traveled. We assume that the data access rate depends on the Zipf-like distribution [12, 13]. The formula for calculating the access probability of the i-th popular data item in the Zipf-like distribution is as follows:

474

I.-H. Bae

PðiÞ ¼

ia

1 Pn

ð8Þ

1 k¼1 k a

Where the parameter n is the total number of data items, and a represents the degree of skew of the distribution. The larger a is, the more skewed access distribution occurs. In the Zipf-like distribution where n = 100 and a = 1, if the threshold value of the popularity index of the popular data item is 0.02, there are only 16 popular data items among the total 100 data items. The sum of the probability of accessing popular data is about 0.833. For the performance evaluation of QHDA, we evaluate the total number of messages and the data dissemination delay using the values of parameters as shown in Table 2. The total number of messages (TNmsg) is evaluated as the total number of messages including the number of data dissemination messages and the number of data request messages as shown in Eq. (9) below: k NMreq l      NMdiss ¼ Phd jQxS j þ Qs-grid þ 1  Phd Qs-grid þ Q‘-grid ; eh NMreq ¼ Plr fPhr ½1 þ ð1  Plh hr Þ Qsgrid þ ð1  Phr ÞjQ‘grid j eh þ ð1  Phr Þ½1 þ ð1  Plh cr Þ Qsgrid þ ð1  Pcr ÞjQ‘grid jg þ ð1  Pl Þð Qsgrid þ jQ‘grid jÞ: TNmsg ¼ NMdiss þ

ð9Þ

r

eh Where, Plh hr and Phr represent the hit probabilities for the popular data requested in the local edge or the local edge overlay network of the device that requested the data, eh respectively. Plh cr and Pcr represent the hit probabilities for the unpopular data requested in the local edge or the local edge overlay network of the device that requested the data, respectively. And k/l represents a request-to-dissemination ratio.

Table 2. Parameter and value for performance evaluation Parameter Grid system size, N

Probability that the requesting data is popular, Phr Probability that the disseminating data is popular, Phd

Value Local edge grid size, Ge(5  8) Fog grid size, Gf(4  5) N = (20  40) = 800 0.84 0.16

Probability of requesting local data, Plr

0.7

We compare the performance of QHDA with quorum-based data dissemination algorithms in a similar way. Figure 12 shows the total number of messages transmitted by data dissemination algorithms. We confirmed that QHDA’s message traffic load is lower than other algorithms. The reason is that QHDA uses the xS-grid quorum to

Design and Evaluation of a Quorum-Based Hierarchical Dissemination Algorithm

475

Fig. 12. Comparison of total number of mes- Fig. 13. Comparison of average disseminasages for quorum-based data dissemination tion delay for quorum-based data disseminaalgorithms tion algorithms

provide high spatial locality in edge layers for popular or critical data dissemination, and the ┣-shaped grid quorum to reduce the high communication costs among the servers in the fog layer. We knows that the QHDA provides much shorter average data dissemination delay than other algorithms from Fig. 13. This is because the QHDA not only reduces the quorum size by using the edge fog-based hierarchical quorums: xS-grid and ┣-shaped grid, but also uses the edge fog- based multicast quorum system as unlike to the multihop quorum system of s-grid and MM-GSQ.

5 Conclusion In this paper, we designed and evaluated a quorum-based hierarchical dissemination algorithm: QHDA to propagate critical event data in edge-fog IoT environments. To design QHDA, two quorum structures: xS-grid and ┣-shaped grid were designed considering EQOS and AR. The QHDA executed in edge-fog based IoT environments, where the xS-grid quorum was used in the edge layer to take advantage of spatial locality and the ┣-shaped grid quorum was used in the fog layer to reduce communication overhead. The performance of QHDA has evaluated through an analytical model. As a result, we confirmed that the performance of the proposed QHDA was superior to other quorum-based algorithms. Future research topics include research on proposed quorum-based IoT mechanisms that provide reliable and timely propagation in healthcare workplace safety, critical event data dissemination in the Internet of vehicles (IoV) and machine learningbased data dissemination methods for massive IoTs. Acknowledgements. This paper was supported by research grants from Daegu Catholic University in 2019.

476

I.-H. Bae

References 1. Daneels, G., et al.: Real-time data dissemination and analytics platform for challenging IoT environments. In: Global Information Infrastructure and Networking Symposium (GIIS), IEEE, St. Pierre, pp. 1–8 (2017) 2. Massive IoT: different technologied for different needs. https://wirepas.com/download/. Accessed 10 June 2019 3. Farooq, M.J., ElSawy, H., Zhu, Q., Alouini, M-S.: Optimizing mission critical data dissemination in massive IoT networks. In: The 2017 International Workshop on SpaSWiN, pp. 1–6. IEEE, Paris (2017) 4. Cloud, fog and edge computing –what’s the difference? https://www.winsystems.com/ cloudfog-and-edge-computing-whats-the-difference/. Accessed 10 June 2019 5. Omoniwa, B., Hussain, R., Javid, M.A., Bouk, S.H., Malik, S.A.: Ege/edge computingbased IoT (FECIoT): architecture, applications, and research issues. Internet Things J. 6(3), 4118–4149 (2019) 6. Naor, M., Wool, A.: The load, capacity and reliability of quorum systems. J. SIAM J. Comput. 27(2), 423–447 (1998) 7. Lai, S., Zhang, B., Ravindran, B., Cho, H.: CQS-pair: cyclic quorum system pair for wakeup scheduling in wireless sensor networks. In: Baker, Theodore P., Bui, A., Tixeuil, S. (eds.) OPODIS 2008. LNCS, vol. 5401, pp. 295–310. Springer, Heidelberg (2008). https://doi.org/ 10.1007/978-3-540-92221-6_20 8. Tao Z., Li, S., Lu, Z., Zhang, X.: A data dissemination algorithm based on geographical quorum system in wireless sensor network. In: Seventh Annual Communication Networks and Services Research Conference, pp. 317–324. IEEE, Moncton (2009) 9. Imani, M., Noshiri O., Joudaki, M., Pouryani M., Dehghan, M.: Adaptive S-grid: a new adaptive quorum-based power saving protocol for multi-hop ad hoc networks. In: 4th International Conference on Knowledge-Based Engineering and Innovation (KBEI), pp. 470–475. IEEE, Tehran (2017) 10. Thar K., Oo, T.Z., Pham, C., Ulah, S., Lee, D.H., Hong, C.S.: Efficient forwarding and popularity based caching for Content Centric Network. In: International Conference on Information Networking (ICOIN), pp. 330–335. IEEE, Cambodia (2015) 11. Jiang, J.-R.: Expected quorum overlap sizes of quorum systems for asynchronous powersaving in mobile ad hoc networks. Comput. Networks 52(17), 3296–3306 (2008) 12. Breslau, L., Cao, P., Pan L., Phillips, G., Shenker, S.: Web caching and Zipf-like distributions: evidence and implications. In: International Conference on Computer Communications, pp. 126–134. IEEE, New York (1999) 13. Lee, E.-J., Bae, I.-H.: Design and evaluation of a cluster-based fuzzy cooperative caching method for MANETs. J. Korean Data Inf. Sci. Soc. 22(2), 269–285 (2011)

Comparative Analysis on Raster, Spiral, Hilbert, and Peano Mapping Pattern of Fragile Watermarking to Address the Authentication Issue in Healthcare System Syifak Izhar Hisham1(&), Mohamad Nazmi Nasir1, and Nasrul Hadi Johari2 1

Faculty of Computing, Universiti Malaysia Pahang, 26300 Gambang, Malaysia [email protected] 2 Faculty of Engineering Technology of Mechanical, Universiti Malaysia Pahang, 26600 Pekan, Malaysia

Abstract. Healthcare service is one of the focus areas in supporting the existence of smart city. Nowadays, many procedures are done paperless and fully in digital. Medical scanning is also stored in digital modalities format and transmitted through a hospital management system. An authentication system is needed during the transmission, thus the need of watermarking. Some interesting research directions in watermarking include the embedding pattern of watermark data in the early process of watermarking. The objective of this research is to investigate the best pattern to determine the bit locations for watermark embedding for copyright protection. This paper applies four types of embedding patterns on medical images, which the quality of watermarked images would depend on how the mapping pattern is. It compares the difference of having a straight forward pattern mapping as a raster pattern and a unique pattern mapping like spiral, Hilbert, and Peano patterns. After mapping, all would have the same stages of a watermarking scheme which are embedding, detection and recovery stage. The comparison factors include the peak-signalnoise-ratio, mean-squared-error values of embedded images, and the computational time. From the result, the significant difference is the computational time; the taken time by the unique pattern is significantly longer than raster. When it comes to handling superabundant data, it is very crucial to produce a userfriendly system. However, as a whole, the results from Peano pattern embedding scheme shows it has a unique pattern which hard to be tracked, yet, its computational time to watermark is acceptable. Keywords: Data authentication  Data security  Medical image watermarking  Mapping pattern  Smart city  Smart healthcare system

© Springer Nature Singapore Pte Ltd. 2019 G. Wang et al. (Eds.): DependSys 2019, CCIS 1123, pp. 477–486, 2019. https://doi.org/10.1007/978-981-15-1304-6_39

478

S. I. Hisham et al.

1 Introduction There is a wide definition being used to define a smart city, as it depends on the people, the places, the level of development, the resources and the focuses related to the city [1–4]. The ASEAN Smart Cities Network (ASCN) has endorsed Smart Cities Framework in July 2018, which the focus areas are Civic and Social, Health and Wellbeing, Safety and Security, Quality Environment, Built Infrastructure, and Industry & Innovation [5]. The areas are also in focus by other countries. As defined by Mohanty [6], “A smart sustainable city is an innovative city that uses information and communication technologies (ICTs) and other means to improve quality of life, efficiency of urban operations and services, and competitiveness, while ensuring that it meets the needs of present and future generations with respect to economic, social and environmental aspects” [6].

Nowadays, many improvements have been done in health care services in order to support the existence of smart city [7]. One of the reasons are because the integration and optimization of ICT in health care system to save the cost, the resource, and the time, in the same time to express the service. With that, many procedures are done paperless and fully in digital. Medical scanning is stored in electronic modalities, such as magnetic resonance imaging (MRI), computed tomography (CT) scan, mammogram, ultrasound, and positron-emission tomography (PET). The digital medical images are used by the radiologists, doctors, consultants and other clinical professionals, when it is stored in the hospital management system. The images may need to be transmitted to other hospitals if the related consultants or surgeons are not in the same hospital. Thus, concerning the crucial status of medical images, especially because of the patients’ privacy, a security system is needed [8]. The idea of using watermarking on medical data is a security issue in the sense that we want to ensure the critical medical data is authentic when the radiologist and doctor refer it. Nowadays, medical images are not printed anymore. Since many advantages of using digital medical images are discovered and it is frequently used in the medical domain, most hospitals are facing issues to manage a large amount of data storage such as administrative documents, patient’s information and medical images. Therefore, it is important to handle those data accurately to avoid the problem of lost, tampering, manipulation and mishandling record at the hospital [9, 10]. The digital watermarking became a research focus for medical documents, specifically medical images [11–15]. Medical images are compelling as there are used for jurisdiction proof and documents for insurance claim nowadays. Thus, watermarking is needed as to ensure all the attached documents and evidence are valid and not edited [16, 17]. Various researches on the medical image watermarking have been done for copyright protection, authentication and patient management system [18–21]. Furthermore, future medical information database system is forecasted to be integrated with the watermarked scheme that is used to protect the security of each personal data and medical information [9, 10, 21]. Regarding medical use, a primary concern among the clinical professionals is that the probability of being modified by attackers, thus, the demand for the authentication

Comparative Analysis on Raster, Spiral, Hilbert, and Peano Mapping Pattern

479

and originality is high [22, 23]. Image authentication can assure receivers that the received image is from the authorized source and that the image content is identical to the one sent [24]. Nowadays, even by using generic software for image elaboration, a medical image can be attacked by erasing or adding any sign of disease onto it. If this image were a critical piece of evidence in a legal case or police investigation, this form of tampering might pose a serious problem. Especially where the telemedicine technology is widely implemented, it is a serious call to start implementing the security system to medical images. One popular mechanism to develop a fragile watermarking scheme that has tamper localization and recovery feature is the block-based mechanism [10]. It is a popular mechanism introduced by Fridrich and Goljan [25] and is well-known to explain the problem of collage attack, vector quantization (VQ) counterfeiting attack and cut-and paste attack [24–29]. The method is by separating the block independence to enable recovery data embedding in block by block. However, it is well-known that most introduced schemes which implementing block-based mechanism are using raster manner of numbering, which, ill-mannered attackers can just modify the watermarked image and cover it up if they manage to obtain the block-mapping sequence in advance [30]. Accordingly, Fridrich and Goljan [25] stated that reconstruction is accomplished by embedding the recovery bits in a block far from the original block. From the experimental results done by [27], it showed that the recovery bits were not embedded in blocks situated in the same column, but with some percentage in the same row. Those in the same row must have an odd distance from the original because the way we spread the tamper was by using the same size, as the block used for embedding and the distance from each other were, at least, one block. If we change the tamper block size in the spread-tampered blocks, then we may have a different result. For medical images, which usually have the region of interest at the center (refer to Fig. 1), the way we embed recovery data at the center is crucial. When pseudorandom mapping process takes place, the mapping will lead the important recovery data to the center if the block data is numbered in raster path. Thus, the unique way of block numbering can help to further the location of recovery data.

Fig. 1. The grey and white area as the Region of Interest (ROI) and the black area as the Region of Non-Interest (RONI) of the image

Some fragile watermarking schemes used raster mapping for the watermarking embedding [27] which at the center of the image, the security data and the original data will keep together. Once the attack hit central location bits, both security and original

480

S. I. Hisham et al.

data will be attacked. There is also schemes using unique pattern for the watermarking [29], but the operational time is long which it may not efficient for loads of data. Thus, a better mapping pattern is needed. In this paper, we investigate the results and outputs that could be contributed by four types of mapping in four schemes of fragile watermarking, which are raster pattern by AW-TDR [27], spiral pattern by SPIRAL-LSB [24], Hilbert pattern by HILBERT-LSB [29], and Peano pattern. At the end of this paper, this research would suggest the best pattern to determine the bit locations for watermark embedding to be applied in healthcare system to ensure the data protection is secured.

2 Methodology Many patterns have the potential to be used in the embedding algorithm to secure the location of embedded watermark data. Patterns are useful to determine the sequence of embedding turn for each pixel or each block, depends on the algorithm of a watermarking. Figure 2 shows the patterns that have been used in watermark embedding algorithms, which are raster, spiral, Hilbert, and Peano patterns. AW-TDR [27] uses raster pattern, SPIRAL-LSB [24] uses spiral pattern, while HILBERT-LSB [29] uses Hilbert pattern in mapping process before the embedding. We developed the algorithm with Peano mapping pattern to compare the results with other three patterns. We name it as PEANO-LSB. Hilbert and Peano patterns are among the recursive space-filling patterns of type.

Fig. 2. Mapping patterns of embedding, (from left) Raster, Spiral, Hilbert, Peano

Figure 3 shows the algorithm diagram of mapping HILBERT-LSB [29]. The difference between algorithms of AW-TDR, SPIRAL-LSB and PEANO-LSB is the chosen pattern, which is created using the coordinate concept, as shown in Fig. 3. The third step in Fig. 3 forms a pattern by choosing the pixel based on the direction of the pattern (refer to Fig. 2). The coordinate of the pixel will be checked, whether it has been numbered or not. If it has, then the pixel will be passed. If it has not, it will be numbered and mapped to continue the pattern. The recursive space-filling pattern maps all pixels to the pattern, while for spiral pattern, it only maps the pixels that can form spiral route. Figure 4 shows the embedding algorithm diagram for HILBERT-LSB, SPIRALLSB, PEANO-LSB and AW-TDR. The embedding process will take place after mapping process. This algorithm is a part of the fragile watermarking schemes, which

Comparative Analysis on Raster, Spiral, Hilbert, and Peano Mapping Pattern

481

Fig. 3. The structure of numbering and mapping algorithm in HILBERT-LSB [29]

contributing to the authentication feature. The four schemes have a detection phase to detect any manipulated area in the watermarked image. However, in this paper we will only discuss about the embedding phase. There are 3 types of embedded data which are the authentication bits; produced by the average intensity of the pixel blocks and subblocks, the parity check bits and the recovery data of the pixel; produced by the average intensity of the sub-blocks. These three types of data are embedded in the least significance bit (LSB) of each pixel in the blocks.

Fig. 4. The structure of embedding algorithm

482

S. I. Hisham et al.

3 Results and Discussion The data images used for this comparative evaluation are same. We embedded the four patterns to all data and observe the results. PSNR and MSE values are considered to evaluate the quality of embedded images. PSNR is often used as a measurement for image fidelity in researches. A high PSNR means the image has a high fidelity. MSE is the average term by term difference between the original image, I, and the watermarked image, I’. If I and I’ are identical, then MSE (I’, I) is 0. The nearer value to 0, the less difference it has with the original version. All patterns can embed the watermark data without leaving any trace because the schemes use the same algorithm to embed, beside the pattern mapping. Visually, human visual system cannot see any difference between the original and the watermarked image, as shown in Fig. 5 from AW-TDR results. A medical image is usually hard to be embedded in entire image without disturbing the visual appearance especially grey-scale image, because of the less colours in it. A change here would affect the diagnosis as it might appear as fine dots or strange form of shape. Thus, we use the least significant bit (LSB) of pixel to carry the watermark data to overcome this challenge. Table 1 shows a consistent style of results which the PSNR values of raster pattern embedded images are the lowest, Hilbert and Peano patterns recorded as having similar PSNR, while the one with spiral mapping pattern records the highest PSNR. However, all of the values are still great for a PSNR value since they are much above 32 dB, the benchmark of visible distortion. The results go parallel with the MSE values, which AW-TDR with raster mapping pattern embedded images are higher than Hilbert and Peano patterns embedded images. SPIRAL-LSB records the smallest MSE, which means the least difference it has with the original one. The reason why SPIRAL-LSB records a very high PSNR and small MSE value is because it does not embed in the whole image. As spiral is not a recursive space filling curve pattern, it only can embed in square shape. The area out of the square dimension will be left unwatermarked, (refer to Fig. 6). Thus, we cannot consider this pattern to be used if medical images are not square. The ‘Computational Time’ column shows the time taken to embed the watermark into the images. The images are various in sizes as they are different medical images. The bigger the image size, the longer it would take to embed into the whole image. From the table, it is clear that raster pattern takes the shortest time to embed. Consistently, SPIRAL-LSB is the second faster to embed. HILBERT-LSB takes the longest time, having big difference compared to AW-TDR and SPIRAL-LSB. This does not reflect the quality of watermarked images, but it is a factor to categorize this system as a good and efficient system, or not. Users do not want to wait for long to complete an embedding process. When it comes to handling superabundant data every day, computational times to watermark is very crucial to producing a userfriendly system.

Comparative Analysis on Raster, Spiral, Hilbert, and Peano Mapping Pattern

Original

Watermarked

Fig. 5. The original data (left) and the watermarked data of AW-TDR (right)

Table 1. Comparison results of Raster and Hilbert patterns embedding. Image

Mapping pattern Raster Spiral Hilbert Peano Raster Spiral Hilbert Peano

PSNR value (db) 53.6208 65.0266 58.3968 58.3872 54.1729 63.3011 58.9348 58.9406

MSE value 0.2825 0.0237 0.0943 0.0941 0.2488 0.0304 0.083 0.0831

Computational Time (s) 2.4735 2.9635 510.7969 408.6719 1.9531 2.8438 45.9219 37.7813

Raster Spiral Hilbert Peano

53.6985 55.2634 53.9764 53.6794

0.2775 0.1935 0.2787 0.2787

1.4219 2.8281 5.6563 4.4844

Raster Spiral Hilbert Peano

54.0338 64.4188 58.8268 58.8912

0.2569 0.0235 0.0839 0.0852

1.9688 2.7656 55.875 44.8906

483

484

S. I. Hisham et al.

Fig. 6. An original mammogram and the watermark data

Hilbert and Peano patterns will give a hard time for attackers to guess which sequence the bits are mapped and where the watermarked data are located. The starting point of mapping is also not from the left edge like raster, but could be anywhere from the image, thus adding the complex security value of the scheme. Given the points above, PEANO-LSB, which uses Peano mapping pattern, is contributing to make a good watermarking scheme; because it can secure the location of watermark data far from the original location, better than raster pattern. It can watermark any shape, whether it is rectangle or square, not limited like what spiral pattern can do. Its computational time is acceptable, compared to HILBERT-LSB.

4 Conclusion This research investigates the best pattern to determine the bit locations for watermark embedding for copyright protection. Unlike the typical order such as raster pattern, the data are better to be mapped using a unique pattern, e.g. Hilbert, Peano, and spiral, to provide extra security. Putting aside the raster pattern as it is not a unique pattern, the test results show a favor on PEANO-LSB where the PSNR and MSE values are good. As a whole, the results from Peano pattern embedding scheme shows it has a unique pattern which hard to be tracked by the manipulators, yet, its computational time to watermark is acceptable to be used in a healthcare system. The future direction of the work will be to include the Internet of Things (IoT) concept with the present proposed work [31–33]. Acknowledgments. This research work is supported by the Project ‘Authentication Watermarking in Digital Text Document Images Using Unique Pattern Numbering and Mapping’ (RDU190366) supported by Universiti Malaysia Pahang. The data in this paper can be accessed in the website of FKom, UMP at [http://fskkp.ump.edu.my].

References 1. Xu, H., Geng, X.: People-centric service intelligence for smart cities. Smart Cities 2(2), 135– 152 (2019) 2. Joss, S., Sengers, F., Schraven, D., Caprotti, F., Dayot, Y.: The smart city as global discourse: storylines and critical junctures across 27 cities. J. Urban Technol. 26, 3–34 (2019)

Comparative Analysis on Raster, Spiral, Hilbert, and Peano Mapping Pattern

485

3. Anthopoulos, L.G.: The rise of the smart city. Understanding Smart Cities: A Tool for Smart Government or an Industrial Trick?. PAIT, vol. 22, pp. 5–45. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-57015-0_2 4. Albino, V., Berardi, U., Dangelico, R.M.: Smart cities: definitions, dimensions, and performance. J. Urban Technol. 22(1), 3–21 (2015) 5. ASEAN Smart Cities Framework (ASCF). The 32nd ASEAN Summit (2018). https://asean. org/storage/2019/02/ASCN-ASEAN-Smart-Cities-Framework.pdf 6. Mohanty, S.: Everything you wanted to know about smart cities. IEEE Consum. Electron. Mag. 5, 60–70 (2016). https://doi.org/10.1109/MCE.2016.2556879 7. Trencher, G., Karvonen, A.: Stretching “smart”: advancing health and well-being through the smart city agenda. Local Environ. 24(7), 610–627 (2019) 8. Ali, Z., Hossain, M.S., Muhammad, G., Aslam, M.: New zero-watermarking algorithm using hurst exponent for protection of privacy in telemedicine. IEEE Access 6, 7930–7940 (2018) 9. Chauhan, D.S., Singh, A.K., Kumar, B., Saini, J.P.: Quantization based multiple medical information watermarking for secure e-health. Multimedia Tools Appl. 78(4), 3911–3923 (2019) 10. Thakur, S., Singh, A.K., Ghrera, S.P., Elhoseny, M.: Multi-layer security of medical data through watermarking and chaotic encryption for tele-health applications. Multimedia Tools Appl. 78(3), 3457–3470 (2019) 11. Planitz, B.M., Maeder, A.J.: A study of block-based medical image watermarking using a perceptual similarity metric. In: Digital Image Computing: Techniques and Applications, p. 70 (2005) 12. Liu, X., et al.: A novel robust reversible watermarking scheme for protecting authenticity and integrity of medical images. IEEE Access 7, 76580–76598 (2019) 13. Zhong, X., Shih, F.Y.: A high-capacity reversible watermarking scheme based on shape decomposition for medical images. Int. J. Pattern Recogn. Artif. Intell. 33(01), 1950001 (2019) 14. Liu, J., Li, J., Ma, J., Sadiq, N., Ai, Y.: FDCT and perceptual hash-based watermarking algorithm for medical images. In: Chen, Y.W., Zimmermann, A., Howlett, R., Jain, L. (eds.) Innovation in Medicine and Healthcare Systems, and Multimedia, pp 157–168. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-8566-7_15 15. Priya, S., Santhi, B.: A novel visual medical image encryption for secure transmission of authenticated watermarked medical images. Mobile Networks Appl. 1–8 (2019) 16. Ernawan, F.: Tchebichef image watermarking along the edge using YCoCg-R color space for copyright protection. Int. J. Electr. Comput. Eng. 9 (2019). (ISSN: 2088-8708) 17. Alias, N., Ernawan, F.: Multiple watermarking technique based on RDWT-SVD and human visual characteristics. J. Theor. Appl. Inf. Technol. 97, 14 (2019) 18. Swaraja, K., Meenakshi, K., Padmavathi, K.: An optimized blind dual medical image watermarking framework for tamper localization and content authentication in secured telemedicine. Biomed. Signal Process. Control 55, 1–15 (2020) 19. Umamageswari, A., Leo Vijilious, M. A.: Enhancing security in medical image informatics using geometrical attacks. Current Sci. 117(3) (2019) 20. Ma, B., Li, B., Wang, X., Wang, C., Li, J., Shi, Y.: A code division multiplexing and block classification-based real-time reversible data-hiding algorithm for medical images. J. RealTime Image Process. 16(4), 857–869 (2019) 21. Al-Haj, A., Amer, A.: Secured telemedicine using region-based watermarking with tamper localization. J. Digit. Imaging 27(6), 737–750 (2014) 22. Coatrieux, G., Maitre, H., Sankur, B., Rolland, Y., Collorec, R.: Relevance of watermarking in medical imaging. In: Proceedings of IEEE EMBS Information Technology Applications in Biomedicine, Arlington, pp. 250–255 (2000)

486

S. I. Hisham et al.

23. Tan, C.K., Ng, C., Xu, X., Poh, C.L., Yong, L.G., Sheah, K.: Security protection of DICOM medical images using dual-layer reversible watermarking with tamper detection capability. J. Digit. Imaging 24(3), 528–540 (2011) 24. Hisham, S.I., Muhammad, A.N., Zain, J.M, Badshah, G., Arshad, N.W.: Digital watermarking for recovering attack areas of medical images using spiral numbering. In: 2013 International Conference on Electronics, Computer and Computation (ICECCO), pp. 285–288 (2013) 25. Fridrich, J., Goljan, M., Baldoza, A. C.: New fragile authentication watermark for images. In: International Conference on Image Processing (ICIP 2000), Vancouver, BC, pp. 446– 449. IEEE Computer Society (2000) 26. Holliman, M., Memon, N.: Counterfeiting attacks on oblivious block-wise independent invisible watermarking schemes. IEEE Trans. Image Process. 9(3), 432–441 (2000) 27. Zain, J.M., Fauzi, A.R.M.: Medical image watermarking with tamper detection and recovery. In: 2006 International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 3270–3273 (2006) 28. Liu, K.C.: Color image watermarking for tamper proofing and pattern-based recovery. IET (IEE) Image Process. 6(5), 445–454 (2012) 29. Hisham, S.I., Zain, J.M., Arshad, N.W., Liew, S.C.: HILBERT-LSB-C as authentication system for color medical images. In: 2015 4th International Conference on Software Engineering and Computer Systems (ICSECS) in IEEE Explore Proceedings, pp. 15–20 (2015) 30. Chang, C.C., Fan, Y.H., Tai, W.L.: Four-scanning attack on hierarchical digital watermarking method for image tamper detection and recovery. Pattern Recogn. 41(2), 654–661 (2008) 31. Rahman, M.A.: Reliability analysis of ZigBee based intra-vehicle wireless sensor networks. In: Sikora, A., Berbineau, M., Vinel, A., Jonsson, M., Pirovano, A., Aguado, M. (eds.) LNCS, vol. 8435, pp. 103–112. Springer, Cham (2015). https://doi.org/10.1007/978-3-31906644-8_10 32. Bhuiyan, M.Z.A., Wang, G., Tian, W., Rahman, M.A., Wu, J.: Content-centric eventinsensitive big data reduction in internet of things. In: GLOBECOM 2017 - 2017 IEEE Global Communications Conference, Singapore, pp. 1–6 (2017) 33. Al-Nadwi, M.M.K., Refat, N., Zaman, N., Rahman, M.A., Bhuiyan, M.Z.A., Razali, R.B.: Cloud enabled e-glossary system: a smart campus perspective. In: Wang, G., Chen, J., Yang, L.T. (eds.) SpaCCS 2018. LNCS, vol. 11342, pp. 251–260. Springer, Cham (2018). https:// doi.org/10.1007/978-3-030-05345-1_21

Author Index

Abdelguerfi, Mahdi 131 Alinani, Karim 170 Arif, Muhammad 158 Arifuzzaman, Shaikh 131, 145 Babaagba, Kehinde O. 369 Bae, Ihn-Han 462 Barra, Paola 180 Basto, Wilmer Cadavid 191 Becker, Ingolf 277 Begum, Nasima 106 Bellaouar, Soumia 67 Bhuiyan, Md Zakirul Alam 3, 54, 106 Bhuiyan, Zakirul Alam 145, 203, 294, 327 Bilbao, Osman Redondo 392 Bisogni, Carmen 180 Borrion, Hervé 277 Bosri, Rabeya 106 Cao, Hao 217 Castiglione, Aniello 231, 247 Castrillón-Santana, Modesto 180 Chang, Xing 80 Cheah, Wooi Ping 119 Chen, Huihe 304 Chen, Jianer 54, 158, 341 Chen, Yanshan 454 Chen, Yuanfang 40, 203, 327 Dahal, Janak 131 Dai, Bingze 95 de la Cruz, Carlos Andrés Uribe 191 de la Hoz Hernández, Juan 191 Deng, Youqiang 18 Du, Linfeng 294 Duan, Guihua 356 Ekblom, Paul 277 Elahi, Haroon 341, 356 Fan, Xingyue 40 Fernández, Claudia 31

Freire-Obregón, David Fu, Cai 18

180

Gaitán, Mercedes 383 Geman, Oana 158 Geng, Qingfan 217 Gong, Jibing 294 Grazioli, Giampiero 247 Gu, Zhaoquan 435 Guerroumi, Mohamed 67 Guiliany, Jesús García 392 Guzmán, Jazmín Flórez 31 Guzmán, Yasmin Flórez 383 Han, Weihong 454 Hart, Emma 369 He, Peiyan 454 Hisham, Syifak Izhar 477 Hou, Jian 405 Huang, Daning 304 Huang, Shiyong 304 Iengo, Simone 247 Imseis, Joseph 145 Ioannou, Athina 262 Ioup, Elias 131 Islam, Tasmina 277 Johari, Nasrul Hadi

477

Lezama, Omar Bonerge Pineda 392 Li, Boya 356 Li, Mingchu 203, 327 Li, Shudong 446, 454 Li, Shujun 262, 277 Li, Wei 314 Li, Yang 18 Li, Yixia 454 Liao, Shaowei 3 Lin, Yuchang 304 Liu, Deshun 170 Liu, Fangai 405

191, 383,

488

Author Index

Liu, Xiangtao 435 Llinás, Nataly Orellano Lu, Danna 446 Lu, Yang 262

Tang, Tingting 203 Tang, Zhe 80 Teng, Zhiyong 294 Torres, José Jinete 392 Tussyadiah, Iis 262

31

Ma, Ying 304, 314 Matos, Luisa Fernanda Arrieta 383 McGuire, Michael 277 Mercado, Carlos Vargas 31, 383 Mo, Ruichao 217 Moussaoui, Samira 67

Vásquez, Luis Cabas 383 Ventura, Jairo Martínez 392 Villalobos, Alexandra Perdomo 191 Viloria, Amelec 191, 383, 392

Palma, Hugo Hernández 191, 383, 392 Peng, Tao 341 Pero, Chiara 231 Posner, Rebecca 277 Pulido, Ronald Prieto 392

Wan, Shaohua 217 Wang, Dan 294 Wang, Guojun 54, 158, 170, 341, 356 Wang, Hao 217 Wang, Haoyu 95 Wang, Hongfei 294 Wang, Wei 446 Wang, Wenbo 80 Wu, Honghao 416, 426 Wu, Ting 40 Wu, Xiaobo 446, 454

Qi, Fang 80 Qi, Lianyong 217

Xiao, Xiaodong 40 Xu, Xiaolong 217

Nachuma, Costain 145 Nappi, Michele 180, 231, 247 Nasir, Mohamad Nazmi 477 Omar, Abdullah Al

106

Rahman, Hafiz ur 54 Rahman, Mohammad Shahriar Ricciardi, Stefano 247 Romero, Ligia 31 Sanabria, Ernesto Steffens 31 Shi, Wei 3 Silva, Jesús 31 Solano, Darwin 31 Tan, Shing Chiang 119 Tan, Sin Yin 119 Tan, Zhiyuan 369 Tang, Kun 3

106

Yang, Dequan 95 Yang, Xian 327 Yao, Qinghuang 314 Yu, Renxin 314 Zhang, Hekai 294 Zhang, Tianle 435 Zhang, Yu 435 Zhao, Dawei 446 Zheng, Minzhen 446 Zheng, Qiuhua 40 Zhong, Tianrong 314 Zhou, Dong 170 Zhuang, Xuqiang 405